1
|
De Ridder K, Che H, Leroy K, Thienpont B. Benchmarking of methods for DNA methylome deconvolution. Nat Commun 2024; 15:4134. [PMID: 38755121 PMCID: PMC11099101 DOI: 10.1038/s41467-024-48466-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2023] [Accepted: 04/30/2024] [Indexed: 05/18/2024] Open
Abstract
Defining the number and abundance of different cell types in tissues is important for understanding disease mechanisms as well as for diagnostic and prognostic purposes. Typically, this is achieved by immunohistological analyses, cell sorting, or single-cell RNA-sequencing. Alternatively, cell-specific DNA methylome information can be leveraged to deconvolve cell fractions from a bulk DNA mixture. However, comprehensive benchmarking of deconvolution methods and modalities was not yet performed. Here we evaluate 16 deconvolution algorithms, developed either specifically for DNA methylome data or more generically. We assess the performance of these algorithms, and the effect of normalization methods, while modeling variables that impact deconvolution performance, including cell abundance, cell type similarity, reference panel size, method for methylome profiling (array or sequencing), and technical variation. We observe differences in algorithm performance depending on each these variables, emphasizing the need for tailoring deconvolution analyses. The complexity of the reference, marker selection method, number of marker loci and, for sequencing-based assays, sequencing depth have a marked influence on performance. By developing handles to select the optimal analysis configuration, we provide a valuable source of information for studies aiming to deconvolve array- or sequencing-based methylation data.
Collapse
Affiliation(s)
- Kobe De Ridder
- Laboratory for Functional Epigenetics, Department of Human Genetics, KU Leuven, 3000, Leuven, Belgium
| | - Huiwen Che
- Laboratory for Functional Epigenetics, Department of Human Genetics, KU Leuven, 3000, Leuven, Belgium
| | - Kaat Leroy
- Laboratory for Functional Epigenetics, Department of Human Genetics, KU Leuven, 3000, Leuven, Belgium
| | - Bernard Thienpont
- Laboratory for Functional Epigenetics, Department of Human Genetics, KU Leuven, 3000, Leuven, Belgium.
- KU Leuven Institute for Single Cell Omics (LISCO), KU Leuven, 3000, Leuven, Belgium.
- KU Leuven Cancer Institute (LKI), KU Leuven, 3000, Leuven, Belgium.
| |
Collapse
|
2
|
Cai M, Zhou J, McKennan C, Wang J. scMD facilitates cell type deconvolution using single-cell DNA methylation references. Commun Biol 2024; 7:1. [PMID: 38168620 PMCID: PMC10762261 DOI: 10.1038/s42003-023-05690-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2023] [Accepted: 12/08/2023] [Indexed: 01/05/2024] Open
Abstract
The proliferation of single-cell RNA-sequencing data has led to the widespread use of cellular deconvolution, aiding the extraction of cell-type-specific information from extensive bulk data. However, those advances have been mostly limited to transcriptomic data. With recent developments in single-cell DNA methylation (scDNAm), there are emerging opportunities for deconvolving bulk DNAm data, particularly for solid tissues like brain that lack cell-type references. Due to technical limitations, current scDNAm sequences represent a small proportion of the whole genome for each single cell, and those detected regions differ across cells. This makes scDNAm data ultra-high dimensional and ultra-sparse. To deal with these challenges, we introduce scMD (single cell Methylation Deconvolution), a cellular deconvolution framework to reliably estimate cell type fractions from tissue-level DNAm data. To analyze large-scale complex scDNAm data, scMD employs a statistical approach to aggregate scDNAm data at the cell cluster level, identify cell-type marker DNAm sites, and create precise cell-type signature matrixes that surpass state-of-the-art sorted-cell or RNA-derived references. Through thorough benchmarking in several datasets, we demonstrate scMD's superior performance in estimating cellular fractions from bulk DNAm data. With scMD-estimated cellular fractions, we identify cell type fractions and cell type-specific differentially methylated cytosines associated with Alzheimer's disease.
Collapse
Affiliation(s)
- Manqi Cai
- Department of Biostatistics, University of Pittsburgh, Pittsburgh, PA, USA
| | - Jingtian Zhou
- Genomic Analysis Laboratory, The Salk Institute for Biological Studies, La Jolla, CA, USA
- Bioinformatics and Systems Biology Program, University of California, San Diego, CA, USA
| | - Chris McKennan
- Department of Statistics, University of Pittsburgh, Pittsburgh, PA, USA
| | - Jiebiao Wang
- Department of Biostatistics, University of Pittsburgh, Pittsburgh, PA, USA.
- Clinical and Translational Science Institute, University of Pittsburgh, Pittsburgh, PA, USA.
| |
Collapse
|
3
|
Yang P, Hubert SM, Futreal PA, Song X, Zhang J, Lee JJ, Wistuba I, Yuan Y, Zhang J, Li Z. A novel Bayesian model for assessing intratumor heterogeneity of tumor infiltrating leukocytes with multi-region gene expression sequencing. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.10.24.563820. [PMID: 37961165 PMCID: PMC10634795 DOI: 10.1101/2023.10.24.563820] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/15/2023]
Abstract
Intratumor heterogeneity (ITH) of tumor-infiltrated leukocytes (TILs) is an important phenomenon of cancer biology with potentially profound clinical impacts. Multi-region gene expression sequencing data provide a promising opportunity that allows for explorations of TILs and their intratumor heterogeneity for each subject. Although several existing methods are available to infer the proportions of TILs, considerable methodological gaps exist for evaluating intratumor heterogeneity of TILs with multi-region gene expression data. Here, we develop ICeITH, immune cell estimation reveals intratumor heterogeneity, a Bayesian hierarchical model that borrows cell type profiles as prior knowledge to decompose mixed bulk data while accounting for the within-subject correlations among tumor samples. ICeITH quantifies intratumor heterogeneity by the variability of targeted cellular compositions. Through extensive simulation studies, we demonstrate that ICeITH is more accurate in measuring relative cellular abundance and evaluating intratumor heterogeneity compared with existing methods. We also assess the ability of ICeITH to stratify patients by their intratumor heterogeneity score and associate the estimations with the survival outcomes. Finally, we apply ICeITH to two multi-region gene expression datasets from lung cancer studies to classify patients into different risk groups according to the ITH estimations of targeted TILs that shape either pro- or anti-tumor processes. In conclusion, ICeITH is a useful tool to evaluate intratumor heterogeneity of TILs from multi-region gene expression data.
Collapse
Affiliation(s)
- Peng Yang
- Department of Statistics, Rice University, Houston, Texas 77005, U.S.A
- Department of Biostatistics, The University of Texas MD Anderson Cancer Center Houston, Texas 77030, U.S.A
| | - Shawna M. Hubert
- Department of Thoracic Head Neck Medical Oncology, The University of Texas MD Anderson Cancer Center, Houston, TX, USA
| | - P. Andrew Futreal
- Department of Genomic Medicine, The University of Texas MD Anderson Cancer Center, Houston, TX, USA
| | - Xingzhi Song
- Department of Genomic Medicine, The University of Texas MD Anderson Cancer Center, Houston, TX, USA
| | - Jianhua Zhang
- Department of Genomic Medicine, The University of Texas MD Anderson Cancer Center, Houston, TX, USA
| | - J. Jack Lee
- Department of Biostatistics, The University of Texas MD Anderson Cancer Center Houston, Texas 77030, U.S.A
| | - Ignacio Wistuba
- Department of Translational Molecular Pathology, The University of Texas MD Anderson Cancer Center, Houston, TX, USA
| | - Ying Yuan
- Department of Biostatistics, The University of Texas MD Anderson Cancer Center Houston, Texas 77030, U.S.A
| | - Jianjun Zhang
- Department of Thoracic Head Neck Medical Oncology, The University of Texas MD Anderson Cancer Center, Houston, TX, USA
- Department of Genomic Medicine, The University of Texas MD Anderson Cancer Center, Houston, TX, USA
| | - Ziyi Li
- Department of Biostatistics, The University of Texas MD Anderson Cancer Center Houston, Texas 77030, U.S.A
| |
Collapse
|
4
|
Takeuchi F, Liang YQ, Shimizu-Furusawa H, Isono M, Ang MY, Mori K, Mori T, Kakazu E, Yoshio S, Kato N. Gene-regulation modules in nonalcoholic fatty liver disease revealed by single-nucleus ATAC-seq. Life Sci Alliance 2023; 6:e202301988. [PMID: 37491046 PMCID: PMC10368228 DOI: 10.26508/lsa.202301988] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2023] [Revised: 07/14/2023] [Accepted: 07/14/2023] [Indexed: 07/27/2023] Open
Abstract
We investigated the progression of nonalcoholic fatty liver disease from fatty liver to steatohepatitis using single-nucleus and bulk ATAC-seq on the livers of rats fed a high-fat diet (HFD). Rats fed HFD for 4 wk developed fatty liver, and those fed HFD for 8 wk further progressed to steatohepatitis. We observed an increase in the proportion of inflammatory macrophages, consistent with the pathological progression. Utilizing machine learning, we divided global gene regulation into modules, wherein transcription factors within a module could regulate genes within the same module, reaffirming known regulatory relationships between transcription factors and biological processes. We identified core genes-central to co-expression and protein-protein interaction-for the biological processes discovered. Notably, a large part of the core genes overlapped with genes previously implicated in nonalcoholic fatty liver disease. Single-nucleus ATAC-seq, combined with data-driven statistical analysis, offers insight into in vivo global gene regulation as a combination of modules and assists in identifying core genes of relevant biological processes.
Collapse
Affiliation(s)
- Fumihiko Takeuchi
- Department of Gene Diagnostics and Therapeutics, Research Institute, National Center for Global Health and Medicine, Tokyo, Japan
- Medical Genomics Center, Research Institute, National Center for Global Health and Medicine, Tokyo, Japan
- Systems Genomics Laboratory, Baker Heart and Diabetes Institute, Melbourne, Australia
| | - Yi-Qiang Liang
- Department of Gene Diagnostics and Therapeutics, Research Institute, National Center for Global Health and Medicine, Tokyo, Japan
| | - Hana Shimizu-Furusawa
- Department of Hygiene and Public Health, School of Medicine, Teikyo University, Tokyo, Japan
| | - Masato Isono
- Department of Gene Diagnostics and Therapeutics, Research Institute, National Center for Global Health and Medicine, Tokyo, Japan
| | - Mia Yang Ang
- Department of Gene Diagnostics and Therapeutics, Research Institute, National Center for Global Health and Medicine, Tokyo, Japan
- Department of Clinical Genome Informatics, Graduate School of Medicine, The University of Tokyo, Tokyo, Japan
| | - Kotaro Mori
- Medical Genomics Center, Research Institute, National Center for Global Health and Medicine, Tokyo, Japan
| | - Taizo Mori
- Department of Liver Diseases, The Research Center for Hepatitis and Immunology, National Center for Global Health and Medicine, Chiba, Japan
| | - Eiji Kakazu
- Department of Liver Diseases, The Research Center for Hepatitis and Immunology, National Center for Global Health and Medicine, Chiba, Japan
| | - Sachiyo Yoshio
- Department of Liver Diseases, The Research Center for Hepatitis and Immunology, National Center for Global Health and Medicine, Chiba, Japan
| | - Norihiro Kato
- Department of Gene Diagnostics and Therapeutics, Research Institute, National Center for Global Health and Medicine, Tokyo, Japan
- Medical Genomics Center, Research Institute, National Center for Global Health and Medicine, Tokyo, Japan
- Department of Clinical Genome Informatics, Graduate School of Medicine, The University of Tokyo, Tokyo, Japan
| |
Collapse
|
5
|
Cai M, Zhou J, McKennan C, Wang J. scMD: cell type deconvolution using single-cell DNA methylation references. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.08.03.551733. [PMID: 37577715 PMCID: PMC10418231 DOI: 10.1101/2023.08.03.551733] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 08/15/2023]
Abstract
The proliferation of single-cell RNA sequencing data has led to the widespread use of cellular deconvolution, aiding the extraction of cell type-specific information from extensive bulk data. However, those advances have been mostly limited to transcriptomic data. With recent development in single-cell DNA methylation (scDNAm), new avenues have been opened for deconvolving bulk DNAm data, particularly for solid tissues like the brain that lack cell-type references. Due to technical limitations, current scDNAm sequences represent a small proportion of the whole genome for each single cell, and those detected regions differ across cells. This makes scDNAm data ultra-high dimensional and ultra-sparse. To deal with these challenges, we introduce scMD (single cell Methylation Deconvolution), a cellular deconvolution framework to reliably estimate cell type fractions from tissue-level DNAm data. To analyze large-scale complex scDNAm data, scMD employs a statistical approach to aggregate scDNAm data at the cell cluster level, identify cell-type marker DNAm sites, and create a precise cell-type signature matrix that surpasses state-of-the-art sorted-cell or RNA-derived references. Through thorough benchmarking in several datasets, we demonstrate scMD's superior performance in estimating cellular fractions from bulk DNAm data. With scMD-estimated cellular fractions, we identify cell type fractions and cell type-specific differentially methylated cytosines associated with Alzheimer's disease.
Collapse
Affiliation(s)
- Manqi Cai
- Department of Biostatistics, University of Pittsburgh, Pittsburgh, PA, USA
| | - Jingtian Zhou
- Genomic Analysis Laboratory, The Salk Institute for Biological Studies, La Jolla, CA, USA
- Bioinformatics and Systems Biology Program, University of California, San Diego, La Jolla, CA
| | - Chris McKennan
- Department of Statistics, University of Pittsburgh, Pittsburgh, PA, USA
| | - Jiebiao Wang
- Department of Biostatistics, University of Pittsburgh, Pittsburgh, PA, USA
- Clinical and Translational Science Institute, University of Pittsburgh, Pittsburgh, PA, USA
| |
Collapse
|
6
|
Little P, Liu S, Zhabotynsky V, Li Y, Lin DY, Sun W. A computational method for cell type-specific expression quantitative trait loci mapping using bulk RNA-seq data. Nat Commun 2023; 14:3030. [PMID: 37231002 PMCID: PMC10212972 DOI: 10.1038/s41467-023-38795-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/21/2022] [Accepted: 05/16/2023] [Indexed: 05/27/2023] Open
Abstract
Mapping cell type-specific gene expression quantitative trait loci (ct-eQTLs) is a powerful way to investigate the genetic basis of complex traits. A popular method for ct-eQTL mapping is to assess the interaction between the genotype of a genetic locus and the abundance of a specific cell type using a linear model. However, this approach requires transforming RNA-seq count data, which distorts the relation between gene expression and cell type proportions and results in reduced power and/or inflated type I error. To address this issue, we have developed a statistical method called CSeQTL that allows for ct-eQTL mapping using bulk RNA-seq count data while taking advantage of allele-specific expression. We validated the results of CSeQTL through simulations and real data analysis, comparing CSeQTL results to those obtained from purified bulk RNA-seq data or single cell RNA-seq data. Using our ct-eQTL findings, we were able to identify cell types relevant to 21 categories of human traits.
Collapse
Affiliation(s)
- Paul Little
- Biostatistics Program, Public Health Science Division, Fred Hutchinson Cancer Center, Seattle, WA, USA.
| | - Si Liu
- Biostatistics Program, Public Health Science Division, Fred Hutchinson Cancer Center, Seattle, WA, USA
| | - Vasyl Zhabotynsky
- Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
- Lineberger Comprehensive Cancer Center, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
| | - Yun Li
- Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
- Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
| | - Dan-Yu Lin
- Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
- Lineberger Comprehensive Cancer Center, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
| | - Wei Sun
- Biostatistics Program, Public Health Science Division, Fred Hutchinson Cancer Center, Seattle, WA, USA.
- Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA.
- Department of Biostatistics, University of Washington, Seattle, WA, USA.
| |
Collapse
|
7
|
Huang P, Cai M, Lu X, McKennan C, Wang J. Accurate estimation of rare cell type fractions from tissue omics data via hierarchical deconvolution. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.03.15.532820. [PMID: 36993280 PMCID: PMC10055056 DOI: 10.1101/2023.03.15.532820] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 06/19/2023]
Abstract
Bulk transcriptomics in tissue samples reflects the average expression levels across different cell types and is highly influenced by cellular fractions. As such, it is critical to estimate cellular fractions to both deconfound differential expression analyses and infer cell type-specific differential expression. Since experimentally counting cells is infeasible in most tissues and studies, in silico cellular deconvolution methods have been developed as an alternative. However, existing methods are designed for tissues consisting of clearly distinguishable cell types and have difficulties estimating highly correlated or rare cell types. To address this challenge, we propose Hierarchical Deconvolution (HiDecon) that uses single-cell RNA sequencing references and a hierarchical cell type tree, which models the similarities among cell types and cell differentiation relationships, to estimate cellular fractions in bulk data. By coordinating cell fractions across layers of the hierarchical tree, cellular fraction information is passed up and down the tree, which helps correct estimation biases by pooling information across related cell types. The flexible hierarchical tree structure also enables estimating rare cell fractions by splitting the tree to higher resolutions. Through simulations and real data applications with the ground truth of measured cellular fractions, we demonstrate that HiDecon significantly outperforms existing methods and accurately estimates cellular fractions.
Collapse
Affiliation(s)
- Penghui Huang
- Deparment of Biostatistics, University of Pittsburgh, Pittsburgh, PA, USA
| | - Manqi Cai
- Deparment of Biostatistics, University of Pittsburgh, Pittsburgh, PA, USA
| | - Xinghua Lu
- Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, PA, USA
| | - Chris McKennan
- Deparment of Statistics, University of Pittsburgh, Pittsburgh, PA, USA
| | - Jiebiao Wang
- Deparment of Biostatistics, University of Pittsburgh, Pittsburgh, PA, USA
| |
Collapse
|
8
|
Meng G, Tang W, Huang E, Li Z, Feng H. A comprehensive assessment of cell type-specific differential expression methods in bulk data. Brief Bioinform 2023; 24:bbac516. [PMID: 36472568 PMCID: PMC9851321 DOI: 10.1093/bib/bbac516] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2022] [Revised: 10/08/2022] [Accepted: 10/29/2022] [Indexed: 12/12/2022] Open
Abstract
Accounting for cell type compositions has been very successful at analyzing high-throughput data from heterogeneous tissues. Differential gene expression analysis at cell type level is becoming increasingly popular, yielding biomarker discovery in a finer granularity within a particular cell type. Although several computational methods have been developed to identify cell type-specific differentially expressed genes (csDEG) from RNA-seq data, a systematic evaluation is yet to be performed. Here, we thoroughly benchmark six recently published methods: CellDMC, CARseq, TOAST, LRCDE, CeDAR and TCA, together with two classical methods, csSAM and DESeq2, for a comprehensive comparison. We aim to systematically evaluate the performance of popular csDEG detection methods and provide guidance to researchers. In simulation studies, we benchmark available methods under various scenarios of baseline expression levels, sample sizes, cell type compositions, expression level alterations, technical noises and biological dispersions. Real data analyses of three large datasets on inflammatory bowel disease, lung cancer and autism provide evaluation in both the gene level and the pathway level. We find that csDEG calling is strongly affected by effect size, baseline expression level and cell type compositions. Results imply that csDEG discovery is a challenging task itself, with room to improvements on handling low signal-to-noise ratio and low expression genes.
Collapse
Affiliation(s)
- Guanqun Meng
- Department of Population and Quantitative Health Sciences, Case Western Reserve University, Cleveland, 44106, Ohio, USA
| | - Wen Tang
- Department of Population and Quantitative Health Sciences, Case Western Reserve University, Cleveland, 44106, Ohio, USA
| | - Emina Huang
- Department of Surgery, The University of Texas Southwestern Medical Center, Dallas, 75390, Texas, USA
| | - Ziyi Li
- Department of Biostatistics, The University of Texas MD Anderson Cancer Center, Houston, 77030, Texas, USA
| | - Hao Feng
- Department of Population and Quantitative Health Sciences, Case Western Reserve University, Cleveland, 44106, Ohio, USA
| |
Collapse
|
9
|
Chen C, Leung YY, Ionita M, Wang LS, Li M. Omnibus and robust deconvolution scheme for bulk RNA sequencing data integrating multiple single-cell reference sets and prior biological knowledge. Bioinformatics 2022; 38:4530-4536. [PMID: 35980155 PMCID: PMC9525013 DOI: 10.1093/bioinformatics/btac563] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/19/2022] [Revised: 07/17/2022] [Accepted: 08/17/2022] [Indexed: 12/24/2022] Open
Abstract
MOTIVATION Cell-type deconvolution of bulk tissue RNA sequencing (RNA-seq) data is an important step toward understanding the variations in cell-type composition among disease conditions. Owing to recent advances in single-cell RNA sequencing (scRNA-seq) and the availability of large amounts of bulk RNA-seq data in disease-relevant tissues, various deconvolution methods have been developed. However, the performance of existing methods heavily relies on the quality of information provided by external data sources, such as the selection of scRNA-seq data as a reference and prior biological information. RESULTS We present the Integrated and Robust Deconvolution (InteRD) algorithm to infer cell-type proportions from target bulk RNA-seq data. Owing to the innovative use of penalized regression with a new evaluation criterion for deconvolution, InteRD has three primary advantages. First, it is able to effectively integrate deconvolution results from multiple scRNA-seq datasets. Second, InteRD calibrates estimates from reference-based deconvolution by taking into account extra biological information as priors. Third, the proposed algorithm is robust to inaccurate external information imposed in the deconvolution system. Extensive numerical evaluations and real-data applications demonstrate that InteRD yields more accurate and robust cell-type proportion estimates that agree well with known biology. AVAILABILITY AND IMPLEMENTATION The proposed InteRD framework is implemented in R and the package is available at https://cran.r-project.org/web/packages/InteRD/index.html. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Chixiang Chen
- Department of Epidemiology and Public Health, Division of Biostatistics and Bioinformatics, University of Maryland School of Medicine, Baltimore, MD 21201, USA
- Department of Neurosurgery, University of Maryland School of Medicine, Baltimore, MD 21201, USA
| | - Yuk Yee Leung
- Department of Pathology and Laboratory Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
- Department of Pathology and Laboratory Medicine, Penn Neurodegeneration Genomics Center, Philadelphia, PA 19104, USA
| | - Matei Ionita
- Department of Pathology and Laboratory Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
- Department of Pathology and Laboratory Medicine, Penn Neurodegeneration Genomics Center, Philadelphia, PA 19104, USA
| | - Li-San Wang
- Department of Pathology and Laboratory Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
- Department of Pathology and Laboratory Medicine, Penn Neurodegeneration Genomics Center, Philadelphia, PA 19104, USA
| | - Mingyao Li
- Department of Biostatistics Epidemiology and Informatics, University of Pennsylvania, Philadelphia, PA 19104, USA
| |
Collapse
|
10
|
He D, Chen M, Wang W, Song C, Qin Y. Deconvolution of tumor composition using partially available DNA methylation data. BMC Bioinformatics 2022; 23:355. [PMID: 36002797 PMCID: PMC9400327 DOI: 10.1186/s12859-022-04893-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2022] [Accepted: 08/16/2022] [Indexed: 11/10/2022] Open
Abstract
Background Deciphering proportions of constitutional cell types in tumor tissues is a crucial step for the analysis of tumor heterogeneity and the prediction of response to immunotherapy. In the process of measuring cell population proportions, traditional experimental methods have been greatly hampered by the cost and extensive dropout events. At present, the public availability of large amounts of DNA methylation data makes it possible to use computational methods to predict proportions. Results In this paper, we proposed PRMeth, a method to deconvolve tumor mixtures using partially available DNA methylation data. By adopting an iteratively optimized non-negative matrix factorization framework, PRMeth took DNA methylation profiles of a portion of the cell types in the tissue mixtures (including blood and solid tumors) as input to estimate the proportions of all cell types as well as the methylation profiles of unknown cell types simultaneously. We compared PRMeth with five different methods through three benchmark datasets and the results show that PRMeth could infer the proportions of all cell types and recover the methylation profiles of unknown cell types effectively. Then, applying PRMeth to four types of tumors from The Cancer Genome Atlas (TCGA) database, we found that the immune cell proportions estimated by PRMeth were largely consistent with previous studies and met biological significance. Conclusions Our method can circumvent the difficulty of obtaining complete DNA methylation reference data and obtain satisfactory deconvolution accuracy, which will be conducive to exploring the new directions of cancer immunotherapy. PRMeth is implemented in R and is freely available from GitHub (https://github.com/hedingqin/PRMeth). Supplementary Information The online version contains supplementary material available at 10.1186/s12859-022-04893-7.
Collapse
Affiliation(s)
- Dingqin He
- College of Information Technology, Shanghai Ocean University, Hucheng Ring Road, Shanghai, China.,Key Laboratory of Fisheries Information Ministry of Agriculture, Shanghai, China
| | - Ming Chen
- College of Information Technology, Shanghai Ocean University, Hucheng Ring Road, Shanghai, China.,Key Laboratory of Fisheries Information Ministry of Agriculture, Shanghai, China
| | - Wenjuan Wang
- College of Information Technology, Shanghai Ocean University, Hucheng Ring Road, Shanghai, China.,Key Laboratory of Fisheries Information Ministry of Agriculture, Shanghai, China
| | - Chunhui Song
- College of Information Technology, Shanghai Ocean University, Hucheng Ring Road, Shanghai, China.,Key Laboratory of Fisheries Information Ministry of Agriculture, Shanghai, China
| | - Yufang Qin
- College of Information Technology, Shanghai Ocean University, Hucheng Ring Road, Shanghai, China. .,Key Laboratory of Fisheries Information Ministry of Agriculture, Shanghai, China.
| |
Collapse
|
11
|
Cai M, Yue M, Chen T, Liu J, Forno E, Lu X, Billiar T, Celedón J, McKennan C, Chen W, Wang J. Robust and accurate estimation of cellular fraction from tissue omics data via ensemble deconvolution. Bioinformatics 2022; 38:3004-3010. [PMID: 35438146 PMCID: PMC9991889 DOI: 10.1093/bioinformatics/btac279] [Citation(s) in RCA: 11] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2022] [Revised: 03/22/2022] [Accepted: 04/13/2022] [Indexed: 01/04/2023] Open
Abstract
MOTIVATION Tissue-level omics data such as transcriptomics and epigenomics are an average across diverse cell types. To extract cell-type-specific (CTS) signals, dozens of cellular deconvolution methods have been proposed to infer cell-type fractions from tissue-level data. However, these methods produce vastly different results under various real data settings. Simulation-based benchmarking studies showed no universally best deconvolution approaches. There have been attempts of ensemble methods, but they only aggregate multiple single-cell references or reference-free deconvolution methods. RESULTS To achieve a robust estimation of cellular fractions, we proposed EnsDeconv (Ensemble Deconvolution), which adopts CTS robust regression to synthesize the results from 11 single deconvolution methods, 10 reference datasets, 5 marker gene selection procedures, 5 data normalizations and 2 transformations. Unlike most benchmarking studies based on simulations, we compiled four large real datasets of 4937 tissue samples in total with measured cellular fractions and bulk gene expression from different tissues. Comprehensive evaluations demonstrated that EnsDeconv yields more stable, robust and accurate fractions than existing methods. We illustrated that EnsDeconv estimated cellular fractions enable various CTS downstream analyses such as differential fractions associated with clinical variables. We further extended EnsDeconv to analyze bulk DNA methylation data. AVAILABILITY AND IMPLEMENTATION EnsDeconv is freely available as an R-package from https://github.com/randel/EnsDeconv. The RNA microarray data from the TRAUMA study are available and can be accessed in GEO (GSE36809). The demographic and clinical phenotypes can be shared on reasonable request to the corresponding authors. The RNA-seq data from the EVAPR study cannot be shared publicly due to the privacy of individuals that participated in the clinical research in compliance with the IRB approval at the University of Pittsburgh. The RNA microarray data from the FHS study are available from dbGaP (phs000007.v32.p13). The RNA-seq data from ROS study is downloaded from AD Knowledge Portal. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Manqi Cai
- Department of Biostatistics, University of Pittsburgh, Pittsburgh, PA 15261, USA
| | - Molin Yue
- Department of Biostatistics, University of Pittsburgh, Pittsburgh, PA 15261, USA
| | - Tianmeng Chen
- Department of Surgery, University of Pittsburgh, Pittsburgh, PA 15213, USA
- Department of Pathology, University of Pittsburgh School of Medicine, Pittsburgh, PA 15213, USA
| | - Jinling Liu
- Department of Engineering Management and Systems Engineering, Missouri University of Science and Technology, Rolla, MO 65409, USA
- Department of Biological Sciences, Missouri University of Science and Technology, Rolla, MO 65409, USA
| | - Erick Forno
- Department of Pediatrics, University of Pittsburgh Medical Center Children’s Hospital of Pittsburgh, Pittsburgh, PA 15224, USA
| | - Xinghua Lu
- Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, PA 15206, USA
| | - Timothy Billiar
- Department of Surgery, University of Pittsburgh, Pittsburgh, PA 15213, USA
| | - Juan Celedón
- Department of Pediatrics, University of Pittsburgh Medical Center Children’s Hospital of Pittsburgh, Pittsburgh, PA 15224, USA
| | - Chris McKennan
- Department of Statistics, University of Pittsburgh, Pittsburgh, PA 15213, USA
| | - Wei Chen
- Department of Pediatrics, University of Pittsburgh Medical Center Children’s Hospital of Pittsburgh, Pittsburgh, PA 15224, USA
| | - Jiebiao Wang
- Department of Biostatistics, University of Pittsburgh, Pittsburgh, PA 15261, USA
| |
Collapse
|
12
|
Jin C, Chen M, Lin D, Sun W. Cell type-aware analysis of RNA-seq data. NATURE COMPUTATIONAL SCIENCE 2021; 1:253-261. [PMID: 34957416 PMCID: PMC8697413 DOI: 10.1038/s43588-021-00055-6] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/30/2020] [Accepted: 03/10/2021] [Indexed: 12/13/2022]
Abstract
Most tissue samples are composed of different cell types. Differential expression analysis without accounting for cell type composition cannot separate the changes due to cell type composition or cell type-specific expression. We propose a computational framework to address these limitations: Cell Type Aware analysis of RNA-seq (CARseq). CARseq employs a negative binomial distribution that appropriately models the count data from RNA-seq experiments. Simulation studies show that CARseq has substantially higher power than a linear model-based approach and it also provides more accurate estimate of the rankings of differentially expressed genes. We have applied CARseq to compare gene expression of schizophrenia/autism subjects versus controls, and identified the cell types underlying the difference and similarities of these two neuron-developmental diseases. Our results are consistent with the results from differential expression analysis using single cell RNA-seq data.
Collapse
Affiliation(s)
- Chong Jin
- Department of Biostatistics, University of North Carolina at Chapel Hill
| | | | - Danyu Lin
- Department of Biostatistics, University of North Carolina at Chapel Hill
| | - Wei Sun
- Department of Biostatistics, University of North Carolina at Chapel Hill
- Public Health Science Division, Fred Hutchinson Cancer Research Center
- Department of Biostatistics, University of Washington
| |
Collapse
|
13
|
EMeth: An EM algorithm for cell type decomposition based on DNA methylation data. Sci Rep 2021; 11:5717. [PMID: 33707472 PMCID: PMC7952399 DOI: 10.1038/s41598-021-84864-9] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2020] [Accepted: 02/22/2021] [Indexed: 12/31/2022] Open
Abstract
We introduce a new computational method named EMeth to estimate cell type proportions using DNA methylation data. EMeth is a reference-based method that requires cell type-specific DNA methylation data from relevant cell types. EMeth improves on the existing reference-based methods by detecting the CpGs whose DNA methylation are inconsistent with the deconvolution model and reducing their contributions to cell type decomposition. Another novel feature of EMeth is that it allows a cell type with known proportions but unknown reference and estimates its methylation. This is motivated by the case of studying methylation in tumor cells while bulk tumor samples include tumor cells as well as other cell types such as infiltrating immune cells, and tumor cell proportion can be estimated by copy number data. We demonstrate that EMeth delivers more accurate estimates of cell type proportions than several other methods using simulated data and in silico mixtures. Applications in cancer studies show that the proportions of T regulatory cells estimated by DNA methylation have expected associations with mutation load and survival time, while the estimates from gene expression miss such associations.
Collapse
|
14
|
Hunt GJ, Gagnon-Bartsch JA. The role of scale in the estimation of cell-type proportions. Ann Appl Stat 2021. [DOI: 10.1214/20-aoas1395] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
15
|
Dong M, Thennavan A, Urrutia E, Li Y, Perou CM, Zou F, Jiang Y. SCDC: bulk gene expression deconvolution by multiple single-cell RNA sequencing references. Brief Bioinform 2020; 22:416-427. [PMID: 31925417 PMCID: PMC7820884 DOI: 10.1093/bib/bbz166] [Citation(s) in RCA: 121] [Impact Index Per Article: 30.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/03/2019] [Revised: 11/04/2019] [Accepted: 12/02/2019] [Indexed: 12/14/2022] Open
Abstract
Recent advances in single-cell RNA sequencing (scRNA-seq) enable characterization of transcriptomic profiles with single-cell resolution and circumvent averaging artifacts associated with traditional bulk RNA sequencing (RNA-seq) data. Here, we propose SCDC, a deconvolution method for bulk RNA-seq that leverages cell-type specific gene expression profiles from multiple scRNA-seq reference datasets. SCDC adopts an ENSEMBLE method to integrate deconvolution results from different scRNA-seq datasets that are produced in different laboratories and at different times, implicitly addressing the problem of batch-effect confounding. SCDC is benchmarked against existing methods using both in silico generated pseudo-bulk samples and experimentally mixed cell lines, whose known cell-type compositions serve as ground truths. We show that SCDC outperforms existing methods with improved accuracy of cell-type decomposition under both settings. To illustrate how the ENSEMBLE framework performs in complex tissues under different scenarios, we further apply our method to a human pancreatic islet dataset and a mouse mammary gland dataset. SCDC returns results that are more consistent with experimental designs and that reproduce more significant associations between cell-type proportions and measured phenotypes.
Collapse
Affiliation(s)
| | | | | | | | | | - Fei Zou
- Corresponding authors: Fei Zou and Yuchao Jiang, Department of Biostatistics and Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA. ,
| | - Yuchao Jiang
- Corresponding authors: Fei Zou and Yuchao Jiang, Department of Biostatistics and Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA. ,
| |
Collapse
|