1
|
Liu T, Liu C, Li Q, Zheng X, Zou F. Adaptive Regularized Tri-Factor Non-Negative Matrix Factorization for Cell Type Deconvolution. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2023.12.07.570631. [PMID: 38106220 PMCID: PMC10723472 DOI: 10.1101/2023.12.07.570631] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/19/2023]
Abstract
Accurate deconvolution of cell types from bulk gene expression is crucial for understanding cellular compositions and uncovering cell-type specific differential expression and physiological states of diseased tissues. Existing deconvolution methods have limitations, such as requiring complete cellular gene expression signatures or neglecting partial biological information. Moreover, these methods often overlook varying cell-type mRNA amounts, leading to biased proportion estimates. Additionally, they do not effectively utilize valuable reference information from external studies, such as means and ranges of population cell-type proportions. To address these challenges, we introduce an Adaptive Regularized Tri-factor non-negative matrix factorization approach for deconvolution (ARTdeConv). We rigorously establish the numerical convergence of our algorithm. Through benchmark simulations, we demonstrate the superior performance of ARTdeConv compared to state-of-the-art semi-reference-based and reference-free methods. In a real-world application, our method accurately estimates cell proportions, as evidenced by the nearly perfect Pearson's correlation between ARTdeConv estimates and flow cytometry measurements in a dataset from a trivalent influenza vaccine study. Moreover, our analysis of ARTdeConv estimates in COVID-19 patients reveals patterns consistent with important immunological phenomena observed in other studies. The proposed method, ARTdeConv, is implemented as an R package and can be accessed on GitHub for researchers and practitioners.
Collapse
|
2
|
Tiong KL, Luzhbin D, Yeang CH. Assessing transcriptomic heterogeneity of single-cell RNASeq data by bulk-level gene expression data. BMC Bioinformatics 2024; 25:209. [PMID: 38867193 PMCID: PMC11167951 DOI: 10.1186/s12859-024-05825-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2024] [Accepted: 06/03/2024] [Indexed: 06/14/2024] Open
Abstract
BACKGROUND Single-cell RNA sequencing (sc-RNASeq) data illuminate transcriptomic heterogeneity but also possess a high level of noise, abundant missing entries and sometimes inadequate or no cell type annotations at all. Bulk-level gene expression data lack direct information of cell population composition but are more robust and complete and often better annotated. We propose a modeling framework to integrate bulk-level and single-cell RNASeq data to address the deficiencies and leverage the mutual strengths of each type of data and enable a more comprehensive inference of their transcriptomic heterogeneity. Contrary to the standard approaches of factorizing the bulk-level data with one algorithm and (for some methods) treating single-cell RNASeq data as references to decompose bulk-level data, we employed multiple deconvolution algorithms to factorize the bulk-level data, constructed the probabilistic graphical models of cell-level gene expressions from the decomposition outcomes, and compared the log-likelihood scores of these models in single-cell data. We term this framework backward deconvolution as inference operates from coarse-grained bulk-level data to fine-grained single-cell data. As the abundant missing entries in sc-RNASeq data have a significant effect on log-likelihood scores, we also developed a criterion for inclusion or exclusion of zero entries in log-likelihood score computation. RESULTS We selected nine deconvolution algorithms and validated backward deconvolution in five datasets. In the in-silico mixtures of mouse sc-RNASeq data, the log-likelihood scores of the deconvolution algorithms were strongly anticorrelated with their errors of mixture coefficients and cell type specific gene expression signatures. In the true bulk-level mouse data, the sample mixture coefficients were unknown but the log-likelihood scores were strongly correlated with accuracy rates of inferred cell types. In the data of autism spectrum disorder (ASD) and normal controls, we found that ASD brains possessed higher fractions of astrocytes and lower fractions of NRGN-expressing neurons than normal controls. In datasets of breast cancer and low-grade gliomas (LGG), we compared the log-likelihood scores of three simple hypotheses about the gene expression patterns of the cell types underlying the tumor subtypes. The model that tumors of each subtype were dominated by one cell type persistently outperformed an alternative model that each cell type had elevated expression in one gene group and tumors were mixtures of those cell types. Superiority of the former model is also supported by comparing the real breast cancer sc-RNASeq clusters with those generated by simulated sc-RNASeq data. CONCLUSIONS The results indicate that backward deconvolution serves as a sensible model selection tool for deconvolution algorithms and facilitates discerning hypotheses about cell type compositions underlying heterogeneous specimens such as tumors.
Collapse
Affiliation(s)
- Khong-Loon Tiong
- Institute of Statistical Science, Academia Sinica, Taipei, Taiwan
| | - Dmytro Luzhbin
- Institute of Statistical Science, Academia Sinica, Taipei, Taiwan
| | | |
Collapse
|
3
|
Wu CT, Du D, Chen L, Dai R, Liu C, Yu G, Bhardwaj S, Parker SJ, Zhang Z, Clarke R, Herrington DM, Wang Y. CAM3.0: determining cell type composition and expression from bulk tissues with fully unsupervised deconvolution. Bioinformatics 2024; 40:btae107. [PMID: 38407991 PMCID: PMC10924278 DOI: 10.1093/bioinformatics/btae107] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2023] [Revised: 01/13/2024] [Accepted: 02/25/2024] [Indexed: 02/28/2024] Open
Abstract
MOTIVATION Complex tissues are dynamic ecosystems consisting of molecularly distinct yet interacting cell types. Computational deconvolution aims to dissect bulk tissue data into cell type compositions and cell-specific expressions. With few exceptions, most existing deconvolution tools exploit supervised approaches requiring various types of references that may be unreliable or even unavailable for specific tissue microenvironments. RESULTS We previously developed a fully unsupervised deconvolution method-Convex Analysis of Mixtures (CAM), that enables estimation of cell type composition and expression from bulk tissues. We now introduce CAM3.0 tool that improves this framework with three new and highly efficient algorithms, namely, radius-fixed clustering to identify reliable markers, linear programming to detect an initial scatter simplex, and a smart floating search for the optimum latent variable model. The comparative experimental results obtained from both realistic simulations and case studies show that the CAM3.0 tool can help biologists more accurately identify known or novel cell markers, determine cell proportions, and estimate cell-specific expressions, complementing the existing tools particularly when study- or datatype-specific references are unreliable or unavailable. AVAILABILITY AND IMPLEMENTATION The open-source R Scripts of CAM3.0 is freely available at https://github.com/ChiungTingWu/CAM3/(https://github.com/Bioconductor/Contributions/issues/3205). A user's guide and a vignette are provided.
Collapse
Affiliation(s)
- Chiung-Ting Wu
- Department of Electrical and Computer Engineering, Virginia Polytechnic Institute and State University, Arlington, VA 22203, United States
| | - Dongping Du
- Department of Electrical and Computer Engineering, Virginia Polytechnic Institute and State University, Arlington, VA 22203, United States
| | - Lulu Chen
- Department of Electrical and Computer Engineering, Virginia Polytechnic Institute and State University, Arlington, VA 22203, United States
| | - Rujia Dai
- Department of Psychiatry, SUNY Upstate Medical University, Syracuse, NY 13210, United States
| | - Chunyu Liu
- Department of Psychiatry, SUNY Upstate Medical University, Syracuse, NY 13210, United States
| | - Guoqiang Yu
- Department of Automation, Tsinghua University, Beijing 100084, P. R. China
| | - Saurabh Bhardwaj
- Department of Electrical and Computer Engineering, Virginia Polytechnic Institute and State University, Arlington, VA 22203, United States
- Department of Electrical and Instrumentation Engineering, Thapar Institute of Engineering & Technology, Punjab 147004, India
| | - Sarah J Parker
- Advanced Clinical Biosystems Research Institute, Cedars Sinai Medical Center, Los Angeles, CA 90048, United States
| | - Zhen Zhang
- Department of Pathology, Johns Hopkins University, Baltimore, MD 21231, United States
| | - Robert Clarke
- The Hormel Institute, University of Minnesota, Austin, MN 55912, United States
| | - David M Herrington
- Department of Internal Medicine, Wake Forest University, Winston-Salem, NC 27157, United States
| | - Yue Wang
- Department of Electrical and Computer Engineering, Virginia Polytechnic Institute and State University, Arlington, VA 22203, United States
| |
Collapse
|
4
|
Herrington D, Wang Y. CLINICAL HETEROGENEITY IN THE AGE OF BIG DATA, ADVANCED ANALYTICS, AND COMPLEXITY THEORY. TRANSACTIONS OF THE AMERICAN CLINICAL AND CLIMATOLOGICAL ASSOCIATION 2023; 133:56-68. [PMID: 37701617 PMCID: PMC10493739] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 09/14/2023]
Abstract
Clinical heterogeneity remains a challenge in the practice of medicine and is an underlying motivation for much of biomedical research. Unfortunately, despite an abundance of technologies capable of producing millions of discrete data elements with information about a patient's health status or disease prognosis, our ability to translate those data into meaningful improvements in understanding of clinical heterogeneity is limited. To address this gap, we have applied newer approaches to manifold learning and developed additional and complementary techniques to interrogate and interpret complex, high dimensional omics data. The central premise is that there exist manifolds embedded in high dimensional data that represent fundamental biologic processes that may help address the challenges of clinical heterogeneity. Preliminary evidence from several real-world data sets suggests that these techniques can identify coherent and reproducible manifolds embedded in higher dimensional omics data. Work is currently ongoing to determine the clinical informativeness of these novel data structures.
Collapse
|
5
|
MRI Radiogenomics in Precision Oncology: New Diagnosis and Treatment Method. COMPUTATIONAL INTELLIGENCE AND NEUROSCIENCE 2022; 2022:2703350. [PMID: 35845886 PMCID: PMC9282990 DOI: 10.1155/2022/2703350] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/02/2022] [Revised: 05/04/2022] [Accepted: 05/25/2022] [Indexed: 11/21/2022]
Abstract
Precision medicine for cancer affords a new way for the most accurate and effective treatment to each individual cancer. Given the high time-evolving intertumor and intratumor heterogeneity features of personal medicine, there are still several obstacles hindering its diagnosis and treatment in clinical practice regardless of extensive exploration on it over the past years. This paper is to investigate radiogenomics methods in the literature for precision medicine for cancer focusing on the heterogeneity analysis of tumors. Based on integrative analysis of multimodal (parametric) imaging and molecular data in bulk tumors, a comprehensive analysis and discussion involving the characterization of tumor heterogeneity in imaging and molecular expression are conducted. These investigations are intended to (i) fully excavate the multidimensional spatial, temporal, and semantic related information regarding high-dimensional breast magnetic resonance imaging data, with integration of the highly specific structured data of genomics and combination of the diagnosis and cognitive process of doctors, and (ii) establish a radiogenomics data representation model based on multidimensional consistency analysis with multilevel spatial-temporal correlations.
Collapse
|
6
|
Predicting Algorithm of Tissue Cell Ratio Based on Deep Learning Using Single-Cell RNA Sequencing. APPLIED SCIENCES-BASEL 2022. [DOI: 10.3390/app12125790] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/10/2022]
Abstract
Background: Understanding the proportion of cell types in heterogeneous tissue samples is important in bioinformatics. It is a challenge to infer the proportion of tissues using bulk RNA sequencing data in bioinformatics because most traditional algorithms for predicting tissue cell ratios heavily rely on standardized specific cell-type gene expression profiles, and do not consider tissue heterogeneity. The prediction accuracy of algorithms is limited, and robustness is lacking. This means that new approaches are needed urgently. Methods: In this study, we introduced an algorithm that automatically predicts tissue cell ratios named Autoptcr. The algorithm uses the data simulated by single-cell RNA sequencing (ScRNA-Seq) for model training, using convolutional neural networks (CNNs) to extract intrinsic relationships between genes and predict the cell proportions of tissues. Results: We trained the algorithm using simulated bulk samples and made predictions using real bulk PBMC data. Comparing Autoptcr with existing advanced algorithms, the Pearson correlation coefficient between the actual value of Autoptcr and the predicted value was the highest, reaching 0.903. Tested on a bulk sample, the correlation coefficient of Lin was 41% higher than that of CSx. The algorithm can infer tissue cell proportions directly from tissue gene expression data. Conclusions: The Autoptcr algorithm uses simulated ScRNA-Seq data for training to solve the problem of specific cell-type gene expression profiles. It also has high prediction accuracy and strong noise resistance for the tissue cell ratio. This work is expected to provide new research ideas for the prediction of tissue cell proportions.
Collapse
|
7
|
Wang Y, Gao J, Xuan C, Guan T, Wang Y, Zhou G, Ding T. FSCAM: CAM-Based Feature Selection for Clustering scRNA-seq. Interdiscip Sci 2022; 14:394-408. [PMID: 35028910 DOI: 10.1007/s12539-021-00495-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2021] [Revised: 11/22/2021] [Accepted: 11/23/2021] [Indexed: 06/14/2023]
Abstract
Cell type determination based on transcriptome profiles is a key application of single-cell RNA sequencing (scRNA-seq). It is usually achieved through unsupervised clustering. Good feature selection is capable of improving the clustering accuracy and is a crucial component of single-cell clustering pipelines. However, most current single-cell feature selection methods are univariable filter methods ignoring gene dependency. Even the multivariable filter methods developed in recent years only consider "one-to-many" relationship between genes. In this paper, a novel single-cell feature selection method based on convex analysis of mixtures (FSCAM) is proposed, which takes into account "many-to-many" relationship. Compared to the previous "one-to-many" methods, FSCAM selects genes with a combination of relevancy, redundancy and completeness. Pertinent benchmarking is conducted on the real datasets to validate the superiority of FSCAM. Through plugging into the framework of partition around medoids (PAM) clustering, a single-cell clustering algorithm based on FSCAM method (SCC_FSCAM) is further developed. Comparing SCC_FSCAM with existing advanced clustering algorithms, the results show that our algorithm has advantages in both internal criteria (clustering number) and external criteria (adjusted Rand index) and has a good stability.
Collapse
Affiliation(s)
- Yan Wang
- School of Science, Jiangnan University, Wuxi, 214122, China
| | - Jie Gao
- School of Science, Jiangnan University, Wuxi, 214122, China.
| | - Chenxu Xuan
- School of Science, Jiangnan University, Wuxi, 214122, China
| | - Tianhao Guan
- School of Science, Jiangnan University, Wuxi, 214122, China
| | - Yujie Wang
- School of Science, Jiangnan University, Wuxi, 214122, China
| | - Gang Zhou
- School of Science, Jiangnan University, Wuxi, 214122, China
| | - Tao Ding
- School of Mathematics Statistics and Physics, Newcastle University, Newcastle upon Tyne, NE1 7RU, UK
| |
Collapse
|
8
|
Lu Y, Wu CT, Parker SJ, Cheng Z, Saylor G, Van Eyk JE, Yu G, Clarke R, Herrington DM, Wang Y. COT: an efficient and accurate method for detecting marker genes among many subtypes. BIOINFORMATICS ADVANCES 2022; 2:vbac037. [PMID: 35673616 PMCID: PMC9163574 DOI: 10.1093/bioadv/vbac037] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/09/2021] [Revised: 04/10/2022] [Accepted: 05/16/2022] [Indexed: 01/27/2023]
Abstract
Motivation Ideally, a molecularly distinct subtype would be composed of molecular features that are expressed uniquely in the subtype of interest but in no others-so-called marker genes (MGs). MG plays a critical role in the characterization, classification or deconvolution of tissue or cell subtypes. We and others have recognized that the test statistics used by most methods do not exactly satisfy the MG definition and often identify inaccurate MG. Results We report an efficient and accurate data-driven method, formulated as a Cosine-based One-sample Test (COT) in scatter space, to detect MG among many subtypes using subtype expression profiles. Fundamentally different from existing approaches, the test statistic in COT precisely matches the mathematical definition of an ideal MG. We demonstrate the performance and utility of COT on both simulated and real gene expression and proteomics data. The open source Python/R tool will allow biologists to efficiently detect MG and perform a more comprehensive and unbiased molecular characterization of tissue or cell subtypes in many biomedical contexts. Nevertheless, COT complements not replaces existing methods. Availability and implementation The Python COT software with a detailed user's manual and a vignette are freely available at https://github.com/MintaYLu/COT. Supplementary information Supplementary data are available at Bioinformatics Advances online.
Collapse
Affiliation(s)
- Yingzhou Lu
- Department of Electrical and Computer Engineering, Virginia Polytechnic Institute and State University, Arlington, VA 22203, USA
| | - Chiung-Ting Wu
- Department of Electrical and Computer Engineering, Virginia Polytechnic Institute and State University, Arlington, VA 22203, USA
| | - Sarah J Parker
- Advanced Clinical Biosystems Research Institute, Cedars Sinai Medical Center, Los Angeles, CA 90048, USA
| | - Zuolin Cheng
- Department of Electrical and Computer Engineering, Virginia Polytechnic Institute and State University, Arlington, VA 22203, USA
| | - Georgia Saylor
- Department of Internal Medicine, Wake Forest University, Winston-Salem, NC 27157, USA
| | - Jennifer E Van Eyk
- Advanced Clinical Biosystems Research Institute, Cedars Sinai Medical Center, Los Angeles, CA 90048, USA
| | - Guoqiang Yu
- Department of Electrical and Computer Engineering, Virginia Polytechnic Institute and State University, Arlington, VA 22203, USA
| | - Robert Clarke
- The Hormel Institute, University of Minnesota, Austin, MN 55912, USA
| | - David M Herrington
- Department of Internal Medicine, Wake Forest University, Winston-Salem, NC 27157, USA
| | - Yue Wang
- Department of Electrical and Computer Engineering, Virginia Polytechnic Institute and State University, Arlington, VA 22203, USA,To whom correspondence should be addressed.
| |
Collapse
|
9
|
Comprehensive evaluation of deconvolution methods for human brain gene expression. Nat Commun 2022; 13:1358. [PMID: 35292647 PMCID: PMC8924248 DOI: 10.1038/s41467-022-28655-4] [Citation(s) in RCA: 42] [Impact Index Per Article: 21.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/05/2019] [Accepted: 01/28/2022] [Indexed: 11/08/2022] Open
Abstract
Transcriptome deconvolution aims to estimate the cellular composition of an RNA sample from its gene expression data, which in turn can be used to correct for composition differences across samples. The human brain is unique in its transcriptomic diversity, and comprises a complex mixture of cell-types, including transcriptionally similar subtypes of neurons. Here, we carry out a comprehensive evaluation of deconvolution methods for human brain transcriptome data, and assess the tissue-specificity of our key observations by comparison with human pancreas and heart. We evaluate eight transcriptome deconvolution approaches and nine cell-type signatures, testing the accuracy of deconvolution using in silico mixtures of single-cell RNA-seq data, RNA mixtures, as well as nearly 2000 human brain samples. Our results identify the main factors that drive deconvolution accuracy for brain data, and highlight the importance of biological factors influencing cell-type signatures, such as brain region and in vitro cell culturing. Transcriptome deconvolution aims to estimate cellular composition based on gene expression data. Here the authors evaluate deconvolution methods for human brain transcriptome and conclude that partial deconvolution algorithms work best, but that appropriate cell-type signatures are also important.
Collapse
|
10
|
Chen L, Wu CT, Lin CH, Dai R, Liu C, Clarke R, Yu G, Van Eyk JE, Herrington DM, Wang Y. swCAM: estimation of subtype-specific expressions in individual samples with unsupervised sample-wise deconvolution. Bioinformatics 2022; 38:1403-1410. [PMID: 34904628 PMCID: PMC8826012 DOI: 10.1093/bioinformatics/btab839] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2021] [Revised: 10/30/2021] [Accepted: 12/10/2021] [Indexed: 02/04/2023] Open
Abstract
MOTIVATION Complex biological tissues are often a heterogeneous mixture of several molecularly distinct cell subtypes. Both subtype compositions and subtype-specific (STS) expressions can vary across biological conditions. Computational deconvolution aims to dissect patterns of bulk tissue data into subtype compositions and STS expressions. Existing deconvolution methods can only estimate averaged STS expressions in a population, while many downstream analyses such as inferring co-expression networks in particular subtypes require subtype expression estimates in individual samples. However, individual-level deconvolution is a mathematically underdetermined problem because there are more variables than observations. RESULTS We report a sample-wise Convex Analysis of Mixtures (swCAM) method that can estimate subtype proportions and STS expressions in individual samples from bulk tissue transcriptomes. We extend our previous CAM framework to include a new term accounting for between-sample variations and formulate swCAM as a nuclear-norm and ℓ2,1-norm regularized matrix factorization problem. We determine hyperparameter values using cross-validation with random entry exclusion and obtain a swCAM solution using an efficient alternating direction method of multipliers. Experimental results on realistic simulation data show that swCAM can accurately estimate STS expressions in individual samples and successfully extract co-expression networks in particular subtypes that are otherwise unobtainable using bulk data. In two real-world applications, swCAM analysis of bulk RNASeq data from brain tissue of cases and controls with bipolar disorder or Alzheimer's disease identified significant changes in cell proportion, expression pattern and co-expression module in patient neurons. Comparative evaluation of swCAM versus peer methods is also provided. AVAILABILITY AND IMPLEMENTATION The R Scripts of swCAM are freely available at https://github.com/Lululuella/swCAM. A user's guide and a vignette are provided. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Lulu Chen
- Department of Electrical and Computer Engineering, Virginia Polytechnic Institute and State University, Arlington, VA 22203, USA
| | - Chiung-Ting Wu
- Department of Electrical and Computer Engineering, Virginia Polytechnic Institute and State University, Arlington, VA 22203, USA
| | - Chia-Hsiang Lin
- Department of Electrical Engineering, National Cheng Kung University, Tainan 70101, Taiwan
| | - Rujia Dai
- Department of Psychiatry, SUNY Upstate Medical University, Syracuse, NY 13210, USA
| | - Chunyu Liu
- Department of Psychiatry, SUNY Upstate Medical University, Syracuse, NY 13210, USA
| | - Robert Clarke
- The Hormel Institute, University of Minnesota, Austin, MN 55912, USA
| | - Guoqiang Yu
- Department of Electrical and Computer Engineering, Virginia Polytechnic Institute and State University, Arlington, VA 22203, USA
| | - Jennifer E Van Eyk
- Advanced Clinical Biosystems Research Institute, Cedars Sinai Medical Center, Los Angeles, CA 90048, USA
| | - David M Herrington
- Department of Internal Medicine, Wake Forest University, Winston-Salem, NC 27157, USA
| | - Yue Wang
- Department of Electrical and Computer Engineering, Virginia Polytechnic Institute and State University, Arlington, VA 22203, USA
| |
Collapse
|
11
|
Boldina G, Fogel P, Rocher C, Bettembourg C, Luta G, Augé F. A2Sign: Agnostic Algorithms for Signatures-a universal method for identifying molecular signatures from transcriptomic datasets prior to cell-type deconvolution. Bioinformatics 2022; 38:1015-1021. [PMID: 34788798 DOI: 10.1093/bioinformatics/btab773] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2021] [Revised: 09/17/2021] [Accepted: 11/09/2021] [Indexed: 02/03/2023] Open
Abstract
MOTIVATION Molecular signatures are critical for inferring the proportions of cell types from bulk transcriptomics data. However, the identification of these signatures is based on a methodology that relies on prior biological knowledge of the cell types being studied. When working with less known biological material, a data-driven approach is required to uncover the underlying classes and generate ad hoc signatures from healthy or pathogenic tissue. RESULTS We present a new approach, A2Sign: Agnostic Algorithms for Signatures, based on a non-negative tensor factorization (NTF) strategy that allows us to identify cell-type-specific molecular signatures, greatly reduce collinearities and also account for inter-individual variability. We propose a global framework that can be applied to uncover molecular signatures for cell-type deconvolution in arbitrary tissues using bulk transcriptome data. We also present two new molecular signatures for deconvolution of up to 16 immune cell types using microarray or RNA-seq data. AVAILABILITY AND IMPLEMENTATION All steps of our analysis were implemented in annotated Python notebooks (https://github.com/paulfogel/A2SIGN). To perform NTF, we used the NMTF package, which can be downloaded using Python pip install. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Galina Boldina
- Sanofi, R&D Translational Sciences France, Bioinformatics, Sanofi, F-91385 Chilly-Mazarin Cedex, France
| | - Paul Fogel
- Consultant, F-75006 Paris, France.,Advestis, F-75008 Paris, France.,Quinten, F-75017 Paris, France
| | - Corinne Rocher
- Sanofi, R&D Translational Sciences France, Bioinformatics, Sanofi, F-91385 Chilly-Mazarin Cedex, France
| | - Charles Bettembourg
- Sanofi, R&D Translational Sciences France, Bioinformatics, Sanofi, F-91385 Chilly-Mazarin Cedex, France
| | - George Luta
- Department of Biostatistics, Bioinformatics and Biomathematics, Georgetown University, Washington, DC 20057, USA
| | - Franck Augé
- Sanofi, R&D Translational Sciences France, Bioinformatics, Sanofi, F-91385 Chilly-Mazarin Cedex, France
| |
Collapse
|
12
|
Comparative assessment and novel strategy on methods for imputing proteomics data. Sci Rep 2022; 12:1067. [PMID: 35058491 PMCID: PMC8776850 DOI: 10.1038/s41598-022-04938-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2021] [Accepted: 01/04/2022] [Indexed: 11/09/2022] Open
Abstract
Missing values are a major issue in quantitative proteomics analysis. While many methods have been developed for imputing missing values in high-throughput proteomics data, a comparative assessment of imputation accuracy remains inconclusive, mainly because mechanisms contributing to true missing values are complex and existing evaluation methodologies are imperfect. Moreover, few studies have provided an outlook of future methodological development. We first re-evaluate the performance of eight representative methods targeting three typical missing mechanisms. These methods are compared on both simulated and masked missing values embedded within real proteomics datasets, and performance is evaluated using three quantitative measures. We then introduce fused regularization matrix factorization, a low-rank global matrix factorization framework, capable of integrating local similarity derived from additional data types. We also explore a biologically-inspired latent variable modeling strategy—convex analysis of mixtures—for missing value imputation and present preliminary experimental results. While some winners emerged from our comparative assessment, the evaluation is intrinsically imperfect because performance is evaluated indirectly on artificial missing or masked values not authentic missing values. Nevertheless, we show that our fused regularization matrix factorization provides a novel incorporation of external and local information, and the exploratory implementation of convex analysis of mixtures presents a biologically plausible new approach.
Collapse
|
13
|
Saddic L, Orosco A, Guo D, Milewicz DM, Troxlair D, Heide RV, Herrington D, Wang Y, Azizzadeh A, Parker SJ. Proteomic analysis of descending thoracic aorta identifies unique and universal signatures of aneurysm and dissection. JVS Vasc Sci 2022; 3:85-181. [PMID: 35280433 PMCID: PMC8914561 DOI: 10.1016/j.jvssci.2022.01.001] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2021] [Accepted: 01/05/2022] [Indexed: 01/05/2023] Open
Abstract
Objective Methods Results Conclusions Diseases of the descending thoracic aorta such as aneurysms and dissections carry a high degree of morbidity and mortality. At present, a complete understanding is still lacking of the genetics that drive these diseases and why some aortic segments dissect in the presence or absence of an aneurysm. We compared and contrasted the whole proteome expression of descending aortas from patients with normal, dissected, aneurysmal, and aneurysmal with dissected pathology aortic tissue. We uncovered potential tissue markers that might serve as future targets for therapy or predictors of disease progression.
Collapse
Affiliation(s)
- Louis Saddic
- Department of Anesthesiology and Perioperative Medicine, David Geffen School of Medicine, University of California, Los Angeles, Los Angeles, Calif
| | - Amanda Orosco
- Department of Cardiology, Smidt Heart Institute, Cedars-Sinai Medical Center, Los Angeles, Calif
| | - Dongchuan Guo
- Department of Internal Medicine, McGovern Medical School, University of Texas Health Science Center, Houston, Tex
| | - Dianna M. Milewicz
- Department of Internal Medicine, McGovern Medical School, University of Texas Health Science Center, Houston, Tex
| | - Dana Troxlair
- Department of Pathology, Louisiana State University, New Orleans, La
| | | | - David Herrington
- Department of Cardiovascular Medicine, Wake Forest University, Winston-Salem, NC
| | - Yue Wang
- Department of Electrical and Computer Engineering, Virginia Polytechnic Institute and State University, Arlington, Va
| | - Ali Azizzadeh
- Department of Cardiology, Smidt Heart Institute, Cedars-Sinai Medical Center, Los Angeles, Calif
| | - Sarah J. Parker
- Department of Cardiology, Smidt Heart Institute, Cedars-Sinai Medical Center, Los Angeles, Calif
- Correspondence: Sarah J. Parker, PhD, Department of Cardiology, Smidt Heart Institute, Cedars Sinai Medical Center, AHSP A9228, 8700 Beverly Blvd, Los Angeles, CA 90048
| |
Collapse
|
14
|
Ahmed M, Lai TH, Kim DR. A Small Fraction of Progenitors Differentiate Into Mature Adipocytes by Escaping the Constraints on the Cell Structure. Front Cell Dev Biol 2021; 9:753042. [PMID: 34708046 PMCID: PMC8542793 DOI: 10.3389/fcell.2021.753042] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/13/2021] [Accepted: 09/10/2021] [Indexed: 11/13/2022] Open
Abstract
Differentiating 3T3-L1 pre-adipocytes are a mixture of non-identical culture cells. It is vital to identify the cell types that respond to the induction stimulus to understand the pre-adipocyte potential and the mature adipocyte behavior. To test this hypothesis, we deconvoluted the gene expression profiles of the cell culture of MDI-induced 3T3-L1 cells. Then we estimated the fractions of the sub-populations and their changes in time. We characterized the sub-populations based on their specific expression profiles. Initial cell cultures comprised three distinct phenotypes. A small fraction of the starting cells responded to the induction and developed into mature adipocytes. Unresponsive cells were probably under structural constraints or were committed to differentiating into alternative phenotypes. Using the same population gene markers, similar proportions were found in induced human primary adipocyte cell cultures. The three sub-populations had diverse responses to treatment with various drugs and compounds. Only the response of the maturating sub-population resembled that estimated from the profiles of the mixture. We then showed that even at a low division rate, a small fraction of cells could increase its share in a dynamic two-populations model. Finally, we used a cell cycle expression index to validate that model. To sum, pre-adipocytes are a mixture of different cells of which a limited fraction become mature adipocytes.
Collapse
Affiliation(s)
- Mahmoud Ahmed
- Department of Biochemistry and Convergence Medical Sciences, Institute of Health Sciences, Gyeongsang National University School of Medicine, Jinju, South Korea
| | - Trang Huyen Lai
- Department of Biochemistry and Convergence Medical Sciences, Institute of Health Sciences, Gyeongsang National University School of Medicine, Jinju, South Korea
| | - Deok Ryong Kim
- Department of Biochemistry and Convergence Medical Sciences, Institute of Health Sciences, Gyeongsang National University School of Medicine, Jinju, South Korea
| |
Collapse
|
15
|
Xie Y, Zhao J, Zhang P. A multicompartment model for intratumor tissue-specific analysis of DCE-MRI using non-negative matrix factorization. Med Phys 2021; 48:2400-2411. [PMID: 33608885 DOI: 10.1002/mp.14793] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2020] [Revised: 12/22/2020] [Accepted: 01/29/2021] [Indexed: 11/12/2022] Open
Abstract
PURPOSE A pharmacokinetic analysis of dynamic contrast-enhanced magnetic resonance imaging (DCE-MRI) data is subject to inaccuracy and instability partly owing to the partial volume effect (PVE). We proposed a new multicompartment model for a tissue-specific pharmacokinetic analysis in DCE-MRI data to solve the PVE problem and to provide better kinetic parameter maps. METHODS We introduced an independent parameter named fractional volumes of tissue compartments in each DCE-MRI pixel to construct a new linear separable multicompartment model, which simultaneously estimates the pixel-wise time-concentration curves and fractional volumes without the need of the pure-pixel assumption. This simplified convex optimization model was solved using a special type of non-negative matrix factorization (NMF) algorithm called the minimum-volume constraint NMF (MVC-NMF). RESULTS To test the model, synthetic datasets were established based on the general pharmacokinetic parameters. On well-designed synthetic data, the proposed model reached lower bias and lower root mean square fitting error compared to the state-of-the-art algorithm in different noise levels. In addition, the real dataset from QIN-BREAST-DCE-MRI was analyzed, and we observed an improved pharmacokinetic parameter estimation to distinguish the treatment response to chemotherapy applied to breast cancer. CONCLUSION Our model improved the accuracy and stability of the tissue-specific estimation of the fractional volumes and kinetic parameters in DCE-MRI data, and improved the robustness to noise, providing more accurate kinetics for more precise prognosis and therapeutic response evaluation using DCE-MRI.
Collapse
Affiliation(s)
- Yuhai Xie
- School of Biomedical Engineering, Shanghai Jiao Tong University, Shanghai, 200240, China
| | - Jun Zhao
- School of Biomedical Engineering, Shanghai Jiao Tong University, Shanghai, 200240, China
| | - Puming Zhang
- School of Biomedical Engineering, Shanghai Jiao Tong University, Shanghai, 200240, China
| |
Collapse
|
16
|
Amrhein L, Fuchs C. stochprofML: stochastic profiling using maximum likelihood estimation in R. BMC Bioinformatics 2021; 22:123. [PMID: 33722188 PMCID: PMC7958472 DOI: 10.1186/s12859-021-03970-7] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2020] [Accepted: 01/15/2021] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Tissues are often heterogeneous in their single-cell molecular expression, and this can govern the regulation of cell fate. For the understanding of development and disease, it is important to quantify heterogeneity in a given tissue. RESULTS We present the R package stochprofML which uses the maximum likelihood principle to parameterize heterogeneity from the cumulative expression of small random pools of cells. We evaluate the algorithm's performance in simulation studies and present further application opportunities. CONCLUSION Stochastic profiling outweighs the necessary demixing of mixed samples with a saving in experimental cost and effort and less measurement error. It offers possibilities for parameterizing heterogeneity, estimating underlying pool compositions and detecting differences between cell populations between samples.
Collapse
Affiliation(s)
- Lisa Amrhein
- Institute of Computational Biology, Helmholtz Zentrum München, Ingolstädter Landstrasse 1, 85764 Neuherberg, Germany
- Department of Mathematics, Technical University Munich, Boltzmannstrasse 3, 85748 Garching, Germany
| | - Christiane Fuchs
- Institute of Computational Biology, Helmholtz Zentrum München, Ingolstädter Landstrasse 1, 85764 Neuherberg, Germany
- Department of Mathematics, Technical University Munich, Boltzmannstrasse 3, 85748 Garching, Germany
- Faculty of Business Administration and Economics, Bielefeld University, Universitätsstrasse 25, 33615 Bielefeld, Germany
| |
Collapse
|
17
|
Hunt GJ, Gagnon-Bartsch JA. The role of scale in the estimation of cell-type proportions. Ann Appl Stat 2021. [DOI: 10.1214/20-aoas1395] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
18
|
Data-driven detection of subtype-specific differentially expressed genes. Sci Rep 2021; 11:332. [PMID: 33432005 PMCID: PMC7801594 DOI: 10.1038/s41598-020-79704-1] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/17/2020] [Accepted: 12/11/2020] [Indexed: 11/08/2022] Open
Abstract
Among multiple subtypes of tissue or cell, subtype-specific differentially-expressed genes (SDEGs) are defined as being most-upregulated in only one subtype but not in any other. Detecting SDEGs plays a critical role in the molecular characterization and deconvolution of multicellular complex tissues. Classic differential analysis assumes a null hypothesis whose test statistic is not subtype-specific, thus can produce a high false positive rate and/or lower detection power. Here we first introduce a One-Versus-Everyone Fold Change (OVE-FC) test for detecting SDEGs. We then propose a scaled test statistic (OVE-sFC) for assessing the statistical significance of SDEGs that applies a mixture null distribution model and a tailored permutation test. The OVE-FC/sFC test was validated on both type 1 error rate and detection power using extensive simulation data sets generated from real gene expression profiles of purified subtype samples. The OVE-FC/sFC test was then applied to two benchmark gene expression data sets of purified subtype samples and detected many known or previously unknown SDEGs. Subsequent supervised deconvolution results on synthesized bulk expression data, obtained using the SDEGs detected from the independent purified expression data by the OVE-FC/sFC test, showed superior performance in deconvolution accuracy when compared with popular peer methods.
Collapse
|
19
|
Chen Z, Wu A. Progress and challenge for computational quantification of tissue immune cells. Brief Bioinform 2021; 22:6065002. [PMID: 33401306 DOI: 10.1093/bib/bbaa358] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2020] [Revised: 10/23/2020] [Accepted: 11/07/2020] [Indexed: 12/28/2022] Open
Abstract
Tissue immune cells have long been recognized as important regulators for the maintenance of balance in the body system. Quantification of the abundance of different immune cells will provide enhanced understanding of the correlation between immune cells and normal or abnormal situations. Currently, computational methods to predict tissue immune cell compositions from bulk transcriptomes have been largely developed. Therefore, summarizing the advantages and disadvantages is appropriate. In addition, an examination of the challenges and possible solutions for these computational models will assist the development of this field. The common hypothesis of these models is that the expression of signature genes for immune cell types might represent the proportion of immune cells that contribute to the tissue transcriptome. In general, we grouped all reported tools into three groups, including reference-free, reference-based scoring and reference-based deconvolution methods. In this review, a summary of all the currently reported computational immune cell quantification tools and their applications, limitations, and perspectives are presented. Furthermore, some critical problems are found that have limited the performance and application of these models, including inadequate immune cell type, the collinearity problem, the impact of the tissue environment on the immune cell expression level, and the deficiency of standard datasets for model validation. To address these issues, tissue specific training datasets that include all known immune cells, a hierarchical computational framework, and benchmark datasets including both tissue expression profiles and the abundances of all the immune cells are proposed to further promote the development of this field.
Collapse
Affiliation(s)
- Ziyi Chen
- Suzhou Institute of Systems Medicine, Center for Systems Medicine, Chinese Academy of Medical Sciences & Peking Union Medical College, Jiangsu, Suzhou, China
| | - Aiping Wu
- Suzhou Institute of Systems Medicine, Center for Systems Medicine, Chinese Academy of Medical Sciences & Peking Union Medical College, Jiangsu, Suzhou, China
| |
Collapse
|
20
|
Chen L, Wu CT, Wang N, Herrington DM, Clarke R, Wang Y. debCAM: a bioconductor R package for fully unsupervised deconvolution of complex tissues. Bioinformatics 2020; 36:3927-3929. [PMID: 32219387 DOI: 10.1093/bioinformatics/btaa205] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2019] [Revised: 03/05/2020] [Accepted: 03/23/2020] [Indexed: 11/14/2022] Open
Abstract
SUMMARY We develop a fully unsupervised deconvolution method to dissect complex tissues into molecularly distinctive tissue or cell subtypes based on bulk expression profiles. We implement an R package, deconvolution by Convex Analysis of Mixtures (debCAM) that can automatically detect tissue/cell-specific markers, determine the number of constituent subtypes, calculate subtype proportions in individual samples and estimate tissue/cell-specific expression profiles. We demonstrate the performance and biomedical utility of debCAM on gene expression, methylation, proteomics and imaging data. With enhanced data preprocessing and prior knowledge incorporation, debCAM software tool will allow biologists to perform a more comprehensive and unbiased characterization of tissue remodeling in many biomedical contexts. AVAILABILITY AND IMPLEMENTATION http://bioconductor.org/packages/debCAM. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Lulu Chen
- Department of Electrical and Computer Engineering, Virginia Polytechnic Institute and State University, Arlington, VA 22203, USA
| | - Chiung-Ting Wu
- Department of Electrical and Computer Engineering, Virginia Polytechnic Institute and State University, Arlington, VA 22203, USA
| | - Niya Wang
- Search Ranking Unit, Google LLC, Mountain View, CA 94043, USA
| | - David M Herrington
- Department of Internal Medicine, Wake Forest University, Winston-Salem, NC 27157, USA
| | - Robert Clarke
- Department of Oncology, Lombardi Comprehensive Cancer Center, Georgetown University, Washington, DC 20057, USA
| | - Yue Wang
- Department of Electrical and Computer Engineering, Virginia Polytechnic Institute and State University, Arlington, VA 22203, USA
| |
Collapse
|
21
|
Zhou T, Sengupta S, Müller P, Ji Y. RNDClone: Tumor subclone reconstruction based on integrating DNA and RNA sequence data. Ann Appl Stat 2020. [DOI: 10.1214/20-aoas1368] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
22
|
Integrative analyses prioritize GNL3 as a risk gene for bipolar disorder. Mol Psychiatry 2020; 25:2672-2684. [PMID: 32826963 DOI: 10.1038/s41380-020-00866-5] [Citation(s) in RCA: 16] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 01/16/2020] [Revised: 07/30/2020] [Accepted: 08/06/2020] [Indexed: 12/14/2022]
Abstract
Genome-wide association studies (GWASs) have identified numerous single nucleotide polymorphisms (SNPs) associated with bipolar disorder (BD), but what the causal variants are and how they contribute to BD is largely unknown. In this study, we used FUMA, a GWAS annotation tool, to pinpoint potential causal variants and genes from the latest BD GWAS findings, and performed integrative analyses, including brain expression quantitative trait loci (eQTL), gene coexpression network, differential gene expression, protein-protein interaction, and brain intermediate phenotype association analysis to identify the functions of a prioritized gene and its connection to BD. Convergent lines of evidence prioritized protein-coding gene G Protein Nucleolar 3 (GNL3) as a BD risk gene, with integrative analyses revealing GNL3's roles in cell proliferation, neuronal functions, and brain phenotypes. We experimentally revealed that BD-related eQTL SNPs rs10865973, rs12635140, and rs4687644 regulate GNL3 expression using dual luciferase reporter assay and CRISPR interference experiment in human neural progenitor cells. We further identified that GNL3 knockdown and overexpression led to aberrant neuronal proliferation and differentiation, using two-dimensional human neural cell cultures and three-dimensional forebrain organoid model. This study gathers evidence that BD-related genetic variants regulate GNL3 expression which subsequently affects neuronal proliferation and differentiation.
Collapse
|
23
|
Radiogenomic signatures reveal multiscale intratumour heterogeneity associated with biological functions and survival in breast cancer. Nat Commun 2020; 11:4861. [PMID: 32978398 PMCID: PMC7519071 DOI: 10.1038/s41467-020-18703-2] [Citation(s) in RCA: 49] [Impact Index Per Article: 12.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2020] [Accepted: 09/08/2020] [Indexed: 12/24/2022] Open
Abstract
Advanced tumours are often heterogeneous, consisting of subclones with various genetic alterations and functional roles. The precise molecular features that characterize the contributions of multiscale intratumour heterogeneity to malignant progression, metastasis, and poor survival are largely unknown. Here, we address these challenges in breast cancer by defining the landscape of heterogeneous tumour subclones and their biological functions using radiogenomic signatures. Molecular heterogeneity is identified by a fully unsupervised deconvolution of gene expression data. Relative prevalence of two subclones associated with cell cycle and primary immunodeficiency pathways identifies patients with significantly different survival outcomes. Radiogenomic signatures of imaging scale heterogeneity are extracted and used to classify patients into groups with distinct subclone compositions. Prognostic value is confirmed by survival analysis accounting for clinical variables. These findings provide insight into how a radiogenomic analysis can identify the biological activities of specific subclones that predict prognosis in a noninvasive and clinically relevant manner. Tumours are made up of heterogeneous subclones. Here, the authors show using breast cancer imaging and gene expression datasets that these subclones can be inferred by the deconvolution of gene expression data, mapped to MRI derived radiogenomic signatures and used to estimate prognosis.
Collapse
|
24
|
Clarke R, Kraikivski P, Jones BC, Sevigny CM, Sengupta S, Wang Y. A systems biology approach to discovering pathway signaling dysregulation in metastasis. Cancer Metastasis Rev 2020; 39:903-918. [PMID: 32776157 PMCID: PMC7487029 DOI: 10.1007/s10555-020-09921-7] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 07/11/2020] [Accepted: 07/13/2020] [Indexed: 02/07/2023]
Abstract
Total metastatic burden is the primary cause of death for many cancer patients. While the process of metastasis has been studied widely, much remains to be understood. Moreover, few agents have been developed that specifically target the major steps of the metastatic cascade. Many individual genes and pathways have been implicated in metastasis but a holistic view of how these interact and cooperate to regulate and execute the process remains somewhat rudimentary. It is unclear whether all of the signaling features that regulate and execute metastasis are yet fully understood. Novel features of a complex system such as metastasis can often be discovered by taking a systems-based approach. We introduce the concepts of systems modeling and define some of the central challenges facing the application of a multidisciplinary systems-based approach to understanding metastasis and finding actionable targets therein. These challenges include appreciating the unique properties of the high-dimensional omics data often used for modeling, limitations in knowledge of the system (metastasis), tumor heterogeneity and sampling bias, and some of the issues key to understanding critical features of molecular signaling in the context of metastasis. We also provide a brief introduction to integrative modeling that focuses on both the nodes and edges of molecular signaling networks. Finally, we offer some observations on future directions as they relate to developing a systems-based model of the metastatic cascade.
Collapse
Affiliation(s)
- Robert Clarke
- Department of Oncology, Georgetown University Medical Center, 3970 Reservoir Rd NW, Washington, DC, 20057, USA.
- Hormel Institute and Department of Biochemistry, Molecular Biology and Biophysics, University of Minnesota, Austin, MN, 55912, USA.
| | - Pavel Kraikivski
- Academy of Integrated Science, Division of Systems Biology, Virginia Polytechnic and State University, Blacksburg, VA, 24061, USA
| | - Brandon C Jones
- Department of Oncology, Georgetown University Medical Center, 3970 Reservoir Rd NW, Washington, DC, 20057, USA
| | - Catherine M Sevigny
- Department of Oncology, Georgetown University Medical Center, 3970 Reservoir Rd NW, Washington, DC, 20057, USA
| | - Surojeet Sengupta
- Department of Oncology, Georgetown University Medical Center, 3970 Reservoir Rd NW, Washington, DC, 20057, USA
| | - Yue Wang
- Bradley Department of Electrical and Computer Engineering, Virginia Polytechnic Institute and State University, Arlington, VA, 22203, USA
| |
Collapse
|
25
|
Parker SJ, Chen L, Spivia W, Saylor G, Mao C, Venkatraman V, Holewinski RJ, Mastali M, Pandey R, Athas G, Yu G, Fu Q, Troxlair D, Vander Heide R, Herrington D, Van Eyk JE, Wang Y. Identification of Putative Early Atherosclerosis Biomarkers by Unsupervised Deconvolution of Heterogeneous Vascular Proteomes. J Proteome Res 2020; 19:2794-2806. [PMID: 32202800 DOI: 10.1021/acs.jproteome.0c00118] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/18/2022]
Abstract
Coronary artery disease remains a leading cause of death in industrialized nations, and early detection of disease is a critical intervention target to effectively treat patients and manage risk. Proteomic analysis of mixed tissue homogenates may obscure subtle protein changes that occur uniquely in underlying tissue subtypes. The unsupervised 'convex analysis of mixtures' (CAM) tool has previously been shown to effectively segregate cellular subtypes from mixed expression data. In this study, we hypothesized that CAM would identify proteomic information specifically informative to early atherosclerosis lesion involvement that could lead to potential markers of early disease detection. We quantified the proteome of 99 paired abdominal aorta (AA) and left anterior descending coronary artery (LAD) specimens (N = 198 specimens total) acquired during autopsy of young adults free of diagnosed cardiac disease. The CAM tool was then used to segregate protein subsets uniquely associated with different underlying tissue types, yielding markers of normal and fibrous plaque (FP) tissues in LAD and AA (N = 62 lesions markers). CAM-derived FP marker expression was validated against pathologist estimated luminal surface involvement of FP, as well as in an orthogonal cohort of "pure" fibrous plaque, fatty streak, and normal vascular specimens. A targeted mass spectrometry (MS) assay quantified 39 of 62 CAM-FP markers in plasma from women with angiographically verified coronary artery disease (CAD, N = 46) or free from apparent CAD (control, N = 40). Elastic net variable selection with logistic regression reduced this list to 10 proteins capable of classifying CAD status in this cohort with <6% misclassification error, and a mean area under the receiver operating characteristic curve of 0.992 (confidence interval 0.968-0.998) after cross validation. The proteomics-CAM workflow identified lesion-specific molecular biomarker candidates by distilling the most representative molecules from heterogeneous tissue types.
Collapse
Affiliation(s)
- Sarah J Parker
- Heart Institute & Advanced Clinical Biosystems Research Institute, Cedars Sinai Medical Center, Los Angeles, California 90048, United States
| | - Lulu Chen
- Department of Electrical and Computer Engineering, Virginia Polytechnic Institute and State University, Arlington, Virginia 24061, United States
| | - Weston Spivia
- Heart Institute & Advanced Clinical Biosystems Research Institute, Cedars Sinai Medical Center, Los Angeles, California 90048, United States
| | - Georgia Saylor
- Department of Cardiovascular Medicine, Wake Forest University, Winston-Salem, North Carolina 27101, United States
| | - Chunhong Mao
- Biocomplexity Institute & Initiative, University of Virginia, Charlottesville, Virginia 22904, United States
| | - Vidya Venkatraman
- Heart Institute & Advanced Clinical Biosystems Research Institute, Cedars Sinai Medical Center, Los Angeles, California 90048, United States
| | - Ronald J Holewinski
- Heart Institute & Advanced Clinical Biosystems Research Institute, Cedars Sinai Medical Center, Los Angeles, California 90048, United States
| | - Mitra Mastali
- Heart Institute & Advanced Clinical Biosystems Research Institute, Cedars Sinai Medical Center, Los Angeles, California 90048, United States
| | - Rakhi Pandey
- Heart Institute & Advanced Clinical Biosystems Research Institute, Cedars Sinai Medical Center, Los Angeles, California 90048, United States
| | - Grace Athas
- Department of Pathology, Louisiana State University, New Orleans, Louisiana 70112, United States
| | - Guoqiang Yu
- Department of Electrical and Computer Engineering, Virginia Polytechnic Institute and State University, Arlington, Virginia 24061, United States
| | - Qin Fu
- Heart Institute & Advanced Clinical Biosystems Research Institute, Cedars Sinai Medical Center, Los Angeles, California 90048, United States
| | - Dana Troxlair
- Department of Pathology, Louisiana State University, New Orleans, Louisiana 70112, United States
| | - Richard Vander Heide
- Department of Pathology, Louisiana State University, New Orleans, Louisiana 70112, United States
| | - David Herrington
- Department of Cardiovascular Medicine, Wake Forest University, Winston-Salem, North Carolina 27101, United States
| | - Jennifer E Van Eyk
- Heart Institute & Advanced Clinical Biosystems Research Institute, Cedars Sinai Medical Center, Los Angeles, California 90048, United States
| | - Yue Wang
- Department of Electrical and Computer Engineering, Virginia Polytechnic Institute and State University, Arlington, Virginia 24061, United States
| |
Collapse
|
26
|
Psychiatric Genetics, Epigenetics, and Cellular Models in Coming Years. JOURNAL OF PSYCHIATRY AND BRAIN SCIENCE 2019; 4. [PMID: 31608310 PMCID: PMC6788748 DOI: 10.20900/jpbs.20190012] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 11/20/2022]
Abstract
Psychiatric genetic studies have uncovered hundreds of loci associated with various psychiatric disorders. We take the opportunity to review achievements in the past and provide our view of what is coming in the fields of molecular genetics, epigenetics, and cellular models. We expect that SNP-array and sequencing-based studies of genetic associations will continue to expand, covering more disorders, drug responses, phenotypes, and diverse populations. Epigenetic studies of psychiatric disorders will be another promising field with the growing recognition that environmental factors impact the risk for psychiatric disorders by modulating epigenetic factors. Functional studies of genetic findings will be needed in cellular models to provide important connections between genetic and epigenetic variants and biological phenotypes.
Collapse
|
27
|
Sun X, Sun S, Yang S. An Efficient and Flexible Method for Deconvoluting Bulk RNA-Seq Data with Single-Cell RNA-Seq Data. Cells 2019; 8:E1161. [PMID: 31569701 PMCID: PMC6830085 DOI: 10.3390/cells8101161] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2019] [Revised: 09/23/2019] [Accepted: 09/26/2019] [Indexed: 12/25/2022] Open
Abstract
Estimating cell type compositions for complex diseases is an important step to investigate the cellular heterogeneity for understanding disease etiology and potentially facilitate early disease diagnosis and prevention. Here, we developed a computationally statistical method, referring to Multi-Omics Matrix Factorization (MOMF), to estimate the cell-type compositions of bulk RNA sequencing (RNA-seq) data by leveraging cell type-specific gene expression levels from single-cell RNA sequencing (scRNA-seq) data. MOMF not only directly models the count nature of gene expression data, but also effectively accounts for the uncertainty of cell type-specific mean gene expression levels. We demonstrate the benefits of MOMF through three real data applications, i.e., Glioblastomas (GBM), colorectal cancer (CRC) and type II diabetes (T2D) studies. MOMF is able to accurately estimate disease-related cell type proportions, i.e., oligodendrocyte progenitor cells and macrophage cells, which are strongly associated with the survival of GBM and CRC, respectively.
Collapse
Affiliation(s)
- Xifang Sun
- Department of Mathematics, School of Science, Xi'an Shiyou University, 710065 Xi'an, China.
| | - Shiquan Sun
- School of Computer Science, Northwestern Polytechnical University, 710072 Xi'an, China.
- Department of Biostatistics, University of Michigan, Ann Arbor, MI 48109, USA.
| | - Sheng Yang
- Department of Biostatistics, School of Public Health, Nanjing Medical University, 211166 Nanjing, China.
| |
Collapse
|
28
|
Sompairac N, Nazarov PV, Czerwinska U, Cantini L, Biton A, Molkenov A, Zhumadilov Z, Barillot E, Radvanyi F, Gorban A, Kairov U, Zinovyev A. Independent Component Analysis for Unraveling the Complexity of Cancer Omics Datasets. Int J Mol Sci 2019; 20:E4414. [PMID: 31500324 PMCID: PMC6771121 DOI: 10.3390/ijms20184414] [Citation(s) in RCA: 44] [Impact Index Per Article: 8.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2019] [Revised: 09/02/2019] [Accepted: 09/04/2019] [Indexed: 12/13/2022] Open
Abstract
Independent component analysis (ICA) is a matrix factorization approach where the signals captured by each individual matrix factors are optimized to become as mutually independent as possible. Initially suggested for solving source blind separation problems in various fields, ICA was shown to be successful in analyzing functional magnetic resonance imaging (fMRI) and other types of biomedical data. In the last twenty years, ICA became a part of the standard machine learning toolbox, together with other matrix factorization methods such as principal component analysis (PCA) and non-negative matrix factorization (NMF). Here, we review a number of recent works where ICA was shown to be a useful tool for unraveling the complexity of cancer biology from the analysis of different types of omics data, mainly collected for tumoral samples. Such works highlight the use of ICA in dimensionality reduction, deconvolution, data pre-processing, meta-analysis, and others applied to different data types (transcriptome, methylome, proteome, single-cell data). We particularly focus on the technical aspects of ICA application in omics studies such as using different protocols, determining the optimal number of components, assessing and improving reproducibility of the ICA results, and comparison with other popular matrix factorization techniques. We discuss the emerging ICA applications to the integrative analysis of multi-level omics datasets and introduce a conceptual view on ICA as a tool for defining functional subsystems of a complex biological system and their interactions under various conditions. Our review is accompanied by a Jupyter notebook which illustrates the discussed concepts and provides a practical tool for applying ICA to the analysis of cancer omics datasets.
Collapse
Affiliation(s)
- Nicolas Sompairac
- Institut Curie, PSL Research University, 75005 Paris, France.
- INSERM U900, 75248 Paris, France.
- CBIO-Centre for Computational Biology, Mines ParisTech, PSL Research University, 75006 Paris, France.
- Centre de Recherches Interdisciplinaires, Université Paris Descartes, 75004 Paris, France.
| | - Petr V Nazarov
- Multiomics Data Science Research Group, Quantitative Biology Unit, Luxembourg Institute of Health (LIH), L-1445 Strassen, Luxembourg.
| | - Urszula Czerwinska
- Institut Curie, PSL Research University, 75005 Paris, France.
- INSERM U900, 75248 Paris, France.
- CBIO-Centre for Computational Biology, Mines ParisTech, PSL Research University, 75006 Paris, France.
| | - Laura Cantini
- Computational Systems Biology Team, Institut de Biologie de l'Ecole Normale Supérieure, CNRS UMR8197, INSERM U1024, Ecole Normale Supérieure, PSL Research University, 75005 Paris, France.
| | - Anne Biton
- Centre de Bioinformatique, Biostatistique et Biologie Intégrative (C3BI, USR 3756 Institut Pasteur et CNRS), 75015 Paris, France.
| | - Askhat Molkenov
- Laboratory of Bioinformatics and Systems Biology, Center for Life Sciences, National Laboratory Astana, Nazarbayev University, 010000 Nur-Sultan, Kazakhstan.
| | - Zhaxybay Zhumadilov
- Laboratory of Bioinformatics and Systems Biology, Center for Life Sciences, National Laboratory Astana, Nazarbayev University, 010000 Nur-Sultan, Kazakhstan.
- University Medical Center, Nazarbayev University, 010000 Nur-Sultan, Kazakhstan.
| | - Emmanuel Barillot
- Institut Curie, PSL Research University, 75005 Paris, France.
- INSERM U900, 75248 Paris, France.
- CBIO-Centre for Computational Biology, Mines ParisTech, PSL Research University, 75006 Paris, France.
| | - Francois Radvanyi
- Institut Curie, PSL Research University, 75005 Paris, France.
- CNRS, UMR 144, 75248 Paris, France.
| | - Alexander Gorban
- Center for Mathematical Modeling, University of Leicester, Leicester LE1 7RH, UK.
- Lobachevsky University, 603022 Nizhny Novgorod, Russia.
| | - Ulykbek Kairov
- Laboratory of Bioinformatics and Systems Biology, Center for Life Sciences, National Laboratory Astana, Nazarbayev University, 010000 Nur-Sultan, Kazakhstan.
| | - Andrei Zinovyev
- Institut Curie, PSL Research University, 75005 Paris, France.
- INSERM U900, 75248 Paris, France.
- CBIO-Centre for Computational Biology, Mines ParisTech, PSL Research University, 75006 Paris, France.
| |
Collapse
|
29
|
Avila Cobos F, Vandesompele J, Mestdagh P, De Preter K. Computational deconvolution of transcriptomics data from mixed cell populations. Bioinformatics 2019; 34:1969-1979. [PMID: 29351586 DOI: 10.1093/bioinformatics/bty019] [Citation(s) in RCA: 130] [Impact Index Per Article: 26.0] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2017] [Accepted: 01/10/2018] [Indexed: 12/22/2022] Open
Abstract
Summary Gene expression analyses of bulk tissues often ignore cell type composition as an important confounding factor, resulting in a loss of signal from lowly abundant cell types. In this review, we highlight the importance and value of computational deconvolution methods to infer the abundance of different cell types and/or cell type-specific expression profiles in heterogeneous samples without performing physical cell sorting. We also explain the various deconvolution scenarios, the mathematical approaches used to solve them and the effect of data processing and different confounding factors on the accuracy of the deconvolution results. Contact katleen.depreter@ugent.be. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Francisco Avila Cobos
- Center for Medical Genetics Ghent (CMGG), Ghent University, 9000 Ghent, Belgium.,Cancer Research Institute Ghent (CRIG), 9000 Ghent, Belgium.,Bioinformatics Institute Ghent from Nucleotides to Networks (BIG N2N), 9000 Ghent, Belgium
| | - Jo Vandesompele
- Center for Medical Genetics Ghent (CMGG), Ghent University, 9000 Ghent, Belgium.,Cancer Research Institute Ghent (CRIG), 9000 Ghent, Belgium.,Bioinformatics Institute Ghent from Nucleotides to Networks (BIG N2N), 9000 Ghent, Belgium
| | - Pieter Mestdagh
- Center for Medical Genetics Ghent (CMGG), Ghent University, 9000 Ghent, Belgium.,Cancer Research Institute Ghent (CRIG), 9000 Ghent, Belgium.,Bioinformatics Institute Ghent from Nucleotides to Networks (BIG N2N), 9000 Ghent, Belgium
| | - Katleen De Preter
- Center for Medical Genetics Ghent (CMGG), Ghent University, 9000 Ghent, Belgium.,Cancer Research Institute Ghent (CRIG), 9000 Ghent, Belgium.,Bioinformatics Institute Ghent from Nucleotides to Networks (BIG N2N), 9000 Ghent, Belgium
| |
Collapse
|
30
|
Clarke R, Tyson JJ, Tan M, Baumann WT, Jin L, Xuan J, Wang Y. Systems biology: perspectives on multiscale modeling in research on endocrine-related cancers. Endocr Relat Cancer 2019; 26:R345-R368. [PMID: 30965282 PMCID: PMC7045974 DOI: 10.1530/erc-18-0309] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 03/25/2019] [Accepted: 04/08/2019] [Indexed: 12/12/2022]
Abstract
Drawing on concepts from experimental biology, computer science, informatics, mathematics and statistics, systems biologists integrate data across diverse platforms and scales of time and space to create computational and mathematical models of the integrative, holistic functions of living systems. Endocrine-related cancers are well suited to study from a systems perspective because of the signaling complexities arising from the roles of growth factors, hormones and their receptors as critical regulators of cancer cell biology and from the interactions among cancer cells, normal cells and signaling molecules in the tumor microenvironment. Moreover, growth factors, hormones and their receptors are often effective targets for therapeutic intervention, such as estrogen biosynthesis, estrogen receptors or HER2 in breast cancer and androgen receptors in prostate cancer. Given the complexity underlying the molecular control networks in these cancers, a simple, intuitive understanding of how endocrine-related cancers respond to therapeutic protocols has proved incomplete and unsatisfactory. Systems biology offers an alternative paradigm for understanding these cancers and their treatment. To correctly interpret the results of systems-based studies requires some knowledge of how in silico models are built, and how they are used to describe a system and to predict the effects of perturbations on system function. In this review, we provide a general perspective on the field of cancer systems biology, and we explore some of the advantages, limitations and pitfalls associated with using predictive multiscale modeling to study endocrine-related cancers.
Collapse
Affiliation(s)
- Robert Clarke
- Department of Oncology, Georgetown University Medical Center, Washington, District of Columbia, USA
| | - John J Tyson
- Department of Biological Sciences, Virginia Polytechnic Institute and State University, Blacksburg, Virginia, USA
| | - Ming Tan
- Department of Biostatistics, Bioinformatics & Biomathematics, Georgetown University Medical Center, Washington, District of Columbia, USA
| | - William T Baumann
- Department of Electrical and Computer Engineering, Virginia Polytechnic Institute and State University, Blacksburg, Virginia, USA
| | - Lu Jin
- Department of Oncology, Georgetown University Medical Center, Washington, District of Columbia, USA
| | - Jianhua Xuan
- Department of Electrical and Computer Engineering, Virginia Polytechnic Institute and State University, Arlington, Virginia, USA
| | - Yue Wang
- Department of Electrical and Computer Engineering, Virginia Polytechnic Institute and State University, Arlington, Virginia, USA
| |
Collapse
|
31
|
Complete deconvolution of cellular mixtures based on linearity of transcriptional signatures. Nat Commun 2019; 10:2209. [PMID: 31101809 PMCID: PMC6525259 DOI: 10.1038/s41467-019-09990-5] [Citation(s) in RCA: 57] [Impact Index Per Article: 11.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/11/2018] [Accepted: 04/11/2019] [Indexed: 11/08/2022] Open
Abstract
Changes in bulk transcriptional profiles of heterogeneous samples often reflect changes in proportions of individual cell types. Several robust techniques have been developed to dissect the composition of such mixed samples given transcriptional signatures of the pure components or their proportions. These approaches are insufficient, however, in situations when no information about individual mixture components is available. This problem is known as the complete deconvolution problem, where the composition is revealed without any a priori knowledge about cell types and their proportions. Here, we identify a previously unrecognized property of tissue-specific genes - their mutual linearity - and use it to reveal the structure of the topological space of mixed transcriptional profiles and provide a noise-robust approach to the complete deconvolution problem. Furthermore, our analysis reveals systematic bias of all deconvolution techniques due to differences in cell size or RNA-content, and we demonstrate how to address this bias at the experimental design level.
Collapse
|
32
|
Radiomic analysis of imaging heterogeneity in tumours and the surrounding parenchyma based on unsupervised decomposition of DCE-MRI for predicting molecular subtypes of breast cancer. Eur Radiol 2019; 29:4456-4467. [PMID: 30617495 DOI: 10.1007/s00330-018-5891-3] [Citation(s) in RCA: 44] [Impact Index Per Article: 8.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/28/2018] [Revised: 10/02/2018] [Accepted: 11/13/2018] [Indexed: 10/27/2022]
Abstract
OBJECTIVES This study aimed to predict the molecular subtypes of breast cancer via intratumoural and peritumoural radiomic analysis with subregion identification based on the decomposition of contrast-enhanced magnetic resonance imaging (DCE-MRI). METHODS The study included 211 women with histopathologically confirmed breast cancer. We utilised a completely unsupervised convex analysis of mixtures (CAM) method by unmixing dynamic imaging series from heterogeneous tissues. Each tumour and the surrounding parenchyma were thus decomposed into multiple subregions, representing different vascular characterisations, from which radiomic features were extracted. A random forest model was trained and tested using a leave-one-out cross-validation (LOOCV) method to predict breast cancer subtypes. The predictive models from tumoural and peritumoural subregions were fused for classification. RESULTS Tumour and peritumour DCE-MR images were decomposed into three compartments, representing plasma input, fast-flow kinetics, and slow-flow kinetics. The tumour subregion related to fast-flow kinetics showed the best performance among the subregions for differentiating between patients with four molecular subtypes (area under the receiver operating characteristic curve (AUC) = 0.832), exhibiting an AUC value significantly (p < 0.0001) higher than that obtained with the entire tumour (AUC = 0.719). When the tumour- and parenchyma-based predictive models were fused, the performance, measured as the AUC, increased to 0.897; this value was significantly higher than that obtained with other tumour partition methods. CONCLUSIONS Radiomic analysis of intratumoural and peritumoural heterogeneity based on the decomposition of image time-series signals has the potential to more accurately identify tumour kinetic features and serve as a valuable clinical marker to enhance the prediction of breast cancer subtypes. KEY POINTS • Decomposition of image time-series signals has the potential to more accurately identify tumour kinetic features. • Fusion of intratumoural- and peritumoural-based predictive models improves the prediction of breast cancer subtypes.
Collapse
|
33
|
Hunt GJ, Freytag S, Bahlo M, Gagnon-Bartsch JA. dtangle: accurate and robust cell type deconvolution. Bioinformatics 2018; 35:2093-2099. [DOI: 10.1093/bioinformatics/bty926] [Citation(s) in RCA: 55] [Impact Index Per Article: 9.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2017] [Revised: 10/20/2018] [Accepted: 11/06/2018] [Indexed: 11/14/2022] Open
Abstract
Abstract
Motivation
Cell type composition of tissues is important in many biological processes. To help understand cell type composition using gene expression data, methods of estimating (deconvolving) cell type proportions have been developed. Such estimates are often used to adjust for confounding effects of cell type in differential expression analysis (DEA).
Results
We propose dtangle, a new cell type deconvolution method. dtangle works on a range of DNA microarray and bulk RNA-seq platforms. It estimates cell type proportions using publicly available, often cross-platform, reference data. We evaluate dtangle on 11 benchmark datasets showing that dtangle is competitive with published deconvolution methods, is robust to outliers and selection of tuning parameters, and is fast. As a case study, we investigate the human immune response to Lyme disease. dtangle’s estimates reveal a temporal trend consistent with previous findings and are important covariates for DEA across disease status.
Availability and implementation
dtangle is on CRAN (cran.r-project.org/package=dtangle) or github (dtangle.github.io).
Supplementary information
Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Gregory J Hunt
- Department of Statistics, University of Michigan, Ann Arbor, MI, USA
| | - Saskia Freytag
- Population Health and Immunity Division, The Walter and Eliza Hall Institute of Medical Research, Parkville, VIC, Australia
- Department of Medical Biology, University of Melbourne, Parkville, VIC, Australia
| | - Melanie Bahlo
- Population Health and Immunity Division, The Walter and Eliza Hall Institute of Medical Research, Parkville, VIC, Australia
- Department of Medical Biology, University of Melbourne, Parkville, VIC, Australia
| | | |
Collapse
|
34
|
Dimitrakopoulou K, Wik E, Akslen LA, Jonassen I. Deblender: a semi-/unsupervised multi-operational computational method for complete deconvolution of expression data from heterogeneous samples. BMC Bioinformatics 2018; 19:408. [PMID: 30404611 PMCID: PMC6223087 DOI: 10.1186/s12859-018-2442-5] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2018] [Accepted: 10/22/2018] [Indexed: 12/24/2022] Open
Abstract
BACKGROUND Towards discovering robust cancer biomarkers, it is imperative to unravel the cellular heterogeneity of patient samples and comprehend the interactions between cancer cells and the various cell types in the tumor microenvironment. The first generation of 'partial' computational deconvolution methods required prior information either on the cell/tissue type proportions or the cell/tissue type-specific expression signatures and the number of involved cell/tissue types. The second generation of 'complete' approaches allowed estimating both of the cell/tissue type proportions and cell/tissue type-specific expression profiles directly from the mixed gene expression data, based on known (or automatically identified) cell/tissue type-specific marker genes. RESULTS We present Deblender, a flexible complete deconvolution tool operating in semi-/unsupervised mode based on the user's access to known marker gene lists and information about cell/tissue composition. In case of no prior knowledge, global gene expression variability is used in clustering the mixed data to substitute marker sets with cluster sets. In addition, we integrate a model selection criterion to predict the number of constituent cell/tissue types. Moreover, we provide a tailored algorithmic scheme to estimate mixture proportions for realistic experimental cases where the number of involved cell/tissue types exceeds the number of mixed samples. We assess the performance of Deblender and a set of state-of-the-art existing tools on a comprehensive set of benchmark and patient cancer mixture expression datasets (including TCGA). CONCLUSION Our results corroborate that Deblender can be a valuable tool to improve understanding of gene expression datasets with implications for prediction and clinical utilization. Deblender is implemented in MATLAB and is available from ( https://github.com/kondim1983/Deblender/ ).
Collapse
Affiliation(s)
- Konstantina Dimitrakopoulou
- Centre for Cancer Biomarkers CCBIO, Department of Informatics, University of Bergen, Bergen, Norway.,Computational Biology Unit, Department of Informatics, University of Bergen, Bergen, Norway
| | - Elisabeth Wik
- Centre for Cancer Biomarkers CCBIO, Department of Clinical Medicine, Section for Pathology, University of Bergen, Bergen, Norway.,Department of Pathology, Haukeland University Hospital, Bergen, Norway
| | - Lars A Akslen
- Centre for Cancer Biomarkers CCBIO, Department of Clinical Medicine, Section for Pathology, University of Bergen, Bergen, Norway.,Department of Pathology, Haukeland University Hospital, Bergen, Norway
| | - Inge Jonassen
- Centre for Cancer Biomarkers CCBIO, Department of Informatics, University of Bergen, Bergen, Norway. .,Computational Biology Unit, Department of Informatics, University of Bergen, Bergen, Norway.
| |
Collapse
|
35
|
Xie F, Zhou M, Xu Y. BayCount: A Bayesian decomposition method for inferring tumor heterogeneity using RNA-Seq counts. Ann Appl Stat 2018. [DOI: 10.1214/17-aoas1123] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
36
|
Lin CH, Chi CY, Chen L, Miller DJ, Wang Y. Detection of Sources in Non-Negative Blind Source Separation by Minimum Description Length Criterion. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2018; 29:4022-4037. [PMID: 28981430 DOI: 10.1109/tnnls.2017.2749279] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]
Abstract
While non-negative blind source separation (nBSS) has found many successful applications in science and engineering, model order selection, determining the number of sources, remains a critical yet unresolved problem. Various model order selection methods have been proposed and applied to real-world data sets but with limited success, with both order over- and under-estimation reported. By studying existing schemes, we have found that the unsatisfactory results are mainly due to invalid assumptions, model oversimplification, subjective thresholding, and/or to assumptions made solely for mathematical convenience. Building on our earlier work that reformulated model order selection for nBSS with more realistic assumptions and models, we report a newly and formally revised model order selection criterion rooted in the minimum description length (MDL) principle. Adopting widely invoked assumptions for achieving a unique nBSS solution, we consider the mixing matrix as consisting of deterministic unknowns, with the source signals following a multivariate Dirichlet distribution. We derive a computationally efficient, stochastic algorithm to obtain approximate maximum-likelihood estimates of model parameters and apply Monte Carlo integration to determine the description length. Our modeling and estimation strategy exploits the characteristic geometry of the data simplex in nBSS. We validate our nBSS-MDL criterion through extensive simulation studies and on four real-world data sets, demonstrating its strong performance and general applicability to nBSS. The proposed nBSS-MDL criterion consistently detects the true number of sources, in all of our case studies.
Collapse
|
37
|
Herrington DM, Mao C, Parker SJ, Fu Z, Yu G, Chen L, Venkatraman V, Fu Y, Wang Y, Howard TD, Jun G, Zhao CF, Liu Y, Saylor G, Spivia WR, Athas GB, Troxclair D, Hixson JE, Vander Heide RS, Wang Y, Van Eyk JE. Proteomic Architecture of Human Coronary and Aortic Atherosclerosis. Circulation 2018; 137:2741-2756. [PMID: 29915101 PMCID: PMC6011234 DOI: 10.1161/circulationaha.118.034365] [Citation(s) in RCA: 90] [Impact Index Per Article: 15.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 02/14/2018] [Accepted: 04/12/2018] [Indexed: 12/26/2022]
Abstract
BACKGOUND The inability to detect premature atherosclerosis significantly hinders implementation of personalized therapy to prevent coronary heart disease. A comprehensive understanding of arterial protein networks and how they change in early atherosclerosis could identify new biomarkers for disease detection and improved therapeutic targets. METHODS Here we describe the human arterial proteome and proteomic features strongly associated with early atherosclerosis based on mass spectrometry analysis of coronary artery and aortic specimens from 100 autopsied young adults (200 arterial specimens). Convex analysis of mixtures, differential dependent network modeling, and bioinformatic analyses defined the composition, network rewiring, and likely regulatory features of the protein networks associated with early atherosclerosis and how they vary across 2 anatomic distributions. RESULTS The data document significant differences in mitochondrial protein abundance between coronary and aortic samples (coronary>>aortic), and between atherosclerotic and normal tissues (atherosclerotic< CONCLUSIONS The human arterial proteome can be viewed as a complex network whose architectural features vary considerably as a function of anatomic location and the presence or absence of atherosclerosis. The data suggest important reductions in mitochondrial protein abundance in early atherosclerosis and also identify a subset of plasma proteins that are highly predictive of angiographically defined coronary disease.
Collapse
Affiliation(s)
- David M Herrington
- Section on Cardiovascular Medicine, Department of Internal Medicine (D.M.H., C.F.Z., G.S.)
| | - Chunhong Mao
- Biocomplexity Institute of Virginia Tech, Virginia Tech, Blacksburg (C.M.)
| | - Sarah J Parker
- Advanced Clinical Biosystems Research Institute, Cedars-Sinai Heart Institute, and Department of Medicine, Cedars-Sinai Medical Center, Los Angeles, CA (S.T.P., V.V., W.R.S., J.E.V.E.)
| | - Zongming Fu
- Johns Hopkins Medical Institute, Baltimore, MD (Z.F.)
| | - Guoqiang Yu
- Department of Electrical and Computer Engineering, Virginia Tech, Arlington (G.Y., L.C., Y.F., Yizhi Wang, Yue Wang)
| | - Lulu Chen
- Department of Electrical and Computer Engineering, Virginia Tech, Arlington (G.Y., L.C., Y.F., Yizhi Wang, Yue Wang)
| | - Vidya Venkatraman
- Advanced Clinical Biosystems Research Institute, Cedars-Sinai Heart Institute, and Department of Medicine, Cedars-Sinai Medical Center, Los Angeles, CA (S.T.P., V.V., W.R.S., J.E.V.E.)
| | - Yi Fu
- Department of Electrical and Computer Engineering, Virginia Tech, Arlington (G.Y., L.C., Y.F., Yizhi Wang, Yue Wang)
| | - Yizhi Wang
- Department of Electrical and Computer Engineering, Virginia Tech, Arlington (G.Y., L.C., Y.F., Yizhi Wang, Yue Wang)
| | | | - Goo Jun
- Department of Epidemiology, Human Genetics and Environmental Sciences, Human Genetics Center, School of Public Health, University of Texas Health Science Center at Houston (G.J., J.E.H.)
| | - Caroline F Zhao
- Section on Cardiovascular Medicine, Department of Internal Medicine (D.M.H., C.F.Z., G.S.)
| | - Yongmei Liu
- Department of Epidemiology, Division of Public Health Sciences (Y.L.), Wake Forest School of Medicine, Winston-Salem, NC
| | - Georgia Saylor
- Section on Cardiovascular Medicine, Department of Internal Medicine (D.M.H., C.F.Z., G.S.)
| | - Weston R Spivia
- Advanced Clinical Biosystems Research Institute, Cedars-Sinai Heart Institute, and Department of Medicine, Cedars-Sinai Medical Center, Los Angeles, CA (S.T.P., V.V., W.R.S., J.E.V.E.)
| | - Grace B Athas
- Department of Pathology, Louisiana State Health Science Center, New Orleans (G.B.A., D.T., R.C.V.H.)
| | - Dana Troxclair
- Department of Pathology, Louisiana State Health Science Center, New Orleans (G.B.A., D.T., R.C.V.H.)
| | - James E Hixson
- Department of Epidemiology, Human Genetics and Environmental Sciences, Human Genetics Center, School of Public Health, University of Texas Health Science Center at Houston (G.J., J.E.H.)
| | - Richard S Vander Heide
- Department of Pathology, Louisiana State Health Science Center, New Orleans (G.B.A., D.T., R.C.V.H.)
| | - Yue Wang
- Department of Electrical and Computer Engineering, Virginia Tech, Arlington (G.Y., L.C., Y.F., Yizhi Wang, Yue Wang)
| | - Jennifer E Van Eyk
- Advanced Clinical Biosystems Research Institute, Cedars-Sinai Heart Institute, and Department of Medicine, Cedars-Sinai Medical Center, Los Angeles, CA (S.T.P., V.V., W.R.S., J.E.V.E.)
| |
Collapse
|
38
|
Computational de novo discovery of distinguishing genes for biological processes and cell types in complex tissues. PLoS One 2018; 13:e0193067. [PMID: 29494600 PMCID: PMC5832224 DOI: 10.1371/journal.pone.0193067] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2017] [Accepted: 02/02/2018] [Indexed: 11/30/2022] Open
Abstract
Bulk tissue samples examined by gene expression studies are usually heterogeneous. The data gained from these samples display the confounding patterns of mixtures consisting of multiple cell types or similar cell types in various functional states, which hinders the elucidation of the molecular mechanisms underlying complex biological phenomena. A realistic approach to compensate for the limitations of experimentally separating homogenous cell populations from mixed tissues is to computationally identify cell-type specific patterns from bulk, heterogeneous measurements. We designed the CellDistinguisher algorithm to analyze the gene expression data of mixed samples, identifying genes that best distinguish biological processes and cell types. Coupled with a deconvolution algorithm that takes cell type specific gene lists as input, we show that CellDistinguisher performs as well as partial deconvolution algorithms in predicting cell type composition without the need for prior knowledge of cell type signatures. This approach is also better in predicting cell type signatures than the one-step traditional complete deconvolution methods. To illustrate its wide applicability, the algorithm was tested on multiple publicly available data sets. In each case, CellDistinguisher identified genes reflecting biological processes typical for the tissues and development stages of interest and estimated the sample compositions accurately.
Collapse
|
39
|
Houseman EA, Kile ML, Christiani DC, Ince TA, Kelsey KT, Marsit CJ. Reference-free deconvolution of DNA methylation data and mediation by cell composition effects. BMC Bioinformatics 2016; 17:259. [PMID: 27358049 PMCID: PMC4928286 DOI: 10.1186/s12859-016-1140-4] [Citation(s) in RCA: 160] [Impact Index Per Article: 20.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/04/2016] [Accepted: 06/19/2016] [Indexed: 12/28/2022] Open
Abstract
Background Recent interest in reference-free deconvolution of DNA methylation data has led to several supervised methods, but these methods do not easily permit the interpretation of underlying cell types. Results We propose a simple method for reference-free deconvolution that provides both proportions of putative cell types defined by their underlying methylomes, the number of these constituent cell types, as well as a method for evaluating the extent to which the underlying methylomes reflect specific types of cells. We demonstrate these methods in an analysis of 23 Infinium data sets from 13 distinct data collection efforts; these empirical evaluations show that our algorithm can reasonably estimate the number of constituent types, return cell proportion estimates that demonstrate anticipated associations with underlying phenotypic data; and methylomes that reflect the underlying biology of constituent cell types. Conclusions Our methodology permits an explicit quantitation of the mediation of phenotypic associations with DNA methylation by cell composition effects. Although more work is needed to investigate functional information related to estimated methylomes, our proposed method provides a novel and useful foundation for conducting DNA methylation studies on heterogeneous tissues lacking reference data. Electronic supplementary material The online version of this article (doi:10.1186/s12859-016-1140-4) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- E Andres Houseman
- School of Biological and Population Health Sciences, College of Public Health and Human Sciences, Oregon State University, Corvallis, OR, USA.
| | - Molly L Kile
- School of Biological and Population Health Sciences, College of Public Health and Human Sciences, Oregon State University, Corvallis, OR, USA
| | - David C Christiani
- Department of Environmental Health, Harvard T. H. Chan School of Public Health, Boston, MA, USA
| | - Tan A Ince
- Department of Pathology, University of Miami, Miller School of Medicine, Miami, FL, USA
| | - Karl T Kelsey
- Department of Epidemiology, Department of Pathology and Laboratory Medicine, Brown University, Providence, USA
| | - Carmen J Marsit
- Department of Community and Family Medicine, Dartmouth Medical School, Hanover, NH, USA
| |
Collapse
|
40
|
Teschendorff AE, Jones A, Widschwendter M. Stochastic epigenetic outliers can define field defects in cancer. BMC Bioinformatics 2016; 17:178. [PMID: 27103033 PMCID: PMC4840974 DOI: 10.1186/s12859-016-1056-z] [Citation(s) in RCA: 27] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2015] [Accepted: 04/16/2016] [Indexed: 12/14/2022] Open
Abstract
Background There is growing evidence that DNA methylation alterations may contribute to carcinogenesis. Recent data also suggest that DNA methylation field defects in normal pre-neoplastic tissue represent infrequent stochastic “outlier” events. This presents a statistical challenge for standard feature selection algorithms, which assume frequent alterations in a disease phenotype. Although differential variability has emerged as a novel feature selection paradigm for the discovery of outliers, a growing concern is that these could result from technical confounders, in principle thus favouring algorithms which are robust to outliers. Results Here we evaluate five differential variability algorithms in over 700 DNA methylomes, including two of the largest cohorts profiling precursor cancer lesions, and demonstrate that most of the novel proposed algorithms lack the sensitivity to detect epigenetic field defects at genome-wide significance. In contrast, algorithms which recognise heterogeneous outlier DNA methylation patterns are able to identify many sites in pre-neoplastic lesions, which display progression in invasive cancer. Thus, we show that many DNA methylation outliers are not technical artefacts, but define epigenetic field defects which are selected for during cancer progression. Conclusions Given that cancer studies aiming to find epigenetic field defects are likely to be limited by sample size, adopting the novel feature selection paradigm advocated here will be critical to increase assay sensitivity. Electronic supplementary material The online version of this article (doi:10.1186/s12859-016-1056-z) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Andrew E Teschendorff
- CAS Key Lab of Computational Biology, CAS-MPG Partner Institute for Computational Biology, Shanghai Institute for Biological Sciences, Chinese Academy of Sciences, Shanghai, China. .,Statistical Cancer Genomics, Paul O'Gorman Building, UCL Cancer Institute, University College London, 72 Huntley Street, London, WC1E 6BT, UK. .,Department of Women's Cancer, University College London, 74 Huntley Street, London, WC1E 6AU, UK.
| | - Allison Jones
- Department of Women's Cancer, University College London, 74 Huntley Street, London, WC1E 6AU, UK
| | - Martin Widschwendter
- Department of Women's Cancer, University College London, 74 Huntley Street, London, WC1E 6AU, UK.
| |
Collapse
|