1
|
Zhang R, Chen L, Oliver LD, Voineskos AN, Park JY. SAN: Mitigating spatial covariance heterogeneity in cortical thickness data collected from multiple scanners or sites. Hum Brain Mapp 2024; 45:e26692. [PMID: 38712767 PMCID: PMC11075170 DOI: 10.1002/hbm.26692] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2023] [Revised: 03/27/2024] [Accepted: 04/08/2024] [Indexed: 05/08/2024] Open
Abstract
In neuroimaging studies, combining data collected from multiple study sites or scanners is becoming common to increase the reproducibility of scientific discoveries. At the same time, unwanted variations arise by using different scanners (inter-scanner biases), which need to be corrected before downstream analyses to facilitate replicable research and prevent spurious findings. While statistical harmonization methods such as ComBat have become popular in mitigating inter-scanner biases in neuroimaging, recent methodological advances have shown that harmonizing heterogeneous covariances results in higher data quality. In vertex-level cortical thickness data, heterogeneity in spatial autocorrelation is a critical factor that affects covariance heterogeneity. Our work proposes a new statistical harmonization method called spatial autocorrelation normalization (SAN) that preserves homogeneous covariance vertex-level cortical thickness data across different scanners. We use an explicit Gaussian process to characterize scanner-invariant and scanner-specific variations to reconstruct spatially homogeneous data across scanners. SAN is computationally feasible, and it easily allows the integration of existing harmonization methods. We demonstrate the utility of the proposed method using cortical thickness data from the Social Processes Initiative in the Neurobiology of the Schizophrenia(s) (SPINS) study. SAN is publicly available as an R package.
Collapse
Affiliation(s)
- Rongqian Zhang
- Department of Statistical SciencesUniversity of TorontoTorontoOntarioCanada
| | - Linxi Chen
- Department of Statistical SciencesUniversity of TorontoTorontoOntarioCanada
| | | | - Aristotle N. Voineskos
- Centre for Addiction and Mental HealthTorontoOntarioCanada
- Department of PsychiatryUniversity of TorontoTorontoOntarioCanada
| | - Jun Young Park
- Department of Statistical SciencesUniversity of TorontoTorontoOntarioCanada
- Department of PsychologyUniversity of TorontoTorontoOntarioCanada
| |
Collapse
|
2
|
Shu H, Qu Z, Zhu H. D-GCCA: Decomposition-based Generalized Canonical Correlation Analysis for Multi-view High-dimensional Data. JOURNAL OF MACHINE LEARNING RESEARCH : JMLR 2022; 23:169. [PMID: 35983506 PMCID: PMC9380864] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
Modern biomedical studies often collect multi-view data, that is, multiple types of data measured on the same set of objects. A popular model in high-dimensional multi-view data analysis is to decompose each view's data matrix into a low-rank common-source matrix generated by latent factors common across all data views, a low-rank distinctive-source matrix corresponding to each view, and an additive noise matrix. We propose a novel decomposition method for this model, called decomposition-based generalized canonical correlation analysis (D-GCCA). The D-GCCA rigorously defines the decomposition on theL 2 space of random variables in contrast to the Euclidean dot product space used by most existing methods, thereby being able to provide the estimation consistency for the low-rank matrix recovery. Moreover, to well calibrate common latent factors, we impose a desirable orthogonality constraint on distinctive latent factors. Existing methods, however, inadequately consider such orthogonality and may thus suffer from substantial loss of undetected common-source variation. Our D-GCCA takes one step further than generalized canonical correlation analysis by separating common and distinctive components among canonical variables, while enjoying an appealing interpretation from the perspective of principal component analysis. Furthermore, we propose to use the variable-level proportion of signal variance explained by common or distinctive latent factors for selecting the variables most influenced. Consistent estimators of our D-GCCA method are established with good finite-sample numerical performance, and have closed-form expressions leading to efficient computation especially for large-scale data. The superiority of D-GCCA over state-of-the-art methods is also corroborated in simulations and real-world data examples.
Collapse
Affiliation(s)
- Hai Shu
- Department of Biostatistics, New York University, New York, NY 10003, USA
| | - Zhe Qu
- Department of Mathematics, Tulane University, New Orleans, LA 70118, USA
| | - Hongtu Zhu
- Department of Biostatistics, Department of Computer Science, The University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA
| |
Collapse
|
3
|
Chen T, Hua W, Xu B, Chen H, Xie M, Sun X, Ge X. Robust rank aggregation and cibersort algorithm applied to the identification of key genes in head and neck squamous cell cancer. MATHEMATICAL BIOSCIENCES AND ENGINEERING : MBE 2021; 18:4491-4507. [PMID: 34198450 DOI: 10.3934/mbe.2021228] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
OBJECTIVE Although multiple hub genes have been identified in head and neck squamous cell cancer (HNSCC) in recent years, because of the limited sample size and inconsistent bioinformatics analysis methods, the results are not reliable. Therefore, it is urgent to use reliable algorithms to find new prognostic markers of HNSCC. METHOD The Robust Rank Aggregation (RRA) method was used to integrate 8 microarray datasets of HNSCC downloaded from the Gene Expression Omnibus (GEO) database to screen differentially expressed genes (DEGs). Later, Gene Ontology (GO) functional annotation together with Kyoto Encyclopedia of Genes and Genomes (KEGG) analysis was carried out to discover functions of those discovered DEGs. According to the KEGG results, those discovered DEGs showed tight association with the occurrence and development of HNSCC. Then cibersort algorithm was used to analyze the infiltration of immune cells of HNSCC and we found that the main infiltrated immune cells were B cells, dendritic cells and macrophages. A protein-protein interaction (PPI) network was established; moreover, key modules were also constructed to select 5 hub genes from the whole network using cytoHubba. 3 hub genes showed significant relationship with prognosis for TCGA-derived HNSCC patients. RESULT The potent DEGs along with hub genes were selected by the combined bioinformatic approach. AURKA, BIRC5 and UBE2C genes may be the potential prognostic biomarker and therapeutic targets of HNSCC. CONCLUSIONS The Robust Rank Aggregation method and cibersort algorithm method can accurately predict the potential prognostic biomarker and therapeutic targets of HNSCC through multiple GEO datasets.
Collapse
Affiliation(s)
- Tingting Chen
- Department of Radiation Oncology, The First Affiliated Hospital of Nanjing Medical University, Nanjing, Jiangsu 210000, China
- Department of Oncology, Northern Jiangsu People's Hospital, Yangzhou, Jiangsu 225000, China
| | - Wei Hua
- Department of Oncology, Northern Jiangsu People's Hospital, Yangzhou, Jiangsu 225000, China
| | - Bing Xu
- Department of Radiation Oncology, The First Affiliated Hospital of Nanjing Medical University, Nanjing, Jiangsu 210000, China
| | - Hui Chen
- Department of Radiation Oncology, The First Affiliated Hospital of Nanjing Medical University, Nanjing, Jiangsu 210000, China
| | - Minhao Xie
- The First School of Clinical Medicine, Nanjing Medical University, Nanjing, Jiangsu 210000, China
| | - Xinchen Sun
- Department of Radiation Oncology, The First Affiliated Hospital of Nanjing Medical University, Nanjing, Jiangsu 210000, China
| | - Xiaolin Ge
- Department of Radiation Oncology, The First Affiliated Hospital of Nanjing Medical University, Nanjing, Jiangsu 210000, China
| |
Collapse
|
4
|
Determining the number of canonical correlation pairs for high-dimensional vectors. ANN I STAT MATH 2021. [DOI: 10.1007/s10463-020-00776-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
|
5
|
Shu H, Wang X, Zhu H. D-CCA: A Decomposition-based Canonical Correlation Analysis for High-Dimensional Datasets. J Am Stat Assoc 2020; 115:292-306. [PMID: 33311817 DOI: 10.1080/01621459.2018.1543599] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
Abstract
A typical approach to the joint analysis of two high-dimensional datasets is to decompose each data matrix into three parts: a low-rank common matrix that captures the shared information across datasets, a low-rank distinctive matrix that characterizes the individual information within a single dataset, and an additive noise matrix. Existing decomposition methods often focus on the orthogonality between the common and distinctive matrices, but inadequately consider the more necessary orthogonal relationship between the two distinctive matrices. The latter guarantees that no more shared information is extractable from the distinctive matrices. We propose decomposition-based canonical correlation analysis (D-CCA), a novel decomposition method that defines the common and distinctive matrices from the L 2 space of random variables rather than the conventionally used Euclidean space, with a careful construction of the orthogonal relationship between distinctive matrices. D-CCA represents a natural generalization of the traditional canonical correlation analysis. The proposed estimators of common and distinctive matrices are shown to be consistent and have reasonably better performance than some state-of-the-art methods in both simulated data and the real data analysis of breast cancer data obtained from The Cancer Genome Atlas.
Collapse
Affiliation(s)
- Hai Shu
- Department of Biostatistics, The University of Texas MD Anderson Cancer Center
| | - Xiao Wang
- Department of Statistics, Purdue University
| | - Hongtu Zhu
- Department of Biostatistics, The University of Texas MD Anderson Cancer Center.,Department of Biostatistics, The University of North Carolina at Chapel Hill
| |
Collapse
|
6
|
Park JY, Lock EF. Integrative factorization of bidimensionally linked matrices. Biometrics 2020; 76:61-74. [PMID: 31444786 PMCID: PMC7036334 DOI: 10.1111/biom.13141] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/06/2019] [Accepted: 08/19/2019] [Indexed: 02/02/2023]
Abstract
Advances in molecular "omics" technologies have motivated new methodologies for the integration of multiple sources of high-content biomedical data. However, most statistical methods for integrating multiple data matrices only consider data shared vertically (one cohort on multiple platforms) or horizontally (different cohorts on a single platform). This is limiting for data that take the form of bidimensionally linked matrices (eg, multiple cohorts measured on multiple platforms), which are increasingly common in large-scale biomedical studies. In this paper, we propose bidimensional integrative factorization (BIDIFAC) for integrative dimension reduction and signal approximation of bidimensionally linked data matrices. Our method factorizes data into (a) globally shared, (b) row-shared, (c) column-shared, and (d) single-matrix structural components, facilitating the investigation of shared and unique patterns of variability. For estimation, we use a penalized objective function that extends the nuclear norm penalization for a single matrix. As an alternative to the complicated rank selection problem, we use results from the random matrix theory to choose tuning parameters. We apply our method to integrate two genomics platforms (messenger RNA and microRNA expression) across two sample cohorts (tumor samples and normal tissue samples) using the breast cancer data from the Cancer Genome Atlas. We provide R code for fitting BIDIFAC, imputing missing values, and generating simulated data.
Collapse
Affiliation(s)
- Jun Young Park
- Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, Minnesota
| | - Eric F Lock
- Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, Minnesota
| |
Collapse
|
7
|
Gaynanova I, Li G. Structural learning and integrative decomposition of multi-view data. Biometrics 2019; 75:1121-1132. [PMID: 31254385 DOI: 10.1111/biom.13108] [Citation(s) in RCA: 33] [Impact Index Per Article: 6.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2018] [Accepted: 06/14/2019] [Indexed: 01/09/2023]
Abstract
The increased availability of multi-view data (data on the same samples from multiple sources) has led to strong interest in models based on low-rank matrix factorizations. These models represent each data view via shared and individual components, and have been successfully applied for exploratory dimension reduction, association analysis between the views, and consensus clustering. Despite these advances, there remain challenges in modeling partially-shared components and identifying the number of components of each type (shared/partially-shared/individual). We formulate a novel linked component model that directly incorporates partially-shared structures. We call this model SLIDE for Structural Learning and Integrative DEcomposition of multi-view data. The proposed model-fitting and selection techniques allow for joint identification of the number of components of each type, in contrast to existing sequential approaches. In our empirical studies, SLIDE demonstrates excellent performance in both signal estimation and component selection. We further illustrate the methodology on the breast cancer data from The Cancer Genome Atlas repository.
Collapse
Affiliation(s)
- Irina Gaynanova
- Department of Statistics, Texas A&M University, College Station, Texas
| | - Gen Li
- Department of Biostatistics, Mailman School of Public Health, Columbia University, New York City, New York
| |
Collapse
|
8
|
Cabanski C, Gilbert H, Mosesova S. Can Graphics Tell Lies? A Tutorial on How To Visualize Your Data. Clin Transl Sci 2018; 11:371-377. [PMID: 29603646 PMCID: PMC6039197 DOI: 10.1111/cts.12554] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/05/2017] [Accepted: 03/12/2017] [Indexed: 11/30/2022] Open
Abstract
Visualizations are a powerful tool for telling a story about a data set or analysis. If done correctly, visualizations not only display data but also help the audience digest key information. However, if done haphazardly, visualization has the potential to confuse the audience and, in the most extreme circumstances, deceive. In this tutorial, we provide a set of general principles for creating informative visualizations that tell a complete and accurate story of the data.
Collapse
Affiliation(s)
| | | | - Sofia Mosesova
- Denali Therapeutics Inc, South San Francisco, California, USA
| |
Collapse
|
9
|
Zhou YH, Marron JS, Wright FA. Computation of ancestry scores with mixed families and unrelated individuals. Biometrics 2017; 74:155-164. [PMID: 28452052 DOI: 10.1111/biom.12708] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2016] [Revised: 03/01/2017] [Accepted: 03/01/2017] [Indexed: 01/03/2023]
Abstract
The issue of robustness to family relationships in computing genotype ancestry scores such as eigenvector projections has received increased attention in genetic association, and is particularly challenging when sets of both unrelated individuals and closely related family members are included. The current standard is to compute loadings (left singular vectors) using unrelated individuals and to compute projected scores for remaining family members. However, projected ancestry scores from this approach suffer from shrinkage toward zero. We consider two main novel strategies: (i) matrix substitution based on decomposition of a target family-orthogonalized covariance matrix, and (ii) using family-averaged data to obtain loadings. We illustrate the performance via simulations, including resampling from 1000 Genomes Project data, and analysis of a cystic fibrosis dataset. The matrix substitution approach has similar performance to the current standard, but is simple and uses only a genotype covariance matrix, while the family-average method shows superior performance. Our approaches are accompanied by novel ancillary approaches that provide considerable insight, including individual-specific eigenvalue scree plots.
Collapse
Affiliation(s)
- Yi-Hui Zhou
- Department of Biological Sciences, Bioinformatics Research Center, North Carolina State University, Raleigh, North Carolina, U.S.A
| | - James S Marron
- Department of Statistics and Operations Research, University of North Carolina, Chapel Hill, U.S.A
| | - Fred A Wright
- Department of Biological Sciences and Statistics, Bioinformatics Research Center, North Carolina State University, Raleigh, U.S.A
| |
Collapse
|
10
|
Shen D, Shen H, Zhu H, Marron JS. The Statistics and Mathematics of High Dimension Low Sample Size Asymptotics. Stat Sin 2016; 26:1747-1770. [PMID: 28018116 PMCID: PMC5173295 DOI: 10.5705/ss.202015.0088] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/07/2023]
Abstract
The aim of this paper is to establish several deep theoretical properties of principal component analysis for multiple-component spike covariance models. Our new results reveal an asymptotic conical structure in critical sample eigendirections under the spike models with distinguishable (or indistinguishable) eigenvalues, when the sample size and/or the number of variables (or dimension) tend to infinity. The consistency of the sample eigenvectors relative to their population counterparts is determined by the ratio between the dimension and the product of the sample size with the spike size. When this ratio converges to a nonzero constant, the sample eigenvector converges to a cone, with a certain angle to its corresponding population eigenvector. In the High Dimension, Low Sample Size case, the angle between the sample eigenvector and its population counterpart converges to a limiting distribution. Several generalizations of the multi-spike covariance models are also explored, and additional theoretical results are presented.
Collapse
Affiliation(s)
| | | | - Hongtu Zhu
- University of North Carolina at Chapel Hill
| | - J S Marron
- University of North Carolina at Chapel Hill
| |
Collapse
|
11
|
Kuligowski J, Pérez-Guaita D, Sánchez-Illana Á, León-González Z, de la Guardia M, Vento M, Lock EF, Quintás G. Analysis of multi-source metabolomic data using joint and individual variation explained (JIVE). Analyst 2016; 140:4521-9. [PMID: 25988771 DOI: 10.1039/c5an00706b] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
Abstract
Metabolic profiling is increasingly being used for understanding biological processes but there is no single analytical technique that provides a complete quantitative or qualitative profiling of the metabolome. Data fusion (i.e. joint analysis of data from multiple sources) has the potential to circumvent this issue facilitating knowledge discovery and reliable biomarker identification. Another field of application of data fusion is the simultaneous analysis of metabolomic changes through several biofluids or tissues. However, metabolomics typically deals with large datasets, with hundreds to thousands of variables and the identification of shared and individual factors or structures across multiple sources is challenging due to the high variable to sample ratios and differences in intensity and noise range. In this work we apply a recent method, Joint and Individual Variation Explained (JIVE), for the integrated unsupervised analysis of metabolomic profiles from multiple data sources. This method separates the shared patterns among data sources (i.e. the joint structure) from the individual structure of each data source that is unrelated to the joint structure. Two examples are described to show the applicability of JIVE for the simultaneous analysis of multi-source data using: (i) plasma samples subjected to different analytical techniques, sample treatment and measurement conditions; and (ii) plasma and urine samples subjected to liquid chromatography-mass spectrometry measured using two ionization conditions.
Collapse
Affiliation(s)
- Julia Kuligowski
- Neonatal Research Centre, Health Research Institute La Fe, Valencia, Spain
| | | | | | | | | | | | | | | |
Collapse
|
12
|
Lock EF, Hoadley KA, Marron J, Nobel AB. JOINT AND INDIVIDUAL VARIATION EXPLAINED (JIVE) FOR INTEGRATED ANALYSIS OF MULTIPLE DATA TYPES. Ann Appl Stat 2013; 7:523-542. [PMID: 23745156 PMCID: PMC3671601 DOI: 10.1214/12-aoas597] [Citation(s) in RCA: 263] [Impact Index Per Article: 23.9] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023]
Abstract
Research in several fields now requires the analysis of datasets in which multiple high-dimensional types of data are available for a common set of objects. In particular, The Cancer Genome Atlas (TCGA) includes data from several diverse genomic technologies on the same cancerous tumor samples. In this paper we introduce Joint and Individual Variation Explained (JIVE), a general decomposition of variation for the integrated analysis of such datasets. The decomposition consists of three terms: a low-rank approximation capturing joint variation across data types, low-rank approximations for structured variation individual to each data type, and residual noise. JIVE quantifies the amount of joint variation between data types, reduces the dimensionality of the data, and provides new directions for the visual exploration of joint and individual structure. The proposed method represents an extension of Principal Component Analysis and has clear advantages over popular two-block methods such as Canonical Correlation Analysis and Partial Least Squares. A JIVE analysis of gene expression and miRNA data on Glioblastoma Multiforme tumor samples reveals gene-miRNA associations and provides better characterization of tumor types.
Collapse
Affiliation(s)
- Eric F. Lock
- Department of Statistics and Operations Research, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599
| | - Katherine A. Hoadley
- Lineberger Comprehensive Cancer Center, University of North Carolina at Chapel Hill, 450 West Dr. Chapel Hill, NC 27599
| | - J.S. Marron
- Department of Statistics and Operations Research, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599
| | - Andrew B. Nobel
- Department of Statistics and Operations Research, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599
| |
Collapse
|
13
|
Sîrbu A, Kerr G, Crane M, Ruskin HJ. RNA-Seq vs dual- and single-channel microarray data: sensitivity analysis for differential expression and clustering. PLoS One 2012; 7:e50986. [PMID: 23251411 PMCID: PMC3518479 DOI: 10.1371/journal.pone.0050986] [Citation(s) in RCA: 60] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2011] [Accepted: 10/30/2012] [Indexed: 01/13/2023] Open
Abstract
With the fast development of high-throughput sequencing technologies, a new generation of genome-wide gene expression measurements is under way. This is based on mRNA sequencing (RNA-seq), which complements the already mature technology of microarrays, and is expected to overcome some of the latter's disadvantages. These RNA-seq data pose new challenges, however, as strengths and weaknesses have yet to be fully identified. Ideally, Next (or Second) Generation Sequencing measures can be integrated for more comprehensive gene expression investigation to facilitate analysis of whole regulatory networks. At present, however, the nature of these data is not very well understood. In this paper we study three alternative gene expression time series datasets for the Drosophila melanogaster embryo development, in order to compare three measurement techniques: RNA-seq, single-channel and dual-channel microarrays. The aim is to study the state of the art for the three technologies, with a view of assessing overlapping features, data compatibility and integration potential, in the context of time series measurements. This involves using established tools for each of the three different technologies, and technical and biological replicates (for RNA-seq and microarrays, respectively), due to the limited availability of biological RNA-seq replicates for time series data. The approach consists of a sensitivity analysis for differential expression and clustering. In general, the RNA-seq dataset displayed highest sensitivity to differential expression. The single-channel data performed similarly for the differentially expressed genes common to gene sets considered. Cluster analysis was used to identify different features of the gene space for the three datasets, with higher similarities found for the RNA-seq and single-channel microarray dataset.
Collapse
Affiliation(s)
- Alina Sîrbu
- Centre for Scientific Computing and Complex Systems Modelling, Dublin City University, Dublin, Ireland.
| | | | | | | |
Collapse
|
14
|
Wilkerson MD, Yin X, Hoadley KA, Liu Y, Hayward MC, Cabanski CR, Muldrew K, Miller CR, Randell SH, Socinski MA, Parsons AM, Funkhouser WK, Lee CB, Roberts PJ, Thorne L, Bernard PS, Perou CM, Hayes DN. Lung squamous cell carcinoma mRNA expression subtypes are reproducible, clinically important, and correspond to normal cell types. Clin Cancer Res 2010; 16:4864-75. [PMID: 20643781 DOI: 10.1158/1078-0432.ccr-10-0199] [Citation(s) in RCA: 206] [Impact Index Per Article: 14.7] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2022]
Abstract
PURPOSE Lung squamous cell carcinoma (SCC) is clinically and genetically heterogeneous, and current diagnostic practices do not adequately substratify this heterogeneity. A robust, biologically based SCC subclassification may describe this variability and lead to more precise patient prognosis and management. We sought to determine if SCC mRNA expression subtypes exist, are reproducible across multiple patient cohorts, and are clinically relevant. EXPERIMENTAL DESIGN Subtypes were detected by unsupervised consensus clustering in five published discovery cohorts of mRNA microarrays, totaling 382 SCC patients. An independent validation cohort of 56 SCC patients was collected and assayed by microarrays. A nearest-centroid subtype predictor was built using discovery cohorts. Validation cohort subtypes were predicted and evaluated for confirmation. Subtype survival outcome, clinical covariates, and biological processes were compared by statistical and bioinformatic methods. RESULTS Four lung SCC mRNA expression subtypes, named primitive, classical, secretory, and basal, were detected and independently validated (P < 0.001). The primitive subtype had the worst survival outcome (P < 0.05) and is an independent predictor of survival (P < 0.05). Tumor differentiation and patient sex were associated with subtype. The expression profiles of the subtypes contained distinct biological processes (primitive: proliferation; classical: xenobiotic metabolism; secretory: immune response; basal: cell adhesion) and suggested distinct pharmacologic interventions. Comparison with lung model systems revealed distinct subtype to cell type correspondence. CONCLUSIONS Lung SCC consists of four mRNA expression subtypes that have different survival outcomes, patient populations, and biological processes. The subtypes stratify patients for more precise prognosis and targeted research.
Collapse
Affiliation(s)
- Matthew D Wilkerson
- Lineberger Comprehensive Cancer Center, University of North Carolina at Chapel Hill, 27599, USA
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|