1
|
Matsui Y, Togayachi A, Sakamoto K, Angata K, Kadomatsu K, Nishihara S. Integrated Systems Analysis Deciphers Transcriptome and Glycoproteome Links in Alzheimer's Disease. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2023.12.25.573290. [PMID: 38234803 PMCID: PMC10793412 DOI: 10.1101/2023.12.25.573290] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/19/2024]
Abstract
Glycosylation is increasingly recognized as a potential therapeutic target in Alzheimer's disease. In recent years, evidence of Alzheimer's disease-specific glycoproteins has been established. However, the mechanisms underlying their dysregulation, including tissue- and cell-type specificity, are not fully understood. We aimed to explore the upstream regulators of aberrant glycosylation by integrating multiple data sources using a glycogenomics approach. We identified dysregulation of the glycosyltransferase PLOD3 in oligodendrocytes as an upstream regulator of cerebral vessels and found that it is involved in COL4A5 synthesis, which is strongly correlated with amyloid fiber formation. Furthermore, COL4A5 has been suggested to interact with astrocytes via extracellular matrix receptors as a ligand. This study suggests directions for new therapeutic strategies for Alzheimer's disease targeting glycosyltransferases.
Collapse
Affiliation(s)
- Yusuke Matsui
- Institute for Glyco-core Research (iGCORE), Nagoya University, Furo-cho, Chikusa-ku, Nagoya 464-8601, Japan
- Biomedical and Health Informatics Unit, Department of Integrated Health Science, Nagoya University Graduate School of Medicine, Daiko-minami, Higashi-ku, Nagoya, 461-8673, Japan
| | - Akira Togayachi
- Glycan and Life Systems Integration Center (GaLSIC), Soka University, 1-236 Tangi-machi, Hachioji, Tokyo 192-8577, Japan
| | - Kazuma Sakamoto
- Institute for Glyco-core Research (iGCORE), Nagoya University, Furo-cho, Chikusa-ku, Nagoya 464-8601, Japan
- Department of Biochemistry, Nagoya University Graduate School of Medicine, Tsurumai-cho, Showa-ku, Nagoya, 466-8550, Japan
| | - Kiyohiko Angata
- Glycan and Life Systems Integration Center (GaLSIC), Soka University, 1-236 Tangi-machi, Hachioji, Tokyo 192-8577, Japan
| | - Kenji Kadomatsu
- Institute for Glyco-core Research (iGCORE), Nagoya University, Furo-cho, Chikusa-ku, Nagoya 464-8601, Japan
- Department of Biochemistry, Nagoya University Graduate School of Medicine, Tsurumai-cho, Showa-ku, Nagoya, 466-8550, Japan
| | - Shoko Nishihara
- Glycan and Life Systems Integration Center (GaLSIC), Soka University, 1-236 Tangi-machi, Hachioji, Tokyo 192-8577, Japan
| |
Collapse
|
2
|
Smith RN, Rosales IA, Tomaszewski KT, Mahowald GT, Araujo-Medina M, Acheampong E, Bruce A, Rios A, Otsuka T, Tsuji T, Hotta K, Colvin R. Utility of Banff Human Organ Transplant Gene Panel in Human Kidney Transplant Biopsies. Transplantation 2023; 107:1188-1199. [PMID: 36525551 PMCID: PMC10132999 DOI: 10.1097/tp.0000000000004389] [Citation(s) in RCA: 6] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/23/2022]
Abstract
BACKGROUND Microarray transcript analysis of human renal transplantation biopsies has successfully identified the many patterns of graft rejection. To evaluate an alternative, this report tests whether gene expression from the Banff Human Organ Transplant (B-HOT) probe set panel, derived from validated microarrays, can identify the relevant allograft diagnoses directly from archival human renal transplant formalin-fixed paraffin-embedded biopsies. To test this hypothesis, principal components (PCs) of gene expressions were used to identify allograft diagnoses, to classify diagnoses, and to determine whether the PC data were rich enough to identify diagnostic subtypes by clustering, which are all needed if the B-HOT panel can substitute for microarrays. METHODS RNA was isolated from routine, archival formalin-fixed paraffin-embedded tissue renal biopsy cores with both rejection and nonrejection diagnoses. The B-HOT panel expression of 770 genes was analyzed by PCs, which were then tested to determine their ability to identify diagnoses. RESULTS PCs of microarray gene sets identified the Banff categories of renal allograft diagnoses, modeled well the aggregate diagnoses, showing a similar correspondence with the pathologic diagnoses as microarrays. Clustering of the PCs identified diagnostic subtypes including non-chronic antibody-mediated rejection with high endothelial expression. PCs of cell types and pathways identified new mechanistic patterns including differential expression of B and plasma cells. CONCLUSIONS Using PCs of gene expression from the B-Hot panel confirms the utility of the B-HOT panel to identify allograft diagnoses and is similar to microarrays. The B-HOT panel will accelerate and expand transcript analysis and will be useful for longitudinal and outcome studies.
Collapse
Affiliation(s)
- Rex N Smith
- Department of Pathology, Massachusetts General Hospital, Harvard Medical School, Boston, MA
- Center for Transplantation Sciences, Massachusetts General Hospital, Boston, MA
| | - Ivy A Rosales
- Department of Pathology, Massachusetts General Hospital, Harvard Medical School, Boston, MA
- Center for Transplantation Sciences, Massachusetts General Hospital, Boston, MA
| | - Kristen T Tomaszewski
- Department of Pathology, Massachusetts General Hospital, Harvard Medical School, Boston, MA
- Center for Transplantation Sciences, Massachusetts General Hospital, Boston, MA
| | - Grace T Mahowald
- Department of Pathology, Massachusetts General Hospital, Harvard Medical School, Boston, MA
| | - Milagros Araujo-Medina
- Department of Pathology, Massachusetts General Hospital, Harvard Medical School, Boston, MA
| | - Ellen Acheampong
- Department of Pathology, Massachusetts General Hospital, Harvard Medical School, Boston, MA
| | - Amy Bruce
- Department of Pathology, Massachusetts General Hospital, Harvard Medical School, Boston, MA
| | - Andrea Rios
- Department of Pathology, Massachusetts General Hospital, Harvard Medical School, Boston, MA
| | - Takuya Otsuka
- Department of Surgical Pathology, Hokkaido University Hospital, Sapporo, Japan
| | - Takahiro Tsuji
- Department of Pathology, Sapporo City General Hospital, Sapporo, Japan
| | - Kiyohiko Hotta
- Department of Urology, Hokkaido University Hospital, Sapporo, Japan
| | - Robert Colvin
- Department of Pathology, Massachusetts General Hospital, Harvard Medical School, Boston, MA
- Center for Transplantation Sciences, Massachusetts General Hospital, Boston, MA
| |
Collapse
|
3
|
Maghsoudi Z, Nguyen H, Tavakkoli A, Nguyen T. A comprehensive survey of the approaches for pathway analysis using multi-omics data integration. Brief Bioinform 2022; 23:6761962. [PMID: 36252928 PMCID: PMC9677478 DOI: 10.1093/bib/bbac435] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/28/2022] [Revised: 08/26/2022] [Accepted: 09/08/2022] [Indexed: 02/07/2023] Open
Abstract
Pathway analysis has been widely used to detect pathways and functions associated with complex disease phenotypes. The proliferation of this approach is due to better interpretability of its results and its higher statistical power compared with the gene-level statistics. A plethora of pathway analysis methods that utilize multi-omics setup, rather than just transcriptomics or proteomics, have recently been developed to discover novel pathways and biomarkers. Since multi-omics gives multiple views into the same problem, different approaches are employed in aggregating these views into a comprehensive biological context. As a result, a variety of novel hypotheses regarding disease ideation and treatment targets can be formulated. In this article, we review 32 such pathway analysis methods developed for multi-omics and multi-cohort data. We discuss their availability and implementation, assumptions, supported omics types and databases, pathway analysis techniques and integration strategies. A comprehensive assessment of each method's practicality, and a thorough discussion of the strengths and drawbacks of each technique will be provided. The main objective of this survey is to provide a thorough examination of existing methods to assist potential users and researchers in selecting suitable tools for their data and analysis purposes, while highlighting outstanding challenges in the field that remain to be addressed for future development.
Collapse
Affiliation(s)
- Zeynab Maghsoudi
- Department of Computer Science and Engineering, University of Nevada, Reno, 89557, Nevada, USA
| | - Ha Nguyen
- Department of Computer Science and Engineering, University of Nevada, Reno, 89557, Nevada, USA
| | - Alireza Tavakkoli
- Department of Computer Science and Engineering, University of Nevada, Reno, 89557, Nevada, USA
| | - Tin Nguyen
- Corresponding author: Tin Nguyen, Department of Computer Science and Engineering, University of Nevada, Reno, NV, USA. Tel.: +1-775-784-6619;
| |
Collapse
|
4
|
Zhang LX, Yan H, Liu Y, Xu J, Song J, Yu DJ. Enhancing Characteristic Gene Selection and Tumor Classification by the Robust Laplacian Supervised Discriminative Sparse PCA. J Chem Inf Model 2022; 62:1794-1807. [PMID: 35353532 DOI: 10.1021/acs.jcim.1c01403] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/08/2023]
Abstract
Characteristic gene selection and tumor classification of gene expression data play major roles in genomic research. Due to the characteristics of a small sample size and high dimensionality of gene expression data, it is a common practice to perform dimensionality reduction prior to the use of machine learning-based methods to analyze the expression data. In this context, classical principal component analysis (PCA) and its improved versions have been widely used. Recently, methods based on supervised discriminative sparse PCA have been developed to improve the performance of data dimensionality reduction. However, such methods still have limitations: most of them have not taken into consideration the improvement of robustness to outliers and noise, label information, sparsity, as well as capturing intrinsic geometrical structures in one objective function. To address this drawback, in this study, we propose a novel PCA-based method, known as the robust Laplacian supervised discriminative sparse PCA, termed RLSDSPCA, which enforces the L2,1 norm on the error function and incorporates the graph Laplacian into supervised discriminative sparse PCA. To evaluate the efficacy of the proposed RLSDSPCA, we applied it to the problems of characteristic gene selection and tumor classification problems using gene expression data. The results demonstrate that the proposed RLSDSPCA method, when used in combination with other related methods, can effectively identify new pathogenic genes associated with diseases. In addition, RLSDSPCA has also achieved the best performance compared with the state-of-the-art methods on tumor classification in terms of major performance metrics. The codes and data sets used in the study are freely available at http://csbio.njust.edu.cn/bioinf/rlsdspca/.
Collapse
Affiliation(s)
- Lu-Xing Zhang
- School of Computer Science and Engineering, Nanjing University of Science and Technology, 200 Xiaolingwei, Nanjing 210094, China
| | - He Yan
- School of Computer Science and Engineering, Nanjing University of Science and Technology, 200 Xiaolingwei, Nanjing 210094, China
| | - Yan Liu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, 200 Xiaolingwei, Nanjing 210094, China
| | - Jian Xu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, 200 Xiaolingwei, Nanjing 210094, China
| | - Jiangning Song
- Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, Victoria 3800, Australia.,Monash Centre for Data Science, Faculty of Information Technology, Monash University, Melbourne, Victoria 3800, Australia
| | - Dong-Jun Yu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, 200 Xiaolingwei, Nanjing 210094, China
| |
Collapse
|
5
|
Gene set inference from single-cell sequencing data using a hybrid of matrix factorization and variational autoencoders. NAT MACH INTELL 2020. [DOI: 10.1038/s42256-020-00269-9] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
|
6
|
Deng Y, Wu S, Fan H. Genome-wide pathway-based quantitative multiple phenotypes analysis. PLoS One 2020; 15:e0240910. [PMID: 33175855 PMCID: PMC7657528 DOI: 10.1371/journal.pone.0240910] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2020] [Accepted: 10/06/2020] [Indexed: 11/18/2022] Open
Abstract
For complex diseases, genome-wide pathway association studies have become increasingly promising. Currently, however, pathway-based association analysis mainly focus on a single phenotype, which may insufficient to describe the complex diseases and physiological processes. This work proposes a combination model to evaluate the association between a pathway and multiple phenotypes and to reduce the run time based on asymptotic results. For a single phenotype, we propose a semi-supervised maximum kernel-based U-statistics (mSKU) method to assess the pathway-based association analysis. For multiple phenotypes, we propose the fisher combination function with dependent phenotypes (FC) to transform the p-values between the pathway and each marginal phenotype individually to achieve pathway-based multiple phenotypes analysis. With real data from the Alzheimer Disease Neuroimaging Initiative (ADNI) study and Human Liver Cohort (HLC) study, the FC-mSKU method allows us to specify which pathways are specific to a single phenotype or contribute to common genetic constructions of multiple phenotypes. If we only focus on single-phenotype tests, we may miss some findings for etiology studies. Through extensive simulation studies, the FC-mSKU method demonstrates its advantages compared with its counterparts.
Collapse
Affiliation(s)
- Yamin Deng
- Statistics Center, First Hospital of Shanxi Medical University, Taiyuan, China.,Division of Health Statistics, School of Public Health, Shanxi Medical University, Taiyuan, China
| | - Shiman Wu
- Statistics Center, First Hospital of Shanxi Medical University, Taiyuan, China
| | - Huifang Fan
- Statistics Center, First Hospital of Shanxi Medical University, Taiyuan, China
| |
Collapse
|
7
|
Odom GJ, Ban Y, Colaprico A, Liu L, Silva TC, Sun X, Pico AR, Zhang B, Wang L, Chen X. PathwayPCA: an R/Bioconductor Package for Pathway Based Integrative Analysis of Multi-Omics Data. Proteomics 2020; 20:e1900409. [PMID: 32430990 PMCID: PMC7677175 DOI: 10.1002/pmic.201900409] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2020] [Revised: 05/01/2020] [Indexed: 01/01/2023]
Abstract
The authors present pathwayPCA, an R/Bioconductor package for integrative pathway analysis that utilizes modern statistical methodology, including supervised and adaptive, elastic-net, sparse principal component analysis. pathwayPCA can be applied to continuous, binary, and survival outcomes in studies with multiple covariates and/or interaction effects. It outperforms several alternative methods at identifying disease-associated pathways in integrative analysis using both simulated and real datasets. In addition, several case studies are provided to illustrate pathwayPCA analysis with gene selection, estimating, and visualizing sample-specific pathway activities, identifying sex-specific pathway effects in kidney cancer, and building integrative models for predicting patient prognosis. pathwayPCA is an open-source R package, freely available through the Bioconductor repository. pathwayPCA is expected to be a useful tool for empowering the wider scientific community to analyze and interpret the wealth of available proteomics data, along with other types of molecular data recently made available by Clinical Proteomic Tumor Analysis Consortium and other large consortiums.
Collapse
Affiliation(s)
- Gabriel J. Odom
- Department of Biostatistics, Florida International University, Stempel College of Public Health, Miami, FL 33199, USA
- Division of Biostatistics, Department of Public Health Sciences, University of Miami, Miller School of Medicine, Miami, FL 33136, USA
| | - Yuguang Ban
- Sylvester Comprehensive Cancer Center, University of Miami, Miller School of Medicine, Miami, FL 33136, USA
| | - Antonio Colaprico
- Division of Biostatistics, Department of Public Health Sciences, University of Miami, Miller School of Medicine, Miami, FL 33136, USA
| | - Lizhong Liu
- Division of Biostatistics, Department of Public Health Sciences, University of Miami, Miller School of Medicine, Miami, FL 33136, USA
| | - Tiago Chedraoui Silva
- Division of Biostatistics, Department of Public Health Sciences, University of Miami, Miller School of Medicine, Miami, FL 33136, USA
| | - Xiaodian Sun
- Sylvester Comprehensive Cancer Center, University of Miami, Miller School of Medicine, Miami, FL 33136, USA
| | - Alexander R. Pico
- Institute for Data Science and Biotechnology, Gladstone Institutes, San Francisco, CA 94158, USA
| | - Bing Zhang
- Lester and Sue Smith Breast Center, Baylor College of Medicine, Houston TX 77030, USA
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston TX 77030, USA
| | - Lily Wang
- Division of Biostatistics, Department of Public Health Sciences, University of Miami, Miller School of Medicine, Miami, FL 33136, USA
- Sylvester Comprehensive Cancer Center, University of Miami, Miller School of Medicine, Miami, FL 33136, USA
- Dr. John T Macdonald Foundation Department of Human Genetics, University of Miami, Miller School of Medicine, Miami, FL 33136, USA
- John P. Hussman Institute for Human Genomics, University of Miami Miller School of Medicine, Miami, FL 33136, USA
| | - Xi Chen
- Division of Biostatistics, Department of Public Health Sciences, University of Miami, Miller School of Medicine, Miami, FL 33136, USA
- Sylvester Comprehensive Cancer Center, University of Miami, Miller School of Medicine, Miami, FL 33136, USA
| |
Collapse
|
8
|
Yan KK, Wang X, Lam WWT, Vardhanabhuti V, Lee AWM, Pang HH. Radiomics analysis using stability selection supervised component analysis for right-censored survival data. Comput Biol Med 2020; 124:103959. [PMID: 32905923 PMCID: PMC7501167 DOI: 10.1016/j.compbiomed.2020.103959] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/19/2020] [Revised: 08/02/2020] [Accepted: 08/03/2020] [Indexed: 02/03/2023]
Abstract
Radiomics is a newly emerging field that involves the extraction of massive quantitative features from biomedical images by using data-characterization algorithms. Distinctive imaging features identified from biomedical images can be used for prognosis and therapeutic response prediction, and they can provide a noninvasive approach for personalized therapy. So far, many of the published radiomics studies utilize existing out of the box algorithms to identify the prognostic markers from biomedical images that are not specific to radiomics data. To better utilize biomedical images, we propose a novel machine learning approach, stability selection supervised principal component analysis (SSSuperPCA) that identifies stable features from radiomics big data coupled with dimension reduction for right-censored survival outcomes. The proposed approach allows us to identify a set of stable features that are highly associated with the survival outcomes in a simple yet meaningful manner, while controlling the per-family error rate. We evaluate the performance of SSSuperPCA using simulations and real data sets for non-small cell lung cancer and head and neck cancer, and compare it with other machine learning algorithms. The results demonstrate that our method has a competitive edge over other existing methods in identifying the prognostic markers from biomedical imaging data for the prediction of right-censored survival outcomes.
Collapse
Affiliation(s)
- Kang K Yan
- School of Public Health, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Hong Kong SAR, China
| | - Xiaofei Wang
- Department of Biostatistics and Bioinformatics, Duke University School of Medicine, Durham, NC, USA
| | - Wendy W T Lam
- School of Public Health, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Hong Kong SAR, China; Jockey Club Institute of Cancer Care, Li Ka Shing Faculty of Medicine, Hong Kong SAS, China
| | - Varut Vardhanabhuti
- Department of Diagnostic Radiology, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Hong Kong SAR, China
| | - Anne W M Lee
- Department of Clinical Oncology, The University of Hong Kong-Shenzhen Hospital and The University of Hong Kong, Hong Kong SAR, China
| | - Herbert H Pang
- School of Public Health, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Hong Kong SAR, China.
| |
Collapse
|
9
|
Somani J, Ramchandran S, Lähdesmäki H. A personalised approach for identifying disease-relevant pathways in heterogeneous diseases. NPJ Syst Biol Appl 2020; 6:17. [PMID: 32518234 PMCID: PMC7283216 DOI: 10.1038/s41540-020-0130-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2019] [Accepted: 03/12/2020] [Indexed: 11/30/2022] Open
Abstract
Numerous time-course gene expression datasets have been generated for studying the biological dynamics that drive disease progression; and nearly as many methods have been proposed to analyse them. However, barely any method exists that can appropriately model time-course data while accounting for heterogeneity that entails many complex diseases. Most methods manage to fulfil either one of those qualities, but not both. The lack of appropriate methods hinders our capability of understanding the disease process and pursuing preventive treatments. We present a method that models time-course data in a personalised manner using Gaussian processes in order to identify differentially expressed genes (DEGs); and combines the DEG lists on a pathway-level using a permutation-based empirical hypothesis testing in order to overcome gene-level variability and inconsistencies prevalent to datasets from heterogenous diseases. Our method can be applied to study the time-course dynamics, as well as specific time-windows of heterogeneous diseases. We apply our personalised approach on three longitudinal type 1 diabetes (T1D) datasets, where the first two are used to determine perturbations taking place during early prognosis of the disease, as well as in time-windows before autoantibody positivity and T1D diagnosis; and the third is used to assess the generalisability of our method. By comparing to non-personalised methods, we demonstrate that our approach is biologically motivated and can reveal more insights into progression of heterogeneous diseases. With its robust capabilities of identifying disease-relevant pathways, our approach could be useful for predicting events in the progression of heterogeneous diseases and even for biomarker identification.
Collapse
Affiliation(s)
- Juhi Somani
- Department of Computer Science, Aalto University, 02150, Espoo, Finland
| | | | - Harri Lähdesmäki
- Department of Computer Science, Aalto University, 02150, Espoo, Finland.
| |
Collapse
|
10
|
Kim K, Sun H. Incorporating genetic networks into case-control association studies with high-dimensional DNA methylation data. BMC Bioinformatics 2019; 20:510. [PMID: 31640538 PMCID: PMC6805595 DOI: 10.1186/s12859-019-3040-x] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/10/2019] [Accepted: 08/21/2019] [Indexed: 12/23/2022] Open
Abstract
Background In human genetic association studies with high-dimensional gene expression data, it has been well known that statistical selection methods utilizing prior biological network knowledge such as genetic pathways and signaling pathways can outperform other methods that ignore genetic network structures in terms of true positive selection. In recent epigenetic research on case-control association studies, relatively many statistical methods have been proposed to identify cancer-related CpG sites and their corresponding genes from high-dimensional DNA methylation array data. However, most of existing methods are not designed to utilize genetic network information although methylation levels between linked genes in the genetic networks tend to be highly correlated with each other. Results We propose new approach that combines data dimension reduction techniques with network-based regularization to identify outcome-related genes for analysis of high-dimensional DNA methylation data. In simulation studies, we demonstrated that the proposed approach overwhelms other statistical methods that do not utilize genetic network information in terms of true positive selection. We also applied it to the 450K DNA methylation array data of the four breast invasive carcinoma cancer subtypes from The Cancer Genome Atlas (TCGA) project. Conclusions The proposed variable selection approach can utilize prior biological network information for analysis of high-dimensional DNA methylation array data. It first captures gene level signals from multiple CpG sites using data a dimension reduction technique and then performs network-based regularization based on biological network graph information. It can select potentially cancer-related genes and genetic pathways that were missed by the existing methods. Electronic supplementary material The online version of this article (10.1186/s12859-019-3040-x) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Kipoong Kim
- Department of Statistic, Pusan National University, Busan, 46241, Korea
| | - Hokeun Sun
- Department of Statistic, Pusan National University, Busan, 46241, Korea.
| |
Collapse
|
11
|
Radiomics and MGMT promoter methylation for prognostication of newly diagnosed glioblastoma. Sci Rep 2019; 9:14435. [PMID: 31594994 PMCID: PMC6783410 DOI: 10.1038/s41598-019-50849-y] [Citation(s) in RCA: 47] [Impact Index Per Article: 9.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2019] [Accepted: 09/20/2019] [Indexed: 11/16/2022] Open
Abstract
We attempted to establish a magnetic resonance imaging (MRI)-based radiomic model for stratifying prognostic subgroups of newly diagnosed glioblastoma (GBM) patients and predicting O (6)-methylguanine-DNA methyltransferase promotor methylation (pMGMT-met) status of the tumor. Preoperative MRI scans from 201 newly diagnosed GBM patients were included in this study. A total of 489 texture features including the first-order feature, second-order features from 162 datasets, and location data from 182 datasets were collected. Supervised principal component analysis was used for prognostication and predictive modeling for pMGMT-met status was performed based on least absolute shrinkage and selection operator regression. 22 radiomic features that were correlated with prognosis were used to successfully stratify patients into high-risk and low-risk groups (p = 0.004, Log-rank test). The radiomic high- and low-risk stratification and pMGMT status were independent prognostic factors. As a matter of fact, predictive accuracy of the pMGMT methylation status was 67% when modeled by two significant radiomic features. A significant survival difference was observed among the combined high-risk group, combined intermediate-risk group (this group consists of radiomic low risk and pMGMT-unmet or radiomic high risk and pMGMT-met), and combined low-risk group (p = 0.0003, Log-rank test). Radiomics can be used to build a prognostic score for stratifying high- and low-risk GBM, which was an independent prognostic factor from pMGMT methylation status. On the other hand, predictive accuracy of the pMGMT methylation status by radiomic analysis was insufficient for practical use.
Collapse
|
12
|
Bani-Sadr A, Eker OF, Berner LP, Ameli R, Hermier M, Barritault M, Meyronet D, Guyotat J, Jouanneau E, Honnorat J, Ducray F, Berthezene Y. Conventional MRI radiomics in patients with suspected early- or pseudo-progression. Neurooncol Adv 2019; 1:vdz019. [PMID: 32642655 PMCID: PMC7212855 DOI: 10.1093/noajnl/vdz019] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/06/2023] Open
Abstract
Background After radiochemotherapy, 30% of patients with early worsening MRI experience pseudoprogression (Psp) which is not distinguishable from early progression (EP). We aimed to assess the diagnostic value of radiomics in patients with suspected EP or Psp. Methods Radiomics features (RF) of 76 patients (53 EP and 23 Psp) retrospectively identified were extracted from conventional MRI based on four volumes-of-interest. Subjects were randomly assigned into training and validation groups. Classification model (EP versus Psp) consisted of a random forest algorithm after univariate filtering. Overall (OS) and progression-free survivals (PFS) were predicted using a semi-supervised principal component analysis, and forecasts were evaluated using C-index and integrated Brier scores (IBS). Results Using 11 RFs, radiomics classified patients with 75.0% and 76.0% accuracy, 81.6% and 94.1% sensitivity, 50.0% and 37.5% specificity, respectively, in training and validation phases. Addition of MGMT promoter status improved accuracy to 83% and 79.2%, and specificity to 63.6% and 75%. OS model included 14 RFs and stratified low- and high-risk patients both in the training (hazard ratio [HR], 3.63; P = .002) and the validation (HR, 3.76; P = .001) phases. Similarly, PFS model stratified patients during training (HR, 2.58; P = .005) and validation (HR, 3.58; P = .004) phases using 5 RF. OS and PFS forecasts had C-index of 0.65 and 0.69, and IBS of 0.122 and 0.147, respectively. Conclusions Conventional MRI radiomics has promising diagnostic value, especially when combined with MGMT promoter status, but with moderate specificity. In addition, our results suggest a potential for predicting OS and PFS.
Collapse
Affiliation(s)
- Alexandre Bani-Sadr
- Department of Neuroradiology, East Group Hospital, Hospices Civils de Lyon, Lyon Cedex, France
| | - Omer Faruk Eker
- Department of Neuroradiology, East Group Hospital, Hospices Civils de Lyon, Lyon Cedex, France
| | - Lise-Prune Berner
- Department of Neuroradiology, East Group Hospital, Hospices Civils de Lyon, Lyon Cedex, France
| | - Roxana Ameli
- Department of Neuroradiology, East Group Hospital, Hospices Civils de Lyon, Lyon Cedex, France
| | - Marc Hermier
- Department of Neuroradiology, East Group Hospital, Hospices Civils de Lyon, Lyon Cedex, France
| | - Marc Barritault
- Department of Molecular Biology, East Group Hospital, Hospices Civils de Lyon, Lyon Cedex, France
| | - David Meyronet
- Department of Neuropathology, East Group Hospital, Hospices Civils de Lyon, Lyon Cedex, France
| | - Jacques Guyotat
- Department of Neurosurgery, East Group Hospital, Hospices Civils de Lyon, Lyon Cedex, France.,Université Claude Bernard Lyon 1, Villeurbanne, France
| | - Emmanuel Jouanneau
- Department of Neurosurgery, East Group Hospital, Hospices Civils de Lyon, Lyon Cedex, France.,Université Claude Bernard Lyon 1, Villeurbanne, France
| | - Jerome Honnorat
- Université Claude Bernard Lyon 1, Villeurbanne, France.,Department of Neuro-Oncology, East Group Hospital, Hospices Civils de Lyon, Lyon Cedex, France
| | - François Ducray
- Université Claude Bernard Lyon 1, Villeurbanne, France.,Department of Neuro-Oncology, East Group Hospital, Hospices Civils de Lyon, Lyon Cedex, France
| | | |
Collapse
|
13
|
Walsh EE, Mariani TJ, Chu C, Grier A, Gill SR, Qiu X, Wang L, Holden-Wiltse J, Corbett A, Thakar J, Benoodt L, McCall MN, Topham DJ, Falsey AR, Caserta MT. Aims, Study Design, and Enrollment Results From the Assessing Predictors of Infant Respiratory Syncytial Virus Effects and Severity Study. JMIR Res Protoc 2019; 8:e12907. [PMID: 31199303 PMCID: PMC6595944 DOI: 10.2196/12907] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/25/2018] [Revised: 03/01/2019] [Accepted: 03/03/2019] [Indexed: 01/04/2023] Open
Abstract
Background The majority of infants hospitalized with primary respiratory syncytial virus (RSV) infection have no obvious risk factors for severe disease. Objective The aim of this study (Assessing Predictors of Infant RSV Effects and Severity, AsPIRES) was to identify factors associated with severe disease in full-term healthy infants younger than 10 months with primary RSV infection. Methods RSV infected infants were enrolled from 3 cohorts during consecutive winters from August 2012 to April 2016 in Rochester, New York. A birth cohort was prospectively enrolled and followed through their first winter for development of RSV infection. An outpatient supplemental cohort was enrolled in the emergency department or pediatric offices, and a hospital cohort was enrolled on admission with RSV infection. RSV was diagnosed by reverse transcriptase-polymerase chain reaction. Demographic and clinical data were recorded and samples collected for assays: buccal swab (cytomegalovirus polymerase chain reaction, PCR), nasal swab (RSV qualitative PCR, complete viral gene sequence, 16S ribosomal ribonucleic acid [RNA] amplicon microbiota analysis), nasal wash (chemokine and cytokine assays), nasal brush (nasal respiratory epithelial cell gene expression using RNA sequencing [RNAseq]), and 2 to 3 ml of heparinized blood (flow cytometry, RNAseq analysis of purified cluster of differentiation [CD]4+, CD8+, B cells and natural killer cells, and RSV-specific antibody). Cord blood (RSV-specific antibody) was also collected for the birth cohort. Univariate and multivariate logistic regression will be used for analysis of data using a continuous Global Respiratory Severity Score (GRSS) as the outcome variable. Novel statistical methods will be developed for integration of the large complex datasets. Results A total of 453 infants were enrolled into the 3 cohorts; 226 in the birth cohort, 60 in the supplemental cohort, and 78 in the hospital cohort. A total of 126 birth cohort infants remained in the study and were evaluated for 150 respiratory illnesses. Of the 60 RSV positive infants in the supplemental cohort, 42 completed the study, whereas all 78 of the RSV positive hospital cohort infants completed the study. A GRSS was calculated for each RSV-infected infant and is being used to analyze each of the complex datasets by correlation with disease severity in univariate and multivariate methods. Conclusions The AsPIRES study will provide insights into the complex pathogenesis of RSV infection in healthy full-term infants with primary RSV infection. The analysis will allow assessment of multiple factors potentially influencing the severity of RSV infection including the level of RSV specific antibodies, the innate immune response of nasal epithelial cells, the adaptive response by various lymphocyte subsets, the resident airway microbiota, and viral factors. Results of this study will inform disease interventions such as vaccines and antiviral therapies.
Collapse
Affiliation(s)
- Edward E Walsh
- University of Rochester School of Medicine and Dentistry, Rochester, NY, United States
| | - Thomas J Mariani
- University of Rochester School of Medicine and Dentistry, Rochester, NY, United States
| | - ChinYi Chu
- University of Rochester School of Medicine and Dentistry, Rochester, NY, United States
| | - Alex Grier
- University of Rochester School of Medicine and Dentistry, Rochester, NY, United States
| | - Steven R Gill
- University of Rochester School of Medicine and Dentistry, Rochester, NY, United States
| | - Xing Qiu
- University of Rochester School of Medicine and Dentistry, Rochester, NY, United States
| | - Lu Wang
- University of Rochester School of Medicine and Dentistry, Rochester, NY, United States
| | - Jeanne Holden-Wiltse
- University of Rochester School of Medicine and Dentistry, Rochester, NY, United States
| | - Anthony Corbett
- University of Rochester School of Medicine and Dentistry, Rochester, NY, United States
| | - Juilee Thakar
- University of Rochester School of Medicine and Dentistry, Rochester, NY, United States
| | - Lauren Benoodt
- University of Rochester School of Medicine and Dentistry, Rochester, NY, United States
| | - Matthew N McCall
- University of Rochester School of Medicine and Dentistry, Rochester, NY, United States
| | - David J Topham
- University of Rochester School of Medicine and Dentistry, Rochester, NY, United States
| | - Ann R Falsey
- University of Rochester School of Medicine and Dentistry, Rochester, NY, United States
| | - Mary T Caserta
- University of Rochester School of Medicine and Dentistry, Rochester, NY, United States
| |
Collapse
|
14
|
Recent Advances in Supervised Dimension Reduction: A Survey. MACHINE LEARNING AND KNOWLEDGE EXTRACTION 2019. [DOI: 10.3390/make1010020] [Citation(s) in RCA: 48] [Impact Index Per Article: 9.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/21/2022]
Abstract
Recently, we have witnessed an explosive growth in both the quantity and dimension of data generated, which aggravates the high dimensionality challenge in tasks such as predictive modeling and decision support. Up to now, a large amount of unsupervised dimension reduction methods have been proposed and studied. However, there is no specific review focusing on the supervised dimension reduction problem. Most studies performed classification or regression after unsupervised dimension reduction methods. However, we recognize the following advantages if learning the low-dimensional representation and the classification/regression model simultaneously: high accuracy and effective representation. Considering classification or regression as being the main goal of dimension reduction, the purpose of this paper is to summarize and organize the current developments in the field into three main classes: PCA-based, Non-negative Matrix Factorization (NMF)-based, and manifold-based supervised dimension reduction methods, as well as provide elaborated discussions on their advantages and disadvantages. Moreover, we outline a dozen open problems that can be further explored to advance the development of this topic.
Collapse
|
15
|
Majumdar S, Basak SC, Lungu CN, Diudea MV, Grunwald GD. Mathematical structural descriptors and mutagenicity assessment: a study with congeneric and diverse datasets $. SAR AND QSAR IN ENVIRONMENTAL RESEARCH 2018; 29:579-590. [PMID: 30025481 DOI: 10.1080/1062936x.2018.1496475] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/04/2018] [Accepted: 07/01/2018] [Indexed: 06/08/2023]
Abstract
Quantitative bioactivity and toxicity assessment of chemical compounds plays a central role in drug discovery as it saves a substantial amount of resources. To this end, high-performance computing has enabled researchers and practitioners to leverage hundreds, or even thousands, of computed molecular descriptors for the activity prediction of candidate compounds. In this paper, we evaluate the utility of two large groups of chemical descriptors by such predictive modelling, as well as chemical structure discovery, through empirical analysis. We use a suite of commercially available and in-house software to calculate molecular descriptors for two sets of chemical mutagens - a homogeneous set of 95 amines, and a diverse set of 508 chemicals. Using calculated descriptors, we model the mutagenic activity of these compounds using a number of methods from the statistics and machine-learning literature, and use robust principal component analysis to investigate the low-dimensional subspaces that characterize these chemicals. Our results suggest that combining different sets of descriptors is likely to result in a better predictive model - but that depends on the compounds being modelled and the modelling technique being used.
Collapse
Affiliation(s)
- S Majumdar
- a University of Florida Informatics Institute , Gainesville , USA
| | - S C Basak
- b Department of Chemistry and Biochemistry , University of Minnesota , Duluth MN , USA
| | - C N Lungu
- c Department of Chemistry , Babes-Bolyai University , Cluj-Napoca , Romania
| | - M V Diudea
- c Department of Chemistry , Babes-Bolyai University , Cluj-Napoca , Romania
| | - G D Grunwald
- d Natural Resources Research Institute , University of Minnesota , Duluth , USA
| |
Collapse
|
16
|
Meng Y, Cai XH, Wang L. Potential Genes and Pathways of Neonatal Sepsis Based on Functional Gene Set Enrichment Analyses. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2018; 2018:6708520. [PMID: 30154914 PMCID: PMC6091373 DOI: 10.1155/2018/6708520] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/18/2018] [Revised: 06/04/2018] [Accepted: 06/27/2018] [Indexed: 12/16/2022]
Abstract
BACKGROUND Neonatal sepsis (NS) is considered as the most common cause of neonatal deaths that newborns suffer from. Although numerous studies focus on gene biomarkers of NS, the predictive value of the gene biomarkers is low. NS pathogenesis is still needed to be investigated. METHODS After data preprocessing, we used KEGG enrichment method to identify the differentially expressed pathways between NS and normal controls. Then, functional principal component analysis (FPCA) was adopted to calculate gene values in NS. In order to further study the key signaling pathway of the NS, elastic-net regression model, Mann-Whitney U test, and coexpression network were used to estimate the weights of signaling pathway and hub genes. RESULTS A total of 115 different pathways between NS and controls were first identified. FPCA made full use of time-series gene expression information and estimated F values of genes in the different pathways. The top 1000 genes were considered as the different genes and were further analyzed by elastic-net regression and MWU test. There were 7 key signaling pathways between the NS and controls, according to different sources. Among those genes involved in key pathways, 7 hub genes, PIK3CA, TGFBR2, CDKN1B, KRAS, E2F3, TRAF6, and CHUK, were determined based on the coexpression network. Most of them were cancer-related genes. PIK3CA was considered as the common marker, which is highly expressed in the lymphocyte group. Little was known about the correlation of PIK3CA with NS, which gives us a new enlightenment for NS study. CONCLUSION This research might provide the perspective information to explore the potential novel genes and pathways as NS therapy targets.
Collapse
Affiliation(s)
- YuXiu Meng
- Department of Neonatology, First People's Hospital of Jining, Jining, Shandong 272000, China
| | - Xue Hong Cai
- Department of Pediatrics, Traditional Chinese Medicine Hospital of Yanzhou, Jining, Shandong 272100, China
| | - LiPei Wang
- Department of Neonatology, First People's Hospital of Jining, Jining, Shandong 272000, China
| |
Collapse
|
17
|
Kickingereder P, Götz M, Muschelli J, Wick A, Neuberger U, Shinohara RT, Sill M, Nowosielski M, Schlemmer HP, Radbruch A, Wick W, Bendszus M, Maier-Hein KH, Bonekamp D. Large-scale Radiomic Profiling of Recurrent Glioblastoma Identifies an Imaging Predictor for Stratifying Anti-Angiogenic Treatment Response. Clin Cancer Res 2016; 22:5765-5771. [PMID: 27803067 DOI: 10.1158/1078-0432.ccr-16-0702] [Citation(s) in RCA: 193] [Impact Index Per Article: 24.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/17/2016] [Revised: 06/23/2016] [Accepted: 07/07/2016] [Indexed: 01/03/2023]
Abstract
PURPOSE Antiangiogenic treatment with bevacizumab, a mAb to the VEGF, is the single most widely used therapeutic agent for patients with recurrent glioblastoma. A major challenge is that there are currently no validated biomarkers that can predict treatment outcome. Here we analyze the potential of radiomics, an emerging field of research that aims to utilize the full potential of medical imaging. EXPERIMENTAL DESIGN A total of 4,842 quantitative MRI features were automatically extracted and analyzed from the multiparametric tumor of 172 patients (allocated to a discovery and validation set with a 2:1 ratio) with recurrent glioblastoma prior to bevacizumab treatment. Leveraging a high-throughput approach, radiomic features of patients in the discovery set were subjected to a supervised principal component (superpc) analysis to generate a prediction model for stratifying treatment outcome to antiangiogenic therapy by means of both progression-free and overall survival (PFS and OS). RESULTS The superpc predictor stratified patients in the discovery set into a low or high risk group for PFS (HR = 1.60; P = 0.017) and OS (HR = 2.14; P < 0.001) and was successfully validated for patients in the validation set (HR = 1.85, P = 0.030 for PFS; HR = 2.60, P = 0.001 for OS). CONCLUSIONS Our radiomic-based superpc signature emerges as a putative imaging biomarker for the identification of patients who may derive the most benefit from antiangiogenic therapy, advances the knowledge in the noninvasive characterization of brain tumors, and stresses the role of radiomics as a novel tool for improving decision support in cancer treatment at low cost. Clin Cancer Res; 22(23); 5765-71. ©2016 AACR.
Collapse
Affiliation(s)
- Philipp Kickingereder
- Department of Neuroradiology, University of Heidelberg Medical Center, Heidelberg, Germany
| | - Michael Götz
- Medical Image Computing, Division Medical and Biological Informatics, German Cancer Research Center (DKFZ), Heidelberg, Germany
| | - John Muschelli
- Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, Maryland
| | - Antje Wick
- Neurology Clinic, University of Heidelberg Medical Center, Heidelberg, Germany
| | - Ulf Neuberger
- Department of Neuroradiology, University of Heidelberg Medical Center, Heidelberg, Germany
| | - Russell T Shinohara
- Department of Biostatistics and Epidemiology, Center for Clinical Epidemiology and Biostatistics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania
| | - Martin Sill
- Division of Biostatistics, DKFZ, Heidelberg, Germany
| | - Martha Nowosielski
- Department of Neurology, The Medical University of Innsbruck, Innsbruck, Austria
| | | | - Alexander Radbruch
- Department of Neuroradiology, University of Heidelberg Medical Center, Heidelberg, Germany.,Department of Radiology, DKFZ, Heidelberg, Germany
| | - Wolfgang Wick
- Neurology Clinic, University of Heidelberg Medical Center, Heidelberg, Germany.,Clinical Cooperation Unit Neurooncology, German Cancer Consortium (DKTK), DKFZ, Heidelberg, Germany
| | - Martin Bendszus
- Department of Neuroradiology, University of Heidelberg Medical Center, Heidelberg, Germany
| | - Klaus H Maier-Hein
- Medical Image Computing, Division Medical and Biological Informatics, German Cancer Research Center (DKFZ), Heidelberg, Germany
| | - David Bonekamp
- Department of Neuroradiology, University of Heidelberg Medical Center, Heidelberg, Germany.,Department of Radiology, DKFZ, Heidelberg, Germany
| |
Collapse
|
18
|
Chan WH, Mohamad MS, Deris S, Zaki N, Kasim S, Omatu S, Corchado JM, Al Ashwal H. Identification of informative genes and pathways using an improved penalized support vector machine with a weighting scheme. Comput Biol Med 2016; 77:102-15. [PMID: 27522238 DOI: 10.1016/j.compbiomed.2016.08.004] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2016] [Revised: 08/03/2016] [Accepted: 08/03/2016] [Indexed: 01/03/2023]
Abstract
Incorporation of pathway knowledge into microarray analysis has brought better biological interpretation of the analysis outcome. However, most pathway data are manually curated without specific biological context. Non-informative genes could be included when the pathway data is used for analysis of context specific data like cancer microarray data. Therefore, efficient identification of informative genes is inevitable. Embedded methods like penalized classifiers have been used for microarray analysis due to their embedded gene selection. This paper proposes an improved penalized support vector machine with absolute t-test weighting scheme to identify informative genes and pathways. Experiments are done on four microarray data sets. The results are compared with previous methods using 10-fold cross validation in terms of accuracy, sensitivity, specificity and F-score. Our method shows consistent improvement over the previous methods and biological validation has been done to elucidate the relation of the selected genes and pathway with the phenotype under study.
Collapse
Affiliation(s)
- Weng Howe Chan
- Artificial Intelligence and Bioinformatics Research Group, Faculty of Computing, Universiti Teknologi Malaysia, 81310 Skudai, Johor, Malaysia
| | - Mohd Saberi Mohamad
- Artificial Intelligence and Bioinformatics Research Group, Faculty of Computing, Universiti Teknologi Malaysia, 81310 Skudai, Johor, Malaysia.
| | - Safaai Deris
- Faculty of Creative Technology & Heritage, Universiti Malaysia Kelantan, Locked Bag 01, Bachok, 16300 Kota Bharu, Kelantan, Malaysia
| | - Nazar Zaki
- College of Information Technology, United Arab Emirate University, Al Ain 15551, United Arab Emirates
| | - Shahreen Kasim
- Faculty of Computer Science and Information Technology, Universiti Tun Hussein Onn Malaysia, 86400 Batu Pahat, Malaysia
| | - Sigeru Omatu
- Department of Electronics, Information and Communication Engineering, Osaka Institute of Technology, Osaka 535-8585, Japan
| | - Juan Manuel Corchado
- Biomedical Research Institute of Salamanca/BISITE Research Group, University of Salamanca, Salamanca, Spain
| | - Hany Al Ashwal
- College of Information Technology, United Arab Emirate University, Al Ain 15551, United Arab Emirates
| |
Collapse
|
19
|
Kickingereder P, Burth S, Wick A, Götz M, Eidel O, Schlemmer HP, Maier-Hein KH, Wick W, Bendszus M, Radbruch A, Bonekamp D. Radiomic Profiling of Glioblastoma: Identifying an Imaging Predictor of Patient Survival with Improved Performance over Established Clinical and Radiologic Risk Models. Radiology 2016; 280:880-9. [PMID: 27326665 DOI: 10.1148/radiol.2016160845] [Citation(s) in RCA: 274] [Impact Index Per Article: 34.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
Abstract
Purpose To evaluate whether radiomic feature-based magnetic resonance (MR) imaging signatures allow prediction of survival and stratification of patients with newly diagnosed glioblastoma with improved accuracy compared with that of established clinical and radiologic risk models. Materials and Methods Retrospective evaluation of data was approved by the local ethics committee and informed consent was waived. A total of 119 patients (allocated in a 2:1 ratio to a discovery [n = 79] or validation [n = 40] set) with newly diagnosed glioblastoma were subjected to radiomic feature extraction (12 190 features extracted, including first-order, volume, shape, and texture features) from the multiparametric (contrast material-enhanced T1-weighted and fluid-attenuated inversion-recovery imaging sequences) and multiregional (contrast-enhanced and unenhanced) tumor volumes. Radiomic features of patients in the discovery set were subjected to a supervised principal component (SPC) analysis to predict progression-free survival (PFS) and overall survival (OS) and were validated in the validation set. The performance of a Cox proportional hazards model with the SPC analysis predictor was assessed with C index and integrated Brier scores (IBS, lower scores indicating higher accuracy) and compared with Cox models based on clinical (age and Karnofsky performance score) and radiologic (Gaussian normalized relative cerebral blood volume and apparent diffusion coefficient) parameters. Results SPC analysis allowed stratification based on 11 features of patients in the discovery set into a low- or high-risk group for PFS (hazard ratio [HR], 2.43; P = .002) and OS (HR, 4.33; P < .001), and the results were validated successfully in the validation set for PFS (HR, 2.28; P = .032) and OS (HR, 3.45; P = .004). The performance of the SPC analysis (OS: IBS, 0.149; C index, 0.654; PFS: IBS, 0.138; C index, 0.611) was higher compared with that of the radiologic (OS: IBS, 0.175; C index, 0.603; PFS: IBS, 0.149; C index, 0.554) and clinical risk models (OS: IBS, 0.161, C index, 0.640; PFS: IBS, 0.139; C index, 0.599). The performance of the SPC analysis model was further improved when combined with clinical data (OS: IBS, 0.142; C index, 0.696; PFS: IBS, 0.132; C index, 0.637). Conclusion An 11-feature radiomic signature that allows prediction of survival and stratification of patients with newly diagnosed glioblastoma was identified, and improved performance compared with that of established clinical and radiologic risk models was demonstrated. (©) RSNA, 2016 Online supplemental material is available for this article.
Collapse
Affiliation(s)
- Philipp Kickingereder
- From the Department of Neuroradiology (P.K., S.B., O.E., M.B., A.R., D.B.) and Neurology Clinic (A.W., W.W.), University of Heidelberg Medical Center, Im Neuenheimer Feld 400, 69120 Heidelberg, Germany; Department of Medical Image Computing, Medical and Biological Informatics Division (M.G., K.H.M.H.), Department of Radiology (H.P.S., A.R., D.B.), and Clinical Neuro-oncology Cooperation Unit, German Cancer Consortium (DKTK) (W.W.), German Cancer Research Center (DKFZ), Heidelberg, Germany
| | - Sina Burth
- From the Department of Neuroradiology (P.K., S.B., O.E., M.B., A.R., D.B.) and Neurology Clinic (A.W., W.W.), University of Heidelberg Medical Center, Im Neuenheimer Feld 400, 69120 Heidelberg, Germany; Department of Medical Image Computing, Medical and Biological Informatics Division (M.G., K.H.M.H.), Department of Radiology (H.P.S., A.R., D.B.), and Clinical Neuro-oncology Cooperation Unit, German Cancer Consortium (DKTK) (W.W.), German Cancer Research Center (DKFZ), Heidelberg, Germany
| | - Antje Wick
- From the Department of Neuroradiology (P.K., S.B., O.E., M.B., A.R., D.B.) and Neurology Clinic (A.W., W.W.), University of Heidelberg Medical Center, Im Neuenheimer Feld 400, 69120 Heidelberg, Germany; Department of Medical Image Computing, Medical and Biological Informatics Division (M.G., K.H.M.H.), Department of Radiology (H.P.S., A.R., D.B.), and Clinical Neuro-oncology Cooperation Unit, German Cancer Consortium (DKTK) (W.W.), German Cancer Research Center (DKFZ), Heidelberg, Germany
| | - Michael Götz
- From the Department of Neuroradiology (P.K., S.B., O.E., M.B., A.R., D.B.) and Neurology Clinic (A.W., W.W.), University of Heidelberg Medical Center, Im Neuenheimer Feld 400, 69120 Heidelberg, Germany; Department of Medical Image Computing, Medical and Biological Informatics Division (M.G., K.H.M.H.), Department of Radiology (H.P.S., A.R., D.B.), and Clinical Neuro-oncology Cooperation Unit, German Cancer Consortium (DKTK) (W.W.), German Cancer Research Center (DKFZ), Heidelberg, Germany
| | - Oliver Eidel
- From the Department of Neuroradiology (P.K., S.B., O.E., M.B., A.R., D.B.) and Neurology Clinic (A.W., W.W.), University of Heidelberg Medical Center, Im Neuenheimer Feld 400, 69120 Heidelberg, Germany; Department of Medical Image Computing, Medical and Biological Informatics Division (M.G., K.H.M.H.), Department of Radiology (H.P.S., A.R., D.B.), and Clinical Neuro-oncology Cooperation Unit, German Cancer Consortium (DKTK) (W.W.), German Cancer Research Center (DKFZ), Heidelberg, Germany
| | - Heinz-Peter Schlemmer
- From the Department of Neuroradiology (P.K., S.B., O.E., M.B., A.R., D.B.) and Neurology Clinic (A.W., W.W.), University of Heidelberg Medical Center, Im Neuenheimer Feld 400, 69120 Heidelberg, Germany; Department of Medical Image Computing, Medical and Biological Informatics Division (M.G., K.H.M.H.), Department of Radiology (H.P.S., A.R., D.B.), and Clinical Neuro-oncology Cooperation Unit, German Cancer Consortium (DKTK) (W.W.), German Cancer Research Center (DKFZ), Heidelberg, Germany
| | - Klaus H Maier-Hein
- From the Department of Neuroradiology (P.K., S.B., O.E., M.B., A.R., D.B.) and Neurology Clinic (A.W., W.W.), University of Heidelberg Medical Center, Im Neuenheimer Feld 400, 69120 Heidelberg, Germany; Department of Medical Image Computing, Medical and Biological Informatics Division (M.G., K.H.M.H.), Department of Radiology (H.P.S., A.R., D.B.), and Clinical Neuro-oncology Cooperation Unit, German Cancer Consortium (DKTK) (W.W.), German Cancer Research Center (DKFZ), Heidelberg, Germany
| | - Wolfgang Wick
- From the Department of Neuroradiology (P.K., S.B., O.E., M.B., A.R., D.B.) and Neurology Clinic (A.W., W.W.), University of Heidelberg Medical Center, Im Neuenheimer Feld 400, 69120 Heidelberg, Germany; Department of Medical Image Computing, Medical and Biological Informatics Division (M.G., K.H.M.H.), Department of Radiology (H.P.S., A.R., D.B.), and Clinical Neuro-oncology Cooperation Unit, German Cancer Consortium (DKTK) (W.W.), German Cancer Research Center (DKFZ), Heidelberg, Germany
| | - Martin Bendszus
- From the Department of Neuroradiology (P.K., S.B., O.E., M.B., A.R., D.B.) and Neurology Clinic (A.W., W.W.), University of Heidelberg Medical Center, Im Neuenheimer Feld 400, 69120 Heidelberg, Germany; Department of Medical Image Computing, Medical and Biological Informatics Division (M.G., K.H.M.H.), Department of Radiology (H.P.S., A.R., D.B.), and Clinical Neuro-oncology Cooperation Unit, German Cancer Consortium (DKTK) (W.W.), German Cancer Research Center (DKFZ), Heidelberg, Germany
| | - Alexander Radbruch
- From the Department of Neuroradiology (P.K., S.B., O.E., M.B., A.R., D.B.) and Neurology Clinic (A.W., W.W.), University of Heidelberg Medical Center, Im Neuenheimer Feld 400, 69120 Heidelberg, Germany; Department of Medical Image Computing, Medical and Biological Informatics Division (M.G., K.H.M.H.), Department of Radiology (H.P.S., A.R., D.B.), and Clinical Neuro-oncology Cooperation Unit, German Cancer Consortium (DKTK) (W.W.), German Cancer Research Center (DKFZ), Heidelberg, Germany
| | - David Bonekamp
- From the Department of Neuroradiology (P.K., S.B., O.E., M.B., A.R., D.B.) and Neurology Clinic (A.W., W.W.), University of Heidelberg Medical Center, Im Neuenheimer Feld 400, 69120 Heidelberg, Germany; Department of Medical Image Computing, Medical and Biological Informatics Division (M.G., K.H.M.H.), Department of Radiology (H.P.S., A.R., D.B.), and Clinical Neuro-oncology Cooperation Unit, German Cancer Consortium (DKTK) (W.W.), German Cancer Research Center (DKFZ), Heidelberg, Germany
| |
Collapse
|
20
|
Zhang Q, Zhao Y, Zhang R, Wei Y, Yi H, Shao F, Chen F. A Comparative Study of Five Association Tests Based on CpG Set for Epigenome-Wide Association Studies. PLoS One 2016; 11:e0156895. [PMID: 27258058 PMCID: PMC4892473 DOI: 10.1371/journal.pone.0156895] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2016] [Accepted: 05/20/2016] [Indexed: 11/19/2022] Open
Abstract
An epigenome-wide association study (EWAS) is a large-scale study of human disease-associated epigenetic variation, specifically variation in DNA methylation. High throughput technologies enable simultaneous epigenetic profiling of DNA methylation at hundreds of thousands of CpGs across the genome. The clustering of correlated DNA methylation at CpGs is reportedly similar to that of linkage-disequilibrium (LD) correlation in genetic single nucleotide polymorphisms (SNP) variation. However, current analysis methods, such as the t-test and rank-sum test, may be underpowered to detect differentially methylated markers. We propose to test the association between the outcome (e.g case or control) and a set of CpG sites jointly. Here, we compared the performance of five CpG set analysis approaches: principal component analysis (PCA), supervised principal component analysis (SPCA), kernel principal component analysis (KPCA), sequence kernel association test (SKAT), and sliced inverse regression (SIR) with Hotelling's T2 test and t-test using Bonferroni correction. The simulation results revealed that the first six methods can control the type I error at the significance level, while the t-test is conservative. SPCA and SKAT performed better than other approaches when the correlation among CpG sites was strong. For illustration, these methods were also applied to a real methylation dataset.
Collapse
Affiliation(s)
- Qiuyi Zhang
- Department of Biostatistics, School of Public Health, Nanjing Medical University, Nanjing, China, 211166
| | - Yang Zhao
- Department of Biostatistics, School of Public Health, Nanjing Medical University, Nanjing, China, 211166
| | - Ruyang Zhang
- Department of Biostatistics, School of Public Health, Nanjing Medical University, Nanjing, China, 211166
| | - Yongyue Wei
- Department of Biostatistics, School of Public Health, Nanjing Medical University, Nanjing, China, 211166
| | - Honggang Yi
- Department of Biostatistics, School of Public Health, Nanjing Medical University, Nanjing, China, 211166
| | - Fang Shao
- Department of Biostatistics, School of Public Health, Nanjing Medical University, Nanjing, China, 211166
| | - Feng Chen
- Department of Biostatistics, School of Public Health, Nanjing Medical University, Nanjing, China, 211166
| |
Collapse
|
21
|
Abstract
The use of high-throughput data to study the changing behavior of biological pathways has focused mainly on examining the changes in the means of pathway genes. In this paper, we propose instead to test for changes in the co-regulated and unregulated variability of pathway genes. We assume that the eigenvalues of previously defined pathways capture biologically relevant quantities, and we develop a test for biologically meaningful changes in the eigenvalues between classes. This test reflects important and often ignored aspects of pathway behavior and provides a useful complement to traditional pathway analyses.
Collapse
Affiliation(s)
- P Danaher
- NanoString Technologies, 530 Fairview Ave. N, Seattle, Washington 98109, U.S.A
| | - D Paul
- Department of Statistics, University of California, One Shields Avenue, Davis, California 95616, U.S.A
| | - P Wang
- Icahn Institute of Genomics and Multiscale Biology, Icahn Medical School at Mount Sinai, 1470 Madison Avenue, S8-102 New York, New York, 10029, U.S.A
| |
Collapse
|
22
|
A novel hybrid dimension reduction technique for undersized high dimensional gene expression data sets using information complexity criterion for cancer classification. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2015; 2015:370640. [PMID: 25838836 PMCID: PMC4370236 DOI: 10.1155/2015/370640] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/24/2014] [Accepted: 02/18/2015] [Indexed: 11/21/2022]
Abstract
Gene expression data typically are large, complex, and highly noisy. Their dimension is high with several thousand genes (i.e., features) but with only a limited number of observations (i.e., samples). Although the classical principal component analysis (PCA) method is widely used as a first standard step in dimension reduction and in supervised and unsupervised classification, it suffers from several shortcomings in the case of data sets involving undersized samples, since the sample covariance matrix degenerates and becomes singular. In this paper we address these limitations within the context of probabilistic PCA (PPCA) by introducing and developing a new and novel approach using maximum entropy covariance matrix and its hybridized smoothed covariance estimators. To reduce the dimensionality of the data and to choose the number of probabilistic PCs (PPCs) to be retained, we further introduce and develop celebrated Akaike's information criterion (AIC), consistent Akaike's information criterion (CAIC), and the information theoretic measure of complexity (ICOMP) criterion of Bozdogan. Six publicly available undersized benchmark data sets were analyzed to show the utility, flexibility, and versatility of our approach with hybridized smoothed covariance matrix estimators, which do not degenerate to perform the PPCA to reduce the dimension and to carry out supervised classification of cancer groups in high dimensions.
Collapse
|
23
|
Thomas R, Hubbard AE, McHale CM, Zhang L, Rappaport SM, Lan Q, Rothman N, Vermeulen R, Guyton KZ, Jinot J, Sonawane BR, Smith MT. Characterization of changes in gene expression and biochemical pathways at low levels of benzene exposure. PLoS One 2014; 9:e91828. [PMID: 24786086 PMCID: PMC4006721 DOI: 10.1371/journal.pone.0091828] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/11/2013] [Accepted: 02/14/2014] [Indexed: 11/19/2022] Open
Abstract
Benzene, a ubiquitous environmental pollutant, causes acute myeloid leukemia (AML). Recently, through transcriptome profiling of peripheral blood mononuclear cells (PBMC), we reported dose-dependent effects of benzene exposure on gene expression and biochemical pathways in 83 workers exposed across four airborne concentration ranges (from <1 ppm to >10 ppm) compared with 42 subjects with non-workplace ambient exposure levels. Here, we further characterize these dose-dependent effects with continuous benzene exposure in all 125 study subjects. We estimated air benzene exposure levels in the 42 environmentally-exposed subjects from their unmetabolized urinary benzene levels. We used a novel non-parametric, data-adaptive model selection method to estimate the change with dose in the expression of each gene. We describe non-parametric approaches to model pathway responses and used these to estimate the dose responses of the AML pathway and 4 other pathways of interest. The response patterns of majority of genes as captured by mean estimates of the first and second principal components of the dose-response for the five pathways and the profiles of 6 AML pathway response-representative genes (identified by clustering) exhibited similar apparent supra-linear responses. Responses at or below 0.1 ppm benzene were observed for altered expression of AML pathway genes and CYP2E1. Together, these data show that benzene alters disease-relevant pathways and genes in a dose-dependent manner, with effects apparent at doses as low as 100 ppb in air. Studies with extensive exposure assessment of subjects exposed in the low-dose range between 10 ppb and 1 ppm are needed to confirm these findings.
Collapse
Affiliation(s)
- Reuben Thomas
- Superfund Research Program, School of Public Health, University of California, Berkeley, California, United States of America
| | - Alan E. Hubbard
- Superfund Research Program, School of Public Health, University of California, Berkeley, California, United States of America
| | - Cliona M. McHale
- Superfund Research Program, School of Public Health, University of California, Berkeley, California, United States of America
| | - Luoping Zhang
- Superfund Research Program, School of Public Health, University of California, Berkeley, California, United States of America
| | - Stephen M. Rappaport
- Superfund Research Program, School of Public Health, University of California, Berkeley, California, United States of America
| | - Qing Lan
- Division of Cancer Epidemiology and Genetics, National Cancer Institute, National Institutes of Health, Bethesda, Maryland, United States of America
| | - Nathaniel Rothman
- Division of Cancer Epidemiology and Genetics, National Cancer Institute, National Institutes of Health, Bethesda, Maryland, United States of America
| | - Roel Vermeulen
- Institute of Risk assessment Sciences, Utrecht University, Utrecht, The Netherlands
| | - Kathryn Z. Guyton
- National Center for Environmental Assessment, Office of Research and Development, US EPA, Washington, DC, United States of America
| | - Jennifer Jinot
- National Center for Environmental Assessment, Office of Research and Development, US EPA, Washington, DC, United States of America
| | - Babasaheb R. Sonawane
- National Center for Environmental Assessment, Office of Research and Development, US EPA, Washington, DC, United States of America
| | - Martyn T. Smith
- Superfund Research Program, School of Public Health, University of California, Berkeley, California, United States of America
| |
Collapse
|
24
|
Killeen AP, Morris DG, Kenny DA, Mullen MP, Diskin MG, Waters SM. Global gene expression in endometrium of high and low fertility heifers during the mid-luteal phase of the estrous cycle. BMC Genomics 2014; 15:234. [PMID: 24669966 PMCID: PMC3986929 DOI: 10.1186/1471-2164-15-234] [Citation(s) in RCA: 42] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/07/2013] [Accepted: 03/14/2014] [Indexed: 01/01/2023] Open
Abstract
BACKGROUND In both beef and dairy cattle, the majority of early embryo loss occurs within the first 14 days following insemination. During this time-period, embryos are completely dependent on their maternal uterine environment for development, growth and ultimately survival, therefore an optimum uterine environment is critical to their survival. The objective of this study was to investigate whether differences in endometrial gene expression during the mid-luteal phase of the estrous cycle exist between crossbred beef heifers ranked as either high (HF) or low fertility (LF) (following four rounds of artificial insemination (AI)) using the Affymetrix® 23 K Bovine Gene Chip. RESULTS Conception rates for each of the four rounds of AI were within a normal range: 70-73.3%. Microarray analysis of endometrial tissue collected on day 7 of the estrous cycle detected 419 differentially expressed genes (DEG) between HF (n = 6) and LF (n = 6) animals. The main gene pathways affected were, cellular growth and proliferation, angiogenesis, lipid metabolism, cellular and tissue morphology and development, inflammation and metabolic exchange. DEG included, FST, SLC45A2, MMP19, FADS1 and GALNT6. CONCLUSIONS This study highlights, some of the molecular mechanisms potentially controlling uterine endometrial function during the mid-luteal phase of the estrous cycle, which may contribute to uterine endometrial mediated impaired fertility in cattle. Differentially expressed genes are potential candidate genes for the identification of genetic variation influencing cow fertility, which may be incorporated into future breeding programmes.
Collapse
Affiliation(s)
| | | | | | | | | | - Sinéad M Waters
- Teagasc, Animal and Bioscience Research Department, Animal and Grassland Research and Innovation Centre, Grange, Dunsany, County Meath, Ireland.
| |
Collapse
|
25
|
Hira ZM, Trigeorgis G, Gillies DF. An algorithm for finding biologically significant features in microarray data based on a priori manifold learning. PLoS One 2014; 9:e90562. [PMID: 24595155 PMCID: PMC3940899 DOI: 10.1371/journal.pone.0090562] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/22/2013] [Accepted: 02/02/2014] [Indexed: 11/19/2022] Open
Abstract
Microarray databases are a large source of genetic data, which, upon proper analysis, could enhance our understanding of biology and medicine. Many microarray experiments have been designed to investigate the genetic mechanisms of cancer, and analytical approaches have been applied in order to classify different types of cancer or distinguish between cancerous and non-cancerous tissue. However, microarrays are high-dimensional datasets with high levels of noise and this causes problems when using machine learning methods. A popular approach to this problem is to search for a set of features that will simplify the structure and to some degree remove the noise from the data. The most widely used approach to feature extraction is principal component analysis (PCA) which assumes a multivariate Gaussian model of the data. More recently, non-linear methods have been investigated. Among these, manifold learning algorithms, for example Isomap, aim to project the data from a higher dimensional space onto a lower dimension one. We have proposed a priori manifold learning for finding a manifold in which a representative set of microarray data is fused with relevant data taken from the KEGG pathway database. Once the manifold has been constructed the raw microarray data is projected onto it and clustering and classification can take place. In contrast to earlier fusion based methods, the prior knowledge from the KEGG databases is not used in, and does not bias the classification process--it merely acts as an aid to find the best space in which to search the data. In our experiments we have found that using our new manifold method gives better classification results than using either PCA or conventional Isomap.
Collapse
Affiliation(s)
- Zena M. Hira
- Department of Computing, Imperial College London, London, United Kingdom
- * E-mail:
| | - George Trigeorgis
- Department of Computing, Imperial College London, London, United Kingdom
| | - Duncan F. Gillies
- Department of Computing, Imperial College London, London, United Kingdom
| |
Collapse
|
26
|
SNP set association analysis for genome-wide association studies. PLoS One 2013; 8:e62495. [PMID: 23658731 PMCID: PMC3643925 DOI: 10.1371/journal.pone.0062495] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2012] [Accepted: 03/22/2013] [Indexed: 11/29/2022] Open
Abstract
Genome-wide association study (GWAS) is a promising approach for identifying common genetic variants of the diseases on the basis of millions of single nucleotide polymorphisms (SNPs). In order to avoid low power caused by overmuch correction for multiple comparisons in single locus association study, some methods have been proposed by grouping SNPs together into a SNP set based on genomic features, then testing the joint effect of the SNP set. We compare the performances of principal component analysis (PCA), supervised principal component analysis (SPCA), kernel principal component analysis (KPCA), and sliced inverse regression (SIR). Simulated SNP sets are generated under scenarios of 0, 1 and ≥2 causal SNPs model. Our simulation results show that all of these methods can control the type I error at the nominal significance level. SPCA is always more powerful than the other methods at different settings of linkage disequilibrium structures and minor allele frequency of the simulated datasets. We also apply these four methods to a real GWAS of non-small cell lung cancer (NSCLC) in Han Chinese population
Collapse
|
27
|
Chen X, Ishwaran H. Pathway hunting by random survival forests. Bioinformatics 2013; 29:99-105. [PMID: 23129299 PMCID: PMC3530909 DOI: 10.1093/bioinformatics/bts643] [Citation(s) in RCA: 26] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2012] [Revised: 07/18/2012] [Accepted: 10/17/2012] [Indexed: 01/22/2023] Open
Abstract
MOTIVATION Pathway or gene set analysis has been widely applied to genomic data. Many current pathway testing methods use univariate test statistics calculated from individual genomic markers, which ignores the correlations and interactions between candidate markers. Random forests-based pathway analysis is a promising approach for incorporating complex correlation and interaction patterns, but one limitation of previous approaches is that pathways have been considered separately, thus pathway cross-talk information was not considered. RESULTS In this article, we develop a new pathway hunting algorithm for survival outcomes using random survival forests, which prioritize important pathways by accounting for gene correlation and genomic interactions. We show that the proposed method performs favourably compared with five popular pathway testing methods using both synthetic and real data. We find that the proposed methodology provides an efficient and powerful pathway modelling framework for high-dimensional genomic data. AVAILABILITY The R code for the analysis used in this article is available upon request.
Collapse
Affiliation(s)
- Xi Chen
- Department of Biostatistics, Vanderbilt University School of Medicine, Nashville, TN 37232, USA.
| | | |
Collapse
|
28
|
Wang L, Chen X, Zhang B. Statistical Analysis of Patient-Specific Pathway Activities via Mixed Models. ACTA ACUST UNITED AC 2013; Suppl 8:7313. [PMID: 24124644 DOI: 10.4172/2155-6180.s8-001] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
In the study of complex diseases, a major challenge is disease heterogeneity, where the dysregulation of different pathways often lead to similar disease phenotypes. As a result, a given pathway could be differentially expressed with respect to controls for some patients, but not for others. Therefore, to develop successful personalized treatment regime, in addition to identifying disease relevant pathways for the entire patient group, it's also important to test if a particular pathway is dysregulated for an individual patient. To this end, we compare pathway gene expression profile for a particular individual in the patient group to the "norm" (or standard) established by a group of control patients. We studied statistical analysis of patient-specific pathway activities under the mixed models framework. Using gene expression dataset with realistic correlation patterns, we showed the proposed hypothesis testing procedure had false positive rate (type I error) as expected. In addition, we illustrated the proposed methodology using a Type 2 Diabetes dataset. Our results showed a previously diabetes associated pathway was only differentially expressed (relative to the control group) in less than 30% of the diabetes patients, which provided an explanation for the moderate group level statistical significance seen in a previous study. This result also suggested targeting this particular pathway would likely be beneficial for only 30% of the patients. In addition to the case-control study we have illustrated, this model can be easily extended to handle more complex designs with additional covariates and multiple sources of variations. Moreover, the proposed model operates within a well-established statistical framework and can be implemented in common statistical packages.
Collapse
Affiliation(s)
- Lily Wang
- Department of Biostatistics, Vanderbilt University, Nashville, TN 37232, USA
| | | | | |
Collapse
|
29
|
Yu T, Bai Y. Analyzing LC/MS metabolic profiling data in the context of existing metabolic networks. ACTA ACUST UNITED AC 2012; 1:83-91. [PMID: 24010053 DOI: 10.2174/2213235x11301010084] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Metabolic profiling is the unbiased detection and quantification of low molecular-weight metabolites in a living system. It is rapidly developing in biological and translational research, contributing to disease mechanism elucidation, environmental chemical surveillance, biomarker detection, and health outcome prediction. Recent developments in experimental and computational technology allow more and more known metabolites to be detected and quantified from complex samples. As the coverage of the metabolic network improves, it has become feasible to examine metabolic profiling data from a systems perspective, i.e. interpreting the data and performing statistical inference in the context of pathways and genome-scale metabolic networks. Recently a number of methods have been developed in this area, and much improvement in algorithms and databases are still needed. In this review, we survey some methods for the analysis of metabolic profiling data based on metabolic networks.
Collapse
Affiliation(s)
- Tianwei Yu
- Department of Biostatistics and Bioinformatics, Rollins School of Public Health, Emory University, Atlanta, GA
| | | |
Collapse
|
30
|
Comparative evaluation of set-level techniques in predictive classification of gene expression samples. BMC Bioinformatics 2012; 13 Suppl 10:S15. [PMID: 22759420 PMCID: PMC3382436 DOI: 10.1186/1471-2105-13-s10-s15] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2022] Open
Abstract
Background Analysis of gene expression data in terms of a priori-defined gene sets has recently received significant attention as this approach typically yields more compact and interpretable results than those produced by traditional methods that rely on individual genes. The set-level strategy can also be adopted with similar benefits in predictive classification tasks accomplished with machine learning algorithms. Initial studies into the predictive performance of set-level classifiers have yielded rather controversial results. The goal of this study is to provide a more conclusive evaluation by testing various components of the set-level framework within a large collection of machine learning experiments. Results Genuine curated gene sets constitute better features for classification than sets assembled without biological relevance. For identifying the best gene sets for classification, the Global test outperforms the gene-set methods GSEA and SAM-GS as well as two generic feature selection methods. To aggregate expressions of genes into a feature value, the singular value decomposition (SVD) method as well as the SetSig technique improve on simple arithmetic averaging. Set-level classifiers learned with 10 features constituted by the Global test slightly outperform baseline gene-level classifiers learned with all original data features although they are slightly less accurate than gene-level classifiers learned with a prior feature-selection step. Conclusion Set-level classifiers do not boost predictive accuracy, however, they do achieve competitive accuracy if learned with the right combination of ingredients. Availability Open-source, publicly available software was used for classifier learning and testing. The gene expression datasets and the gene set database used are also publicly available. The full tabulation of experimental results is available at http://ida.felk.cvut.cz/CESLT.
Collapse
|
31
|
Adaptive elastic-net sparse principal component analysis for pathway association testing. Stat Appl Genet Mol Biol 2011; 10:/j/sagmb.2011.10.issue-1/1544-6115.1697/1544-6115.1697.xml. [PMID: 23089825 DOI: 10.2202/1544-6115.1697] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/05/2023]
Abstract
Pathway or gene set analysis has become an increasingly popular approach for analyzing high-throughput biological experiments such as microarray gene expression studies. The purpose of pathway analysis is to identify differentially expressed pathways associated with outcomes. Important challenges in pathway analysis are selecting a subset of genes contributing most to association with clinical phenotypes and conducting statistical tests of association for the pathways efficiently. We propose a two-stage analysis strategy: (1) extract latent variables representing activities within each pathway using a dimension reduction approach based on adaptive elastic-net sparse principal component analysis; (2) integrate the latent variables with the regression modeling framework to analyze studies with different types of outcomes such as binary, continuous or survival outcomes. Our proposed approach is computationally efficient. For each pathway, because the latent variables are estimated in an unsupervised fashion without using disease outcome information, in the sample label permutation testing procedure, the latent variables only need to be calculated once rather than for each permutation resample. Using both simulated and real datasets, we show our approach performed favorably when compared with five other currently available pathway testing methods.
Collapse
|
32
|
Misman MF, Mohamad MS, Deris S, Abdullah A, Hashim SZM. An improved hybrid of SVM and SCAD for pathway analysis. Bioinformation 2011; 7:169-75. [PMID: 22102773 PMCID: PMC3218518 DOI: 10.6026/97320630007169] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2011] [Accepted: 10/02/2011] [Indexed: 11/23/2022] Open
Abstract
Pathway analysis has lead to a new era in genomic research by providing further biological process information compared to traditional single gene analysis. Beside the advantage, pathway analysis provides some challenges to the researchers, one of which is the quality of pathway data itself. The pathway data usually defined from biological context free, when it comes to a specific biological context (e.g. lung cancer disease), typically only several genes within pathways are responsible for the corresponding cellular process. It also can be that some pathways may be included with uninformative genes or perhaps informative genes were excluded. Moreover, many algorithms in pathway analysis neglect these limitations by treating all the genes within pathways as significant. In previous study, a hybrid of support vector machines and smoothly clipped absolute deviation with groups-specific tuning parameters (gSVM-SCAD) was proposed in order to identify and select the informative genes before the pathway evaluation process. However, gSVM-SCAD had showed a limitation in terms of the performance of classification accuracy. In order to deal with this limitation, we made an enhancement to the tuning parameter method for gSVM-SCAD by applying the B-Type generalized approximate cross validation (BGACV). Experimental analyses using one simulated data and two gene expression data have shown that the proposed method obtains significant results in identifying biologically significant genes and pathways, and in classification accuracy.
Collapse
Affiliation(s)
- Muhammad Faiz Misman
- Faculty of Computer Science and Information Systems, Universiti Teknologi Malaysia, 81310, Skudai, Johor Darul Takzim, Malaysia
| | - Mohd Saberi Mohamad
- Faculty of Computer Science and Information Systems, Universiti Teknologi Malaysia, 81310, Skudai, Johor Darul Takzim, Malaysia
| | - Safaai Deris
- Faculty of Computer Science and Information Systems, Universiti Teknologi Malaysia, 81310, Skudai, Johor Darul Takzim, Malaysia
| | - Afnizanfaizal Abdullah
- Faculty of Computer Science and Information Systems, Universiti Teknologi Malaysia, 81310, Skudai, Johor Darul Takzim, Malaysia
| | - Siti Zaiton Mohd Hashim
- Faculty of Computer Science and Information Systems, Universiti Teknologi Malaysia, 81310, Skudai, Johor Darul Takzim, Malaysia
| |
Collapse
|
33
|
Capturing changes in gene expression dynamics by gene set differential coordination analysis. Genomics 2011; 98:469-77. [PMID: 21971296 DOI: 10.1016/j.ygeno.2011.09.001] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/10/2011] [Revised: 09/01/2011] [Accepted: 09/16/2011] [Indexed: 12/31/2022]
Abstract
Analyzing gene expression data at the gene set level greatly improves feature extraction and data interpretation. Currently most efforts in gene set analysis are focused on differential expression analysis--finding gene sets whose genes show first-order relationship with the clinical outcome. However the regulation of the biological system is complex, and much of the change in gene expression dynamics do not manifest in the form of differential expression. At the gene set level, capturing the change in expression dynamics is difficult due to the complexity and heterogeneity of the gene sets. Here we report a systematic approach to detect gene sets that show differential coordination patterns with the rest of the transcriptome, as well as pairs of gene sets that are differentially coordinated with each other. We demonstrate that the method can identify biologically relevant gene sets, many of which do not show first-order relationship with the clinical outcome.
Collapse
|
34
|
Han B, Li L, Chen Y, Zhu L, Dai Q. A two step method to identify clinical outcome relevant genes with microarray data. J Biomed Inform 2011; 44:229-38. [DOI: 10.1016/j.jbi.2010.11.007] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2010] [Revised: 10/06/2010] [Accepted: 11/29/2010] [Indexed: 12/29/2022]
|
35
|
Long N, Gianola D, Rosa G, Weigel K. Dimension reduction and variable selection for genomic selection: application to predicting milk yield in Holsteins. J Anim Breed Genet 2011; 128:247-57. [DOI: 10.1111/j.1439-0388.2011.00917.x] [Citation(s) in RCA: 37] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/31/2023]
|
36
|
Chen X, Wang L, Hu B, Guo M, Barnard J, Zhu X. Pathway-based analysis for genome-wide association studies using supervised principal components. Genet Epidemiol 2011; 34:716-24. [PMID: 20842628 DOI: 10.1002/gepi.20532] [Citation(s) in RCA: 39] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/19/2022]
Abstract
Many complex diseases are influenced by genetic variations in multiple genes, each with only a small marginal effect on disease susceptibility. Pathway analysis, which identifies biological pathways associated with disease outcome, has become increasingly popular for genome-wide association studies (GWAS). In addition to combining weak signals from a number of SNPs in the same pathway, results from pathway analysis also shed light on the biological processes underlying disease. We propose a new pathway-based analysis method for GWAS, the supervised principal component analysis (SPCA) model. In the proposed SPCA model, a selected subset of SNPs most associated with disease outcome is used to estimate the latent variable for a pathway. The estimated latent variable for each pathway is an optimal linear combination of a selected subset of SNPs; therefore, the proposed SPCA model provides the ability to borrow strength across the SNPs in a pathway. In addition to identifying pathways associated with disease outcome, SPCA also carries out additional within-category selection to identify the most important SNPs within each gene set. The proposed model operates in a well-established statistical framework and can handle design information such as covariate adjustment and matching information in GWAS. We compare the proposed method with currently available methods using data with realistic linkage disequilibrium structures, and we illustrate the SPCA method using the Wellcome Trust Case-Control Consortium Crohn Disease (CD) data set.
Collapse
Affiliation(s)
- Xi Chen
- Division of Cancer Biostatistics, Department of Biostatistics, Vanderbilt University School of Medicine, Nashville, Tennessee 37232, USA
| | | | | | | | | | | |
Collapse
|
37
|
Robotti E, Demartini M, Gosetti F, Calabrese G, Marengo E. Development of a classification and ranking method for the identification of possible biomarkers in two-dimensional gel-electrophoresis based on principal component analysis and variable selection procedures. MOLECULAR BIOSYSTEMS 2011; 7:677-86. [PMID: 21286649 DOI: 10.1039/c0mb00124d] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
Abstract
The identification of biomarkers is one of the leading research areas in proteomics. When biomarkers have to be searched for in spot volume datasets produced by 2D gel-electrophoresis, problems may arise related to the large number of spots present in each map and the small number of samples available in each class (control/pathological). In such cases multivariate methods are usually exploited together with variable selection procedures, to provide a set of possible biomarkers: they are however usually aimed to the selection of the smallest set of variables (spots) providing the best performances in prediction. This approach seems not to be suitable for the identification of potential biomarkers since in this case all the possible candidate biomarkers have to be identified to provide a general picture of the "pathological state": in this case exhaustivity has to be preferred to provide a complete understanding of the mechanisms underlying the pathology. We propose here a ranking and classification method, "Ranking-PCA", based on Principal Component Analysis and variable selection in forward search: the method selects one variable at a time as the one providing the best separation of the two classes investigated in the space given by the relevant PCs. The method was applied to an artificial dataset and a real case-study: Ranking-PCA exhaustively identified the potential biomarkers and provided reliable and robust results.
Collapse
Affiliation(s)
- Elisa Robotti
- Department of Environmental and Life Sciences, University of Eastern Piedmont, Viale T. Michel 11, 15121 Alessandria, Italy
| | | | | | | | | |
Collapse
|
38
|
Ma S, Dai Y. Principal component analysis based methods in bioinformatics studies. Brief Bioinform 2011; 12:714-22. [PMID: 21242203 DOI: 10.1093/bib/bbq090] [Citation(s) in RCA: 125] [Impact Index Per Article: 9.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
In analysis of bioinformatics data, a unique challenge arises from the high dimensionality of measurements. Without loss of generality, we use genomic study with gene expression measurements as a representative example but note that analysis techniques discussed in this article are also applicable to other types of bioinformatics studies. Principal component analysis (PCA) is a classic dimension reduction approach. It constructs linear combinations of gene expressions, called principal components (PCs). The PCs are orthogonal to each other, can effectively explain variation of gene expressions, and may have a much lower dimensionality. PCA is computationally simple and can be realized using many existing software packages. This article consists of the following parts. First, we review the standard PCA technique and their applications in bioinformatics data analysis. Second, we describe recent 'non-standard' applications of PCA, including accommodating interactions among genes, pathways and network modules and conducting PCA with estimating equations as opposed to gene expressions. Third, we introduce several recently proposed PCA-based techniques, including the supervised PCA, sparse PCA and functional PCA. The supervised PCA and sparse PCA have been shown to have better empirical performance than the standard PCA. The functional PCA can analyze time-course gene expression data. Last, we raise the awareness of several critical but unsolved problems related to PCA. The goal of this article is to make bioinformatics researchers aware of the PCA technique and more importantly its most recent development, so that this simple yet effective dimension reduction technique can be better employed in bioinformatics data analysis.
Collapse
Affiliation(s)
- Shuangge Ma
- 60 College ST, LEPH 209, School of Public Health, Yale University, New Haven, CT 06520, USA.
| | | |
Collapse
|
39
|
Ma S, Kosorok MR, Huang J, Dai Y. Incorporating higher-order representative features improves prediction in network-based cancer prognosis analysis. BMC Med Genomics 2011; 4:5. [PMID: 21226928 PMCID: PMC3037289 DOI: 10.1186/1755-8794-4-5] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2010] [Accepted: 01/12/2011] [Indexed: 01/30/2023] Open
Abstract
BACKGROUND In cancer prognosis studies with gene expression measurements, an important goal is to construct gene signatures with predictive power. In this study, we describe the coordination among genes using the weighted coexpression network, where nodes represent genes and nodes are connected if the corresponding genes have similar expression patterns across samples. There are subsets of nodes, called modules, that are tightly connected to each other. In several published studies, it has been suggested that the first principal components of individual modules, also referred to as "eigengenes", may sufficiently represent the corresponding modules. RESULTS In this article, we refer to principal components and their functions as representative features". We investigate higher-order representative features, which include the principal components other than the first ones and second order terms (quadratics and interactions). Two gradient thresholding methods are adopted for regularized estimation and feature selection. Analysis of six prognosis studies on lymphoma and breast cancer shows that incorporating higher-order representative features improves prediction performance over using eigengenes only. Simulation study further shows that prediction performance can be less satisfactory if the representative feature set is not properly chosen. CONCLUSIONS This study introduces multiple ways of defining the representative features and effective thresholding regularized estimation approaches. It provides convincing evidence that the higher-order representative features may have important implications for the prediction of cancer prognosis.
Collapse
Affiliation(s)
- Shuangge Ma
- School of Public Health, Yale University, New Haven, CT, USA
| | - Michael R Kosorok
- Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
| | - Jian Huang
- Departments of Statistics and Actuarial Science, and Biostatistics, University of Iowa, Iowa City, IA, USA
| | - Ying Dai
- School of Public Health, Yale University, New Haven, CT, USA
| |
Collapse
|
40
|
Chen X, Wang L, Ishwaran H. An Integrative Pathway-based Clinical-genomic Model for Cancer Survival Prediction. Stat Probab Lett 2010; 80:1313-1319. [PMID: 21731150 DOI: 10.1016/j.spl.2010.04.011] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 01/19/2023]
Abstract
Prediction models that use gene expression levels are now being proposed for personalized treatment of cancer, but building accurate models that are easy to interpret remains a challenge. In this paper, we describe an integrative clinical-genomic approach that combines both genomic pathway and clinical information. First, we summarize information from genes in each pathway using Supervised Principal Components (SPCA) to obtain pathway-based genomic predictors. Next, we build a prediction model based on clinical variables and pathway-based genomic predictors using Random Survival Forests (RSF). Our rationale for this two-stage procedure is that the underlying disease process may be influenced by environmental exposure (measured by clinical variables) and perturbations in different pathways (measured by pathway-based genomic variables), as well as their interactions. Using two cancer microarray datasets, we show that the pathway-based clinical-genomic model outperforms gene-based clinical-genomic models, with improved prediction accuracy and interpretability.
Collapse
Affiliation(s)
- Xi Chen
- Division of Cancer Biostatistics, Department of Biostatistics, Vanderbilt University, Nashville, TN 37232, USA
| | | | | |
Collapse
|
41
|
Montgomery SB, Dermitzakis ET. The resolution of the genetics of gene expression. Hum Mol Genet 2009; 18:R211-5. [PMID: 19808798 DOI: 10.1093/hmg/ddp400] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/25/2022] Open
Abstract
Understanding the influence of genetics on the molecular mechanisms underpinning human phenotypic diversity is fundamental to being able to predict health outcomes and treat disease. To interrogate the role of genetics on cellular state and function, gene expression has been extensively used. Past and present studies have highlighted important patterns of heritability, population differentiation and tissue-specificity in gene expression. Current and future studies are taking advantage of systems biology-based approaches and advances in sequencing technology: new methodology aims to translate regulatory networks to enrich pathways responsible for disease etiology and 2nd generation sequencing now offers single-molecular resolution of the transcriptome providing unprecedented information on the structural and genetic characteristics of gene expression. Such advances are leading to a future where rich cellular phenotypes will facilitate understanding of the transmission of genetic effect from the gene to organism.
Collapse
Affiliation(s)
- Stephen B Montgomery
- Department of Genetic Medicine and Development, University of Geneva Medical School, Geneva CH-1211, Switzerland
| | | |
Collapse
|
42
|
A unified mixed effects model for gene set analysis of time course microarray experiments. Stat Appl Genet Mol Biol 2009; 8:Article 47. [PMID: 19954419 DOI: 10.2202/1544-6115.1484] [Citation(s) in RCA: 29] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Methods for gene set analysis test for coordinated changes of a group of genes involved in the same biological process or molecular pathway. Higher statistical power is gained for gene set analysis by combining weak signals from a number of individual genes in each group. Although many gene set analysis methods have been proposed for microarray experiments with two groups, few can be applied to time course experiments. We propose a unified statistical model for analyzing time course experiments at the gene set level using random coefficient models, which fall into the more general class of mixed effects models. These models include a systematic component that models the mean trajectory for the group of genes, and a random component (the random coefficients) that models how each gene's trajectory varies about the mean trajectory. We show that the proposed model (1) outperforms currently available methods at discriminating gene sets differentially changed over time from null gene sets; (2) provides more stable results that are less affected by sampling variations; (3) models dependency among genes adequately and preserves type I error rate; and (4) allows for gene ranking based on predicted values of the random effects. We describe simulation studies using gene expression data with "real life" correlations and we demonstrate the proposed random coefficient model using a mouse colon development time course dataset. The agreement between results of the proposed random coefficient model and the previous reports for this proof-of-concept trial further validates this methodology, which provides a unified statistical model for systems analysis of microarray experiments with complex experimental designs when re-sampling based methods are difficult to apply.
Collapse
|
43
|
Lee YL, Xu X, Wallenstein S, Chen J. Gene expression profiles of the one-carbon metabolism pathway. J Genet Genomics 2009; 36:277-82. [PMID: 19447375 DOI: 10.1016/s1673-8527(08)60115-0] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/03/2008] [Revised: 12/24/2008] [Accepted: 01/20/2009] [Indexed: 10/20/2022]
Abstract
One-carbon metabolism plays a critical role in both DNA methylation and DNA synthesis. Accumulating evidence has shown that interruptions of this pathway are associated with many disease outcomes including cardiovascular diseases and cancers. Mechanistic studies have been performed on genetic polymorphisms involved in one-carbon metabolism. However, expression profiles of these inter-related genes are not well-known. In this study, we examined the gene expression profiles of 11 one-carbon metabolizing genes by quantifying the mRNA level of the lymphocyte among 54 healthy individuals and explored the correlations of these genes. We found these genes were expressed in lymphocytes at moderate levels and showed significant inter-person variations. We also applied principle component analysis to explore potential patterns of expression. The components identified by the program agreed with existing knowledge about one-carbon metabolism. This study helps us better understand the biological functions of one-carbon metabolism.
Collapse
Affiliation(s)
- Yin Leng Lee
- Department of Community and Preventive Medicine, Mount Sinai School of Medicine, New York 10029, USA
| | | | | | | |
Collapse
|
44
|
Nueda MJ, Sebastián P, Tarazona S, García-García F, Dopazo J, Ferrer A, Conesa A. Functional assessment of time course microarray data. BMC Bioinformatics 2009; 10 Suppl 6:S9. [PMID: 19534758 PMCID: PMC2697656 DOI: 10.1186/1471-2105-10-s6-s9] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022] Open
Abstract
Motivation Time-course microarray experiments study the progress of gene expression along time across one or several experimental conditions. Most developed analysis methods focus on the clustering or the differential expression analysis of genes and do not integrate functional information. The assessment of the functional aspects of time-course transcriptomics data requires the use of approaches that exploit the activation dynamics of the functional categories to where genes are annotated. Methods We present three novel methodologies for the functional assessment of time-course microarray data. i) maSigFun derives from the maSigPro method, a regression-based strategy to model time-dependent expression patterns and identify genes with differences across series. maSigFun fits a regression model for groups of genes labeled by a functional class and selects those categories which have a significant model. ii) PCA-maSigFun fits a PCA model of each functional class-defined expression matrix to extract orthogonal patterns of expression change, which are then assessed for their fit to a time-dependent regression model. iii) ASCA-functional uses the ASCA model to rank genes according to their correlation to principal time expression patterns and assess functional enrichment on a GSA fashion. We used simulated and experimental datasets to study these novel approaches. Results were compared to alternative methodologies. Results Synthetic and experimental data showed that the different methods are able to capture different aspects of the relationship between genes, functions and co-expression that are biologically meaningful. The methods should not be considered as competitive but they provide different insights into the molecular and functional dynamic events taking place within the biological system under study.
Collapse
Affiliation(s)
- María José Nueda
- Department of Statistics and Operation Research, University of Alicante, Ctra, San Vicente del Raspeig, S/N 03690 Alicante, Spain.
| | | | | | | | | | | | | |
Collapse
|
45
|
Chen X, Wang L. Integrating biological knowledge with gene expression profiles for survival prediction of cancer. J Comput Biol 2009; 16:265-78. [PMID: 19183004 DOI: 10.1089/cmb.2008.12tt] [Citation(s) in RCA: 40] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/05/2023] Open
Abstract
Due to the large variability in survival times between cancer patients and the plethora of genes on microarrays unrelated to outcome, building accurate prediction models that are easy to interpret remains a challenge. In this paper, we propose a general strategy for improving performance and interpretability of prediction models by integrating gene expression data with prior biological knowledge. First, we link gene identifiers in expression dataset with gene annotation databases such as Gene Ontology (GO). Then we construct "supergenes" for each gene category by summarizing information from genes related to outcome using a modified principal component analysis (PCA) method. Finally, instead of using genes as predictors, we use these supergenes representing information from each gene category as predictors to predict survival outcome. In addition to identifying gene categories associated with outcome, the proposed approach also carries out additional within-category selection to select important genes within each gene set. We show, using two real breast cancer microarray datasets, that the prediction models constructed based on gene sets (or pathway) information outperform the prediction models based on expression values of single genes, with improved prediction accuracy and interpretability.
Collapse
Affiliation(s)
- Xi Chen
- Department of Quantitative Health Sciences, The Cleveland Clinic, Cleveland, OH 44195, USA.
| | | |
Collapse
|
46
|
Wu Z, Zhao X, Chen L. Identifying responsive functional modules from protein-protein interaction network. Mol Cells 2009; 27:271-7. [PMID: 19326072 DOI: 10.1007/s10059-009-0035-x] [Citation(s) in RCA: 31] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/23/2009] [Accepted: 01/26/2009] [Indexed: 10/21/2022] Open
Abstract
Proteins interact with each other within a cell, and those interactions give rise to the biological function and dynamical behavior of cellular systems. Generally, the protein interactions are temporal, spatial, or condition dependent in a specific cell, where only a small part of interactions usually take place under certain conditions. Recently, although a large amount of protein interaction data have been collected by high-throughput technologies, the interactions are recorded or summarized under various or different conditions and therefore cannot be directly used to identify signaling pathways or active networks, which are believed to work in specific cells under specific conditions. However, protein interactions activated under specific conditions may give hints to the biological process underlying corresponding phenotypes. In particular, responsive functional modules consist of protein interactions activated under specific conditions can provide insight into the mechanism underlying biological systems, e.g. protein interaction subnetworks found for certain diseases rather than normal conditions may help to discover potential biomarkers. From computational viewpoint, identifying responsive functional modules can be formulated as an optimization problem. Therefore, efficient computational methods for extracting responsive functional modules are strongly demanded due to the NP-hard nature of such a combinatorial problem. In this review, we first report recent advances in development of computational methods for extracting responsive functional modules or active pathways from protein interaction network and microarray data. Then from computational aspect, we discuss remaining obstacles and perspectives for this attractive and challenging topic in the area of systems biology.
Collapse
Affiliation(s)
- Zikai Wu
- Institute of Systems Biology, Shanghai University, Shanghai 200444, China
| | | | | |
Collapse
|
47
|
Ma S, Kosorok MR. Identification of differential gene pathways with principal component analysis. ACTA ACUST UNITED AC 2009; 25:882-9. [PMID: 19223452 DOI: 10.1093/bioinformatics/btp085] [Citation(s) in RCA: 58] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
MOTIVATION Development of high-throughput technology makes it possible to measure expressions of thousands of genes simultaneously. Genes have the inherent pathway structure, where pathways are composed of multiple genes with coordinated biological functions. It is of great interest to identify differential gene pathways that are associated with the variations of phenotypes. RESULTS We propose the following approach for detecting differential gene pathways. First, we construct gene pathways using databases such as KEGG or GO. Second, for each pathway, we extract a small number of representative features, which are linear combinations of gene expressions and/or their transformations. Specifically, we propose using (i) principal components (PCs) of gene expression sets, (ii) PCs of expanded gene expression sets and (iii) expanded sets of PCs of gene expressions, as the representative features. Third, we identify differential gene pathways as those with representative features significantly associated with the variations of phenotypes, particularly disease clinical outcomes, in regression models. The false discovery rate approach is used to adjust for multiple comparisons. Analysis of three gene expression datasets suggests that (i) the proposed approach can effectively identify differential gene pathways; (ii) PCs that explain only a small amount of variations of gene expressions may bear significant associations between gene pathways and phenotypes; (iii) including second-order terms of gene expressions may lead to identification of new differential gene pathways; (iv) the proposed approach is relatively insensitive to additional noises; and (v) the proposed approach can identify gene pathways missed by alternative approaches. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Shuangge Ma
- Department of Epidemiology and Public Health, Yale University, New Haven, CT 06510, USA.
| | | |
Collapse
|