1
|
Mubeen S, Tom Kodamullil A, Hofmann-Apitius M, Domingo-Fernández D. On the influence of several factors on pathway enrichment analysis. Brief Bioinform 2022; 23:bbac143. [PMID: 35453140 PMCID: PMC9116215 DOI: 10.1093/bib/bbac143] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2022] [Revised: 03/21/2022] [Accepted: 03/30/2022] [Indexed: 02/01/2023] Open
Abstract
Pathway enrichment analysis has become a widely used knowledge-based approach for the interpretation of biomedical data. Its popularity has led to an explosion of both enrichment methods and pathway databases. While the elegance of pathway enrichment lies in its simplicity, multiple factors can impact the results of such an analysis, which may not be accounted for. Researchers may fail to give influential aspects their due, resorting instead to popular methods and gene set collections, or default settings. Despite ongoing efforts to establish set guidelines, meaningful results are still hampered by a lack of consensus or gold standards around how enrichment analysis should be conducted. Nonetheless, such concerns have prompted a series of benchmark studies specifically focused on evaluating the influence of various factors on pathway enrichment results. In this review, we organize and summarize the findings of these benchmarks to provide a comprehensive overview on the influence of these factors. Our work covers a broad spectrum of factors, spanning from methodological assumptions to those related to prior biological knowledge, such as pathway definitions and database choice. In doing so, we aim to shed light on how these aspects can lead to insignificant, uninteresting or even contradictory results. Finally, we conclude the review by proposing future benchmarks as well as solutions to overcome some of the challenges, which originate from the outlined factors.
Collapse
Affiliation(s)
- Sarah Mubeen
- Department of Bioinformatics, Fraunhofer Institute for Algorithms and Scientific Computing, Sankt Augustin 53757, Germany
- Bonn-Aachen International Center for Information Technology (B-IT), University of Bonn, 53115 Bonn, Germany
- Fraunhofer Center for Machine Learning, Germany
| | - Alpha Tom Kodamullil
- Department of Bioinformatics, Fraunhofer Institute for Algorithms and Scientific Computing, Sankt Augustin 53757, Germany
| | - Martin Hofmann-Apitius
- Department of Bioinformatics, Fraunhofer Institute for Algorithms and Scientific Computing, Sankt Augustin 53757, Germany
- Bonn-Aachen International Center for Information Technology (B-IT), University of Bonn, 53115 Bonn, Germany
| | - Daniel Domingo-Fernández
- Department of Bioinformatics, Fraunhofer Institute for Algorithms and Scientific Computing, Sankt Augustin 53757, Germany
- Fraunhofer Center for Machine Learning, Germany
- Enveda Biosciences, Boulder, CO, 80301, USA
| |
Collapse
|
2
|
Madsen RR, Erickson EC, Rueda OM, Robin X, Caldas C, Toker A, Semple RK, Vanhaesebroeck B. Positive correlation between transcriptomic stemness and PI3K/AKT/mTOR signaling scores in breast cancer, and a counterintuitive relationship with PIK3CA genotype. PLoS Genet 2021; 17:e1009876. [PMID: 34762647 PMCID: PMC8584750 DOI: 10.1371/journal.pgen.1009876] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2021] [Accepted: 10/13/2021] [Indexed: 12/13/2022] Open
Abstract
A PI3Kα-selective inhibitor has recently been approved for use in breast tumors harboring mutations in PIK3CA, the gene encoding p110α. Preclinical studies have suggested that the PI3K/AKT/mTOR signaling pathway influences stemness, a dedifferentiation-related cellular phenotype associated with aggressive cancer. However, to date, no direct evidence for such a correlation has been demonstrated in human tumors. In two independent human breast cancer cohorts, encompassing nearly 3,000 tumor samples, transcriptional footprint-based analysis uncovered a positive linear association between transcriptionally-inferred PI3K/AKT/mTOR signaling scores and stemness scores. Unexpectedly, stratification of tumors according to PIK3CA genotype revealed a "biphasic" relationship of mutant PIK3CA allele dosage with these scores. Relative to tumor samples without PIK3CA mutations, the presence of a single copy of a hotspot PIK3CA variant was associated with lower PI3K/AKT/mTOR signaling and stemness scores, whereas the presence of multiple copies of PIK3CA hotspot mutations correlated with higher PI3K/AKT/mTOR signaling and stemness scores. This observation was recapitulated in a human cell model of heterozygous and homozygous PIK3CAH1047R expression. Collectively, our analysis (1) provides evidence for a signaling strength-dependent PI3K-stemness relationship in human breast cancer; (2) supports evaluation of the potential benefit of patient stratification based on a combination of conventional PI3K pathway genetic information with transcriptomic indices of PI3K signaling activation.
Collapse
Affiliation(s)
- Ralitsa R. Madsen
- University College London Cancer Institute, Paul O’Gorman Building, University College London, London, United Kingdom
| | - Emily C. Erickson
- Department of Pathology, Medicine and Cancer Center, Beth Israel Deaconess Medical Center, Harvard Medical School, Boston, Massachusetts, United States of America
| | - Oscar M. Rueda
- Cancer Research UK Cambridge Institute and Department of Oncology, Li Ka Shing Centre, University of Cambridge, Cambridge, United Kingdom
- Cambridge Breast Unit, Addenbrooke’s Hospital, Cambridge University Hospital NHS Foundation Trust, Cambridge, United Kingdom
- NIHR Cambridge Biomedical Research Centre and Cambridge Experimental Cancer Medicine Centre, Cambridge University Hospital NHS Foundation Trust, Cambridge, United Kingdom
| | - Xavier Robin
- SIB Swiss Institute of Bioinformatics, Biozentrum, University of Basel, Basel, Switzerland
| | - Carlos Caldas
- Cancer Research UK Cambridge Institute and Department of Oncology, Li Ka Shing Centre, University of Cambridge, Cambridge, United Kingdom
- Cambridge Breast Unit, Addenbrooke’s Hospital, Cambridge University Hospital NHS Foundation Trust, Cambridge, United Kingdom
- NIHR Cambridge Biomedical Research Centre and Cambridge Experimental Cancer Medicine Centre, Cambridge University Hospital NHS Foundation Trust, Cambridge, United Kingdom
| | - Alex Toker
- Department of Pathology, Medicine and Cancer Center, Beth Israel Deaconess Medical Center, Harvard Medical School, Boston, Massachusetts, United States of America
| | - Robert K. Semple
- Centre for Cardiovascular Science, Queen’s Medical Research Institute, University of Edinburgh, Edinburgh, United Kingdom
| | - Bart Vanhaesebroeck
- University College London Cancer Institute, Paul O’Gorman Building, University College London, London, United Kingdom
| |
Collapse
|
3
|
Emmert-Streib F. Grand Challenges for Artificial Intelligence in Molecular Medicine. FRONTIERS IN MOLECULAR MEDICINE 2021; 1:734659. [PMID: 39087080 PMCID: PMC11285658 DOI: 10.3389/fmmed.2021.734659] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 07/01/2021] [Accepted: 07/08/2021] [Indexed: 08/02/2024]
Affiliation(s)
- Frank Emmert-Streib
- Predictive Society and Data Analytics Lab, Faculty of Information Technolgy and Communication Sciences, Tampere University, Tampere, Finland
- Institute of Biosciences and Medical Technology, Tampere, Finland
| |
Collapse
|
4
|
Maleki F, Ovens K, Hogan DJ, Kusalik AJ. Gene Set Analysis: Challenges, Opportunities, and Future Research. Front Genet 2020; 11:654. [PMID: 32695141 PMCID: PMC7339292 DOI: 10.3389/fgene.2020.00654] [Citation(s) in RCA: 106] [Impact Index Per Article: 21.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2020] [Accepted: 05/29/2020] [Indexed: 12/14/2022] Open
Abstract
Gene set analysis methods are widely used to provide insight into high-throughput gene expression data. There are many gene set analysis methods available. These methods rely on various assumptions and have different requirements, strengths and weaknesses. In this paper, we classify gene set analysis methods based on their components, describe the underlying requirements and assumptions for each class, and provide directions for future research in developing and evaluating gene set analysis methods.
Collapse
|
5
|
Emmert-Streib F, Dehmer M, Yli-Harja O. Ensuring Quality Standards and Reproducible Research for Data Analysis Services in Oncology: A Cooperative Service Model. Front Cell Dev Biol 2020; 7:349. [PMID: 31921859 PMCID: PMC6929679 DOI: 10.3389/fcell.2019.00349] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2019] [Accepted: 12/04/2019] [Indexed: 11/13/2022] Open
Abstract
Modern molecular high-throughput devices, e.g., next-generation sequencing, have transformed medical research. Resulting data sets are usually high-dimensional on a genomic-scale providing multi-factorial information from intertwined molecular and cellular activities of genes and their products. This genomics-revolution installed precision medicine offering breathtaking opportunities for patient's diagnosis and treatment. However, due to the speed of these developments the quality standards of the involved data analyses are lacking behind, as exemplified by the infamous Duke Saga. In this paper, we argue in favor of a two-stage cooperative serve model that couples data generation and data analysis in the most beneficial way from the perspective of a patient to ensure data analysis quality standards including reproducible research.
Collapse
Affiliation(s)
- Frank Emmert-Streib
- Predictive Society and Data Analytics Lab, Faculty of Information Technology and Communication Sciences, Tampere University, Tampere, Finland.,Institute of Biosciences and Medical Technology, Tampere, Finland
| | - Matthias Dehmer
- Steyr School of Management, University of Applied Sciences Upper Austria, Steyr, Austria.,Department of Mechatronics and Biomedical Computer Science, UMIT, Hall in Tyrol, Austria.,College of Artificial Intelligence, Nankai University, Tianjin, China
| | - Olli Yli-Harja
- Institute of Biosciences and Medical Technology, Tampere, Finland.,Institute for Systems Biology, Seattle, WA, United States
| |
Collapse
|
6
|
Glazko G, Zybailov B, Emmert-Streib F, Baranova A, Rahmatallah Y. Proteome-transcriptome alignment of molecular portraits achieved by self-contained gene set analysis: Consensus colon cancer subtypes case study. PLoS One 2019; 14:e0221444. [PMID: 31437237 PMCID: PMC6705791 DOI: 10.1371/journal.pone.0221444] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2019] [Accepted: 08/06/2019] [Indexed: 01/10/2023] Open
Abstract
Gene set analysis (GSA) has become the common methodology for analyzing transcriptomics data. However, self-contained GSA techniques are rarely, if ever, used for proteomics data analysis. Here we present a self-contained proteome level GSA of four consensus molecular subtypes (CMSs) previously established by transcriptome dissection of colon carcinoma specimens. Despite notable difference in structure of proteomics and transcriptomics data, many pathway-wide characteristic features of CMSs found at the mRNA level were reproduced at the protein level. In particular, CMS1 features show heavy involvement of immune system as well as the pathways related to mismatch repair, DNA replication and functioning of proteasome, while CMS4 tumors upregulate complement pathway and proteins participating in epithelial-to-mesenchymal transition (EMT). In addition, protein level GSA yielded a set of novel observations visible at the proteome, but not at the transcriptome level, including possible involvement of major histocompatibility complex II (MHC-II) antigens in the known immunogenicity of CMS1 and a connection between cholesterol trafficking and the regulation of Integrin-linked kinase (ILK) in CMS3. Overall, this study proves utility of self-contained GSA approaches as a critical tool for analyzing proteomics data in general and dissecting protein-level molecular portraits of human tumors in particular.
Collapse
Affiliation(s)
- Galina Glazko
- Department of Biomedical Informatics, University of Arkansas for Medical Sciences, Little Rock, AR, United States of America
| | - Boris Zybailov
- Department of Biochemistry and Molecular Biology, University of Arkansas for Medical Sciences, Little Rock, AR, United States of America
| | - Frank Emmert-Streib
- Computational Medicine and Statistical Learning Laboratory, Tampere University of Technology, Korkeakoulunkatu, Tampere, Finland FI
| | - Ancha Baranova
- School of Systems Biology, George Mason University, Manassas VA, United States of America
- Research Center for Medical Genetics, Moscow, Russia
| | - Yasir Rahmatallah
- Department of Biomedical Informatics, University of Arkansas for Medical Sciences, Little Rock, AR, United States of America
| |
Collapse
|
7
|
Understanding Statistical Hypothesis Testing: The Logic of Statistical Inference. MACHINE LEARNING AND KNOWLEDGE EXTRACTION 2019. [DOI: 10.3390/make1030054] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
Statistical hypothesis testing is among the most misunderstood quantitative analysis methods from data science. Despite its seeming simplicity, it has complex interdependencies between its procedural components. In this paper, we discuss the underlying logic behind statistical hypothesis testing, the formal meaning of its components and their connections. Our presentation is applicable to all statistical hypothesis tests as generic backbone and, hence, useful across all application domains in data science and artificial intelligence.
Collapse
|
8
|
Ebrahimpoor M, Spitali P, Hettne K, Tsonaka R, Goeman J. Simultaneous Enrichment Analysis of all Possible Gene-sets: Unifying Self-Contained and Competitive Methods. Brief Bioinform 2019; 21:1302-1312. [PMID: 31297505 PMCID: PMC7373179 DOI: 10.1093/bib/bbz074] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2019] [Revised: 05/28/2019] [Accepted: 05/28/2019] [Indexed: 01/23/2023] Open
Abstract
Studying sets of genomic features is increasingly popular in genomics, proteomics and metabolomics since analyzing at set level not only creates a natural connection to biological knowledge but also offers more statistical power. Currently, there are two gene-set testing approaches, self-contained and competitive, both of which have their advantages and disadvantages, but neither offers the final solution. We introduce simultaneous enrichment analysis (SEA), a new approach for analysis of feature sets in genomics and other omics based on a new unified null hypothesis, which includes the self-contained and competitive null hypotheses as special cases. We employ closed testing using Simes tests to test this new hypothesis. For every feature set, the proportion of active features is estimated, and a confidence bound is provided. Also, for every unified null hypotheses, a \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{upgreek}
\usepackage{mathrsfs}
\setlength{\oddsidemargin}{-69pt}
\begin{document}
}{}$P$\end{document}-value is calculated, which is adjusted for family-wise error rate. SEA does not need to assume that the features are independent. Moreover, users are allowed to choose the feature set(s) of interest after observing the data. We develop a novel pipeline and apply it on RNA-seq data of dystrophin-deficient mdx mice, showcasing the flexibility of the method. Finally, the power properties of the method are evaluated through simulation studies.
Collapse
Affiliation(s)
- Mitra Ebrahimpoor
- Medical statistics, Department of Biomedical Data Science, Leiden University Medical Center, Leiden, The Netherlands
| | - Pietro Spitali
- Department of Human Genetics, Leiden University Medical Center, Leiden, The Netherlands
| | - Kristina Hettne
- Medical statistics, Department of Biomedical Data Science, Leiden University Medical Center, Leiden, The Netherlands
| | - Roula Tsonaka
- Medical statistics, Department of Biomedical Data Science, Leiden University Medical Center, Leiden, The Netherlands
| | - Jelle Goeman
- Medical statistics, Department of Biomedical Data Science, Leiden University Medical Center, Leiden, The Netherlands
| |
Collapse
|
9
|
Zhou W, Altman RB. Data-driven human transcriptomic modules determined by independent component analysis. BMC Bioinformatics 2018; 19:327. [PMID: 30223787 PMCID: PMC6142401 DOI: 10.1186/s12859-018-2338-4] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2017] [Accepted: 08/28/2018] [Indexed: 12/20/2022] Open
Abstract
Background Analyzing the human transcriptome is crucial in advancing precision medicine, and the plethora of over half a million human microarray samples in the Gene Expression Omnibus (GEO) has enabled us to better characterize biological processes at the molecular level. However, transcriptomic analysis is challenging because the data is inherently noisy and high-dimensional. Gene set analysis is currently widely used to alleviate the issue of high dimensionality, but the user-defined choice of gene sets can introduce biasness in results. In this paper, we advocate the use of a fixed set of transcriptomic modules for such analysis. We apply independent component analysis to the large collection of microarray data in GEO in order to discover reproducible transcriptomic modules that can be used as features for machine learning. We evaluate the usability of these modules across six studies, and demonstrate (1) their usage as features for sample classification, and also their robustness in dealing with small training sets, (2) their regularization of data when clustering samples and (3) the biological relevancy of differentially expressed features. Results We identified 139 reproducible transcriptomic modules, which we term fundamental components (FCs). In studies with less than 50 samples, FC-space classification model outperformed their gene-space counterparts, with higher sensitivity (p < 0.01). The models also had higher accuracy and negative predictive value (p < 0.01) for small data sets (less than 30 samples). Additionally, we observed a reduction in batch effects when data is clustered in the FC-space. Finally, we found that differentially expressed FCs mapped to GO terms that were also identified via traditional gene-based approaches. Conclusions The 139 FCs provide biologically-relevant summarization of transcriptomic data, and their performance in low sample settings suggest that they should be employed in such studies in order to harness the data efficiently. Electronic supplementary material The online version of this article (10.1186/s12859-018-2338-4) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Weizhuang Zhou
- Department of Bioengineering, Stanford University, Stanford, CA, 94305, USA
| | - Russ B Altman
- Department of Bioengineering, Stanford University, Stanford, CA, 94305, USA. .,Department of Genetics, Stanford University, Stanford, CA, 94305, USA.
| |
Collapse
|
10
|
Giummarra L, Crewther SG, Riddell N, Murphy MJ, Crewther DP. Pathway analysis identifies altered mitochondrial metabolism, neurotransmission, structural pathways and complement cascade in retina/RPE/ choroid in chick model of form-deprivation myopia. PeerJ 2018; 6:e5048. [PMID: 29967729 PMCID: PMC6026464 DOI: 10.7717/peerj.5048] [Citation(s) in RCA: 24] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/19/2018] [Accepted: 05/31/2018] [Indexed: 12/15/2022] Open
Abstract
Purpose RNA sequencing analysis has demonstrated bidirectional changes in metabolism, structural and immune pathways during early induction of defocus induced myopia. Thus, the aim of this study was to investigate whether similar gene pathways are also related to the more excessive axial growth, ultrastructural and elemental microanalytic changes seen during the induction and recovery from form-deprivation myopia (FDM) in chicks and predicted by the RIDE model of myopia. Methods Archived genomic transcriptome data from the first three days of induction of monocularly occluded form deprived myopia (FDMI) in chicks was obtained from the GEO database (accession # GSE6543) while data from chicks monocularly occluded for 10 days and then given up to 24 h of normal visual recovery (FDMR) were collected. Gene set enrichment analysis (GSEA) software was used to determine enriched pathways during the induction (FDMI) and recovery (FDMR) from FD. Curated gene-sets were obtained from open access sources. Results Clusters of significant changes in mitochondrial energy metabolism, neurotransmission, ion channel transport, G protein coupled receptor signalling, complement cascades and neuron structure and growth were identified during the 10 days of induction of profound myopia and were found to correlate well with change in axial dimensions. Bile acid and bile salt metabolism pathways (cholesterol/lipid metabolism and sodium channel activation) were significantly upregulated during the first 24 h of recovery from 10 days of FDM. Conclusions The gene pathways altered during induction of FDM are similar to those reported in defocus induced myopia and are established indicators of oxidative stress, osmoregulatory and associated structural changes. These findings are also consistent with the choroidal thinning, axial elongation and hyperosmotic ion distribution patterns across the retina and choroid previously reported in FDM and predicted by RIDE.
Collapse
Affiliation(s)
- Loretta Giummarra
- School of Psychology & Public Health, La Trobe University, Melbourne, Victoria, Australia
| | - Sheila G Crewther
- School of Psychology & Public Health, La Trobe University, Melbourne, Victoria, Australia
| | - Nina Riddell
- School of Psychology & Public Health, La Trobe University, Melbourne, Victoria, Australia
| | - Melanie J Murphy
- School of Psychology & Public Health, La Trobe University, Melbourne, Victoria, Australia
| | - David P Crewther
- Centre for Psychopharmacology, Swinburne University of Technology, Hawthorn, Victoria, Australia
| |
Collapse
|
11
|
Differential expression of genes and differentially perturbed pathways associated with very high evening fatigue in oncology patients receiving chemotherapy. Support Care Cancer 2017; 26:739-750. [PMID: 28944404 DOI: 10.1007/s00520-017-3883-5] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/27/2016] [Accepted: 09/11/2017] [Indexed: 10/18/2022]
Abstract
PURPOSE Fatigue is the most common symptom associated with cancer and its treatment. Investigation of molecular mechanisms associated with fatigue in oncology patients may identify new therapeutic targets. The objectives of this study were to evaluate the relationships between gene expression and perturbations in biological pathways and evening fatigue severity in oncology patients who received chemotherapy (CTX). METHODS The Lee Fatigue Scale (LFS) and latent class analysis were used to identify evening fatigue phenotypes. We measured 47,214 ribonucleic acid transcripts from whole blood collected prior to a cycle of CTX. Perturbations in biological pathways associated with differential gene expression were identified from public data sets (i.e., Kyoto Encyclopedia Gene and Genomes, BioCarta). RESULTS Patients were classified into Moderate (n = 65, mean LFS score 3.1) or Very High (n = 195, mean LFS score 6.4) evening fatigue groups. Compared to patients with Moderate fatigue, patients with Very High fatigue exhibited differential expression of 29 genes. A number of the perturbed pathways identified validated prior mechanistic hypotheses for fatigue, including alterations in immune function, inflammation, neurotransmission, energy metabolism, and circadian rhythms. Based on our findings, energy metabolism was further divided into alterations in carbohydrate metabolism and skeletal muscle energy. Alterations in renal function-related pathways were identified as a potential new mechanism. CONCLUSIONS This study identified differential gene expression and perturbed biological pathways that provide new insights into the multiple and likely inter-related mechanisms associated with evening fatigue in oncology patients.
Collapse
|
12
|
Abstract
The analysis of gene sets (in a form of functionally related genes or pathways) has become the method of choice for extracting the strongest signals from omics data. The motivation behind using gene sets instead of individual genes is two-fold. First, this approach incorporates pre-existing biological knowledge into the analysis and facilitates the interpretation of experimental results. Second, it employs a statistical hypotheses testing framework. Here, we briefly review main Gene Set Analysis (GSA) approaches for testing differential expression of gene sets and several GSA approaches for testing statistical hypotheses beyond differential expression that allow extracting additional biological information from the data. We distinguish three major types of GSA approaches testing: (1) differential expression (DE), (2) differential variability (DV), and (3) differential co-expression (DC) of gene sets between two phenotypes. We also present comparative power analysis and Type I error rates for different approaches in each major type of GSA on simulated data. Our evaluation presents a concise guideline for selecting GSA approaches best performing under particular experimental settings. The value of the three major types of GSA approaches is illustrated with real data example. While being applied to the same data set, major types of GSA approaches result in complementary biological information.
Collapse
|
13
|
Abstract
Approaches to identify significant pathways from high-throughput quantitative data have been developed in recent years. Still, the analysis of proteomic data stays difficult because of limited sample size. This limitation also leads to the practice of using a competitive null as common approach; which fundamentally implies genes or proteins as independent units. The independent assumption ignores the associations among biomolecules with similar functions or cellular localization, as well as the interactions among them manifested as changes in expression ratios. Consequently, these methods often underestimate the associations among biomolecules and cause false positives in practice. Some studies incorporate the sample covariance matrix into the calculation to address this issue. However, sample covariance may not be a precise estimation if the sample size is very limited, which is usually the case for the data produced by mass spectrometry. In this study, we introduce a multivariate test under a self-contained null to perform pathway analysis for quantitative proteomic data. The covariance matrix used in the test statistic is constructed by the confidence scores retrieved from the STRING database or the HitPredict database. We also design an integrating procedure to retain pathways of sufficient evidence as a pathway group. The performance of the proposed T2-statistic is demonstrated using five published experimental datasets: the T-cell activation, the cAMP/PKA signaling, the myoblast differentiation, and the effect of dasatinib on the BCR-ABL pathway are proteomic datasets produced by mass spectrometry; and the protective effect of myocilin via the MAPK signaling pathway is a gene expression dataset of limited sample size. Compared with other popular statistics, the proposed T2-statistic yields more accurate descriptions in agreement with the discussion of the original publication. We implemented the T2-statistic into an R package T2GA, which is available at https://github.com/roqe/T2GA. Pathway analysis is a common approach to quickly access the pathways being regulated in the experiments. There are numerous statistics to perform pathway analysis; most of them assume that the genes or proteins are independent of each other for statistical ease. This assumption, however, is unrealistic to the real biological system and may cause false positives in practice. A standard way to address this issue is to measure the associations among genes or proteins. Unfortunately, the estimation of associations requires sufficient sample size, which is usually not available for proteomic data produced by mass spectrometry. In this study, we propose a T2-statistic, which estimates the associations among gene products, to perform pathway analysis for quantitative proteomic data. Instead of calculating the associations directly from data, we use the confidence scores retrieved from protein-protein interaction databases. We also design an integrating procedure to reserve pathways of sufficient evidence as a regulated pathway group. We compare the proposed T2-statistic to other popular statistics using five published experimental datasets, and the T2-statistic yields more accurate descriptions in agreement with the discussion of the original papers.
Collapse
|
14
|
Rahmatallah Y, Zybailov B, Emmert-Streib F, Glazko G. GSAR: Bioconductor package for Gene Set analysis in R. BMC Bioinformatics 2017; 18:61. [PMID: 28118818 PMCID: PMC5259853 DOI: 10.1186/s12859-017-1482-6] [Citation(s) in RCA: 25] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2016] [Accepted: 01/10/2017] [Indexed: 01/01/2023] Open
Abstract
BACKGROUND Gene set analysis (in a form of functionally related genes or pathways) has become the method of choice for analyzing omics data in general and gene expression data in particular. There are many statistical methods that either summarize gene-level statistics for a gene set or apply a multivariate statistic that accounts for intergene correlations. Most available methods detect complex departures from the null hypothesis but lack the ability to identify the specific alternative hypothesis that rejects the null. RESULTS GSAR (Gene Set Analysis in R) is an open-source R/Bioconductor software package for gene set analysis (GSA). It implements self-contained multivariate non-parametric statistical methods testing a complex null hypothesis against specific alternatives, such as differences in mean (shift), variance (scale), or net correlation structure. The package also provides a graphical visualization tool, based on the union of two minimum spanning trees, for correlation networks to examine the change in the correlation structures of a gene set between two conditions and highlight influential genes (hubs). CONCLUSIONS Package GSAR provides a set of multivariate non-parametric statistical methods that test a complex null hypothesis against specific alternatives. The methods in package GSAR are applicable to any type of omics data that can be represented in a matrix format. The package, with detailed instructions and examples, is freely available under the GPL (> = 2) license from the Bioconductor web site.
Collapse
Affiliation(s)
- Yasir Rahmatallah
- Department of Biomedical Informatics, University of Arkansas for Medical Sciences, Little Rock, AR, 72205, USA.
| | - Boris Zybailov
- Department of Biochemistry and Molecular Biology, University of Arkansas for Medical Sciences, Little Rock, AR, 72205, USA
| | - Frank Emmert-Streib
- Computational Medicine and Statistical Learning Laboratory, Tampere University of Technology, Korkeakoulunkatu 1, Tampere, FI-33720, Finland
| | - Galina Glazko
- Department of Biomedical Informatics, University of Arkansas for Medical Sciences, Little Rock, AR, 72205, USA
| |
Collapse
|
15
|
Kober KM, Dunn L, Mastick J, Cooper B, Langford D, Melisko M, Venook A, Chen LM, Wright F, Hammer M, Schmidt BL, Levine J, Miaskowski C, Aouizerat BE. Gene Expression Profiling of Evening Fatigue in Women Undergoing Chemotherapy for Breast Cancer. Biol Res Nurs 2016; 18:370-85. [PMID: 26957308 DOI: 10.1177/1099800416629209] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/16/2023]
Abstract
Moderate-to-severe fatigue occurs in up to 94% of oncology patients undergoing active treatment. Current interventions for fatigue are not efficacious. A major impediment to the development of effective treatments is a lack of understanding of the fundamental mechanisms underlying fatigue. In the current study, differences in phenotypic characteristics and gene expression profiles were evaluated in a sample of breast cancer patients undergoing chemotherapy (CTX) who reported low (n = 19) and high (n = 25) levels of evening fatigue. Compared to the low group, patients in the high evening fatigue group reported lower functional status scores, higher comorbidity scores, and fewer prior cancer treatments. One gene was identified as upregulated and 11 as downregulated in the high evening fatigue group. Gene set analysis found 24 downregulated and 94 simultaneously up- and downregulated pathways between the two fatigue groups. Transcript origin analysis found that differential expression (DE) originated primarily from monocytes and dendritic cell types. Query of public data sources found 18 gene expression experiments with similar DE profiles. Our analyses revealed that inflammation, neurotransmitter regulation, and energy metabolism are likely mechanisms associated with evening fatigue severity; that CTX may contribute to fatigue seen in oncology patients; and that the patterns of gene expression may be shared with other models of fatigue (e.g., physical exercise and pathogen-induced sickness behavior). These results suggest that the mechanisms that underlie fatigue in oncology patients are multifactorial.
Collapse
Affiliation(s)
- Kord M Kober
- School of Nursing, University of California, San Francisco, CA, USA
| | - Laura Dunn
- School of Medicine, University of California, San Francisco, CA, USA
| | - Judy Mastick
- School of Nursing, University of California, San Francisco, CA, USA
| | - Bruce Cooper
- School of Nursing, University of California, San Francisco, CA, USA
| | - Dale Langford
- School of Nursing, University of California, San Francisco, CA, USA
| | - Michelle Melisko
- School of Medicine, University of California, San Francisco, CA, USA
| | - Alan Venook
- School of Medicine, University of California, San Francisco, CA, USA
| | - Lee-May Chen
- School of Medicine, University of California, San Francisco, CA, USA
| | - Fay Wright
- College of Nursing, New York University, New York, NY, USA
| | - Marilyn Hammer
- College of Nursing, New York University, New York, NY, USA
| | - Brian L Schmidt
- Department of Oral and Maxillofacial Surgery, New York University, New York, NY, USA
| | - Jon Levine
- School of Medicine, University of California, San Francisco, CA, USA
| | | | - Bradley E Aouizerat
- School of Nursing, University of California, San Francisco, CA, USA Institute for Human Genetics, University of California, San Francisco, CA, USA
| |
Collapse
|
16
|
Emmert-Streib F, Zhang SD, Hamilton P. Report from the 2nd Summer School in Computational Biology organized by the Queen's University of Belfast. GENOMICS DATA 2015; 2:37-9. [PMID: 26484064 PMCID: PMC4535836 DOI: 10.1016/j.gdata.2013.12.001] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/09/2013] [Revised: 12/17/2013] [Accepted: 12/20/2013] [Indexed: 12/01/2022]
Abstract
In this paper, we present a meeting report for the 2nd Summer School in Computational Biology organized by the Queen's University of Belfast. We describe the organization of the summer school, its underlying concept and student feedback we received after the completion of the summer school.
Collapse
Affiliation(s)
- Frank Emmert-Streib
- Computational Biology and Machine Learning Laboratory, Center for Cancer Research and Cell Biology, School of Medicine, Dentistry and Biomedical Sciences, Queen's University Belfast, United Kingdom
- Corresponding author.
| | - Shu-Dong Zhang
- Center for Cancer Research and Cell Biology, School of Medicine, Dentistry and Biomedical Sciences, Queen's University Belfast, United Kingdom
| | - Peter Hamilton
- Center for Cancer Research and Cell Biology, School of Medicine, Dentistry and Biomedical Sciences, Queen's University Belfast, United Kingdom
| |
Collapse
|
17
|
Rahmatallah Y, Emmert-Streib F, Glazko G. Gene set analysis approaches for RNA-seq data: performance evaluation and application guideline. Brief Bioinform 2015; 17:393-407. [PMID: 26342128 PMCID: PMC4870397 DOI: 10.1093/bib/bbv069] [Citation(s) in RCA: 45] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2015] [Indexed: 11/15/2022] Open
Abstract
Transcriptome sequencing (RNA-seq) is gradually replacing microarrays for high-throughput studies of gene expression. The main challenge of analyzing microarray data is not in finding differentially expressed genes, but in gaining insights into the biological processes underlying phenotypic differences. To interpret experimental results from microarrays, gene set analysis (GSA) has become the method of choice, in particular because it incorporates pre-existing biological knowledge (in a form of functionally related gene sets) into the analysis. Here we provide a brief review of several statistically different GSA approaches (competitive and self-contained) that can be adapted from microarrays practice as well as those specifically designed for RNA-seq. We evaluate their performance (in terms of Type I error rate, power, robustness to the sample size and heterogeneity, as well as the sensitivity to different types of selection biases) on simulated and real RNA-seq data. Not surprisingly, the performance of various GSA approaches depends only on the statistical hypothesis they test and does not depend on whether the test was developed for microarrays or RNA-seq data. Interestingly, we found that competitive methods have lower power as well as robustness to the samples heterogeneity than self-contained methods, leading to poor results reproducibility. We also found that the power of unsupervised competitive methods depends on the balance between up- and down-regulated genes in tested gene sets. These properties of competitive methods have been overlooked before. Our evaluation provides a concise guideline for selecting GSA approaches, best performing under particular experimental settings in the context of RNA-seq.
Collapse
|
18
|
Frost HR, Li Z, Asselbergs FW, Moore JH. An Independent Filter for Gene Set Testing Based on Spectral Enrichment. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2015; 12:1076-1086. [PMID: 26451820 PMCID: PMC4666312 DOI: 10.1109/tcbb.2015.2415815] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
Gene set testing has become an indispensable tool for the analysis of high-dimensional genomic data. An important motivation for testing gene sets, rather than individual genomic variables, is to improve statistical power by reducing the number of tested hypotheses. Given the dramatic growth in common gene set collections, however, testing is often performed with nearly as many gene sets as underlying genomic variables. To address the challenge to statistical power posed by large gene set collections, we have developed spectral gene set filtering (SGSF), a novel technique for independent filtering of gene set collections prior to gene set testing. The SGSF method uses as a filter statistic the p-value measuring the statistical significance of the association between each gene set and the sample principal components (PCs), taking into account the significance of the associated eigenvalues. Because this filter statistic is independent of standard gene set test statistics under the null hypothesis but dependent under the alternative, the proportion of enriched gene sets is increased without impacting the type I error rate. As shown using simulated and real gene expression data, the SGSF algorithm accurately filters gene sets unrelated to the experimental outcome resulting in significantly increased gene set testing power.
Collapse
Affiliation(s)
- H. Robert Frost
- Institute for Quantitative Biomedical Sciences, the Section of Biostatistics and Epidemiology in the Department of Community and Family Medicine and the Department of Genetics at the Geisel School of Medicine, Dartmouth College, Hanover, NH 03755
| | - Zhigang Li
- Institute for Quantitative Biomedical Sciences, the Section of Biostatistics and Epidemiology in the Department of Community and Family Medicine and the Department of Genetics at the Geisel School of Medicine, Dartmouth College, Hanover, NH 03755
| | - Folkert W. Asselbergs
- Durrer Center for Cardio-genetic Research at the ICIN-Netherlands Heart Institute and the Department of Cardiology, Division of Heart and Lungs at the University Medical Center Utrecht, Utrecht, The Netherlands
| | - Jason H. Moore
- Institute for Quantitative Biomedical Sciences, the Section of Biostatistics and Epidemiology in the Department of Community and Family Medicine and the Department of Genetics at the Geisel School of Medicine, Dartmouth College, Hanover, NH 03755
| |
Collapse
|
19
|
Rahmatallah Y, Emmert-Streib F, Glazko G. Comparative evaluation of gene set analysis approaches for RNA-Seq data. BMC Bioinformatics 2014; 15:397. [PMID: 25475910 PMCID: PMC4265362 DOI: 10.1186/s12859-014-0397-8] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/11/2014] [Accepted: 11/24/2014] [Indexed: 11/18/2022] Open
Abstract
Background Over the last few years transcriptome sequencing (RNA-Seq) has almost completely taken over microarrays for high-throughput studies of gene expression. Currently, the most popular use of RNA-Seq is to identify genes which are differentially expressed between two or more conditions. Despite the importance of Gene Set Analysis (GSA) in the interpretation of the results from RNA-Seq experiments, the limitations of GSA methods developed for microarrays in the context of RNA-Seq data are not well understood. Results We provide a thorough evaluation of popular multivariate and gene-level self-contained GSA approaches on simulated and real RNA-Seq data. The multivariate approach employs multivariate non-parametric tests combined with popular normalizations for RNA-Seq data. The gene-level approach utilizes univariate tests designed for the analysis of RNA-Seq data to find gene-specific P-values and combines them into a pathway P-value using classical statistical techniques. Our results demonstrate that the Type I error rate and the power of multivariate tests depend only on the test statistics and are insensitive to the different normalizations. In general standard multivariate GSA tests detect pathways that do not have any bias in terms of pathways size, percentage of differentially expressed genes, or average gene length in a pathway. In contrast the Type I error rate and the power of gene-level GSA tests are heavily affected by the methods for combining P-values, and all aforementioned biases are present in detected pathways. Conclusions Our result emphasizes the importance of using self-contained non-parametric multivariate tests for detecting differentially expressed pathways for RNA-Seq data and warns against applying gene-level GSA tests, especially because of their high level of Type I error rates for both, simulated and real data. Electronic supplementary material The online version of this article (doi:10.1186/s12859-014-0397-8) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Yasir Rahmatallah
- Division of Biomedical Informatics, University of Arkansas for Medical Sciences, Little Rock, AR, 72205, USA.
| | - Frank Emmert-Streib
- Computational Biology and Machine Learning Laboratory, Center for Cancer Research and Cell Biology, School of Medicine, Dentistry and Biomedical Sciences, Queen's University Belfast, 97 Lisburn Road, Belfast, BT9 7BL, UK.
| | - Galina Glazko
- Division of Biomedical Informatics, University of Arkansas for Medical Sciences, Little Rock, AR, 72205, USA.
| |
Collapse
|
20
|
Ulrich R, Puff C, Wewetzer K, Kalkuhl A, Deschl U, Baumgärtner W. Transcriptional changes in canine distemper virus-induced demyelinating leukoencephalitis favor a biphasic mode of demyelination. PLoS One 2014; 9:e95917. [PMID: 24755553 PMCID: PMC3995819 DOI: 10.1371/journal.pone.0095917] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2013] [Accepted: 04/01/2014] [Indexed: 01/08/2023] Open
Abstract
Canine distemper virus (CDV)-induced demyelinating leukoencephalitis in dogs (Canis familiaris) is suggested to represent a naturally occurring translational model for subacute sclerosing panencephalitis and multiple sclerosis in humans. The aim of this study was a hypothesis-free microarray analysis of the transcriptional changes within cerebellar specimens of five cases of acute, six cases of subacute demyelinating, and three cases of chronic demyelinating and inflammatory CDV leukoencephalitis as compared to twelve non-infected control dogs. Frozen cerebellar specimens were used for analysis of histopathological changes including demyelination, transcriptional changes employing microarrays, and presence of CDV nucleoprotein RNA and protein using microarrays, RT-qPCR and immunohistochemistry. Microarray analysis revealed 780 differentially expressed probe sets. The dominating change was an up-regulation of genes related to the innate and the humoral immune response, and less distinct the cytotoxic T-cell-mediated immune response in all subtypes of CDV leukoencephalitis as compared to controls. Multiple myelin genes including myelin basic protein and proteolipid protein displayed a selective down-regulation in subacute CDV leukoencephalitis, suggestive of an oligodendrocyte dystrophy. In contrast, a marked up-regulation of multiple immunoglobulin-like expressed sequence tags and the delta polypeptide of the CD3 antigen was observed in chronic CDV leukoencephalitis, in agreement with the hypothesis of an immune-mediated demyelination in the late inflammatory phase of the disease. Analysis of pathways intimately linked to demyelination as determined by morphometry employing correlation-based Gene Set Enrichment Analysis highlighted the pathomechanistic importance of up-regulated genes comprised by the gene ontology terms “viral replication” and “humoral immune response” as well as down-regulated genes functionally related to “metabolite and energy generation”.
Collapse
Affiliation(s)
- Reiner Ulrich
- Department of Pathology, University of Veterinary Medicine Hannover, Hannover, Germany
- Center of Systems Neuroscience, Hannover, Germany
- * E-mail:
| | - Christina Puff
- Department of Pathology, University of Veterinary Medicine Hannover, Hannover, Germany
| | - Konstantin Wewetzer
- Department of Functional and Applied Anatomy, Hannover Medical School, Hannover, Germany
- Center of Systems Neuroscience, Hannover, Germany
| | - Arno Kalkuhl
- Department of Non-Clinical Drug Safety, Boehringer Ingelheim Pharma GmbH&Co KG, Biberach (Riβ), Germany
| | - Ulrich Deschl
- Department of Non-Clinical Drug Safety, Boehringer Ingelheim Pharma GmbH&Co KG, Biberach (Riβ), Germany
| | - Wolfgang Baumgärtner
- Department of Pathology, University of Veterinary Medicine Hannover, Hannover, Germany
- Center of Systems Neuroscience, Hannover, Germany
| |
Collapse
|
21
|
Bateman AR, El-Hachem N, Beck AH, Aerts HJWL, Haibe-Kains B. Importance of collection in gene set enrichment analysis of drug response in cancer cell lines. Sci Rep 2014; 4:4092. [PMID: 24522610 PMCID: PMC3923229 DOI: 10.1038/srep04092] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/25/2013] [Accepted: 01/29/2014] [Indexed: 12/27/2022] Open
Abstract
Gene set enrichment analysis (GSEA) associates gene sets and phenotypes, its use is predicated on the choice of a pre-defined collection of sets. The defacto standard implementation of GSEA provides seven collections yet there are no guidelines for the choice of collections and the impact of such choice, if any, is unknown. Here we compare each of the standard gene set collections in the context of a large dataset of drug response in human cancer cell lines. We define and test a new collection based on gene co-expression in cancer cell lines to compare the performance of the standard collections to an externally derived cell line based collection. The results show that GSEA findings vary significantly depending on the collection chosen for analysis. Henceforth, collections should be carefully selected and reported in studies that leverage GSEA.
Collapse
Affiliation(s)
- Alain R Bateman
- Bioinformatics and Computational Genomics Laboratory, Institut de Recherches Cliniques de Montréal, University of Montreal, Montreal, Quebec, Canada
| | - Nehme El-Hachem
- Bioinformatics and Computational Genomics Laboratory, Institut de Recherches Cliniques de Montréal, University of Montreal, Montreal, Quebec, Canada
| | - Andrew H Beck
- Department of Pathology, Beth Israel Deaconess Medical Center and Harvard Medical School, Boston, MA, USA
| | - Hugo J W L Aerts
- 1] Department of Biostatistics and Computational Biology and Center for Cancer Computational Biology, Dana-Farber Cancer Institute, Boston, MA, USA [2] Department of Radiation Oncology & Radiology, Dana-Farber Cancer Institute, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA [3] Department of Radiation Oncology, Maastricht University, Maastricht, The Netherlands
| | - Benjamin Haibe-Kains
- 1] Bioinformatics and Computational Genomics Laboratory, Institut de Recherches Cliniques de Montréal, University of Montreal, Montreal, Quebec, Canada [2] Ontario Cancer Institute, Princess Margaret Cancer Centre, University Health Network, Toronto, Ontario, Canada
| |
Collapse
|
22
|
Rahmatallah Y, Emmert-Streib F, Glazko G. Gene Sets Net Correlations Analysis (GSNCA): a multivariate differential coexpression test for gene sets. ACTA ACUST UNITED AC 2013; 30:360-8. [PMID: 24292935 PMCID: PMC4023302 DOI: 10.1093/bioinformatics/btt687] [Citation(s) in RCA: 80] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022]
Abstract
MOTIVATION To date, gene set analysis approaches primarily focus on identifying differentially expressed gene sets (pathways). Methods for identifying differentially coexpressed pathways also exist but are mostly based on aggregated pairwise correlations or other pairwise measures of coexpression. Instead, we propose Gene Sets Net Correlations Analysis (GSNCA), a multivariate differential coexpression test that accounts for the complete correlation structure between genes. RESULTS In GSNCA, weight factors are assigned to genes in proportion to the genes' cross-correlations (intergene correlations). The problem of finding the weight vectors is formulated as an eigenvector problem with a unique solution. GSNCA tests the null hypothesis that for a gene set there is no difference in the weight vectors of the genes between two conditions. In simulation studies and the analyses of experimental data, we demonstrate that GSNCA captures changes in the structure of genes' cross-correlations rather than differences in the averaged pairwise correlations. Thus, GSNCA infers differences in coexpression networks, however, bypassing method-dependent steps of network inference. As an additional result from GSNCA, we define hub genes as genes with the largest weights and show that these genes correspond frequently to major and specific pathway regulators, as well as to genes that are most affected by the biological difference between two conditions. In summary, GSNCA is a new approach for the analysis of differentially coexpressed pathways that also evaluates the importance of the genes in the pathways, thus providing unique information that may result in the generation of novel biological hypotheses. AVAILABILITY AND IMPLEMENTATION Implementation of the GSNCA test in R is available upon request from the authors.
Collapse
Affiliation(s)
- Yasir Rahmatallah
- Division of Biomedical Informatics, University of Arkansas for Medical Sciences, Little Rock, AR 72205, USA and Computational Biology and Machine Learning Laboratory, Center for Cancer Research and Cell Biology, School of Medicine, Dentistry and Biomedical Sciences, Queen's University Belfast, Belfast BT9 7BL, UK
| | | | | |
Collapse
|