101
|
Fan J, Feng Y, Jiang J, Tong X. Feature Augmentation via Nonparametrics and Selection (FANS) in High-Dimensional Classification. J Am Stat Assoc 2016; 111:275-287. [PMID: 27185970 DOI: 10.1080/01621459.2015.1005212] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]
Abstract
We propose a high dimensional classification method that involves nonparametric feature augmentation. Knowing that marginal density ratios are the most powerful univariate classifiers, we use the ratio estimates to transform the original feature measurements. Subsequently, penalized logistic regression is invoked, taking as input the newly transformed or augmented features. This procedure trains models equipped with local complexity and global simplicity, thereby avoiding the curse of dimensionality while creating a flexible nonlinear decision boundary. The resulting method is called Feature Augmentation via Nonparametrics and Selection (FANS). We motivate FANS by generalizing the Naive Bayes model, writing the log ratio of joint densities as a linear combination of those of marginal densities. It is related to generalized additive models, but has better interpretability and computability. Risk bounds are developed for FANS. In numerical analysis, FANS is compared with competing methods, so as to provide a guideline on its best application domain. Real data analysis demonstrates that FANS performs very competitively on benchmark email spam and gene expression data sets. Moreover, FANS is implemented by an extremely fast algorithm through parallel computing.
Collapse
Affiliation(s)
- Jianqing Fan
- Jianqing Fan is Frederick L. Moore Professor of Finance, Department of Operations Research and Financial Engineering, Princeton University, Princeton, NJ, 08544 ( )
| | - Yang Feng
- Yang Feng is Assistant Professor, Department of Statistics, Columbia University, New York, NY, 10027 ( )
| | - Jiancheng Jiang
- Jiancheng Jiang is Associate Professor, Department of Mathematics and Statistics, University of North Carolina at Charlotte, Charlotte, NC, 28223 ( )
| | - Xin Tong
- Xin Tong is Assistant Professor, Department of Data Sciences and Operations, University of Southern California, Los Angeles, CA, 90089 ( )
| |
Collapse
|
102
|
Giddaluru S, Espeseth T, Salami A, Westlye LT, Lundquist A, Christoforou A, Cichon S, Adolfsson R, Steen VM, Reinvang I, Nilsson LG, Le Hellard S, Nyberg L. Genetics of structural connectivity and information processing in the brain. Brain Struct Funct 2016; 221:4643-4661. [PMID: 26852023 PMCID: PMC5102980 DOI: 10.1007/s00429-016-1194-0] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2015] [Accepted: 01/22/2016] [Indexed: 12/20/2022]
Abstract
Understanding the genetic factors underlying brain structural connectivity is a major challenge in imaging genetics. Here, we present results from genome-wide association studies (GWASs) of whole-brain white matter (WM) fractional anisotropy (FA), an index of microstructural coherence measured using diffusion tensor imaging. Data from independent GWASs of 355 Swedish and 250 Norwegian healthy adults were integrated by meta-analysis to enhance power. Complementary GWASs on behavioral data reflecting processing speed, which is related to microstructural properties of WM pathways, were performed and integrated with WM FA results via multimodal analysis to identify shared genetic associations. One locus on chromosome 17 (rs145994492) showed genome-wide significant association with WM FA (meta P value = 1.87 × 10-08). Suggestive associations (Meta P value <1 × 10-06) were observed for 12 loci, including one containing ZFPM2 (lowest meta P value = 7.44 × 10-08). This locus was also implicated in multimodal analysis of WM FA and processing speed (lowest Fisher P value = 8.56 × 10-07). ZFPM2 is relevant in specification of corticothalamic neurons during brain development. Analysis of SNPs associated with processing speed revealed association with a locus that included SSPO (lowest meta P value = 4.37 × 10-08), which has been linked to commissural axon growth. An intergenic SNP (rs183854424) 14 kb downstream of CSMD1, which is implicated in schizophrenia, showed suggestive evidence of association in the WM FA meta-analysis (meta P value = 1.43 × 10-07) and the multimodal analysis (Fisher P value = 1 × 10-07). These findings provide novel data on the genetics of WM pathways and processing speed, and highlight a role of ZFPM2 and CSMD1 in information processing in the brain.
Collapse
Affiliation(s)
- Sudheer Giddaluru
- Dr. Einar Martens Research Group for Biological Psychiatry, Center for Medical Genetics and Molecular Medicine, Haukeland University Hospital, 5021, Bergen, Norway.,K.G.Jebsen Center for Psychosis Research and the Norwegian Center for Mental Disorders Research (NORMENT), Department of Clinical Science, University of Bergen, 5021, Bergen, Norway
| | - Thomas Espeseth
- K.G. Jebsen Center for Psychosis Research, Norwegian Center for Mental Disorders Research (NORMENT), Division of Mental Health and Addiction, Oslo University Hospital, 0424, Oslo, Norway.,Department of Psychology, University of Oslo, 0317, Oslo, Norway
| | - Alireza Salami
- Umeå Center for Functional Brain Imaging (UFBI), Umeå University, 90187, Umeå, Sweden.,Aging Research Center, Karolinska Institutet and Stockholm University, 11330, Stockholm, Sweden
| | - Lars T Westlye
- K.G. Jebsen Center for Psychosis Research, Norwegian Center for Mental Disorders Research (NORMENT), Division of Mental Health and Addiction, Oslo University Hospital, 0424, Oslo, Norway.,Department of Psychology, University of Oslo, 0317, Oslo, Norway
| | - Anders Lundquist
- Umeå Center for Functional Brain Imaging (UFBI), Umeå University, 90187, Umeå, Sweden.,Department of Statistics, USBF, Umeå University, 90187, Umeå, Sweden
| | - Andrea Christoforou
- Dr. Einar Martens Research Group for Biological Psychiatry, Center for Medical Genetics and Molecular Medicine, Haukeland University Hospital, 5021, Bergen, Norway.,K.G.Jebsen Center for Psychosis Research and the Norwegian Center for Mental Disorders Research (NORMENT), Department of Clinical Science, University of Bergen, 5021, Bergen, Norway
| | - Sven Cichon
- Division of Medical Genetics, Department of Biomedicine, University of Basel, 4058, Basel, Switzerland.,Institute of Neuroscience and Medicine (INM-1), Research Center Juelich, 52425, Juelich, Germany.,Department of Genomics, Life and Brain Center, University of Bonn, 53127, Bonn, Germany
| | - Rolf Adolfsson
- Department of Clinical Sciences, Psychiatry, Umeå University, 90187, Umeå, Sweden
| | - Vidar M Steen
- Dr. Einar Martens Research Group for Biological Psychiatry, Center for Medical Genetics and Molecular Medicine, Haukeland University Hospital, 5021, Bergen, Norway.,K.G.Jebsen Center for Psychosis Research and the Norwegian Center for Mental Disorders Research (NORMENT), Department of Clinical Science, University of Bergen, 5021, Bergen, Norway
| | - Ivar Reinvang
- Department of Psychology, University of Oslo, 0317, Oslo, Norway
| | - Lars Göran Nilsson
- Umeå Center for Functional Brain Imaging (UFBI), Umeå University, 90187, Umeå, Sweden.,ARC, Karolinska Institutet, Stockholm, Sweden
| | - Stéphanie Le Hellard
- Dr. Einar Martens Research Group for Biological Psychiatry, Center for Medical Genetics and Molecular Medicine, Haukeland University Hospital, 5021, Bergen, Norway.,K.G.Jebsen Center for Psychosis Research and the Norwegian Center for Mental Disorders Research (NORMENT), Department of Clinical Science, University of Bergen, 5021, Bergen, Norway
| | - Lars Nyberg
- Umeå Center for Functional Brain Imaging (UFBI), Umeå University, 90187, Umeå, Sweden. .,Department of Radiation Sciences, Umeå University, 90187, Umeå, Sweden. .,Department of Integrative Medical Biology, Umeå University, 90187, Umeå, Sweden.
| |
Collapse
|
103
|
Schmid F, Schmid M, Müssel C, Sträng JE, Buske C, Bullinger L, Kraus JM, Kestler HA. GiANT: gene set uncertainty in enrichment analysis. Bioinformatics 2016; 32:1891-4. [DOI: 10.1093/bioinformatics/btw030] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2015] [Accepted: 01/12/2016] [Indexed: 11/14/2022] Open
|
104
|
Stöckel D, Kehl T, Trampert P, Schneider L, Backes C, Ludwig N, Gerasch A, Kaufmann M, Gessler M, Graf N, Meese E, Keller A, Lenhof HP. Multi-omics enrichment analysis using the GeneTrail2 web service. Bioinformatics 2016; 32:1502-8. [PMID: 26787660 DOI: 10.1093/bioinformatics/btv770] [Citation(s) in RCA: 112] [Impact Index Per Article: 14.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/19/2015] [Accepted: 12/28/2015] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Gene set analysis has revolutionized the interpretation of high-throughput transcriptomic data. Nowadays, with comprehensive studies that measure multiple -omics from the same sample, powerful tools for the integrative analysis of multi-omics datasets are required. RESULTS Here, we present GeneTrail2, a web service allowing the integrated analysis of transcriptomic, miRNomic, genomic and proteomic datasets. It offers multiple statistical tests, a large number of predefined reference sets, as well as a comprehensive collection of biological categories and enables direct comparisons between the computed results. We used GeneTrail2 to explore pathogenic mechanisms of Wilms tumors. We not only succeeded in revealing signaling cascades that may contribute to the malignancy of blastemal subtype tumors but also identified potential biomarkers for nephroblastoma with adverse prognosis. The presented use-case demonstrates that GeneTrail2 is well equipped for the integrative analysis of comprehensive -omics data and may help to shed light on complex pathogenic mechanisms in cancer and other diseases. AVAILABILITY AND IMPLEMENTATION GeneTrail2 can be freely accessed under https://genetrail2.bioinf.uni-sb.de CONTACT : dstoeckel@bioinf.uni-sb.de SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Daniel Stöckel
- Center for Bioinformatics, Saarland University, Saarbrücken D-66041
| | - Tim Kehl
- Center for Bioinformatics, Saarland University, Saarbrücken D-66041
| | - Patrick Trampert
- Center for Bioinformatics, Saarland University, Saarbrücken D-66041
| | - Lara Schneider
- Center for Bioinformatics, Saarland University, Saarbrücken D-66041
| | - Christina Backes
- Center for Bioinformatics, Saarland University, Saarbrücken D-66041
| | - Nicole Ludwig
- Department of Human Genetics, Medical School, Saarland University, Homburg D-66421
| | - Andreas Gerasch
- Center for Bioinformatics, Eberhard-Karls-University, Tübingen, D-72076
| | - Michael Kaufmann
- Center for Bioinformatics, Eberhard-Karls-University, Tübingen, D-72076
| | - Manfred Gessler
- Theodor-Boveri-Institute/Biocenter, Developmental Biochemistry, and Comprehensive Cancer Center Mainfranken, Würzburg University, Würzburg D-97074 and
| | - Norbert Graf
- Department of Pediatric Oncology and Hematology, Medical School, Saarland University, Homburg, D-66421, Germany
| | - Eckart Meese
- Department of Human Genetics, Medical School, Saarland University, Homburg D-66421
| | - Andreas Keller
- Center for Bioinformatics, Saarland University, Saarbrücken D-66041
| | | |
Collapse
|
105
|
Poussin C, Laurent A, Peitsch MC, Hoeng J, De Leon H. Systems toxicology-based assessment of the candidate modified risk tobacco product THS2.2 for the adhesion of monocytic cells to human coronary arterial endothelial cells. Toxicology 2016; 339:73-86. [PMID: 26655683 DOI: 10.1016/j.tox.2015.11.007] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2015] [Revised: 11/26/2015] [Accepted: 11/30/2015] [Indexed: 12/20/2022]
Abstract
Alterations of endothelial adhesive properties by cigarette smoke (CS) can progressively favor the development of atherosclerosis which may cause cardiovascular disorders. Modified risk tobacco products (MRTPs) are tobacco products developed to reduce smoking-related risks. A systems biology/toxicology approach combined with a functional in vitro adhesion assay was used to assess the impact of a candidate heat-not-burn technology-based MRTP, Tobacco Heating System (THS) 2.2, on the adhesion of monocytic cells to human coronary arterial endothelial cells (HCAECs) compared with a reference cigarette (3R4F). HCAECs were treated for 4h with conditioned media of human monocytic Mono Mac 6 (MM6) cells preincubated with low or high concentrations of aqueous extracts from THS2.2 aerosol or 3R4F smoke for 2h (indirect treatment), unconditioned media (direct treatment), or fresh aqueous aerosol/smoke extracts (fresh direct treatment). Functional and molecular investigations revealed that aqueous 3R4F smoke extract promoted the adhesion of MM6 cells to HCAECs via distinct direct and indirect concentration-dependent mechanisms. Using the same approach, we identified significantly reduced effects of aqueous THS2.2 aerosol extract on MM6 cell-HCAEC adhesion, and reduced molecular changes in endothelial and monocytic cells. Ten- and 20-fold increased concentrations of aqueous THS2.2 aerosol extract were necessary to elicit similar effects to those measured with 3R4F in both fresh direct and indirect exposure modalities, respectively. Our systems toxicology study demonstrated reduced effects of an aqueous aerosol extract from the candidate MRTP, THS2.2, using the adhesion of monocytic cells to human coronary endothelial cells as a surrogate pathophysiologically relevant event in atherogenesis.
Collapse
Affiliation(s)
- Carine Poussin
- Philip Morris International R&D, Philip Morris Products S.A., Quai Jeanrenaud 5, 2000 Neuchâtel, Switzerland.
| | - Alexandra Laurent
- Philip Morris International R&D, Philip Morris Products S.A., Quai Jeanrenaud 5, 2000 Neuchâtel, Switzerland
| | - Manuel C Peitsch
- Philip Morris International R&D, Philip Morris Products S.A., Quai Jeanrenaud 5, 2000 Neuchâtel, Switzerland
| | - Julia Hoeng
- Philip Morris International R&D, Philip Morris Products S.A., Quai Jeanrenaud 5, 2000 Neuchâtel, Switzerland
| | - Hector De Leon
- Philip Morris International R&D, Philip Morris Products S.A., Quai Jeanrenaud 5, 2000 Neuchâtel, Switzerland
| |
Collapse
|
106
|
Koskela von Sydow A, Janbaz C, Kardeby C, Repsilber D, Ivarsson M. IL-1α Counteract TGF-β Regulated Genes and Pathways in Human Fibroblasts. J Cell Biochem 2015; 117:1622-32. [DOI: 10.1002/jcb.25455] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/08/2015] [Accepted: 12/01/2015] [Indexed: 12/24/2022]
Affiliation(s)
- Anita Koskela von Sydow
- Faculty of Medicine and Health; Örebro University; Örebro Sweden
- Department of Clinical Research Laboratory; University Hospital; Örebro Sweden
| | - Chris Janbaz
- Faculty of Medicine and Health; Örebro University; Örebro Sweden
- Department of Plastic and Reconstructive Surgery; University Hospital; Örebro Sweden
| | - Caroline Kardeby
- Faculty of Medicine and Health; Örebro University; Örebro Sweden
| | - Dirk Repsilber
- Faculty of Medicine and Health; Örebro University; Örebro Sweden
| | - Mikael Ivarsson
- Faculty of Medicine and Health; Örebro University; Örebro Sweden
| |
Collapse
|
107
|
García-Campos MA, Espinal-Enríquez J, Hernández-Lemus E. Pathway Analysis: State of the Art. Front Physiol 2015; 6:383. [PMID: 26733877 PMCID: PMC4681784 DOI: 10.3389/fphys.2015.00383] [Citation(s) in RCA: 155] [Impact Index Per Article: 17.2] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2015] [Accepted: 11/26/2015] [Indexed: 12/02/2022] Open
Abstract
Pathway analysis is a set of widely used tools for research in life sciences intended to give meaning to high-throughput biological data. The methodology of these tools settles in the gathering and usage of knowledge that comprise biomolecular functioning, coupled with statistical testing and other algorithms. Despite their wide employment, pathway analysis foundations and overall background may not be fully understood, leading to misinterpretation of analysis results. This review attempts to comprise the fundamental knowledge to take into consideration when using pathway analysis as a hypothesis generation tool. We discuss the key elements that are part of these methodologies, their capabilities and current deficiencies. We also present an overview of current and all-time popular methods, highlighting different classes across them. In doing so, we show the exploding diversity of methods that pathway analysis encompasses, point out commonly overlooked caveats, and direct attention to a potential new class of methods that attempt to zoom the analysis scope to the sample scale.
Collapse
Affiliation(s)
| | - Jesús Espinal-Enríquez
- Computational Genomics, National Institute of Genomic MedicineMéxico City, México; Complejidad en Biología de Sistemas, Centro de Ciencias de la Complejidad, Universidad Nacional Autónoma de MéxicoCiudad de México, México
| | - Enrique Hernández-Lemus
- Computational Genomics, National Institute of Genomic MedicineMéxico City, México; Complejidad en Biología de Sistemas, Centro de Ciencias de la Complejidad, Universidad Nacional Autónoma de MéxicoCiudad de México, México
| |
Collapse
|
108
|
Clark NR, Szymkiewicz M, Wang Z, Monteiro CD, Jones MR, Ma'ayan A. Principal Angle Enrichment Analysis (PAEA): Dimensionally Reduced Multivariate Gene Set Enrichment Analysis Tool. PROCEEDINGS. IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOMEDICINE 2015; 2015:256-262. [PMID: 26848405 DOI: 10.1109/bibm.2015.7359689] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/21/2022]
Abstract
Gene set analysis of differential expression, which identifies collectively differentially expressed gene sets, has become an important tool for biology. The power of this approach lies in its reduction of the dimensionality of the statistical problem and its incorporation of biological interpretation by construction. Many approaches to gene set analysis have been proposed, but benchmarking their performance in the setting of real biological data is difficult due to the lack of a gold standard. In a previously published work we proposed a geometrical approach to differential expression which performed highly in benchmarking tests and compared well to the most popular methods of differential gene expression. As reported, this approach has a natural extension to gene set analysis which we call Principal Angle Enrichment Analysis (PAEA). PAEA employs dimensionality reduction and a multivariate approach for gene set enrichment analysis. However, the performance of this method has not been assessed nor its implementation as a web-based tool. Here we describe new benchmarking protocols for gene set analysis methods and find that PAEA performs highly. The PAEA method is implemented as a user-friendly web-based tool, which contains 70 gene set libraries and is freely available to the community.
Collapse
Affiliation(s)
- Neil R Clark
- Department of Pharmacology and Systems Therapeutics, Icahn School of Medicine at Mount Sinai School, One Gustave L. Levy Place, New York, NY, 10029, USA; Big Data to Knowledge (BD2K) Library of Integrated Network-based Cellular Signatures (LINCS) Data Coordination and Integration Center (DCIC); Mount Sinai Knowledge Management Center (KMC) for Illuminating the Druggable Genome (IDG)
| | - Maciej Szymkiewicz
- Warsaw School of Information Technology under the auspices of the Polish Academy of Sciences, 6 Newelska St., 01-447, Warsaw, Poland
| | - Zichen Wang
- Department of Pharmacology and Systems Therapeutics, Icahn School of Medicine at Mount Sinai School, One Gustave L. Levy Place, New York, NY, 10029, USA; Big Data to Knowledge (BD2K) Library of Integrated Network-based Cellular Signatures (LINCS) Data Coordination and Integration Center (DCIC); Mount Sinai Knowledge Management Center (KMC) for Illuminating the Druggable Genome (IDG)
| | - Caroline D Monteiro
- Department of Pharmacology and Systems Therapeutics, Icahn School of Medicine at Mount Sinai School, One Gustave L. Levy Place, New York, NY, 10029, USA; Big Data to Knowledge (BD2K) Library of Integrated Network-based Cellular Signatures (LINCS) Data Coordination and Integration Center (DCIC); Mount Sinai Knowledge Management Center (KMC) for Illuminating the Druggable Genome (IDG)
| | - Matthew R Jones
- Department of Pharmacology and Systems Therapeutics, Icahn School of Medicine at Mount Sinai School, One Gustave L. Levy Place, New York, NY, 10029, USA; Big Data to Knowledge (BD2K) Library of Integrated Network-based Cellular Signatures (LINCS) Data Coordination and Integration Center (DCIC); Mount Sinai Knowledge Management Center (KMC) for Illuminating the Druggable Genome (IDG)
| | - Avi Ma'ayan
- Department of Pharmacology and Systems Therapeutics, Icahn School of Medicine at Mount Sinai School, One Gustave L. Levy Place, New York, NY, 10029, USA; Big Data to Knowledge (BD2K) Library of Integrated Network-based Cellular Signatures (LINCS) Data Coordination and Integration Center (DCIC); Mount Sinai Knowledge Management Center (KMC) for Illuminating the Druggable Genome (IDG)
| |
Collapse
|
109
|
Rosenberger A, Friedrichs S, Amos CI, Brennan P, Fehringer G, Heinrich J, Hung RJ, Muley T, Müller-Nurasyid M, Risch A, Bickeböller H. META-GSA: Combining Findings from Gene-Set Analyses across Several Genome-Wide Association Studies. PLoS One 2015; 10:e0140179. [PMID: 26501144 PMCID: PMC4621033 DOI: 10.1371/journal.pone.0140179] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/28/2015] [Accepted: 09/21/2015] [Indexed: 01/31/2023] Open
Abstract
INTRODUCTION Gene-set analysis (GSA) methods are used as complementary approaches to genome-wide association studies (GWASs). The single marker association estimates of a predefined set of genes are either contrasted with those of all remaining genes or with a null non-associated background. To pool the p-values from several GSAs, it is important to take into account the concordance of the observed patterns resulting from single marker association point estimates across any given gene set. Here we propose an enhanced version of Fisher's inverse χ2-method META-GSA, however weighting each study to account for imperfect correlation between association patterns. SIMULATION AND POWER We investigated the performance of META-GSA by simulating GWASs with 500 cases and 500 controls at 100 diallelic markers in 20 different scenarios, simulating different relative risks between 1 and 1.5 in gene sets of 10 genes. Wilcoxon's rank sum test was applied as GSA for each study. We found that META-GSA has greater power to discover truly associated gene sets than simple pooling of the p-values, by e.g. 59% versus 37%, when the true relative risk for 5 of 10 genes was assume to be 1.5. Under the null hypothesis of no difference in the true association pattern between the gene set of interest and the set of remaining genes, the results of both approaches are almost uncorrelated. We recommend not relying on p-values alone when combining the results of independent GSAs. APPLICATION We applied META-GSA to pool the results of four case-control GWASs of lung cancer risk (Central European Study and Toronto/Lunenfeld-Tanenbaum Research Institute Study; German Lung Cancer Study and MD Anderson Cancer Center Study), which had already been analyzed separately with four different GSA methods (EASE; SLAT, mSUMSTAT and GenGen). This application revealed the pathway GO0015291 "transmembrane transporter activity" as significantly enriched with associated genes (GSA-method: EASE, p = 0.0315 corrected for multiple testing). Similar results were found for GO0015464 "acetylcholine receptor activity" but only when not corrected for multiple testing (all GSA-methods applied; p ≈ 0.02).
Collapse
Affiliation(s)
- Albert Rosenberger
- Department of Genetic Epidemiology, University Medical Center, Georg-August University Göttingen, Göttingen, Germany
| | - Stefanie Friedrichs
- Department of Genetic Epidemiology, University Medical Center, Georg-August University Göttingen, Göttingen, Germany
| | - Christopher I. Amos
- Geisel School of Medicine, Dartmouth College, Lebanon, NH, United States of America
| | - Paul Brennan
- International Agency for Research on Cancer (IARC), Lyon, France
| | - Gordon Fehringer
- Prosserman Centre for Health Research, Samuel Lunenfeld Research Institute, Mount Sinai Hospital, Toronto, Ontario, Canada
| | - Joachim Heinrich
- Institute of Epidemiology I, Helmholtz Zentrum München, German Research Center for Environmental Health, Neuherberg, Germany
| | - Rayjean J. Hung
- Prosserman Centre for Health Research, Samuel Lunenfeld Research Institute, Mount Sinai Hospital, Toronto, Ontario, Canada
| | - Thomas Muley
- Translational Lung Research Center Heidelberg (TLRC-H), Member of the German Center for Lung Research (DZL), Heidelberg, Germany
- Thoraxklinik at University of Heidelberg, Heidelberg, Germany
| | - Martina Müller-Nurasyid
- Department of Medicine I, Ludwig-Maximilians-University Munich, Munich, Germany
- Institute of Medical Informatics, Biometry and Epidemiology, Chair of Genetic Epidemiology, Ludwig-Maximilians-University, Munich, Germany
- Institute of Genetic Epidemiology, Helmholtz Zentrum München—German Research Center for Environmental Health, Neuherberg, Germany
- DZHK (German Centre for Cardiovascular Research), partner site Munich Heart Alliance, Munich, Germany
| | - Angela Risch
- Translational Lung Research Center Heidelberg (TLRC-H), Member of the German Center for Lung Research (DZL), Heidelberg, Germany
- Division of Epigenomics and Cancer Risk Factors, German Cancer Research Center, Heidelberg, Germany
- Division of Molecular Biology, University Salzburg, Salzburg, Austria
| | - Heike Bickeböller
- Department of Genetic Epidemiology, University Medical Center, Georg-August University Göttingen, Göttingen, Germany
| |
Collapse
|
110
|
Rahmatallah Y, Emmert-Streib F, Glazko G. Gene set analysis approaches for RNA-seq data: performance evaluation and application guideline. Brief Bioinform 2015; 17:393-407. [PMID: 26342128 PMCID: PMC4870397 DOI: 10.1093/bib/bbv069] [Citation(s) in RCA: 42] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2015] [Indexed: 11/15/2022] Open
Abstract
Transcriptome sequencing (RNA-seq) is gradually replacing microarrays for high-throughput studies of gene expression. The main challenge of analyzing microarray data is not in finding differentially expressed genes, but in gaining insights into the biological processes underlying phenotypic differences. To interpret experimental results from microarrays, gene set analysis (GSA) has become the method of choice, in particular because it incorporates pre-existing biological knowledge (in a form of functionally related gene sets) into the analysis. Here we provide a brief review of several statistically different GSA approaches (competitive and self-contained) that can be adapted from microarrays practice as well as those specifically designed for RNA-seq. We evaluate their performance (in terms of Type I error rate, power, robustness to the sample size and heterogeneity, as well as the sensitivity to different types of selection biases) on simulated and real RNA-seq data. Not surprisingly, the performance of various GSA approaches depends only on the statistical hypothesis they test and does not depend on whether the test was developed for microarrays or RNA-seq data. Interestingly, we found that competitive methods have lower power as well as robustness to the samples heterogeneity than self-contained methods, leading to poor results reproducibility. We also found that the power of unsupervised competitive methods depends on the balance between up- and down-regulated genes in tested gene sets. These properties of competitive methods have been overlooked before. Our evaluation provides a concise guideline for selecting GSA approaches, best performing under particular experimental settings in the context of RNA-seq.
Collapse
|
111
|
An adaptive test for the mean vector in large-<mml:math altimg="si101.gif" display="inline" overflow="scroll" xmlns:xocs="http://www.elsevier.com/xml/xocs/dtd" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://www.elsevier.com/xml/ja/dtd" xmlns:ja="http://www.elsevier.com/xml/ja/dtd" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:tb="http://www.elsevier.com/xml/common/table/dtd" xmlns:sb="http://www.elsevier.com/xml/common/struct-bib/dtd" xmlns:ce="http://www.elsevier.com/xml/common/dtd" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:cals="http://www.elsevier.com/xml/common/cals/dtd" xmlns:sa="http://www.elsevier.com/xml/common/struct-aff/dtd"><mml:mi>p</mml:mi></mml:math>-small-<mml:math altimg="si102.gif" display="inline" overflow="scroll" xmlns:xocs="http://www.elsevier.com/xml/xocs/dtd" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://www.elsevier.com/xml/ja/dtd" xmlns:ja="http://www.elsevier.com/xml/ja/dtd" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:tb="http://www.elsevier.com/xml/common/table/dtd" xmlns:sb="http://www.elsevier.com/xml/common/struct-bib/dtd" xmlns:ce="http://www.elsevier.com/xml/common/dtd" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:cals="http://www.elsevier.com/xml/common/cals/dtd" xmlns:sa="http://www.elsevier.com/xml/common/struct-aff/dtd"><mml:mi>n</mml:mi></mml:math> problems. Comput Stat Data Anal 2015. [DOI: 10.1016/j.csda.2015.03.004] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
112
|
Frost HR, Li Z, Asselbergs FW, Moore JH. An Independent Filter for Gene Set Testing Based on Spectral Enrichment. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2015; 12:1076-1086. [PMID: 26451820 PMCID: PMC4666312 DOI: 10.1109/tcbb.2015.2415815] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
Gene set testing has become an indispensable tool for the analysis of high-dimensional genomic data. An important motivation for testing gene sets, rather than individual genomic variables, is to improve statistical power by reducing the number of tested hypotheses. Given the dramatic growth in common gene set collections, however, testing is often performed with nearly as many gene sets as underlying genomic variables. To address the challenge to statistical power posed by large gene set collections, we have developed spectral gene set filtering (SGSF), a novel technique for independent filtering of gene set collections prior to gene set testing. The SGSF method uses as a filter statistic the p-value measuring the statistical significance of the association between each gene set and the sample principal components (PCs), taking into account the significance of the associated eigenvalues. Because this filter statistic is independent of standard gene set test statistics under the null hypothesis but dependent under the alternative, the proportion of enriched gene sets is increased without impacting the type I error rate. As shown using simulated and real gene expression data, the SGSF algorithm accurately filters gene sets unrelated to the experimental outcome resulting in significantly increased gene set testing power.
Collapse
Affiliation(s)
- H. Robert Frost
- Institute for Quantitative Biomedical Sciences, the Section of Biostatistics and Epidemiology in the Department of Community and Family Medicine and the Department of Genetics at the Geisel School of Medicine, Dartmouth College, Hanover, NH 03755
| | - Zhigang Li
- Institute for Quantitative Biomedical Sciences, the Section of Biostatistics and Epidemiology in the Department of Community and Family Medicine and the Department of Genetics at the Geisel School of Medicine, Dartmouth College, Hanover, NH 03755
| | - Folkert W. Asselbergs
- Durrer Center for Cardio-genetic Research at the ICIN-Netherlands Heart Institute and the Department of Cardiology, Division of Heart and Lungs at the University Medical Center Utrecht, Utrecht, The Netherlands
| | - Jason H. Moore
- Institute for Quantitative Biomedical Sciences, the Section of Biostatistics and Epidemiology in the Department of Community and Family Medicine and the Department of Genetics at the Geisel School of Medicine, Dartmouth College, Hanover, NH 03755
| |
Collapse
|
113
|
Turner JA, Bolen CR, Blankenship DM. Quantitative gene set analysis generalized for repeated measures, confounder adjustment, and continuous covariates. BMC Bioinformatics 2015; 16:272. [PMID: 26316107 PMCID: PMC4551517 DOI: 10.1186/s12859-015-0707-9] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2015] [Accepted: 08/17/2015] [Indexed: 12/20/2022] Open
Abstract
Background Gene set analysis (GSA) of gene expression data can be highly powerful when the biological signal is weak compared to other sources of variability in the data. However, many gene set analysis approaches utilize permutation tests which are not appropriate for complex study designs. For example, the correlation of subjects is broken when comparing time points within a longitudinal study. Linear mixed models provide a method to analyze longitudinal studies as well as adjust for potential confounding factors and account for sources of variability that are not of primary interest. Currently, there are no known gene set analysis approaches that fully account for these study design and analysis aspects. In order to do so, we generalize the QuSAGE gene set analysis algorithm, denoted Q-Gen, and provide the necessary estimation adjustments to incorporate linear mixed model analyses. Results We assessed the performance of our generalized method in comparison to the original QuSAGE method in settings such as longitudinal repeated measures analysis and accounting for potential confounders. We demonstrate that the original QuSAGE method can not control for type-I error when these complexities exist. In addition to statistical appropriateness, analysis of a longitudinal influenza study suggests Q-Gen can allow for greater sensitivity when exploring a large number of gene sets. Conclusions Q-Gen is an extension to the gene set analysis method of QuSAGE, and allows for linear mixed models to be applied appropriately within a gene set analysis framework. It provides GSA an added layer of flexibility that was not currently available. This flexibility allows for more appropriate statistical modeling of complex data structures that are inherent to many microarray study designs and can provide more sensitivity. Electronic supplementary material The online version of this article (doi:10.1186/s12859-015-0707-9) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Jacob A Turner
- Baylor Research Institute, 3310 Live Oak, Dallas, 75204, TX, USA.
| | - Christopher R Bolen
- Department of Microbiology and Immunology, Stanford University School, Stanford, 94305, CA, USA.
| | | |
Collapse
|
114
|
Frost HR, Li Z, Moore JH. Principal component gene set enrichment (PCGSE). BioData Min 2015; 8:25. [PMID: 26300978 PMCID: PMC4543476 DOI: 10.1186/s13040-015-0059-z] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2015] [Accepted: 08/04/2015] [Indexed: 11/22/2022] Open
Abstract
Background Although principal component analysis (PCA) is widely used for the dimensional reduction of biomedical data, interpretation of PCA results remains daunting. Most existing interpretation methods attempt to explain each principal component (PC) in terms of a small number of variables by generating approximate PCs with mainly zero loadings. Although useful when just a few variables dominate the population PCs, these methods can perform poorly on genomic data, where interesting biological features are frequently represented by the combined signal of functionally related sets of genes. While gene set testing methods have been widely used in supervised settings to quantify the association of groups of genes with clinical outcomes, these methods have seen only limited application for testing the enrichment of gene sets relative to sample PCs. Results We describe a novel approach, principal component gene set enrichment (PCGSE), for unsupervised gene set testing relative to the sample PCs of genomic data. The PCGSE method computes the statistical association between gene sets and individual PCs using a two-stage competitive gene set test. To demonstrate the efficacy of the PCGSE method, we use simulated and real gene expression data to evaluate the performance of various gene set test statistics and significance tests. Conclusions Gene set testing is an effective approach for interpreting the PCs of high-dimensional genomic data. As shown using both simulated and real datasets, the PCGSE method can generate biologically meaningful and computationally efficient results via a two-stage, competitive parametric test that correctly accounts for inter-gene correlation. Electronic supplementary material The online version of this article (doi:10.1186/s13040-015-0059-z) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- H Robert Frost
- Institute of Quantitative Biomedical Sciences, Geisel School of Medicine, Lebanon, 03756 NH USA ; Section of Biostatistics and Epidemiology, Department of Community and Family Medicine, Geisel School of Medicine, Lebanon, 03756 NH USA ; Department of Genetics, Dartmouth College, Hanover, 03755 NH USA
| | - Zhigang Li
- Institute of Quantitative Biomedical Sciences, Geisel School of Medicine, Lebanon, 03756 NH USA ; Section of Biostatistics and Epidemiology, Department of Community and Family Medicine, Geisel School of Medicine, Lebanon, 03756 NH USA
| | - Jason H Moore
- Institute of Quantitative Biomedical Sciences, Geisel School of Medicine, Lebanon, 03756 NH USA ; Section of Biostatistics and Epidemiology, Department of Community and Family Medicine, Geisel School of Medicine, Lebanon, 03756 NH USA ; Department of Genetics, Dartmouth College, Hanover, 03755 NH USA
| |
Collapse
|
115
|
Robinson DG, Wang JY, Storey JD. A nested parallel experiment demonstrates differences in intensity-dependence between RNA-seq and microarrays. Nucleic Acids Res 2015; 43:e131. [PMID: 26130709 PMCID: PMC4787771 DOI: 10.1093/nar/gkv636] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2015] [Accepted: 06/08/2015] [Indexed: 12/03/2022] Open
Abstract
Understanding the differences between microarray and RNA-Seq technologies for measuring gene expression is necessary for informed design of experiments and choice of data analysis methods. Previous comparisons have come to sometimes contradictory conclusions, which we suggest result from a lack of attention to the intensity-dependent nature of variation generated by the technologies. To examine this trend, we carried out a parallel nested experiment performed simultaneously on the two technologies that systematically split variation into four stages (treatment, biological variation, library preparation and chip/lane noise), allowing a separation and comparison of the sources of variation in a well-controlled cellular system, Saccharomyces cerevisiae. With this novel dataset, we demonstrate that power and accuracy are more dependent on per-gene read depth in RNA-Seq than they are on fluorescence intensity in microarrays. However, we carried out quantitative PCR validations which indicate that microarrays may demonstrate greater systematic bias in low-intensity genes than in RNA-seq.
Collapse
Affiliation(s)
- David G Robinson
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08544, USA
| | - Jean Y Wang
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08544, USA
| | - John D Storey
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08544, USA Center for Statistics and Machine Learning, Princeton University, Princeton, NJ 08544, USA Department of Molecular Biology, Princeton University, Princeton, NJ 08544, USA
| |
Collapse
|
116
|
MacNeil SM, Johnson WE, Li DY, Piccolo SR, Bild AH. Inferring pathway dysregulation in cancers from multiple types of omic data. Genome Med 2015; 7:61. [PMID: 26170901 PMCID: PMC4499940 DOI: 10.1186/s13073-015-0189-4] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/03/2014] [Accepted: 06/16/2015] [Indexed: 11/10/2022] Open
Abstract
Although in some cases individual genomic aberrations may drive disease development in isolation, a complex interplay among multiple aberrations is common. Accordingly, we developed Gene Set Omic Analysis (GSOA), a bioinformatics tool that can evaluate multiple types and combinations of omic data at the pathway level. GSOA uses machine learning to identify dysregulated pathways and improves upon other methods because of its ability to decipher complex, multigene patterns. We compare GSOA to alternative methods and demonstrate its ability to identify pathways known to play a role in various cancer phenotypes. Software implementing the GSOA method is freely available from https://bitbucket.org/srp33/gsoa.
Collapse
Affiliation(s)
- Shelley M MacNeil
- />Department of Oncological Sciences, University of Utah, Salt Lake City, UT USA
- />Department of Pharmacology and Toxicology, University of Utah, Salt Lake City, UT USA
| | - William E Johnson
- />Department of Oncological Sciences, University of Utah, Salt Lake City, UT USA
- />Division of Computational Biomedicine, Boston University School of Medicine, Boston, MA USA
| | - Dean Y Li
- />Department of Oncological Sciences, University of Utah, Salt Lake City, UT USA
- />Department of Medicine, University of Utah, Salt Lake City, UT USA
- />Department of Human Genetics, University of Utah, Salt Lake City, UT USA
| | - Stephen R Piccolo
- />Department of Pharmacology and Toxicology, University of Utah, Salt Lake City, UT USA
- />Division of Computational Biomedicine, Boston University School of Medicine, Boston, MA USA
- />Department of Biology, Brigham Young University, Provo, UT USA
| | - Andrea H Bild
- />Department of Oncological Sciences, University of Utah, Salt Lake City, UT USA
- />Department of Pharmacology and Toxicology, University of Utah, Salt Lake City, UT USA
| |
Collapse
|
117
|
Hejblum BP, Skinner J, Thiébaut R. Time-Course Gene Set Analysis for Longitudinal Gene Expression Data. PLoS Comput Biol 2015; 11:e1004310. [PMID: 26111374 PMCID: PMC4482329 DOI: 10.1371/journal.pcbi.1004310] [Citation(s) in RCA: 43] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2014] [Accepted: 04/30/2015] [Indexed: 01/13/2023] Open
Abstract
Gene set analysis methods, which consider predefined groups of genes in the analysis of genomic data, have been successfully applied for analyzing gene expression data in cross-sectional studies. The time-course gene set analysis (TcGSA) introduced here is an extension of gene set analysis to longitudinal data. The proposed method relies on random effects modeling with maximum likelihood estimates. It allows to use all available repeated measurements while dealing with unbalanced data due to missing at random (MAR) measurements. TcGSA is a hypothesis driven method that identifies a priori defined gene sets with significant expression variations over time, taking into account the potential heterogeneity of expression within gene sets. When biological conditions are compared, the method indicates if the time patterns of gene sets significantly differ according to these conditions. The interest of the method is illustrated by its application to two real life datasets: an HIV therapeutic vaccine trial (DALIA-1 trial), and data from a recent study on influenza and pneumococcal vaccines. In the DALIA-1 trial TcGSA revealed a significant change in gene expression over time within 69 gene sets during vaccination, while a standard univariate individual gene analysis corrected for multiple testing as well as a standard a Gene Set Enrichment Analysis (GSEA) for time series both failed to detect any significant pattern change over time. When applied to the second illustrative data set, TcGSA allowed the identification of 4 gene sets finally found to be linked with the influenza vaccine too although they were found to be associated to the pneumococcal vaccine only in previous analyses. In our simulation study TcGSA exhibits good statistical properties, and an increased power compared to other approaches for analyzing time-course expression patterns of gene sets. The method is made available for the community through an R package. Gene set analysis methods use prior biological knowledge to analyze gene expression data. This prior knowledge takes the form of predefined groups of genes, linked through their biological function. Gene set analysis methods have been successfully applied in transversal studies, their results being more sensitive and interpretable than those of methods investigating genomic data one gene at a time. The time-course gene set analysis (TcGSA) introduced here is an extension of such gene set analysis to longitudinal data. This method identifies a priori defined groups of genes whose expression is not stable over time, taking into account the potential heterogeneity between patients and between genes. When biological conditions are compared, it identifies the gene sets that have different expression dynamics according to these conditions. Data from 2 studies are analyzed: data from an HIV therapeutic vaccine trial, and data from a recent study on influenza and pneumococcal vaccines. In both cases, TcGSA provided new insights compared to standard approaches thanks to an increased sensitivity compared to other approaches. Those results highlight the benefits of the TcGSA method for analyzing gene expression dynamics.
Collapse
Affiliation(s)
- Boris P. Hejblum
- Univ. Bordeaux, ISPED, Centre INSERM U897-Epidemiologie-Biostatistique, F-33000 Bordeaux, France
- INSERM, ISPED, Centre INSERM U897-Epidemiologie-Biostatistique, F-33000 Bordeaux, France
- INRIA, Team SISTM, F-33000 Bordeaux, France
- Vaccine Research Institute-VRI, Hôpital Henri Mondor, Créteil, France
- Baylor Institute for Immunology Research, Dallas, Texas, United States of America
| | - Jason Skinner
- Vaccine Research Institute-VRI, Hôpital Henri Mondor, Créteil, France
- Baylor Institute for Immunology Research, Dallas, Texas, United States of America
| | - Rodolphe Thiébaut
- Univ. Bordeaux, ISPED, Centre INSERM U897-Epidemiologie-Biostatistique, F-33000 Bordeaux, France
- INSERM, ISPED, Centre INSERM U897-Epidemiologie-Biostatistique, F-33000 Bordeaux, France
- INRIA, Team SISTM, F-33000 Bordeaux, France
- Vaccine Research Institute-VRI, Hôpital Henri Mondor, Créteil, France
- Baylor Institute for Immunology Research, Dallas, Texas, United States of America
- * E-mail:
| |
Collapse
|
118
|
Pathway-based gene signatures predicting clinical outcome of lung adenocarcinoma. Sci Rep 2015; 5:10979. [PMID: 26042604 PMCID: PMC4455286 DOI: 10.1038/srep10979] [Citation(s) in RCA: 28] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2014] [Accepted: 05/11/2015] [Indexed: 01/24/2023] Open
Abstract
Lung adenocarcinoma is often diagnosed at an advanced stage with poor prognosis. Patients with different clinical outcomes may have similar clinico-pathological characteristics. The results of previous studies for biomarkers for lung adenocarcinoma have generally been inconsistent and limited in clinical application. In this study, we used inverse-variance weighting to combine the hazard ratios for the four datasets and performed pathway analysis to identify prognosis-associated gene signatures. A total of 2,418 genes were found to be significantly associated with overall survival. Of these, a 21-gene signature in the HMGB1/RAGE signalling pathway and a 31-gene signature in the clathrin-coated vesicle cycle pathway were significantly associated with prognosis of lung adenocarcinoma across all four datasets (all p-values < 0.05, log-rank test). We combined the scores for the three pathways to derive a combined pathway-based risk (CPBR) score. Three pathway-based signatures and CPBR score also had more predictive power than single genes. Finally, the CPBR score was validated in two independent cohorts (GSE14814 and GSE13213 in the GEO database) and had significant adjusted hazard ratios 2.72 (p-value < 0.0001) and 1.71 (p-value < 0.0001), respectively. These results could provide a more complete picture of the lung cancer pathogenesis.
Collapse
|
119
|
Ye C, Jiang B, Zhang X, Liu JS. dslice: an R package for nonparametric testing of associations with application in QTL and gene set analysis. Bioinformatics 2015; 31:1842-4. [PMID: 25609796 DOI: 10.1093/bioinformatics/btv021] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2014] [Accepted: 01/12/2015] [Indexed: 11/12/2022] Open
Abstract
UNLABELLED Many statistical problems in bioinformatics and genetics can be formulated as the testing of associations between a categorical variable and a continuous variable. A dynamic slicing method was proposed for non-parametric dependence testing, which has been demonstrated to have higher powers compared with traditional methods such as Kolmogorov-Smirnov test. We introduce an R package dslice to facilitate the use of dynamic slicing method in bioinformatic applications such as quantitative trait loci study and gene set enrichment analysis. AVAILABILITY AND IMPLEMENTATION dslice is implemented in Rcpp and available in the Comprehensive R Archive Network. The package is distributed under the GNU General Public License (version 2 or later).
Collapse
Affiliation(s)
- Chao Ye
- MOE Key Laboratory of Bioinformatics, Bioinformatics Division and Center for Synthetic & Systems Biology, TNLIST, Department of Automation, Tsinghua University, Beijing 100084, China, Department of Statistics, Harvard University, Cambridge, MA 02138, USA, School of Life Sciences, Tsinghua University, Beijing 100084, China and Center of Statistics, Tsinghua University, Beijing 100084, China
| | - Bo Jiang
- MOE Key Laboratory of Bioinformatics, Bioinformatics Division and Center for Synthetic & Systems Biology, TNLIST, Department of Automation, Tsinghua University, Beijing 100084, China, Department of Statistics, Harvard University, Cambridge, MA 02138, USA, School of Life Sciences, Tsinghua University, Beijing 100084, China and Center of Statistics, Tsinghua University, Beijing 100084, China
| | - Xuegong Zhang
- MOE Key Laboratory of Bioinformatics, Bioinformatics Division and Center for Synthetic & Systems Biology, TNLIST, Department of Automation, Tsinghua University, Beijing 100084, China, Department of Statistics, Harvard University, Cambridge, MA 02138, USA, School of Life Sciences, Tsinghua University, Beijing 100084, China and Center of Statistics, Tsinghua University, Beijing 100084, China MOE Key Laboratory of Bioinformatics, Bioinformatics Division and Center for Synthetic & Systems Biology, TNLIST, Department of Automation, Tsinghua University, Beijing 100084, China, Department of Statistics, Harvard University, Cambridge, MA 02138, USA, School of Life Sciences, Tsinghua University, Beijing 100084, China and Center of Statistics, Tsinghua University, Beijing 100084, China
| | - Jun S Liu
- MOE Key Laboratory of Bioinformatics, Bioinformatics Division and Center for Synthetic & Systems Biology, TNLIST, Department of Automation, Tsinghua University, Beijing 100084, China, Department of Statistics, Harvard University, Cambridge, MA 02138, USA, School of Life Sciences, Tsinghua University, Beijing 100084, China and Center of Statistics, Tsinghua University, Beijing 100084, China MOE Key Laboratory of Bioinformatics, Bioinformatics Division and Center for Synthetic & Systems Biology, TNLIST, Department of Automation, Tsinghua University, Beijing 100084, China, Department of Statistics, Harvard University, Cambridge, MA 02138, USA, School of Life Sciences, Tsinghua University, Beijing 100084, China and Center of Statistics, Tsinghua University, Beijing 100084, China
| |
Collapse
|
120
|
Lui TWH, Tsui NBY, Chan LWC, Wong CSC, Siu PMF, Yung BYM. DECODE: an integrated differential co-expression and differential expression analysis of gene expression data. BMC Bioinformatics 2015; 16:182. [PMID: 26026612 PMCID: PMC4449974 DOI: 10.1186/s12859-015-0582-4] [Citation(s) in RCA: 26] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2014] [Accepted: 04/22/2015] [Indexed: 01/30/2023] Open
Abstract
BACKGROUND Both differential expression (DE) and differential co-expression (DC) analyses are appreciated as useful tools in understanding gene regulation related to complex diseases. The performance of integrating DE and DC, however, remains unexplored. RESULTS In this study, we proposed a novel analytical approach called DECODE (Differential Co-expression and Differential Expression) to integrate DC and DE analyses of gene expression data. DECODE allows one to study the combined features of DC and DE of each transcript between two conditions. By incorporating information of the dependency between DC and DE variables, two optimal thresholds for defining substantial change in expression and co-expression are systematically defined for each gene based on chi-square maximization. By using these thresholds, genes can be categorized into four groups with either high or low DC and DE characteristics. In this study, DECODE was applied to a large breast cancer microarray data set consisted of two thousand tumor samples. By identifying genes with high DE and high DC, we demonstrated that DECODE could improve the detection of some functional gene sets such as those related to immune system, metastasis, lipid and glucose metabolism. Further investigation on the identified genes and the associated functional pathways would provide an additional level of understanding of complex disease mechanism. CONCLUSIONS By complementing the recent DC and the traditional DE analyses, DECODE is a valuable methodology for investigating biological functions of genes exhibiting disease-associated DE and DC combined characteristics, which may not be easily revealed through DC or DE approach alone. DECODE is available at the Comprehensive R Archive Network (CRAN): http://cran.r-project.org/web/packages/decode/index.html .
Collapse
Affiliation(s)
- Thomas W H Lui
- Department of Health Technology and Informatics, The Hong Kong Polytechnic University, Hung Hom, Kowloon, Hong Kong.
| | - Nancy B Y Tsui
- Department of Health Technology and Informatics, The Hong Kong Polytechnic University, Hung Hom, Kowloon, Hong Kong.
| | - Lawrence W C Chan
- Department of Health Technology and Informatics, The Hong Kong Polytechnic University, Hung Hom, Kowloon, Hong Kong.
| | - Cesar S C Wong
- Department of Health Technology and Informatics, The Hong Kong Polytechnic University, Hung Hom, Kowloon, Hong Kong.
| | - Parco M F Siu
- Department of Health Technology and Informatics, The Hong Kong Polytechnic University, Hung Hom, Kowloon, Hong Kong.
| | - Benjamin Y M Yung
- Department of Health Technology and Informatics, The Hong Kong Polytechnic University, Hung Hom, Kowloon, Hong Kong.
| |
Collapse
|
121
|
Abstract
Many methods of gene set analysis developed in recent years have been compared empirically in a number of comprehensive review articles. Although it is recognized that different methods tend to identify different gene sets as significant, no consensus has been worked out as to which method is preferable, as the recommendations are often contradictory. In this article, we want to group and compare different methods in terms of the methodological assumptions pertaining to definition of a sample and formulation of the actual null hypothesis. We discuss four models of statistical experiment explicitly or implicitly assumed by most if not all currently available methods of gene set analysis. We analyse validity of the models in the context of the actual biological experiment. Based on this, we recommend a group of methods that provide biologically interpretable results in statistically sound way. Finally, we demonstrate how correlated or low signal-to-noise data affects performance of different methods, observed in terms of the false-positive rate and power.
Collapse
|
122
|
Bessonov K, Gusareva ES, Van Steen K. A cautionary note on the impact of protocol changes for genome-wide association SNP × SNP interaction studies: an example on ankylosing spondylitis. Hum Genet 2015; 134:761-73. [DOI: 10.1007/s00439-015-1560-7] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/19/2015] [Accepted: 04/26/2015] [Indexed: 12/11/2022]
|
123
|
Larson JL, Owen AB. Moment based gene set tests. BMC Bioinformatics 2015; 16:132. [PMID: 25928861 PMCID: PMC4419444 DOI: 10.1186/s12859-015-0571-7] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2014] [Accepted: 04/10/2015] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Permutation-based gene set tests are standard approaches for testing relationships between collections of related genes and an outcome of interest in high throughput expression analyses. Using M random permutations, one can attain p-values as small as 1/(M+1). When many gene sets are tested, we need smaller p-values, hence larger M, to achieve significance while accounting for the number of simultaneous tests being made. As a result, the number of permutations to be done rises along with the cost per permutation. To reduce this cost, we seek parametric approximations to the permutation distributions for gene set tests. RESULTS We study two gene set methods based on sums and sums of squared correlations. The statistics we study are among the best performers in the extensive simulation of 261 gene set methods by Ackermann and Strimmer in 2009. Our approach calculates exact relevant moments of these statistics and uses them to fit parametric distributions. The computational cost of our algorithm for the linear case is on the order of doing |G| permutations, where |G| is the number of genes in set G. For the quadratic statistics, the cost is on the order of |G|(2) permutations which can still be orders of magnitude faster than plain permutation sampling. We applied the permutation approximation method to three public Parkinson's Disease expression datasets and discovered enriched gene sets not previously discussed. We found that the moment-based gene set enrichment p-values closely approximate the permutation method p-values at a tiny fraction of their cost. They also gave nearly identical rankings to the gene sets being compared. CONCLUSIONS We have developed a moment based approximation to linear and quadratic gene set test statistics' permutation distribution. This allows approximate testing to be done orders of magnitude faster than one could do by sampling permutations. We have implemented our method as a publicly available Bioconductor package, npGSEA (www.bioconductor.org) .
Collapse
Affiliation(s)
- Jessica L Larson
- Department of Bioinformatics and Computational Biology, Genentech, Inc., South San Francisco, USA. .,Currently at GenePeeks, Inc., Cambridge, USA.
| | - Art B Owen
- Department of Statistics, Stanford University, Stanford, USA.
| |
Collapse
|
124
|
Chang YH, Yang YL, Chen CM, Chen HY. Apoptosis pathway signature for prediction of treatment response and clinical outcome in childhood high risk B-Precursor acute lymphoblastic leukemia. Am J Cancer Res 2015; 5:1844-1853. [PMID: 26175952 PMCID: PMC4497450] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2015] [Accepted: 04/15/2015] [Indexed: 06/04/2023] Open
Abstract
The most common cancer in children is acute lymphoblastic leukemia (ALL) and it had high cure rate, especially for B-precursor ALL. However, relapse due to drug resistance and overdose treatment reach the limitations in patient managements. In this study, integration of gene expression microarray data, logistic regression, analysis of microarray (SAM) method, and gene set analysis were performed to discover treatment response associated pathway-based signatures in the original cohort. Results showed that 3772 probes were significantly associated with treatment response. After pathway analysis, only apoptosis pathway had significant association with treatment response. Apoptosis pathway signature (APS) derived from 15 significantly expressed genes had 88% accuracy for treatment response prediction. The APS was further validated in two independent cohorts. Results also showed that APS was significantly associated with induction failure time (adjusted hazard ratio [HR] = 1.60, 95% confidence interval [CI] = [1.13, 2.27]) in the first cohort and significantly associated with event-free survival (adjusted HR = 1.56, 95% CI = [1.13, 2.16]) or overall survival in the second cohort (adjusted HR = 1.74, 95% CI = [1.24, 2.45]). APS not only can predict clinical outcome, but also provide molecular guidance of patient management.
Collapse
Affiliation(s)
- Ya-Hsuan Chang
- Institute of Biomedical Engineering, National Taiwan UniversityNo1, Sec. 1, Jen-Ai Rd, Taipei 100, Taiwan
| | - Yung-Li Yang
- Department of Pediatrics, National Taiwan University Hospital and College of Medicine, National Taiwan University1 Jen-Ai Road, Section 1, Taipei, Taiwan
- Department of Laboratory Medicine, National Taiwan University Hospital and College of Medicine, National Taiwan University1 Jen-Ai Road, Section 1, Taipei, Taiwan
| | - Chung-Ming Chen
- Institute of Biomedical Engineering, National Taiwan UniversityNo1, Sec. 1, Jen-Ai Rd, Taipei 100, Taiwan
| | - Hsuan-Yu Chen
- Institute of Statistical Science, Academia Sinica128 Academia Road, Section 2, Nankang, Taipei, Taiwan
| |
Collapse
|
125
|
Abstract
An important data analysis task in statistical genomics involves the integration of genome-wide gene-level measurements with preexisting data on the same genes. A wide variety of statistical methodologies and computational tools have been developed for this general task. We emphasize one particular distinction among methodologies, namely whether they process gene sets one at a time (uniset) or simultaneously via some multiset technique. Owing to the complexity of collections of gene sets, the multiset approach offers some advantages, as it naturally accommodates set-size variations and among-set overlaps. However, this approach presents both computational and inferential challenges. After reviewing some statistical issues that arise in uniset analysis, we examine two model-based multiset methods for gene list data.
Collapse
Affiliation(s)
- Michael A Newton
- Department of Statistics, University of Wisconsin, Madison, Wisconsin 53706 ; Department of Biostatistics and Medical Informatics, University of Wisconsin, Madison, Wisconsin 53706
| | - Zhishi Wang
- Department of Statistics, University of Wisconsin, Madison, Wisconsin 53706
| |
Collapse
|
126
|
Ha NT, Gross JJ, van Dorland A, Tetens J, Thaller G, Schlather M, Bruckmaier R, Simianer H. Gene-based mapping and pathway analysis of metabolic traits in dairy cows. PLoS One 2015; 10:e0122325. [PMID: 25789767 PMCID: PMC4366076 DOI: 10.1371/journal.pone.0122325] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/01/2014] [Accepted: 02/05/2015] [Indexed: 11/18/2022] Open
Abstract
The metabolic adaptation of dairy cows during the transition period has been studied intensively in the last decades. However, until now, only few studies have paid attention to the genetic aspects of this process. Here, we present the results of a gene-based mapping and pathway analysis with the measurements of three key metabolites, (1) non-esterified fatty acids (NEFA), (2) beta-hydroxybutyrate (BHBA) and (3) glucose, characterizing the metabolic adaptability of dairy cows before and after calving. In contrast to the conventional single-marker approach, we identify 99 significant and biologically sensible genes associated with at least one of the considered phenotypes and thus giving evidence for a genetic basis of the metabolic adaptability. Moreover, our results strongly suggest three pathways involved in the metabolism of steroids and lipids are potential candidates for the adaptive regulation of dairy cows in their early lactation. From our perspective, a closer investigation of our findings will lead to a step forward in understanding the variability in the metabolic adaptability of dairy cows in their early lactation.
Collapse
Affiliation(s)
- Ngoc-Thuy Ha
- Veterinary Physiology, Vetsuisse Faculty, University of Bern, Bern, Switzerland
- Animal Breeding and Genetics Group, Department of Animal Sciences, Georg-August-University, Goettingen, Germany
- * E-mail:
| | - Josef Johann Gross
- Veterinary Physiology, Vetsuisse Faculty, University of Bern, Bern, Switzerland
| | - Annette van Dorland
- Veterinary Physiology, Vetsuisse Faculty, University of Bern, Bern, Switzerland
| | - Jens Tetens
- Institute of Animal Breeding and Husbandry, Christian-Albrechts-University, Kiel, Germany
| | - Georg Thaller
- Institute of Animal Breeding and Husbandry, Christian-Albrechts-University, Kiel, Germany
| | - Martin Schlather
- Chair of Mathematical Statistics, University of Mannheim, Mannheim, Germany
| | - Rupert Bruckmaier
- Veterinary Physiology, Vetsuisse Faculty, University of Bern, Bern, Switzerland
| | - Henner Simianer
- Animal Breeding and Genetics Group, Department of Animal Sciences, Georg-August-University, Goettingen, Germany
| |
Collapse
|
127
|
Nam D. Effect of the absolute statistic on gene-sampling gene-set analysis methods. Stat Methods Med Res 2015; 26:1248-1260. [PMID: 25733546 DOI: 10.1177/0962280215574014] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
Gene-set enrichment analysis and its modified versions have commonly been used for identifying altered functions or pathways in disease from microarray data. In particular, the simple gene-sampling gene-set analysis methods have been heavily used for datasets with only a few sample replicates. The biggest problem with this approach is the highly inflated false-positive rate. In this paper, the effect of absolute gene statistic on gene-sampling gene-set analysis methods is systematically investigated. Thus far, the absolute gene statistic has merely been regarded as a supplementary method for capturing the bidirectional changes in each gene set. Here, it is shown that incorporating the absolute gene statistic in gene-sampling gene-set analysis substantially reduces the false-positive rate and improves the overall discriminatory ability. Its effect was investigated by power, false-positive rate, and receiver operating curve for a number of simulated and real datasets. The performances of gene-set analysis methods in one-tailed (genome-wide association study) and two-tailed (gene expression data) tests were also compared and discussed.
Collapse
Affiliation(s)
- Dougu Nam
- Department of Biological Sciences and Department of Mathematical Sciences, UNIST, Ulsan, Republic of Korea
| |
Collapse
|
128
|
Psychiatric genome-wide association study analyses implicate neuronal, immune and histone pathways. Nat Neurosci 2015; 18:199-209. [PMID: 25599223 PMCID: PMC4378867 DOI: 10.1038/nn.3922] [Citation(s) in RCA: 555] [Impact Index Per Article: 61.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/31/2014] [Accepted: 12/10/2014] [Indexed: 12/15/2022]
Abstract
Genome-wide association studies (GWAS) of psychiatric disorders have identified multiple genetic associations with such disorders, but better methods are needed to derive the underlying biological mechanisms that these signals indicate. We sought to identify biological pathways in GWAS data from over 60,000 participants from the Psychiatric Genomics Consortium. We developed an analysis framework to rank pathways that requires only summary statistics. We combined this score across disorders to find common pathways across three adult psychiatric disorders: schizophrenia, major depression and bipolar disorder. Histone methylation processes showed the strongest association, and we also found statistically significant evidence for associations with multiple immune and neuronal signaling pathways and with the postsynaptic density. Our study indicates that risk variants for psychiatric disorders aggregate in particular biological pathways and that these pathways are frequently shared between disorders. Our results confirm known mechanisms and suggest several novel insights into the etiology of psychiatric disorders.
Collapse
|
129
|
Kaever A, Landesfeind M, Feussner K, Mosblech A, Heilmann I, Morgenstern B, Feussner I, Meinicke P. MarVis-Pathway: integrative and exploratory pathway analysis of non-targeted metabolomics data. Metabolomics 2015; 11:764-777. [PMID: 25972773 PMCID: PMC4419191 DOI: 10.1007/s11306-014-0734-y] [Citation(s) in RCA: 62] [Impact Index Per Article: 6.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 04/22/2014] [Accepted: 09/23/2014] [Indexed: 11/27/2022]
Abstract
A central aim in the evaluation of non-targeted metabolomics data is the detection of intensity patterns that differ between experimental conditions as well as the identification of the underlying metabolites and their association with metabolic pathways. In this context, the identification of metabolites based on non-targeted mass spectrometry data is a major bottleneck. In many applications, this identification needs to be guided by expert knowledge and interactive tools for exploratory data analysis can significantly support this process. Additionally, the integration of data from other omics platforms, such as DNA microarray-based transcriptomics, can provide valuable hints and thereby facilitate the identification of metabolites via the reconstruction of related metabolic pathways. We here introduce the MarVis-Pathway tool, which allows the user to identify metabolites by annotation of pathways from cross-omics data. The analysis is supported by an extensive framework for pathway enrichment and meta-analysis. The tool allows the mapping of data set features by ID, name, and accurate mass, and can incorporate information from adduct and isotope correction of mass spectrometry data. MarVis-Pathway was integrated in the MarVis-Suite (http://marvis.gobics.de), which features the seamless highly interactive filtering, combination, clustering, and visualization of omics data sets. The functionality of the new software tool is illustrated using combined mass spectrometry and DNA microarray data. This application confirms jasmonate biosynthesis as important metabolic pathway that is upregulated during the wound response of Arabidopsis plants.
Collapse
Affiliation(s)
- Alexander Kaever
- Department of Bioinformatics, Institute of Microbiology and Genetics, Georg-August-University Göttingen, Goldschmidtstr. 1, 37077 Göttingen, Germany
| | - Manuel Landesfeind
- Department of Bioinformatics, Institute of Microbiology and Genetics, Georg-August-University Göttingen, Goldschmidtstr. 1, 37077 Göttingen, Germany
| | - Kirstin Feussner
- Department of Plant Biochemistry, Albrecht-von-Haller-Institute for Plant Sciences, Georg-August-University Göttingen, Justus-von-Liebig-Weg 11, 37077 Göttingen, Germany
| | - Alina Mosblech
- Department of Plant Biochemistry, Albrecht-von-Haller-Institute for Plant Sciences, Georg-August-University Göttingen, Justus-von-Liebig-Weg 11, 37077 Göttingen, Germany
| | - Ingo Heilmann
- Department of Plant Biochemistry, Albrecht-von-Haller-Institute for Plant Sciences, Georg-August-University Göttingen, Justus-von-Liebig-Weg 11, 37077 Göttingen, Germany
| | - Burkhard Morgenstern
- Department of Bioinformatics, Institute of Microbiology and Genetics, Georg-August-University Göttingen, Goldschmidtstr. 1, 37077 Göttingen, Germany
| | - Ivo Feussner
- Department of Plant Biochemistry, Albrecht-von-Haller-Institute for Plant Sciences, Georg-August-University Göttingen, Justus-von-Liebig-Weg 11, 37077 Göttingen, Germany
| | - Peter Meinicke
- Department of Bioinformatics, Institute of Microbiology and Genetics, Georg-August-University Göttingen, Goldschmidtstr. 1, 37077 Göttingen, Germany
| |
Collapse
|
130
|
Yang L, Ainali C, Tsoka S, Papageorgiou LG. Pathway activity inference for multiclass disease classification through a mathematical programming optimisation framework. BMC Bioinformatics 2014; 15:390. [PMID: 25475756 PMCID: PMC4269079 DOI: 10.1186/s12859-014-0390-2] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/04/2014] [Accepted: 11/19/2014] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Applying machine learning methods on microarray gene expression profiles for disease classification problems is a popular method to derive biomarkers, i.e. sets of genes that can predict disease state or outcome. Traditional approaches where expression of genes were treated independently suffer from low prediction accuracy and difficulty of biological interpretation. Current research efforts focus on integrating information on protein interactions through biochemical pathway datasets with expression profiles to propose pathway-based classifiers that can enhance disease diagnosis and prognosis. As most of the pathway activity inference methods in literature are either unsupervised or applied on two-class datasets, there is good scope to address such limitations by proposing novel methodologies. RESULTS A supervised multiclass pathway activity inference method using optimisation techniques is reported. For each pathway expression dataset, patterns of its constituent genes are summarised into one composite feature, termed pathway activity, and a novel mathematical programming model is proposed to infer this feature as a weighted linear summation of expression of its constituent genes. Gene weights are determined by the optimisation model, in a way that the resulting pathway activity has the optimal discriminative power with regards to disease phenotypes. Classification is then performed on the resulting low-dimensional pathway activity profile. CONCLUSIONS The model was evaluated through a variety of published gene expression profiles that cover different types of disease. We show that not only does it improve classification accuracy, but it can also perform well in multiclass disease datasets, a limitation of other approaches from the literature. Desirable features of the model include the ability to control the maximum number of genes that may participate in determining pathway activity, which may be pre-specified by the user. Overall, this work highlights the potential of building pathway-based multi-phenotype classifiers for accurate disease diagnosis and prognosis problems.
Collapse
Affiliation(s)
- Lingjian Yang
- Centre for Process Systems Engineering, Department of Chemical Engineering, University College London, London, WC1E 7JE, UK.
| | - Chrysanthi Ainali
- Department of Informatics, School of Natural and Mathematical Sciences, King's College London, London, WC2R 2LS, UK.
| | - Sophia Tsoka
- Department of Informatics, School of Natural and Mathematical Sciences, King's College London, London, WC2R 2LS, UK.
| | - Lazaros G Papageorgiou
- Centre for Process Systems Engineering, Department of Chemical Engineering, University College London, London, WC1E 7JE, UK.
| |
Collapse
|
131
|
Rahmatallah Y, Emmert-Streib F, Glazko G. Comparative evaluation of gene set analysis approaches for RNA-Seq data. BMC Bioinformatics 2014; 15:397. [PMID: 25475910 PMCID: PMC4265362 DOI: 10.1186/s12859-014-0397-8] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/11/2014] [Accepted: 11/24/2014] [Indexed: 11/18/2022] Open
Abstract
Background Over the last few years transcriptome sequencing (RNA-Seq) has almost completely taken over microarrays for high-throughput studies of gene expression. Currently, the most popular use of RNA-Seq is to identify genes which are differentially expressed between two or more conditions. Despite the importance of Gene Set Analysis (GSA) in the interpretation of the results from RNA-Seq experiments, the limitations of GSA methods developed for microarrays in the context of RNA-Seq data are not well understood. Results We provide a thorough evaluation of popular multivariate and gene-level self-contained GSA approaches on simulated and real RNA-Seq data. The multivariate approach employs multivariate non-parametric tests combined with popular normalizations for RNA-Seq data. The gene-level approach utilizes univariate tests designed for the analysis of RNA-Seq data to find gene-specific P-values and combines them into a pathway P-value using classical statistical techniques. Our results demonstrate that the Type I error rate and the power of multivariate tests depend only on the test statistics and are insensitive to the different normalizations. In general standard multivariate GSA tests detect pathways that do not have any bias in terms of pathways size, percentage of differentially expressed genes, or average gene length in a pathway. In contrast the Type I error rate and the power of gene-level GSA tests are heavily affected by the methods for combining P-values, and all aforementioned biases are present in detected pathways. Conclusions Our result emphasizes the importance of using self-contained non-parametric multivariate tests for detecting differentially expressed pathways for RNA-Seq data and warns against applying gene-level GSA tests, especially because of their high level of Type I error rates for both, simulated and real data. Electronic supplementary material The online version of this article (doi:10.1186/s12859-014-0397-8) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Yasir Rahmatallah
- Division of Biomedical Informatics, University of Arkansas for Medical Sciences, Little Rock, AR, 72205, USA.
| | - Frank Emmert-Streib
- Computational Biology and Machine Learning Laboratory, Center for Cancer Research and Cell Biology, School of Medicine, Dentistry and Biomedical Sciences, Queen's University Belfast, 97 Lisburn Road, Belfast, BT9 7BL, UK.
| | - Galina Glazko
- Division of Biomedical Informatics, University of Arkansas for Medical Sciences, Little Rock, AR, 72205, USA.
| |
Collapse
|
132
|
Zhang Y, Huo L, Lin L, Zeng Y. The Dantzig Discriminant Analysis with High Dimensional Data. COMMUN STAT-THEOR M 2014. [DOI: 10.1080/03610926.2013.878359] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]
|
133
|
Hopp L, Wirth H, Fasold M, Binder H. Portraying the expression landscapes of cancer subtypes. ACTA ACUST UNITED AC 2014. [DOI: 10.4161/sysb.25897] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/18/2022]
|
134
|
Schlage WK, Iskandar AR, Kostadinova R, Xiang Y, Sewer A, Majeed S, Kuehn D, Frentzel S, Talikka M, Geertz M, Mathis C, Ivanov N, Hoeng J, Peitsch MC. In vitro systems toxicology approach to investigate the effects of repeated cigarette smoke exposure on human buccal and gingival organotypic epithelial tissue cultures. Toxicol Mech Methods 2014; 24:470-87. [PMID: 25046638 PMCID: PMC4219813 DOI: 10.3109/15376516.2014.943441] [Citation(s) in RCA: 49] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/15/2014] [Revised: 06/20/2014] [Accepted: 06/29/2014] [Indexed: 11/13/2022]
Abstract
Smoking has been associated with diseases of the lung, pulmonary airways and oral cavity. Cytologic, genomic and transcriptomic changes in oral mucosa correlate with oral pre-neoplasia, cancer and inflammation (e.g. periodontitis). Alteration of smoking-related gene expression changes in oral epithelial cells is similar to that in bronchial and nasal epithelial cells. Using a systems toxicology approach, we have previously assessed the impact of cigarette smoke (CS) seen as perturbations of biological processes in human nasal and bronchial organotypic epithelial culture models. Here, we report our further assessment using in vitro human oral organotypic epithelium models. We exposed the buccal and gingival organotypic epithelial tissue cultures to CS at the air-liquid interface. CS exposure was associated with increased secretion of inflammatory mediators, induction of cytochrome P450s activity and overall weak toxicity in both tissues. Using microarray technology, gene-set analysis and a novel computational modeling approach leveraging causal biological network models, we identified CS impact on xenobiotic metabolism-related pathways accompanied by a more subtle alteration in inflammatory processes. Gene-set analysis further indicated that the CS-induced pathways in the in vitro buccal tissue models resembled those in the in vivo buccal biopsies of smokers from a published dataset. These findings support the translatability of systems responses from in vitro to in vivo and demonstrate the applicability of oral organotypical tissue models for an impact assessment of CS on various tissues exposed during smoking, as well as for impact assessment of reduced-risk products.
Collapse
Affiliation(s)
- Walter K. Schlage
- Philip Morris International R&D, Philip Morris Products S.A.NeuchâtelSwitzerland
| | - Anita R. Iskandar
- Philip Morris International R&D, Philip Morris Products S.A.NeuchâtelSwitzerland
| | - Radina Kostadinova
- Philip Morris International R&D, Philip Morris Products S.A.NeuchâtelSwitzerland
| | - Yang Xiang
- Philip Morris International R&D, Philip Morris Products S.A.NeuchâtelSwitzerland
| | - Alain Sewer
- Philip Morris International R&D, Philip Morris Products S.A.NeuchâtelSwitzerland
| | - Shoaib Majeed
- Philip Morris International R&D, Philip Morris Products S.A.NeuchâtelSwitzerland
| | - Diana Kuehn
- Philip Morris International R&D, Philip Morris Products S.A.NeuchâtelSwitzerland
| | - Stefan Frentzel
- Philip Morris International R&D, Philip Morris Products S.A.NeuchâtelSwitzerland
| | - Marja Talikka
- Philip Morris International R&D, Philip Morris Products S.A.NeuchâtelSwitzerland
| | - Marcel Geertz
- Philip Morris International R&D, Philip Morris Products S.A.NeuchâtelSwitzerland
| | - Carole Mathis
- Philip Morris International R&D, Philip Morris Products S.A.NeuchâtelSwitzerland
| | - Nikolai Ivanov
- Philip Morris International R&D, Philip Morris Products S.A.NeuchâtelSwitzerland
| | - Julia Hoeng
- Philip Morris International R&D, Philip Morris Products S.A.NeuchâtelSwitzerland
| | - Manuel C. Peitsch
- Philip Morris International R&D, Philip Morris Products S.A.NeuchâtelSwitzerland
| |
Collapse
|
135
|
Jiang L, Edwards SM, Thomsen B, Workman CT, Guldbrandtsen B, Sørensen P. A random set scoring model for prioritization of disease candidate genes using protein complexes and data-mining of GeneRIF, OMIM and PubMed records. BMC Bioinformatics 2014; 15:315. [PMID: 25253562 PMCID: PMC4181406 DOI: 10.1186/1471-2105-15-315] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/07/2013] [Accepted: 09/17/2014] [Indexed: 12/12/2022] Open
Abstract
BACKGROUND Prioritizing genetic variants is a challenge because disease susceptibility loci are often located in genes of unknown function or the relationship with the corresponding phenotype is unclear. A global data-mining exercise on the biomedical literature can establish the phenotypic profile of genes with respect to their connection to disease phenotypes. The importance of protein-protein interaction networks in the genetic heterogeneity of common diseases or complex traits is becoming increasingly recognized. Thus, the development of a network-based approach combined with phenotypic profiling would be useful for disease gene prioritization. RESULTS We developed a random-set scoring model and implemented it to quantify phenotype relevance in a network-based disease gene-prioritization approach. We validated our approach based on different gene phenotypic profiles, which were generated from PubMed abstracts, OMIM, and GeneRIF records. We also investigated the validity of several vocabulary filters and different likelihood thresholds for predicted protein-protein interactions in terms of their effect on the network-based gene-prioritization approach, which relies on text-mining of the phenotype data. Our method demonstrated good precision and sensitivity compared with those of two alternative complex-based prioritization approaches. We then conducted a global ranking of all human genes according to their relevance to a range of human diseases. The resulting accurate ranking of known causal genes supported the reliability of our approach. Moreover, these data suggest many promising novel candidate genes for human disorders that have a complex mode of inheritance. CONCLUSION We have implemented and validated a network-based approach to prioritize genes for human diseases based on their phenotypic profile. We have devised a powerful and transparent tool to identify and rank candidate genes. Our global gene prioritization provides a unique resource for the biological interpretation of data from genome-wide association studies, and will help in the understanding of how the associated genetic variants influence disease or quantitative phenotypes.
Collapse
Affiliation(s)
- Li Jiang
- Department of Molecular Biology and Genetics, Aarhus University, DK-8830 Tjele, Denmark.
| | | | | | | | | | | |
Collapse
|
136
|
Martin F, Sewer A, Talikka M, Xiang Y, Hoeng J, Peitsch MC. Quantification of biological network perturbations for mechanistic insight and diagnostics using two-layer causal models. BMC Bioinformatics 2014; 15:238. [PMID: 25015298 PMCID: PMC4227138 DOI: 10.1186/1471-2105-15-238] [Citation(s) in RCA: 92] [Impact Index Per Article: 9.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/04/2014] [Accepted: 06/26/2014] [Indexed: 12/30/2022] Open
Abstract
BACKGROUND High-throughput measurement technologies such as microarrays provide complex datasets reflecting mechanisms perturbed in an experiment, typically a treatment vs. control design. Analysis of these information rich data can be guided based on a priori knowledge, such as networks or set of related proteins or genes. Among those, cause-and-effect network models are becoming increasingly popular and more than eighty such models, describing processes involved in cell proliferation, cell fate, cell stress, and inflammation have already been published. A meaningful systems toxicology approach to study the response of a cell system, or organism, exposed to bio-active substances requires a quantitative measure of dose-response at network level, to go beyond the differential expression of single genes. RESULTS We developed a method that quantifies network response in an interpretable manner. It fully exploits the (signed graph) structure of cause-and-effect networks models to integrate and mine transcriptomics measurements. The presented approach also enables the extraction of network-based signatures for predicting a phenotype of interest. The obtained signatures are coherent with the underlying network perturbation and can lead to more robust predictions across independent studies. The value of the various components of our mathematically coherent approach is substantiated using several in vivo and in vitro transcriptomics datasets. As a proof-of-principle, our methodology was applied to unravel mechanisms related to the efficacy of a specific anti-inflammatory drug in patients suffering from ulcerative colitis. A plausible mechanistic explanation of the unequal efficacy of the drug is provided. Moreover, by utilizing the underlying mechanisms, an accurate and robust network-based diagnosis was built to predict the response to the treatment. CONCLUSION The presented framework efficiently integrates transcriptomics data and "cause and effect" network models to enable a mathematically coherent framework from quantitative impact assessment and data interpretation to patient stratification for diagnosis purposes.
Collapse
Affiliation(s)
- Florian Martin
- Philip Morris International, R&D, Biological Systems Research, Quai Jeanrenaud 5, 2000 Neuchatel, Switzerland.
| | | | | | | | | | | |
Collapse
|
137
|
Jelizarow M, Cieza A, Mansmann U. Global permutation tests for multivariate ordinal data: alternatives, test statistics and the null dilemma. J R Stat Soc Ser C Appl Stat 2014. [DOI: 10.1111/rssc.12070] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Affiliation(s)
| | - Alarcos Cieza
- Ludwig-Maximilians University; Munich Germany
- University of Southampton; UK
| | | |
Collapse
|
138
|
Zhao J, Zhu Y, Boerwinkle E, Xiong M. Pathway analysis with next-generation sequencing data. Eur J Hum Genet 2014; 23:507-15. [PMID: 24986826 DOI: 10.1038/ejhg.2014.121] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2013] [Revised: 03/29/2014] [Accepted: 04/26/2014] [Indexed: 12/21/2022] Open
Abstract
Although pathway analysis methods have been developed and successfully applied to association studies of common variants, the statistical methods for pathway-based association analysis of rare variants have not been well developed. Many investigators observed highly inflated false-positive rates and low power in pathway-based tests of association of rare variants. The inflated false-positive rates and low true-positive rates of the current methods are mainly due to their lack of ability to account for gametic phase disequilibrium. To overcome these serious limitations, we develop a novel statistic that is based on the smoothed functional principal component analysis (SFPCA) for pathway association tests with next-generation sequencing data. The developed statistic has the ability to capture position-level variant information and account for gametic phase disequilibrium. By intensive simulations, we demonstrate that the SFPCA-based statistic for testing pathway association with either rare or common or both rare and common variants has the correct type 1 error rates. Also the power of the SFPCA-based statistic and 22 additional existing statistics are evaluated. We found that the SFPCA-based statistic has a much higher power than other existing statistics in all the scenarios considered. To further evaluate its performance, the SFPCA-based statistic is applied to pathway analysis of exome sequencing data in the early-onset myocardial infarction (EOMI) project. We identify three pathways significantly associated with EOMI after the Bonferroni correction. In addition, our preliminary results show that the SFPCA-based statistic has much smaller P-values to identify pathway association than other existing methods.
Collapse
Affiliation(s)
- Jinying Zhao
- Department of Epidemiology, Tulane University School of Public Health and Tropical Medicine, New Orleans, LA, USA
| | - Yun Zhu
- Department of Epidemiology, Tulane University School of Public Health and Tropical Medicine, New Orleans, LA, USA
| | - Eric Boerwinkle
- Human Genetics Center, Division of Biostatistics, University of Texas Health Science Center at Houston, Houston, TX, USA
| | - Momiao Xiong
- Human Genetics Center, Division of Biostatistics, University of Texas Health Science Center at Houston, Houston, TX, USA
| |
Collapse
|
139
|
Roux J, Privman E, Moretti S, Daub JT, Robinson-Rechavi M, Keller L. Patterns of positive selection in seven ant genomes. Mol Biol Evol 2014; 31:1661-85. [PMID: 24782441 PMCID: PMC4069625 DOI: 10.1093/molbev/msu141] [Citation(s) in RCA: 112] [Impact Index Per Article: 11.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023] Open
Abstract
The evolution of ants is marked by remarkable adaptations that allowed the development of very complex social systems. To identify how ant-specific adaptations are associated with patterns of molecular evolution, we searched for signs of positive selection on amino-acid changes in proteins. We identified 24 functional categories of genes which were enriched for positively selected genes in the ant lineage. We also reanalyzed genome-wide data sets in bees and flies with the same methodology to check whether positive selection was specific to ants or also present in other insects. Notably, genes implicated in immunity were enriched for positively selected genes in the three lineages, ruling out the hypothesis that the evolution of hygienic behaviors in social insects caused a major relaxation of selective pressure on immune genes. Our scan also indicated that genes implicated in neurogenesis and olfaction started to undergo increased positive selection before the evolution of sociality in Hymenoptera. Finally, the comparison between these three lineages allowed us to pinpoint molecular evolution patterns that were specific to the ant lineage. In particular, there was ant-specific recurrent positive selection on genes with mitochondrial functions, suggesting that mitochondrial activity was improved during the evolution of this lineage. This might have been an important step toward the evolution of extreme lifespan that is a hallmark of ants.
Collapse
Affiliation(s)
- Julien Roux
- Department of Ecology and Evolution, University of Lausanne, Lausanne, SwitzerlandSIB Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Eyal Privman
- Department of Ecology and Evolution, University of Lausanne, Lausanne, SwitzerlandSIB Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Sébastien Moretti
- Department of Ecology and Evolution, University of Lausanne, Lausanne, SwitzerlandSIB Swiss Institute of Bioinformatics, Lausanne, SwitzerlandVital-IT Group, SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Josephine T Daub
- Department of Ecology and Evolution, University of Lausanne, Lausanne, SwitzerlandSIB Swiss Institute of Bioinformatics, Lausanne, SwitzerlandCMPG, Institute of Ecology and Evolution, University of Bern, Bern, Switzerland
| | - Marc Robinson-Rechavi
- Department of Ecology and Evolution, University of Lausanne, Lausanne, SwitzerlandSIB Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Laurent Keller
- Department of Ecology and Evolution, University of Lausanne, Lausanne, Switzerland
| |
Collapse
|
140
|
Schwarzer C, Siatkowski M, Pfeiffer MJ, Baeumer N, Drexler HCA, Wang B, Fuellen G, Boiani M. Maternal age effect on mouse oocytes: new biological insight from proteomic analysis. Reproduction 2014; 148:55-72. [DOI: 10.1530/rep-14-0126] [Citation(s) in RCA: 42] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
Abstract
The long-standing view of ‘immortal germline vs mortal soma’ poses a fundamental question in biology concerning how oocytes age in molecular terms. A mainstream hypothesis is that maternal ageing of oocytes has its roots in gene transcription. Investigating the proteins resulting from mRNA translation would reveal how far the levels of functionally available proteins correlate with mRNAs and would offer novel insights into the changes oocytes undergo during maternal ageing. Gene ontology (GO) semantic analysis revealed a high similarity of the detected proteome (2324 proteins) to the transcriptome (22 334 mRNAs), although not all proteins had a cognate mRNA. Concerning their dynamics, fourfold changes of abundance were more frequent in the proteome (3%) than the transcriptome (0.05%), with no correlation. Whereas proteins associated with the nucleus (e.g. structural maintenance of chromosomes and spindle-assembly checkpoints) were largely represented among those that change in oocytes during maternal ageing; proteins associated with oxidative stress/damage (e.g. superoxide dismutase) were infrequent. These quantitative alterations are either impoverishing or enriching. Using GO analysis, these alterations do not relate in any simple way to the classic signature of ageing known from somatic tissues. Given the lack of correlation, we conclude that proteome analysis of mouse oocytes may not be surrogated with transcriptome analysis. Furthermore, we conclude that the classic features of ageing may not be transposed from somatic tissues to oocytes in a one-to-one fashion. Overall, there is more to the maternal ageing of oocytes than mere cellular deterioration exemplified by the notorious increase of meiotic aneuploidy.
Collapse
|
141
|
Mishra P, Törönen P, Leino Y, Holm L. Gene set analysis: limitations in popular existing methods and proposed improvements. ACTA ACUST UNITED AC 2014; 30:2747-56. [PMID: 24903419 DOI: 10.1093/bioinformatics/btu374] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022]
Abstract
MOTIVATION Gene set analysis is the analysis of a set of genes that collectively contribute to a biological process. Most popular gene set analysis methods are based on empirical P-value that requires large number of permutations. Despite numerous gene set analysis methods developed in the past decade, the most popular methods still suffer from serious limitations. RESULTS We present a gene set analysis method (mGSZ) based on Gene Set Z-scoring function (GSZ) and asymptotic P-values. Asymptotic P-value calculation requires fewer permutations, and thus speeds up the gene set analysis process. We compare the GSZ-scoring function with seven popular gene set scoring functions and show that GSZ stands out as the best scoring function. In addition, we show improved performance of the GSA method when the max-mean statistics is replaced by the GSZ scoring function. We demonstrate the importance of both gene and sample permutations by showing the consequences in the absence of one or the other. A comparison of asymptotic and empirical methods of P-value estimation demonstrates a clear advantage of asymptotic P-value over empirical P-value. We show that mGSZ outperforms the state-of-the-art methods based on two different evaluations. We compared mGSZ results with permutation and rotation tests and show that rotation does not improve our asymptotic P-values. We also propose well-known asymptotic distribution models for three of the compared methods. AVAILABILITY AND IMPLEMENTATION mGSZ is available as R package from cran.r-project.org.
Collapse
Affiliation(s)
- Pashupati Mishra
- Institute of Biotechnology, University of Helsinki, Helsinki, Finland and CSC - IT Center for Science, Ltd., Espoo, Finland
| | - Petri Törönen
- Institute of Biotechnology, University of Helsinki, Helsinki, Finland and CSC - IT Center for Science, Ltd., Espoo, Finland
| | - Yrjö Leino
- Institute of Biotechnology, University of Helsinki, Helsinki, Finland and CSC - IT Center for Science, Ltd., Espoo, Finland
| | - Liisa Holm
- Institute of Biotechnology, University of Helsinki, Helsinki, Finland and CSC - IT Center for Science, Ltd., Espoo, Finland
| |
Collapse
|
142
|
Martini P, Sales G, Calura E, Cagnin S, Chiogna M, Romualdi C. timeClip: pathway analysis for time course data without replicates. BMC Bioinformatics 2014; 15 Suppl 5:S3. [PMID: 25077979 PMCID: PMC4095003 DOI: 10.1186/1471-2105-15-s5-s3] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/23/2023] Open
Abstract
Background Time-course gene expression experiments are useful tools for exploring biological processes. In this type of experiments, gene expression changes are monitored along time. Unfortunately, replication of time series is still costly and usually long time course do not have replicates. Many approaches have been proposed to deal with this data structure, but none of them in the field of pathway analysis. Pathway analyses have acquired great relevance for helping the interpretation of gene expression data. Several methods have been proposed to this aim: from the classical enrichment to the more complex topological analysis that gains power from the topology of the pathway. None of them were devised to identify temporal variations in time course data. Results Here we present timeClip, a topology based pathway analysis specifically tailored to long time series without replicates. timeClip combines dimension reduction techniques and graph decomposition theory to explore and identify the portion of pathways that is most time-dependent. In the first step, timeClip selects the time-dependent pathways; in the second step, the most time dependent portions of these pathways are highlighted. We used timeClip on simulated data and on a benchmark dataset regarding mouse muscle regeneration model. Our approach shows good performance on different simulated settings. On the real dataset, we identify 76 time-dependent pathways, most of which known to be involved in the regeneration process. Focusing on the 'mTOR signaling pathway' we highlight the timing of key processes of the muscle regeneration: from the early pathway activation through growth factor signals to the late burst of protein production needed for the fiber regeneration. Conclusions timeClip represents a new improvement in the field of time-dependent pathway analysis. It allows to isolate and dissect pathways characterized by time-dependent components. Furthermore, using timeClip on a mouse muscle regeneration dataset we were able to characterize the process of muscle fiber regeneration with its correct timing.
Collapse
|
143
|
Wang Y, Thilmony R, Gu YQ. NetVenn: an integrated network analysis web platform for gene lists. Nucleic Acids Res 2014; 42:W161-6. [PMID: 24771340 PMCID: PMC4086115 DOI: 10.1093/nar/gku331] [Citation(s) in RCA: 27] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023] Open
Abstract
Many lists containing biological identifiers, such as gene lists, have been generated in various genomics projects. Identifying the overlap among gene lists can enable us to understand the similarities and differences between the data sets. Here, we present an interactome network-based web application platform named NetVenn for comparing and mining the relationships among gene lists. NetVenn contains interactome network data publically available for several species and supports a user upload of customized interactome network data. It has an efficient and interactive graphic tool that provides a Venn diagram view for comparing two to four lists in the context of an interactome network. NetVenn also provides a comprehensive annotation of genes in the gene lists by using enriched terms from multiple functional databases. In addition, it allows for mapping the gene expression data, providing information of transcription status of genes in the network. The power graph analysis tool is integrated in NetVenn for simplified visualization of gene relationships in the network. NetVenn is freely available at http://probes.pw.usda.gov/NetVenn or http://wheat.pw.usda.gov/NetVenn.
Collapse
Affiliation(s)
- Yi Wang
- USDA-ARS, Western Regional Research Center, Genomics and Gene Discovery Research Unit, Albany, CA 94710, USA Department of Plant Sciences, University of California, Davis, CA 95616, USA
| | - Roger Thilmony
- USDA-ARS, Western Regional Research Center, Crop Improvement Research Unit, Albany, CA 94710, USA
| | - Yong Q Gu
- USDA-ARS, Western Regional Research Center, Genomics and Gene Discovery Research Unit, Albany, CA 94710, USA
| |
Collapse
|
144
|
Rhee SY, Mutwil M. Towards revealing the functions of all genes in plants. TRENDS IN PLANT SCIENCE 2014; 19:212-21. [PMID: 24231067 DOI: 10.1016/j.tplants.2013.10.006] [Citation(s) in RCA: 146] [Impact Index Per Article: 14.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/06/2013] [Revised: 10/10/2013] [Accepted: 10/16/2013] [Indexed: 05/19/2023]
Abstract
The great recent progress made in identifying the molecular parts lists of organisms revealed the paucity of our understanding of what most of the parts do. In this review, we introduce computational and statistical approaches and omics data used for inferring gene function in plants, with an emphasis on network-based inference. We also discuss caveats associated with network-based function predictions such as performance assessment, annotation propagation, the guilt-by-association concept, and the meaning of hubs. Finally, we note the current limitations and possible future directions such as the need for gold standard data from several species, unified access to data and tools, quantitative comparison of data and tool quality, and high-throughput experimental validation platforms for systematic gene function elucidation in plants.
Collapse
Affiliation(s)
- Seung Yon Rhee
- Carnegie Institution for Science, Department of Plant Biology, 260 Panama St, Stanford, CA 94305, USA.
| | - Marek Mutwil
- Max Planck Institute for Molecular Plant Physiology, 14476 Potsdam, Germany.
| |
Collapse
|
145
|
Reshetova P, Smilde AK, van Kampen AHC, Westerhuis JA. Use of prior knowledge for the analysis of high-throughput transcriptomics and metabolomics data. BMC SYSTEMS BIOLOGY 2014; 8 Suppl 2:S2. [PMID: 25033193 PMCID: PMC4101693 DOI: 10.1186/1752-0509-8-s2-s2] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/23/2022]
Abstract
BACKGROUND High-throughput omics technologies have enabled the measurement of many genes or metabolites simultaneously. The resulting high dimensional experimental data poses significant challenges to transcriptomics and metabolomics data analysis methods, which may lead to spurious instead of biologically relevant results. One strategy to improve the results is the incorporation of prior biological knowledge in the analysis. This strategy is used to reduce the solution space and/or to focus the analysis on biological meaningful regions. In this article, we review a selection of these methods used in transcriptomics and metabolomics. We combine the reviewed methods in three groups based on the underlying mathematical model: exploratory methods, supervised methods and estimation of the covariance matrix. We discuss which prior knowledge has been used, how it is incorporated and how it modifies the mathematical properties of the underlying methods.
Collapse
|
146
|
Feala JD, Abdulhameed MDM, Yu C, Dutta B, Yu X, Schmid K, Dave J, Tortella F, Reifman J. Systems biology approaches for discovering biomarkers for traumatic brain injury. J Neurotrauma 2014; 30:1101-16. [PMID: 23510232 DOI: 10.1089/neu.2012.2631] [Citation(s) in RCA: 45] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023] Open
Abstract
The rate of traumatic brain injury (TBI) in service members with wartime injuries has risen rapidly in recent years, and complex, variable links have emerged between TBI and long-term neurological disorders. The multifactorial nature of TBI secondary cellular response has confounded attempts to find cellular biomarkers for its diagnosis and prognosis or for guiding therapy for brain injury. One possibility is to apply emerging systems biology strategies to holistically probe and analyze the complex interweaving molecular pathways and networks that mediate the secondary cellular response through computational models that integrate these diverse data sets. Here, we review available systems biology strategies, databases, and tools. In addition, we describe opportunities for applying this methodology to existing TBI data sets to identify new biomarker candidates and gain insights about the underlying molecular mechanisms of TBI response. As an exemplar, we apply network and pathway analysis to a manually compiled list of 32 protein biomarker candidates from the literature, recover known TBI-related mechanisms, and generate hypothetical new biomarker candidates.
Collapse
Affiliation(s)
- Jacob D Feala
- Department of Defense Biotechnology High Performance Computing Software Applications Institute, Telemedicine and Advanced Technology Research Center, U.S. Army Medical Research and Materiel Command, Fort Detrick, Maryland, USA
| | | | | | | | | | | | | | | | | |
Collapse
|
147
|
Soneson C, Fontes M. Incorporation of gene exchangeabilities improves the reproducibility of gene set rankings. Comput Stat Data Anal 2014. [DOI: 10.1016/j.csda.2012.07.026] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
148
|
Kaever A, Landesfeind M, Feussner K, Morgenstern B, Feussner I, Meinicke P. Meta-analysis of pathway enrichment: combining independent and dependent omics data sets. PLoS One 2014; 9:e89297. [PMID: 24586671 PMCID: PMC3938450 DOI: 10.1371/journal.pone.0089297] [Citation(s) in RCA: 39] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2013] [Accepted: 01/20/2014] [Indexed: 11/18/2022] Open
Abstract
A major challenge in current systems biology is the combination and integrative analysis of large data sets obtained from different high-throughput omics platforms, such as mass spectrometry based Metabolomics and Proteomics or DNA microarray or RNA-seq-based Transcriptomics. Especially in the case of non-targeted Metabolomics experiments, where it is often impossible to unambiguously map ion features from mass spectrometry analysis to metabolites, the integration of more reliable omics technologies is highly desirable. A popular method for the knowledge-based interpretation of single data sets is the (Gene) Set Enrichment Analysis. In order to combine the results from different analyses, we introduce a methodical framework for the meta-analysis of p-values obtained from Pathway Enrichment Analysis (Set Enrichment Analysis based on pathways) of multiple dependent or independent data sets from different omics platforms. For dependent data sets, e.g. obtained from the same biological samples, the framework utilizes a covariance estimation procedure based on the nonsignificant pathways in single data set enrichment analysis. The framework is evaluated and applied in the joint analysis of Metabolomics mass spectrometry and Transcriptomics DNA microarray data in the context of plant wounding. In extensive studies of simulated data set dependence, the introduced correlation could be fully reconstructed by means of the covariance estimation based on pathway enrichment. By restricting the range of p-values of pathways considered in the estimation, the overestimation of correlation, which is introduced by the significant pathways, could be reduced. When applying the proposed methods to the real data sets, the meta-analysis was shown not only to be a powerful tool to investigate the correlation between different data sets and summarize the results of multiple analyses but also to distinguish experiment-specific key pathways.
Collapse
Affiliation(s)
- Alexander Kaever
- Department of Bioinformatics, Institute of Microbiology and Genetics, Georg-August-University, Göttingen, Germany
- * E-mail:
| | - Manuel Landesfeind
- Department of Bioinformatics, Institute of Microbiology and Genetics, Georg-August-University, Göttingen, Germany
| | - Kirstin Feussner
- Department of Plant Biochemistry, Albrecht-von-Haller-Institute for Plant Sciences, Georg-August-University, Göttingen, Germany
| | - Burkhard Morgenstern
- Department of Bioinformatics, Institute of Microbiology and Genetics, Georg-August-University, Göttingen, Germany
| | - Ivo Feussner
- Department of Plant Biochemistry, Albrecht-von-Haller-Institute for Plant Sciences, Georg-August-University, Göttingen, Germany
| | - Peter Meinicke
- Department of Bioinformatics, Institute of Microbiology and Genetics, Georg-August-University, Göttingen, Germany
| |
Collapse
|
149
|
Hua J, Bittner ML, Dougherty ER. Evaluating gene set enrichment analysis via a hybrid data model. Cancer Inform 2014; 13:1-16. [PMID: 24558298 PMCID: PMC3929260 DOI: 10.4137/cin.s13305] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2013] [Revised: 11/12/2013] [Accepted: 11/15/2013] [Indexed: 01/09/2023] Open
Abstract
Gene set enrichment analysis (GSA) methods have been widely adopted by biological labs to analyze data and generate hypotheses for validation. Most of the existing comparison studies focus on whether the existing GSA methods can produce accurate P-values; however, practitioners are often more concerned with the correct gene-set ranking generated by the methods. The ranking performance is closely related to two critical goals associated with GSA methods: the ability to reveal biological themes and ensuring reproducibility, especially for small-sample studies. We have conducted a comprehensive simulation study focusing on the ranking performance of seven representative GSA methods. We overcome the limitation on the availability of real data sets by creating hybrid data models from existing large data sets. To build the data model, we pick a master gene from the data set to form the ground truth and artificially generate the phenotype labels. Multiple hybrid data models can be constructed from one data set and multiple data sets of smaller sizes can be generated by resampling the original data set. This approach enables us to generate a large batch of data sets to check the ranking performance of GSA methods. Our simulation study reveals that for the proposed data model, the Q2 type GSA methods have in general better performance than other GSA methods and the global test has the most robust results. The properties of a data set play a critical role in the performance. For the data sets with highly connected genes, all GSA methods suffer significantly in performance.
Collapse
Affiliation(s)
- Jianping Hua
- Center for Bioinformatics and Genomic Systems Engineering, Texas A&M University, College Station, TX, USA
| | - Michael L Bittner
- Center for Bioinformatics and Genomic Systems Engineering, Texas A&M University, College Station, TX, USA. ; Computational Biology Division, Translational Genomics Research Institute, Phoenix, AZ, USA
| | - Edward R Dougherty
- Center for Bioinformatics and Genomic Systems Engineering, Texas A&M University, College Station, TX, USA. ; Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX, USA
| |
Collapse
|
150
|
Hu J, Tzeng JY. Integrative gene set analysis of multi-platform data with sample heterogeneity. ACTA ACUST UNITED AC 2014; 30:1501-7. [PMID: 24489370 DOI: 10.1093/bioinformatics/btu060] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2022]
Abstract
MOTIVATION Gene set analysis is a popular method for large-scale genomic studies. Because genes that have common biological features are analyzed jointly, gene set analysis often achieves better power and generates more biologically informative results. With the advancement of technologies, genomic studies with multi-platform data have become increasingly common. Several strategies have been proposed that integrate genomic data from multiple platforms to perform gene set analysis. To evaluate the performances of existing integrative gene set methods under various scenarios, we conduct a comparative simulation analysis based on The Cancer Genome Atlas breast cancer dataset. RESULTS We find that existing methods for gene set analysis are less effective when sample heterogeneity exists. To address this issue, we develop three methods for multi-platform genomic data with heterogeneity: two non-parametric methods, multi-platform Mann-Whitney statistics and multi-platform outlier robust T-statistics, and a parametric method, multi-platform likelihood ratio statistics. Using simulations, we show that the proposed multi-platform Mann-Whitney statistics method has higher power for heterogeneous samples and comparable performance for homogeneous samples when compared with the existing methods. Our real data applications to two datasets of The Cancer Genome Atlas also suggest that the proposed methods are able to identify novel pathways that are missed by other strategies. AVAILABILITY AND IMPLEMENTATION http://www4.stat.ncsu.edu/∼jytzeng/Software/Multiplatform_gene_set_analysis/
Collapse
Affiliation(s)
- Jun Hu
- Bioinformatics Research Center, North Carolina State University, Ricks Hall, 1 Lampe Dr., Raleigh, NC 27607, USA, Division of Bioinformatics, Omicsoft Inc., 200 Cascade Pointe Lane, Suite 101, Cary, NC 27513, USA, Department of Statistics, North Carolina State University, Ricks Hall, 1 Lampe Dr., Raleigh, NC 27607, USA and Department of Statistics, National Cheng-Kung University, No.1, University Road, Tainan 701, TaiwanBioinformatics Research Center, North Carolina State University, Ricks Hall, 1 Lampe Dr., Raleigh, NC 27607, USA, Division of Bioinformatics, Omicsoft Inc., 200 Cascade Pointe Lane, Suite 101, Cary, NC 27513, USA, Department of Statistics, North Carolina State University, Ricks Hall, 1 Lampe Dr., Raleigh, NC 27607, USA and Department of Statistics, National Cheng-Kung University, No.1, University Road, Tainan 701, Taiwan
| | - Jung-Ying Tzeng
- Bioinformatics Research Center, North Carolina State University, Ricks Hall, 1 Lampe Dr., Raleigh, NC 27607, USA, Division of Bioinformatics, Omicsoft Inc., 200 Cascade Pointe Lane, Suite 101, Cary, NC 27513, USA, Department of Statistics, North Carolina State University, Ricks Hall, 1 Lampe Dr., Raleigh, NC 27607, USA and Department of Statistics, National Cheng-Kung University, No.1, University Road, Tainan 701, TaiwanBioinformatics Research Center, North Carolina State University, Ricks Hall, 1 Lampe Dr., Raleigh, NC 27607, USA, Division of Bioinformatics, Omicsoft Inc., 200 Cascade Pointe Lane, Suite 101, Cary, NC 27513, USA, Department of Statistics, North Carolina State University, Ricks Hall, 1 Lampe Dr., Raleigh, NC 27607, USA and Department of Statistics, National Cheng-Kung University, No.1, University Road, Tainan 701, TaiwanBioinformatics Research Center, North Carolina State University, Ricks Hall, 1 Lampe Dr., Raleigh, NC 27607, USA, Division of Bioinformatics, Omicsoft Inc., 200 Cascade Pointe Lane, Suite 101, Cary, NC 27513, USA, Department of Statistics, North Carolina State University, Ricks Hall, 1 Lampe Dr., Raleigh, NC 27607, USA and Department of Statistics, National Cheng-Kung University, No.1, University Road, Tainan 701, Taiwan
| |
Collapse
|