1
|
Lu L, Townsend KA, Daigle BJ. GEOlimma: differential expression analysis and feature selection using pre-existing microarray data. BMC Bioinformatics 2021; 22:44. [PMID: 33535967 PMCID: PMC7860207 DOI: 10.1186/s12859-020-03932-5] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2019] [Accepted: 12/11/2020] [Indexed: 12/14/2022] Open
Abstract
Background Differential expression and feature selection analyses are essential steps for the development of accurate diagnostic/prognostic classifiers of complicated human diseases using transcriptomics data. These steps are particularly challenging due to the curse of dimensionality and the presence of technical and biological noise. A promising strategy for overcoming these challenges is the incorporation of pre-existing transcriptomics data in the identification of differentially expressed (DE) genes. This approach has the potential to improve the quality of selected genes, increase classification performance, and enhance biological interpretability. While a number of methods have been developed that use pre-existing data for differential expression analysis, existing methods do not leverage the identities of experimental conditions to create a robust metric for identifying DE genes. Results In this study, we propose a novel differential expression and feature selection method—GEOlimma—which combines pre-existing microarray data from the Gene Expression Omnibus (GEO) with the widely-applied Limma method for differential expression analysis. We first quantify differential gene expression across 2481 pairwise comparisons from 602 curated GEO Datasets, and we convert differential expression frequencies to DE prior probabilities. Genes with high DE prior probabilities show enrichment in cell growth and death, signal transduction, and cancer-related biological pathways, while genes with low prior probabilities were enriched in sensory system pathways. We then applied GEOlimma to four differential expression comparisons within two human disease datasets and performed differential expression, feature selection, and supervised classification analyses. Our results suggest that use of GEOlimma provides greater experimental power to detect DE genes compared to Limma, due to its increased effective sample size. Furthermore, in a supervised classification analysis using GEOlimma as a feature selection method, we observed similar or better classification performance than Limma given small, noisy subsets of an asthma dataset. Conclusions Our results demonstrate that GEOlimma is a more effective method for differential gene expression and feature selection analyses compared to the standard Limma method. Due to its focus on gene-level differential expression, GEOlimma also has the potential to be applied to other high-throughput biological datasets.
Collapse
Affiliation(s)
- Liangqun Lu
- Department of Biological Sciences, University of Memphis, Memphis, USA.,Department of Computer Science, University of Memphis, Memphis, USA
| | - Kevin A Townsend
- Department of Computer Science, University of Memphis, Memphis, USA
| | - Bernie J Daigle
- Department of Biological Sciences, University of Memphis, Memphis, USA. .,Department of Computer Science, University of Memphis, Memphis, USA.
| |
Collapse
|
2
|
Zhou B, Osinski JM, Mateo JL, Martynoga B, Sim FJ, Campbell CE, Guillemot F, Piper M, Gronostajski RM. Loss of NFIX Transcription Factor Biases Postnatal Neural Stem/Progenitor Cells Toward Oligodendrogenesis. Stem Cells Dev 2015; 24:2114-26. [DOI: 10.1089/scd.2015.0136] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022] Open
Affiliation(s)
- Bo Zhou
- Department of Biochemistry, Genomics and Bioinformatics Program, New York State Center of Excellence in Bioinformatics and Life Sciences, State University of New York at Buffalo, Buffalo, New York
| | - Jason M. Osinski
- Department of Biochemistry, Genomics and Bioinformatics Program, New York State Center of Excellence in Bioinformatics and Life Sciences, State University of New York at Buffalo, Buffalo, New York
| | - Juan L. Mateo
- Centre for Organismal Studies Heidelberg, University of Heidelberg, Heidelberg, Germany
| | - Ben Martynoga
- Division of Molecular Neurobiology, MRC, London, United Kingdom
| | - Fraser J. Sim
- Department of Genetics, Genomics and Bioinformatics Program, New York State Center of Excellence in Bioinformatics and Life Sciences, State University of New York at Buffalo, Buffalo, New York
- Department of Pharmacology and Toxicology, State University of New York at Buffalo, Buffalo, New York
| | - Christine E. Campbell
- Department of Biochemistry, Genomics and Bioinformatics Program, New York State Center of Excellence in Bioinformatics and Life Sciences, State University of New York at Buffalo, Buffalo, New York
| | | | - Michael Piper
- School of Biomedical Sciences, Queensland Brain Institute, The University of Queensland, Brisbane, Australia
| | - Richard M. Gronostajski
- Department of Biochemistry, Genomics and Bioinformatics Program, New York State Center of Excellence in Bioinformatics and Life Sciences, State University of New York at Buffalo, Buffalo, New York
- Department of Genetics, Genomics and Bioinformatics Program, New York State Center of Excellence in Bioinformatics and Life Sciences, State University of New York at Buffalo, Buffalo, New York
| |
Collapse
|
3
|
Daigle BJ, Deng A, McLaughlin T, Cushman SW, Cam MC, Reaven G, Tsao PS, Altman RB. Using pre-existing microarray datasets to increase experimental power: application to insulin resistance. PLoS Comput Biol 2010; 6:e1000718. [PMID: 20361040 PMCID: PMC2845644 DOI: 10.1371/journal.pcbi.1000718] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2009] [Accepted: 02/22/2010] [Indexed: 11/18/2022] Open
Abstract
Although they have become a widely used experimental technique for identifying differentially expressed (DE) genes, DNA microarrays are notorious for generating noisy data. A common strategy for mitigating the effects of noise is to perform many experimental replicates. This approach is often costly and sometimes impossible given limited resources; thus, analytical methods are needed which increase accuracy at no additional cost. One inexpensive source of microarray replicates comes from prior work: to date, data from hundreds of thousands of microarray experiments are in the public domain. Although these data assay a wide range of conditions, they cannot be used directly to inform any particular experiment and are thus ignored by most DE gene methods. We present the SVD Augmented Gene expression Analysis Tool (SAGAT), a mathematically principled, data-driven approach for identifying DE genes. SAGAT increases the power of a microarray experiment by using observed coexpression relationships from publicly available microarray datasets to reduce uncertainty in individual genes' expression measurements. We tested the method on three well-replicated human microarray datasets and demonstrate that use of SAGAT increased effective sample sizes by as many as 2.72 arrays. We applied SAGAT to unpublished data from a microarray study investigating transcriptional responses to insulin resistance, resulting in a 50% increase in the number of significant genes detected. We evaluated 11 (58%) of these genes experimentally using qPCR, confirming the directions of expression change for all 11 and statistical significance for three. Use of SAGAT revealed coherent biological changes in three pathways: inflammation, differentiation, and fatty acid synthesis, furthering our molecular understanding of a type 2 diabetes risk factor. We envision SAGAT as a means to maximize the potential for biological discovery from subtle transcriptional responses, and we provide it as a freely available software package that is immediately applicable to any human microarray study.
Collapse
Affiliation(s)
- Bernie J. Daigle
- Department of Genetics, Stanford University School of Medicine, Stanford, California, United States of America
| | - Alicia Deng
- Division of Cardiovascular Medicine, Stanford University School of Medicine, Stanford, California, United States of America
| | - Tracey McLaughlin
- Division of Endocrinology, Stanford University School of Medicine, Stanford, California, United States of America
| | - Samuel W. Cushman
- National Institute of Diabetes and Digestive and Kidney Diseases, National Institutes of Health, Bethesda, Maryland, United States of America
| | - Margaret C. Cam
- National Institute of Diabetes and Digestive and Kidney Diseases, National Institutes of Health, Bethesda, Maryland, United States of America
| | - Gerald Reaven
- Division of Cardiovascular Medicine, Stanford University School of Medicine, Stanford, California, United States of America
| | - Philip S. Tsao
- Division of Cardiovascular Medicine, Stanford University School of Medicine, Stanford, California, United States of America
| | - Russ B. Altman
- Department of Genetics, Stanford University School of Medicine, Stanford, California, United States of America
- Department of Bioengineering, Stanford University School of Medicine, Stanford, California, United States of America
- * E-mail:
| |
Collapse
|
4
|
Abstract
Background Microarray technology has made it possible to simultaneously monitor the expression levels of thousands of genes in a single experiment. However, the large number of genes greatly increases the challenges of analyzing, comprehending and interpreting the resulting mass of data. Selecting a subset of important genes is inevitable to address the challenge. Gene selection has been investigated extensively over the last decade. Most selection procedures, however, are not sufficient for accurate inference of underlying biology, because biological significance does not necessarily have to be statistically significant. Additional biological knowledge needs to be integrated into the gene selection procedure. Results We propose a general framework for gene ranking. We construct a bipartite graph from the Gene Ontology (GO) and gene expression data. The graph describes the relationship between genes and their associated molecular functions. Under a species condition, edge weights of the graph are assigned to be gene expression level. Such a graph provides a mathematical means to represent both species-independent and species-dependent biological information. We also develop a new ranking algorithm to analyze the weighted graph via a kernelized spatial depth (KSD) approach. Consequently, the importance of gene and molecular function can be simultaneously ranked by a real-valued measure, KSD, which incorporates the global and local structure of the graph. Over-expressed and under-regulated genes also can be separately ranked. Conclusion The gene-function bigraph integrates molecular function annotations into gene expression data. The relevance of genes is described in the graph (through a common function). The proposed method provides an exploratory framework for gene data analysis.
Collapse
Affiliation(s)
- Cuilan Gao
- Department of Mathematics, The University of Mississippi, University, MS 38677, USA
| | | | | | | |
Collapse
|
5
|
Leach SM, Tipney H, Feng W, Baumgartner WA, Kasliwal P, Schuyler RP, Williams T, Spritz RA, Hunter L. Biomedical discovery acceleration, with applications to craniofacial development. PLoS Comput Biol 2009; 5:e1000215. [PMID: 19325874 PMCID: PMC2653649 DOI: 10.1371/journal.pcbi.1000215] [Citation(s) in RCA: 55] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/25/2008] [Accepted: 02/12/2009] [Indexed: 01/17/2023] Open
Abstract
The profusion of high-throughput instruments and the explosion of new results in the scientific literature, particularly in molecular biomedicine, is both a blessing and a curse to the bench researcher. Even knowledgeable and experienced scientists can benefit from computational tools that help navigate this vast and rapidly evolving terrain. In this paper, we describe a novel computational approach to this challenge, a knowledge-based system that combines reading, reasoning, and reporting methods to facilitate analysis of experimental data. Reading methods extract information from external resources, either by parsing structured data or using biomedical language processing to extract information from unstructured data, and track knowledge provenance. Reasoning methods enrich the knowledge that results from reading by, for example, noting two genes that are annotated to the same ontology term or database entry. Reasoning is also used to combine all sources into a knowledge network that represents the integration of all sorts of relationships between a pair of genes, and to calculate a combined reliability score. Reporting methods combine the knowledge network with a congruent network constructed from experimental data and visualize the combined network in a tool that facilitates the knowledge-based analysis of that data. An implementation of this approach, called the Hanalyzer, is demonstrated on a large-scale gene expression array dataset relevant to craniofacial development. The use of the tool was critical in the creation of hypotheses regarding the roles of four genes never previously characterized as involved in craniofacial development; each of these hypotheses was validated by further experimental work.
Collapse
Affiliation(s)
- Sonia M. Leach
- Center for Computational Pharmacology, University of Colorado at Denver, Denver, Colorado, United States of America
| | - Hannah Tipney
- Center for Computational Pharmacology, University of Colorado at Denver, Denver, Colorado, United States of America
| | - Weiguo Feng
- Department of Craniofacial Biology, University of Colorado at Denver, Denver, Colorado, United States of America
| | - William A. Baumgartner
- Center for Computational Pharmacology, University of Colorado at Denver, Denver, Colorado, United States of America
| | - Priyanka Kasliwal
- Center for Computational Pharmacology, University of Colorado at Denver, Denver, Colorado, United States of America
| | - Ronald P. Schuyler
- Center for Computational Pharmacology, University of Colorado at Denver, Denver, Colorado, United States of America
| | - Trevor Williams
- Department of Craniofacial Biology, University of Colorado at Denver, Denver, Colorado, United States of America
| | - Richard A. Spritz
- Human Medical Genetics Program, University of Colorado at Denver, Denver, Colorado, United States of America
| | - Lawrence Hunter
- Center for Computational Pharmacology, University of Colorado at Denver, Denver, Colorado, United States of America
- * E-mail:
| |
Collapse
|