1
|
Gould AL, Baumgartner R, Zhao A. Bayesian screening for feature selection. J Biopharm Stat 2022; 32:832-857. [PMID: 35736220 DOI: 10.1080/10543406.2022.2033760] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/17/2022]
Abstract
Biomedical applications such as genome-wide association studies screen large databases with high-dimensional features to identify rare, weakly expressed, and important continuous-valued features for subsequent detailed analysis. We describe an exact, rapid Bayesian screening approach with attractive diagnostic properties using a Gaussian random mixture model focusing on the missed discovery rate (the probability of failing to identify potentially informative features) rather than the false discovery rate ordinarily used with multiple hypothesis testing. The method provides the likelihood that a feature merits further investigation, as well as distributions of the effect magnitudes and the proportion of features with the same expected responses under alternative conditions. Important features include the dependence of the critical values on clinical and regulatory priorities and direct assessment of the diagnostic properties.
Collapse
Affiliation(s)
- A Lawrence Gould
- Biostatistics and Research Decision Sciences Merck & Co Inc Kenilworth, New Jersey, USA
| | - Richard Baumgartner
- Biostatistics and Research Decision Sciences Merck & Co Inc Kenilworth, New Jersey, USA
| | - Amanda Zhao
- Biostatistics and Research Decision Sciences Merck & Co Inc Kenilworth, New Jersey, USA
| |
Collapse
|
2
|
Oh VKS, Li RW. Large-Scale Meta-Longitudinal Microbiome Data with a Known Batch Factor. Genes (Basel) 2022; 13:392. [PMID: 35327945 PMCID: PMC8953633 DOI: 10.3390/genes13030392] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/25/2021] [Revised: 02/05/2022] [Accepted: 02/18/2022] [Indexed: 12/04/2022] Open
Abstract
Data contamination in meta-approaches where multiple biological samples are combined considerably affects the results of subsequent downstream analyses, such as differential abundance tests comparing multiple groups at a fixed time point. Little has been thoroughly investigated regarding the impact of the lurking variable of various batch sources, such as different days or different laboratories, in more complicated time series experimental designs, for instance, repeatedly measured longitudinal data and metadata. We highlight that the influence of batch factors is significant on subsequent downstream analyses, including longitudinal differential abundance tests, by performing a case study of microbiome time course data with two treatment groups and a simulation study of mimic microbiome longitudinal counts.
Collapse
Affiliation(s)
- Vera-Khlara S. Oh
- United States Department of Agriculture, Agricultural Research Service, Animal Genomics and Improvement Laboratory, Beltsville, MD 20705, USA
- Department of Data Science, College of Natural Sciences, Jeju National University, Jeju City 690-756, Korea
| | - Robert W. Li
- United States Department of Agriculture, Agricultural Research Service, Animal Genomics and Improvement Laboratory, Beltsville, MD 20705, USA
| |
Collapse
|
3
|
Kapourani CA, Argelaguet R, Sanguinetti G, Vallejos CA. scMET: Bayesian modeling of DNA methylation heterogeneity at single-cell resolution. Genome Biol 2021; 22:114. [PMID: 33879195 PMCID: PMC8056718 DOI: 10.1186/s13059-021-02329-8] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2020] [Accepted: 03/25/2021] [Indexed: 02/06/2023] Open
Abstract
High-throughput single-cell measurements of DNA methylomes can quantify methylation heterogeneity and uncover its role in gene regulation. However, technical limitations and sparse coverage can preclude this task. scMET is a hierarchical Bayesian model which overcomes sparsity, sharing information across cells and genomic features to robustly quantify genuine biological heterogeneity. scMET can identify highly variable features that drive epigenetic heterogeneity, and perform differential methylation and variability analyses. We illustrate how scMET facilitates the characterization of epigenetically distinct cell populations and how it enables the formulation of novel hypotheses on the epigenetic regulation of gene expression. scMET is available at https://github.com/andreaskapou/scMET .
Collapse
Affiliation(s)
- Chantriolnt-Andreas Kapourani
- MRC Institute of Genetics and Molecular Medicine, University of Edinburgh, Edinburgh, UK
- School of Informatics, University of Edinburgh, Edinburgh, UK
| | | | - Guido Sanguinetti
- School of Informatics, University of Edinburgh, Edinburgh, UK.
- SISSA, International School of Advanced Studies, Trieste, Italy.
| | - Catalina A Vallejos
- MRC Institute of Genetics and Molecular Medicine, University of Edinburgh, Edinburgh, UK.
- The Alan Turing Institute, London, UK.
| |
Collapse
|
4
|
Bhattacharjee A, Vishwakarma GK. Time-course data prediction for repeatedly measured gene expression. INT J BIOMATH 2019. [DOI: 10.1142/s1793524519500335] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022]
Abstract
Variability in time course gene expression data is a natural phenomenon. The intention of this work is to predict the future time point data through observed sample data point. The Bayesian inference is carried to serve the objective. A total of 6 replicates 3 time point’s data of 218 genes expression is adopted to illustrate the method. The estimates are found consistent with HPD interval to predict the future time point gene expression value. This proposed method can be adopted in other gene expression data setup to predict the future time course data.
Collapse
Affiliation(s)
- Atanu Bhattacharjee
- Section of Biostatistics, Centre for Cancer Epidemiology, Tata Memorial Centre, Navi Mumbai 410210, India
| | - Gajendra K. Vishwakarma
- Department of Applied Mathematics, Indian Institute of Technology (ISM), Dhanbad-826004, India
| |
Collapse
|
5
|
Li B, Sun Z, He Q, Zhu Y, Qin ZS. Bayesian inference with historical data-based informative priors improves detection of differentially expressed genes. Bioinformatics 2016; 32:682-9. [PMID: 26519502 DOI: 10.1093/bioinformatics/btv631] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/15/2015] [Accepted: 10/26/2015] [Indexed: 12/13/2022] Open
Abstract
MOTIVATION Modern high-throughput biotechnologies such as microarray are capable of producing a massive amount of information for each sample. However, in a typical high-throughput experiment, only limited number of samples were assayed, thus the classical 'large p, small n' problem. On the other hand, rapid propagation of these high-throughput technologies has resulted in a substantial collection of data, often carried out on the same platform and using the same protocol. It is highly desirable to utilize the existing data when performing analysis and inference on a new dataset. RESULTS Utilizing existing data can be carried out in a straightforward fashion under the Bayesian framework in which the repository of historical data can be exploited to build informative priors and used in new data analysis. In this work, using microarray data, we investigate the feasibility and effectiveness of deriving informative priors from historical data and using them in the problem of detecting differentially expressed genes. Through simulation and real data analysis, we show that the proposed strategy significantly outperforms existing methods including the popular and state-of-the-art Bayesian hierarchical model-based approaches. Our work illustrates the feasibility and benefits of exploiting the increasingly available genomics big data in statistical inference and presents a promising practical strategy for dealing with the 'large p, small n' problem. AVAILABILITY AND IMPLEMENTATION Our method is implemented in R package IPBT, which is freely available from https://github.com/benliemory/IPBT CONTACT: yuzhu@purdue.edu; zhaohui.qin@emory.edu SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Ben Li
- Department of Biostatistics and Bioinformatics, Rollins School of Public Health, Emory University, Atlanta, GA 30322, USA
| | - Zhaonan Sun
- Department of Statistics, Purdue University, West Lafayette, IN 47906, USA and
| | - Qing He
- Department of Biostatistics and Bioinformatics, Rollins School of Public Health, Emory University, Atlanta, GA 30322, USA
| | - Yu Zhu
- Department of Statistics, Purdue University, West Lafayette, IN 47906, USA and
| | - Zhaohui S Qin
- Department of Biostatistics and Bioinformatics, Rollins School of Public Health, Emory University, Atlanta, GA 30322, USA, Department of Biomedical Informatics, Emory University School of Medicine, Atlanta, GA 30322, USA
| |
Collapse
|
6
|
Mahdevar G, Nowzari-Dalini A, Sadeghi M. Inferring gene correlation networks from transcription factor binding sites. Genes Genet Syst 2014; 88:301-9. [PMID: 24694393 DOI: 10.1266/ggs.88.301] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022] Open
Abstract
Gene expression is a highly regulated biological process that is fundamental to the existence of phenotypes of any living organism. The regulatory relations are usually modeled as a network; simply, every gene is modeled as a node and relations are shown as edges between two related genes. This paper presents a novel method for inferring correlation networks, networks constructed by connecting co-expressed genes, through predicting co-expression level from genes promoter's sequences. According to the results, this method works well on biological data and its outcome is comparable to the methods that use microarray as input. The method is written in C++ language and is available upon request from the corresponding author.
Collapse
Affiliation(s)
- Ghasem Mahdevar
- Department of Bioinformatics, Institute of Biochemistry and Biophysics, University of Tehran
| | | | | |
Collapse
|
7
|
Abstract
The process of screening for differentially expressed genes using microarray samples can usually be reduced to a large set of statistical hypothesis tests. In this situation, statistical issues arise which are not encountered in a single hypothesis test, related to the need to identify the specific hypotheses to be rejected, and to report an associated error. As in any complex testing problem, it is rarely the case that a single method is always to be preferred, leaving the analysts with the problem of selecting the most appropriate method for the particular task at hand. In this chapter, an introduction to current multiple testing methodology was presented, with the objective of clarifying the methodological issues involved, and hopefully providing the reader with some basis with which to compare and select methods.
Collapse
Affiliation(s)
- Anthony Almudevar
- Department of Biostatistics and Computational Biology, University of Rochester, Rochester, NY, USA.
| |
Collapse
|
8
|
Noma H, Matsui S. Empirical Bayes ranking and selection methods via semiparametric hierarchical mixture models in microarray studies. Stat Med 2012; 32:1904-16. [PMID: 23281021 DOI: 10.1002/sim.5718] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2011] [Accepted: 12/06/2012] [Indexed: 11/07/2022]
Abstract
The main purpose of microarray studies is screening of differentially expressed genes as candidates for further investigation. Because of limited resources in this stage, prioritizing genes are relevant statistical tasks in microarray studies. For effective gene selections, parametric empirical Bayes methods for ranking and selection of genes with largest effect sizes have been proposed (Noma et al., 2010; Biostatistics 11: 281-289). The hierarchical mixture model incorporates the differential and non-differential components and allows information borrowing across differential genes with separation from nuisance, non-differential genes. In this article, we develop empirical Bayes ranking methods via a semiparametric hierarchical mixture model. A nonparametric prior distribution, rather than parametric prior distributions, for effect sizes is specified and estimated using the "smoothing by roughening" approach of Laird and Louis (1991; Computational statistics and data analysis 12: 27-37). We present applications to childhood and infant leukemia clinical studies with microarrays for exploring genes related to prognosis or disease progression.
Collapse
Affiliation(s)
- Hisashi Noma
- Department of Data Science, The Institute of Statistical Mathematics, 10-3 Midori-cho, Tachikawa, Tokyo, 190-8562, Japan.
| | | |
Collapse
|
9
|
Hong Z, Lian H. BOPA: A Bayesian hierarchical model for outlier expression detection. Comput Stat Data Anal 2012. [DOI: 10.1016/j.csda.2012.05.003] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
|
10
|
Wang X, Chen M, Khodursky AB, Xiao G. Bayesian Joint Analysis of Gene Expression Data and Gene Functional Annotations. STATISTICS IN BIOSCIENCES 2012. [DOI: 10.1007/s12561-012-9065-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/28/2022]
|
11
|
|
12
|
Chen CH, Su WC, Chen CY, Huang JY, Tsai FY, Wang WC, Hsiung CA, Jeng KS, Chang IS. A Bayesian measurement error model for two-channel cell-based RNAi data with replicates. Ann Appl Stat 2012. [DOI: 10.1214/11-aoas496] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
13
|
Jung K, Friede T, Beissbarth T. Reporting FDR analogous confidence intervals for the log fold change of differentially expressed genes. BMC Bioinformatics 2011; 12:288. [PMID: 21756370 PMCID: PMC3154206 DOI: 10.1186/1471-2105-12-288] [Citation(s) in RCA: 27] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2011] [Accepted: 07/15/2011] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Gene expression experiments are common in molecular biology, for example in order to identify genes which play a certain role in a specified biological framework. For that purpose expression levels of several thousand genes are measured simultaneously using DNA microarrays. Comparing two distinct groups of tissue samples to detect those genes which are differentially expressed one statistical test per gene is performed, and resulting p-values are adjusted to control the false discovery rate. In addition, the expression change of each gene is quantified by some effect measure, typically the log fold change. In certain cases, however, a gene with a significant p-value can have a rather small fold change while in other cases a non-significant gene can have a rather large fold change. The biological relevance of the change of gene expression can be more intuitively judged by a fold change then merely by a p-value. Therefore, confidence intervals for the log fold change which accompany the adjusted p-values are desirable. RESULTS In a new approach, we employ an existing algorithm for adjusting confidence intervals in the case of high-dimensional data and apply it to a widely used linear model for microarray data. Furthermore, we adopt a concept of different relevance categories for effects in clinical trials to assess biological relevance of genes in microarray experiments. In a brief simulation study the properties of the adjusting algorithm are maintained when being combined with the linear model for microarray data. In two cancer data sets the adjusted confidence intervals can indicate significance of large fold changes and distinguish them from other large but non-significant fold changes. Adjusting of confidence intervals also corrects the assessment of biological relevance. CONCLUSIONS Our new combination approach and the categorization of fold changes facilitates the selection of genes in microarray experiments and helps to interpret their biological relevance.
Collapse
Affiliation(s)
- Klaus Jung
- Department of Medical Statistics, University Medical Center Göttingen, D-37099 Göttingen, Germany.
| | | | | |
Collapse
|
14
|
Arima S, Liseo B, Mariani F, Tardella L. Exploiting blank spots for model-based background correction in discovering genes with DNA array data. STAT MODEL 2011. [DOI: 10.1177/1471082x1001100201] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
Motivated by a real data set deriving from a study on the genetic determinants of the behavior of Mycobacterium tuberculosis (MTB) hosted in macrophage, we take advantage of the presence of control spots and illustrate modelling issues for background correction and the ensuing empirical findings resulting from a Bayesian hierarchical approach to the problem of detecting differentially expressed genes. We prove the usefulness of a fully integrated approach where background correction and normalization are embedded in a single model-based framework, creating a new tailored model to account for the peculiar features of DNA array data where null expressions are planned by design. We also advocate the use of an alternative normalization device resulting from a suitable reparameterization. The new model is validated by using both simulated and our MTB data. This work suggests that the presence of a substantial fraction of exact null expressions might be the effect of an imperfect background calibration and shows how this can be suitably re-calibrated with the information coming from control spots. The proposed idea can be extended to all experiments in which a subset of genes whose expression levels can be ascribed mainly to background noise is planned by design.
Collapse
Affiliation(s)
- Serena Arima
- Serena Arima, Dipartimento di metodi e modelli per l’economia, il territorio e la finanza, Sapienza Università di Roma, via del Castno Laurenziano 9, Roma, 00161, Italy
| | - Brunero Liseo
- Dipartimento di metodi e modelli per l’economia, il territorio e la finanza, Sapienza Università di Roma, Italy
| | | | - Luca Tardella
- Dipartimento di Statistica, Sapienza Università di Roma, Italy
| |
Collapse
|
15
|
Dhavala SS, Datta S, Mallick BK, Carroll RJ, Khare S, Lawhon SD, Adams LG. Bayesian Modeling of MPSS Data: Gene Expression Analysis of Bovine Salmonella Infection. J Am Stat Assoc 2010; 105:956-967. [PMID: 21165171 PMCID: PMC3002112 DOI: 10.1198/jasa.2010.ap08327] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
Abstract
Massively Parallel Signature Sequencing (MPSS) is a high-throughput counting-based technology available for gene expression profiling. It produces output that is similar to Serial Analysis of Gene Expression (SAGE) and is ideal for building complex relational databases for gene expression. Our goal is to compare the in vivo global gene expression profiles of tissues infected with different strains of Salmonella obtained using the MPSS technology. In this article, we develop an exact ANOVA type model for this count data using a zero-inflated Poisson (ZIP) distribution, different from existing methods that assume continuous densities. We adopt two Bayesian hierarchical models-one parametric and the other semiparametric with a Dirichlet process prior that has the ability to "borrow strength" across related signatures, where a signature is a specific arrangement of the nucleotides, usually 16-21 base-pairs long. We utilize the discreteness of Dirichlet process prior to cluster signatures that exhibit similar differential expression profiles. Tests for differential expression are carried out using non-parametric approaches, while controlling the false discovery rate. We identify several differentially expressed genes that have important biological significance and conclude with a summary of the biological discoveries.
Collapse
Affiliation(s)
- Soma S. Dhavala
- Department of Statistics, 3143 TAMU, Texas A & M University, College Station, TX, 77843
| | - Sujay Datta
- Statistical Center for HIV/AIDS Research and Prevention, Fred Hutchinson Cancer Research Center, M2-C125, 1100 Fairview Avenue N Seattle, WA 98109 ()
| | - Bani K. Mallick
- Department of Statistics, 3143 TAMU, Texas A & M University, College Station, TX, 77843 ()
| | - Raymond J. Carroll
- Department of Statistics, 3143 TAMU, Texas A & M University, College Station, TX 77843 ()
| | - Sangeeta Khare
- Department of Veterinary Pathobiology, 4467 TAMU, Texas A & M University, College Station, TX 77843 ()
| | - Sara D. Lawhon
- Department of Veterinary Pathobiology, 4467 TAMU, Texas A & M University, College Station, TX 77843 ()
| | - L. Garry Adams
- Department of Veterinary Pathobiology, 4467 TAMU, Texas A & M University, College Station, TX 77843 ()
| |
Collapse
|
16
|
Bayesian hierarchical model for estimating gene expression intensity using multiple scanned microarrays. EURASIP JOURNAL ON BIOINFORMATICS & SYSTEMS BIOLOGY 2010:231950. [PMID: 18464926 DOI: 10.1155/2008/231950] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/07/2007] [Accepted: 11/28/2007] [Indexed: 11/18/2022]
Abstract
We propose a method for improving the quality of signal from DNA microarrays by using several scans at varying scanner sensitivities. A Bayesian latent intensity model is introduced for the analysis of such data. The method improves the accuracy at which expressions can be measured in all ranges and extends the dynamic range of measured gene expression at the high end. Our method is generic and can be applied to data from any organism, for imaging with any scanner that allows varying the laser power, and for extraction with any image analysis software. Results from a self-self hybridization data set illustrate an improved precision in the estimation of the expression of genes compared to what can be achieved by applying standard methods and using only a single scan.
Collapse
|
17
|
Gupta R, Greco D, Auvinen P, Arjas E. Bayesian integrated modeling of expression data: a case study on RhoG. BMC Bioinformatics 2010; 11:295. [PMID: 20515463 PMCID: PMC2894040 DOI: 10.1186/1471-2105-11-295] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/04/2009] [Accepted: 06/01/2010] [Indexed: 11/02/2022] Open
Abstract
BACKGROUND DNA microarrays provide an efficient method for measuring activity of genes in parallel and even covering all the known transcripts of an organism on a single array. This has to be balanced against that analyzing data emerging from microarrays involves several consecutive steps, and each of them is a potential source of errors. Errors tend to accumulate when moving from the lower level towards the higher level analyses because of the sequential nature. Eliminating such errors does not seem feasible without completely changing the technologies, but one should nevertheless try to meet the goal of being able to realistically assess degree of the uncertainties that are involved when drawing the final conclusions from such analyses. RESULTS We present a Bayesian hierarchical model for finding differentially expressed genes between two experimental conditions, proposing an integrated statistical approach where correcting signal saturation, systematic array effects, dye effects, and finding differentially expressed genes, are all modeled jointly. The integration allows all these components, and also the associated errors, to be considered simultaneously. The inference is based on full posterior distribution of gene expression indices and on quantities derived from them rather than on point estimates. The model was applied and tested on two different datasets. CONCLUSIONS The method presents a way of integrating various steps of microarray analysis into a single joint analysis, and thereby enables extracting information on differential expression in a manner, which properly accounts for various sources of potential error in the process.
Collapse
Affiliation(s)
- Rashi Gupta
- Department of Mathematics and Statistics, University of Helsinki, Helsinki, Finland.
| | | | | | | |
Collapse
|
18
|
Crager MR. Gene identification using true discovery rate degree of association sets and estimates corrected for regression to the mean. Stat Med 2010; 29:33-45. [PMID: 19960511 DOI: 10.1002/sim.3789] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
Analyses intended to identify genes with expression that is associated with some clinical outcome or state are often based on ranked p-values from tests of point null hypotheses of no association. Van de Wiel and Kim take the innovative approach of testing the interval null hypotheses that the degree of association for a gene is less than some value of interest against the alternative that it is greater. Combining this idea with the false discovery rate controlling methods of Storey, Taylor and Siegmund gives a computationally simple way to identify true discovery rate degree of association (TDRDA) sets of genes among which a specified proportion are expected to have an absolute association of a specified degree or more. This leads to a gene ranking method that uses the maximum lower bound degree of association for which each gene belongs to a TDRDA set. Estimates of each gene's actual degree of association with approximate correction for 'selection bias' due to regression to the mean (RM) can be derived using simple bivariate normal theory and Efron and Tibshirani's empirical Bayes approach. For a given data set, all possible TDRDA sets can be displayed along with the gene ranking and the RM-corrected estimates of degree of association in a concise graphical summary.
Collapse
Affiliation(s)
- Michael R Crager
- Department of Biostatistics, Genomic Health, Inc., Redwood City, CA 94063, USA.
| |
Collapse
|
19
|
Yanofsky CM, Bickel DR. Validation of differential gene expression algorithms: application comparing fold-change estimation to hypothesis testing. BMC Bioinformatics 2010; 11:63. [PMID: 20109217 PMCID: PMC3224549 DOI: 10.1186/1471-2105-11-63] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2009] [Accepted: 01/28/2010] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Sustained research on the problem of determining which genes are differentially expressed on the basis of microarray data has yielded a plethora of statistical algorithms, each justified by theory, simulation, or ad hoc validation and yet differing in practical results from equally justified algorithms. Recently, a concordance method that measures agreement among gene lists have been introduced to assess various aspects of differential gene expression detection. This method has the advantage of basing its assessment solely on the results of real data analyses, but as it requires examining gene lists of given sizes, it may be unstable. RESULTS Two methodologies for assessing predictive error are described: a cross-validation method and a posterior predictive method. As a nonparametric method of estimating prediction error from observed expression levels, cross validation provides an empirical approach to assessing algorithms for detecting differential gene expression that is fully justified for large numbers of biological replicates. Because it leverages the knowledge that only a small portion of genes are differentially expressed, the posterior predictive method is expected to provide more reliable estimates of algorithm performance, allaying concerns about limited biological replication. In practice, the posterior predictive method can assess when its approximations are valid and when they are inaccurate. Under conditions in which its approximations are valid, it corroborates the results of cross validation. Both comparison methodologies are applicable to both single-channel and dual-channel microarrays. For the data sets considered, estimating prediction error by cross validation demonstrates that empirical Bayes methods based on hierarchical models tend to outperform algorithms based on selecting genes by their fold changes or by non-hierarchical model-selection criteria. (The latter two approaches have comparable performance.) The posterior predictive assessment corroborates these findings. CONCLUSIONS Algorithms for detecting differential gene expression may be compared by estimating each algorithm's error in predicting expression ratios, whether such ratios are defined across microarray channels or between two independent groups.According to two distinct estimators of prediction error, algorithms using hierarchical models outperform the other algorithms of the study. The fact that fold-change shrinkage performed as well as conventional model selection criteria calls for investigating algorithms that combine the strengths of significance testing and fold-change estimation.
Collapse
Affiliation(s)
- Corey M Yanofsky
- Ottawa Institute of Systems Biology, Department of Biochemistry, Microbiology, and Immunology, University of Ottawa, Ottawa, Ontario, Canada
| | | |
Collapse
|
20
|
Crespi CM, Boscardin WJ. Bayesian Model Checking for Multivariate Outcome Data. Comput Stat Data Anal 2009; 53:3765-3772. [PMID: 20204167 DOI: 10.1016/j.csda.2009.03.024] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
Bayesian models are increasingly used to analyze complex multivariate outcome data. However, diagnostics for such models have not been well-developed. We present a diagnostic method of evaluating the fit of Bayesian models for multivariate data based on posterior predictive model checking (PPMC), a technique in which observed data are compared to replicated data generated from model predictions. Most previous work on PPMC has focused on the use of test quantities that are scalar summaries of the data and parameters. However, scalar summaries are unlikely to capture the rich features of multivariate data. We introduce the use of dissimilarity measures for checking Bayesian models for multivariate outcome data. This method has the advantage of checking the fit of the model to the complete data vectors or vector summaries with reduced dimension, providing a comprehensive picture of model fit. An application with longitudinal binary data illustrates the methods.
Collapse
Affiliation(s)
- Catherine M Crespi
- Department of Biostatistics, University of California, Los Angeles, CA, USA
| | | |
Collapse
|
21
|
Chen LS, Lin CY. BayesianP-Values for Testing Independence in 2 × 2 Contingency Tables. COMMUN STAT-THEOR M 2009. [DOI: 10.1080/03610920802513221] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
|
22
|
Bradley AJ, Green MJ. Factors affecting cure when treating bovine clinical mastitis with cephalosporin-based intramammary preparations. J Dairy Sci 2009; 92:1941-53. [PMID: 19389951 DOI: 10.3168/jds.2008-1497] [Citation(s) in RCA: 62] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Abstract
Data were collated for an independent scientific analysis from 2 international, multicenter studies that had compared the efficacy of 3 different cephalosporin-containing intramammary preparations in the treatment of clinical mastitis in dairy cattle [cefalexin (first generation) in combination with kanamycin; cefquinome (fourth generation); and cefoperazone (third generation)]. Quarters were assessed using standard bacteriological techniques before treatment and at 16 and 25 d posttreatment. Additional data were also available on individual cows and study farms, including parity, breed, and cow somatic cell count histories, herd bulk milk somatic cell counts, and farm management regimens. Sufficient data for analysis were available from a total of 491 cases on 192 farms in 3 countries (United Kingdom, France, and Germany) with up to 16 cases being recruited from any one farm. Clinical cases were of diverse etiology, representing both contagious and environmental pathogens. Univariable analysis demonstrated that quarters in the cefalexin + kanamycin and cefquinome treatment groups were not significantly different from each other, but were both significantly more likely to be pathogen free posttreatment than quarters in the cefoperazone group. Multivariable analysis was undertaken using conventional random effects models. Two models were built, with the first incorporating only information available to the practitioner at the time of treatment and the second including all information collected during the study. These models indicated that country, pretreatment rectal temperature (above-normal temperature associated with an increased chance of being pathogen free posttreatment), individual cow somatic cell count (increased somatic cell count associated with a decreased chance of being pathogen free posttreatment), and pathogen (Staphylococcus aureus isolation associated with a decreased chance of being pathogen free posttreatment) were useful predictors of pathogen free status; parity, yield, bulk milk somatic cell counts, and other farm management factors were not. The importance of country in the analysis demonstrates the need to generate local data when assessing treatment regimens. In addition, these results suggest that the factors important in predicting the outcome of treatment of clinical mastitis cases may be dissimilar to those reported to affect the likelihood of cure when treating subclinical intramammary infections.
Collapse
Affiliation(s)
- A J Bradley
- University of Bristol, Division of Farm Animal Science, School of Veterinary Science, Langford House, Langford, Bristol, BS40 5DU, United Kingdom.
| | | |
Collapse
|
23
|
Green MJ, Medley GF, Browne WJ. Use of posterior predictive assessments to evaluate model fit in multilevel logistic regression. Vet Res 2009; 40:30. [PMID: 19323968 PMCID: PMC2675184 DOI: 10.1051/vetres/2009013] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/05/2008] [Accepted: 03/24/2009] [Indexed: 11/15/2022] Open
Abstract
Assessing the fit of a model is an important final step in any statistical analysis, but this is not straightforward when complex discrete response models are used. Cross validation and posterior predictions have been suggested as methods to aid model criticism. In this paper a comparison is made between four methods of model predictive assessment in the context of a three level logistic regression model for clinical mastitis in dairy cattle; cross validation, a prediction using the full posterior predictive distribution and two “mixed” predictive methods that incorporate higher level random effects simulated from the underlying model distribution. Cross validation is considered a gold standard method but is computationally intensive and thus a comparison is made between posterior predictive assessments and cross validation. The analyses revealed that mixed prediction methods produced results close to cross validation whilst the full posterior predictive assessment gave predictions that were over-optimistic (closer to the observed disease rates) compared with cross validation. A mixed prediction method that simulated random effects from both higher levels was best at identifying the outlying level two (farm-year) units of interest. It is concluded that this mixed prediction method, simulating random effects from both higher levels, is straightforward and may be of value in model criticism of multilevel logistic regression, a technique commonly used for animal health data with a hierarchical structure.
Collapse
Affiliation(s)
- Martin J Green
- School of Veterinary Medicine and Science, University of Nottingham, Sutton Bonington Campus, United Kingdom - School of Mathematical Sciences, University of Nottingham, Nottingham, United Kingdom.
| | | | | |
Collapse
|
24
|
Stochastic modelling for quantitative description of heterogeneous biological systems. Nat Rev Genet 2009; 10:122-33. [PMID: 19139763 DOI: 10.1038/nrg2509] [Citation(s) in RCA: 298] [Impact Index Per Article: 19.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/11/2022]
Abstract
Two related developments are currently changing traditional approaches to computational systems biology modelling. First, stochastic models are being used increasingly in preference to deterministic models to describe biochemical network dynamics at the single-cell level. Second, sophisticated statistical methods and algorithms are being used to fit both deterministic and stochastic models to time course and other experimental data. Both frameworks are needed to adequately describe observed noise, variability and heterogeneity of biological systems over a range of scales of biological organization.
Collapse
|
25
|
Marot G, Foulley JL, Jaffrézic F. A structural mixed model to shrink covariance matrices for time-course differential gene expression studies. Comput Stat Data Anal 2009. [DOI: 10.1016/j.csda.2008.04.018] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
26
|
Bhattacharjee M, Botting C, Sillanpää M. Bayesian biomarker identification based on marker-expression proteomics data. Genomics 2008; 92:384-92. [DOI: 10.1016/j.ygeno.2008.06.006] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/18/2008] [Revised: 06/09/2008] [Accepted: 06/11/2008] [Indexed: 11/29/2022]
|
27
|
Blangiardo M, Richardson S. A Bayesian calibration model for combining different pre-processing methods in Affymetrix chips. BMC Bioinformatics 2008; 9:512. [PMID: 19046434 PMCID: PMC2639433 DOI: 10.1186/1471-2105-9-512] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/12/2008] [Accepted: 12/01/2008] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND In gene expression studies a key role is played by the so called "pre-processing", a series of steps designed to extract the signal and account for the sources of variability due to the technology used rather than to biological differences between the RNA samples. At the moment there is no commonly agreed gold standard pre-processing method and each researcher has the responsibility to choose one method, incurring the risk of false positive and false negative features arising from the particular method chosen. RESULTS We propose a Bayesian calibration model that makes use of the information provided by several pre-processing methods and we show that this model gives a better assessment of the 'true' unknown differential expression between two conditions. We demonstrate how to estimate the posterior distribution of the differential expression values of interest from the combined information. CONCLUSION On simulated data and on the spike-in Latin Square dataset from Affymetrix the Bayesian calibration model proves to have more power than each pre-processing method. Its biological interest is demonstrated through an experimental example on publicly available data.
Collapse
Affiliation(s)
- Marta Blangiardo
- Centre for Biostatistics, Imperial College, St Mary's Campus, Norfolk Place, London, UK.
| | | |
Collapse
|
28
|
Sarholz B, Piepho HP. Variance component estimation for mixed model analysis of cDNA microarray data. Biom J 2008; 50:927-39. [PMID: 19035549 DOI: 10.1002/bimj.200810476] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
Microarrays provide a valuable tool for the quantification of gene expression. Usually, however, there is a limited number of replicates leading to unsatisfying variance estimates in a gene-wise mixed model analysis. As thousands of genes are available, it is desirable to combine information across genes. When more than two tissue types or treatments are to be compared it might be advisable to consider the array effect as random. Then information between arrays may be recovered, which can increase accuracy in estimation. We propose a method of variance component estimation across genes for a linear mixed model with two random effects. The method may be extended to models with more than two random effects. We assume that the variance components follow a log-normal distribution. Assuming that the sums of squares from the gene-wise analysis, given the true variance components, follow a scaled chi(2)-distribution, we adopt an empirical Bayes approach. The variance components are estimated by the expectation of their posterior distribution. The new method is evaluated in a simulation study. Differentially expressed genes are more likely to be detected by tests based on these variance estimates than by tests based on gene-wise variance estimates. This effect is most visible in studies with small array numbers. Analyzing a real data set on maize endosperm the method is shown to work well.
Collapse
Affiliation(s)
- Barbara Sarholz
- General Motors Powertrain Germany GmbH, Rüsselsheim, Germany
| | | |
Collapse
|
29
|
Kadota K, Nakai Y, Shimizu K. A weighted average difference method for detecting differentially expressed genes from microarray data. Algorithms Mol Biol 2008; 3:8. [PMID: 18578891 PMCID: PMC2464587 DOI: 10.1186/1748-7188-3-8] [Citation(s) in RCA: 96] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2007] [Accepted: 06/26/2008] [Indexed: 01/05/2023] Open
Abstract
Background Identification of differentially expressed genes (DEGs) under different experimental conditions is an important task in many microarray studies. However, choosing which method to use for a particular application is problematic because its performance depends on the evaluation metric, the dataset, and so on. In addition, when using the Affymetrix GeneChip® system, researchers must select a preprocessing algorithm from a number of competing algorithms such as MAS, RMA, and DFW, for obtaining expression-level measurements. To achieve optimal performance for detecting DEGs, a suitable combination of gene selection method and preprocessing algorithm needs to be selected for a given probe-level dataset. Results We introduce a new fold-change (FC)-based method, the weighted average difference method (WAD), for ranking DEGs. It uses the average difference and relative average signal intensity so that highly expressed genes are highly ranked on the average for the different conditions. The idea is based on our observation that known or potential marker genes (or proteins) tend to have high expression levels. We compared WAD with seven other methods; average difference (AD), FC, rank products (RP), moderated t statistic (modT), significance analysis of microarrays (samT), shrinkage t statistic (shrinkT), and intensity-based moderated t statistic (ibmT). The evaluation was performed using a total of 38 different binary (two-class) probe-level datasets: two artificial "spike-in" datasets and 36 real experimental datasets. The results indicate that WAD outperforms the other methods when sensitivity and specificity are considered simultaneously: the area under the receiver operating characteristic curve for WAD was the highest on average for the 38 datasets. The gene ranking for WAD was also the most consistent when subsets of top-ranked genes produced from three different preprocessed data (MAS, RMA, and DFW) were compared. Overall, WAD performed the best for MAS-preprocessed data and the FC-based methods (AD, WAD, FC, or RP) performed well for RMA and DFW-preprocessed data. Conclusion WAD is a promising alternative to existing methods for ranking DEGs with two classes. Its high performance should increase researchers' confidence in microarray analyses.
Collapse
|
30
|
Zhao H, Chan KL, Cheng LM, Yan H. Multivariate hierarchical Bayesian model for differential gene expression analysis in microarray experiments. BMC Bioinformatics 2008; 9 Suppl 1:S9. [PMID: 18315862 PMCID: PMC2259410 DOI: 10.1186/1471-2105-9-s1-s9] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
Background Identification of differentially expressed genes is a typical objective when analyzing gene expression data. Recently, Bayesian hierarchical models have become increasingly popular to solve this type of problems. These models show good performance in accommodating noise, variability and low replication of microarray data. However, the correlation between different fluorescent signals measured from a gene spot is ignored, which can diversely affect the data analysis step. In fact, the intensities of the two signals are significantly correlated across samples. The larger the log-transformed intensities are, the smaller the correlation is. Results Motivated by the complicated error relations in microarray data, we propose a multivariate hierarchical Bayesian framework for data analysis in the replicated microarray experiments. Gene expression data are modelled by a multivariate normal distribution, parameterized by the corresponding mean vectors and covariance matrixes with a conjugate prior distribution. Within the Bayesian framework, a generalized likelihood ratio test (GLRT) is also developed to infer the gene expression patterns. Simulation studies show that the proposed approach presents better operating characteristics and lower false discovery rate (FDR) than existing methods, especially when the correlation coefficient is large. The approach is illustrated with two examples of microarray analysis. The proposed method successfully detects significant genes closely related to the experimental states, which are verified by the biological information. Conclusions The multivariate Bayesian model, compatible with the dependence between mean and variance in the univariate Bayesian model, relaxes the constant coefficient of variation assumption between measurements by adding a covariance structure. This model improves the identification of differentially expressed genes significantly since the Bayesian model fit well with the microarray data.
Collapse
Affiliation(s)
- Hongya Zhao
- Department of Electronic Engineering, City University of Hong Kong, Kowloon, Hong Kong.
| | | | | | | |
Collapse
|
31
|
|
32
|
Bochkina N, Richardson S. Tail posterior probability for inference in pairwise and multiclass gene expression data. Biometrics 2008; 63:1117-25. [PMID: 18078482 DOI: 10.1111/j.1541-0420.2007.00807.x] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/29/2022]
Abstract
We consider the problem of identifying differentially expressed genes in microarray data in a Bayesian framework with a noninformative prior distribution on the parameter quantifying differential expression. We introduce a new rule, tail posterior probability, based on the posterior distribution of the standardized difference, to identify genes differentially expressed between two conditions, and we derive a frequentist estimator of the false discovery rate associated with this rule. We compare it to other Bayesian rules in the considered settings. We show how the tail posterior probability can be extended to testing a compound null hypothesis against a class of specific alternatives in multiclass data.
Collapse
Affiliation(s)
- N Bochkina
- Centre for Biostatistics, Imperial College, London W2 1PG, UK.
| | | |
Collapse
|
33
|
Wei P, Pan W. Incorporating gene networks into statistical tests for genomic data via a spatially correlated mixture model. Bioinformatics 2007; 24:404-11. [DOI: 10.1093/bioinformatics/btm612] [Citation(s) in RCA: 59] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
|
34
|
Åstrand M, Mostad P, Rudemo M. Improved Covariance Matrix Estimators for Weighted Analysis of Microarray Data. J Comput Biol 2007; 14:1353-67. [DOI: 10.1089/cmb.2007.0078] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Affiliation(s)
- Magnus Åstrand
- Department of Mathematical Sciences, Chalmers University of Technology and Göteborg University, Göteborg, Sweden
| | - Petter Mostad
- Department of Mathematical Sciences, Chalmers University of Technology and Göteborg University, Göteborg, Sweden
| | - Mats Rudemo
- Department of Mathematical Sciences, Chalmers University of Technology and Göteborg University, Göteborg, Sweden
| |
Collapse
|
35
|
Blangiardo M, Richardson S. Statistical tools for synthesizing lists of differentially expressed features in related experiments. Genome Biol 2007; 8:R54. [PMID: 17428330 PMCID: PMC1896017 DOI: 10.1186/gb-2007-8-4-r54] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/07/2006] [Revised: 11/13/2006] [Accepted: 04/11/2007] [Indexed: 11/10/2022] Open
Abstract
A novel approach for finding a list of features that are commonly perturbed in two or more experiments, quantifying the evidence of dependence between the experiments by a ratio. We propose a novel approach for finding a list of features that are commonly perturbed in two or more experiments, quantifying the evidence of dependence between the experiments by a ratio. We present a Bayesian analysis of this ratio, which leads us to suggest two rules for choosing a cut-off on the ranked list of p values. We evaluate and compare the performance of these statistical tools in a simulation study, and show their usefulness on two real datasets.
Collapse
Affiliation(s)
- Marta Blangiardo
- Centre for Biostatistics, Imperial College, St Mary's Campus, Norfolk Place, London W2 1PG, UK
| | - Sylvia Richardson
- Centre for Biostatistics, Imperial College, St Mary's Campus, Norfolk Place, London W2 1PG, UK
| |
Collapse
|
36
|
Gottardo R, Li W, Johnson WE, Liu XS. A flexible and powerful bayesian hierarchical model for ChIP-Chip experiments. Biometrics 2007; 64:468-78. [PMID: 17888037 DOI: 10.1111/j.1541-0420.2007.00899.x] [Citation(s) in RCA: 29] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/26/2022]
Abstract
Chromatin-immunoprecipitation microarrays (ChIP-chip) that enable researchers to identify regions of a given genome that are bound by specific DNA-binding proteins present new challenges for statistical analysis due to the large number of probes, the high noise-to-signal ratio, and the spatial dependence between probes. We propose a method called BAC (Bayesian analysis of ChIP-chip) to detect transcription factor bound regions, which incorporate the dependence between probes while making little assumptions about the bound regions (e.g., length). BAC is robust to probe outliers with an exchangeable prior for the variances, which allows different variances for the probes but still shrink extreme empirical variances. Parameter estimation is carried out using Markov chain Monte Carlo and inference is based on the joint distribution of the parameters. Bound regions are detected using posterior probabilities computed from the joint posterior distribution of neighboring probes. We show that these posterior probabilities are well calibrated and can be used to obtain an estimate of the false discovery rate. The method is illustrated using two publicly available ChIP-chip data sets containing 18 experimentally validated regions. We compare our method to four other baseline and commonly used techniques, namely, the Wilcoxon's rank sum test, TileMap, HGMM, and MAT. We found BAC and HGMM to perform best at detecting validated regions. However, HGMM appears to be very sensitive to probe outliers compared to BAC. In addition, we present a simulation study, which shows that BAC is more powerful than the other four techniques under various simulation scenarios while being robust to model misspecification.
Collapse
Affiliation(s)
- Raphael Gottardo
- Department of Statistics, University of British Columbia, Vancouver, Canada.
| | | | | | | |
Collapse
|
37
|
Manda SOM, Walls RE, Gilthorpe MS. A full Bayesian hierarchical mixture model for the variance of gene differential expression. BMC Bioinformatics 2007; 8:124. [PMID: 17439644 PMCID: PMC1876253 DOI: 10.1186/1471-2105-8-124] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2006] [Accepted: 04/17/2007] [Indexed: 12/04/2022] Open
Abstract
Background In many laboratory-based high throughput microarray experiments, there are very few replicates of gene expression levels. Thus, estimates of gene variances are inaccurate. Visual inspection of graphical summaries of these data usually reveals that heteroscedasticity is present, and the standard approach to address this is to take a log2 transformation. In such circumstances, it is then common to assume that gene variability is constant when an analysis of these data is undertaken. However, this is perhaps too stringent an assumption. More careful inspection reveals that the simple log2 transformation does not remove the problem of heteroscedasticity. An alternative strategy is to assume independent gene-specific variances; although again this is problematic as variance estimates based on few replications are highly unstable. More meaningful and reliable comparisons of gene expression might be achieved, for different conditions or different tissue samples, where the test statistics are based on accurate estimates of gene variability; a crucial step in the identification of differentially expressed genes. Results We propose a Bayesian mixture model, which classifies genes according to similarity in their variance. The result is that genes in the same latent class share the similar variance, estimated from a larger number of replicates than purely those per gene, i.e. the total of all replicates of all genes in the same latent class. An example dataset, consisting of 9216 genes with four replicates per condition, resulted in four latent classes based on their similarity of the variance. Conclusion The mixture variance model provides a realistic and flexible estimate for the variance of gene expression data under limited replicates. We believe that in using the latent class variances, estimated from a larger number of genes in each derived latent group, the p-values obtained are more robust than either using a constant gene or gene-specific variance estimate.
Collapse
Affiliation(s)
- Samuel OM Manda
- Biostatistics Unit, Centre for Epidemiology and Biostatistics, Leeds, LS2 9LN, UK
| | - Rebecca E Walls
- Biostatistics Unit, Centre for Epidemiology and Biostatistics, Leeds, LS2 9LN, UK
- Department of Statistics, University of Leeds, Leeds, UK
| | - Mark S Gilthorpe
- Biostatistics Unit, Centre for Epidemiology and Biostatistics, Leeds, LS2 9LN, UK
| |
Collapse
|
38
|
Nott DJ, Yu Z, Chan E, Cotsapas C, Cowley MJ, Pulvers J, Williams R, Little P. Hierarchical Bayes variable selection and microarray experiments. J MULTIVARIATE ANAL 2007. [DOI: 10.1016/j.jmva.2006.10.001] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
|
39
|
Abstract
Given a set of microarray data, the problem is to detect differentially expressed genes, using a false discovery rate (FDR) criterion. As opposed to common procedures in the literature, we do not base the selection criterion on statistical significance only, but also on the effect size. Therefore, we select only those genes that are significantly more differentially expressed than some f-fold (e.g., f = 2). This corresponds to use of an interval null domain for the effect size. Based on a simple error model, we discuss a naive estimator for the FDR, interpreted as the probability that the parameter of interest lies in the null-domain (e.g., mu < log(2)(2) = 1) given that the test statistic exceeds a threshold. We improve the naive estimator by using deconvolution. That is, the density of the parameter of interest is recovered from the data. We study performance of the methods using simulations and real data.
Collapse
Affiliation(s)
- Mark A van de Wiel
- Department of Mathematics, Vrije Universiteit, De Boelelaan 1081a, 1081 HV Amsterdam, The Netherlands.
| | | |
Collapse
|
40
|
Lewin A, Grieve IC. Grouping Gene Ontology terms to improve the assessment of gene set enrichment in microarray data. BMC Bioinformatics 2006; 7:426. [PMID: 17018143 PMCID: PMC1622761 DOI: 10.1186/1471-2105-7-426] [Citation(s) in RCA: 33] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2006] [Accepted: 10/03/2006] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Gene Ontology (GO) terms are often used to assess the results of microarray experiments. The most common way to do this is to perform Fisher's exact tests to find GO terms which are over-represented amongst the genes declared to be differentially expressed in the analysis of the microarray experiment. However, due to the high degree of dependence between GO terms, statistical testing is conservative, and interpretation is difficult. RESULTS We propose testing groups of GO terms rather than individual terms, to increase statistical power, reduce dependence between tests and improve the interpretation of results. We use the publicly available package POSOC to group the terms. Our method finds groups of GO terms significantly over-represented amongst differentially expressed genes which are not found by Fisher's tests on individual GO terms. CONCLUSION Grouping Gene Ontology terms improves the interpretation of gene set enrichment for microarray data.
Collapse
Affiliation(s)
- Alex Lewin
- Department of Epidemiology and Public Health, Imperial College, Norfolk Place, London W2 1PG, UK
| | - Ian C Grieve
- MRC Clinical Sciences Centre, Imperial College, Hammersmith Hospital, London W12 ONN, UK
| |
Collapse
|
41
|
Oba S, lshii S. Semi-supervised discovery of differential genes. BMC Bioinformatics 2006; 7:414. [PMID: 16981994 PMCID: PMC1584253 DOI: 10.1186/1471-2105-7-414] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/08/2006] [Accepted: 09/18/2006] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Various statistical scores have been proposed for evaluating the significance of genes that may exhibit differential expression between two or more controlled conditions. However, in many clinical studies to detect clinical marker genes for example, the conditions have not necessarily been controlled well, thus condition labels are sometimes hard to obtain due to physical, financial, and time costs. In such a situation, we can consider an unsupervised case where labels are not available or a semi-supervised case where labels are available for a part of the whole sample set, rather than a well-studied supervised case where all samples have their labels. RESULTS We assume a latent variable model for the expression of active genes and apply the optimal discovery procedure (ODP) proposed by Storey (2005) to the model. Our latent variable model allows gene significance scores to be applied to unsupervised and semi-supervised cases. The ODP framework improves detectability by sharing the estimated parameters of null and alternative models of multiple tests over multiple genes. A theoretical consideration leads to two different interpretations of the latent variable, i.e., it only implicitly affects the alternative model through the model parameters, or it is explicitly included in the alternative model, so that the interpretations correspond to two different implementations of ODP. By comparing the two implementations through experiments with simulation data, we have found that sharing the latent variable estimation is effective for increasing the detectability of truly active genes. We also show that the unsupervised and semi-supervised rating of genes, which takes into account the samples without condition labels, can improve detection of active genes in real gene discovery problems. CONCLUSION The experimental results indicate that the ODP framework is effective for hypotheses including latent variables and is further improved by sharing the estimations of hidden variables over multiple tests.
Collapse
Affiliation(s)
- Shigeyuki Oba
- Graduate School of Information Science, Nara Institute of Science and Technology, Takayama, Ikoma, Nara, Japan
| | - Shin lshii
- Graduate School of Information Science, Nara Institute of Science and Technology, Takayama, Ikoma, Nara, Japan
| |
Collapse
|