1
|
Sun X, Fu Y. Local false discovery rate estimation with competition-based procedures for variable selection. Stat Med 2024; 43:61-88. [PMID: 37927105 DOI: 10.1002/sim.9942] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/31/2022] [Revised: 08/23/2023] [Accepted: 09/29/2023] [Indexed: 11/07/2023]
Abstract
Multiple hypothesis testing has been widely applied to problems dealing with high-dimensional data, for example, the selection of important variables or features from a large number of candidates while controlling the error rate. The most prevailing measure of error rate used in multiple hypothesis testing is the false discovery rate (FDR). In recent years, the local false discovery rate (fdr) has drawn much attention, due to its advantage of accessing the confidence of individual hypotheses. However, most methods estimate fdr throughP $$ P $$ -values or statistics with known null distributions, which are sometimes unavailable or unreliable. Adopting the innovative methodology of competition-based procedures, for example, the knockoff filter, this paper proposes a new approach, named TDfdr, to fdr estimation, which is free ofP $$ P $$ -values or known null distributions. Extensive simulation studies demonstrate that TDfdr can accurately estimate the fdr with two competition-based procedures. We applied the TDfdr method to two real biomedical tasks. One is to identify significantly differentially expressed proteins related to the COVID-19 disease, and the other is to detect mutations in the genotypes of HIV-1 that are associated with drug resistance. Higher discovery power was observed compared to existing popular methods.
Collapse
Affiliation(s)
- Xiaoya Sun
- CEMS, NCMIS, RCSDS, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing, China
- School of Mathematical Sciences, University of Chinese Academy of Sciences, Beijing, China
| | - Yan Fu
- CEMS, NCMIS, RCSDS, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing, China
- School of Mathematical Sciences, University of Chinese Academy of Sciences, Beijing, China
| |
Collapse
|
2
|
Denti F, Peluso S, Guindani M, Mira A. Multiple hypothesis screening using mixtures of non-local distributions with applications to genomic studies. Stat Med 2023; 42:1931-1945. [PMID: 36914221 DOI: 10.1002/sim.9705] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2022] [Revised: 10/02/2022] [Accepted: 02/24/2023] [Indexed: 03/15/2023]
Abstract
The analysis of large-scale datasets, especially in biomedical contexts, frequently involves a principled screening of multiple hypotheses. The celebrated two-group model jointly models the distribution of the test statistics with mixtures of two competing densities, the null and the alternative distributions. We investigate the use of weighted densities and, in particular, non-local densities as working alternative distributions, to enforce separation from the null and thus refine the screening procedure. We show how these weighted alternatives improve various operating characteristics, such as the Bayesian false discovery rate, of the resulting tests for a fixed mixture proportion with respect to a local, unweighted likelihood approach. Parametric and nonparametric model specifications are proposed, along with efficient samplers for posterior inference. By means of a simulation study, we exhibit how our model compares with both well-established and state-of-the-art alternatives in terms of various operating characteristics. Finally, to illustrate the versatility of our method, we conduct three differential expression analyses with publicly-available datasets from genomic studies of heterogeneous nature.
Collapse
Affiliation(s)
- Francesco Denti
- Department of Statistics, Università Cattolica del Sacro Cuore, Milan, Italy
| | - Stefano Peluso
- Department of Statistics and Quantitative Methods, University of Milan - Bicocca, Milan, Italy
| | - Michele Guindani
- Department of Biostatistics, University of California Los Angeles, California, Los Angeles, USA
| | - Antonietta Mira
- Faculty of Economics, Università della Svizzera italiana, Lugano, Switzerland.,Department of Science and High Technology, University of Insubria, Como, Italy
| |
Collapse
|
3
|
Franzolini B, Lijoi A, Prünster I. Model selection for maternal hypertensive disorders with symmetric hierarchical Dirichlet processes. Ann Appl Stat 2023. [DOI: 10.1214/22-aoas1628] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/26/2023]
Affiliation(s)
- Beatrice Franzolini
- Singapore Institute for Clinical Sciences (SICS), Agency for Science, Technology and Research (A*STAR)
| | | | | |
Collapse
|
4
|
Gould AL, Baumgartner R, Zhao A. Bayesian screening for feature selection. J Biopharm Stat 2022; 32:832-857. [PMID: 35736220 DOI: 10.1080/10543406.2022.2033760] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/17/2022]
Abstract
Biomedical applications such as genome-wide association studies screen large databases with high-dimensional features to identify rare, weakly expressed, and important continuous-valued features for subsequent detailed analysis. We describe an exact, rapid Bayesian screening approach with attractive diagnostic properties using a Gaussian random mixture model focusing on the missed discovery rate (the probability of failing to identify potentially informative features) rather than the false discovery rate ordinarily used with multiple hypothesis testing. The method provides the likelihood that a feature merits further investigation, as well as distributions of the effect magnitudes and the proportion of features with the same expected responses under alternative conditions. Important features include the dependence of the critical values on clinical and regulatory priorities and direct assessment of the diagnostic properties.
Collapse
Affiliation(s)
- A Lawrence Gould
- Biostatistics and Research Decision Sciences Merck & Co Inc Kenilworth, New Jersey, USA
| | - Richard Baumgartner
- Biostatistics and Research Decision Sciences Merck & Co Inc Kenilworth, New Jersey, USA
| | - Amanda Zhao
- Biostatistics and Research Decision Sciences Merck & Co Inc Kenilworth, New Jersey, USA
| |
Collapse
|
5
|
Jin Z, Kang J, Yu T. Feature selection and classification over the network with missing node observations. Stat Med 2022; 41:1242-1262. [PMID: 34816464 PMCID: PMC9773124 DOI: 10.1002/sim.9267] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2021] [Revised: 09/14/2021] [Accepted: 10/29/2021] [Indexed: 12/25/2022]
Abstract
Jointly analyzing transcriptomic data and the existing biological networks can yield more robust and informative feature selection results, as well as better understanding of the biological mechanisms. Selecting and classifying node features over genome-scale networks has become increasingly important in genomic biology and genomic medicine. Existing methods have some critical drawbacks. The first is they do not allow flexible modeling of different subtypes of selected nodes. The second is they ignore nodes with missing values, very likely to increase bias in estimation. To address these limitations, we propose a general modeling framework for Bayesian node classification (BNC) with missing values. A new prior model is developed for the class indicators incorporating the network structure. For posterior computation, we resort to the Swendsen-Wang algorithm for efficiently updating class indicators. BNC can naturally handle missing values in the Bayesian modeling framework, which improves the node classification accuracy and reduces the bias in estimating gene effects. We demonstrate the advantages of our methods via extensive simulation studies and the analysis of the cutaneous melanoma dataset from The Cancer Genome Atlas.
Collapse
Affiliation(s)
| | - Jian Kang
- Department of Biostatistics, University of Michigan, Ann Arbor, Michigan
| | - Tianwei Yu
- School of Data Science and Warshel Institute, The Chinese University of Hong Kong - Shenzhen, and Shenzhen Research Institute of Big Data, Shenzhen, China
| |
Collapse
|
6
|
Bi R, Liu P. A semi-parametric Bayesian approach for detection of gene expression heterosis with RNA-seq data. J Appl Stat 2021; 50:214-230. [DOI: 10.1080/02664763.2021.2004581] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
Affiliation(s)
- Ran Bi
- Department of Statistics, Iowa State University, Ames, IA, USA
| | - Peng Liu
- Department of Statistics, Iowa State University, Ames, IA, USA
| |
Collapse
|
7
|
Barrientos AF, Canale A. A Bayesian goodness-of-fit test for regression. Comput Stat Data Anal 2021. [DOI: 10.1016/j.csda.2020.107104] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
|
8
|
Vo T, Mishra A, Ithapu V, Singh V, Newton MA. Dimension constraints improve hypothesis testing for large-scale, graph-associated, brain-image data. Biostatistics 2021; 23:860-874. [PMID: 33616173 PMCID: PMC9295049 DOI: 10.1093/biostatistics/kxab001] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/24/2019] [Revised: 01/06/2021] [Accepted: 01/08/2021] [Indexed: 11/15/2022] Open
Abstract
For large-scale testing with graph-associated data, we present an empirical Bayes mixture
technique to score local false-discovery rates (FDRs). Compared to procedures that ignore
the graph, the proposed Graph-based Mixture Model (GraphMM) method gains power in settings
where non-null cases form connected subgraphs, and it does so by regularizing parameter
contrasts between testing units. Simulations show that GraphMM controls the FDR in a
variety of settings, though it may lose control with excessive regularization. On magnetic
resonance imaging data from a study of brain changes associated with the onset of
Alzheimer’s disease, GraphMM produces greater yield than conventional large-scale testing
procedures.
Collapse
Affiliation(s)
- Tien Vo
- Department of Biostatistics and Medical Informatics, University of Wisconsin at Madison 610 Walnut Street, Madison, WI, USA
| | - Akshay Mishra
- Department of Biostatistics and Medical Informatics, University of Wisconsin at Madison 610 Walnut Street, Madison, WI, USA
| | - Vamsi Ithapu
- Department of Biostatistics and Medical Informatics, University of Wisconsin at Madison 610 Walnut Street, Madison, WI, USA
| | - Vikas Singh
- Department of Biostatistics and Medical Informatics, University of Wisconsin at Madison 610 Walnut Street, Madison, WI, USA
| | - Michael A Newton
- Department of Biostatistics and Medical Informatics, University of Wisconsin at Madison 610 Walnut Street, Madison, WI, USA
| |
Collapse
|
9
|
Identifying atypically expressed chromosome regions using RNA-Seq data. STAT METHOD APPL-GER 2020. [DOI: 10.1007/s10260-019-00496-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]
|
10
|
Denti F, Guindani M, Leisen F, Lijoi A, Wadsworth WD, Vannucci M. Two-group Poisson-Dirichlet mixtures for multiple testing. Biometrics 2020; 77:622-633. [PMID: 32535900 DOI: 10.1111/biom.13314] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/12/2019] [Revised: 05/21/2020] [Accepted: 05/22/2020] [Indexed: 11/26/2022]
Abstract
The simultaneous testing of multiple hypotheses is common to the analysis of high-dimensional data sets. The two-group model, first proposed by Efron, identifies significant comparisons by allocating observations to a mixture of an empirical null and an alternative distribution. In the Bayesian nonparametrics literature, many approaches have suggested using mixtures of Dirichlet Processes in the two-group model framework. Here, we investigate employing mixtures of two-parameter Poisson-Dirichlet Processes instead, and show how they provide a more flexible and effective tool for large-scale hypothesis testing. Our model further employs nonlocal prior densities to allow separation between the two mixture components. We obtain a closed-form expression for the exchangeable partition probability function of the two-group model, which leads to a straightforward Markov Chain Monte Carlo implementation. We compare the performance of our method for large-scale inference in a simulation study and illustrate its use on both a prostate cancer data set and a case-control microbiome study of the gastrointestinal tracts in children from underdeveloped countries who have been recently diagnosed with moderate-to-severe diarrhea.
Collapse
Affiliation(s)
- Francesco Denti
- Department of Statistics, University of California, Irvine, California
| | - Michele Guindani
- Department of Statistics, University of California, Irvine, California
| | - Fabrizio Leisen
- School of Mathematics, Statistics and Actuarial Sciences, University of Kent, Canterbury, UK
| | - Antonio Lijoi
- Department of Decision Sciences, Bocconi University, Milan, Italy.,Bocconi Institute of Data Science and Analytics (BIDSA), Milan, Italy
| | | | | |
Collapse
|
11
|
Liao JG, Berg A, McMurry TL. A Robustified Posterior for Bayesian Inference on a Large Number of Parallel Effects. AM STAT 2020. [DOI: 10.1080/00031305.2019.1701549] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]
Affiliation(s)
- J. G. Liao
- Division of Biostatistics & Bioinformatics, Pennsylvania State University, Hershey, PA
| | - Arthur Berg
- Division of Biostatistics & Bioinformatics, Pennsylvania State University, Hershey, PA
| | | |
Collapse
|
12
|
Wang X, Shojaie A, Zou J. Bayesian Hidden Markov Models for Dependent Large-Scale Multiple Testing. Comput Stat Data Anal 2019; 136:123-136. [PMID: 31662591 DOI: 10.1016/j.csda.2019.01.009] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
Abstract
An optimal and flexible multiple hypotheses testing procedure is constructed for dependent data based on Bayesian techniques, aiming at handling two challenges, namely dependence structure and non-null distribution specification. Ignoring dependence among hypotheses tests may lead to loss of efficiency and bias in decision. Misspecification in the non-null distribution, on the other hand, can result in both false positive and false negative errors. Hidden Markov models are used to accommodate the dependence structure among the tests. Dirichlet mixture process prior is applied on the non-null distribution to overcome the potential pitfalls in distribution misspecification. The testing algorithm based on Bayesian techniques optimizes the false negative rate (FNR) while controlling the false discovery rate (FDR). The procedure is applied to pointwise and clusterwise analysis. Its performance is compared with existing approaches using both simulated and real data examples.
Collapse
Affiliation(s)
- Xia Wang
- Department of Mathematical Sciences, University of Cincinnati, Cincinnati, Ohio 45221, U.S.A
| | - Ali Shojaie
- Department of Biostatistics, University of Washington, Seattle, Washington 98195, U.S.A
| | - Jian Zou
- Department of Mathematical Sciences, Worcester Polytechnic Institute, Worcester, Massachusetts 01609, U.S.A
| |
Collapse
|
13
|
Affiliation(s)
- Wesley Tansey
- Department of Computer Science, University of Texas at Austin, Austin, TX
| | - Oluwasanmi Koyejo
- Department of Computer Science, University of Illinois at Urbana-Champaign, Champaign, IL
| | | | - James G. Scott
- Department of Information, Risk, and Operations Management; Department of Statistics and Data Sciences; University of Texas at Austin, Austin, TX
| |
Collapse
|
14
|
|
15
|
Kumar N, Hoque MA, Sugimoto M. Robust volcano plot: identification of differential metabolites in the presence of outliers. BMC Bioinformatics 2018; 19:128. [PMID: 29642836 PMCID: PMC5896081 DOI: 10.1186/s12859-018-2117-2] [Citation(s) in RCA: 21] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/24/2017] [Accepted: 03/19/2018] [Indexed: 12/22/2022] Open
Abstract
BACKGROUND The identification of differential metabolites in metabolomics is still a big challenge and plays a prominent role in metabolomics data analyses. Metabolomics datasets often contain outliers because of analytical, experimental, and biological ambiguity, but the currently available differential metabolite identification techniques are sensitive to outliers. RESULTS We propose a kernel weight based outlier-robust volcano plot for identifying differential metabolites from noisy metabolomics datasets. Two numerical experiments are used to evaluate the performance of the proposed technique against nine existing techniques, including the t-test and the Kruskal-Wallis test. Artificially generated data with outliers reveal that the proposed method results in a lower misclassification error rate and a greater area under the receiver operating characteristic curve compared with existing methods. An experimentally measured breast cancer dataset to which outliers were artificially added reveals that our proposed method produces only two non-overlapping differential metabolites whereas the other nine methods produced between seven and 57 non-overlapping differential metabolites. CONCLUSION Our data analyses show that the performance of the proposed differential metabolite identification technique is better than that of existing methods. Thus, the proposed method can contribute to analysis of metabolomics data with outliers. The R package and user manual of the proposed method are available at https://github.com/nishithkumarpaul/Rvolcano .
Collapse
Affiliation(s)
- Nishith Kumar
- Department of Statistics, Rajshahi University, Rajshahi, Bangladesh
- Bioinformatics Lab, Department of Statistics, Bangabandhu Sheikh Mujibur Rahman Science and Technology University, Gopalganj, Bangladesh
| | - Md. Aminul Hoque
- Department of Statistics, Rajshahi University, Rajshahi, Bangladesh
| | - Masahiro Sugimoto
- Health Promotion and Preemptive Medicine, Research and Development Center for Minimally Invasive Therapies, Tokyo Medical University, Shinjuku, Tokyo, 160-8402 Japan
| |
Collapse
|
16
|
Madrid-Padilla OH, Polson NG, Scott J. A deconvolution path for mixtures. Electron J Stat 2018. [DOI: 10.1214/18-ejs1430] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
17
|
Wang C, Gevertz JL. Finding causative genes from high-dimensional data: an appraisal of statistical and machine learning approaches. Stat Appl Genet Mol Biol 2017; 15:321-47. [PMID: 27226102 DOI: 10.1515/sagmb-2015-0072] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
Modern biological experiments often involve high-dimensional data with thousands or more variables. A challenging problem is to identify the key variables that are related to a specific disease. Confounding this task is the vast number of statistical methods available for variable selection. For this reason, we set out to develop a framework to investigate the variable selection capability of statistical methods that are commonly applied to analyze high-dimensional biological datasets. Specifically, we designed six simulated cancers (based on benchmark colon and prostate cancer data) where we know precisely which genes cause a dataset to be classified as cancerous or normal - we call these causative genes. We found that not one statistical method tested could identify all the causative genes for all of the simulated cancers, even though increasing the sample size does improve the variable selection capabilities in most cases. Furthermore, certain statistical tools can classify our simulated data with a low error rate, yet the variables being used for classification are not necessarily the causative genes.
Collapse
|
18
|
|
19
|
Chen S, Bowman FD, Mayberg HS. A Bayesian hierarchical framework for modeling brain connectivity for neuroimaging data. Biometrics 2015; 72:596-605. [PMID: 26501687 DOI: 10.1111/biom.12433] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/01/2014] [Revised: 08/01/2015] [Accepted: 09/01/2015] [Indexed: 01/21/2023]
Abstract
We propose a novel Bayesian hierarchical model for brain imaging data that unifies voxel-level (the most localized unit of measure) and region-level brain connectivity analyses, and yields population-level inferences. Functional connectivity generally refers to associations in brain activity between distinct locations. The first level of our model summarizes brain connectivity for cross-region voxel pairs using a two-component mixture model consisting of connected and nonconnected voxels. We use the proportion of connected voxel pairs to define a new measure of connectivity strength, which reflects the breadth of between-region connectivity. Furthermore, we evaluate the impact of clinical covariates on connectivity between region-pairs at a population level. We perform parameter estimation using Markov chain Monte Carlo (MCMC) techniques, which can be executed quickly relative to the number of model parameters. We apply our method to resting-state functional magnetic resonance imaging (fMRI) data from 32 subjects with major depression and simulated data to demonstrate the properties of our method.
Collapse
Affiliation(s)
- Shuo Chen
- Department of Epidemiology and Biostatistics, University of Maryland, College Park, Maryland 20742, U.S.A
| | - F DuBois Bowman
- Department of Biostatistics, Columbia University, Manhattan, New York 10032, U.S.A
| | - Helen S Mayberg
- School of Medicine, Emory University, Atlanta, Georgia 30322, U.S.A
| |
Collapse
|
20
|
Liu F, Wang C, Liu P. A Semi-parametric Bayesian Approach for Differential Expression Analysis of RNA-seq Data. JOURNAL OF AGRICULTURAL BIOLOGICAL AND ENVIRONMENTAL STATISTICS 2015; 20:555-576. [PMID: 27570441 DOI: 10.1007/s13253-015-0227-0] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/25/2023]
Abstract
RNA-sequencing (RNA-seq) technologies have revolutionized the way agricultural biologists study gene expression as well as generated a tremendous amount of data waiting for analysis. Detecting differentially expressed genes is one of the fundamental steps in RNA-seq data analysis. In this paper, we model the count data from RNA-seq experiments with a Poisson-Gamma hierarchical model, or equivalently, a negative binomial (NB) model. We derive a semi-parametric Bayesian approach with a Dirichlet process as the prior model for the distribution of fold changes between the two treatment means. An inference strategy using Gibbs algorithm is developed for differential expression analysis. The results of several simulation studies show that our proposed method outperforms other methods including the popularly applied edgeR and DESeq methods. We also discuss an application of our method to a dataset that compares gene expression between bundle sheath and mesophyll cells in maize leaves.
Collapse
Affiliation(s)
- Fangfang Liu
- Department of Statistics Iowa State University Ames, IA 50011
| | - Chong Wang
- Department of Statistics Iowa State University Ames, IA 50011
| | - Peng Liu
- Department of Statistics Iowa State University Ames, IA 50011
| |
Collapse
|
21
|
Mollah MMH, Jamal R, Mokhtar NM, Harun R, Mollah MNH. A Hybrid One-Way ANOVA Approach for the Robust and Efficient Estimation of Differential Gene Expression with Multiple Patterns. PLoS One 2015; 10:e0138810. [PMID: 26413858 PMCID: PMC4587675 DOI: 10.1371/journal.pone.0138810] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/19/2015] [Accepted: 09/03/2015] [Indexed: 11/22/2022] Open
Abstract
Background Identifying genes that are differentially expressed (DE) between two or more conditions with multiple patterns of expression is one of the primary objectives of gene expression data analysis. Several statistical approaches, including one-way analysis of variance (ANOVA), are used to identify DE genes. However, most of these methods provide misleading results for two or more conditions with multiple patterns of expression in the presence of outlying genes. In this paper, an attempt is made to develop a hybrid one-way ANOVA approach that unifies the robustness and efficiency of estimation using the minimum β-divergence method to overcome some problems that arise in the existing robust methods for both small- and large-sample cases with multiple patterns of expression. Results The proposed method relies on a β-weight function, which produces values between 0 and 1. The β-weight function with β = 0.2 is used as a measure of outlier detection. It assigns smaller weights (≥ 0) to outlying expressions and larger weights (≤ 1) to typical expressions. The distribution of the β-weights is used to calculate the cut-off point, which is compared to the observed β-weight of an expression to determine whether that gene expression is an outlier. This weight function plays a key role in unifying the robustness and efficiency of estimation in one-way ANOVA. Conclusion Analyses of simulated gene expression profiles revealed that all eight methods (ANOVA, SAM, LIMMA, EBarrays, eLNN, KW, robust BetaEB and proposed) perform almost identically for m = 2 conditions in the absence of outliers. However, the robust BetaEB method and the proposed method exhibited considerably better performance than the other six methods in the presence of outliers. In this case, the BetaEB method exhibited slightly better performance than the proposed method for the small-sample cases, but the the proposed method exhibited much better performance than the BetaEB method for both the small- and large-sample cases in the presence of more than 50% outlying genes. The proposed method also exhibited better performance than the other methods for m > 2 conditions with multiple patterns of expression, where the BetaEB was not extended for this condition. Therefore, the proposed approach would be more suitable and reliable on average for the identification of DE genes between two or more conditions with multiple patterns of expression.
Collapse
Affiliation(s)
- Mohammad Manir Hossain Mollah
- Institut Perubatan Molekul UKM (UMBI), University Kebangsaan Malaysia (UKM), Jalan Ya’acob Latiff, Bandar Tun Razak, Cheras 56000 Kuala Lumpur, Malaysia
- * E-mail:
| | - Rahman Jamal
- Institut Perubatan Molekul UKM (UMBI), University Kebangsaan Malaysia (UKM), Jalan Ya’acob Latiff, Bandar Tun Razak, Cheras 56000 Kuala Lumpur, Malaysia
| | - Norfilza Mohd Mokhtar
- Institut Perubatan Molekul UKM (UMBI), University Kebangsaan Malaysia (UKM), Jalan Ya’acob Latiff, Bandar Tun Razak, Cheras 56000 Kuala Lumpur, Malaysia
- Department of Physiology, Faculty of Medicine, Universiti Kebangsaan Malaysia, Kuala Lumpur, Malaysia
| | - Roslan Harun
- Institut Perubatan Molekul UKM (UMBI), University Kebangsaan Malaysia (UKM), Jalan Ya’acob Latiff, Bandar Tun Razak, Cheras 56000 Kuala Lumpur, Malaysia
| | - Md. Nurul Haque Mollah
- Laboratory of Bioinformatics, Department of Statistics, University of Rajshahi, Rajshahi-6205, Bangladesh
| |
Collapse
|
22
|
Scott JG, Kelly RC, Smith MA, Zhou P, Kass RE. False discovery rate regression: an application to neural synchrony detection in primary visual cortex. J Am Stat Assoc 2015; 110:459-471. [PMID: 26855459 DOI: 10.1080/01621459.2014.990973] [Citation(s) in RCA: 29] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]
Abstract
Many approaches for multiple testing begin with the assumption that all tests in a given study should be combined into a global false-discovery-rate analysis. But this may be inappropriate for many of today's large-scale screening problems, where auxiliary information about each test is often available, and where a combined analysis can lead to poorly calibrated error rates within different subsets of the experiment. To address this issue, we introduce an approach called false-discovery-rate regression that directly uses this auxiliary information to inform the outcome of each test. The method can be motivated by a two-groups model in which covariates are allowed to influence the local false discovery rate, or equivalently, the posterior probability that a given observation is a signal. This poses many subtle issues at the interface between inference and computation, and we investigate several variations of the overall approach. Simulation evidence suggests that: (1) when covariate effects are present, FDR regression improves power for a fixed false-discovery rate; and (2) when covariate effects are absent, the method is robust, in the sense that it does not lead to inflated error rates. We apply the method to neural recordings from primary visual cortex. The goal is to detect pairs of neurons that exhibit fine-time-scale interactions, in the sense that they fire together more often than expected due to chance. Our method detects roughly 50% more synchronous pairs versus a standard FDR-controlling analysis. The companion R package FDRreg implements all methods described in the paper.
Collapse
|
23
|
Zhao Y, Kang J, Yu T. A BAYESIAN NONPARAMETRIC MIXTURE MODEL FOR SELECTING GENES AND GENE SUBNETWORKS. Ann Appl Stat 2014; 8:999-1021. [PMID: 25984253 DOI: 10.1214/14-aoas719] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/01/2023]
Abstract
It is very challenging to select informative features from tens of thousands of measured features in high-throughput data analysis. Recently, several parametric/regression models have been developed utilizing the gene network information to select genes or pathways strongly associated with a clinical/biological outcome. Alternatively, in this paper, we propose a nonparametric Bayesian model for gene selection incorporating network information. In addition to identifying genes that have a strong association with a clinical outcome, our model can select genes with particular expressional behavior, in which case the regression models are not directly applicable. We show that our proposed model is equivalent to an infinity mixture model for which we develop a posterior computation algorithm based on Markov chain Monte Carlo (MCMC) methods. We also propose two fast computing algorithms that approximate the posterior simulation with good accuracy but relatively low computational cost. We illustrate our methods on simulation studies and the analysis of Spellman yeast cell cycle microarray data.
Collapse
Affiliation(s)
- Yize Zhao
- Department of Biostatistics and Bioinformatics Emory University 1518 Clifton Rd. Atlanta, Georgia 30322 USA
| | - Jian Kang
- Department of Biostatistics and Bioinformatics Emory University 1518 Clifton Rd. Atlanta, Georgia 30322 USA
| | - Tianwei Yu
- Department of Biostatistics and Bioinformatics Emory University 1518 Clifton Rd. Atlanta, Georgia 30322 USA
| |
Collapse
|
24
|
Dickhaus T, Blankertz B, Meinecke FC. Binary classification with pFDR-pFNR losses. Biom J 2014; 55:463-77. [DOI: 10.1002/bimj.201200054] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2012] [Revised: 11/27/2012] [Accepted: 11/30/2012] [Indexed: 11/11/2022]
Affiliation(s)
- Thorsten Dickhaus
- Department of Mathematics; Humboldt-University Berlin; Unter den Linden 6 D-10099 Berlin Germany
| | - Benjamin Blankertz
- Neurotechnology Group; Berlin Institute of Technology; Marchstrasse 23 D-10587 Berlin Germany
- Bernstein Center for Computational Neuroscience Berlin; Philippstrasse 13, Haus 6 D-10115 Berlin Germany
| | - Frank C. Meinecke
- Machine Learning/Intelligent Data Analysis Group; Berlin Institute of Technology, Marchstrasse 23; D-10587 Berlin Germany
| |
Collapse
|
25
|
Liao JG. Prior robust empirical Bayes inference for large-scale data by conditioning on rank with application to microarray data. Biostatistics 2014; 15:60-73. [PMID: 23934072 PMCID: PMC3862209 DOI: 10.1093/biostatistics/kxt026] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/26/2012] [Revised: 07/10/2013] [Accepted: 07/10/2013] [Indexed: 11/13/2022] Open
Abstract
Empirical Bayes methods have been extensively used for microarray data analysis by modeling the large number of unknown parameters as random effects. Empirical Bayes allows borrowing information across genes and can automatically adjust for multiple testing and selection bias. However, the standard empirical Bayes model can perform poorly if the assumed working prior deviates from the true prior. This paper proposes a new rank-conditioned inference in which the shrinkage and confidence intervals are based on the distribution of the error conditioned on rank of the data. Our approach is in contrast to a Bayesian posterior, which conditions on the data themselves. The new method is almost as efficient as standard Bayesian methods when the working prior is close to the true prior, and it is much more robust when the working prior is not close. In addition, it allows a more accurate (but also more complex) non-parametric estimate of the prior to be easily incorporated, resulting in improved inference. The new method's prior robustness is demonstrated via simulation experiments. Application to a breast cancer gene expression microarray dataset is presented. Our R package rank.Shrinkage provides a ready-to-use implementation of the proposed methodology.
Collapse
Affiliation(s)
- J. G. Liao
- Division of Biostatistics and Bioinformatics, Penn State University, Hershey, PA 17033, USA
| |
Collapse
|
26
|
Cao J, Zhang S. A Bayesian extension of the hypergeometric test for functional enrichment analysis. Biometrics 2013; 70:84-94. [PMID: 24320951 DOI: 10.1111/biom.12122] [Citation(s) in RCA: 44] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2012] [Revised: 09/01/2013] [Accepted: 10/01/2013] [Indexed: 11/28/2022]
Abstract
Functional enrichment analysis is conducted on high-throughput data to provide functional interpretation for a list of genes or proteins that share a common property, such as being differentially expressed (DE). The hypergeometric P-value has been widely used to investigate whether genes from pre-defined functional terms, for example, Gene Ontology (GO), are enriched in the DE genes. The hypergeometric P-value has three limitations: (1) computed independently for each term, thus neglecting biological dependence; (2) subject to a size constraint that leads to the tendency of selecting less-specific terms; (3) repeated use of information due to overlapping annotations by the true-path rule. We propose a Bayesian approach based on the non-central hypergeometric model. The GO dependence structure is incorporated through a prior on non-centrality parameters. The likelihood function does not include overlapping information. The inference about enrichment is based on posterior probabilities that do not have a size constraint. This method can detect moderate but consistent enrichment signals and identify sets of closely related and biologically meaningful functional terms rather than isolated terms. We also describe the basic ideas of assumption and implementation of different methods to provide some theoretical insights, which are demonstrated via a simulation study. A real application is presented.
Collapse
Affiliation(s)
- Jing Cao
- Department of Statistical Science, Southern Methodist University, Dallas, Texas 75275, U.S.A
| | | |
Collapse
|
27
|
Trentini F, Ji Y, Iwamoto T, Qi Y, Pusztai L, Müller P. Bayesian mixture models for assessment of gene differential behaviour and prediction of pCR through the integration of copy number and gene expression data. PLoS One 2013; 8:e68071. [PMID: 23874497 PMCID: PMC3709899 DOI: 10.1371/journal.pone.0068071] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2012] [Accepted: 05/23/2013] [Indexed: 11/18/2022] Open
Abstract
We consider modeling jointly microarray RNA expression and DNA copy number data. We propose Bayesian mixture models that define latent Gaussian probit scores for the DNA and RNA, and integrate between the two platforms via a regression of the RNA probit scores on the DNA probit scores. Such a regression conveniently allows us to include additional sample specific covariates such as biological conditions and clinical outcomes. The two developed methods are aimed respectively to make inference on differential behaviour of genes in patients showing different subtypes of breast cancer and to predict the pathological complete response (pCR) of patients borrowing strength across the genomic platforms. Posterior inference is carried out via MCMC simulations. We demonstrate the proposed methodology using a published data set consisting of 121 breast cancer patients.
Collapse
Affiliation(s)
- Filippo Trentini
- University Centre of Statistics in the Biomedical Sciences, Vita-Salute San Raffaele University, Milan, Italy
| | - Yuan Ji
- Center for Clinical and Research Informatics, NorthShore University HealthSystem, Evanston, Illinois, United States of America
- * E-mail:
| | - Takayuki Iwamoto
- Department of Breast and Endocrine Surgery, Okayama University Hospital, Okayama, Japan
| | - Yuan Qi
- Division of Quantitative Sciences, MD Anderson Cancer Center, Houston, Texas, United States of America
| | - Lajos Pusztai
- Chief of Breast Medical Oncology, Yale School of Medicine, New Haven, Connecticut, United States of America
| | - Peter Müller
- Department of Mathematics, University of Texas, Austin, Texas, United States of America
| |
Collapse
|
28
|
Peng B, Zhu D, Ander BP, Zhang X, Xue F, Sharp FR, Yang X. An integrative framework for Bayesian variable selection with informative priors for identifying genes and pathways. PLoS One 2013; 8:e67672. [PMID: 23844055 PMCID: PMC3700986 DOI: 10.1371/journal.pone.0067672] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2012] [Accepted: 05/21/2013] [Indexed: 12/27/2022] Open
Abstract
The discovery of genetic or genomic markers plays a central role in the development of personalized medicine. A notable challenge exists when dealing with the high dimensionality of the data sets, as thousands of genes or millions of genetic variants are collected on a relatively small number of subjects. Traditional gene-wise selection methods using univariate analyses face difficulty to incorporate correlational, structural, or functional structures amongst the molecular measures. For microarray gene expression data, we first summarize solutions in dealing with 'large p, small n' problems, and then propose an integrative Bayesian variable selection (iBVS) framework for simultaneously identifying causal or marker genes and regulatory pathways. A novel partial least squares (PLS) g-prior for iBVS is developed to allow the incorporation of prior knowledge on gene-gene interactions or functional relationships. From the point view of systems biology, iBVS enables user to directly target the joint effects of multiple genes and pathways in a hierarchical modeling diagram to predict disease status or phenotype. The estimated posterior selection probabilities offer probabilitic and biological interpretations. Both simulated data and a set of microarray data in predicting stroke status are used in validating the performance of iBVS in a Probit model with binary outcomes. iBVS offers a general framework for effective discovery of various molecular biomarkers by combining data-based statistics and knowledge-based priors. Guidelines on making posterior inferences, determining Bayesian significance levels, and improving computational efficiencies are also discussed.
Collapse
Affiliation(s)
- Bin Peng
- Department of Health Statistics, Chongqing Medical University, Chongqing, China
- Division of Biostatistics, Bayessoft, Inc., Davis, California, United States of America
| | - Dianwen Zhu
- Hunter College–School of Public Health, City University of New York, New York, United States of America
| | - Bradley P. Ander
- Medical Investigation of Neurodevelopmental Disorders (MIND) Institute, University of California Davis, Sacramento, California, United States of America
| | - Xiaoshuai Zhang
- School of Public Health, Shandong University, Jinan, Shandong, China
| | - Fuzhong Xue
- School of Public Health, Shandong University, Jinan, Shandong, China
| | - Frank R. Sharp
- Medical Investigation of Neurodevelopmental Disorders (MIND) Institute, University of California Davis, Sacramento, California, United States of America
| | - Xiaowei Yang
- Division of Biostatistics, Bayessoft, Inc., Davis, California, United States of America
- Hunter College–School of Public Health, City University of New York, New York, United States of America
| |
Collapse
|
29
|
Shahbaba B, Johnson WO. Bayesian nonparametric variable selection as an exploratory tool for discovering differentially expressed genes. Stat Med 2013; 32:2114-26. [PMID: 23172736 DOI: 10.1002/sim.5680] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/08/2012] [Accepted: 10/28/2012] [Indexed: 01/15/2023]
Abstract
High-throughput scientific studies involving no clear a priori hypothesis are common. For example, a large-scale genomic study of a disease may examine thousands of genes without hypothesizing that any specific gene is responsible for the disease. In these studies, the objective is to explore a large number of possible factors (e.g., genes) in order to identify a small number that will be considered in follow-up studies that tend to be more thorough and on smaller scales. A simple, hierarchical, linear regression model with random coefficients is assumed for case-control data that correspond to each gene. The specific model used will be seen to be related to a standard Bayesian variable selection model. Relatively large regression coefficients correspond to potential differences in responses for cases versus controls and thus to genes that might 'matter'. For large-scale studies, and using a Dirichlet process mixture model for the regression coefficients, we are able to find clusters of regression effects of genes with increasing potential effect or 'relevance', in relation to the outcome of interest. One cluster will always correspond to genes whose coefficients are in a neighborhood that is relatively close to zero and will be deemed least relevant. Other clusters will correspond to increasing magnitudes of the random/latent regression coefficients. Using simulated data, we demonstrate that our approach could be quite effective in finding relevant genes compared with several alternative methods. We apply our model to two large-scale studies. The first study involves transcriptome analysis of infection by human cytomegalovirus. The second study's objective is to identify differentially expressed genes between two types of leukemia.
Collapse
Affiliation(s)
- Babak Shahbaba
- Department of Statistics, University of California at Irvine, CA, USA.
| | | |
Collapse
|
30
|
Abstract
DNA microarrays are a relatively new technology that can simultaneously measure the expression level of thousands of genes. They have become an important tool for a wide variety of biological experiments. One of the most common goals of DNA microarray experiments is to identify genes associated with biological processes of interest. Conventional statistical tests often produce poor results when applied to microarray data owing to small sample sizes, noisy data, and correlation among the expression levels of the genes. Thus, novel statistical methods are needed to identify significant genes in DNA microarray experiments. This article discusses the challenges inherent in DNA microarray analysis and describes a series of statistical techniques that can be used to overcome these challenges. The problem of multiple hypothesis testing and its relation to microarray studies are also considered, along with several possible solutions.
Collapse
Affiliation(s)
- Eric Bair
- Department of Endodontics, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA ; Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
| |
Collapse
|
31
|
Conlon EM, Postier BL, Methé BA, Nevin KP, Lovley DR. A Bayesian model for pooling gene expression studies that incorporates co-regulation information. PLoS One 2012; 7:e52137. [PMID: 23284902 PMCID: PMC3532429 DOI: 10.1371/journal.pone.0052137] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2012] [Accepted: 11/13/2012] [Indexed: 12/01/2022] Open
Abstract
Current Bayesian microarray models that pool multiple studies assume gene expression is independent of other genes. However, in prokaryotic organisms, genes are arranged in units that are co-regulated (called operons). Here, we introduce a new Bayesian model for pooling gene expression studies that incorporates operon information into the model. Our Bayesian model borrows information from other genes within the same operon to improve estimation of gene expression. The model produces the gene-specific posterior probability of differential expression, which is the basis for inference. We found in simulations and in biological studies that incorporating co-regulation information improves upon the independence model. We assume that each study contains two experimental conditions: a treatment and control. We note that there exist environmental conditions for which genes that are supposed to be transcribed together lose their operon structure, and that our model is best carried out for known operon structures.
Collapse
Affiliation(s)
- Erin M Conlon
- Department of Mathematics and Statistics, University of Massachusetts, Amherst, MA, USA.
| | | | | | | | | |
Collapse
|
32
|
Hong Z, Lian H. BOPA: A Bayesian hierarchical model for outlier expression detection. Comput Stat Data Anal 2012. [DOI: 10.1016/j.csda.2012.05.003] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
|
33
|
Mollah MMH, Mollah MNH, Kishino H. β-empirical Bayes inference and model diagnosis of microarray data. BMC Bioinformatics 2012; 13:135. [PMID: 22713095 PMCID: PMC3464654 DOI: 10.1186/1471-2105-13-135] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2011] [Accepted: 04/23/2012] [Indexed: 12/04/2022] Open
Abstract
Background Microarray data enables the high-throughput survey of mRNA expression profiles at the genomic level; however, the data presents a challenging statistical problem because of the large number of transcripts with small sample sizes that are obtained. To reduce the dimensionality, various Bayesian or empirical Bayes hierarchical models have been developed. However, because of the complexity of the microarray data, no model can explain the data fully. It is generally difficult to scrutinize the irregular patterns of expression that are not expected by the usual statistical gene by gene models. Results As an extension of empirical Bayes (EB) procedures, we have developed the β-empirical Bayes (β-EB) approach based on a β-likelihood measure which can be regarded as an ’evidence-based’ weighted (quasi-) likelihood inference. The weight of a transcript t is described as a power function of its likelihood, fβ(yt|θ). Genes with low likelihoods have unexpected expression patterns and low weights. By assigning low weights to outliers, the inference becomes robust. The value of β, which controls the balance between the robustness and efficiency, is selected by maximizing the predictive β0-likelihood by cross-validation. The proposed β-EB approach identified six significant (p<10−5) contaminated transcripts as differentially expressed (DE) in normal/tumor tissues from the head and neck of cancer patients. These six genes were all confirmed to be related to cancer; they were not identified as DE genes by the classical EB approach. When applied to the eQTL analysis of Arabidopsis thaliana, the proposed β-EB approach identified some potential master regulators that were missed by the EB approach. Conclusions The simulation data and real gene expression data showed that the proposed β-EB method was robust against outliers. The distribution of the weights was used to scrutinize the irregular patterns of expression and diagnose the model statistically. When β-weights outside the range of the predicted distribution were observed, a detailed inspection of the data was carried out. The β-weights described here can be applied to other likelihood-based statistical models for diagnosis, and may serve as a useful tool for transcriptome and proteome studies.
Collapse
Affiliation(s)
- Mohammad Manir Hossain Mollah
- Graduate School of Agricultural and Life Sciences, The University of Tokyo, 1-1-1 Yayoi, Bunkyo-ku, Tokyo 113-8657, Japan.
| | | | | |
Collapse
|
34
|
Benchmarking historical corporate performance. Comput Stat Data Anal 2012. [DOI: 10.1016/j.csda.2011.11.004] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
35
|
Abstract
Polya trees (PT) are random probability measures which can assign probability 1 to the set of continuous distributions for certain specifications of the hyperparameters. This feature distinguishes the PT from the popular Dirichlet process (DP) model which assigns probability 1 to the set of discrete distributions. However, the PT is not nearly as widely used as the DP prior. Probably the main reason is an awkward dependence of posterior inference on the choice of the partitioning subsets in the definition of the PT. We propose a generalization of the PT prior that mitigates this undesirable dependence on the partition structure, by allowing the branching probabilities to be dependent within the same level. The proposed new process is not a PT anymore. However, it is still a tail-free process and many of the prior properties remain the same as those for the PT.
Collapse
Affiliation(s)
| | - Peter Müller
- Department of Mathematics, The University of Texas at Austin
| |
Collapse
|
36
|
Polson NG, Scott JG. Good, great, or lucky? Screening for firms with sustained superior performance using heavy-tailed priors. Ann Appl Stat 2012. [DOI: 10.1214/11-aoas512] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
37
|
Han B, Dalal SR. A Bernstein-type estimator for decreasing density with application to -value adjustments. Comput Stat Data Anal 2012. [DOI: 10.1016/j.csda.2011.08.010] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
38
|
Abstract
In recent years, the capability of synthetic biology to design large genetic circuits has dramatically increased due to rapid advances in DNA synthesis technology and development of tools for large-scale assembly of DNA fragments. Large genetic circuits require more components (parts), especially regulators such as transcription factors, sigma factors, and viral RNA polymerases to provide increased regulatory capability, and also devices such as sensors, receivers, and signaling molecules. All these parts may have a potential impact upon the host that needs to be considered when designing and fabricating circuits. DNA microarrays are a well-established technique for global monitoring of gene expression and therefore are an ideal tool for systematically assessing the impact of expressing parts of genetic circuits in host cells. Knowledge of part impact on the host enables the user to design circuits from libraries of parts taking into account their potential impact and also to possibly modify the host to better tolerate stresses induced by the engineered circuit. In this chapter, we present the complete methodology of performing microarrays from choice of array platform, experimental design, preparing samples for array hybridization, and associated data analysis including preprocessing, normalization, clustering, identifying significantly differentially expressed genes, and interpreting the data based on known biology. With these methodologies, we also include lists of bioinformatic resources and tools for performing data analysis. The aim of this chapter is to provide the reader with the information necessary to be able to systematically catalog the impact of genetic parts on the host and also to optimize the operation of fully engineered genetic circuits.
Collapse
Affiliation(s)
- Virgil A Rhodius
- Department of Microbiology and Immunology, University of California at San Francisco, San Francisco, California, USA
| | | |
Collapse
|
39
|
Tancredi A, Liseo B. A hierarchical Bayesian approach to record linkage and population size problems. Ann Appl Stat 2011. [DOI: 10.1214/10-aoas447] [Citation(s) in RCA: 56] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
40
|
Bogdan M, Chakrabarti A, Frommlet F, Ghosh JK. Asymptotic Bayes-optimality under sparsity of some multiple testing procedures. Ann Stat 2011. [DOI: 10.1214/10-aos869] [Citation(s) in RCA: 53] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
41
|
Ausín MC, Gómez-Villegas MA, González-Pérez B, Rodríguez-Bernal MT, Salazar I, Sanz L. Bayesian Analysis of Multiple Hypothesis Testing with Applications to Microarray Experiments. COMMUN STAT-THEOR M 2011. [DOI: 10.1080/03610921003778183] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
|
42
|
Szczurek E, Biecek P, Tiuryn J, Vingron M. Introducing knowledge into differential expression analysis. J Comput Biol 2010; 17:953-67. [PMID: 20726790 DOI: 10.1089/cmb.2010.0034] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Gene expression measurements allow determining sets of up- or down-regulated, or unchanged genes in a particular experimental condition. Additional biological knowledge can suggest examples of genes from one of these sets. For instance, known target genes of a transcriptional activator are expected, but are not certain to go down after this activator is knocked out. Available differential expression analysis tools do not take such imprecise examples into account. Here we put forward a novel partially supervised mixture modeling methodology for differential expression analysis. Our approach, guided by imprecise examples, clusters expression data into differentially expressed and unchanged genes. The partially supervised methodology is implemented by two methods: a newly introduced belief-based mixture modeling, and soft-label mixture modeling, a method proved efficient in other applications. We investigate on synthetic data the input example settings favorable for each method. In our tests, both belief-based and soft-label methods prove their advantage over semi-supervised mixture modeling in correcting for erroneous examples. We also compare them to alternative differential expression analysis approaches, showing that incorporation of knowledge yields better performance. We present a broad range of knowledge sources and data to which our partially supervised methodology can be applied. First, we determine targets of Ste12 based on yeast knockout data, guided by a Ste12 DNA-binding experiment. Second, we distinguish miR-1 from miR-124 targets in human by clustering expression data under transfection experiments of both microRNAs, using their computationally predicted targets as examples. Finally, we utilize literature knowledge to improve clustering of time-course expression profiles.
Collapse
Affiliation(s)
- Ewa Szczurek
- Max Planck Institute for Molecular Genetics, Berlin, Germany.
| | | | | | | |
Collapse
|
43
|
Scott JG, Berger JO. Bayes and empirical-Bayes multiplicity adjustment in the variable-selection problem. Ann Stat 2010. [DOI: 10.1214/10-aos792] [Citation(s) in RCA: 326] [Impact Index Per Article: 23.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
44
|
Bar H, Booth J, Schifano E, Wells MT. Laplace Approximated EM Microarray Analysis: An Empirical Bayes Approach for Comparative Microarray Experiments. Stat Sci 2010. [DOI: 10.1214/10-sts339] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
45
|
Han S, Andrei AC, Tsui KW. A robust method for large-scale multiple hypotheses testing. Biom J 2010; 52:222-32. [PMID: 20391535 DOI: 10.1002/bimj.200900177] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
When drawing large-scale simultaneous inference, such as in genomics and imaging problems, multiplicity adjustments should be made, since, otherwise, one would be faced with an inflated type I error. Numerous methods are available to estimate the proportion of true null hypotheses pi(0), among a large number of hypotheses tested. Many methods implicitly assume that the pi(0) is large, that is, close to 1. However, in practice, mid-range pi(0) values are frequently encountered and many of the widely used methods tend to produce highly variable or biased estimates of pi(0). As a remedy in such situations, we propose a hierarchical Bayesian model that produces an estimator of pi(0) that exhibits considerably less bias and is more stable. Simulation studies seem indicative of good method performance even when low-to-moderate correlation exists among test statistics. Method performance is assessed in simulated settings and its practical usefulness is illustrated in an application to a type II diabetes study.
Collapse
Affiliation(s)
- Seungbong Han
- Department of Statistics, University of Wisconsin-Madison, Medical Science Center 1300 University Avenue, Madison, WI 53706, USA
| | | | | |
Collapse
|
46
|
Scharpf RB, Tjelmeland H, Parmigiani G, Nobel AB. A Bayesian model for cross-study differential gene expression. J Am Stat Assoc 2009; 104:1295-1310. [PMID: 21127725 DOI: 10.1198/jasa.2009.ap07611] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/29/2022]
Abstract
In this paper we define a hierarchical Bayesian model for microarray expression data collected from several studies and use it to identify genes that show differential expression between two conditions. Key features include shrinkage across both genes and studies, and flexible modeling that allows for interactions between platforms and the estimated effect, as well as concordant and discordant differential expression across studies. We evaluated the performance of our model in a comprehensive fashion, using both artificial data, and a "split-study" validation approach that provides an agnostic assessment of the model's behavior not only under the null hypothesis, but also under a realistic alternative. The simulation results from the artificial data demonstrate the advantages of the Bayesian model. The 1 - AUC values for the Bayesian model are roughly half of the corresponding values for a direct combination of t- and SAM-statistics. Furthermore, the simulations provide guidelines for when the Bayesian model is most likely to be useful. Most noticeably, in small studies the Bayesian model generally outperforms other methods when evaluated by AUC, FDR, and MDR across a range of simulation parameters, and this difference diminishes for larger sample sizes in the individual studies. The split-study validation illustrates appropriate shrinkage of the Bayesian model in the absence of platform-, sample-, and annotation-differences that otherwise complicate experimental data analyses. Finally, we fit our model to four breast cancer studies employing different technologies (cDNA and Affymetrix) to estimate differential expression in estrogen receptor positive tumors versus negative ones. Software and data for reproducing our analysis are publicly available.
Collapse
Affiliation(s)
- Robert B Scharpf
- Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD 21205
| | | | | | | |
Collapse
|
47
|
Scott JG. Nonparametric Bayesian multiple testing for longitudinal performance stratification. Ann Appl Stat 2009. [DOI: 10.1214/09-aoas252] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
48
|
Conlon EM, Postier BL, Methé BA, Nevin KP, Lovley DR. Hierarchical Bayesian meta-analysis models for cross-platform microarray studies. J Appl Stat 2009. [DOI: 10.1080/02664760802562480] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
|
49
|
Chudova D, Ihler A, Lin KK, Andersen B, Smyth P. Bayesian detection of non-sinusoidal periodic patterns in circadian expression data. ACTA ACUST UNITED AC 2009; 25:3114-20. [PMID: 19773336 DOI: 10.1093/bioinformatics/btp547] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
MOTIVATION Cyclical biological processes such as cell division and circadian regulation produce coordinated periodic expression of thousands of genes. Identification of such genes and their expression patterns is a crucial step in discovering underlying regulatory mechanisms. Existing computational methods are biased toward discovering genes that follow sine-wave patterns. RESULTS We present an analysis of variance (ANOVA) periodicity detector and its Bayesian extension that can be used to discover periodic transcripts of arbitrary shapes from replicated gene expression profiles. The models are applicable when the profiles are collected at comparable time points for at least two cycles. We provide an empirical Bayes procedure for estimating parameters of the prior distributions and derive closed-form expressions for the posterior probability of periodicity, enabling efficient computation. The model is applied to two datasets profiling circadian regulation in murine liver and skeletal muscle, revealing a substantial number of previously undetected non-sinusoidal periodic transcripts in each. We also apply quantitative real-time PCR to several highly ranked non-sinusoidal transcripts in liver tissue found by the model, providing independent evidence of circadian regulation of these genes. AVAILABILITY Matlab software for estimating prior distributions and performing inference is available for download from http://www.datalab.uci.edu/resources/periodicity/. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Darya Chudova
- Department of Computer Science, University of California, Irvine, CA 92697, USA.
| | | | | | | | | |
Collapse
|
50
|
Wang J, Wen S, Symmans WF, Pusztai L, Coombes KR. The bimodality index: a criterion for discovering and ranking bimodal signatures from cancer gene expression profiling data. Cancer Inform 2009; 7:199-216. [PMID: 19718451 PMCID: PMC2730180 DOI: 10.4137/cin.s2846] [Citation(s) in RCA: 81] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/02/2022] Open
Abstract
Motivation Identifying genes with bimodal expression patterns from large-scale expression profiling data is an important analytical task. Model-based clustering is popular for this purpose. That technique commonly uses the Bayesian information criterion (BIC) for model selection. In practice, however, BIC appears to be overly sensitive and may lead to the identification of bimodally expressed genes that are unreliable or not clinically useful. We propose using a novel criterion, the bimodality index, not only to identify but also to rank meaningful and reliable bimodal patterns. The bimodality index can be computed using either a mixture model-based algorithm or Markov chain Monte Carlo techniques. Results We carried out simulation studies and applied the method to real data from a cancer gene expression profiling study. Our findings suggest that BIC behaves like a lax cutoff based on the bimodality index, and that the bimodality index provides an objective measure to identify and rank meaningful and reliable bimodal patterns from large-scale gene expression datasets. R code to compute the bimodality index is included in the ClassDiscovery package of the Object-Oriented Microarray and Proteomic Analysis (OOMPA) suite available at the web site http;//bioinformatics.mdanderson.org/Software/OOMPA.
Collapse
Affiliation(s)
- Jing Wang
- Department of Bioinformatics and Computational Biology, The University of Texas M.D. Anderson Cancer Center, Houston, TX 77030-4009, USA.
| | | | | | | | | |
Collapse
|