1
|
Dünder E. A modified information criterion for model selection. COMMUN STAT-THEOR M 2021. [DOI: 10.1080/03610926.2019.1708395] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]
Affiliation(s)
- Emre Dünder
- Department of Statistics, Faculty of Science, Ondokuz Mayıs University, Samsun, Turkey
| |
Collapse
|
2
|
Xu N, Solari A, Goeman J. Globaltest confidence regions and their application to ridge regression. Biom J 2021; 63:1351-1365. [PMID: 34046931 PMCID: PMC8519024 DOI: 10.1002/bimj.202000063] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2020] [Revised: 04/02/2021] [Accepted: 05/01/2021] [Indexed: 12/25/2022]
Abstract
We construct confidence regions in high dimensions by inverting the globaltest statistics, and use them to choose the tuning parameter for penalized regression. The selected model corresponds to the point in the confidence region of the parameters that minimizes the penalty, making it the least complex model that still has acceptable fit according to the test that defines the confidence region. As the globaltest is particularly powerful in the presence of many weak predictors, it connects well to ridge regression, and we thus focus on ridge penalties in this paper. The confidence region method is quick to calculate, intuitive, and gives decent predictive potential. As a tuning parameter selection method it may even outperform classical methods such as cross‐validation in terms of mean squared error of prediction, especially when the signal is weak. We illustrate the method for linear models in simulation study and for Cox models in real gene expression data of breast cancer samples.
Collapse
Affiliation(s)
- Ningning Xu
- Department of Biomedical Data Sciences, Leiden University Medical Center, Leiden, The Netherlands
| | - Aldo Solari
- Department of Economics, Management and Statistics, University of Milano-Bicocca, Milano, Italy
| | - Jelle Goeman
- Department of Biomedical Data Sciences, Leiden University Medical Center, Leiden, The Netherlands
| |
Collapse
|
3
|
Huang J, Jiao Y, Kang L, Liu J, Liu Y, Lu X. GSDAR: a fast Newton algorithm for $$\ell _0$$ regularized generalized linear models with statistical guarantee. Comput Stat 2021. [DOI: 10.1007/s00180-021-01098-z] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
|
4
|
Goepp V, Thalabard JC, Nuel G, Bouaziz O. Regularized bidimensional estimation of the hazard rate. Int J Biostat 2021; 18:263-277. [PMID: 33768761 DOI: 10.1515/ijb-2019-0003] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/07/2019] [Accepted: 02/26/2021] [Indexed: 11/15/2022]
Abstract
In epidemiological or demographic studies, with variable age at onset, a typical quantity of interest is the incidence of a disease (for example the cancer incidence). In these studies, the individuals are usually highly heterogeneous in terms of dates of birth (the cohort) and with respect to the calendar time (the period) and appropriate estimation methods are needed. In this article a new estimation method is presented which extends classical age-period-cohort analysis by allowing interactions between age, period and cohort effects. We introduce a bidimensional regularized estimate of the hazard rate where a penalty is introduced on the likelihood of the model. This penalty can be designed either to smooth the hazard rate or to enforce consecutive values of the hazard to be equal, leading to a parsimonious representation of the hazard rate. In the latter case, we make use of an iterative penalized likelihood scheme to approximate the L 0 norm, which makes the computation tractable. The method is evaluated on simulated data and applied on breast cancer survival data from the SEER program.
Collapse
Affiliation(s)
- Vivien Goepp
- MAP5, CNRS UMR 8145, 45, rue des Saints-Pères, 75006, Paris, France.,MINES ParisTech, CBIO-Centre for Computational Biology, PSL Research University, 75006, Paris, France.,Institut Curie, PSL Research University, 75005, Paris, France.,Inserm, U900, Paris, France
| | | | - Grégory Nuel
- LPSM, CNRS UMR 8001, 4, Place Jussieu, 75005, Paris, France
| | - Olivier Bouaziz
- MAP5, CNRS UMR 8145, 45, rue des Saints-Pères, 75006, Paris, France
| |
Collapse
|
5
|
Selig K, Shaw P, Ankerst D. Bayesian information criterion approximations to Bayes factors for univariate and multivariate logistic regression models. Int J Biostat 2020; 17:241-266. [PMID: 33119543 DOI: 10.1515/ijb-2020-0045] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/06/2020] [Accepted: 10/08/2020] [Indexed: 11/15/2022]
Abstract
Schwarz's criterion, also known as the Bayesian Information Criterion or BIC, is commonly used for model selection in logistic regression due to its simple intuitive formula. For tests of nested hypotheses in independent and identically distributed data as well as in Normal linear regression, previous results have motivated use of Schwarz's criterion by its consistent approximation to the Bayes factor (BF), defined as the ratio of posterior to prior model odds. Furthermore, under construction of an intuitive unit-information prior for the parameters of interest to test for inclusion in the nested models, previous results have shown that Schwarz's criterion approximates the BF to higher order in the neighborhood of the simpler nested model. This paper extends these results to univariate and multivariate logistic regression, providing approximations to the BF for arbitrary prior distributions and definitions of the unit-information prior corresponding to Schwarz's approximation. Simulations show accuracies of the approximations for small samples sizes as well as comparisons to conclusions from frequentist testing. We present an application in prostate cancer, the motivating setting for our work, which illustrates the approximation for large data sets in a practical example.
Collapse
Affiliation(s)
- Katharina Selig
- Department of Mathematics, Technical University of Munich, Munchen, Germany
| | - Pamela Shaw
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA, USA
| | - Donna Ankerst
- Department of Mathematics, Technical University of Munich, Munchen, Germany
| |
Collapse
|
6
|
Model selection in sparse high-dimensional vine copula models with an application to portfolio risk. J MULTIVARIATE ANAL 2019. [DOI: 10.1016/j.jmva.2019.03.004] [Citation(s) in RCA: 25] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
|
7
|
Szulc P, Bogdan M, Frommlet F, Tang H. Joint genotype- and ancestry-based genome-wide association studies in admixed populations. Genet Epidemiol 2017; 41:555-566. [PMID: 28657151 DOI: 10.1002/gepi.22056] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2016] [Revised: 04/01/2017] [Accepted: 04/25/2017] [Indexed: 12/21/2022]
Abstract
In genome-wide association studies (GWAS) genetic loci that influence complex traits are localized by inspecting associations between genotypes of genetic markers and the values of the trait of interest. On the other hand, admixture mapping, which is performed in case of populations consisting of a recent mix of two ancestral groups, relies on the ancestry information at each locus (locus-specific ancestry). Recently it has been proposed to jointly model genotype and locus-specific ancestry within the framework of single marker tests. Here, we extend this approach for population-based GWAS in the direction of multimarker models. A modified version of the Bayesian information criterion is developed for building a multilocus model that accounts for the differential correlation structure due to linkage disequilibrium (LD) and admixture LD. Simulation studies and a real data example illustrate the advantages of this new approach compared to single-marker analysis or modern model selection strategies based on separately analyzing genotype and ancestry data, as well as to single-marker analysis combining genotypic and ancestry information. Depending on the signal strength, our procedure automatically chooses whether genotypic or locus-specific ancestry markers are added to the model. This results in a good compromise between the power to detect causal mutations and the precision of their localization. The proposed method has been implemented in R and is available at http://www.math.uni.wroc.pl/~mbogdan/admixtures/.
Collapse
Affiliation(s)
- Piotr Szulc
- Faculty of Mathematics, Wroclaw University of Technology, Wroclaw, Poland
| | - Malgorzata Bogdan
- Faculty of Mathematics and Computer Science, University of Wroclaw, Wroclaw, Poland
| | - Florian Frommlet
- Department of Medical Statistics, CEMSIIS, Medical University of Vienna, Vienna, Austria
| | - Hua Tang
- Departments of Genetics and Statistics, Stanford University, Stanford, California, United States of America
| |
Collapse
|
8
|
|
9
|
Barber RF, Drton M. High-dimensional Ising model selection with Bayesian information criteria. Electron J Stat 2015. [DOI: 10.1214/15-ejs1012] [Citation(s) in RCA: 46] [Impact Index Per Article: 5.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
10
|
Dolejsi E, Bodenstorfer B, Frommlet F. Analyzing genome-wide association studies with an FDR controlling modification of the Bayesian Information Criterion. PLoS One 2014; 9:e103322. [PMID: 25061809 PMCID: PMC4111553 DOI: 10.1371/journal.pone.0103322] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/22/2014] [Accepted: 07/01/2014] [Indexed: 01/24/2023] Open
Abstract
The prevailing method of analyzing GWAS data is still to test each marker individually, although from a statistical point of view it is quite obvious that in case of complex traits such single marker tests are not ideal. Recently several model selection approaches for GWAS have been suggested, most of them based on LASSO-type procedures. Here we will discuss an alternative model selection approach which is based on a modification of the Bayesian Information Criterion (mBIC2) which was previously shown to have certain asymptotic optimality properties in terms of minimizing the misclassification error. Heuristic search strategies are introduced which attempt to find the model which minimizes mBIC2, and which are efficient enough to allow the analysis of GWAS data. Our approach is implemented in a software package called MOSGWA. Its performance in case control GWAS is compared with the two algorithms HLASSO and d-GWASelect, as well as with single marker tests, where we performed a simulation study based on real SNP data from the POPRES sample. Our results show that MOSGWA performs slightly better than HLASSO, where specifically for more complex models MOSGWA is more powerful with only a slight increase in Type I error. On the other hand according to our simulations GWASelect does not at all control the type I error when used to automatically determine the number of important SNPs. We also reanalyze the GWAS data from the Wellcome Trust Case-Control Consortium and compare the findings of the different procedures, where MOSGWA detects for complex diseases a number of interesting SNPs which are not found by other methods.
Collapse
Affiliation(s)
- Erich Dolejsi
- Center for Medical Statistics, Informatics, and Intelligent Systems/Section of Medical Statistics, Medical University Vienna, Vienna, Austria
| | | | - Florian Frommlet
- Center for Medical Statistics, Informatics, and Intelligent Systems/Section of Medical Statistics, Medical University Vienna, Vienna, Austria
| |
Collapse
|
11
|
Lv J, Liu JS. Model selection principles in misspecified models. J R Stat Soc Series B Stat Methodol 2013. [DOI: 10.1111/rssb.12023] [Citation(s) in RCA: 48] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Affiliation(s)
- Jinchi Lv
- University of Southern California; Los Angeles USA
| | | |
Collapse
|
12
|
Frommlet F, Bogdan M. Some optimality properties of FDR controlling rules under sparsity. Electron J Stat 2013. [DOI: 10.1214/13-ejs808] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|