1
|
Frommlet F. A neutral comparison of algorithms to minimize L 0 penalties for high-dimensional variable selection. Biom J 2024; 66:e2200207. [PMID: 37421205 DOI: 10.1002/bimj.202200207] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2022] [Revised: 03/09/2023] [Accepted: 04/29/2023] [Indexed: 07/10/2023]
Abstract
Variable selection methods based on L0 penalties have excellent theoretical properties to select sparse models in a high-dimensional setting. There exist modifications of the Bayesian Information Criterion (BIC) which either control the familywise error rate (mBIC) or the false discovery rate (mBIC2) in terms of which regressors are selected to enter a model. However, the minimization of L0 penalties comprises a mixed-integer problem which is known to be NP-hard and therefore becomes computationally challenging with increasing numbers of regressor variables. This is one reason why alternatives like the LASSO have become so popular, which involve convex optimization problems that are easier to solve. The last few years have seen some real progress in developing new algorithms to minimize L0 penalties. The aim of this article is to compare the performance of these algorithms in terms of minimizing L0 -based selection criteria. Simulation studies covering a wide range of scenarios that are inspired by genetic association studies are used to compare the values of selection criteria obtained with different algorithms. In addition, some statistical characteristics of the selected models and the runtime of algorithms are compared. Finally, the performance of the algorithms is illustrated in a real data example concerned with expression quantitative trait loci (eQTL) mapping.
Collapse
Affiliation(s)
- Florian Frommlet
- Institute of Medical Statistics, Center for Medical Data Science, Medical University of Vienna, Vienna, Austria
| |
Collapse
|
2
|
Zhou J, Hoen AG, Mcritchie S, Pathmasiri W, Viles WD, Nguyen QP, Madan JC, Dade E, Karagas MR, Gui J. Information enhanced model selection for Gaussian graphical model with application to metabolomic data. Biostatistics 2022; 23:926-948. [PMID: 33720330 PMCID: PMC9608647 DOI: 10.1093/biostatistics/kxab006] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2020] [Revised: 01/21/2021] [Accepted: 01/22/2021] [Indexed: 11/12/2022] Open
Abstract
In light of the low signal-to-noise nature of many large biological data sets, we propose a novel method to learn the structure of association networks using Gaussian graphical models combined with prior knowledge. Our strategy includes two parts. In the first part, we propose a model selection criterion called structural Bayesian information criterion, in which the prior structure is modeled and incorporated into Bayesian information criterion. It is shown that the popular extended Bayesian information criterion is a special case of structural Bayesian information criterion. In the second part, we propose a two-step algorithm to construct the candidate model pool. The algorithm is data-driven and the prior structure is embedded into the candidate model automatically. Theoretical investigation shows that under some mild conditions structural Bayesian information criterion is a consistent model selection criterion for high-dimensional Gaussian graphical model. Simulation studies validate the superiority of the proposed algorithm over the existing ones and show the robustness to the model misspecification. Application to relative concentration data from infant feces collected from subjects enrolled in a large molecular epidemiological cohort study validates that metabolic pathway involvement is a statistically significant factor for the conditional dependence between metabolites. Furthermore, new relationships among metabolites are discovered which can not be identified by the conventional methods of pathway analysis. Some of them have been widely recognized in biological literature.
Collapse
Affiliation(s)
- Jie Zhou
- Department of Biomedical Data Science, Geisel School of Medicine, Dartmouth College, 3 Rope Ferry Road, Hanover, NH 03755, USA
| | - Anne G Hoen
- Department of Biomedical Data Science, Geisel School of Medicine, Dartmouth College, Hanover, NH, USA and Depatment of Epidemiology, Geisel School of Medicine, Dartmouth College, 3 Rope Ferry Road, Hanover, NH 03755, USA
| | - Susan Mcritchie
- Nutrition Research Institute, Department of Nutrition, School of Public Health, University of North Carolina at Chapel Hill, Chapel Hill, 500 Laureate Way, Kannapolis, NC 28081, USA
| | - Wimal Pathmasiri
- Nutrition Research Institute, Department of Nutrition, School of Public Health, University of North Carolina at Chapel Hill, Chapel Hill, 500 Laureate Way, Kannapolis, NC 28081, USA
| | - Weston D Viles
- Department of Mathematics and Statistics, University of Southern Maine, 96 Falmouth St, Portland, ME 04103, USA
| | - Quang P Nguyen
- Department of Biomedical Data Science, Geisel School of Medicine, Dartmouth College, Hanover, NH, USA and Depatment of Epidemiology, Geisel School of Medicine, Dartmouth College, Hanover, NH, USA
| | - Juliette C Madan
- Depatment of Epidemiology, Geisel School of Medicine, Dartmouth College, Hanover, NH, USA
| | - Erika Dade
- Depatment of Epidemiology, Geisel School of Medicine, Dartmouth College, Hanover, NH, USA
| | - Margaret R Karagas
- Depatment of Epidemiology, Geisel School of Medicine, Dartmouth College, Hanover, NH, USA
| | - Jiang Gui
- Department of Biomedical Data Science, Geisel School of Medicine, Dartmouth College, Hanover, NH, USA
| |
Collapse
|
3
|
Frommlet F, Szulc P, König F, Bogdan M. Selecting predictive biomarkers from genomic data. PLoS One 2022; 17:e0269369. [PMID: 35709188 PMCID: PMC9202896 DOI: 10.1371/journal.pone.0269369] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2021] [Accepted: 05/13/2022] [Indexed: 11/18/2022] Open
Abstract
Recently there have been tremendous efforts to develop statistical procedures which allow to determine subgroups of patients for which certain treatments are effective. This article focuses on the selection of prognostic and predictive genetic biomarkers based on a relatively large number of candidate Single Nucleotide Polymorphisms (SNPs). We consider models which include prognostic markers as main effects and predictive markers as interaction effects with treatment. We compare different high-dimensional selection approaches including adaptive lasso, a Bayesian adaptive version of the Sorted L-One Penalized Estimator (SLOBE) and a modified version of the Bayesian Information Criterion (mBIC2). These are compared with classical multiple testing procedures for individual markers. Having identified predictive markers we consider several different approaches how to specify subgroups susceptible to treatment. Our main conclusion is that selection based on mBIC2 and SLOBE has similar predictive performance as the adaptive lasso while including substantially fewer biomarkers.
Collapse
Affiliation(s)
- Florian Frommlet
- Department of Medical Statistics, CEMSIIS, Medical University of Vienna, Vienna, Austria
- * E-mail:
| | - Piotr Szulc
- Institute of Mathematics, University of Wroclaw, Wroclaw, Poland
| | - Franz König
- Department of Medical Statistics, CEMSIIS, Medical University of Vienna, Vienna, Austria
| | - Malgorzata Bogdan
- Institute of Mathematics, University of Wroclaw, Wroclaw, Poland
- Department of Statistics, Lund University, Lund, Sweden
| |
Collapse
|
4
|
Wallin J, Bogdan M, Szulc PA, Doerge RW, Siegmund DO. Ghost QTL and hotspots in experimental crosses: novel approach for modeling polygenic effects. Genetics 2021; 217:6067404. [PMID: 33789342 DOI: 10.1093/genetics/iyaa041] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2020] [Accepted: 12/10/2020] [Indexed: 11/14/2022] Open
Abstract
Ghost quantitative trait loci (QTL) are the false discoveries in QTL mapping, that arise due to the "accumulation" of the polygenic effects, uniformly distributed over the genome. The locations on the chromosome that are strongly correlated with the total of the polygenic effects depend on a specific sample correlation structure determined by the genotypes at all loci. The problem is particularly severe when the same genotypes are used to study multiple QTL, e.g. using recombinant inbred lines or studying the expression QTL. In this case, the ghost QTL phenomenon can lead to false hotspots, where multiple QTL show apparent linkage to the same locus. We illustrate the problem using the classic backcross design and suggest that it can be solved by the application of the extended mixed effect model, where the random effects are allowed to have a nonzero mean. We provide formulas for estimating the thresholds for the corresponding t-test statistics and use them in the stepwise selection strategy, which allows for a simultaneous detection of several QTL. Extensive simulation studies illustrate that our approach eliminates ghost QTL/false hotspots, while preserving a high power of true QTL detection.
Collapse
Affiliation(s)
- Jonas Wallin
- Department of Statistics, Lund University, 220 07 Lund, Sweden
| | - Małgorzata Bogdan
- Department of Statistics, Lund University, 220 07 Lund, Sweden.,Department of Mathematics, Institute of Mathematics, University of Wroclaw, 50-137 Wroclaw, Poland
| | - Piotr A Szulc
- Department of Mathematics, Institute of Mathematics, University of Wroclaw, 50-137 Wroclaw, Poland
| | - R W Doerge
- Department of Biological Sciences, Carnegie Mellon University, Pittsburgh, PA 15 213, USA.,Department of Statistics and Data Science, Carnegie Mellon University, Pittsburgh, PA 15 213, USA
| | - David O Siegmund
- Department of Statistics, Stanford University, Stanford, CA 94 305, USA
| |
Collapse
|
5
|
Szulc P, Bogdan M, Frommlet F, Tang H. Joint genotype- and ancestry-based genome-wide association studies in admixed populations. Genet Epidemiol 2017; 41:555-566. [PMID: 28657151 DOI: 10.1002/gepi.22056] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2016] [Revised: 04/01/2017] [Accepted: 04/25/2017] [Indexed: 12/21/2022]
Abstract
In genome-wide association studies (GWAS) genetic loci that influence complex traits are localized by inspecting associations between genotypes of genetic markers and the values of the trait of interest. On the other hand, admixture mapping, which is performed in case of populations consisting of a recent mix of two ancestral groups, relies on the ancestry information at each locus (locus-specific ancestry). Recently it has been proposed to jointly model genotype and locus-specific ancestry within the framework of single marker tests. Here, we extend this approach for population-based GWAS in the direction of multimarker models. A modified version of the Bayesian information criterion is developed for building a multilocus model that accounts for the differential correlation structure due to linkage disequilibrium (LD) and admixture LD. Simulation studies and a real data example illustrate the advantages of this new approach compared to single-marker analysis or modern model selection strategies based on separately analyzing genotype and ancestry data, as well as to single-marker analysis combining genotypic and ancestry information. Depending on the signal strength, our procedure automatically chooses whether genotypic or locus-specific ancestry markers are added to the model. This results in a good compromise between the power to detect causal mutations and the precision of their localization. The proposed method has been implemented in R and is available at http://www.math.uni.wroc.pl/~mbogdan/admixtures/.
Collapse
Affiliation(s)
- Piotr Szulc
- Faculty of Mathematics, Wroclaw University of Technology, Wroclaw, Poland
| | - Malgorzata Bogdan
- Faculty of Mathematics and Computer Science, University of Wroclaw, Wroclaw, Poland
| | - Florian Frommlet
- Department of Medical Statistics, CEMSIIS, Medical University of Vienna, Vienna, Austria
| | - Hua Tang
- Departments of Genetics and Statistics, Stanford University, Stanford, California, United States of America
| |
Collapse
|
6
|
Huang A, Xu S, Cai X. Empirical Bayesian elastic net for multiple quantitative trait locus mapping. Heredity (Edinb) 2014; 114:107-15. [PMID: 25204301 DOI: 10.1038/hdy.2014.79] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2013] [Revised: 06/27/2014] [Accepted: 07/04/2014] [Indexed: 01/21/2023] Open
Abstract
In multiple quantitative trait locus (QTL) mapping, a high-dimensional sparse regression model is usually employed to account for possible multiple linked QTLs. The QTL model may include closely linked and thus highly correlated genetic markers, especially when high-density marker maps are used in QTL mapping because of the advancement in sequencing technology. Although existing algorithms, such as Lasso, empirical Bayesian Lasso (EBlasso) and elastic net (EN) are available to infer such QTL models, more powerful methods are highly desirable to detect more QTLs in the presence of correlated QTLs. We developed a novel empirical Bayesian EN (EBEN) algorithm for multiple QTL mapping that inherits the efficiency of our previously developed EBlasso algorithm. Simulation results demonstrated that EBEN provided higher power of detection and almost the same false discovery rate compared with EN and EBlasso. Particularly, EBEN can identify correlated QTLs that the other two algorithms may fail to identify. When analyzing a real dataset, EBEN detected more effects than EN and EBlasso. EBEN provides a useful tool for inferring high-dimensional sparse model in multiple QTL mapping and other applications. An R software package 'EBEN' implementing the EBEN algorithm is available on the Comprehensive R Archive Network (CRAN).
Collapse
Affiliation(s)
- A Huang
- Department of Electrical and Computer Engineering, University of Miami, Coral Gables, FL, USA
| | - S Xu
- Department of Botany and Plant Sciences, University of California, Riverside, CA, USA
| | - X Cai
- Department of Electrical and Computer Engineering, University of Miami, Coral Gables, FL, USA
| |
Collapse
|
7
|
Malina M, Ickstadt K, Schwender H, Posch M, Bogdan M. Detection of epistatic effects with logic regression and a classical linear regression model. Stat Appl Genet Mol Biol 2014; 13:83-104. [PMID: 24413217 DOI: 10.1515/sagmb-2013-0028] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
To locate multiple interacting quantitative trait loci (QTL) influencing a trait of interest within experimental populations, usually methods as the Cockerham's model are applied. Within this framework, interactions are understood as the part of the joined effect of several genes which cannot be explained as the sum of their additive effects. However, if a change in the phenotype (as disease) is caused by Boolean combinations of genotypes of several QTLs, this Cockerham's approach is often not capable to identify them properly. To detect such interactions more efficiently, we propose a logic regression framework. Even though with the logic regression approach a larger number of models has to be considered (requiring more stringent multiple testing correction) the efficient representation of higher order logic interactions in logic regression models leads to a significant increase of power to detect such interactions as compared to a Cockerham's approach. The increase in power is demonstrated analytically for a simple two-way interaction model and illustrated in more complex settings with simulation study and real data analysis.
Collapse
|
8
|
Frommlet F, Ruhaltinger F, Twaróg P, Bogdan M. Modified versions of Bayesian Information Criterion for genome-wide association studies. Comput Stat Data Anal 2012. [DOI: 10.1016/j.csda.2011.05.005] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
|
9
|
Żak-Szatkowska M, Bogdan M. Modified versions of the Bayesian Information Criterion for sparse Generalized Linear Models. Comput Stat Data Anal 2011. [DOI: 10.1016/j.csda.2011.04.016] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
|
10
|
Statistical properties of QTL linkage mapping in biparental genetic populations. Heredity (Edinb) 2010; 105:257-67. [PMID: 20461101 DOI: 10.1038/hdy.2010.56] [Citation(s) in RCA: 91] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022] Open
Abstract
Quantitative trait gene or locus (QTL) mapping is routinely used in genetic analysis of complex traits. Especially in practical breeding programs, questions remain such as how large a population and what level of marker density are needed to detect QTLs that are useful to breeders, and how likely it is that the target QTL will be detected with the data set in hand. Some answers can be found in studies on conventional interval mapping (IM). However, it is not clear whether the conclusions obtained from IM are the same as those obtained using other methods. Inclusive composite interval mapping (ICIM) is a useful step forward that highlights the importance of model selection and interval testing in QTL linkage mapping. In this study, we investigate the statistical properties of ICIM compared with IM through simulation. Results indicate that IM is less responsive to marker density and population size (PS). The increase in marker density helps ICIM identify independent QTLs explaining >5% of phenotypic variance. When PS is >200, ICIM achieves unbiased estimations of QTL position and effect. For smaller PS, there is a tendency for the QTL to be located toward the center of the chromosome, with its effect overestimated. The use of dense markers makes linked QTL isolated by empty marker intervals and thus improves mapping efficiency. However, only large-sized populations can take advantage of densely distributed markers. These findings are different from those previously found in IM, indicating great improvements with ICIM.
Collapse
|
11
|
Tag SNP selection based on clustering according to dominant sets found using replicator dynamics. ADV DATA ANAL CLASSI 2010. [DOI: 10.1007/s11634-010-0059-2] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/23/2022]
|
12
|
Zhang M, Zhang D, Wells MT. Variable selection for large p small n regression models with incomplete data: mapping QTL with epistases. BMC Bioinformatics 2008; 9:251. [PMID: 18510743 PMCID: PMC2435550 DOI: 10.1186/1471-2105-9-251] [Citation(s) in RCA: 26] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/26/2007] [Accepted: 05/29/2008] [Indexed: 01/21/2023] Open
Abstract
BACKGROUND Identifying quantitative trait loci (QTL) for both additive and epistatic effects raises the statistical issue of selecting variables from a large number of candidates using a small number of observations. Missing trait and/or marker values prevent one from directly applying the classical model selection criteria such as Akaike's information criterion (AIC) and Bayesian information criterion (BIC). RESULTS We propose a two-step Bayesian variable selection method which deals with the sparse parameter space and the small sample size issues. The regression coefficient priors are flexible enough to incorporate the characteristic of "large p small n" data. Specifically, sparseness and possible asymmetry of the significant coefficients are dealt with by developing a Gibbs sampling algorithm to stochastically search through low-dimensional subspaces for significant variables. The superior performance of the approach is demonstrated via simulation study. We also applied it to real QTL mapping datasets. CONCLUSION The two-step procedure coupled with Bayesian classification offers flexibility in modeling "large p small n" data, especially for the sparse and asymmetric parameter space. This approach can be extended to other settings characterized by high dimension and low sample size.
Collapse
Affiliation(s)
- Min Zhang
- Department of Statistics, Purdue University, West Lafayette, IN 47907, USA.
| | | | | |
Collapse
|