251
|
Wu Y, Cook RJ. Penalized regression for interval-censored times of disease progression: Selection of HLA markers in psoriatic arthritis. Biometrics 2015; 71:782-91. [DOI: 10.1111/biom.12302] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2013] [Revised: 12/01/2014] [Accepted: 02/01/2015] [Indexed: 10/23/2022]
Affiliation(s)
- Ying Wu
- Department of Statistics and Actuarial Science; University of Waterloo; Waterloo, Ontario, Canada N2L 3G1
| | - Richard J. Cook
- Department of Statistics and Actuarial Science; University of Waterloo; Waterloo, Ontario, Canada N2L 3G1
| |
Collapse
|
252
|
|
253
|
Tessier A, Bertrand J, Chenel M, Comets E. Comparison of Nonlinear Mixed Effects Models and Noncompartmental Approaches in Detecting Pharmacogenetic Covariates. AAPS JOURNAL 2015; 17:597-608. [PMID: 25693489 DOI: 10.1208/s12248-015-9726-8] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/01/2014] [Accepted: 01/28/2015] [Indexed: 11/30/2022]
Abstract
Genetic data is now collected in many clinical trials, especially in population pharmacokinetic studies. There is no consensus on methods to test the association between pharmacokinetics and genetic covariates. We performed a simulation study inspired by real clinical trials, using the pharmacokinetics (PK) of a compound under development having a nonlinear bioavailability along with genotypes for 176 single nucleotide polymorphisms (SNPs). Scenarios included 78 subjects extensively sampled (16 observations per subject) to simulate a phase I study, or 384 subjects with the same rich design. Under the alternative hypothesis (H1), six SNPs were drawn randomly to affect the log-clearance under an additive linear model. For each scenario, 200 PK data sets were simulated under the null hypothesis (no gene effect) and H1. We compared 16 combinations of four association tests, a stepwise procedure and three penalised regressions (ridge regression, Lasso, HyperLasso), applied to four pharmacokinetic phenotypes, two observed concentrations, area under the curve estimated by noncompartmental analysis and model-based clearance. The different combinations were compared in terms of true and false positives and probability to detect the genetic effects. In presence of nonlinearity and/or variability in bioavailability, model-based phenotype allowed a higher probability to detect the SNPs than other phenotypes. In a realistic setting with a limited number of subjects, all methods showed a low ability to detect genetic effects. Ridge regression had the best probability to detect SNPs, but also a higher number of false positives. No association test showed a much higher power than the others.
Collapse
Affiliation(s)
- Adrien Tessier
- INSERM, IAME, UMR 1137, Faculté de médecine Paris Diderot Paris 7 - site Bichat, 16 rue Henri Huchard, 75018, Paris, France,
| | | | | | | |
Collapse
|
254
|
Abstract
In plant and animal breeding studies a distinction is made between the genetic value (additive plus epistatic genetic effects) and the breeding value (additive genetic effects) of an individual since it is expected that some of the epistatic genetic effects will be lost due to recombination. In this article, we argue that the breeder can take advantage of the epistatic marker effects in regions of low recombination. The models introduced here aim to estimate local epistatic line heritability by using genetic map information and combining local additive and epistatic effects. To this end, we have used semiparametric mixed models with multiple local genomic relationship matrices with hierarchical designs. Elastic-net postprocessing was used to introduce sparsity. Our models produce good predictive performance along with useful explanatory information.
Collapse
|
255
|
McNeish DM. Using Lasso for Predictor Selection and to Assuage Overfitting: A Method Long Overlooked in Behavioral Sciences. MULTIVARIATE BEHAVIORAL RESEARCH 2015; 50:471-84. [PMID: 26610247 DOI: 10.1080/00273171.2015.1036965] [Citation(s) in RCA: 171] [Impact Index Per Article: 19.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/16/2023]
Abstract
Ordinary least squares and stepwise selection are widespread in behavioral science research; however, these methods are well known to encounter overfitting problems such that R(2) and regression coefficients may be inflated while standard errors and p values may be deflated, ultimately reducing both the parsimony of the model and the generalizability of conclusions. More optimal methods for selecting predictors and estimating regression coefficients such as regularization methods (e.g., Lasso) have existed for decades, are widely implemented in other disciplines, and are available in mainstream software, yet, these methods are essentially invisible in the behavioral science literature while the use of sub optimal methods continues to proliferate. This paper discusses potential issues with standard statistical models, provides an introduction to regularization with specific details on both Lasso and its related predecessor ridge regression, provides an example analysis and code for running a Lasso analysis in R and SAS, and discusses limitations and related methods.
Collapse
Affiliation(s)
- Daniel M McNeish
- a Department of Human Development and Quantitative Methodology , University of Maryland , College Park
| |
Collapse
|
256
|
Wang Z, Gu Q, Ning Y, Liu H. High Dimensional EM Algorithm: Statistical Optimization and Asymptotic Normality. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 2015; 28:2512-2520. [PMID: 28615917 PMCID: PMC5467221] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]
Abstract
We provide a general theory of the expectation-maximization (EM) algorithm for inferring high dimensional latent variable models. In particular, we make two contributions: (i) For parameter estimation, we propose a novel high dimensional EM algorithm which naturally incorporates sparsity structure into parameter estimation. With an appropriate initialization, this algorithm converges at a geometric rate and attains an estimator with the (near-)optimal statistical rate of convergence. (ii) Based on the obtained estimator, we propose new inferential procedures for testing hypotheses and constructing confidence intervals for low dimensional components of high dimensional parameters. For a broad family of statistical models, our framework establishes the first computationally feasible approach for optimal estimation and asymptotic inference in high dimensions. Our theory is supported by thorough numerical results.
Collapse
Affiliation(s)
- Zhaoran Wang
- Department of Operations Research and Financial Engineering, Princeton University, Princeton, NJ 08544, USA
| | - Quanquan Gu
- Department of Operations Research and Financial Engineering, Princeton University, Princeton, NJ 08544, USA
| | - Yang Ning
- Department of Operations Research and Financial Engineering, Princeton University, Princeton, NJ 08544, USA
| | - Han Liu
- Department of Operations Research and Financial Engineering, Princeton University, Princeton, NJ 08544, USA
| |
Collapse
|
257
|
|
258
|
Bühlmann P, van de Geer S. High-dimensional inference in misspecified linear models. Electron J Stat 2015. [DOI: 10.1214/15-ejs1041] [Citation(s) in RCA: 26] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
259
|
|
260
|
Truntzer C, Mostacci E, Jeannin A, Petit JM, Ducoroy P, Cardot H. Comparison of classification methods that combine clinical data and high-dimensional mass spectrometry data. BMC Bioinformatics 2014; 15:385. [PMID: 25432156 PMCID: PMC4261611 DOI: 10.1186/s12859-014-0385-z] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/10/2013] [Accepted: 11/12/2014] [Indexed: 12/02/2022] Open
Abstract
Background The identification of new diagnostic or prognostic biomarkers is one of the main aims of clinical cancer research. Technologies like mass spectrometry are commonly being used in proteomic research. Mass spectrometry signals show the proteomic profiles of the individuals under study at a given time. These profiles correspond to the recording of a large number of proteins, much larger than the number of individuals. These variables come in addition to or to complete classical clinical variables. The objective of this study is to evaluate and compare the predictive ability of new and existing models combining mass spectrometry data and classical clinical variables. This study was conducted in the context of binary prediction. Results To achieve this goal, simulated data as well as a real dataset dedicated to the selection of proteomic markers of steatosis were used to evaluate the methods. The proposed methods meet the challenge of high-dimensional data and the selection of predictive markers by using penalization methods (Ridge, Lasso) and dimension reduction techniques (PLS), as well as a combination of both strategies through sparse PLS in the context of a binary class prediction. The methods were compared in terms of mean classification rate and their ability to select the true predictive values. These comparisons were done on clinical-only models, mass-spectrometry-only models and combined models. Conclusions It was shown that models which combine both types of data can be more efficient than models that use only clinical or mass spectrometry data when the sample size of the dataset is large enough.
Collapse
Affiliation(s)
- Caroline Truntzer
- Proteomic Platform CLIPP, Centre Hospitalier Universitaire, Dijon, 21000, France. .,University of Burgundy, Dijon, 21000, France.
| | - Elise Mostacci
- Proteomic Platform CLIPP, Centre Hospitalier Universitaire, Dijon, 21000, France. .,Institut de Mathématiques de Bourgogne, UMR CNRS 5584, Dijon, 21000, France. .,University of Burgundy, Dijon, 21000, France.
| | - Aline Jeannin
- Proteomic Platform CLIPP, Centre Hospitalier Universitaire, Dijon, 21000, France. .,University of Burgundy, Dijon, 21000, France.
| | - Jean-Michel Petit
- Service Endocrinologie, Centre Hospitalier Universitaire, Dijon, 21000, France.
| | - Patrick Ducoroy
- Proteomic Platform CLIPP, Centre Hospitalier Universitaire, Dijon, 21000, France. .,University of Burgundy, Dijon, 21000, France.
| | - Hervé Cardot
- Institut de Mathématiques de Bourgogne, UMR CNRS 5584, Dijon, 21000, France. .,University of Burgundy, Dijon, 21000, France.
| |
Collapse
|
261
|
Abstract
We consider the scenario where one observes an outcome variable and sets of features from multiple assays, all measured on the same set of samples. One approach that has been proposed for dealing with these type of data is "sparse multiple canonical correlation analysis" (sparse mCCA). All of the current sparse mCCA techniques are biconvex and thus have no guarantees about reaching a global optimum. We propose a method for performing sparse supervised canonical correlation analysis (sparse sCCA), a specific case of sparse mCCA when one of the datasets is a vector. Our proposal for sparse sCCA is convex and thus does not face the same difficulties as the other methods. We derive efficient algorithms for this problem that can be implemented with off the shelf solvers, and illustrate their use on simulated and real data.
Collapse
Affiliation(s)
- Samuel M Gross
- Department of Statistics, Stanford University, Stanford, CA 94305, USA
| | - Robert Tibshirani
- Department of Statistics, Stanford University, Stanford, CA 94305, USADepartment of Health Research & Policy, Stanford University, Stanford, CA 94305, USA
| |
Collapse
|
262
|
Meinshausen N. Group bound: confidence intervals for groups of variables in sparse high dimensional regression without assumptions on the design. J R Stat Soc Series B Stat Methodol 2014. [DOI: 10.1111/rssb.12094] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
263
|
Affiliation(s)
- Michael B. Morrissey
- School of Biology; University of St Andrews; Dyers Brae House St Andrews KY16 9TH UK
| |
Collapse
|
264
|
Functional multi-locus QTL mapping of temporal trends in Scots pine wood traits. G3-GENES GENOMES GENETICS 2014; 4:2365-79. [PMID: 25305041 PMCID: PMC4267932 DOI: 10.1534/g3.114.014068] [Citation(s) in RCA: 27] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Quantitative trait loci (QTL) mapping of wood properties in conifer species has focused on single time point measurements or on trait means based on heterogeneous wood samples (e.g., increment cores), thus ignoring systematic within-tree trends. In this study, functional QTL mapping was performed for a set of important wood properties in increment cores from a 17-yr-old Scots pine (Pinus sylvestris L.) full-sib family with the aim of detecting wood trait QTL for general intercepts (means) and for linear slopes by increasing cambial age. Two multi-locus functional QTL analysis approaches were proposed and their performances were compared on trait datasets comprising 2 to 9 time points, 91 to 455 individual tree measurements and genotype datasets of amplified length polymorphisms (AFLP), and single nucleotide polymorphism (SNP) markers. The first method was a multilevel LASSO analysis whereby trend parameter estimation and QTL mapping were conducted consecutively; the second method was our Bayesian linear mixed model whereby trends and underlying genetic effects were estimated simultaneously. We also compared several different hypothesis testing methods under either the LASSO or the Bayesian framework to perform QTL inference. In total, five and four significant QTL were observed for the intercepts and slopes, respectively, across wood traits such as earlywood percentage, wood density, radial fiberwidth, and spiral grain angle. Four of these QTL were represented by candidate gene SNPs, thus providing promising targets for future research in QTL mapping and molecular function. Bayesian and LASSO methods both detected similar sets of QTL given datasets that comprised large numbers of individuals.
Collapse
|
265
|
|