26
|
Huang PH. A penalized likelihood method for multi-group structural equation modelling. THE BRITISH JOURNAL OF MATHEMATICAL AND STATISTICAL PSYCHOLOGY 2018; 71:499-522. [PMID: 29500879 DOI: 10.1111/bmsp.12130] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/02/2017] [Revised: 10/12/2017] [Indexed: 06/08/2023]
Abstract
In the past two decades, statistical modelling with sparsity has become an active research topic in the fields of statistics and machine learning. Recently, Huang, Chen and Weng (2017, Psychometrika, 82, 329) and Jacobucci, Grimm, and McArdle (2016, Structural Equation Modeling: A Multidisciplinary Journal, 23, 555) both proposed sparse estimation methods for structural equation modelling (SEM). These methods, however, are restricted to performing single-group analysis. The aim of the present work is to establish a penalized likelihood (PL) method for multi-group SEM. Our proposed method decomposes each group model parameter into a common reference component and a group-specific increment component. By penalizing the increment components, the heterogeneity of parameter values across the population can be explored since the null group-specific effects are expected to diminish. We developed an expectation-conditional maximization algorithm to optimize the PL criteria. A numerical experiment and a real data example are presented to demonstrate the potential utility of the proposed method.
Collapse
|
27
|
He Y, Lin H, Tu D. A single-index threshold Cox proportional hazard model for identifying a treatment-sensitive subset based on multiple biomarkers. Stat Med 2018; 37:3267-3279. [PMID: 29869381 DOI: 10.1002/sim.7837] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2017] [Revised: 04/18/2018] [Accepted: 05/07/2018] [Indexed: 01/18/2023]
Abstract
In this paper, we introduce a single-index threshold Cox proportional hazard model to select and combine biomarkers to identify patients who may be sensitive to a specific treatment. A penalized smoothed partial likelihood is proposed to estimate the parameters in the model. A simple, efficient, and unified algorithm is presented to maximize this likelihood function. The estimators based on this likelihood function are shown to be consistent and asymptotically normal. Under mild conditions, the proposed estimators also achieve the oracle property. The proposed approach is evaluated through simulation analyses and application to the analysis of data from two clinical trials, one involving patients with locally advanced or metastatic pancreatic cancer and one involving patients with resectable lung cancer.
Collapse
|
28
|
Heinze G, Wallisch C, Dunkler D. Variable selection - A review and recommendations for the practicing statistician. Biom J 2018; 60:431-449. [PMID: 29292533 PMCID: PMC5969114 DOI: 10.1002/bimj.201700067] [Citation(s) in RCA: 686] [Impact Index Per Article: 114.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2017] [Revised: 11/13/2017] [Accepted: 11/17/2017] [Indexed: 12/12/2022]
Abstract
Statistical models support medical research by facilitating individualized outcome prognostication conditional on independent variables or by estimating effects of risk factors adjusted for covariates. Theory of statistical models is well-established if the set of independent variables to consider is fixed and small. Hence, we can assume that effect estimates are unbiased and the usual methods for confidence interval estimation are valid. In routine work, however, it is not known a priori which covariates should be included in a model, and often we are confronted with the number of candidate variables in the range 10-30. This number is often too large to be considered in a statistical model. We provide an overview of various available variable selection methods that are based on significance or information criteria, penalized likelihood, the change-in-estimate criterion, background knowledge, or combinations thereof. These methods were usually developed in the context of a linear regression model and then transferred to more generalized linear models or models for censored survival data. Variable selection, in particular if used in explanatory modeling where effect estimates are of central interest, can compromise stability of a final model, unbiasedness of regression coefficients, and validity of p-values or confidence intervals. Therefore, we give pragmatic recommendations for the practicing statistician on application of variable selection methods in general (low-dimensional) modeling problems and on performing stability investigations and inference. We also propose some quantities based on resampling the entire variable selection process to be routinely reported by software packages offering automated variable selection algorithms.
Collapse
|
29
|
Choi Y, Coram M, Peng J, Tang H. A Poisson Log-Normal Model for Constructing Gene Covariation Network Using RNA-seq Data. J Comput Biol 2017; 24:721-731. [PMID: 28557607 PMCID: PMC5510689 DOI: 10.1089/cmb.2017.0053] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022] Open
Abstract
Constructing expression networks using transcriptomic data is an effective approach for studying gene regulation. A popular approach for constructing such a network is based on the Gaussian graphical model (GGM), in which an edge between a pair of genes indicates that the expression levels of these two genes are conditionally dependent, given the expression levels of all other genes. However, GGMs are not appropriate for non-Gaussian data, such as those generated in RNA-seq experiments. We propose a novel statistical framework that maximizes a penalized likelihood, in which the observed count data follow a Poisson log-normal distribution. To overcome the computational challenges, we use Laplace's method to approximate the likelihood and its gradients, and apply the alternating directions method of multipliers to find the penalized maximum likelihood estimates. The proposed method is evaluated and compared with GGMs using both simulated and real RNA-seq data. The proposed method shows improved performance in detecting edges that represent covarying pairs of genes, particularly for edges connecting low-abundant genes and edges around regulatory hubs.
Collapse
|
30
|
Huang PH, Chen H, Weng LJ. A Penalized Likelihood Method for Structural Equation Modeling. PSYCHOMETRIKA 2017; 82:329-354. [PMID: 28417228 DOI: 10.1007/s11336-017-9566-9] [Citation(s) in RCA: 32] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/28/2013] [Revised: 12/06/2016] [Indexed: 06/07/2023]
Abstract
A penalized likelihood (PL) method for structural equation modeling (SEM) was proposed as a methodology for exploring the underlying relations among both observed and latent variables. Compared to the usual likelihood method, PL includes a penalty term to control the complexity of the hypothesized model. When the penalty level is appropriately chosen, the PL can yield an SEM model that balances the model goodness-of-fit and model complexity. In addition, the PL results in a sparse estimate that enhances the interpretability of the final model. The proposed method is especially useful when limited substantive knowledge is available for model specifications. The PL method can be also understood as a methodology that links the traditional SEM to the exploratory SEM (Asparouhov & Muthén in Struct Equ Model Multidiscipl J 16:397-438, 2009). An expectation-conditional maximization algorithm was developed to maximize the PL criterion. The asymptotic properties of the proposed PL were also derived. The performance of PL was evaluated through a numerical experiment, and two real data illustrations were presented to demonstrate its utility in psychological research.
Collapse
|
31
|
Abstract
Finite mixture regression models have been widely used for modelling mixed regression relationships arising from a clustered and thus heterogenous population. The classical normal mixture model, despite its simplicity and wide applicability, may fail in the presence of severe outliers. Using a sparse, case-specific, and scale-dependent mean-shift mixture model parameterization, we propose a robust mixture regression approach for simultaneously conducting outlier detection and robust parameter estimation. A penalized likelihood approach is adopted to induce sparsity among the mean-shift parameters so that the outliers are distinguished from the remainder of the data, and a generalized Expectation-Maximization (EM) algorithm is developed to perform stable and efficient computation. The proposed approach is shown to have strong connections with other robust methods including the trimmed likelihood method and M-estimation approaches. In contrast to several existing methods, the proposed methods show outstanding performance in our simulation studies.
Collapse
|
32
|
Wangerin KA, Ahn S, Wollenweber S, Ross SG, Kinahan PE, Manjeshwar RM. Evaluation of lesion detectability in positron emission tomography when using a convergent penalized likelihood image reconstruction method. J Med Imaging (Bellingham) 2016; 4:011002. [PMID: 27921073 DOI: 10.1117/1.jmi.4.1.011002] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2016] [Accepted: 10/18/2016] [Indexed: 11/14/2022] Open
Abstract
We have previously developed a convergent penalized likelihood (PL) image reconstruction algorithm using the relative difference prior (RDP) and showed that it achieves more accurate lesion quantitation compared to ordered subsets expectation maximization (OSEM). We evaluated the detectability of low-contrast liver and lung lesions using the PL-RDP algorithm compared to OSEM. We performed a two-alternative forced choice study using a channelized Hotelling observer model that was previously validated against human observers. Lesion detectability showed a stronger dependence on lesion size for PL-RDP than OSEM. Lesion detectability was improved using time-of-flight (TOF) reconstruction, with greater benefit for the liver compared to the lung and with increasing benefit for decreasing lesion size and contrast. PL detectability was statistically significantly higher than OSEM for 20 mm liver lesions when contrast was [Formula: see text] ([Formula: see text]), and TOF PL detectability was statistically significantly higher than TOF OSEM for 15 and 20 mm liver lesions with contrast [Formula: see text] and [Formula: see text], respectively. For all other cases, there was no statistically significant difference between PL and OSEM ([Formula: see text]). For the range of studied lesion properties, lesion detectability using PL-RDP was equivalent or improved compared to using OSEM.
Collapse
|
33
|
Abstract
Survival data with ultrahigh dimensional covariates such as genetic markers have been collected in medical studies and other fields. In this work, we propose a feature screening procedure for the Cox model with ultrahigh dimensional covariates. The proposed procedure is distinguished from the existing sure independence screening (SIS) procedures (Fan, Feng and Wu, 2010, Zhao and Li, 2012) in that the proposed procedure is based on joint likelihood of potential active predictors, and therefore is not a marginal screening procedure. The proposed procedure can effectively identify active predictors that are jointly dependent but marginally independent of the response without performing an iterative procedure. We develop a computationally effective algorithm to carry out the proposed procedure and establish the ascent property of the proposed algorithm. We further prove that the proposed procedure possesses the sure screening property. That is, with the probability tending to one, the selected variable set includes the actual active predictors. We conduct Monte Carlo simulation to evaluate the finite sample performance of the proposed procedure and further compare the proposed procedure and existing SIS procedures. The proposed methodology is also demonstrated through an empirical analysis of a real data example.
Collapse
|
34
|
Barber RF, Sidky EY. MOCCA: Mirrored Convex/Concave Optimization for Nonconvex Composite Functions. JOURNAL OF MACHINE LEARNING RESEARCH : JMLR 2016; 17:1-51. [PMID: 29391859 PMCID: PMC5789814] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]
Abstract
Many optimization problems arising in high-dimensional statistics decompose naturally into a sum of several terms, where the individual terms are relatively simple but the composite objective function can only be optimized with iterative algorithms. In this paper, we are interested in optimization problems of the form F(Kx) + G(x), where K is a fixed linear transformation, while F and G are functions that may be nonconvex and/or nondifferentiable. In particular, if either of the terms are nonconvex, existing alternating minimization techniques may fail to converge; other types of existing approaches may instead be unable to handle nondifferentiability. We propose the MOCCA (mirrored convex/concave) algorithm, a primal/dual optimization approach that takes a local convex approximation to each term at every iteration. Inspired by optimization problems arising in computed tomography (CT) imaging, this algorithm can handle a range of nonconvex composite optimization problems, and offers theoretical guarantees for convergence when the overall problem is approximately convex (that is, any concavity in one term is balanced out by convexity in the other term). Empirical results show fast convergence for several structured signal recovery problems.
Collapse
|
35
|
Shujie MA, Carroll RJ, Liang H, Xu S. Estimation and Inference in Generalized Additive Coefficient Models for Nonlinear Interactions with High-Dimensional Covariates. Ann Stat 2015; 43:2102-2131. [PMID: 26412908 DOI: 10.1214/15-aos1344] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Abstract
In the low-dimensional case, the generalized additive coefficient model (GACM) proposed by Xue and Yang [Statist. Sinica16 (2006) 1423-1446] has been demonstrated to be a powerful tool for studying nonlinear interaction effects of variables. In this paper, we propose estimation and inference procedures for the GACM when the dimension of the variables is high. Specifically, we propose a groupwise penalization based procedure to distinguish significant covariates for the "large p small n" setting. The procedure is shown to be consistent for model structure identification. Further, we construct simultaneous confidence bands for the coefficient functions in the selected model based on a refined two-step spline estimator. We also discuss how to choose the tuning parameters. To estimate the standard deviation of the functional estimator, we adopt the smoothed bootstrap method. We conduct simulation experiments to evaluate the numerical performance of the proposed methods and analyze an obesity data set from a genome-wide association study as an illustration.
Collapse
|
36
|
Greenland S, Mansournia MA. Penalization, bias reduction, and default priors in logistic and related categorical and survival regressions. Stat Med 2015; 34:3133-43. [PMID: 26011599 DOI: 10.1002/sim.6537] [Citation(s) in RCA: 161] [Impact Index Per Article: 17.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/26/2014] [Revised: 04/06/2015] [Accepted: 05/01/2015] [Indexed: 11/06/2022]
Abstract
Penalization is a very general method of stabilizing or regularizing estimates, which has both frequentist and Bayesian rationales. We consider some questions that arise when considering alternative penalties for logistic regression and related models. The most widely programmed penalty appears to be the Firth small-sample bias-reduction method (albeit with small differences among implementations and the results they provide), which corresponds to using the log density of the Jeffreys invariant prior distribution as a penalty function. The latter representation raises some serious contextual objections to the Firth reduction, which also apply to alternative penalties based on t-distributions (including Cauchy priors). Taking simplicity of implementation and interpretation as our chief criteria, we propose that the log-F(1,1) prior provides a better default penalty than other proposals. Penalization based on more general log-F priors is trivial to implement and facilitates mean-squared error reduction and sensitivity analyses of penalty strength by varying the number of prior degrees of freedom. We caution however against penalization of intercepts, which are unduly sensitive to covariate coding and design idiosyncrasies.
Collapse
|
37
|
Li Z, Liu H, Tu W. A sexually transmitted infection screening algorithm based on semiparametric regression models. Stat Med 2015; 34:2844-57. [PMID: 25900920 DOI: 10.1002/sim.6515] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2014] [Revised: 03/16/2015] [Accepted: 04/02/2015] [Indexed: 11/11/2022]
Abstract
Sexually transmitted infections (STIs) with Chlamydia trachomatis, Neisseria gonorrhoeae, and Trichomonas vaginalis are among the most common infectious diseases in the United States, disproportionately affecting young women. Because a significant portion of the infections present no symptoms, infection control relies primarily on disease screening. However, universal STI screening in a large population can be expensive. In this paper, we propose a semiparametric model-based screening algorithm. The model quantifies organism-specific infection risks in individual subjects and accounts for the within-subject interdependence of the infection outcomes of different organisms and the serial correlations among the repeated assessments of the same organism. Bivariate thin-plate regression spline surfaces are incorporated to depict the concurrent influences of age and sexual partners on infection acquisition. Model parameters are estimated by using a penalized likelihood method. For inference, we develop a likelihood-based resampling procedure to compare the bivariate effect surfaces across outcomes. Simulation studies are conducted to evaluate the model fitting performance. A screening algorithm is developed using data collected from an epidemiological study of young women at increased risk of STIs. We present evidence that the three organisms have distinct age and partner effect patterns; for C. trachomatis, the partner effect is more pronounced in younger adolescents. Predictive performance of the proposed screening algorithm is assessed through a receiver operating characteristic analysis. We show that the model-based screening algorithm has excellent accuracy in identifying individuals at increased risk, and thus can be used to assist STI screening in clinical practice.
Collapse
|
38
|
Wang X. Firth logistic regression for rare variant association tests. Front Genet 2014; 5:187. [PMID: 24995013 PMCID: PMC4063169 DOI: 10.3389/fgene.2014.00187] [Citation(s) in RCA: 129] [Impact Index Per Article: 12.9] [Reference Citation Analysis] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2014] [Accepted: 06/02/2014] [Indexed: 11/13/2022] Open
|
39
|
Biard L, Porcher R, Resche-Rigon M. Permutation tests for centre effect on survival endpoints with application in an acute myeloid leukaemia multicentre study. Stat Med 2014; 33:3047-57. [PMID: 24676752 DOI: 10.1002/sim.6153] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/17/2013] [Revised: 02/21/2014] [Accepted: 03/02/2014] [Indexed: 11/10/2022]
Abstract
When analysing multicentre data, it may be of interest to test whether the distribution of the endpoint varies among centres. In a mixed-effect model, testing for such a centre effect consists in testing to zero a random centre effect variance component. It has been shown that the usual asymptotic χ(2) distribution of the likelihood ratio and score statistics under the null does not necessarily hold. In the case of censored data, mixed-effects Cox models have been used to account for random effects, but few works have concentrated on testing to zero the variance component of the random effects. We propose a permutation test, using random permutation of the cluster indices, to test for a centre effect in multilevel censored data. Results from a simulation study indicate that the permutation tests have correct type I error rates, contrary to standard likelihood ratio tests, and are more powerful. The proposed tests are illustrated using data of a multicentre clinical trial of induction therapy in acute myeloid leukaemia patients.
Collapse
|
40
|
A penalized-likelihood method to estimate the distribution of selection coefficients from phylogenetic data. Genetics 2014; 197:257-71. [PMID: 24532780 DOI: 10.1534/genetics.114.162263] [Citation(s) in RCA: 46] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/04/2023] Open
Abstract
We develop a maximum penalized-likelihood (MPL) method to estimate the fitnesses of amino acids and the distribution of selection coefficients (S = 2Ns) in protein-coding genes from phylogenetic data. This improves on a previous maximum-likelihood method. Various penalty functions are used to penalize extreme estimates of the fitnesses, thus correcting overfitting by the previous method. Using a combination of computer simulation and real data analysis, we evaluate the effect of the various penalties on the estimation of the fitnesses and the distribution of S. We show the new method regularizes the estimates of the fitnesses for small, relatively uninformative data sets, but it can still recover the large proportion of deleterious mutations when present in simulated data. Computer simulations indicate that as the number of taxa in the phylogeny or the level of sequence divergence increases, the distribution of S can be more accurately estimated. Furthermore, the strength of the penalty can be varied to study how informative a particular data set is about the distribution of S. We analyze three protein-coding genes (the chloroplast rubisco protein, mammal mitochondrial proteins, and an influenza virus polymerase) and show the new method recovers a large proportion of deleterious mutations in these data, even under strong penalties, confirming the distribution of S is bimodal in these real data. We recommend the use of the new MPL approach for the estimation of the distribution of S in species phylogenies of protein-coding genes.
Collapse
|
41
|
Johnson VE. On Numerical Aspects of Bayesian Model Selection in High and Ultrahigh-dimensional Settings. BAYESIAN ANALYSIS 2013; 8:741-758. [PMID: 24683431 PMCID: PMC3968919 DOI: 10.1214/13-ba818] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/03/2023]
Abstract
This article examines the convergence properties of a Bayesian model selection procedure based on a non-local prior density in ultrahigh-dimensional settings. The performance of the model selection procedure is also compared to popular penalized likelihood methods. Coupling diagnostics are used to bound the total variation distance between iterates in an Markov chain Monte Carlo (MCMC) algorithm and the posterior distribution on the model space. In several simulation scenarios in which the number of observations exceeds 100, rapid convergence and high accuracy of the Bayesian procedure is demonstrated. Conversely, the coupling diagnostics are successful in diagnosing lack of convergence in several scenarios for which the number of observations is less than 100. The accuracy of the Bayesian model selection procedure in identifying high probability models is shown to be comparable to commonly used penalized likelihood methods, including extensions of smoothly clipped absolute deviations (SCAD) and least absolute shrinkage and selection operator (LASSO) procedures.
Collapse
|
42
|
Lee D, Lee Y, Pawitan Y, Lee W. Sparse partial least-squares regression for high-throughput survival data analysis. Stat Med 2013; 32:5340-52. [PMID: 24105836 DOI: 10.1002/sim.5975] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/27/2012] [Revised: 08/24/2013] [Accepted: 08/27/2013] [Indexed: 11/09/2022]
Abstract
The partial least-square (PLS) method has been adapted to the Cox's proportional hazards model for analyzing high-dimensional survival data. But because the latent components constructed in PLS employ all predictors regardless of their relevance, it is often difficult to interpret the results. In this paper, we propose a new formulation of sparse PLS (SPLS) procedure for survival data to allow simultaneous sparse variable selection and dimension reduction. We develop a computing algorithm for SPLS by modifying an iteratively reweighted PLS algorithm and illustrate the method with the Swedish and the Netherlands Cancer Institute breast cancer datasets. Through the numerical studies, we find that our SPLS method generally performs better than the standard PLS and sparse Cox regression methods in variable selection and prediction.
Collapse
|
43
|
Ghebremichael-Weldeselassie Y, Whitaker HJ, Farrington CP. Self-controlled case series method with smooth age effect. Stat Med 2013; 33:639-49. [PMID: 24038284 DOI: 10.1002/sim.5949] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/13/2012] [Accepted: 07/29/2013] [Indexed: 11/07/2022]
Abstract
The self-controlled case series method, commonly used to investigate potential associations between vaccines and adverse events, requires information on cases only and automatically controls all age-independent multiplicative confounders while allowing for an age-dependent baseline incidence. In the parametric version of the method, we modelled the age-specific relative incidence by using a piecewise constant function, whereas in the semiparametric version, we left it unspecified. However, mis-specification of age groups in the parametric version can lead to biassed estimates of exposure effect, and the semiparametric approach runs into computational problems when the number of cases in the study is moderately large. We, thus, propose to use a penalized likelihood approach where the age effect is modelled using splines. We use a linear combination of cubic M-splines to approximate the age-specific relative incidence and integrated splines for the cumulative relative incidence. We conducted a simulation study to evaluate the performance of the new approach and its efficiency relative to the parametric and semiparametric approaches. Results show that the new approach performs equivalently to the existing methods when the sample size is small and works well for large data sets. We applied the new spline-based approach to data on febrile convulsions and paediatric vaccines. Co
Collapse
|
44
|
Leffondré K, Touraine C, Helmer C, Joly P. Interval-censored time-to-event and competing risk with death: is the illness-death model more accurate than the Cox model? Int J Epidemiol 2013; 42:1177-86. [PMID: 23900486 DOI: 10.1093/ije/dyt126] [Citation(s) in RCA: 63] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
BACKGROUND In survival analyses of longitudinal data, death is often a competing event for the disease of interest, and the time-to-disease onset is interval-censored when the diagnosis is made at intermittent follow-up visits. As a result, the disease status at death is unknown for subjects disease-free at the last visit before death. Standard survival analysis consists in right-censoring the time-to-disease onset at that visit, which may induce an underestimation of the disease incidence. By contrast, an illness-death model for interval-censored data accounts for the probability of developing the disease between that visit and death, and provides a better incidence estimate. However, the two approaches have never been compared for estimating the effect of exposure on disease risk. METHODS This paper compares through simulations the accuracy of the effect estimates from a semi-parametric illness-death model for interval-censored data and the standard Cox model. The approaches are also compared for estimating the effects of selected risk factors on the risk of dementia, using the French elderly PAQUID cohort data. RESULTS The illness-death model provided a more accurate effect estimate of exposures that also affected mortality. The direction and magnitude of the bias from the Cox model depended on the effects of the exposure on disease and death. The application to the PAQUID cohort confirmed the simulation results. CONCLUSION If follow-up intervals are wide and the exposure has an impact on death, then the illness-death model for interval-censored data should be preferred to the standard Cox regression analysis.
Collapse
|
45
|
Ayers KL, Cordell HJ. Identification of grouped rare and common variants via penalized logistic regression. Genet Epidemiol 2013; 37:592-602. [PMID: 23836590 PMCID: PMC3842118 DOI: 10.1002/gepi.21746] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2012] [Revised: 05/24/2013] [Accepted: 05/24/2013] [Indexed: 11/09/2022]
Abstract
In spite of the success of genome-wide association studies in finding many common variants associated with disease, these variants seem to explain only a small proportion of the estimated heritability. Data collection has turned toward exome and whole genome sequencing, but it is well known that single marker methods frequently used for common variants have low power to detect rare variants associated with disease, even with very large sample sizes. In response, a variety of methods have been developed that attempt to cluster rare variants so that they may gather strength from one another under the premise that there may be multiple causal variants within a gene. Most of these methods group variants by gene or proximity, and test one gene or marker window at a time. We propose a penalized regression method (PeRC) that analyzes all genes at once, allowing grouping of all (rare and common) variants within a gene, along with subgrouping of the rare variants, thus borrowing strength from both rare and common variants within the same gene. The method can incorporate either a burden-based weighting of the rare variants or one in which the weights are data driven. In simulations, our method performs favorably when compared to many previously proposed approaches, including its predecessor, the sparse group lasso [Friedman et al., 2010].
Collapse
|
46
|
Tong X, Zhu L, Leng C, Leisenring W, Robison LL. A general semiparametric hazards regression model: efficient estimation and structure selection. Stat Med 2013; 32:4980-94. [PMID: 23824784 DOI: 10.1002/sim.5885] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/22/2012] [Accepted: 05/28/2013] [Indexed: 11/06/2022]
Abstract
We consider a general semiparametric hazards regression model that encompasses the Cox proportional hazards model and the accelerated failure time model for survival analysis. To overcome the nonexistence of the maximum likelihood, we derive a kernel-smoothed profile likelihood function and prove that the resulting estimates of the regression parameters are consistent and achieve semiparametric efficiency. In addition, we develop penalized structure selection techniques to determine which covariates constitute the accelerated failure time model and which covariates constitute the proportional hazards model. The proposed method is able to estimate the model structure consistently and model parameters efficiently. Furthermore, variance estimation is straightforward. The proposed estimation performs well in simulation studies and is applied to the analysis of a real data set.
Collapse
|
47
|
Adluru N, Hanlon BM, Lutz A, Lainhart JE, Alexander AL, Davidson RJ. Penalized likelihood phenotyping: unifying voxelwise analyses and multi-voxel pattern analyses in neuroimaging: penalized likelihood phenotyping. Neuroinformatics 2013; 11:227-47. [PMID: 23397550 PMCID: PMC3624987 DOI: 10.1007/s12021-012-9175-9] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/18/2023]
Abstract
Neuroimage phenotyping for psychiatric and neurological disorders is performed using voxelwise analyses also known as voxel based analyses or morphometry (VBM). A typical voxelwise analysis treats measurements at each voxel (e.g., fractional anisotropy, gray matter probability) as outcome measures to study the effects of possible explanatory variables (e.g., age, group) in a linear regression setting. Furthermore, each voxel is treated independently until the stage of correction for multiple comparisons. Recently, multi-voxel pattern analyses (MVPA), such as classification, have arisen as an alternative to VBM. The main advantage of MVPA over VBM is that the former employ multivariate methods which can account for interactions among voxels in identifying significant patterns. They also provide ways for computer-aided diagnosis and prognosis at individual subject level. However, compared to VBM, the results of MVPA are often more difficult to interpret and prone to arbitrary conclusions. In this paper, first we use penalized likelihood modeling to provide a unified framework for understanding both VBM and MVPA. We then utilize statistical learning theory to provide practical methods for interpreting the results of MVPA beyond commonly used performance metrics, such as leave-one-out-cross validation accuracy and area under the receiver operating characteristic (ROC) curve. Additionally, we demonstrate that there are challenges in MVPA when trying to obtain image phenotyping information in the form of statistical parametric maps (SPMs), which are commonly obtained from VBM, and provide a bootstrap strategy as a potential solution for generating SPMs using MVPA. This technique also allows us to maximize the use of available training data. We illustrate the empirical performance of the proposed framework using two different neuroimaging studies that pose different levels of challenge for classification using MVPA.
Collapse
|
48
|
Fu P, Panneerselvam A, Clifford B, Dowlati A, Ma PC, Zeng G, Halmos B, Leidner RS. Simpson's paradox - aggregating and partitioning populations in health disparities of lung cancer patients. Stat Methods Med Res 2012; 24:937-48. [PMID: 22246415 DOI: 10.1177/0962280211434179] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
It is well known that non-small cell lung cancer (NSCLC) is a heterogeneous group of diseases. Previous studies have demonstrated genetic variation among different ethnic groups in the epidermal growth factor receptor (EGFR) in NSCLC. Research by our group and others has recently shown a lower frequency of EGFR mutations in African Americans with NSCLC, as compared to their White counterparts. In this study, we use our original study data of EGFR pathway genetics in African American NSCLC as an example to illustrate that univariate analyses based on aggregation versus partition of data leads to contradictory results, in order to emphasize the importance of controlling statistical confounding. We further investigate analytic approaches in logistic regression for data with separation, as is the case in our example data set, and apply appropriate methods to identify predictors of EGFR mutation. Our simulation shows that with separated or nearly separated data, penalized maximum likelihood (PML) produces estimates with smallest bias and approximately maintains the nominal value with statistical power equal to or better than that from maximum likelihood and exact conditional likelihood methods. Application of the PML method in our example data set shows that race and EGFR-FISH are independently significant predictors of EGFR mutation.
Collapse
|
49
|
Rondeau V, Pignon JP, Michiels S. A joint model for the dependence between clustered times to tumour progression and deaths: A meta-analysis of chemotherapy in head and neck cancer. Stat Methods Med Res 2011; 24:711-29. [PMID: 22025414 DOI: 10.1177/0962280211425578] [Citation(s) in RCA: 30] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
The observation of time to tumour progression (TTP) or progression-free survival (PFS) may be terminated by a terminal event. In this context, deaths may be due to tumour progression, and the time to the major failure event (death) may be correlated with the TTP. The usual assumption of independence between the TTP process and death, required by many commonly used statistical methods, can be violated. Furthermore, although the relationship between TTP and time to death is most relevant to the anti-cancer drug development or to evaluation of TTP as a surrogate endpoint, statistical models that try to describe the dependence structure between these two characteristics are not frequently used. We propose a joint frailty model for the analysis of two survival endpoints, TTP and time to death, or PFS and time to death, in the context of data clustering (e.g. at the centre or trial level). This approach allows us to simultaneously evaluate the prognostic effects of covariates on the two survival endpoints, while accounting both for the relationship between the outcomes and for data clustering. We show how a maximum penalized likelihood estimation can be applied to a nonparametric estimation of the continuous hazard functions in a general joint frailty model with right censoring and delayed entry. The model was motivated by a large meta-analysis of randomized trials for head and neck cancers (Meta-Analysis of Chemotherapy in Head and Neck Cancers), in which the efficacy of chemotherapy on TTP or PFS and overall survival was investigated, as adjunct to surgery or radiotherapy or both.
Collapse
|
50
|
Abstract
This paper reviews the literature on sparse high dimensional models and discusses some applications in economics and finance. Recent developments of theory, methods, and implementations in penalized least squares and penalized likelihood methods are highlighted. These variable selection methods are proved to be effective in high dimensional sparse modeling. The limits of dimensionality that regularization methods can handle, the role of penalty functions, and their statistical properties are detailed. Some recent advances in ultra-high dimensional sparse modeling are also briefly discussed.
Collapse
|