1
|
Khalili A, Shokoohi F, Asgharian M, Lin S. Sparse estimation in semiparametric finite mixture of varying coefficient regression models. Biometrics 2023; 79:3445-3457. [PMID: 37066855 DOI: 10.1111/biom.13870] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2022] [Accepted: 03/30/2023] [Indexed: 04/18/2023]
Abstract
Finite mixture of regressions (FMR) are commonly used to model heterogeneous effects of covariates on a response variable in settings where there are unknown underlying subpopulations. FMRs, however, cannot accommodate situations where covariates' effects also vary according to an "index" variable-known as finite mixture of varying coefficient regression (FM-VCR). Although complex, this situation occurs in real data applications: the osteocalcin (OCN) data analyzed in this manuscript presents a heterogeneous relationship where the effect of a genetic variant on OCN in each hidden subpopulation varies over time. Oftentimes, the number of covariates with varying coefficients also presents a challenge: in the OCN study, genetic variants on the same chromosome are considered jointly. The relative proportions of hidden subpopulations may also change over time. Nevertheless, existing methods cannot provide suitable solutions for accommodating all these features in real data applications. To fill this gap, we develop statistical methodologies based on regularized local-kernel likelihood for simultaneous parameter estimation and variable selection in sparse FM-VCR models. We study large-sample properties of the proposed methods. We then carry out a simulation study to evaluate the performance of various penalties adopted for our regularized approach and ascertain the ability of a BIC-type criterion for estimating the number of subpopulations. Finally, we applied the FM-VCR model to analyze the OCN data and identified several covariates, including genetic variants, that have age-dependent effects on OCN.
Collapse
Affiliation(s)
- Abbas Khalili
- Department of Mathematics and Statistics, McGill University, Montreal, Quebec, Canada
| | - Farhad Shokoohi
- Department of Mathematical Sciences, University of Nevada Las Vegas, Las Vegas, Nevada, USA
| | - Masoud Asgharian
- Department of Mathematics and Statistics, McGill University, Montreal, Quebec, Canada
| | - Shili Lin
- Department of Statistics, Ohio State University, Columbus, Ohio, USA
| |
Collapse
|
2
|
Kalligeris EN, Karagrigoriou A, Parpoula C. On stochastic dynamic modeling of incidence data. Int J Biostat 2023:ijb-2021-0134. [PMID: 37118931 DOI: 10.1515/ijb-2021-0134] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/30/2021] [Accepted: 03/07/2023] [Indexed: 04/30/2023]
Abstract
In this paper, a Markov Regime Switching Model of Conditional Mean with covariates, is proposed and investigated for the analysis of incidence rate data. The components of the model are selected by both penalized likelihood techniques in conjunction with the Expectation Maximization algorithm, with the goal of achieving a high level of robustness regarding the modeling of dynamic behaviors of epidemiological data. In addition to statistical inference, Changepoint Detection Analysis is performed for the selection of the number of regimes, which reduces the complexity associated with Likelihood Ratio Tests. Within this framework, a three-phase procedure for modeling incidence data is proposed and tested via real and simulated data.
Collapse
Affiliation(s)
- Emmanouil-Nektarios Kalligeris
- Laboratory of Mathematics Raphaël Salem, University of Rouen Normandy, Avenue de l'Université, BP. 12, 76801 Saint Étienne du Rouvray, Rouen, France
- Lab of Statistics and Data Analysis, University of the Aegean, 83200 Karlovasi, Samos, Greece
| | - Alex Karagrigoriou
- Lab of Statistics and Data Analysis, University of the Aegean, 83200 Karlovasi, Samos, Greece
- Graphic Era Deemed to be University, Dehradun, India
| | - Christina Parpoula
- Department of Psychology, Panteion University of Social and Political Sciences, 17671, Athens, Greece
| |
Collapse
|
3
|
Webb A, Ma J. Cox models with time-varying covariates and partly-interval censoring-A maximum penalised likelihood approach. Stat Med 2022; 42:815-833. [PMID: 36585040 PMCID: PMC10107645 DOI: 10.1002/sim.9645] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/20/2022] [Revised: 09/30/2022] [Accepted: 12/20/2022] [Indexed: 01/01/2023]
Abstract
Time-varying covariates can be important predictors when model based predictions are considered. A Cox model that includes time-varying covariates is usually referred to as an extended Cox model. When only right censoring is presented in the observed survival times, the conventional partial likelihood method is still applicable to estimate the regression coefficients of an extended Cox model. However, if there are interval-censored survival times, then the partial likelihood method is not directly available unless an imputation, such as the middle point imputation, is used to replaced the left- and interval-censored data. However, such imputation methods are well known for causing biases. This paper considers fitting of the extended Cox models using the maximum penalised likelihood method allowing observed survival times to be partly interval censored, where a penalty function is used to regularise the baseline hazard estimate. We present simulation studies to demonstrate the performance of our proposed method, and illustrate our method with applications to two real datasets from medical research.
Collapse
Affiliation(s)
- Annabel Webb
- Department of Mathematics and StatisticsMacquarie UniversityMacquarie ParkNew South WalesAustralia
| | - Jun Ma
- Department of Mathematics and StatisticsMacquarie UniversityMacquarie ParkNew South WalesAustralia
| |
Collapse
|
4
|
Webb A, Ma J, Lô SN. Penalized likelihood estimation of a mixture cure Cox model with partly interval censoring-An application to thin melanoma. Stat Med 2022; 41:3260-3280. [PMID: 35474515 PMCID: PMC9544451 DOI: 10.1002/sim.9415] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/06/2021] [Revised: 02/24/2022] [Accepted: 04/05/2022] [Indexed: 11/17/2022]
Abstract
Time‐to‐event data in medical studies may involve some patients who are cured and will never experience the event of interest. In practice, those cured patients are right censored. However, when data contain a cured fraction, standard survival methods such as Cox proportional hazards models can produce biased results and therefore misleading interpretations. In addition, for some outcomes, the exact time of an event is not known; instead an interval of time in which the event occurred is recorded. This article proposes a new computational approach that can deal with both the cured fraction issues and the interval censoring challenge. To do so, we extend the traditional mixture cure Cox model to accommodate data with partly interval censoring for the observed event times. The traditional method for estimation of the model parameters is based on the expectation‐maximization (EM) algorithm, where the log‐likelihood is maximized through an indirect complete data log‐likelihood function. We propose in this article an alternative algorithm that directly optimizes the log‐likelihood function. Extensive Monte Carlo simulations are conducted to demonstrate the performance of the new method over the EM algorithm. The main advantage of the new algorithm is the generation of asymptotic variance matrices for all the estimated parameters. The new method is applied to a thin melanoma dataset to predict melanoma recurrence. Various inferences, including survival and hazard function plots with point‐wise confidence intervals, are presented. An R package is now available at Github and will be uploaded to R CRAN.
Collapse
Affiliation(s)
- Annabel Webb
- Department of Mathematics and Statistics, Macquarie University, Sydney, New South Wales, Australia
| | - Jun Ma
- Department of Mathematics and Statistics, Macquarie University, Sydney, New South Wales, Australia
| | - Serigne N Lô
- Melanoma Institute Australia, The University of Sydney, North Sydney, New South Wales, Australia.,Faculty of Medicine and Health, The University of Sydney, Sydney, New South Wales, Australia.,Institute for Research and Medical Consultations (IRMC), Imam Abdulrahman Bin Faisal University, Dammam, Saudi Arabia
| |
Collapse
|
5
|
Wang Y, Lin L, Thompson CG, Chu H. A penalization approach to random-effects meta-analysis. Stat Med 2022; 41:500-516. [PMID: 34796539 PMCID: PMC8792303 DOI: 10.1002/sim.9261] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2020] [Revised: 09/08/2021] [Accepted: 10/29/2021] [Indexed: 11/06/2022]
Abstract
Systematic reviews and meta-analyses are principal tools to synthesize evidence from multiple independent sources in many research fields. The assessment of heterogeneity among collected studies is a critical step when performing a meta-analysis, given its influence on model selection and conclusions about treatment effects. A common-effect (CE) model is conventionally used when the studies are deemed homogeneous, while a random-effects (RE) model is used for heterogeneous studies. However, both models have limitations. For example, the CE model produces excessively conservative confidence intervals with low coverage probabilities when the collected studies have heterogeneous treatment effects. The RE model, on the other hand, assigns higher weights to small studies compared to the CE model. In the presence of small-study effects or publication bias, the over-weighted small studies from a RE model can lead to substantially biased overall treatment effect estimates. In addition, outlying studies may exaggerate between-study heterogeneity. This article introduces penalization methods as a compromise between the CE and RE models. The proposed methods are motivated by the penalized likelihood approach, which is widely used in the current literature to control model complexity and reduce variances of parameter estimates. We compare the existing and proposed methods with simulated data and several case studies to illustrate the benefits of the penalization methods.
Collapse
Affiliation(s)
- Yipeng Wang
- Department of Statistics, Florida State University, FL,
USA,Department of Biostatistics, University of Florida, FL,
USA
| | - Lifeng Lin
- Department of Statistics, Florida State University, FL,
USA,Correspondence: Lifeng Lin, 411 OSB,
117 N Woodward Ave, Tallahassee, FL 32306, USA.
| | | | - Haitao Chu
- Division of Biostatistics, University of Minnesota School
of Public Health, MN, USA
| |
Collapse
|
6
|
Ren X, Jung JE, Zhu W, Lee SJ. Penalized-Likelihood PET Image Reconstruction Using Similarity-Driven Median Regularization. Tomography 2022; 8:158-174. [PMID: 35076630 PMCID: PMC8788485 DOI: 10.3390/tomography8010013] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/24/2021] [Revised: 12/24/2021] [Accepted: 01/02/2022] [Indexed: 11/16/2022] Open
Abstract
In this paper, we present a new regularized image reconstruction method for positron emission tomography (PET), where an adaptive weighted median regularizer is used in the context of a penalized-likelihood framework. The motivation of our work is to overcome the limitation of the conventional median regularizer, which has proven useful for tomographic reconstruction but suffers from the negative effect of removing fine details in the underlying image when the edges occupy less than half of the window elements. The crux of our method is inspired by the well-known non-local means denoising approach, which exploits the measure of similarity between the image patches for weighted smoothing. However, our method is different from the non-local means denoising approach in that the similarity measure between the patches is used for the median weights rather than for the smoothing weights. As the median weights, in this case, are spatially variant, they provide adaptive median regularization achieving high-quality reconstructions. The experimental results indicate that our similarity-driven median regularization method not only improves the reconstruction accuracy, but also has great potential for super-resolution reconstruction for PET.
Collapse
Affiliation(s)
- Xue Ren
- Department of Electronic Engineering, Pai Chai University, Daejeon 35345, Korea; (X.R.); (W.Z.)
| | - Ji Eun Jung
- Image Processing Group, Genoray, Company, Ltd., Seongnam 13230, Gyeonggi-Do, Korea;
| | - Wen Zhu
- Department of Electronic Engineering, Pai Chai University, Daejeon 35345, Korea; (X.R.); (W.Z.)
| | - Soo-Jin Lee
- Department of Electronic Engineering, Pai Chai University, Daejeon 35345, Korea; (X.R.); (W.Z.)
| |
Collapse
|
7
|
Rakhmawati TW, Ha ID, Lee H, Lee Y. Penalized variable selection for cause-specific hazard frailty models with clustered competing-risks data. Stat Med 2021; 40:6541-6557. [PMID: 34541690 DOI: 10.1002/sim.9197] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/09/2020] [Revised: 08/27/2021] [Accepted: 08/28/2021] [Indexed: 11/08/2022]
Abstract
Competing risks data usually arise when an occurrence of an event precludes other types of events from being observed. Such data are often encountered in a clustered clinical study such as a multi-center clinical trial. For the clustered competing-risks data which are correlated within a cluster, competing-risks models allowing for frailty terms have been recently studied. To the best of our knowledge, however, there is no literature on variable selection methods for cause-specific hazard frailty models. In this article, we propose a variable selection procedure for fixed effects in cause-specific competing risks frailty models using a penalized h-likelihood (HL). Here, we study three penalty functions, LASSO, SCAD, and HL. Simulation studies demonstrate that the proposed procedure using the HL penalty works well, providing a higher probability of choosing the true model than LASSO and SCAD methods without losing prediction accuracy. The proposed method is illustrated by using two kinds of clustered competing-risks cancer data sets.
Collapse
Affiliation(s)
| | - Il Do Ha
- Department of Statistics, Pukyong National University, Busan, South Korea
| | - Hangbin Lee
- Department of Statistics, Seoul National University, Seoul, South Korea
| | - Youngjo Lee
- Department of Statistics, Seoul National University, Seoul, South Korea
| |
Collapse
|
8
|
Ollier E, Blanchard P, Le Teuff G, Michiels S. Penalized Poisson model for network meta-analysis of individual patient time-to-event data. Stat Med 2021; 41:340-355. [PMID: 34710951 DOI: 10.1002/sim.9240] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2020] [Revised: 10/14/2021] [Accepted: 10/15/2021] [Indexed: 12/15/2022]
Abstract
Network meta-analysis (NMA) allows the combination of direct and indirect evidence from a set of randomized clinical trials. Performing NMA using individual patient data (IPD) is considered as a "gold standard" approach as it provides several advantages over NMA based on aggregate data. For example, it allows to perform advanced modeling of covariates or covariate-treatment interactions. An important issue in IPD NMA is the selection of influential parameters among terms that account for inconsistency, covariates, covariate-by-treatment interactions or nonproportionality of treatments effect for time to event data. This issue has not been deeply studied in the literature yet and in particular not for time-to-event data. A major difficulty is to jointly account for between-trial heterogeneity which could have a major influence on the selection process. The use of penalized generalized mixed effect model is a solution, but existing implementations have several shortcomings and an important computational cost that precludes their use for complex IPD NMA. In this article, we propose a penalized Poisson regression model to perform IPD NMA of time-to-event data. It is based only on fixed effect parameters which improve its computational cost over the use of random effects. It could be easily implemented using existing penalized regression package. Computer code is shared for implementation. The methods were applied on simulated data to illustrate the importance to take into account between trial heterogeneity during the selection procedure. Finally, it was applied to an IPD NMA of overall survival of chemotherapy and radiotherapy in nasopharyngeal carcinoma.
Collapse
Affiliation(s)
- Edouard Ollier
- Service de Biostatistique et d'Épidémiologie, Gustave Roussy, Université Paris-Saclay, Villejuif, France.,Oncostat U1018, Inserm, Université Paris-Saclay, Labeled Ligue Contre le Cancer, Villejuif, France.,SAINBIOSE U1059, Equipe DVH, Université Jean Monnet, Saint-Etienne, France
| | - Pierre Blanchard
- Oncostat U1018, Inserm, Université Paris-Saclay, Labeled Ligue Contre le Cancer, Villejuif, France.,Département de Radiothérapie, Gustave Roussy, Université Paris-Saclay, Villejuif, France
| | - Gwénaël Le Teuff
- Service de Biostatistique et d'Épidémiologie, Gustave Roussy, Université Paris-Saclay, Villejuif, France.,Oncostat U1018, Inserm, Université Paris-Saclay, Labeled Ligue Contre le Cancer, Villejuif, France
| | - Stefan Michiels
- Service de Biostatistique et d'Épidémiologie, Gustave Roussy, Université Paris-Saclay, Villejuif, France.,Oncostat U1018, Inserm, Université Paris-Saclay, Labeled Ligue Contre le Cancer, Villejuif, France
| |
Collapse
|
9
|
Castel C, Sommen C, Strat YL, Alioum A. A multi-state Markov model using notification data to estimate HIV incidence, number of undiagnosed individuals living with HIV, and delay between infection and diagnosis: Illustration in France, 2008-2018. Stat Methods Med Res 2021; 30:2382-2398. [PMID: 34606379 DOI: 10.1177/09622802211032697] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
Thirty-five years since the discovery of the human immunodeficiency virus (HIV), the epidemic is still ongoing in France. To guide HIV prevention strategies and monitor their impact, it is essential to understand the dynamics of the HIV epidemic. The indicator for reporting the progress of new infections is the HIV incidence. Given that HIV is mainly transmitted by undiagnosed individuals and that earlier treatment leads to less HIV transmission, it is essential to know the number of infected people unaware of their HIV-positive status as well as the time between infection and diagnosis. Our approach is based on a non-homogeneous multi-state Markov model describing the progression of the HIV disease. We propose a penalized likelihood approach to estimate the HIV incidence curve as well as the diagnosis rates. The HIV incidence curve was approximated using cubic M-splines, while an approximation of the cross-validation criterion was used to estimate the smoothing parameter. In a simulation study, we evaluate the performance of the model for reconstructing the HIV incidence curve and diagnosis rates. The method is illustrated in the population of men who have sex with men using HIV surveillance data collected by the French Institute for Public Health Surveillance since 2004.
Collapse
Affiliation(s)
- Charlotte Castel
- Data Science Division, French Institute for Public Health Surveillance, Saint-Maurice, France.,University of Paris-Est, Champs-Sur-Marne, France
| | - Cecile Sommen
- Data Science Division, French Institute for Public Health Surveillance, Saint-Maurice, France
| | - Yann Le Strat
- Data Science Division, French Institute for Public Health Surveillance, Saint-Maurice, France
| | - Ahmadou Alioum
- Epidemiology and Biostatistics Research Center, Inserm Center U1219-Bordeaux Population Health, Bordeaux, France.,Inserm Center U1219-Bordeaux Population Health, ISPED, University of Bordeaux 2, Bordeaux, France
| |
Collapse
|
10
|
Heo J, Baek J. A Penalized Matrix Normal Mixture Model for Clustering Matrix Data. Entropy (Basel) 2021; 23:1249. [PMID: 34681973 DOI: 10.3390/e23101249] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/19/2021] [Revised: 09/17/2021] [Accepted: 09/22/2021] [Indexed: 11/26/2022]
Abstract
Along with advances in technology, matrix data, such as medical/industrial images, have emerged in many practical fields. These data usually have high dimensions and are not easy to cluster due to their intrinsic correlated structure among rows and columns. Most approaches convert matrix data to multi dimensional vectors and apply conventional clustering methods to them, and thus, suffer from an extreme high-dimensionality problem as well as a lack of interpretability of the correlated structure among row/column variables. Recently, a regularized model was proposed for clustering matrix-valued data by imposing a sparsity structure for the mean signal of each cluster. We extend their approach by regularizing further on the covariance to cope better with the curse of dimensionality for large size images. A penalized matrix normal mixture model with lasso-type penalty terms in both mean and covariance matrices is proposed, and then an expectation maximization algorithm is developed to estimate the parameters. The proposed method has the competence of both parsimonious modeling and reflecting the proper conditional correlation structure. The estimators are consistent, and their limiting distributions are derived. We applied the proposed method to simulated data as well as real datasets and measured its clustering performance with the clustering accuracy (ACC) and the adjusted rand index (ARI). The experiment results show that the proposed method performed better with higher ACC and ARI than those of conventional methods.
Collapse
|
11
|
Clipp HL, Evans AL, Kessinger BE, Kellner K, Rota CT. A penalized likelihood for multispecies occupancy models improves predictions of species interactions. Ecology 2021; 102:e03520. [PMID: 34468982 DOI: 10.1002/ecy.3520] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 12/16/2020] [Revised: 04/15/2021] [Accepted: 06/10/2021] [Indexed: 11/05/2022]
Abstract
Multispecies occupancy models estimate dependence among multiple species of interest from patterns of co-occurrence, but problems associated with separation and boundary estimates can lead to unreasonably large estimates of parameters and associated standard errors when species are rarely observed at the same site or when data are sparse. In this paper, we overcome these issues by implementing a penalized likelihood, which introduces a small bias in parameter estimates in exchange for a potentially large reduction in variance. We compare parameter estimates obtained from both penalized and unpenalized multispecies occupancy models fit to simulated data that exhibit various degrees of separation and to a real-word data set of bird surveys with little apparent overlap between potentially interacting species. Our simulation results demonstrate that penalized multispecies occupancy models did not exhibit boundary estimates and produced lower bias, lower mean squared error, and improved inference relative to unpenalized models. When applied to real-world data, our penalized multispecies occupancy model constrained boundary estimates and allowed for meaningful inference related to the interactions of two species of conservation concern. To facilitate the use of our penalized multispecies occupancy model, the techniques demonstrated in this paper have been integrated into the unmarked package in R programing language.
Collapse
Affiliation(s)
- Hannah L Clipp
- Division of Forestry and Natural Resources, West Virginia University, Morgantown, West Virginia, 26506, USA
| | - Amber L Evans
- Division of Forestry and Natural Resources, West Virginia University, Morgantown, West Virginia, 26506, USA
| | - Brin E Kessinger
- Division of Forestry and Natural Resources, West Virginia University, Morgantown, West Virginia, 26506, USA
| | - Kenneth Kellner
- Global Wildlife Conservation Center, State University of New York College of Environmental Science and Forestry, Syracuse, New York, 13210, USA
| | - Christopher T Rota
- Division of Forestry and Natural Resources, West Virginia University, Morgantown, West Virginia, 26506, USA
| |
Collapse
|
12
|
Alsalim N, Baghfalaki T. Variable selection for longitudinal zero-inflated power series transition model. J Biopharm Stat 2021; 31:668-685. [PMID: 34325620 DOI: 10.1080/10543406.2021.1944177] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
Abstract
In modeling many longitudinal count clinical studies, the excess of zeros is a common problem. To take into account the extra zeros, the zero-inflated power series (ZIPS) models have been applied. These models assume a latent mixture model consisting of a count component and a degenerated zero component that has a unit point mass at zero. Usually, the current response measurement in a longitudinal sequence is a function of previous outcomes. For example, in a study about acute renal allograft rejection, the number of acute rejection episodes for a patient in current time is a function of this outcome at previous follow-up times. In this paper, we consider a transition model for accounting the dependence of current outcome on the previous outcomes in the presence of excess zeros. New variable selection methods for the ZIPS transition model using least absolute shrinkage and selection operator (LASSO), minimax concave penalty (MCP) and smoothly clipped absolute deviation (SCAD) penalties are proposed. An expectation-maximization (EM) algorithm using the penalized likelihood is applied for both parameters estimations and conducting variable selection. Some simulation studies are performed to investigate the performance of the proposed approach and the approach is applied to analyze a real dataset.
Collapse
Affiliation(s)
- Nawar Alsalim
- Department of Statistics, Faculty of Mathematical Sciences, Tarbiat Modares University, Tehran, Iran
| | - Taban Baghfalaki
- Department of Statistics, Faculty of Mathematical Sciences, Tarbiat Modares University, Tehran, Iran
| |
Collapse
|
13
|
Shin YE, Liu D, Sang H, Ferguson TA, Song PXK. A binary hidden Markov model on spatial network for amyotrophic lateral sclerosis disease spreading pattern analysis. Stat Med 2021; 40:3035-3052. [PMID: 33763884 DOI: 10.1002/sim.8956] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2020] [Revised: 03/02/2021] [Accepted: 03/03/2021] [Indexed: 11/05/2022]
Abstract
Amyotrophic lateral sclerosis (ALS) is a neurological disease that starts at a focal point and gradually spreads to other parts of the nervous system. One of the main clinical symptoms of ALS is muscle weakness. To study spreading patterns of muscle weakness, we analyze spatiotemporal binary muscle strength data, which indicates whether observed muscle strengths are impaired or healthy. We propose a hidden Markov model-based approach that assumes the observed disease status depends on two latent disease states. The model enables us to estimate the incidence rate of ALS disease and the probability of disease state transition. Specifically, the latter is modeled by a logistic autoregression in that the spatial network of susceptible muscles follows a Markov process. The proposed model is flexible to allow both historical muscle conditions and their spatial relationships to be included in the analysis. To estimate the model parameters, we provide an iterative algorithm to maximize sparse-penalized likelihood with bias correction, and use the Viterbi algorithm to label hidden disease states. We apply the proposed approach to analyze the ALS patients' data from EMPOWER Study.
Collapse
Affiliation(s)
- Yei Eun Shin
- Biostatistics Branch, Division of Cancer Epidemiology and Genetics, National Cancer Institute, Rockville, Maryland, USA
| | - Dawei Liu
- Global Analytics and Data Sciences, Biogen, Cambridge, Massachusetts, USA
| | - Huiyan Sang
- Department of Statistics, Texas A&M University, College Station, Texas, USA
| | - Toby A Ferguson
- Neurology Research and Early Clinical Development, Biogen, Cambridge, Massachusetts, USA
| | - Peter X K Song
- Department of Biostatistics, University of Michigan, Ann Arbor, Michigan, USA
| |
Collapse
|
14
|
Goepp V, Thalabard JC, Nuel G, Bouaziz O. Regularized bidimensional estimation of the hazard rate. Int J Biostat 2021; 18:263-277. [PMID: 33768761 DOI: 10.1515/ijb-2019-0003] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/07/2019] [Accepted: 02/26/2021] [Indexed: 11/15/2022]
Abstract
In epidemiological or demographic studies, with variable age at onset, a typical quantity of interest is the incidence of a disease (for example the cancer incidence). In these studies, the individuals are usually highly heterogeneous in terms of dates of birth (the cohort) and with respect to the calendar time (the period) and appropriate estimation methods are needed. In this article a new estimation method is presented which extends classical age-period-cohort analysis by allowing interactions between age, period and cohort effects. We introduce a bidimensional regularized estimate of the hazard rate where a penalty is introduced on the likelihood of the model. This penalty can be designed either to smooth the hazard rate or to enforce consecutive values of the hazard to be equal, leading to a parsimonious representation of the hazard rate. In the latter case, we make use of an iterative penalized likelihood scheme to approximate the L 0 norm, which makes the computation tractable. The method is evaluated on simulated data and applied on breast cancer survival data from the SEER program.
Collapse
Affiliation(s)
- Vivien Goepp
- MAP5, CNRS UMR 8145, 45, rue des Saints-Pères, 75006, Paris, France.,MINES ParisTech, CBIO-Centre for Computational Biology, PSL Research University, 75006, Paris, France.,Institut Curie, PSL Research University, 75005, Paris, France.,Inserm, U900, Paris, France
| | | | - Grégory Nuel
- LPSM, CNRS UMR 8001, 4, Place Jussieu, 75005, Paris, France
| | - Olivier Bouaziz
- MAP5, CNRS UMR 8145, 45, rue des Saints-Pères, 75006, Paris, France
| |
Collapse
|
15
|
Geminiani E, Marra G, Moustaki I. Single- and Multiple-Group Penalized Factor Analysis: A Trust-Region Algorithm Approach with Integrated Automatic Multiple Tuning Parameter Selection. Psychometrika 2021; 86:65-95. [PMID: 33768403 PMCID: PMC8035122 DOI: 10.1007/s11336-021-09751-8] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/11/2020] [Revised: 02/18/2021] [Accepted: 02/22/2021] [Indexed: 05/28/2023]
Abstract
Penalized factor analysis is an efficient technique that produces a factor loading matrix with many zero elements thanks to the introduction of sparsity-inducing penalties within the estimation process. However, sparse solutions and stable model selection procedures are only possible if the employed penalty is non-differentiable, which poses certain theoretical and computational challenges. This article proposes a general penalized likelihood-based estimation approach for single- and multiple-group factor analysis models. The framework builds upon differentiable approximations of non-differentiable penalties, a theoretically founded definition of degrees of freedom, and an algorithm with integrated automatic multiple tuning parameter selection that exploits second-order analytical derivative information. The proposed approach is evaluated in two simulation studies and illustrated using a real data set. All the necessary routines are integrated into the R package penfa.
Collapse
Affiliation(s)
- Elena Geminiani
- Department of Statistical Sciences, University of Bologna, Via Delle Belle Arti 41, 40126, Bologna, Italy.
| | - Giampiero Marra
- Department of Statistical Science, University College London, London, UK
| | - Irini Moustaki
- Department of Statistics, London School of Economics and Political Science, London, UK
| |
Collapse
|
16
|
Van Calster B, van Smeden M, De Cock B, Steyerberg EW. Regression shrinkage methods for clinical prediction models do not guarantee improved performance: Simulation study. Stat Methods Med Res 2020; 29:3166-3178. [PMID: 32401702 DOI: 10.1177/0962280220921415] [Citation(s) in RCA: 49] [Impact Index Per Article: 12.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
When developing risk prediction models on datasets with limited sample size, shrinkage methods are recommended. Earlier studies showed that shrinkage results in better predictive performance on average. This simulation study aimed to investigate the variability of regression shrinkage on predictive performance for a binary outcome. We compared standard maximum likelihood with the following shrinkage methods: uniform shrinkage (likelihood-based and bootstrap-based), penalized maximum likelihood (ridge) methods, LASSO logistic regression, adaptive LASSO, and Firth's correction. In the simulation study, we varied the number of predictors and their strength, the correlation between predictors, the event rate of the outcome, and the events per variable. In terms of results, we focused on the calibration slope. The slope indicates whether risk predictions are too extreme (slope < 1) or not extreme enough (slope > 1). The results can be summarized into three main findings. First, shrinkage improved calibration slopes on average. Second, the between-sample variability of calibration slopes was often increased relative to maximum likelihood. In contrast to other shrinkage approaches, Firth's correction had a small shrinkage effect but showed low variability. Third, the correlation between the estimated shrinkage and the optimal shrinkage to remove overfitting was typically negative, with Firth's correction as the exception. We conclude that, despite improved performance on average, shrinkage often worked poorly in individual datasets, in particular when it was most needed. The results imply that shrinkage methods do not solve problems associated with small sample size or low number of events per variable.
Collapse
Affiliation(s)
- Ben Van Calster
- Department of Development and Regeneration, KU Leuven, Leuven, Belgium.,Department of Biomedical Data Sciences, Leiden University Medical Center, Leiden, Netherlands
| | - Maarten van Smeden
- Department of Biomedical Data Sciences, Leiden University Medical Center, Leiden, Netherlands.,Department of Clinical Epidemiology, Leiden University Medical Center, Leiden, Netherlands
| | - Bavo De Cock
- Department of Development and Regeneration, KU Leuven, Leuven, Belgium.,Department of Accountancy, KU Leuven, Finance and Insurance, Leuven, Belgium
| | - Ewout W Steyerberg
- Department of Biomedical Data Sciences, Leiden University Medical Center, Leiden, Netherlands
| |
Collapse
|
17
|
Abella M, Martinez C, Desco M, Vaquero JJ, Fessler JA. Simplified Statistical Image Reconstruction for X-ray CT With Beam-Hardening Artifact Compensation. IEEE Trans Med Imaging 2020; 39:111-118. [PMID: 31180844 PMCID: PMC6995645 DOI: 10.1109/tmi.2019.2921929] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
CT images are often affected by beam-hardening artifacts due to the polychromatic nature of the X-ray spectra. These artifacts appear in the image as cupping in homogeneous areas and as dark bands between dense regions such as bones. This paper proposes a simplified statistical reconstruction method for X-ray CT based on Poisson statistics that accounts for the non-linearities caused by beam hardening. The main advantages of the proposed method over previous algorithms are that it avoids the preliminary segmentation step, which can be tricky, especially for low-dose scans, and it does not require knowledge of the whole source spectrum, which is often unknown. Each voxel attenuation is modeled as a mixture of bone and soft tissue by defining density-dependent tissue fractions and maintaining one unknown per voxel. We approximate the energy-dependent attenuation corresponding to different combinations of bone and soft tissues, the so-called beam-hardening function, with the 1D function corresponding to water plus two parameters that can be tuned empirically. Results on both simulated data with Poisson sinogram noise and two rodent studies acquired with the ARGUS/CT system showed a beam hardening reduction (both cupping and dark bands) similar to analytical reconstruction followed by post-processing techniques but with reduced noise and streaks in cases with a low number of projections, as expected for statistical image reconstruction.
Collapse
|
18
|
Chen BE, Wang J. Joint modeling of binary response and survival for clustered data in clinical trials. Stat Med 2019; 39:326-339. [PMID: 31777115 DOI: 10.1002/sim.8403] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2018] [Revised: 08/08/2019] [Accepted: 09/06/2019] [Indexed: 11/08/2022]
Abstract
In clinical trials, it is often desirable to evaluate the effect of a prognostic factor such as a marker response on a survival outcome. However, the marker response and survival outcome are usually associated with some potentially unobservable factors. In this case, the conventional statistical methods that model these two outcomes separately may not be appropriate. In this paper, we propose a joint model for marker response and survival outcomes for clustered data, providing efficient statistical inference by considering these two outcomes simultaneously. We focus on a special type of marker response: a binary outcome, which is investigated together with survival data using a cluster-specific multivariate random effect variable. A multivariate penalized likelihood method is developed to make statistical inference for the joint model. However, the standard errors obtained from the penalized likelihood method are usually underestimated. This issue is addressed using a jackknife resampling method to obtain a consistent estimate of standard errors. We conduct extensive simulation studies to assess the finite sample performance of the proposed joint model and inference methods in different scenarios. The simulation studies show that the proposed joint model has excellent finite sample properties compared to the separate models when there exists an underlying association between the marker response and survival data. Finally, we apply the proposed method to a symptom control study conducted by Canadian Cancer Trials Group to explore the prognostic effect of covariates on pain control and overall survival.
Collapse
Affiliation(s)
- Bingshu E Chen
- Canadian Cancer Trials Group and Department of Public Health Sciences, Queen's University, Kingston, Ontario, Canada
| | - Jia Wang
- Population Health Research Institute, Hamilton, Ontario, Canada
| |
Collapse
|
19
|
Šinkovec H, Geroldinger A, Heinze G. Bring More Data!-A Good Advice? Removing Separation in Logistic Regression by Increasing Sample Size. Int J Environ Res Public Health 2019; 16:ijerph16234658. [PMID: 31766753 PMCID: PMC6926877 DOI: 10.3390/ijerph16234658] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/25/2019] [Revised: 11/12/2019] [Accepted: 11/20/2019] [Indexed: 11/16/2022]
Abstract
The parameters of logistic regression models are usually obtained by the method of maximum likelihood (ML). However, in analyses of small data sets or data sets with unbalanced outcomes or exposures, ML parameter estimates may not exist. This situation has been termed 'separation' as the two outcome groups are separated by the values of a covariate or a linear combination of covariates. To overcome the problem of non-existing ML parameter estimates, applying Firth's correction (FC) was proposed. In practice, however, a principal investigator might be advised to 'bring more data' in order to solve a separation issue. We illustrate the problem by means of examples from colorectal cancer screening and ornithology. It is unclear if such an increasing sample size (ISS) strategy that keeps sampling new observations until separation is removed improves estimation compared to applying FC to the original data set. We performed an extensive simulation study where the main focus was to estimate the cost-adjusted relative efficiency of ML combined with ISS compared to FC. FC yielded reasonably small root mean squared errors and proved to be the more efficient estimator. Given our findings, we propose not to adapt the sample size when separation is encountered but to use FC as the default method of analysis whenever the number of observations or outcome events is critically low.
Collapse
|
20
|
Rashid NU, Li Q, Yeh JJ, Ibrahim JG. Modeling Between-Study Heterogeneity for Improved Replicability in Gene Signature Selection and Clinical Prediction. J Am Stat Assoc 2019; 115:1125-1138. [PMID: 33012902 DOI: 10.1080/01621459.2019.1671197] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/08/2023]
Abstract
In the genomic era, the identification of gene signatures associated with disease is of significant interest. Such signatures are often used to predict clinical outcomes in new patients and aid clinical decision-making. However, recent studies have shown that gene signatures are often not replicable. This occurrence has practical implications regarding the generalizability and clinical applicability of such signatures. To improve replicability, we introduce a novel approach to select gene signatures from multiple datasets whose effects are consistently non-zero and account for between-study heterogeneity. We build our model upon some rank-based quantities, facilitating integration over different genomic datasets. A high dimensional penalized Generalized Linear Mixed Model (pGLMM) is used to select gene signatures and address data heterogeneity. We compare our method to some commonly used strategies that select gene signatures ignoring between-study heterogeneity. We provide asymptotic results justifying the performance of our method and demonstrate its advantage in the presence of heterogeneity through thorough simulation studies. Lastly, we motivate our method through a case study subtyping pancreatic cancer patients from four gene expression studies.
Collapse
Affiliation(s)
- Naim U Rashid
- Department of Biostatistics, Gillings School of Global Public Health, University of North Carolina at Chapel Hill, Chapel Hill, NC, U.S.A.,Lineberger Comprehensive Cancer Center, University of North Carolina at Chapel Hill, Chapel Hill, NC, U.S.A
| | - Quefeng Li
- Department of Biostatistics, Gillings School of Global Public Health, University of North Carolina at Chapel Hill, Chapel Hill, NC, U.S.A
| | - Jen Jen Yeh
- Lineberger Comprehensive Cancer Center, University of North Carolina at Chapel Hill, Chapel Hill, NC, U.S.A.,Department of Surgery, University of North Carolina at Chapel Hill, Chapel Hill, NC, U.S.A.,Department of Pharmacology, University of North Carolina at Chapel Hill, Chapel Hill, NC, U.S.A
| | - Joseph G Ibrahim
- Department of Biostatistics, Gillings School of Global Public Health, University of North Carolina at Chapel Hill, Chapel Hill, NC, U.S.A
| |
Collapse
|
21
|
Jazić I, Haneuse S, French B, MacGrogan G, Rondeau V. Design and analysis of nested case-control studies for recurrent events subject to a terminal event. Stat Med 2019; 38:4348-4362. [PMID: 31290191 DOI: 10.1002/sim.8302] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2018] [Revised: 06/06/2019] [Accepted: 06/06/2019] [Indexed: 11/08/2022]
Abstract
The process by which patients experience a series of recurrent events, such as hospitalizations, may be subject to death. In cohort studies, one strategy for analyzing such data is to fit a joint frailty model for the intensities of the recurrent event and death, which estimates covariate effects on the two event types while accounting for their dependence. When certain covariates are difficult to obtain, however, researchers may only have the resources to subsample patients on whom to collect complete data: one way is using the nested case-control (NCC) design, in which risk set sampling is performed based on a single outcome. We develop a general framework for the design of NCC studies in the presence of recurrent and terminal events and propose estimation and inference for a joint frailty model for recurrence and death using data arising from such studies. We propose a maximum weighted penalized likelihood approach using flexible spline models for the baseline intensity functions. Two standard error estimators are proposed: a sandwich estimator and a perturbation resampling procedure. We investigate operating characteristics of our estimators as well as design considerations via a simulation study and illustrate our methods using two studies: one on recurrent cardiac hospitalizations in patients with heart failure and the other on local recurrence and metastasis in patients with breast cancer.
Collapse
Affiliation(s)
- Ina Jazić
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, Massachusetts
| | - Sebastien Haneuse
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, Massachusetts
| | - Benjamin French
- Department of Statistics, Radiation Effects Research Foundation, Hiroshima, Japan
| | | | - Virginie Rondeau
- Centre de recherche INSERM U1219, Université de Bordeaux-ISPED, Bordeaux, France
| |
Collapse
|
22
|
Sun L, Li S, Wang L, Song X. Variable selection in semiparametric nonmixture cure model with interval-censored failure time data: An application to the prostate cancer screening study. Stat Med 2019; 38:3026-3039. [PMID: 31032999 DOI: 10.1002/sim.8165] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/03/2018] [Revised: 03/03/2019] [Accepted: 03/24/2019] [Indexed: 11/06/2022]
Abstract
Censored failure time data with a cured subgroup is frequently encountered in many scientific areas including the cancer screening research, tumorigenicity studies, and sociological surveys. Meanwhile, one may also encounter an extraordinary large number of risk factors in practice, such as patient's demographic characteristics, clinical measurements, and medical history, which makes variable selection an emerging need in the data analysis. Motivated by a medical study on prostate cancer screening, we develop a variable selection method in the semiparametric nonmixture or promotion time cure model when interval-censored data with a cured subgroup are present. Specifically, we propose a penalized likelihood approach with the use of the least absolute shrinkage and selection operator, adaptive least absolute shrinkage and selection operator, or smoothly clipped absolute deviation penalties, which can be easily accomplished via a novel penalized expectation-maximization algorithm. We assess the finite-sample performance of the proposed methodology through extensive simulations and analyze the prostate cancer screening data for illustration.
Collapse
Affiliation(s)
- Liuquan Sun
- School of Economics and Statistics, Guangzhou University, Guangzhou, China
| | - Shuwei Li
- School of Economics and Statistics, Guangzhou University, Guangzhou, China
| | - Lianming Wang
- Department of Statistics, University of South Carolina, Columbia, South Carolina
| | - Xinyuan Song
- Department of Statistics, The Chinese University of Hong Kong, Shatin, Hong Kong
| |
Collapse
|
23
|
Abstract
The current penalized regression methods for selecting predictor variables and estimating the associated regression coefficients in the sparse Cox model are mainly based on partial likelihood. In this paper, a bias-corrected empirical likelihood method is proposed for the sparse Cox model in conjunction with appropriate penalty functions when the dimensionality of data is high. Theoretical properties of the resulting estimator for the large sample are proved. Simulation studies suggest that penalized empirical likelihood works better than partial likelihood in terms of selecting correct predictors without introducing more model errors. The well-known primary biliary cirrhosis data set is used to illustrate the proposed penalized empirical likelihood method.
Collapse
Affiliation(s)
- Dongliang Wang
- Department of Public Health and Preventive Medicine, SUNY Upstate Medical University
| | - Tong Tong Wu
- Department of Biostatistics and Computational Biology, University of Rochester
| | - Yichuan Zhao
- Department of Mathematics and Statistics, Georgia State University
| |
Collapse
|
24
|
Derkach A, Pfeiffer RM, Chen TH, Sampson JN. High dimensional mediation analysis with latent variables. Biometrics 2019; 75:745-756. [PMID: 30859548 DOI: 10.1111/biom.13053] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/14/2018] [Revised: 02/19/2019] [Accepted: 02/22/2019] [Indexed: 11/27/2022]
Abstract
We propose a model for high dimensional mediation analysis that includes latent variables. We describe our model in the context of an epidemiologic study for incident breast cancer with one exposure and a large number of biomarkers (i.e., potential mediators). We assume that the exposure directly influences a group of latent, or unmeasured, factors which are associated with both the outcome and a subset of the biomarkers. The biomarkers associated with the latent factors linking the exposure to the outcome are considered "mediators." We derive the likelihood for this model and develop an expectation-maximization algorithm to maximize an L1-penalized version of this likelihood to limit the number of factors and associated biomarkers. We show that the resulting estimates are consistent and that the estimates of the nonzero parameters have an asymptotically normal distribution. In simulations, procedures based on this new model can have significantly higher power for detecting the mediating biomarkers compared with the simpler approaches. We apply our method to a study that evaluates the relationship between body mass index, 481 metabolic measurements, and estrogen-receptor positive breast cancer.
Collapse
Affiliation(s)
- Andriy Derkach
- Division of Cancer Epidemiology and Genetics, National Cancer Institute, National Institutes of Health, Rockville, Maryland
| | - Ruth M Pfeiffer
- Division of Cancer Epidemiology and Genetics, National Cancer Institute, National Institutes of Health, Rockville, Maryland
| | - Ting-Huei Chen
- Department of Mathematics and Statistics, Laval University, Quebec City, Canada
| | - Joshua N Sampson
- Division of Cancer Epidemiology and Genetics, National Cancer Institute, National Institutes of Health, Rockville, Maryland
| |
Collapse
|
25
|
Huang PH. A penalized likelihood method for multi-group structural equation modelling. Br J Math Stat Psychol 2018; 71:499-522. [PMID: 29500879 DOI: 10.1111/bmsp.12130] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/02/2017] [Revised: 10/12/2017] [Indexed: 06/08/2023]
Abstract
In the past two decades, statistical modelling with sparsity has become an active research topic in the fields of statistics and machine learning. Recently, Huang, Chen and Weng (2017, Psychometrika, 82, 329) and Jacobucci, Grimm, and McArdle (2016, Structural Equation Modeling: A Multidisciplinary Journal, 23, 555) both proposed sparse estimation methods for structural equation modelling (SEM). These methods, however, are restricted to performing single-group analysis. The aim of the present work is to establish a penalized likelihood (PL) method for multi-group SEM. Our proposed method decomposes each group model parameter into a common reference component and a group-specific increment component. By penalizing the increment components, the heterogeneity of parameter values across the population can be explored since the null group-specific effects are expected to diminish. We developed an expectation-conditional maximization algorithm to optimize the PL criteria. A numerical experiment and a real data example are presented to demonstrate the potential utility of the proposed method.
Collapse
Affiliation(s)
- Po-Hsien Huang
- Department of Psychology, National Cheng Kung University, Taiwan
| |
Collapse
|
26
|
He Y, Lin H, Tu D. A single-index threshold Cox proportional hazard model for identifying a treatment-sensitive subset based on multiple biomarkers. Stat Med 2018; 37:3267-3279. [PMID: 29869381 DOI: 10.1002/sim.7837] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2017] [Revised: 04/18/2018] [Accepted: 05/07/2018] [Indexed: 01/18/2023]
Abstract
In this paper, we introduce a single-index threshold Cox proportional hazard model to select and combine biomarkers to identify patients who may be sensitive to a specific treatment. A penalized smoothed partial likelihood is proposed to estimate the parameters in the model. A simple, efficient, and unified algorithm is presented to maximize this likelihood function. The estimators based on this likelihood function are shown to be consistent and asymptotically normal. Under mild conditions, the proposed estimators also achieve the oracle property. The proposed approach is evaluated through simulation analyses and application to the analysis of data from two clinical trials, one involving patients with locally advanced or metastatic pancreatic cancer and one involving patients with resectable lung cancer.
Collapse
Affiliation(s)
- Ye He
- Center of Statistical Research, School of Statistics, Southwestern University of Finance and Economics, Chengdu, China
| | - Huazhen Lin
- Center of Statistical Research, School of Statistics, Southwestern University of Finance and Economics, Chengdu, China
| | - Dongsheng Tu
- Department of Public Health Sciences, Canadian Cancer Trials Group, Queen's University, Kingston, Ontario, Canada
| |
Collapse
|
27
|
Heinze G, Wallisch C, Dunkler D. Variable selection - A review and recommendations for the practicing statistician. Biom J 2018; 60:431-449. [PMID: 29292533 PMCID: PMC5969114 DOI: 10.1002/bimj.201700067] [Citation(s) in RCA: 677] [Impact Index Per Article: 112.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2017] [Revised: 11/13/2017] [Accepted: 11/17/2017] [Indexed: 12/12/2022]
Abstract
Statistical models support medical research by facilitating individualized outcome prognostication conditional on independent variables or by estimating effects of risk factors adjusted for covariates. Theory of statistical models is well-established if the set of independent variables to consider is fixed and small. Hence, we can assume that effect estimates are unbiased and the usual methods for confidence interval estimation are valid. In routine work, however, it is not known a priori which covariates should be included in a model, and often we are confronted with the number of candidate variables in the range 10-30. This number is often too large to be considered in a statistical model. We provide an overview of various available variable selection methods that are based on significance or information criteria, penalized likelihood, the change-in-estimate criterion, background knowledge, or combinations thereof. These methods were usually developed in the context of a linear regression model and then transferred to more generalized linear models or models for censored survival data. Variable selection, in particular if used in explanatory modeling where effect estimates are of central interest, can compromise stability of a final model, unbiasedness of regression coefficients, and validity of p-values or confidence intervals. Therefore, we give pragmatic recommendations for the practicing statistician on application of variable selection methods in general (low-dimensional) modeling problems and on performing stability investigations and inference. We also propose some quantities based on resampling the entire variable selection process to be routinely reported by software packages offering automated variable selection algorithms.
Collapse
Affiliation(s)
- Georg Heinze
- Section for Clinical Biometrics, Center for Medical Statistics, Informatics and Intelligent Systems, Medical University of Vienna, Vienna, 1090, Austria
| | - Christine Wallisch
- Section for Clinical Biometrics, Center for Medical Statistics, Informatics and Intelligent Systems, Medical University of Vienna, Vienna, 1090, Austria
| | - Daniela Dunkler
- Section for Clinical Biometrics, Center for Medical Statistics, Informatics and Intelligent Systems, Medical University of Vienna, Vienna, 1090, Austria
| |
Collapse
|
28
|
Abstract
Constructing expression networks using transcriptomic data is an effective approach for studying gene regulation. A popular approach for constructing such a network is based on the Gaussian graphical model (GGM), in which an edge between a pair of genes indicates that the expression levels of these two genes are conditionally dependent, given the expression levels of all other genes. However, GGMs are not appropriate for non-Gaussian data, such as those generated in RNA-seq experiments. We propose a novel statistical framework that maximizes a penalized likelihood, in which the observed count data follow a Poisson log-normal distribution. To overcome the computational challenges, we use Laplace's method to approximate the likelihood and its gradients, and apply the alternating directions method of multipliers to find the penalized maximum likelihood estimates. The proposed method is evaluated and compared with GGMs using both simulated and real RNA-seq data. The proposed method shows improved performance in detecting edges that represent covarying pairs of genes, particularly for edges connecting low-abundant genes and edges around regulatory hubs.
Collapse
Affiliation(s)
- Yoonha Choi
- Department of Genetics, Stanford University, Stanford, California
| | - Marc Coram
- Department of Health Research and Policy, Stanford University, Stanford, California
| | - Jie Peng
- Department of Statistics, University of California, Davis, Davis, California
| | - Hua Tang
- Department of Genetics, Stanford University, Stanford, California
| |
Collapse
|
29
|
Abstract
A penalized likelihood (PL) method for structural equation modeling (SEM) was proposed as a methodology for exploring the underlying relations among both observed and latent variables. Compared to the usual likelihood method, PL includes a penalty term to control the complexity of the hypothesized model. When the penalty level is appropriately chosen, the PL can yield an SEM model that balances the model goodness-of-fit and model complexity. In addition, the PL results in a sparse estimate that enhances the interpretability of the final model. The proposed method is especially useful when limited substantive knowledge is available for model specifications. The PL method can be also understood as a methodology that links the traditional SEM to the exploratory SEM (Asparouhov & Muthén in Struct Equ Model Multidiscipl J 16:397-438, 2009). An expectation-conditional maximization algorithm was developed to maximize the PL criterion. The asymptotic properties of the proposed PL were also derived. The performance of PL was evaluated through a numerical experiment, and two real data illustrations were presented to demonstrate its utility in psychological research.
Collapse
Affiliation(s)
- Po-Hsien Huang
- Department of Psychology, National Taiwan University, No. 1, Sec. 4, Roosevelt Road, Taipei, 10617, Taiwan
- Department of Psychology, National Cheng Kung University, Tainan, Taiwan
| | - Hung Chen
- Department of Mathematics, National Taiwan University, No. 1, Sec. 4, Roosevelt Road, Taipei, 10617 , Taiwan
| | - Li-Jen Weng
- Department of Psychology, National Taiwan University, No. 1, Sec. 4, Roosevelt Road, Taipei, 10617, Taiwan.
| |
Collapse
|
30
|
Abstract
Finite mixture regression models have been widely used for modelling mixed regression relationships arising from a clustered and thus heterogenous population. The classical normal mixture model, despite its simplicity and wide applicability, may fail in the presence of severe outliers. Using a sparse, case-specific, and scale-dependent mean-shift mixture model parameterization, we propose a robust mixture regression approach for simultaneously conducting outlier detection and robust parameter estimation. A penalized likelihood approach is adopted to induce sparsity among the mean-shift parameters so that the outliers are distinguished from the remainder of the data, and a generalized Expectation-Maximization (EM) algorithm is developed to perform stable and efficient computation. The proposed approach is shown to have strong connections with other robust methods including the trimmed likelihood method and M-estimation approaches. In contrast to several existing methods, the proposed methods show outstanding performance in our simulation studies.
Collapse
Affiliation(s)
- Chun Yu
- School of Statistics, Jiangxi University of Finance and Economics, Nanchang 330013, P. R. China
| | - Weixin Yao
- Department of Statistics, University of California, Riverside, CA 92521, U.S.A
| | - Kun Chen
- Department of Statistics, University of Connecticut, Storrs, CT 06269, U.S.A
| |
Collapse
|
31
|
Wangerin KA, Ahn S, Wollenweber S, Ross SG, Kinahan PE, Manjeshwar RM. Evaluation of lesion detectability in positron emission tomography when using a convergent penalized likelihood image reconstruction method. J Med Imaging (Bellingham) 2016; 4:011002. [PMID: 27921073 DOI: 10.1117/1.jmi.4.1.011002] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2016] [Accepted: 10/18/2016] [Indexed: 11/14/2022] Open
Abstract
We have previously developed a convergent penalized likelihood (PL) image reconstruction algorithm using the relative difference prior (RDP) and showed that it achieves more accurate lesion quantitation compared to ordered subsets expectation maximization (OSEM). We evaluated the detectability of low-contrast liver and lung lesions using the PL-RDP algorithm compared to OSEM. We performed a two-alternative forced choice study using a channelized Hotelling observer model that was previously validated against human observers. Lesion detectability showed a stronger dependence on lesion size for PL-RDP than OSEM. Lesion detectability was improved using time-of-flight (TOF) reconstruction, with greater benefit for the liver compared to the lung and with increasing benefit for decreasing lesion size and contrast. PL detectability was statistically significantly higher than OSEM for 20 mm liver lesions when contrast was [Formula: see text] ([Formula: see text]), and TOF PL detectability was statistically significantly higher than TOF OSEM for 15 and 20 mm liver lesions with contrast [Formula: see text] and [Formula: see text], respectively. For all other cases, there was no statistically significant difference between PL and OSEM ([Formula: see text]). For the range of studied lesion properties, lesion detectability using PL-RDP was equivalent or improved compared to using OSEM.
Collapse
Affiliation(s)
- Kristen A Wangerin
- General Electric Global Research Center, 1 Research Circle, Niskayuna, New York 12309, United States; University of Washington, Department of Bioengineering, 3720 15th Avenue NE, Seattle, Washington 98195, United States
| | - Sangtae Ahn
- General Electric Global Research Center , 1 Research Circle, Niskayuna, New York 12309, United States
| | - Scott Wollenweber
- General Electric Healthcare , 3000 North Grandview Boulevard, Waukesha, Wisconsin 53188, United States
| | - Steven G Ross
- General Electric Healthcare , 3000 North Grandview Boulevard, Waukesha, Wisconsin 53188, United States
| | - Paul E Kinahan
- University of Washington, Department of Bioengineering, 3720 15th Avenue NE, Seattle, Washington 98195, United States; University of Washington, Department of Radiology, 1959 NE Pacific Street, Seattle, Washington 98195, United States
| | - Ravindra M Manjeshwar
- General Electric Global Research Center , 1 Research Circle, Niskayuna, New York 12309, United States
| |
Collapse
|
32
|
Abstract
Survival data with ultrahigh dimensional covariates such as genetic markers have been collected in medical studies and other fields. In this work, we propose a feature screening procedure for the Cox model with ultrahigh dimensional covariates. The proposed procedure is distinguished from the existing sure independence screening (SIS) procedures (Fan, Feng and Wu, 2010, Zhao and Li, 2012) in that the proposed procedure is based on joint likelihood of potential active predictors, and therefore is not a marginal screening procedure. The proposed procedure can effectively identify active predictors that are jointly dependent but marginally independent of the response without performing an iterative procedure. We develop a computationally effective algorithm to carry out the proposed procedure and establish the ascent property of the proposed algorithm. We further prove that the proposed procedure possesses the sure screening property. That is, with the probability tending to one, the selected variable set includes the actual active predictors. We conduct Monte Carlo simulation to evaluate the finite sample performance of the proposed procedure and further compare the proposed procedure and existing SIS procedures. The proposed methodology is also demonstrated through an empirical analysis of a real data example.
Collapse
Affiliation(s)
- Guangren Yang
- School of Economics, Jinan University, Guangzhou, P.R. China
| | - Ye Yu
- Department of Statistics, The Pennsylvania State University, University Park, PA 16802
| | - Runze Li
- Department of Statistics and The Methodology Center, The Pennsylvania State University, University Park, PA 16802
| | - Anne Buu
- School of Nursing, University of Michigan, Ann Arbor, MI 48109, USA
| |
Collapse
|
33
|
Barber RF, Sidky EY. MOCCA: Mirrored Convex/Concave Optimization for Nonconvex Composite Functions. J Mach Learn Res 2016; 17:1-51. [PMID: 29391859 PMCID: PMC5789814] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]
Abstract
Many optimization problems arising in high-dimensional statistics decompose naturally into a sum of several terms, where the individual terms are relatively simple but the composite objective function can only be optimized with iterative algorithms. In this paper, we are interested in optimization problems of the form F(Kx) + G(x), where K is a fixed linear transformation, while F and G are functions that may be nonconvex and/or nondifferentiable. In particular, if either of the terms are nonconvex, existing alternating minimization techniques may fail to converge; other types of existing approaches may instead be unable to handle nondifferentiability. We propose the MOCCA (mirrored convex/concave) algorithm, a primal/dual optimization approach that takes a local convex approximation to each term at every iteration. Inspired by optimization problems arising in computed tomography (CT) imaging, this algorithm can handle a range of nonconvex composite optimization problems, and offers theoretical guarantees for convergence when the overall problem is approximately convex (that is, any concavity in one term is balanced out by convexity in the other term). Empirical results show fast convergence for several structured signal recovery problems.
Collapse
Affiliation(s)
- Rina Foygel Barber
- Department of Statistics, University of Chicago, 5747 South Ellis Avenue, Chicago, IL 60637, USA
| | - Emil Y Sidky
- Department of Radiology, University of Chicago, 5841 South Maryland Avenue, Chicago, IL 60637, USA
| |
Collapse
|
34
|
Shujie MA, Carroll RJ, Liang H, Xu S. Estimation and Inference in Generalized Additive Coefficient Models for Nonlinear Interactions with High-Dimensional Covariates. Ann Stat 2015; 43:2102-2131. [PMID: 26412908 DOI: 10.1214/15-aos1344] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Abstract
In the low-dimensional case, the generalized additive coefficient model (GACM) proposed by Xue and Yang [Statist. Sinica16 (2006) 1423-1446] has been demonstrated to be a powerful tool for studying nonlinear interaction effects of variables. In this paper, we propose estimation and inference procedures for the GACM when the dimension of the variables is high. Specifically, we propose a groupwise penalization based procedure to distinguish significant covariates for the "large p small n" setting. The procedure is shown to be consistent for model structure identification. Further, we construct simultaneous confidence bands for the coefficient functions in the selected model based on a refined two-step spline estimator. We also discuss how to choose the tuning parameters. To estimate the standard deviation of the functional estimator, we adopt the smoothed bootstrap method. We conduct simulation experiments to evaluate the numerical performance of the proposed methods and analyze an obesity data set from a genome-wide association study as an illustration.
Collapse
|
35
|
Greenland S, Mansournia MA. Penalization, bias reduction, and default priors in logistic and related categorical and survival regressions. Stat Med 2015; 34:3133-43. [PMID: 26011599 DOI: 10.1002/sim.6537] [Citation(s) in RCA: 163] [Impact Index Per Article: 18.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/26/2014] [Revised: 04/06/2015] [Accepted: 05/01/2015] [Indexed: 11/06/2022]
Abstract
Penalization is a very general method of stabilizing or regularizing estimates, which has both frequentist and Bayesian rationales. We consider some questions that arise when considering alternative penalties for logistic regression and related models. The most widely programmed penalty appears to be the Firth small-sample bias-reduction method (albeit with small differences among implementations and the results they provide), which corresponds to using the log density of the Jeffreys invariant prior distribution as a penalty function. The latter representation raises some serious contextual objections to the Firth reduction, which also apply to alternative penalties based on t-distributions (including Cauchy priors). Taking simplicity of implementation and interpretation as our chief criteria, we propose that the log-F(1,1) prior provides a better default penalty than other proposals. Penalization based on more general log-F priors is trivial to implement and facilitates mean-squared error reduction and sensitivity analyses of penalty strength by varying the number of prior degrees of freedom. We caution however against penalization of intercepts, which are unduly sensitive to covariate coding and design idiosyncrasies.
Collapse
Affiliation(s)
- Sander Greenland
- Department of Epidemiology, Fielding School of Public Health, University of California, Los Angeles, CA, U.S.A.,Department of Statistics, College of Letters and Science, University of California, Los Angeles, CA, U.S.A
| | - Mohammad Ali Mansournia
- Department of Epidemiology and Biostatistics, School of Public Health, Tehran University of Medical Sciences, Tehran, Iran
| |
Collapse
|
36
|
Li Z, Liu H, Tu W. A sexually transmitted infection screening algorithm based on semiparametric regression models. Stat Med 2015; 34:2844-57. [PMID: 25900920 DOI: 10.1002/sim.6515] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2014] [Revised: 03/16/2015] [Accepted: 04/02/2015] [Indexed: 11/11/2022]
Abstract
Sexually transmitted infections (STIs) with Chlamydia trachomatis, Neisseria gonorrhoeae, and Trichomonas vaginalis are among the most common infectious diseases in the United States, disproportionately affecting young women. Because a significant portion of the infections present no symptoms, infection control relies primarily on disease screening. However, universal STI screening in a large population can be expensive. In this paper, we propose a semiparametric model-based screening algorithm. The model quantifies organism-specific infection risks in individual subjects and accounts for the within-subject interdependence of the infection outcomes of different organisms and the serial correlations among the repeated assessments of the same organism. Bivariate thin-plate regression spline surfaces are incorporated to depict the concurrent influences of age and sexual partners on infection acquisition. Model parameters are estimated by using a penalized likelihood method. For inference, we develop a likelihood-based resampling procedure to compare the bivariate effect surfaces across outcomes. Simulation studies are conducted to evaluate the model fitting performance. A screening algorithm is developed using data collected from an epidemiological study of young women at increased risk of STIs. We present evidence that the three organisms have distinct age and partner effect patterns; for C. trachomatis, the partner effect is more pronounced in younger adolescents. Predictive performance of the proposed screening algorithm is assessed through a receiver operating characteristic analysis. We show that the model-based screening algorithm has excellent accuracy in identifying individuals at increased risk, and thus can be used to assist STI screening in clinical practice.
Collapse
Affiliation(s)
- Zhuokai Li
- Duke Clinical Research Institute, 2400 Pratt Street, Durham, NC 27705, U.S.A
| | - Hai Liu
- Department of Biostatistics, Indiana University Schools of Medicine and Public Health, 410 West 10th Street, Indianapolis, IN 46202, U.S.A
| | - Wanzhu Tu
- Department of Biostatistics, Indiana University Schools of Medicine and Public Health, 410 West 10th Street, Indianapolis, IN 46202, U.S.A
| |
Collapse
|
37
|
Affiliation(s)
- Xuefeng Wang
- Program in Public Health, Departments of Preventive Medicine, Biomedical Informatics, and Applied Mathematics and Statistics, Stony Brook University Stony Brook, NY, USA
| |
Collapse
|
38
|
Biard L, Porcher R, Resche-Rigon M. Permutation tests for centre effect on survival endpoints with application in an acute myeloid leukaemia multicentre study. Stat Med 2014; 33:3047-57. [PMID: 24676752 DOI: 10.1002/sim.6153] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/17/2013] [Revised: 02/21/2014] [Accepted: 03/02/2014] [Indexed: 11/10/2022]
Abstract
When analysing multicentre data, it may be of interest to test whether the distribution of the endpoint varies among centres. In a mixed-effect model, testing for such a centre effect consists in testing to zero a random centre effect variance component. It has been shown that the usual asymptotic χ(2) distribution of the likelihood ratio and score statistics under the null does not necessarily hold. In the case of censored data, mixed-effects Cox models have been used to account for random effects, but few works have concentrated on testing to zero the variance component of the random effects. We propose a permutation test, using random permutation of the cluster indices, to test for a centre effect in multilevel censored data. Results from a simulation study indicate that the permutation tests have correct type I error rates, contrary to standard likelihood ratio tests, and are more powerful. The proposed tests are illustrated using data of a multicentre clinical trial of induction therapy in acute myeloid leukaemia patients.
Collapse
Affiliation(s)
- L Biard
- Service de Biostatistique et Information Médicale, Hôpital Saint-Louis, AP-HP, F-75010 Paris, France; Université Paris Diderot - Paris 7, Sorbonne Paris Cité, F-75010 Paris, France; INSERM, ECSTRA Team, UMR-S 1153, F-75010 Paris, France
| | | | | |
Collapse
|
39
|
Tamuri AU, Goldman N, dos Reis M. A penalized-likelihood method to estimate the distribution of selection coefficients from phylogenetic data. Genetics 2014; 197:257-71. [PMID: 24532780 DOI: 10.1534/genetics.114.162263] [Citation(s) in RCA: 46] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/04/2023] Open
Abstract
We develop a maximum penalized-likelihood (MPL) method to estimate the fitnesses of amino acids and the distribution of selection coefficients (S = 2Ns) in protein-coding genes from phylogenetic data. This improves on a previous maximum-likelihood method. Various penalty functions are used to penalize extreme estimates of the fitnesses, thus correcting overfitting by the previous method. Using a combination of computer simulation and real data analysis, we evaluate the effect of the various penalties on the estimation of the fitnesses and the distribution of S. We show the new method regularizes the estimates of the fitnesses for small, relatively uninformative data sets, but it can still recover the large proportion of deleterious mutations when present in simulated data. Computer simulations indicate that as the number of taxa in the phylogeny or the level of sequence divergence increases, the distribution of S can be more accurately estimated. Furthermore, the strength of the penalty can be varied to study how informative a particular data set is about the distribution of S. We analyze three protein-coding genes (the chloroplast rubisco protein, mammal mitochondrial proteins, and an influenza virus polymerase) and show the new method recovers a large proportion of deleterious mutations in these data, even under strong penalties, confirming the distribution of S is bimodal in these real data. We recommend the use of the new MPL approach for the estimation of the distribution of S in species phylogenies of protein-coding genes.
Collapse
|
40
|
Abstract
This article examines the convergence properties of a Bayesian model selection procedure based on a non-local prior density in ultrahigh-dimensional settings. The performance of the model selection procedure is also compared to popular penalized likelihood methods. Coupling diagnostics are used to bound the total variation distance between iterates in an Markov chain Monte Carlo (MCMC) algorithm and the posterior distribution on the model space. In several simulation scenarios in which the number of observations exceeds 100, rapid convergence and high accuracy of the Bayesian procedure is demonstrated. Conversely, the coupling diagnostics are successful in diagnosing lack of convergence in several scenarios for which the number of observations is less than 100. The accuracy of the Bayesian model selection procedure in identifying high probability models is shown to be comparable to commonly used penalized likelihood methods, including extensions of smoothly clipped absolute deviations (SCAD) and least absolute shrinkage and selection operator (LASSO) procedures.
Collapse
|
41
|
Lee D, Lee Y, Pawitan Y, Lee W. Sparse partial least-squares regression for high-throughput survival data analysis. Stat Med 2013; 32:5340-52. [PMID: 24105836 DOI: 10.1002/sim.5975] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/27/2012] [Revised: 08/24/2013] [Accepted: 08/27/2013] [Indexed: 11/09/2022]
Abstract
The partial least-square (PLS) method has been adapted to the Cox's proportional hazards model for analyzing high-dimensional survival data. But because the latent components constructed in PLS employ all predictors regardless of their relevance, it is often difficult to interpret the results. In this paper, we propose a new formulation of sparse PLS (SPLS) procedure for survival data to allow simultaneous sparse variable selection and dimension reduction. We develop a computing algorithm for SPLS by modifying an iteratively reweighted PLS algorithm and illustrate the method with the Swedish and the Netherlands Cancer Institute breast cancer datasets. Through the numerical studies, we find that our SPLS method generally performs better than the standard PLS and sparse Cox regression methods in variable selection and prediction.
Collapse
Affiliation(s)
- Donghwan Lee
- Department of Medical Epidemiology and Biostatistics, Karolinska Institutet, 17177 Stockholm, Sweden
| | | | | | | |
Collapse
|
42
|
Ghebremichael-Weldeselassie Y, Whitaker HJ, Farrington CP. Self-controlled case series method with smooth age effect. Stat Med 2013; 33:639-49. [PMID: 24038284 DOI: 10.1002/sim.5949] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/13/2012] [Accepted: 07/29/2013] [Indexed: 11/07/2022]
Abstract
The self-controlled case series method, commonly used to investigate potential associations between vaccines and adverse events, requires information on cases only and automatically controls all age-independent multiplicative confounders while allowing for an age-dependent baseline incidence. In the parametric version of the method, we modelled the age-specific relative incidence by using a piecewise constant function, whereas in the semiparametric version, we left it unspecified. However, mis-specification of age groups in the parametric version can lead to biassed estimates of exposure effect, and the semiparametric approach runs into computational problems when the number of cases in the study is moderately large. We, thus, propose to use a penalized likelihood approach where the age effect is modelled using splines. We use a linear combination of cubic M-splines to approximate the age-specific relative incidence and integrated splines for the cumulative relative incidence. We conducted a simulation study to evaluate the performance of the new approach and its efficiency relative to the parametric and semiparametric approaches. Results show that the new approach performs equivalently to the existing methods when the sample size is small and works well for large data sets. We applied the new spline-based approach to data on febrile convulsions and paediatric vaccines. Co
Collapse
|
43
|
Leffondré K, Touraine C, Helmer C, Joly P. Interval-censored time-to-event and competing risk with death: is the illness-death model more accurate than the Cox model? Int J Epidemiol 2013; 42:1177-86. [PMID: 23900486 DOI: 10.1093/ije/dyt126] [Citation(s) in RCA: 63] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
BACKGROUND In survival analyses of longitudinal data, death is often a competing event for the disease of interest, and the time-to-disease onset is interval-censored when the diagnosis is made at intermittent follow-up visits. As a result, the disease status at death is unknown for subjects disease-free at the last visit before death. Standard survival analysis consists in right-censoring the time-to-disease onset at that visit, which may induce an underestimation of the disease incidence. By contrast, an illness-death model for interval-censored data accounts for the probability of developing the disease between that visit and death, and provides a better incidence estimate. However, the two approaches have never been compared for estimating the effect of exposure on disease risk. METHODS This paper compares through simulations the accuracy of the effect estimates from a semi-parametric illness-death model for interval-censored data and the standard Cox model. The approaches are also compared for estimating the effects of selected risk factors on the risk of dementia, using the French elderly PAQUID cohort data. RESULTS The illness-death model provided a more accurate effect estimate of exposures that also affected mortality. The direction and magnitude of the bias from the Cox model depended on the effects of the exposure on disease and death. The application to the PAQUID cohort confirmed the simulation results. CONCLUSION If follow-up intervals are wide and the exposure has an impact on death, then the illness-death model for interval-censored data should be preferred to the standard Cox regression analysis.
Collapse
Affiliation(s)
- Karen Leffondré
- University of Bordeaux, ISPED, Centre INSERM U897-Epidemiology-Biostatistics, Bordeaux, France
| | | | | | | |
Collapse
|
44
|
Ayers KL, Cordell HJ. Identification of grouped rare and common variants via penalized logistic regression. Genet Epidemiol 2013; 37:592-602. [PMID: 23836590 PMCID: PMC3842118 DOI: 10.1002/gepi.21746] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2012] [Revised: 05/24/2013] [Accepted: 05/24/2013] [Indexed: 11/09/2022]
Abstract
In spite of the success of genome-wide association studies in finding many common variants associated with disease, these variants seem to explain only a small proportion of the estimated heritability. Data collection has turned toward exome and whole genome sequencing, but it is well known that single marker methods frequently used for common variants have low power to detect rare variants associated with disease, even with very large sample sizes. In response, a variety of methods have been developed that attempt to cluster rare variants so that they may gather strength from one another under the premise that there may be multiple causal variants within a gene. Most of these methods group variants by gene or proximity, and test one gene or marker window at a time. We propose a penalized regression method (PeRC) that analyzes all genes at once, allowing grouping of all (rare and common) variants within a gene, along with subgrouping of the rare variants, thus borrowing strength from both rare and common variants within the same gene. The method can incorporate either a burden-based weighting of the rare variants or one in which the weights are data driven. In simulations, our method performs favorably when compared to many previously proposed approaches, including its predecessor, the sparse group lasso [Friedman et al., 2010].
Collapse
Affiliation(s)
- Kristin L Ayers
- Institute of Genetic Medicine, Newcastle University, Newcastle upon Tyne NE1 3BZ, United Kingdom.
| | | |
Collapse
|
45
|
Tong X, Zhu L, Leng C, Leisenring W, Robison LL. A general semiparametric hazards regression model: efficient estimation and structure selection. Stat Med 2013; 32:4980-94. [PMID: 23824784 DOI: 10.1002/sim.5885] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/22/2012] [Accepted: 05/28/2013] [Indexed: 11/06/2022]
Abstract
We consider a general semiparametric hazards regression model that encompasses the Cox proportional hazards model and the accelerated failure time model for survival analysis. To overcome the nonexistence of the maximum likelihood, we derive a kernel-smoothed profile likelihood function and prove that the resulting estimates of the regression parameters are consistent and achieve semiparametric efficiency. In addition, we develop penalized structure selection techniques to determine which covariates constitute the accelerated failure time model and which covariates constitute the proportional hazards model. The proposed method is able to estimate the model structure consistently and model parameters efficiently. Furthermore, variance estimation is straightforward. The proposed estimation performs well in simulation studies and is applied to the analysis of a real data set.
Collapse
Affiliation(s)
- Xingwei Tong
- Department of Statistics and Applied Probability, National University of Singapore, Singapore
| | | | | | | | | |
Collapse
|
46
|
Adluru N, Hanlon BM, Lutz A, Lainhart JE, Alexander AL, Davidson RJ. Penalized likelihood phenotyping: unifying voxelwise analyses and multi-voxel pattern analyses in neuroimaging: penalized likelihood phenotyping. Neuroinformatics 2013; 11:227-47. [PMID: 23397550 PMCID: PMC3624987 DOI: 10.1007/s12021-012-9175-9] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/18/2023]
Abstract
Neuroimage phenotyping for psychiatric and neurological disorders is performed using voxelwise analyses also known as voxel based analyses or morphometry (VBM). A typical voxelwise analysis treats measurements at each voxel (e.g., fractional anisotropy, gray matter probability) as outcome measures to study the effects of possible explanatory variables (e.g., age, group) in a linear regression setting. Furthermore, each voxel is treated independently until the stage of correction for multiple comparisons. Recently, multi-voxel pattern analyses (MVPA), such as classification, have arisen as an alternative to VBM. The main advantage of MVPA over VBM is that the former employ multivariate methods which can account for interactions among voxels in identifying significant patterns. They also provide ways for computer-aided diagnosis and prognosis at individual subject level. However, compared to VBM, the results of MVPA are often more difficult to interpret and prone to arbitrary conclusions. In this paper, first we use penalized likelihood modeling to provide a unified framework for understanding both VBM and MVPA. We then utilize statistical learning theory to provide practical methods for interpreting the results of MVPA beyond commonly used performance metrics, such as leave-one-out-cross validation accuracy and area under the receiver operating characteristic (ROC) curve. Additionally, we demonstrate that there are challenges in MVPA when trying to obtain image phenotyping information in the form of statistical parametric maps (SPMs), which are commonly obtained from VBM, and provide a bootstrap strategy as a potential solution for generating SPMs using MVPA. This technique also allows us to maximize the use of available training data. We illustrate the empirical performance of the proposed framework using two different neuroimaging studies that pose different levels of challenge for classification using MVPA.
Collapse
|
47
|
Fu P, Panneerselvam A, Clifford B, Dowlati A, Ma PC, Zeng G, Halmos B, Leidner RS. Simpson's paradox - aggregating and partitioning populations in health disparities of lung cancer patients. Stat Methods Med Res 2012; 24:937-48. [PMID: 22246415 DOI: 10.1177/0962280211434179] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
It is well known that non-small cell lung cancer (NSCLC) is a heterogeneous group of diseases. Previous studies have demonstrated genetic variation among different ethnic groups in the epidermal growth factor receptor (EGFR) in NSCLC. Research by our group and others has recently shown a lower frequency of EGFR mutations in African Americans with NSCLC, as compared to their White counterparts. In this study, we use our original study data of EGFR pathway genetics in African American NSCLC as an example to illustrate that univariate analyses based on aggregation versus partition of data leads to contradictory results, in order to emphasize the importance of controlling statistical confounding. We further investigate analytic approaches in logistic regression for data with separation, as is the case in our example data set, and apply appropriate methods to identify predictors of EGFR mutation. Our simulation shows that with separated or nearly separated data, penalized maximum likelihood (PML) produces estimates with smallest bias and approximately maintains the nominal value with statistical power equal to or better than that from maximum likelihood and exact conditional likelihood methods. Application of the PML method in our example data set shows that race and EGFR-FISH are independently significant predictors of EGFR mutation.
Collapse
Affiliation(s)
- P Fu
- Seidman Cancer Center, Case Western Reserve University School of Medicine, Cleveland, OH, USA.
| | - A Panneerselvam
- Seidman Cancer Center, Case Western Reserve University School of Medicine, Cleveland, OH, USA
| | - B Clifford
- Seidman Cancer Center, Case Western Reserve University School of Medicine, Cleveland, OH, USA
| | - A Dowlati
- Seidman Cancer Center, Case Western Reserve University School of Medicine, Cleveland, OH, USA
| | - P C Ma
- Cleveland Clinic, Taussig Cancer Institute, Cleveland, OH, USA
| | - G Zeng
- College of Education, Texas A &M University - Corpus Christi, Corpus Christi, TX, USA
| | - B Halmos
- Herbert Irving Comprehensive Cancer Center, Columbia University, New York, NY, USA
| | - R S Leidner
- Seidman Cancer Center, Case Western Reserve University School of Medicine, Cleveland, OH, USA
| |
Collapse
|
48
|
Rondeau V, Pignon JP, Michiels S. A joint model for the dependence between clustered times to tumour progression and deaths: A meta-analysis of chemotherapy in head and neck cancer. Stat Methods Med Res 2011; 24:711-29. [PMID: 22025414 DOI: 10.1177/0962280211425578] [Citation(s) in RCA: 30] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
The observation of time to tumour progression (TTP) or progression-free survival (PFS) may be terminated by a terminal event. In this context, deaths may be due to tumour progression, and the time to the major failure event (death) may be correlated with the TTP. The usual assumption of independence between the TTP process and death, required by many commonly used statistical methods, can be violated. Furthermore, although the relationship between TTP and time to death is most relevant to the anti-cancer drug development or to evaluation of TTP as a surrogate endpoint, statistical models that try to describe the dependence structure between these two characteristics are not frequently used. We propose a joint frailty model for the analysis of two survival endpoints, TTP and time to death, or PFS and time to death, in the context of data clustering (e.g. at the centre or trial level). This approach allows us to simultaneously evaluate the prognostic effects of covariates on the two survival endpoints, while accounting both for the relationship between the outcomes and for data clustering. We show how a maximum penalized likelihood estimation can be applied to a nonparametric estimation of the continuous hazard functions in a general joint frailty model with right censoring and delayed entry. The model was motivated by a large meta-analysis of randomized trials for head and neck cancers (Meta-Analysis of Chemotherapy in Head and Neck Cancers), in which the efficacy of chemotherapy on TTP or PFS and overall survival was investigated, as adjunct to surgery or radiotherapy or both.
Collapse
Affiliation(s)
- Virginie Rondeau
- INSERM, CR897 (Biostatistic), Bordeaux, F-33076, France. Université Bordeaux Segalen, Bordeaux, F-33076, France.
| | - Jean-Pierre Pignon
- Department of Biostatistics and Epidemiology, Institut Gustave-Roussy, Villejuif, F-94805, France
| | - Stefan Michiels
- Institut Jules Bordet, Université Libre de Bruxelles, Brussels, Belgium
| | | |
Collapse
|
49
|
Abstract
This paper reviews the literature on sparse high dimensional models and discusses some applications in economics and finance. Recent developments of theory, methods, and implementations in penalized least squares and penalized likelihood methods are highlighted. These variable selection methods are proved to be effective in high dimensional sparse modeling. The limits of dimensionality that regularization methods can handle, the role of penalty functions, and their statistical properties are detailed. Some recent advances in ultra-high dimensional sparse modeling are also briefly discussed.
Collapse
Affiliation(s)
- Jianqing Fan
- Bendheim Center for Finance, Princeton University, Princeton, New Jersey 08544
- Department of Operations Research and Financial Engineering, Princeton University, Princeton, New Jersey 08544
| | - Jinchi Lv
- Information and Operations Management Department, Marshall School of Business, University of Southern California, Los Angeles, California 90089
| | - Lei Qi
- Bendheim Center for Finance, Princeton University, Princeton, New Jersey 08544
- Department of Operations Research and Financial Engineering, Princeton University, Princeton, New Jersey 08544
| |
Collapse
|
50
|
Abstract
Semiparametric additive partial linear models, containing both linear and nonlinear additive components, are more flexible compared to linear models, and they are more efficient compared to general nonparametric regression models because they reduce the problem known as "curse of dimensionality". In this paper, we propose a new estimation approach for these models, in which we use polynomial splines to approximate the additive nonparametric components and we derive the asymptotic normality for the resulting estimators of the parameters. We also develop a variable selection procedure to identify significant linear components using the smoothly clipped absolute deviation penalty (SCAD), and we show that the SCAD-based estimators of non-zero linear components have an oracle property. Simulations are performed to examine the performance of our approach as compared to several other variable selection methods such as the Bayesian Information Criterion and Least Absolute Shrinkage and Selection Operator (LASSO). The proposed approach is also applied to real data from a nutritional epidemiology study, in which we explore the relationship between plasma beta-carotene levels and personal characteristics (e.g., age, gender, body mass index (BMI), etc.) as well as dietary factors (e.g., alcohol consumption, smoking status, intake of cholesterol, etc.).
Collapse
Affiliation(s)
- Xiang Liu
- Department of Biostatistics and Computational Biology, University of Rochester Medical Center, Rochester, NY 14642, U.S.A.
| | - Li Wang
- Department of Statistics, University of Georgia, Athens, GA 30602, U.S.A.
| | - Hua Liang
- Department of Biostatistics and Computational Biology, University of Rochester Medical Center, Rochester, NY 14642, U.S.A
| |
Collapse
|