1
|
Order selection for heterogeneous semiparametric hidden Markov models. Stat Med 2024; 43:2501-2526. [PMID: 38616718 DOI: 10.1002/sim.10069] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/19/2023] [Revised: 11/26/2023] [Accepted: 03/12/2024] [Indexed: 04/16/2024]
Abstract
Hidden Markov models (HMMs), which can characterize dynamic heterogeneity, are valuable tools for analyzing longitudinal data. The order of HMMs (ie, the number of hidden states) is typically assumed to be known or predetermined by some model selection criterion in conventional analysis. As prior information about the order frequently lacks, pairwise comparisons under criterion-based methods become computationally expensive with the model space growing. A few studies have conducted order selection and parameter estimation simultaneously, but they only considered homogeneous parametric instances. This study proposes a Bayesian double penalization (BDP) procedure for simultaneous order selection and parameter estimation of heterogeneous semiparametric HMMs. To overcome the difficulties in updating the order, we create a brand-new Markov chain Monte Carlo algorithm coupled with an effective adjust-bound reversible jump strategy. Simulation results reveal that the proposed BDP procedure performs well in estimation and works noticeably better than the conventional criterion-based approaches. Application of the suggested method to the Alzheimer's Disease Neuroimaging Initiative research further supports its usefulness.
Collapse
|
2
|
Longitudinal varying coefficient single-index model with censored covariates. Biometrics 2024; 80:ujad006. [PMID: 38364803 PMCID: PMC10871868 DOI: 10.1093/biomtc/ujad006] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2022] [Revised: 08/26/2023] [Accepted: 10/31/2023] [Indexed: 02/18/2024]
Abstract
It is of interest to health policy research to estimate the population-averaged longitudinal medical cost trajectory from initial cancer diagnosis to death, and understand how the trajectory curve is affected by patient characteristics. This research question leads to a number of statistical challenges because the longitudinal cost data are often non-normally distributed with skewness, zero-inflation, and heteroscedasticity. The trajectory is nonlinear, and its length and shape depend on survival, which are subject to censoring. Modeling the association between multiple patient characteristics and nonlinear cost trajectory curves of varying lengths should take into consideration parsimony, flexibility, and interpretation. We propose a novel longitudinal varying coefficient single-index model. Multiple patient characteristics are summarized in a single-index, representing a patient's overall propensity for healthcare use. The effects of this index on various segments of the cost trajectory depend on both time and survival, which is flexibly modeled by a bivariate varying coefficient function. The model is estimated by generalized estimating equations with an extended marginal mean structure to accommodate censored survival time as a covariate. We established the pointwise confidence interval of the varying coefficient and a test for the covariate effect. The numerical performance was extensively studied in simulations. We applied the proposed methodology to medical cost data of prostate cancer patients from the Surveillance, Epidemiology, and End Results-Medicare-Linked Database.
Collapse
|
3
|
Joint semiparametric kernel network regression. Stat Med 2023; 42:5247-5265. [PMID: 37724619 DOI: 10.1002/sim.9910] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2021] [Revised: 08/28/2023] [Accepted: 09/05/2023] [Indexed: 09/21/2023]
Abstract
Variable selection and graphical modeling play essential roles in highly correlated and high-dimensional (HCHD) data analysis. Variable selection methods have been developed under both parametric and nonparametric model settings. However, variable selection for nonadditive, nonparametric regression with high-dimensional variables is challenging due to complications in modeling unknown dependence structures among HCHD variables. Gaussian graphical models are a popular and useful tool for investigating the conditional dependence between variables via estimating sparse precision matrices. For a given class of interest, the estimated precision matrices can be mapped onto networks for visualization. However, the limitation of Gaussian graphical models is that they are only applicable to discretized response variables and for the case whenp log ( p ) ≪ n $$ p\log (p)\ll n $$ , wherep $$ p $$ is the number of variables andn $$ n $$ is the sample size. They are necessary to develop a joint method for variable selection and graphical modeling. To the best of our knowledge, the methods for simultaneously selecting variable selection and estimating networks among variables in the semiparametric regression settings are quite limited. Hence, in this paper, we develop a joint semiparametric kernel network regression method to solve this limitation and to provide a connection between them. Our approach is a unified and integrated method that can simultaneously identify important variables and build a network among those variables. We developed our approach under a semiparametric kernel machine regression framework, which can allow for nonlinear or nonadditive associations and complicated interactions among the variables. The advantages of our approach are that it can (1) simultaneously select variables and build a network among HCHD variables under a regression setting; (2) model unknown and complicated interactions among the variables and estimate the network among these variables; (3) allow for any form of semiparametric model, including non-additive, nonparametric model; and (4) provide an interpretable network that considers important variables and a response variable. We demonstrate our approach using a simulation study and real application on genetic pathway-based analysis.
Collapse
|
4
|
A Semiparametric Inverse Reinforcement Learning Approach to Characterize Decision Making for Mental Disorders. J Am Stat Assoc 2023; 119:27-38. [PMID: 38706706 PMCID: PMC11068237 DOI: 10.1080/01621459.2023.2261184] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2022] [Accepted: 09/03/2023] [Indexed: 05/07/2024]
Abstract
Major depressive disorder (MDD) is one of the leading causes of disability-adjusted life years. Emerging evidence indicates the presence of reward processing abnormalities in MDD. An important scientific question is whether the abnormalities are due to reduced sensitivity to received rewards or reduced learning ability. Motivated by the probabilistic reward task (PRT) experiment in the EMBARC study, we propose a semiparametric inverse reinforcement learning (RL) approach to characterize the reward-based decision-making of MDD patients. The model assumes that a subject's decision-making process is updated based on a reward prediction error weighted by the subject-specific learning rate. To account for the fact that one favors a decision leading to a potentially high reward, but this decision process is not necessarily linear, we model reward sensitivity with a non-decreasing and nonlinear function. For inference, we estimate the latter via approximation by I-splines and then maximize the joint conditional log-likelihood. We show that the resulting estimators are consistent and asymptotically normal. Through extensive simulation studies, we demonstrate that under different reward-generating distributions, the semiparametric inverse RL outperforms the parametric inverse RL. We apply the proposed method to EMBARC and find that MDD and control groups have similar learning rates but different reward sensitivity functions. There is strong statistical evidence that reward sensitivity functions have nonlinear forms. Using additional brain imaging data in the same study, we find that both reward sensitivity and learning rate are associated with brain activities in the negative affect circuitry under an emotional conflict task.
Collapse
|
5
|
Change-plane analysis for subgroup detection with a continuous treatment. Biometrics 2023; 79:1920-1933. [PMID: 36134534 PMCID: PMC10030385 DOI: 10.1111/biom.13762] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2021] [Accepted: 09/14/2022] [Indexed: 11/30/2022]
Abstract
Detecting and characterizing subgroups with differential effects of a binary treatment has been widely studied and led to improvements in patient outcomes and population risk management. Under the setting of a continuous treatment, however, such investigations remain scarce. We propose a semiparametric change-plane model and consequently a doubly robust test statistic for assessing the existence of two subgroups with differential treatment effects under a continuous treatment. The proposed testing procedure is valid when either the baseline function for the covariate effects or the generalized propensity score function for the continuous treatment is correctly specified. The asymptotic distributions of the test statistic under the null and local alternative hypotheses are established. When the null hypothesis of no subgroup is rejected, the change-plane parameters that define the subgroups can be estimated. This paper provides a unified framework of the change-plane method to handle various types of outcomes, including the exponential family of distributions and time-to-event outcomes. Additional extensions with nonparametric estimation approaches are also provided. We evaluate the performance of our proposed methods through extensive simulation studies under various scenarios. An application to the Health Effects of Arsenic Longitudinal Study with a continuous environmental exposure of arsenic is presented.
Collapse
|
6
|
Semiparametric pseudo-score and pseudo-likelihood for evaluating correlate of protection in vaccine trials. Stat Med 2023. [PMID: 37248751 DOI: 10.1002/sim.9807] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2021] [Revised: 03/03/2023] [Accepted: 05/08/2023] [Indexed: 05/31/2023]
Abstract
In vaccine clinical trials, vaccine efficacy endpoint analysis is usually associated with in high cost or extended study duration, due to the generally low infection rate. Correlate of protection (CoP), which refers to surrogate endpoint, usually immunological response, that can reliably predict the treatment effect, provides a more efficient and less costly approach to evaluate the vaccine. To handle the challenge of the missingness in the unobserved surrogate immune biomarker, the pseudo-score (PS) method, semiparametric method and pseudo-likelihood (PL) method demonstrated their advantages on different aspects. In this article, we propose new methodologies to combine the advantages of PS and PL with semiparametric methods respectively, to achieve higher estimate efficiency, allow continuous baseline predictor variable, and handle multiple surrogate markers. The advantage of our methodologies are demonstrated by a simulation study in different settings and applied to a case study, which eventually can improve the chance of a successful trial.
Collapse
|
7
|
Noniterative adjustment to regression estimators with population-based auxiliary information for semiparametric models. Biometrics 2023; 79:140-150. [PMID: 34693991 DOI: 10.1111/biom.13585] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2020] [Revised: 10/06/2021] [Accepted: 10/08/2021] [Indexed: 12/14/2022]
Abstract
Disease registries, surveillance data, and other datasets with extremely large sample sizes become increasingly available in providing population-based information on disease incidence, survival probability, or other important public health characteristics. Such information can be leveraged in studies that collect detailed measurements but with smaller sample sizes. In contrast to recent proposals that formulate additional information as constraints in optimization problems, we develop a general framework to construct simple estimators that update the usual regression estimators with some functionals of data that incorporate the additional information. We consider general settings that incorporate nuisance parameters in the auxiliary information, non-i.i.d. data such as those from case-control studies, and semiparametric models with infinite-dimensional parameters common in survival analysis. Details of several important data and sampling settings are provided with numerical examples.
Collapse
|
8
|
Highly robust causal semiparametric U-statistic with applications in biomedical studies. Int J Biostat 2022; 0:ijb-2022-0047. [PMID: 36433631 PMCID: PMC10225018 DOI: 10.1515/ijb-2022-0047] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/19/2022] [Accepted: 10/31/2022] [Indexed: 11/28/2022]
Abstract
With our increased ability to capture large data, causal inference has received renewed attention and is playing an ever-important role in biomedicine and economics. However, one major methodological hurdle is that existing methods rely on many unverifiable model assumptions. Thus robust modeling is a critically important approach complementary to sensitivity analysis, where it compares results under various model assumptions. The more robust a method is with respect to model assumptions, the more worthy it is. The doubly robust estimator (DRE) is a significant advance in this direction. However, in practice, many outcome measures are functionals of multiple distributions, and so are the associated estimands, which can only be estimated via U-statistics. Thus most existing DREs do not apply. This article proposes a broad class of highly robust U-statistic estimators (HREs), which use semiparametric specifications for both the propensity score and outcome models in constructing the U-statistic. Thus, the HRE is more robust than the existing DREs. We derive comprehensive asymptotic properties of the proposed estimators and perform extensive simulation studies to evaluate their finite sample performance and compare them with the corresponding parametric U-statistics and the naive estimators, which show significant advantages. Then we apply the method to analyze a clinical trial from the AIDS Clinical Trials Group.
Collapse
|
9
|
Extending multivariate Student's- t $$ t $$ semiparametric mixed models for longitudinal data with censored responses and heavy tails. Stat Med 2022; 41:3696-3719. [PMID: 35596519 DOI: 10.1002/sim.9443] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2020] [Revised: 04/25/2022] [Accepted: 05/10/2022] [Indexed: 11/08/2022]
Abstract
This article extends the semiparametric mixed model for longitudinal censored data with Gaussian errors by considering the Student's t $$ t $$ -distribution. This model allows us to consider a flexible, functional dependence of an outcome variable over the covariates using nonparametric regression. Moreover, the proposed model takes into account the correlation between observations by using random effects. Penalized likelihood equations are applied to derive the maximum likelihood estimates that appear to be robust against outlying observations with respect to the Mahalanobis distance. We estimate nonparametric functions using smoothing splines under an EM-type algorithm framework. Finally, the proposed approach's performance is evaluated through extensive simulation studies and an application to two datasets from acquired immunodeficiency syndrome clinical trials.
Collapse
|
10
|
Identification of subgroups via partial linear regression modeling approach. Biom J 2021; 64:506-522. [PMID: 34897799 DOI: 10.1002/bimj.202000331] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2020] [Revised: 07/22/2021] [Accepted: 08/07/2021] [Indexed: 11/08/2022]
Abstract
In clinical trials, treatment effects often vary from subject to subject. Some subjects may benefit more than others from a specific treatment. One of the aims of subgroup analysis is to identify if there are subgroups of subjects with differential treatment effects. As in standard analysis, we first test if subgroups with differential treatment effects exist; if they do, we classify the subjects into different subgroups based on their covariate profiles; otherwise, we conclude no subgroups have differential treatment effects in this population. Existing methods utilize regression models, particularly linear models, for such analysis. However, in practice, not all effects of covariates on responses are linear. To address this issue, the article proposes a more flexible model, the partial linear model with a nonlinear monotone function to describe some specific effects of covariates and with a linear component to describe the effects of other covariates, develops model-fitting algorithm and derives model asymptotics. We then utilize the Wald statistic to test the existence of subgroups and the Neyman-Pearson rule to classify subjects into the subgroups. Simulation studies are conducted to evaluate the finite sample performance of the proposed method by comparing it with the commonly used linear models. Finally, we apply the methods to analyzing a real clinical trial.
Collapse
|
11
|
A semiparametric Gumbel regression model for analyzing longitudinal data with non-normal tails. Stat Med 2021; 41:736-750. [PMID: 34816477 DOI: 10.1002/sim.9248] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2021] [Revised: 09/14/2021] [Accepted: 10/17/2021] [Indexed: 11/07/2022]
Abstract
Abnormal longitudinal values in biomarkers can be a sign of abnormal status or signal development of a disease. Identifying new biomarkers for early and efficient disease detection is crucial for disease prevention. Compared to the majority of the healthy general population, abnormal values are located within the tails of the biomarker distribution. Thus, parametric regression models that accommodate abnormal values in biomarkers can better detect the association between biomarkers and disease. In this article, we propose semiparametric Gumbel regression models for (1) longitudinal continuous biomarker outcomes, (2) flexibly modeling the time-effect on the outcome, and (3) accounting for the measurement error in biomarker measurements. We adopted the EM algorithm in combination with a two-dimensional grid search to estimate regression parameters and a function of time-effect. We proposed an efficient asymptotic variance estimator for regression parameter estimates. The proposed estimator is asymptotically unbiased in both theory and simulation studies. We applied the proposed model and two other models to investigate associations between fasting blood glucose biomarkers and potential risk factors from a diabetes ancillary study to the Atherosclerosis Risk in Communities (ARIC) study. The real data application was illustrated by fitting the proposed regression model and graphically evaluating the goodness-of-fit value.
Collapse
|
12
|
Discrete-time survival data with longitudinal covariates. Stat Med 2020; 39:4372-4385. [PMID: 32871614 DOI: 10.1002/sim.8729] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/04/2020] [Revised: 07/21/2020] [Accepted: 07/25/2020] [Indexed: 12/15/2022]
Abstract
Survival analysis has been conventionally performed on a continuous time scale. In practice, the survival time is often recorded or handled on a discrete scale; when this is the case, the discrete-time survival analysis would provide analysis results more relevant to the actual data scale. Besides, data on time-dependent covariates in the survival analysis are usually collected through intermittent follow-ups, resulting in the missing and mismeasured covariate data. In this work, we propose the sufficient discrete hazard (SDH) approach to discrete-time survival analysis with longitudinal covariates that are subject to missingness and mismeasurement. The SDH method employs the conditional score idea available for dealing with mismeasured covariates, and the penalized least squares for estimating the missing covariate value using the regression spline basis. The SDH method is developed for the single event analysis with the logistic discrete hazard model, and for the competing risks analysis with the multinomial logit model. Simulation results revel good finite-sample performances of the proposed estimator and the associated asymptotic theory. The proposed SDH method is applied to the scleroderma lung study data, where the time to medication withdrawal and time to death were recorded discretely in months, for illustration.
Collapse
|
13
|
Using sufficient direction factor model to analyze latent activities associated with breast cancer survival. Biometrics 2020; 76:1340-1350. [PMID: 31860141 PMCID: PMC7305041 DOI: 10.1111/biom.13208] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2018] [Revised: 09/20/2019] [Accepted: 12/16/2019] [Indexed: 11/27/2022]
Abstract
High-dimensional gene expression data often exhibit intricate correlation patterns as the result of coordinated genetic regulation. In practice, however, it is difficult to directly measure these coordinated underlying activities. Analysis of breast cancer survival data with gene expressions motivates us to use a two-stage latent factor approach to estimate these unobserved coordinated biological processes. Compared to existing approaches, our proposed procedure has several unique characteristics. In the first stage, an important distinction is that our procedure incorporates prior biological knowledge about gene-pathway membership into the analysis and explicitly model the effects of genetic pathways on the latent factors. Second, to characterize the molecular heterogeneity of breast cancer, our approach provides estimates specific to each cancer subtype. Finally, our proposed framework incorporates sparsity condition due to the fact that genetic networks are often sparse. In the second stage, we investigate the relationship between latent factor activity levels and survival time with censoring using a general dimension reduction model in the survival analysis context. Combining the factor model and sufficient direction model provides an efficient way of analyzing high-dimensional data and reveals some interesting relations in the breast cancer gene expression data.
Collapse
|
14
|
A Semiparametric Bayesian Approach to Dropout in Longitudinal Studies with Auxiliary Covariates. J Comput Graph Stat 2020; 29:1-12. [PMID: 33013150 DOI: 10.1080/10618600.2019.1617159] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]
Abstract
We develop a semiparametric Bayesian approach to missing outcome data in longitudinal studies in the presence of auxiliary covariates. We consider a joint model for the full data response, missingness and auxiliary covariates. We include auxiliary covariates to "move" the missingness "closer" to missing at random (MAR). In particular, we specify a semiparametric Bayesian model for the observed data via Gaussian process priors and Bayesian additive regression trees. These model specifications allow us to capture non-linear and non-additive effects, in contrast to existing parametric methods. We then separately specify the conditional distribution of the missing data response given the observed data response, missingness and auxiliary covariates (i.e. the extrapolation distribution) using identifying restrictions. We introduce meaningful sensitivity parameters that allow for a simple sensitivity analysis. Informative priors on those sensitivity parameters can be elicited from subject-matter experts. We use Monte Carlo integration to compute the full data estimands. Performance of our approach is assessed using simulated datasets. Our methodology is motivated by, and applied to, data from a clinical trial on treatments for schizophrenia.
Collapse
|
15
|
Joint analysis of panel count and interval-censored data using distribution-free frailty analysis. Biom J 2020; 62:1164-1175. [PMID: 32022280 DOI: 10.1002/bimj.201900134] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2019] [Revised: 07/04/2019] [Accepted: 08/01/2019] [Indexed: 11/07/2022]
Abstract
We propose a joint analysis of recurrent and nonrecurrent event data subject to general types of interval censoring. The proposed analysis allows for general semiparametric models, including the Box-Cox transformation and inverse Box-Cox transformation models for the recurrent and nonrecurrent events, respectively. A frailty variable is used to account for the potential dependence between the recurrent and nonrecurrent event processes, while leaving the distribution of the frailty unspecified. We apply the pseudolikelihood for interval-censored recurrent event data, usually termed as panel count data, and the sufficient likelihood for interval-censored nonrecurrent event data by conditioning on the sufficient statistic for the frailty and using the working assumption of independence over examination times. Large sample theory and a computation procedure for the proposed analysis are established. We illustrate the proposed methodology by a joint analysis of the numbers of occurrences of basal cell carcinoma over time and time to the first recurrence of squamous cell carcinoma based on a skin cancer dataset, as well as a joint analysis of the numbers of adverse events and time to premature withdrawal from study medication based on a scleroderma lung disease dataset.
Collapse
|
16
|
Penalized integrative semiparametric interaction analysis for multiple genetic datasets. Stat Med 2019; 38:3221-3242. [PMID: 30993736 DOI: 10.1002/sim.8172] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2018] [Revised: 02/08/2019] [Accepted: 03/27/2019] [Indexed: 12/19/2022]
Abstract
In this article, we consider a semiparametric additive partially linear interaction model for the integrative analysis of multiple genetic datasets. The goals are to identify important genetic predictors and gene-gene interactions and to estimate the nonparametric functions that describe the environmental effects at the same time. To find the similarities and differences of the genetic effects across different datasets, we impose a group structure on the regression coefficients matrix under the homogeneity assumption, ie, models for different datasets share the same sparsity structure, but the coefficients may differ across datasets. We develop an iterative approach to estimate the parameters of main effects, interactions and nonparametric functions, where a reparametrization of interaction parameters is implemented to meet the strong hierarchy assumption. We demonstrate the advantages of the proposed method in identification, estimation, and prediction in a series of numerical studies. We also apply the proposed method to the Skin Cutaneous Melanoma data and the lung cancer data from the Cancer Genome Atlas.
Collapse
|
17
|
Bayesian Detection of Abnormal Asynchrony of Division Between Sister Cells in Mutant Caenorhabditis elegans Embryos. J Comput Biol 2019; 26:495-505. [PMID: 30964328 DOI: 10.1089/cmb.2018.0246] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Cell division timing is critical for cell fate specification and morphogenesis during embryogenesis, but how division timings are regulated among cells during development is poorly understood. In this article, we focus on the comparison of asynchrony of division, that is, difference of lifetime, between sister cells (ADS) among wild-type and mutant individuals of Caenorhabditis elegans. On the one hand, due to extreme imbalance between wild-type individuals and mutant-type samples, direct comparison of two distributions of ADS between wild type and mutant type is not feasible. On the other hand, we originally found that the ADS is correlated with the lifespan of the corresponding mother cell in wild type. Hence, a semiparametric Bayesian quantile regression method where lifetime of the mother cell is taken as covariate is developed to estimate the 95% confidence curve of ADS in wild type and then ADS of mutant type is classified as abnormal if outside the corresponding confidence interval. A high accuracy of our method is demonstrated by a large-scale simulation study. Real data analysis shows that ADS is related to gene function and expression quantitatively.
Collapse
|
18
|
Semiparametric varying-coefficient regression analysis of recurrent events with applications to treatment switching. Stat Med 2018; 37:3959-3974. [PMID: 29992591 DOI: 10.1002/sim.7856] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2017] [Revised: 04/19/2018] [Accepted: 06/01/2018] [Indexed: 11/06/2022]
Abstract
This paper investigates the semiparametric statistical methods for recurrent events. The mean number of the recurrent events are modeled with the generalized semiparametric varying-coefficient model that can flexibly model three types of covariate effects: time-constant effects, time-varying effects, and covariate-varying effects. We assume that the time-varying effects are unspecified functions of time and the covariate-varying effects are parametric functions of an exposure variable specified up to a finite number of unknown parameters. Different link functions can be selected to provide a rich family of models for recurrent events data. The profile estimation methods are developed for the parametric and nonparametric components. The asymptotic properties are established. We also develop some hypothesis testing procedures to test validity of the parametric forms of covariate-varying effects. The simulation study shows that both estimation and hypothesis testing procedures perform well. The proposed method is applied to analyze a data set from an acyclovir study and investigate whether acyclovir treatment reduces the mean relapse recurrences.
Collapse
|
19
|
The PSID and Income Volatility: Its Record of Seminal Research and Some New Findings. THE ANNALS OF THE AMERICAN ACADEMY OF POLITICAL AND SOCIAL SCIENCE 2018; 680:48-81. [PMID: 31666745 PMCID: PMC6820686 DOI: 10.1177/0002716218791766] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
The Panel Study of Income Dynamics (PSID) has made more contributions to the study of income volatility than any other dataset in the United States. Its record of providing data for seminal research is unmatched. In this article, we first present the reasons that the PSID has made such major contributions to research on the topic. Then we review the major papers that have used the PSID to study income volatility, comparing their results to those using other datasets. Last, we present new results for income volatility among U.S. men through 2014, finding that both gross volatility and the variance of transitory shocks display a three-phase trend: upward trends from the 1970s to the 1980s, a stable period in the 1990s through the early 2000s, and a large increase during the Great Recession.
Collapse
|
20
|
A Truncation Model for Estimating Species Richness. Int J Biostat 2018; 15:ijb-2017-0035. [PMID: 30048236 DOI: 10.1515/ijb-2017-0035] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/11/2017] [Accepted: 06/19/2018] [Indexed: 11/15/2022]
Abstract
We propose a truncation model for the abundance distribution in species richness estimation. This model is inherently semiparametric and incorporates an unknown truncation threshold between rare and abundant observations. Using the conditional likelihood, we derive a class of estimators for the parameters in this model by stepwise maximization. The species richness estimator is given by the integer maximizing the binomial likelihood, given all other parameters in the model. Under regularity conditions, we show that our estimators of the model parameters are asymptotically efficient. We recover Chaos lower bound estimator of species richness when the parametric part of the model is single-component Poisson. Thus our class of estimators strictly generalized the latter. We illustrate the performance of the proposed method in a simulation study, and compare it favorably to other widely-used estimators. We also give an application to estimating the number of distinct vocabulary words in French playwright Molière's Tartuffe.
Collapse
|
21
|
Subgroup analysis with semiparametric models toward precision medicine. Stat Med 2018; 37:1830-1845. [PMID: 29575056 DOI: 10.1002/sim.7638] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/20/2017] [Revised: 01/23/2018] [Accepted: 01/26/2018] [Indexed: 11/11/2022]
Abstract
In analyzing clinical trials, one important objective is to classify the patients into treatment-favorable and nonfavorable subgroups. Existing parametric methods are not robust, and the commonly used classification rules ignore the fact that the implications of treatment-favorable and nonfavorable subgroups can be different. To address these issues, we propose a semiparametric model, incorporating both our knowledge and uncertainty about the true model. The Wald statistics is used to test the existence of subgroups, while the Neyman-Pearson rule to classify each subject. Asymptotic properties are derived, simulation studies are conducted to evaluate the performance of the method, and then method is used to analyze a real-world trial data.
Collapse
|
22
|
Hypothesis tests for stratified mark-specific proportional hazards models with missing covariates, with application to HIV vaccine efficacy trials. Biom J 2018; 60:516-536. [PMID: 29488249 DOI: 10.1002/bimj.201700002] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/12/2017] [Revised: 08/13/2017] [Accepted: 11/09/2017] [Indexed: 11/06/2022]
Abstract
This article develops hypothesis testing procedures for the stratified mark-specific proportional hazards model with missing covariates where the baseline functions may vary with strata. The mark-specific proportional hazards model has been studied to evaluate mark-specific relative risks where the mark is the genetic distance of an infecting HIV sequence to an HIV sequence represented inside the vaccine. This research is motivated by analyzing the RV144 phase 3 HIV vaccine efficacy trial, to understand associations of immune response biomarkers on the mark-specific hazard of HIV infection, where the biomarkers are sampled via a two-phase sampling nested case-control design. We test whether the mark-specific relative risks are unity and how they change with the mark. The developed procedures enable assessment of whether risk of HIV infection with HIV variants close or far from the vaccine sequence are modified by immune responses induced by the HIV vaccine; this question is interesting because vaccine protection occurs through immune responses directed at specific HIV sequences. The test statistics are constructed based on augmented inverse probability weighted complete-case estimators. The asymptotic properties and finite-sample performances of the testing procedures are investigated, demonstrating double-robustness and effectiveness of the predictive auxiliaries to recover efficiency. The finite-sample performance of the proposed tests are examined through a comprehensive simulation study. The methods are applied to the RV144 trial.
Collapse
|
23
|
Joint two-part Tobit models for longitudinal and time-to-event data. Stat Med 2017; 36:4214-4229. [PMID: 28795414 DOI: 10.1002/sim.7429] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2016] [Revised: 04/20/2017] [Accepted: 07/07/2017] [Indexed: 11/06/2022]
Abstract
In this article, we show how Tobit models can address problems of identifying characteristics of subjects having left-censored outcomes in the context of developing a method for jointly analyzing time-to-event and longitudinal data. There are some methods for handling these types of data separately, but they may not be appropriate when time to event is dependent on the longitudinal outcome, and a substantial portion of values are reported to be below the limits of detection. An alternative approach is to develop a joint model for the time-to-event outcome and a two-part longitudinal outcome, linking them through random effects. This proposed approach is implemented to assess the association between the risk of decline of CD4/CD8 ratio and rates of change in viral load, along with discriminating between patients who are potentially progressors to AIDS from patients who do not. We develop a fully Bayesian approach for fitting joint two-part Tobit models and illustrate the proposed methods on simulated and real data from an AIDS clinical study.
Collapse
|
24
|
Orthogonality of the Mean and Error Distribution in Generalized Linear Models. COMMUN STAT-THEOR M 2016; 46:3290-3296. [PMID: 28435181 PMCID: PMC5396964 DOI: 10.1080/03610926.2013.851241] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2013] [Accepted: 09/20/2013] [Indexed: 10/20/2022]
Abstract
We show that the mean-model parameter is always orthogonal to the error distribution in generalized linear models. Thus, the maximum likelihood estimator of the mean-model parameter will be asymptotically efficient regardless of whether the error distribution is known completely, known up to a finite vector of parameters, or left completely unspecified, in which case the likelihood is taken to be an appropriate semiparametric likelihood. Moreover, the maximum likelihood estimator of the mean-model parameter will be asymptotically independent of the maximum likelihood estimator of the error distribution. This generalizes some well-known results for the special cases of normal, gamma and multinomial regression models, and, perhaps more interestingly, suggests that asymptotically efficient estimation and inferences can always be obtained if the error distribution is nonparametrically estimated along with the mean. In contrast, estimation and inferences using misspecified error distributions or variance functions are generally not efficient.
Collapse
|
25
|
Time-varying coefficients models for recurrent event data when different varying coefficients admit different degrees of smoothness: application to heart disease modeling. Stat Med 2016; 35:4166-82. [PMID: 27238093 DOI: 10.1002/sim.6995] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2015] [Revised: 04/12/2016] [Accepted: 04/26/2016] [Indexed: 11/09/2022]
Abstract
We consider a class of semiparametric marginal rate models for analyzing recurrent event data. In these models, both time-varying and time-free effects are present, and the estimation of time-varying effects may result in non-smooth regression functions. A typical approach for avoiding this problem and producing smooth functions is based on kernel methods. The traditional kernel-based approach, however, assumes a common degree of smoothness for all time-varying regression functions, which may result in suboptimal estimators if the functions have different levels of smoothness. In this paper, we extend the traditional approach by introducing different bandwidths for different regression functions. First, we establish the asymptotic properties of the suggested estimators. Next, we demonstrate the superiority of our proposed method using two finite-sample simulation studies. Finally, we illustrate our methodology by analyzing a real-world heart disease dataset. Copyright © 2016 John Wiley & Sons, Ltd.
Collapse
|
26
|
Abstract
Decision making can be a complex process requiring the integration of several attributes of choice options. Understanding the neural processes underlying (uncertain) investment decisions is an important topic in neuroeconomics. We analyzed functional magnetic resonance imaging (fMRI) data from an investment decision study for stimulus-related effects. We propose a new technique for identifying activated brain regions: cluster, estimation, activation, and decision method. Our analysis is focused on clusters of voxels rather than voxel units. Thus, we achieve a higher signal-to-noise ratio within the unit tested and a smaller number of hypothesis tests compared with the often used General Linear Model (GLM). We propose to first conduct the brain parcellation by applying spatially constrained spectral clustering. The information within each cluster can then be extracted by the flexible dynamic semiparametric factor model (DSFM) dimension reduction technique and finally be tested for differences in activation between conditions. This sequence of Cluster, Estimation, Activation, and Decision admits a model-free analysis of the local fMRI signal. Applying a GLM on the DSFM-based time series resulted in a significant correlation between the risk of choice options and changes in fMRI signal in the anterior insula and dorsomedial prefrontal cortex. Additionally, individual differences in decision-related reactions within the DSFM time series predicted individual differences in risk attitudes as modeled with the framework of the mean-variance model.
Collapse
|
27
|
Statistical Approaches for the Study of Cognitive and Brain Aging. Front Aging Neurosci 2016; 8:176. [PMID: 27486400 PMCID: PMC4949247 DOI: 10.3389/fnagi.2016.00176] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2016] [Accepted: 07/04/2016] [Indexed: 01/12/2023] Open
Abstract
Neuroimaging studies of cognitive and brain aging often yield massive datasets that create many analytic and statistical challenges. In this paper, we discuss and address several limitations in the existing work. (1) Linear models are often used to model the age effects on neuroimaging markers, which may be inadequate in capturing the potential nonlinear age effects. (2) Marginal correlations are often used in brain network analysis, which are not efficient in characterizing a complex brain network. (3) Due to the challenge of high-dimensionality, only a small subset of the regional neuroimaging markers is considered in a prediction model, which could miss important regional markers. To overcome those obstacles, we introduce several advanced statistical methods for analyzing data from cognitive and brain aging studies. Specifically, we introduce semiparametric models for modeling age effects, graphical models for brain network analysis, and penalized regression methods for selecting the most important markers in predicting cognitive outcomes. We illustrate these methods using the healthy aging data from the Active Brain Study.
Collapse
|
28
|
A novel targeted learning method for quantitative trait loci mapping. Genetics 2014; 198:1369-76. [PMID: 25258376 PMCID: PMC4256757 DOI: 10.1534/genetics.114.168955] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2014] [Accepted: 09/13/2014] [Indexed: 11/18/2022] Open
Abstract
We present a novel semiparametric method for quantitative trait loci (QTL) mapping in experimental crosses. Conventional genetic mapping methods typically assume parametric models with Gaussian errors and obtain parameter estimates through maximum-likelihood estimation. In contrast with univariate regression and interval-mapping methods, our model requires fewer assumptions and also accommodates various machine-learning algorithms. Estimation is performed with targeted maximum-likelihood learning methods. We demonstrate our semiparametric targeted learning approach in a simulation study and a well-studied barley data set.
Collapse
|
29
|
Calibrated Precision Matrix Estimation for High-Dimensional Elliptical Distributions. IEEE TRANSACTIONS ON INFORMATION THEORY 2014; 60:7874-7887. [PMID: 25632164 PMCID: PMC4306585 DOI: 10.1109/tit.2014.2360980] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/04/2023]
Abstract
We propose a semiparametric method for estimating a precision matrix of high-dimensional elliptical distributions. Unlike most existing methods, our method naturally handles heavy tailness and conducts parameter estimation under a calibration framework, thus achieves improved theoretical rates of convergence and finite sample performance on heavy-tail applications. We further demonstrate the performance of the proposed method using thorough numerical experiments.
Collapse
|
30
|
NONLINEAR PREDICTIVE LATENT PROCESS MODELS FOR INTEGRATING SPATIO-TEMPORAL EXPOSURE DATA FROM MULTIPLE SOURCES. Ann Appl Stat 2014; 8:1538-1560. [PMID: 29861821 PMCID: PMC5983907 DOI: 10.1214/14-aoas737] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Abstract
Spatio-temporal prediction of levels of an environmental exposure is an important problem in environmental epidemiology. Our work is motivated by multiple studies on the spatio-temporal distribution of mobile source, or traffic related, particles in the greater Boston area. When multiple sources of exposure information are available, a joint model that pools information across sources maximizes data coverage over both space and time, thereby reducing the prediction error. We consider a Bayesian hierarchical framework in which a joint model consists of a set of submodels, one for each data source, and a model for the latent process that serves to relate the submodels to one another. If a submodel depends on the latent process nonlinearly, inference using standard MCMC techniques can be computationally prohibitive. The implications are particularly severe when the data for each submodel are aggregated at different temporal scales. To make such problems tractable, we linearize the nonlinear components with respect to the latent process and induce sparsity in the covariance matrix of the latent process using compactly supported covariance functions. We propose an efficient MCMC scheme that takes advantage of these approximations. We use our model to address a temporal change of support problem whereby interest focuses on pooling daily and multiday black carbon readings in order to maximize the spatial coverage of the study region.
Collapse
|
31
|
Inferences on relative failure rates in stratified mark-specific proportional hazards models with missing marks, with application to HIV vaccine efficacy trials. J R Stat Soc Ser C Appl Stat 2014; 64:49-73. [PMID: 25641990 DOI: 10.1111/rssc.12067] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
This article develops hypothesis testing procedures for the stratified mark-specific proportional hazards model in the presence of missing marks. The motivating application is preventive HIV vaccine efficacy trials, where the mark is the genetic distance of an infecting HIV sequence to an HIV sequence represented inside the vaccine. The test statistics are constructed based on two-stage efficient estimators, which utilize auxiliary predictors of the missing marks. The asymptotic properties and finite-sample performances of the testing procedures are investigated, demonstrating double-robustness and effectiveness of the predictive auxiliaries to recover efficiency. The methods are applied to the RV144 vaccine trial.
Collapse
|
32
|
Optimal auxiliary-covariate-based two-phase sampling design for semiparametric efficient estimation of a mean or mean difference, with application to clinical trials. Stat Med 2013; 33:901-17. [PMID: 24123289 DOI: 10.1002/sim.6006] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/22/2012] [Revised: 08/14/2013] [Accepted: 09/19/2013] [Indexed: 11/10/2022]
Abstract
To address the objective in a clinical trial to estimate the mean or mean difference of an expensive endpoint Y, one approach employs a two-phase sampling design, wherein inexpensive auxiliary variables W predictive of Y are measured in everyone, Y is measured in a random sample, and the semiparametric efficient estimator is applied. This approach is made efficient by specifying the phase two selection probabilities as optimal functions of the auxiliary variables and measurement costs. While this approach is familiar to survey samplers, it apparently has seldom been used in clinical trials, and several novel results practicable for clinical trials are developed. We perform simulations to identify settings where the optimal approach significantly improves efficiency compared to approaches in current practice. We provide proofs and R code. The optimality results are developed to design an HIV vaccine trial, with objective to compare the mean 'importance-weighted' breadth (Y) of the T-cell response between randomized vaccine groups. The trial collects an auxiliary response (W) highly predictive of Y and measures Y in the optimal subset. We show that the optimal design-estimation approach can confer anywhere between absent and large efficiency gain (up to 24 % in the examples) compared to the approach with the same efficient estimator but simple random sampling, where greater variability in the cost-standardized conditional variance of Y given W yields greater efficiency gains. Accurate estimation of E[Y | W] is important for realizing the efficiency gain, which is aided by an ample phase two sample and by using a robust fitting method.
Collapse
|
33
|
Abstract
We develop asymptotic theory for weighted likelihood estimators (WLE) under two-phase stratified sampling without replacement. We also consider several variants of WLEs involving estimated weights and calibration. A set of empirical process tools are developed including a Glivenko-Cantelli theorem, a theorem for rates of convergence of M-estimators, and a Donsker theorem for the inverse probability weighted empirical processes under two-phase sampling and sampling without replacement at the second phase. Using these general results, we derive asymptotic distributions of the WLE of a finite-dimensional parameter in a general semiparametric model where an estimator of a nuisance parameter is estimable either at regular or nonregular rates. We illustrate these results and methods in the Cox model with right censoring and interval censoring. We compare the methods via their asymptotic variances under both sampling without replacement and the more usual (and easier to analyze) assumption of Bernoulli sampling at the second phase.
Collapse
|
34
|
Abstract
We study a class of semiparametric skewed distributions arising when the sample selection process produces non-randomly sampled observations. Based on semiparametric theory and taking into account the symmetric nature of the population distribution, we propose both consistent estimators, i.e. robust to model mis-specification, and efficient estimators, i.e. reaching the minimum possible estimation variance, of the location of the symmetric population. We demonstrate the theoretical properties of our estimators through asymptotic analysis and assess their finite sample performance through simulations. We also implement our methodology on a real data example of ambulatory expenditures to illustrate the applicability of the estimators in practice.
Collapse
|
35
|
Abstract
There is an active debate in the literature on censored data about the relative performance of model based maximum likelihood estimators, IPCW-estimators, and a variety of double robust semiparametric efficient estimators. Kang and Schafer (2007) demonstrate the fragility of double robust and IPCW-estimators in a simulation study with positivity violations. They focus on a simple missing data problem with covariates where one desires to estimate the mean of an outcome that is subject to missingness. Responses by Robins, et al. (2007), Tsiatis and Davidian (2007), Tan (2007) and Ridgeway and McCaffrey (2007) further explore the challenges faced by double robust estimators and offer suggestions for improving their stability. In this article, we join the debate by presenting targeted maximum likelihood estimators (TMLEs). We demonstrate that TMLEs that guarantee that the parametric submodel employed by the TMLE procedure respects the global bounds on the continuous outcomes, are especially suitable for dealing with positivity violations because in addition to being double robust and semiparametric efficient, they are substitution estimators. We demonstrate the practical performance of TMLEs relative to other estimators in the simulations designed by Kang and Schafer (2007) and in modified simulations with even greater estimation challenges.
Collapse
|
36
|
Abstract
An objective of randomized placebo-controlled preventive HIV vaccine efficacy trials is to assess the relationship between the vaccine effect to prevent infection and the genetic distance of the exposing HIV to the HIV strain represented in the vaccine construct. Motivated by this objective, recently a mark-specific proportional hazards model with a continuum of competing risks has been studied, where the genetic distance of the transmitting strain is the continuous `mark' defined and observable only in failures. A high percentage of genetic marks of interest may be missing for a variety of reasons, predominantly due to rapid evolution of HIV sequences after transmission before a blood sample is drawn from which HIV sequences are measured. This research investigates the stratified mark-specific proportional hazards model with missing marks where the baseline functions may vary with strata. We develop two consistent estimation approaches, the first based on the inverse probability weighted complete-case (IPW) technique, and the second based on augmenting the IPW estimator by incorporating auxiliary information predictive of the mark. We investigate the asymptotic properties and finite-sample performance of the two estimators, and show that the augmented IPW estimator, which satisfies a double robustness property, is more efficient.
Collapse
|
37
|
Abstract
Collaborative double robust targeted maximum likelihood estimators represent a fundamental further advance over standard targeted maximum likelihood estimators of a pathwise differentiable parameter of a data generating distribution in a semiparametric model, introduced in van der Laan, Rubin (2006). The targeted maximum likelihood approach involves fluctuating an initial estimate of a relevant factor (Q) of the density of the observed data, in order to make a bias/variance tradeoff targeted towards the parameter of interest. The fluctuation involves estimation of a nuisance parameter portion of the likelihood, g. TMLE has been shown to be consistent and asymptotically normally distributed (CAN) under regularity conditions, when either one of these two factors of the likelihood of the data is correctly specified, and it is semiparametric efficient if both are correctly specified. In this article we provide a template for applying collaborative targeted maximum likelihood estimation (C-TMLE) to the estimation of pathwise differentiable parameters in semi-parametric models. The procedure creates a sequence of candidate targeted maximum likelihood estimators based on an initial estimate for Q coupled with a succession of increasingly non-parametric estimates for g. In a departure from current state of the art nuisance parameter estimation, C-TMLE estimates of g are constructed based on a loss function for the targeted maximum likelihood estimator of the relevant factor Q that uses the nuisance parameter to carry out the fluctuation, instead of a loss function for the nuisance parameter itself. Likelihood-based cross-validation is used to select the best estimator among all candidate TMLE estimators of Q(0) in this sequence. A penalized-likelihood loss function for Q is suggested when the parameter of interest is borderline-identifiable. We present theoretical results for "collaborative double robustness," demonstrating that the collaborative targeted maximum likelihood estimator is CAN even when Q and g are both mis-specified, providing that g solves a specified score equation implied by the difference between the Q and the true Q(0). This marks an improvement over the current definition of double robustness in the estimating equation literature. We also establish an asymptotic linearity theorem for the C-DR-TMLE of the target parameter, showing that the C-DR-TMLE is more adaptive to the truth, and, as a consequence, can even be super efficient if the first stage density estimator does an excellent job itself with respect to the target parameter. This research provides a template for targeted efficient and robust loss-based learning of a particular target feature of the probability distribution of the data within large (infinite dimensional) semi-parametric models, while still providing statistical inference in terms of confidence intervals and p-values. This research also breaks with a taboo (e.g., in the propensity score literature in the field of causal inference) on using the relevant part of likelihood to fine-tune the fitting of the nuisance parameter/censoring mechanism/treatment mechanism.
Collapse
|
38
|
Estimating a scale-change effect for time-varying phenotypes in genome-wide association studies. JOURNAL OF APPLIED STATISTICAL SCIENCE 2010; 18:477-493. [PMID: 28255222 PMCID: PMC5330672] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
The Cox proportional hazards model has been used widely in genome-wide association (GWA) studies of censored time-varying phenotypes to investigate disease association expressed by the relative hazards. In this paper, we instead apply the so-called accelerated hazards model to explore a novel time scale-change genotypic association, which is not necessarily able to be identified by the traditional Cox models. Our application is motivated and demonstrated by a GWA study of hematopoietic stem cell transplantation cohort.
Collapse
|
39
|
Analysis of Two-sample Censored Data Using a Semiparametric Mixture Model. ACTA MATHEMATICA SINICA, ENGLISH SERIES 2009; 25:389-398. [PMID: 20622987 PMCID: PMC2901133 DOI: 10.1007/s10255-008-8804-4] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/29/2023]
Abstract
In this article we study a semiparametric mixture model for the two-sample problem with right censored data. The model implies that the densities for the continuous outcomes are related by a parametric tilt but otherwise unspecified. It provides a useful alternative to the Cox (1972) proportional hazards model for the comparison of treatments based on right censored survival data. We propose an iterative algorithm for the semiparametric maximum likelihood estimates of the parametric and nonparametric components of the model. The performance of the proposed method is studied using simulation. We illustrate our method in an application to melanoma.
Collapse
|
40
|
Abstract
For time-to-event data with finitely many competing risks, the proportional hazards model has been a popular tool for relating the cause-specific outcomes to covariates [Prentice et al. Biometrics34 (1978) 541-554]. This article studies an extension of this approach to allow a continuum of competing risks, in which the cause of failure is replaced by a continuous mark only observed at the failure time. We develop inference for the proportional hazards model in which the regression parameters depend nonparametrically on the mark and the baseline hazard depends nonparametrically on both time and mark. This work is motivated by the need to assess HIV vaccine efficacy, while taking into account the genetic divergence of infecting HIV viruses in trial participants from the HIV strain that is contained in the vaccine, and adjusting for covariate effects. Mark-specific vaccine efficacy is expressed in terms of one of the regression functions in the mark-specific proportional hazards model. The new approach is evaluated in simulations and applied to the first HIV vaccine efficacy trial.
Collapse
|
41
|
Abstract
Gene copy number changes are common characteristics of many genetic disorders. A new technology, array comparative genomic hybridization (a-CGH), is widely used today to screen for gains and losses in cancers and other genetic diseases with high resolution at the genome level or for specific chromosomal region. Statistical methods for analyzing such a-CGH data have been developed. However, most of the existing methods are for unrelated individual data and the results from them provide explanation for horizontal variations in copy number changes. It is potentially meaningful to develop a statistical method that will allow for the analysis of family data to investigate the vertical kinship effects as well. Here we consider a semiparametric model based on clustering method in which the marginal distributions are estimated nonparametrically, and the familial dependence structure is modeled by copula. The model is illustrated and evaluated using simulated data. Our results show that the proposed method is more robust than the commonly used multivariate normal model. Finally, we demonstrated the utility of our method using a real dataset.
Collapse
|