26
|
Gierz K, Park K. Detection of multiple change points in a Weibull accelerated failure time model using sequential testing. Biom J 2021; 64:617-634. [PMID: 34873728 DOI: 10.1002/bimj.202000262] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2020] [Revised: 06/21/2021] [Accepted: 07/19/2021] [Indexed: 11/07/2022]
Abstract
With improvements to cancer diagnoses and treatments, incidences and mortality rates have changed. However, the most commonly used analysis methods do not account for such distributional changes. In survival analysis, change point problems can concern a shift in a distribution for a set of time-ordered observations, potentially under censoring or truncation. We propose a sequential testing approach for detecting multiple change points in the Weibull accelerated failure time model, since this is sufficiently flexible to accommodate increasing, decreasing, or constant hazard rates and is also the only continuous distribution for which the accelerated failure time model can be reparameterized as a proportional hazards model. Our sequential testing procedure does not require the number of change points to be known; this information is instead inferred from the data. We conduct a simulation study to show that the method accurately detects change points and estimates the model. The numerical results along with real data applications demonstrate that our proposed method can detect change points in the hazard rate.
Collapse
|
27
|
Bertrand F, Maumy-Bertrand M. Fitting and Cross-Validating Cox Models to Censored Big Data With Missing Values Using Extensions of Partial Least Squares Regression Models. Front Big Data 2021; 4:684794. [PMID: 34790895 PMCID: PMC8591675 DOI: 10.3389/fdata.2021.684794] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2021] [Accepted: 10/07/2021] [Indexed: 11/22/2022] Open
Abstract
Fitting Cox models in a big data context -on a massive scale in terms of volume, intensity, and complexity exceeding the capacity of usual analytic tools-is often challenging. If some data are missing, it is even more difficult. We proposed algorithms that were able to fit Cox models in high dimensional settings using extensions of partial least squares regression to the Cox models. Some of them were able to cope with missing data. We were recently able to extend our most recent algorithms to big data, thus allowing to fit Cox model for big data with missing values. When cross-validating standard or extended Cox models, the commonly used criterion is the cross-validated partial loglikelihood using a naive or a van Houwelingen scheme -to make efficient use of the death times of the left out data in relation to the death times of all the data. Quite astonishingly, we will show, using a strong simulation study involving three different data simulation algorithms, that these two cross-validation methods fail with the extensions, either straightforward or more involved ones, of partial least squares regression to the Cox model. This is quite an interesting result for at least two reasons. Firstly, several nice features of PLS based models, including regularization, interpretability of the components, missing data support, data visualization thanks to biplots of individuals and variables -and even parsimony or group parsimony for Sparse partial least squares or sparse group SPLS based models, account for a common use of these extensions by statisticians who usually select their hyperparameters using cross-validation. Secondly, they are almost always featured in benchmarking studies to assess the performance of a new estimation technique used in a high dimensional or big data context and often show poor statistical properties. We carried out a vast simulation study to evaluate more than a dozen of potential cross-validation criteria, either AUC or prediction error based. Several of them lead to the selection of a reasonable number of components. Using these newly found cross-validation criteria to fit extensions of partial least squares regression to the Cox model, we performed a benchmark reanalysis that showed enhanced performances of these techniques. In addition, we proposed sparse group extensions of our algorithms and defined a new robust measure based on the Schmid score and the R coefficient of determination for least absolute deviation: the integrated R Schmid Score weighted. The R-package used in this article is available on the CRAN, http://cran.r-project.org/web/packages/plsRcox/index.html. The R package bigPLS will soon be available on the CRAN and, until then, is available on Github https://github.com/fbertran/bigPLS.
Collapse
|
28
|
Li Y, Liang M, Mao L, Wang S. Robust estimation and variable selection for the accelerated failure time model. Stat Med 2021; 40:4473-4491. [PMID: 34031919 PMCID: PMC8364878 DOI: 10.1002/sim.9042] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/06/2020] [Revised: 04/25/2021] [Accepted: 04/26/2021] [Indexed: 11/10/2022]
Abstract
This article concerns robust modeling of the survival time for cancer patients. Accurate prediction of patient survival time is crucial to the development of effective therapeutic strategies. To this goal, we propose a unified Expectation-Maximization approach combined with the L1 -norm penalty to perform variable selection and parameter estimation simultaneously in the accelerated failure time model with right-censored survival data of moderate sizes. Our approach accommodates general loss functions, and reduces to the well-known Buckley-James method when the squared-error loss is used without regularization. To mitigate the effects of outliers and heavy-tailed noise in real applications, we recommend the use of robust loss functions under the general framework. Furthermore, our approach can be extended to incorporate group structure among covariates. We conduct extensive simulation studies to assess the performance of the proposed methods with different loss functions and apply them to an ovarian carcinoma study as an illustration.
Collapse
|
29
|
Yi GY, He W, Carroll RJ. Feature screening with large-scale and high-dimensional survival data. Biometrics 2021; 78:894-907. [PMID: 33881782 DOI: 10.1111/biom.13479] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/27/2020] [Revised: 01/27/2021] [Accepted: 04/07/2020] [Indexed: 11/27/2022]
Abstract
Data with a huge size present great challenges in modeling, inferences, and computation. In handling big data, much attention has been directed to settings with "large p small n", and relatively less work has been done to address problems with p and n being both large, though data with such a feature have now become more accessible than before, where p represents the number of variables and n stands for the sample size. The big volume of data does not automatically ensure good quality of inferences because a large number of unimportant variables may be collected in the process of gathering informative variables. To carry out valid statistical analysis, it is imperative to screen out noisy variables that have no predictive value for explaining the outcome variable. In this paper, we develop a screening method for handling large-sized survival data, where the sample size n is large and the dimension p of covariates is of non-polynomial order of the sample size n, or the so-called NP-dimension. We rigorously establish theoretical results for the proposed method and conduct numerical studies to assess its performance. Our research offers multiple extensions of existing work and enlarges the scope of high-dimensional data analysis. The proposed method capitalizes on the connections among useful regression settings and offers a computationally efficient screening procedure. Our method can be applied to different situations with large-scale data including genomic data.
Collapse
|
30
|
Influence of Noise-Limited Censored Path Loss on Model Fitting and Path Loss-Based Positioning. SENSORS 2021; 21:s21030987. [PMID: 33540651 PMCID: PMC7867288 DOI: 10.3390/s21030987] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/28/2020] [Revised: 01/18/2021] [Accepted: 01/27/2021] [Indexed: 11/26/2022]
Abstract
Positioning is considered one of the key features in various novel industry verticals in future radio systems. Since path loss (PL) or received signal strength-based measurements are widely available in the majority of wireless standards, PL-based positioning has an important role among positioning technologies. Conventionally, PL-based positioning has two phases—fitting a PL model to training data and positioning based on the link distance estimates. However, in both phases, the maximum measurable PL is limited by measurement noise. Such immeasurable samples are called censored PL data and such noisy data are commonly neglected in both the model fitting and in the positioning phase. In the case of censored PL, the loss is known to be above a known threshold level and that information can be used in model fitting and in the positioning phase. In this paper, we examine and propose how to use censored PL data in PL model-based positioning. Additionally, we demonstrate with several simulations the potential of the proposed approach for considerable improvements in positioning accuracy (23–57%) and improved robustness against PL model fitting errors.
Collapse
|
31
|
Olivari RC, Garay AM, Lachos VH, Matos LA. Mixed-effects models for censored data with autoregressive errors. J Biopharm Stat 2020; 31:273-294. [PMID: 33315523 DOI: 10.1080/10543406.2020.1852246] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
Abstract
Mixed-effects models, with modifications to accommodate censored observations (LMEC/NLMEC), are routinely used to analyze measurements, collected irregularly over time, which are often subject to some upper and lower detection limits. This paper presents a likelihood-based approach for fitting LMEC/NLMEC models with autoregressive of order p dependence of the error term. An EM-type algorithm is developed for computing the maximum likelihood estimates, obtaining as a byproduct the standard errors of the fixed effects and the likelihood value. Moreover, the constraints on the parameter space that arise from the stationarity conditions for the autoregressive parameters in the EM algorithm are handled by a reparameterization scheme, as discussed in Lin and Lee (2007). To examine the performance of the proposed method, we present some simulation studies and analyze a real AIDS case study. The proposed algorithm and methods are implemented in the new R package ARpLMEC.
Collapse
|
32
|
Simoneau G, Moodie EEM, Nijjar JS, Platt RW. Finite sample variance estimation for optimal dynamic treatment regimes of survival outcomes. Stat Med 2020; 39:4466-4479. [PMID: 32929753 DOI: 10.1002/sim.8735] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2019] [Revised: 07/18/2020] [Accepted: 07/27/2020] [Indexed: 11/06/2022]
Abstract
Deriving valid confidence intervals for complex estimators is a challenging task in practice. Estimators of dynamic weighted survival modeling (DWSurv), a method to estimate an optimal dynamic treatment regime of censored outcomes, are asymptotically normal and consistent for their target parameters when at least a subset of the nuisance models is correctly specified. However, their behavior in finite samples and the impact of model misspecification on inferences remain unclear. In addition, the estimators' nonregularity may negatively affect the inferences under some specific data generating mechanisms. Our objective was to compare five methods, two asymptotic variance formulas (adjusting or not for the estimation of nuisance parameters) to three bootstrap approaches, to construct confidence intervals for the DWSurv parameters in finite samples. Via simulations, we considered practical scenarios, for example, when some nuisance models are misspecified or when nonregularity is problematic. We also compared the five methods in an application about the treatment of rheumatoid arthritis. We found that the bootstrap approaches performed consistently well at the cost of longer computational times. The asymptotic variance with adjustments generally yielded conservative confidence intervals. The asymptotic variance without adjustments yielded nominal coverages for large sample sizes. We recommend using the asymptotic variance with adjustments in small samples and the bootstrap if computationally feasible. Caution should be taken when nonregularity may be an issue.
Collapse
|
33
|
Loumponias K, Tsaklidis G. Kalman filtering with censored measurements. J Appl Stat 2020; 49:317-335. [PMID: 35707209 PMCID: PMC9196092 DOI: 10.1080/02664763.2020.1810645] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/21/2019] [Accepted: 08/09/2020] [Indexed: 10/23/2022]
Abstract
This paper concerns Kalman filtering when the measurements of the process are censored. The censored measurements are addressed by the Tobit model of Type I and are one-dimensional with two censoring limits, while the (hidden) state vectors are multidimensional. For this model, Bayesian estimates for the state vectors are provided through a recursive algorithm of Kalman filtering type. Experiments are presented to illustrate the effectiveness and applicability of the algorithm. The experiments show that the proposed method outperforms other filtering methodologies in minimizing the computational cost as well as the overall Root Mean Square Error (RMSE) for synthetic and real data sets.
Collapse
|
34
|
Arfè A, Alexander B, Trippa L. Optimality of testing procedures for survival data in the nonproportional hazards setting. Biometrics 2020; 77:587-598. [PMID: 32535892 DOI: 10.1111/biom.13315] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2019] [Revised: 05/25/2020] [Accepted: 05/27/2020] [Indexed: 02/06/2023]
Abstract
Most statistical tests for treatment effects used in randomized clinical trials with survival outcomes are based on the proportional hazards assumption, which often fails in practice. Data from early exploratory studies may provide evidence of nonproportional hazards, which can guide the choice of alternative tests in the design of practice-changing confirmatory trials. We developed a test to detect treatment effects in a late-stage trial, which accounts for the deviations from proportional hazards suggested by early-stage data. Conditional on early-stage data, among all tests that control the frequentist Type I error rate at a fixed α level, our testing procedure maximizes the Bayesian predictive probability that the study will demonstrate the efficacy of the experimental treatment. Hence, the proposed test provides a useful benchmark for other tests commonly used in the presence of nonproportional hazards, for example, weighted log-rank tests. We illustrate this approach in simulations based on data from a published cancer immunotherapy phase III trial.
Collapse
|
35
|
Simoneau G, Moodie EEM, Azoulay L, Platt RW. Adaptive Treatment Strategies With Survival Outcomes: An Application to the Treatment of Type 2 Diabetes Using a Large Observational Database. Am J Epidemiol 2020; 189:461-469. [PMID: 31903490 DOI: 10.1093/aje/kwz272] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2019] [Revised: 12/10/2019] [Accepted: 12/10/2019] [Indexed: 01/16/2023] Open
Abstract
Sequences of treatments that adapt to a patient's changing condition over time are often needed for the management of chronic diseases. An adaptive treatment strategy (ATS) consists of personalized treatment rules to be applied through the course of a disease that input the patient's characteristics at the time of decision-making and output a recommended treatment. An optimal ATS is the sequence of tailored treatments that yields the best clinical outcome for patients sharing similar characteristics. Methods for estimating optimal adaptive treatment strategies, which must disentangle short- and long-term treatment effects, can be theoretically involved and hard to explain to clinicians, especially when the outcome to be optimized is a survival time subject to right-censoring. In this paper, we describe dynamic weighted survival modeling, a method for estimating an optimal ATS with survival outcomes. Using data from the Clinical Practice Research Datalink, a large primary-care database, we illustrate how it can answer an important clinical question about the treatment of type 2 diabetes. We identify an ATS pertaining to which drug add-ons to recommend when metformin in monotherapy does not achieve the therapeutic goals.
Collapse
|
36
|
Zhao YQ, Zhu R, Chen G, Zheng Y. Constructing dynamic treatment regimes with shared parameters for censored data. Stat Med 2020; 39:1250-1263. [PMID: 31951041 PMCID: PMC7305816 DOI: 10.1002/sim.8473] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/08/2019] [Revised: 10/31/2019] [Accepted: 12/16/2019] [Indexed: 01/28/2023]
Abstract
Dynamic treatment regimes are sequential decision rules that adapt throughout disease progression according to a patient's evolving characteristics. In many clinical applications, it is desirable that the format of the decision rules remains consistent over time. Unlike the estimation of dynamic treatment regimes in regular settings, where decision rules are formed without shared parameters, the derivation of the shared decision rules requires estimating shared parameters indexing the decision rules across different decision points. Estimation of such rules becomes more complicated when the clinical outcome of interest is a survival time subject to censoring. To address these challenges, we propose two novel methods: censored shared-Q-learning and censored shared-O-learning. Both methods incorporate clinical preferences into a qualitative rule, where the parameters indexing the decision rules are shared across different decision points and estimated simultaneously. We use simulation studies to demonstrate the superior performance of the proposed methods. The methods are further applied to the Framingham Heart Study to derive treatment rules for cardiovascular disease.
Collapse
|
37
|
Baharith LA, AL-Beladi KM, Klakattawi HS. The Odds Exponential-Pareto IV Distribution: Regression Model and Application. ENTROPY 2020; 22:e22050497. [PMID: 33286270 PMCID: PMC7516982 DOI: 10.3390/e22050497] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/19/2020] [Revised: 04/15/2020] [Accepted: 04/23/2020] [Indexed: 11/16/2022]
Abstract
This article introduces the odds exponential-Pareto IV distribution, which belongs to the odds family of distributions. We studied the statistical properties of this new distribution. The odds exponential-Pareto IV distribution provided decreasing, increasing, and upside-down hazard functions. We employed the maximum likelihood method to estimate the distribution parameters. The estimators performance was assessed by conducting simulation studies. A new log location-scale regression model based on the odds exponential-Pareto IV distribution was also introduced. Parameter estimates of the proposed model were obtained using both maximum likelihood and jackknife methods for right-censored data. Real data sets were analyzed under the odds exponential-Pareto IV distribution and log odds exponential-Pareto IV regression model to show their flexibility and potentiality.
Collapse
|
38
|
Wang X, Zhong Y, Mukhopadhyay P, Schaubel DE. Computationally efficient inference for center effects based on restricted mean survival time. Stat Med 2019; 38:5133-5145. [PMID: 31502288 DOI: 10.1002/sim.8356] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/10/2018] [Revised: 06/04/2019] [Accepted: 07/26/2019] [Indexed: 11/06/2022]
Abstract
Restricted mean survival time (RMST) has gained increased attention in biostatistical and clinical studies. Directly modeling RMST (as opposed to modeling then transforming the hazard function) is appealing computationally and in terms of interpreting covariate effects. We propose computationally convenient methods for evaluating center effects based on RMST. A multiplicative model for the RMST is assumed. Estimation proceeds through an algorithm analogous to stratification, which permits the evaluation of thousands of centers. We derive the asymptotic properties of the proposed estimators and evaluate finite sample performance through simulation. We demonstrate that considerable decreases in computational burden are achievable through the proposed methods, in terms of both storage requirements and run time. The methods are applied to evaluate more than 5000 US dialysis facilities using data from a national end-stage renal disease registry.
Collapse
|
39
|
Wang H, Li G. Extreme learning machine Cox model for high-dimensional survival analysis. Stat Med 2019; 38:2139-2156. [PMID: 30632193 PMCID: PMC6498851 DOI: 10.1002/sim.8090] [Citation(s) in RCA: 20] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2018] [Revised: 10/11/2018] [Accepted: 12/12/2018] [Indexed: 11/07/2022]
Abstract
Some interesting recent studies have shown that neural network models are useful alternatives in modeling survival data when the assumptions of a classical parametric or semiparametric survival model such as the Cox (1972) model are seriously violated. However, to the best of our knowledge, the plausibility of adapting the emerging extreme learning machine (ELM) algorithm for single-hidden-layer feedforward neural networks to survival analysis has not been explored. In this paper, we present a kernel ELM Cox model regularized by an L0 -based broken adaptive ridge (BAR) penalization method. Then, we demonstrate that the resulting method, referred to as ELMCoxBAR, can outperform some other state-of-art survival prediction methods such as L1 - or L2 -regularized Cox regression, random survival forest with various splitting rules, and boosted Cox model, in terms of its predictive performance using both simulated and real world datasets. In addition to its good predictive performance, we illustrate that the proposed method has a key computational advantage over the above competing methods in terms of computation time efficiency using an a real-world ultra-high-dimensional survival data.
Collapse
|
40
|
Arboretti R, Bathke AC, Carrozzo E, Pesarin F, Salmaso L. Multivariate permutation tests for two sample testing in presence of nondetects with application to microarray data. Stat Methods Med Res 2019; 29:258-271. [PMID: 30799774 DOI: 10.1177/0962280219832225] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Very often, data collected in medical research are characterized by censored observations and/or data with mass on the value zero. This happens for example when some measurements fall below the detection limits of the specific instrument used. This type of left censored observations is called "nondetects". Such a situation of an excessive number of zeros in a data set is also referred to as zero-inflated data. In the present work, we aim at comparing different multivariate permutation procedures in two-sample testing for data with nondetects. The effect of censoring is investigated with regard to the different values that may be attributed to nondetected values, both under the null hypothesis and under alternative. We motivate the problem using data from allergy research.
Collapse
|
41
|
Lachos VH, A Matos L, Castro LM, Chen MH. Flexible longitudinal linear mixed models for multiple censored responses data. Stat Med 2018; 38:1074-1102. [PMID: 30421470 DOI: 10.1002/sim.8017] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2018] [Revised: 09/27/2018] [Accepted: 10/01/2018] [Indexed: 11/06/2022]
Abstract
In biomedical studies and clinical trials, repeated measures are often subject to some upper and/or lower limits of detection. Hence, the responses are either left or right censored. A complication arises when more than one series of responses is repeatedly collected on each subject at irregular intervals over a period of time and the data exhibit tails heavier than the normal distribution. The multivariate censored linear mixed effect (MLMEC) model is a frequently used tool for a joint analysis of more than one series of longitudinal data. In this context, we develop a robust generalization of the MLMEC based on the scale mixtures of normal distributions. To take into account the autocorrelation existing among irregularly observed measures, a damped exponential correlation structure is considered. For this complex longitudinal structure, we propose an exact estimation procedure to obtain the maximum-likelihood estimates of the fixed effects and variance components using a stochastic approximation of the EM algorithm. This approach allows us to estimate the parameters of interest easily and quickly as well as to obtain the standard errors of the fixed effects, the predictions of unobservable values of the responses, and the log-likelihood function as a byproduct. The proposed method is applied to analyze a set of AIDS data and is examined via a simulation study.
Collapse
|
42
|
Chik AHS, Schmidt PJ, Emelko MB. Learning Something From Nothing: The Critical Importance of Rethinking Microbial Non-detects. Front Microbiol 2018; 9:2304. [PMID: 30344512 PMCID: PMC6182096 DOI: 10.3389/fmicb.2018.02304] [Citation(s) in RCA: 20] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2018] [Accepted: 09/10/2018] [Indexed: 11/18/2022] Open
Abstract
Accurate estimation of microbial concentrations is necessary to inform many important environmental science and public health decisions and regulations. Critically, widespread misconceptions about laboratory-reported microbial non-detects have led to their erroneous description and handling as "censored" values. This ultimately compromises their interpretation and undermines efforts to describe and model microbial concentrations accurately. Herein, these misconceptions are dispelled by (1) discussing the critical differences between discrete microbial observations and continuous data acquired using analytical chemistry methodologies and (2) demonstrating the bias introduced by statistical approaches tailored for chemistry data and misapplied to discrete microbial data. Notably, these approaches especially preclude the accurate representation of low concentrations and those estimated using microbial methods with low or variable analytical recovery, which can be expected to result in non-detects. Techniques that account for the probabilistic relationship between observed data and underlying microbial concentrations have been widely demonstrated, and their necessity for handling non-detects (in a way which is consistent with the handling of positive observations) is underscored herein. Habitual reporting of raw microbial observations and sample sizes is proposed to facilitate accurate estimation and analysis of microbial concentrations.
Collapse
|
43
|
Szarka AZ, Hayworth CG, Ramanarayanan TS, Joseph RSI. Statistical Techniques to Analyze Pesticide Data Program Food Residue Observations. JOURNAL OF AGRICULTURAL AND FOOD CHEMISTRY 2018; 66:7165-7171. [PMID: 29902006 DOI: 10.1021/acs.jafc.8b00863] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
The U.S. EPA conducts dietary-risk assessments to ensure that levels of pesticides on food in the U.S. food supply are safe. Often these assessments utilize conservative residue estimates, maximum residue levels (MRLs), and a high-end estimate derived from registrant-generated field-trial data sets. A more realistic estimate of consumers' pesticide exposure from food may be obtained by utilizing residues from food-monitoring programs, such as the Pesticide Data Program (PDP) of the U.S. Department of Agriculture. A substantial portion of food-residue concentrations in PDP monitoring programs are below the limits of detection (left-censored), which makes the comparison of regulatory-field-trial and PDP residue levels difficult. In this paper, we present a novel adaption of established statistical techniques, the Kaplan-Meier estimator (K-M), the robust regression on ordered statistic (ROS), and the maximum-likelihood estimator (MLE), to quantify the pesticide-residue concentrations in the presence of heavily censored data sets. The examined statistical approaches include the most commonly used parametric and nonparametric methods for handling left-censored data that have been used in the fields of medical and environmental sciences. This work presents a case study in which data of thiamethoxam residue on bell pepper generated from registrant field trials were compared with PDP-monitoring residue values. The results from the statistical techniques were evaluated and compared with commonly used simple substitution methods for the determination of summary statistics. It was found that the maximum-likelihood estimator (MLE) is the most appropriate statistical method to analyze this residue data set. Using the MLE technique, the data analyses showed that the median and mean PDP bell pepper residue levels were approximately 19 and 7 times lower, respectively, than the corresponding statistics of the field-trial residues.
Collapse
|
44
|
Orbe J, Virto J. Penalized spline smoothing using Kaplan-Meier weights with censored data. Biom J 2018; 60:947-961. [PMID: 29943440 DOI: 10.1002/bimj.201700213] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2017] [Revised: 05/29/2018] [Accepted: 05/30/2018] [Indexed: 11/11/2022]
Abstract
In this paper, we consider the problem of nonparametric curve fitting in the specific context of censored data. We propose an extension of the penalized splines approach using Kaplan-Meier weights to take into account the effect of censorship and generalized cross-validation techniques to choose the smoothing parameter adapted to the case of censored samples. Using various simulation studies, we analyze the effectiveness of the censored penalized splines method proposed and show that the performance is quite satisfactory. We have extended this proposal to a generalized additive models (GAM) framework introducing a correction of the censorship effect, thus enabling more complex models to be estimated immediately. A real dataset from Stanford Heart Transplant data is also used to illustrate the methodology proposed, which is shown to be a good alternative when the probability distribution for the response variable and the functional form are not known in censored regression models.
Collapse
|
45
|
Lin TI, Lachos VH, Wang WL. Multivariate longitudinal data analysis with censored and intermittent missing responses. Stat Med 2018; 37:2822-2835. [PMID: 29740829 DOI: 10.1002/sim.7692] [Citation(s) in RCA: 21] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2017] [Revised: 03/31/2018] [Accepted: 04/02/2018] [Indexed: 11/08/2022]
Abstract
The multivariate linear mixed model (MLMM) has emerged as an important analytical tool for longitudinal data with multiple outcomes. However, the analysis of multivariate longitudinal data could be complicated by the presence of censored measurements because of a detection limit of the assay in combination with unavoidable missing values arising when subjects miss some of their scheduled visits intermittently. This paper presents a generalization of the MLMM approach, called the MLMM-CM, for a joint analysis of the multivariate longitudinal data with censored and intermittent missing responses. A computationally feasible expectation maximization-based procedure is developed to carry out maximum likelihood estimation within the MLMM-CM framework. Moreover, the asymptotic standard errors of fixed effects are explicitly obtained via the information-based method. We illustrate our methodology by using simulated data and a case study from an AIDS clinical trial. Experimental results reveal that the proposed method is able to provide more satisfactory performance as compared with the traditional MLMM approach.
Collapse
|
46
|
Abstract
In modeling censored data, survival forest models are a competitive nonparametric alternative to traditional parametric or semiparametric models when the function forms are possibly misspecified or the underlying assumptions are violated. In this work, we propose a survival forest approach with trees constructed using a novel pseudo R2 splitting rules. By studying the well-known benchmark data sets, we find that the proposed model generally outperforms popular survival models such as random survival forest with different splitting rules, Cox proportional hazard model, and generalized boosted model in terms of C-index metric.
Collapse
|
47
|
Khan MHR. On the performance of adaptive preprocessing technique in analyzing high-dimensional censored data. Biom J 2018; 60:687-702. [PMID: 29603360 DOI: 10.1002/bimj.201600256] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2016] [Revised: 09/05/2017] [Accepted: 10/20/2017] [Indexed: 11/09/2022]
Abstract
Preprocessing for high-dimensional censored datasets, such as the microarray data, is generally considered as an important technique to gain further stability by reducing potential noise from the data. When variable selection including inference is carried out with high-dimensional censored data the objective is to obtain a smaller subset of variables and then perform the inferential analysis using model estimates based on the selected subset of variables. This two stage inferential analysis is prone to circularity bias because of the noise that might still remain in the dataset. In this work, I propose an adaptive preprocessing technique that uses sure independence screening (SIS) idea to accomplish variable selection and reduces the circularity bias by some popularly known refined high-dimensional methods such as the elastic net, adaptive elastic net, weighted elastic net, elastic net-AFT, and two greedy variable selection methods known as TCS, PC-simple all implemented with the accelerated lifetime models. The proposed technique addresses several features including the issue of collinearity between important and some unimportant covariates, which is often the case in high-dimensional setting under variable selection framework, and different level of censoring. Simulation studies along with an empirical analysis with a real microarray data, mantle cell lymphoma, is carried out to demonstrate the performance of the adaptive pre-processing technique.
Collapse
|
48
|
Li X, Xie S, Zeng D, Wang Y. Efficient ℓ 0 -norm feature selection based on augmented and penalized minimization. Stat Med 2018; 37:473-486. [PMID: 29082539 PMCID: PMC5768461 DOI: 10.1002/sim.7526] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2015] [Revised: 07/04/2017] [Accepted: 09/13/2017] [Indexed: 11/06/2022]
Abstract
Advances in high-throughput technologies in genomics and imaging yield unprecedentedly large numbers of prognostic biomarkers. To accommodate the scale of biomarkers and study their association with disease outcomes, penalized regression is often used to identify important biomarkers. The ideal variable selection procedure would search for the best subset of predictors, which is equivalent to imposing an ℓ0 -penalty on the regression coefficients. Since this optimization is a nondeterministic polynomial-time hard (NP-hard) problem that does not scale with number of biomarkers, alternative methods mostly place smooth penalties on the regression parameters, which lead to computationally feasible optimization problems. However, empirical studies and theoretical analyses show that convex approximation of ℓ0 -norm (eg, ℓ1 ) does not outperform their ℓ0 counterpart. The progress for ℓ0 -norm feature selection is relatively slower, where the main methods are greedy algorithms such as stepwise regression or orthogonal matching pursuit. Penalized regression based on regularizing ℓ0 -norm remains much less explored in the literature. In this work, inspired by the recently popular augmenting and data splitting algorithms including alternating direction method of multipliers, we propose a 2-stage procedure for ℓ0 -penalty variable selection, referred to as augmented penalized minimization-L0 (APM-L0 ). The APM-L0 targets ℓ0 -norm as closely as possible while keeping computation tractable, efficient, and simple, which is achieved by iterating between a convex regularized regression and a simple hard-thresholding estimation. The procedure can be viewed as arising from regularized optimization with truncated ℓ1 norm. Thus, we propose to treat regularization parameter and thresholding parameter as tuning parameters and select based on cross-validation. A 1-step coordinate descent algorithm is used in the first stage to significantly improve computational efficiency. Through extensive simulation studies and real data application, we demonstrate superior performance of the proposed method in terms of selection accuracy and computational speed as compared to existing methods. The proposed APM-L0 procedure is implemented in the R-package APML0.
Collapse
|
49
|
Jaspers S, Komárek A, Aerts M. Bayesian estimation of multivariate normal mixtures with covariate-dependent mixing weights, with an application in antimicrobial resistance monitoring. Biom J 2018; 60:7-19. [PMID: 28898442 DOI: 10.1002/bimj.201600253] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2016] [Revised: 07/04/2017] [Accepted: 07/07/2017] [Indexed: 11/05/2022]
Abstract
Bacteria with a reduced susceptibility against antimicrobials pose a major threat to public health. Therefore, large programs have been set up to collect minimum inhibition concentration (MIC) values. These values can be used to monitor the distribution of the nonsusceptible isolates in the general population. Data are collected within several countries and over a number of years. In addition, the sampled bacterial isolates were not tested for susceptibility against one antimicrobial, but rather against an entire range of substances. Interest is therefore in the analysis of the joint distribution of MIC data on two or more antimicrobials, while accounting for a possible effect of covariates. In this regard, we present a Bayesian semiparametric density estimation routine, based on multivariate Gaussian mixtures. The mixing weights are allowed to depend on certain covariates, thereby allowing the user to detect certain changes over, for example, time. The new approach was applied to data collected in Europe in 2010, 2012, and 2013. We investigated the susceptibility of Escherichia coli isolates against ampicillin and trimethoprim, where we found that there seems to be a significant increase in the proportion of nonsusceptible isolates. In addition, a simulation study was carried out, showing the promising behavior of the proposed method in the field of antimicrobial resistance.
Collapse
|
50
|
Fang EX, Ning Y, Liu H. Testing and Confidence Intervals for High Dimensional Proportional Hazards Model. J R Stat Soc Series B Stat Methodol 2017; 79:1415-1437. [PMID: 37854943 PMCID: PMC10584375 DOI: 10.1111/rssb.12224] [Citation(s) in RCA: 34] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]
Abstract
This paper proposes a decorrelation-based approach to test hypotheses and construct confidence intervals for the low dimensional component of high dimensional proportional hazards models. Motivated by the geometric projection principle, we propose new decorrelated score, Wald and partial likelihood ratio statistics. Without assuming model selection consistency, we prove the asymptotic normality of these test statistics, establish their semiparametric optimality. We also develop new procedures for constructing pointwise confidence intervals for the baseline hazard function and baseline survival function. Thorough numerical results are provided to back up our theory.
Collapse
|