1
|
Zhang Z. Variable selection with stepwise and best subset approaches. ANNALS OF TRANSLATIONAL MEDICINE 2016; 4:136. [PMID: 27162786 PMCID: PMC4842399 DOI: 10.21037/atm.2016.03.35] [Citation(s) in RCA: 319] [Impact Index Per Article: 35.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/25/2015] [Accepted: 01/24/2016] [Indexed: 02/05/2023]
Abstract
While purposeful selection is performed partly by software and partly by hand, the stepwise and best subset approaches are automatically performed by software. Two R functions stepAIC() and bestglm() are well designed for stepwise and best subset regression, respectively. The stepAIC() function begins with a full or null model, and methods for stepwise regression can be specified in the direction argument with character values "forward", "backward" and "both". The bestglm() function begins with a data frame containing explanatory variables and response variables. The response variable should be in the last column. Varieties of goodness-of-fit criteria can be specified in the IC argument. The Bayesian information criterion (BIC) usually results in more parsimonious model than the Akaike information criterion.
Collapse
|
editorial |
9 |
319 |
2
|
Zhang Z. Introduction to machine learning: k-nearest neighbors. ANNALS OF TRANSLATIONAL MEDICINE 2016; 4:218. [PMID: 27386492 PMCID: PMC4916348 DOI: 10.21037/atm.2016.03.37] [Citation(s) in RCA: 295] [Impact Index Per Article: 32.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/25/2016] [Accepted: 02/18/2016] [Indexed: 02/05/2023]
Abstract
Machine learning techniques have been widely used in many scientific fields, but its use in medical literature is limited partly because of technical difficulties. k-nearest neighbors (kNN) is a simple method of machine learning. The article introduces some basic ideas underlying the kNN algorithm, and then focuses on how to perform kNN modeling with R. The dataset should be prepared before running the knn() function in R. After prediction of outcome with kNN algorithm, the diagnostic performance of the model should be checked. Average accuracy is the mostly widely used statistic to reflect the kNN algorithm. Factors such as k value, distance calculation and choice of appropriate predictors all have significant impact on the model performance.
Collapse
|
editorial |
9 |
295 |
3
|
Zhang Z. Model building strategy for logistic regression: purposeful selection. ANNALS OF TRANSLATIONAL MEDICINE 2016; 4:111. [PMID: 27127764 PMCID: PMC4828741 DOI: 10.21037/atm.2016.02.15] [Citation(s) in RCA: 289] [Impact Index Per Article: 32.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/30/2015] [Accepted: 01/19/2016] [Indexed: 02/06/2023]
Abstract
Logistic regression is one of the most commonly used models to account for confounders in medical literature. The article introduces how to perform purposeful selection model building strategy with R. I stress on the use of likelihood ratio test to see whether deleting a variable will have significant impact on model fit. A deleted variable should also be checked for whether it is an important adjustment of remaining covariates. Interaction should be checked to disentangle complex relationship between covariates and their synergistic effect on response variable. Model should be checked for the goodness-of-fit (GOF). In other words, how the fitted model reflects the real data. Hosmer-Lemeshow GOF test is the most widely used for logistic regression model.
Collapse
|
editorial |
9 |
289 |
4
|
Zhang Z. Multiple imputation with multivariate imputation by chained equation (MICE) package. ANNALS OF TRANSLATIONAL MEDICINE 2016; 4:30. [PMID: 26889483 PMCID: PMC4731595 DOI: 10.3978/j.issn.2305-5839.2015.12.63] [Citation(s) in RCA: 267] [Impact Index Per Article: 29.7] [Reference Citation Analysis] [Abstract] [Key Words] [Subscribe] [Scholar Register] [Received: 11/05/2015] [Accepted: 12/15/2015] [Indexed: 02/05/2023]
Abstract
Multiple imputation (MI) is an advanced technique for handing missing values. It is superior to single imputation in that it takes into account uncertainty in missing value imputation. However, MI is underutilized in medical literature due to lack of familiarity and computational challenges. The article provides a step-by-step approach to perform MI by using R multivariate imputation by chained equation (MICE) package. The procedure firstly imputed m sets of complete dataset by calling mice() function. Then statistical analysis such as univariate analysis and regression model can be performed within each dataset by calling with() function. This function sets the environment for statistical analysis. Lastly, the results obtained from each analysis are combined by using pool() function.
Collapse
|
editorial |
9 |
267 |
5
|
Zhang Z, Xu X, Ni H. Small studies may overestimate the effect sizes in critical care meta-analyses: a meta-epidemiological study. Crit Care 2013; 17:R2. [PMID: 23302257 PMCID: PMC4056100 DOI: 10.1186/cc11919] [Citation(s) in RCA: 227] [Impact Index Per Article: 18.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/29/2012] [Revised: 12/18/2012] [Accepted: 01/07/2013] [Indexed: 02/07/2023] Open
Abstract
INTRODUCTION Small-study effects refer to the fact that trials with limited sample sizes are more likely to report larger beneficial effects than large trials. However, this has never been investigated in critical care medicine. Thus, the present study aimed to examine the presence and extent of small-study effects in critical care medicine. METHODS Critical care meta-analyses involving randomized controlled trials and reported mortality as an outcome measure were considered eligible for the study. Component trials were classified as large (≥100 patients per arm) and small (<100 patients per arm) according to their sample sizes. Ratio of odds ratio (ROR) was calculated for each meta-analysis and then RORs were combined using a meta-analytic approach. ROR<1 indicated larger beneficial effect in small trials. Small and large trials were compared in methodological qualities including sequence generating, blinding, allocation concealment, intention to treat and sample size calculation. RESULTS A total of 27 critical care meta-analyses involving 317 trials were included. Of them, five meta-analyses showed statistically significant RORs <1, and other meta-analyses did not reach a statistical significance. Overall, the pooled ROR was 0.60 (95% CI: 0.53 to 0.68); the heterogeneity was moderate with an I2 of 50.3% (chi-squared = 52.30; P = 0.002). Large trials showed significantly better reporting quality than small trials in terms of sequence generating, allocation concealment, blinding, intention to treat, sample size calculation and incomplete follow-up data. CONCLUSIONS Small trials are more likely to report larger beneficial effects than large trials in critical care medicine, which could be partly explained by the lower methodological quality in small trials. Caution should be practiced in the interpretation of meta-analyses involving small trials.
Collapse
|
Meta-Analysis |
12 |
227 |
6
|
Zhang Z, Ho KM, Hong Y. Machine learning for the prediction of volume responsiveness in patients with oliguric acute kidney injury in critical care. Crit Care 2019; 23:112. [PMID: 30961662 PMCID: PMC6454725 DOI: 10.1186/s13054-019-2411-z] [Citation(s) in RCA: 197] [Impact Index Per Article: 32.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2019] [Accepted: 03/26/2019] [Indexed: 02/07/2023] Open
Abstract
BACKGROUND AND OBJECTIVES Excess fluid balance in acute kidney injury (AKI) may be harmful, and conversely, some patients may respond to fluid challenges. This study aimed to develop a prediction model that can be used to differentiate between volume-responsive (VR) and volume-unresponsive (VU) AKI. METHODS AKI patients with urine output < 0.5 ml/kg/h for the first 6 h after ICU admission and fluid intake > 5 l in the following 6 h in the US-based critical care database (Medical Information Mart for Intensive Care (MIMIC-III)) were considered. Patients who received diuretics and renal replacement on day 1 were excluded. Two predictive models, using either machine learning extreme gradient boosting (XGBoost) or logistic regression, were developed to predict urine output > 0.65 ml/kg/h during 18 h succeeding the initial 6 h for assessing oliguria. Established models were assessed by using out-of-sample validation. The whole sample was split into training and testing samples by the ratio of 3:1. MAIN RESULTS Of the 6682 patients included in the analysis, 2456 (36.8%) patients were volume responsive with an increase in urine output after receiving > 5 l fluid. Urinary creatinine, blood urea nitrogen (BUN), age, and albumin were the important predictors of VR. The machine learning XGBoost model outperformed the traditional logistic regression model in differentiating between the VR and VU groups (AU-ROC, 0.860; 95% CI, 0.842 to 0.878 vs. 0.728; 95% CI 0.703 to 0.753, respectively). CONCLUSIONS The XGBoost model was able to differentiate between patients who would and would not respond to fluid intake in urine output better than a traditional logistic regression model. This result suggests that machine learning techniques have the potential to improve the development and validation of predictive modeling in critical care research.
Collapse
|
research-article |
6 |
197 |
7
|
Zhang Z, Kattan MW. Drawing Nomograms with R: applications to categorical outcome and survival data. ANNALS OF TRANSLATIONAL MEDICINE 2017; 5:211. [PMID: 28603726 PMCID: PMC5451623 DOI: 10.21037/atm.2017.04.01] [Citation(s) in RCA: 172] [Impact Index Per Article: 21.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/22/2016] [Accepted: 03/23/2017] [Indexed: 02/05/2023]
Abstract
Outcome prediction is a major task in clinical medicine. The standard approach to this work is to collect a variety of predictors and build a model of appropriate type. The model is a mathematical equation that connects the outcome of interest with the predictors. A new patient with given clinical characteristics can be predicted for outcome with this model. However, the equation describing the relationship between predictors and outcome is often complex and the computation requires software for practical use. There is another method called nomogram which is a graphical calculating device allowing an approximate graphical computation of a mathematical function. In this article, we describe how to draw nomograms for various outcomes with nomogram() function. Binary outcome is fit by logistic regression model and the outcome of interest is the probability of the event of interest. Ordinal outcome variable is also discussed. Survival analysis can be fit with parametric model to fully describe the distributions of survival time. Statistics such as the median survival time, survival probability up to a specific time point are taken as the outcome of interest.
Collapse
|
Editorial |
8 |
172 |
8
|
Zhang Z. Univariate description and bivariate statistical inference: the first step delving into data. ANNALS OF TRANSLATIONAL MEDICINE 2016; 4:91. [PMID: 27047950 PMCID: PMC4791343 DOI: 10.21037/atm.2016.02.11] [Citation(s) in RCA: 142] [Impact Index Per Article: 15.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/25/2015] [Accepted: 01/10/2016] [Indexed: 02/05/2023]
Abstract
In observational studies, the first step is usually to explore data distribution and the baseline differences between groups. Data description includes their central tendency (e.g., mean, median, and mode) and dispersion (e.g., standard deviation, range, interquartile range). There are varieties of bivariate statistical inference methods such as Student's t-test, Mann-Whitney U test and Chi-square test, for normal, skews and categorical data, respectively. The article shows how to perform these analyses with R codes. Furthermore, I believe that the automation of the whole workflow is of paramount importance in that (I) it allows for others to repeat your results; (II) you can easily find out how you performed analysis during revision; (III) it spares data input by hand and is less error-prone; and (IV) when you correct your original dataset, the final result can be automatically corrected by executing the codes. Therefore, the process of making a publication quality table incorporating all abovementioned statistics and P values is provided, allowing readers to customize these codes to their own needs.
Collapse
|
editorial |
9 |
142 |
9
|
Zhang Z. Propensity score method: a non-parametric technique to reduce model dependence. ANNALS OF TRANSLATIONAL MEDICINE 2017; 5:7. [PMID: 28164092 PMCID: PMC5253298 DOI: 10.21037/atm.2016.08.57] [Citation(s) in RCA: 125] [Impact Index Per Article: 15.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/05/2016] [Accepted: 07/02/2016] [Indexed: 02/05/2023]
Abstract
Propensity score analysis (PSA) is a powerful technique that it balances pretreatment covariates, making the causal effect inference from observational data as reliable as possible. The use of PSA in medical literature has increased exponentially in recent years, and the trend continue to rise. The article introduces rationales behind PSA, followed by illustrating how to perform PSA in R with MatchIt package. There are a variety of methods available for PS matching such as nearest neighbors, full matching, exact matching and genetic matching. The task can be easily done by simply assigning a string value to the method argument in the matchit() function. The generic summary() and plot() functions can be applied to an object of class matchit to check covariate balance after matching. Furthermore, there is a useful package PSAgraphics that contains several graphical functions to check covariate balance between treatment groups across strata. If covariate balance is not achieved, one can modify model specifications or use other techniques such as random forest and recursive partitioning to better represent the underlying structure between pretreatment covariates and treatment assignment. The process can be repeated until the desirable covariate balance is achieved.
Collapse
|
Editorial |
8 |
125 |
10
|
Zhang Z. Survival analysis in the presence of competing risks. ANNALS OF TRANSLATIONAL MEDICINE 2017; 5:47. [PMID: 28251126 PMCID: PMC5326634 DOI: 10.21037/atm.2016.08.62] [Citation(s) in RCA: 108] [Impact Index Per Article: 13.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/20/2016] [Accepted: 07/09/2016] [Indexed: 02/05/2023]
Abstract
Survival analysis in the presence of competing risks imposes additional challenges for clinical investigators in that hazard function (the rate) has no one-to-one link to the cumulative incidence function (CIF, the risk). CIF is of particular interest and can be estimated non-parametrically with the use cuminc() function. This function also allows for group comparison and visualization of estimated CIF. The effect of covariates on cause-specific hazard can be explored using conventional Cox proportional hazard model by treating competing events as censoring. However, the effect on hazard cannot be directly linked to the effect on CIF because there is no one-to-one correspondence between hazard and cumulative incidence. Fine-Gray model directly models the covariate effect on CIF and it reports subdistribution hazard ratio (SHR). However, SHR only provide information on the ordering of CIF curves at different levels of covariates, it has no practical interpretation as HR in the absence of competing risks. Fine-Gray model can be fit with crr() function shipped with the cmprsk package. Time-varying covariates are allowed in the crr() function, which is specified by cov2 and tf arguments. Predictions and visualization of CIF for subjects with given covariate values are allowed for crr object. Alternatively, competing risk models can be fit with riskRegression package by employing different link functions between covariates and outcomes. The assumption of proportionality can be checked by testing statistical significance of interaction terms involving failure time. Schoenfeld residuals provide another way to check model assumption.
Collapse
|
Editorial |
8 |
108 |
11
|
Zhang Z. Missing data imputation: focusing on single imputation. ANNALS OF TRANSLATIONAL MEDICINE 2016; 4:9. [PMID: 26855945 PMCID: PMC4716933 DOI: 10.3978/j.issn.2305-5839.2015.12.38] [Citation(s) in RCA: 101] [Impact Index Per Article: 11.2] [Reference Citation Analysis] [Abstract] [Key Words] [Subscribe] [Scholar Register] [Received: 11/18/2015] [Accepted: 12/08/2015] [Indexed: 02/05/2023]
Abstract
Complete case analysis is widely used for handling missing data, and it is the default method in many statistical packages. However, this method may introduce bias and some useful information will be omitted from analysis. Therefore, many imputation methods are developed to make gap end. The present article focuses on single imputation. Imputations with mean, median and mode are simple but, like complete case analysis, can introduce bias on mean and deviation. Furthermore, they ignore relationship with other variables. Regression imputation can preserve relationship between missing values and other variables. There are many sophisticated methods exist to handle missing values in longitudinal data. This article focuses primarily on how to implement R code to perform single imputation, while avoiding complex mathematical calculations.
Collapse
|
editorial |
9 |
101 |
12
|
Zhang Z, Zheng C, Kim C, Van Poucke S, Lin S, Lan P. Causal mediation analysis in the context of clinical research. ANNALS OF TRANSLATIONAL MEDICINE 2016; 4:425. [PMID: 27942516 PMCID: PMC5124624 DOI: 10.21037/atm.2016.11.11] [Citation(s) in RCA: 90] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/02/2016] [Accepted: 09/25/2016] [Indexed: 02/05/2023]
Abstract
Clinical researches usually collected numerous intermediate variables besides treatment and outcome. These variables are often incorrectly treated as confounding factors and are thus controlled using a variety of multivariable regression models depending on the types of outcome variable. However, these methods fail to disentangle underlying mediating processes. Causal mediation analysis (CMA) is a method to dissect total effect of a treatment into direct and indirect effect. The indirect effect is transmitted via mediator to the outcome. The mediation package is designed to perform CMA under the assumption of sequential ignorability. It reports average causal mediation effect (ACME), average direct effect (ADE) and total effect. Also, the package provides visualization tool for these estimated effects. Sensitivity analysis is designed to examine whether the results are robust to the violation of the sequential ignorability assumption since the assumption has been criticized to be too strong to be satisfied in research practice.
Collapse
|
Editorial |
9 |
90 |
13
|
Zhang Z, Zhang G, Goyal H, Mo L, Hong Y. Identification of subclasses of sepsis that showed different clinical outcomes and responses to amount of fluid resuscitation: a latent profile analysis. Crit Care 2018; 22:347. [PMID: 30563548 PMCID: PMC6299613 DOI: 10.1186/s13054-018-2279-3] [Citation(s) in RCA: 88] [Impact Index Per Article: 12.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2018] [Accepted: 11/26/2018] [Indexed: 02/07/2023] Open
Abstract
BACKGROUND AND OBJECTIVE Sepsis is a heterogeneous disease and identification of its subclasses may facilitate and optimize clinical management. This study aimed to identify subclasses of sepsis and its responses to different amounts of fluid resuscitation. METHODS This was a retrospective study conducted in an intensive care unit at a large tertiary care hospital. The patients fulfilling the diagnostic criteria of sepsis from June 1, 2001 to October 31, 2012 were included. Clinical and laboratory variables were used to perform the latent profile analysis (LPA). A multivariable logistic regression model was used to explore the independent association of fluid input and mortality outcome. RESULTS In total, 14,993 patients were included in the study. The LPA identified four subclasses of sepsis: profile 1 was characterized by the lowest mortality rate and having the largest proportion and was considered the baseline type; profile 2 was characterized by respiratory dysfunction; profile 3 was characterized by multiple organ dysfunction (kidney, coagulation, liver, and shock), and profile 4 was characterized by neurological dysfunction. Profile 3 showed the highest mortality rate (45.4%), followed by profile 4 (27.4%), 2 (18.2%), and 1 (16.9%). Overall, the amount of fluid needed for resuscitation was the largest on day 1 (median 5115 mL, interquartile range (IQR) 2662 to 8800 mL) and decreased rapidly on day 2 (median 2140 mL, IQR 900 to 3872 mL). Higher cumulative fluid input in the first 48 h was associated with reduced risk of hospital mortality for profile 3 (odds ratio (OR) 0.89, 95% CI 0.83 to 0.95 for each 1000 mL increase in fluid input) and with increased risk of death for profile 4 (OR 1.20, 95% CI 1.11 to 1.30). CONCLUSION The study identified four subphenotypes of sepsis, which showed different mortality outcomes and responses to fluid resuscitation. Prospective trials are needed to validate our findings.
Collapse
|
research-article |
7 |
88 |
14
|
Zhang Z, Murtagh F, Van Poucke S, Lin S, Lan P. Hierarchical cluster analysis in clinical research with heterogeneous study population: highlighting its visualization with R. ANNALS OF TRANSLATIONAL MEDICINE 2017; 5:75. [PMID: 28275620 PMCID: PMC5337204 DOI: 10.21037/atm.2017.02.05] [Citation(s) in RCA: 80] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/19/2016] [Accepted: 01/18/2017] [Indexed: 02/05/2023]
Abstract
Big data clinical research typically involves thousands of patients and there are numerous variables available. Conventionally, these variables can be handled by multivariable regression modeling. In this article, the hierarchical cluster analysis (HCA) is introduced. This method is used to explore similarity between observations and/or clusters. The result can be visualized using heat maps and dendrograms. Sometimes, it would be interesting to add scatter plot and smooth lines into the panels of the heat map. The inherent R heatmap package does not provide this function. A series of scatter plots can be created using lattice package, and then background color of each panel is mapped to the regression coefficient by using custom-made panel functions. This is the unique feature of the lattice package. Dendrograms and color keys can be added as the legend elements of the lattice system. The latticeExtra package provides some useful functions for the work.
Collapse
|
Editorial |
8 |
80 |
15
|
Zhang Z, Gayle AA, Wang J, Zhang H, Cardinal-Fernández P. Comparing baseline characteristics between groups: an introduction to the CBCgrps package. ANNALS OF TRANSLATIONAL MEDICINE 2017; 5:484. [PMID: 29299446 PMCID: PMC5750271 DOI: 10.21037/atm.2017.09.39] [Citation(s) in RCA: 80] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/25/2017] [Accepted: 09/21/2017] [Indexed: 02/05/2023]
Abstract
A usual practice in observational studies is the comparison of baseline characteristics of participants between study groups. The overall population can be grouped by clinical outcome or exposure status. A combined table reporting baseline characteristics is usually displayed, for the overall population and then separately for each group. The last column usually gives the P value for the comparison between study groups. In the conventional research model, the variables for which data are collected are limited in number. It is thus feasible to calculate descriptive data one by one and to manually create the table. The availability of EHR and big data mining techniques makes it possible to explore a far larger number of variables. However, manual tabulation of big data is particularly error prone; it is exceedingly time-consuming to create and revise such tables manually. In this paper, we introduce an R package called CBCgrps, which is designed to automate and streamline the generation of such tables when working with big data. The package contains two functions, twogrps() and multigrps(), which are used for comparisons between two and multiple groups, respectively.
Collapse
|
Editorial |
8 |
80 |
16
|
Zhang Z. Parametric regression model for survival data: Weibull regression model as an example. ANNALS OF TRANSLATIONAL MEDICINE 2016; 4:484. [PMID: 28149846 PMCID: PMC5233524 DOI: 10.21037/atm.2016.08.45] [Citation(s) in RCA: 68] [Impact Index Per Article: 7.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/20/2016] [Accepted: 06/23/2016] [Indexed: 02/05/2023]
Abstract
Weibull regression model is one of the most popular forms of parametric regression model that it provides estimate of baseline hazard function, as well as coefficients for covariates. Because of technical difficulties, Weibull regression model is seldom used in medical literature as compared to the semi-parametric proportional hazard model. To make clinical investigators familiar with Weibull regression model, this article introduces some basic knowledge on Weibull regression model and then illustrates how to fit the model with R software. The SurvRegCensCov package is useful in converting estimated coefficients to clinical relevant statistics such as hazard ratio (HR) and event time ratio (ETR). Model adequacy can be assessed by inspecting Kaplan-Meier curves stratified by categorical variable. The eha package provides an alternative method to model Weibull regression model. The check.dist() function helps to assess goodness-of-fit of the model. Variable selection is based on the importance of a covariate, which can be tested using anova() function. Alternatively, backward elimination starting from a full model is an efficient way for model development. Visualization of Weibull regression model after model development is interesting that it provides another way to report your findings.
Collapse
|
Editorial |
9 |
68 |
17
|
Zhang Z, Xu X, Fan H, Li D, Deng H. Higher serum chloride concentrations are associated with acute kidney injury in unselected critically ill patients. BMC Nephrol 2013; 14:235. [PMID: 24164963 PMCID: PMC4231437 DOI: 10.1186/1471-2369-14-235] [Citation(s) in RCA: 67] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2013] [Accepted: 10/09/2013] [Indexed: 02/07/2023] Open
Abstract
BACKGROUND Chloride administration has been found to be harmful to the kidney in critically ill patients. However the association between plasma chloride concentration and renal function has never been investigated. METHODS This was a retrospective study conducted in a tertiary 24-bed intensive care unit from September 2010 to November 2012. Data on serum chloride for each patient during their ICU stay were abstracted from electronic database. Cl0 referred to the initial chloride on ICU entry, Cl(max), Cl(min) and Cl(mean) referred to the maximum, minimum and mean chloride values before the onset of AKI, respectively. AKI was defined according to the conventional AKIN criteria. Univariate and multivariable analysis were performed to examine the association of chloride and AKI development. RESULTS A total of 1221 patients were included into analysis during study period. Three hundred and fifty-seven patients (29.2%) developed AKI. Cl(max) was significantly higher in AKI than in non-AKI group (111.8 ± 8.1 vs 107.9 ± 5.4 mmol/l; p < 0.001); Cl0 was not significantly different between AKI and non-AKI patients; Cl(mean) was significantly higher in AKI than non-AKI (104.3 ± 5.8 vs 103.4 ± 4.5; = 0.0047) patients. Cl(max) remained to be associated with AKI in multivariable analysis (OR: 1.10, 95% CI: 1.08-1.13). CONCLUSION Chloride overload as represented by Cl(mean) and Cl(max) is significantly associated with the development of AKI.
Collapse
|
research-article |
12 |
67 |
18
|
Zhang Z, Zhang H, Khanal MK. Development of scoring system for risk stratification in clinical medicine: a step-by-step tutorial. ANNALS OF TRANSLATIONAL MEDICINE 2017; 5:436. [PMID: 29201888 PMCID: PMC5690964 DOI: 10.21037/atm.2017.08.22] [Citation(s) in RCA: 62] [Impact Index Per Article: 7.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/16/2017] [Accepted: 08/11/2017] [Indexed: 02/05/2023]
Abstract
Risk scores play an important role in clinical medicine. With advances in information technology and availability of electronic healthcare record, scoring systems of less commonly seen diseases and population can be developed. The aim of the article is to provide a tutorial on how to develop and validate risk scores based on a virtual dataset by using R software. The dataset we generated including numeric and categorical variables and firstly the numeric variables would be converted to factor variables according to cutoff points identified by the LOESS smoother. Then risk points of each variable, which are related to the coefficients in logistic regression, are assigned to each level of the converted factor variables and other categorical variables. Finally, the total score is calculated for each subject to represent the prediction of the outcome event probability. The original dataset is split into training and validation subsets. Discrimination and calibration are evaluated in the validation subset. R codes with explanations are presented in the main text.
Collapse
|
Editorial |
8 |
62 |
19
|
Zhang Z, Castelló A. Principal components analysis in clinical studies. ANNALS OF TRANSLATIONAL MEDICINE 2017; 5:351. [PMID: 28936445 PMCID: PMC5599285 DOI: 10.21037/atm.2017.07.12] [Citation(s) in RCA: 61] [Impact Index Per Article: 7.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/26/2017] [Accepted: 06/28/2017] [Indexed: 02/05/2023]
Abstract
In multivariate analysis, independent variables are usually correlated to each other which can introduce multicollinearity in the regression models. One approach to solve this problem is to apply principal components analysis (PCA) over these variables. This method uses orthogonal transformation to represent sets of potentially correlated variables with principal components (PC) that are linearly uncorrelated. PCs are ordered so that the first PC has the largest possible variance and only some components are selected to represent the correlated variables. As a result, the dimension of the variable space is reduced. This tutorial illustrates how to perform PCA in R environment, the example is a simulated dataset in which two PCs are responsible for the majority of the variance in the data. Furthermore, the visualization of PCA is highlighted.
Collapse
|
Editorial |
8 |
61 |
20
|
Zhang Z, Xu X, Ni H, Deng H. Predictive value of ionized calcium in critically ill patients: an analysis of a large clinical database MIMIC II. PLoS One 2014; 9:e95204. [PMID: 24736693 PMCID: PMC3988144 DOI: 10.1371/journal.pone.0095204] [Citation(s) in RCA: 58] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2014] [Accepted: 03/24/2014] [Indexed: 02/06/2023] Open
Abstract
BACKGROUND AND OBJECTIVE Ionized calcium (iCa) has been investigated for its association with mortality in intensive care unit (ICU) patients in many studies. However, these studies are small in sample size and the results are conflicting. The present study aimed to establish the association of iCa with mortality by using a large clinical database. METHODS Multiparameter Intelligent Monitoring in Intensive Care II (MIMIC II) database was used for analysis. Patients older than 15 years were eligible, and patients without iCa measured during their ICU stay were excluded. Demographic data and clinical characteristics were extracted and compared between survivors and non-survivors. iCa measure on ICU admission was defined as Ca0; Camax was the maximum iCa during ICU stay; Camin was the minimum value of iCa during the ICU stay; Camean was the arithmetic mean iCa during ICU stay. MAIN RESULTS A total of 15409 ICU admissions satisfied our inclusion criteria and were included in our analysis. The prevalence of hypocalcemia on ICU entry was 62.06%. Ca0 was significantly lower in non-survivors than in survivors (1.11 ± 0.14 vs 1.13 ± 0.10 mmol/l, p<0.001). In multivariate analysis, moderate hypocalcemia in Ca0 was significantly associated with increased risk of death (OR: 1.943; 95% CI: 1.340-2.817), and mild hypercalcemia was associated with lower mortality (OR: 0.553, 95% CI: 0.400-0.767). While moderate and mild hypocalcemia in Camean is associated with increased risk of death (OR: 1.153, 95% CI: 1.006-1.322 and OR: 2.520, 95% CI: 1.485-4.278), hypercalcemia in Camean is not significantly associated with ICU mortality. CONCLUSION The relationship between Ca0 and clinical outcome follows an "U" shaped curve with the nadir at the normal range, extending slightly to hypercalcemia. Mild hypercalcemia in Ca0 is protective, whereas moderate and mild hypocalcemia in Camean is associated with increased risk of death.
Collapse
|
research-article |
11 |
58 |
21
|
Zhang Z, Geskus RB, Kattan MW, Zhang H, Liu T. Nomogram for survival analysis in the presence of competing risks. ANNALS OF TRANSLATIONAL MEDICINE 2017; 5:403. [PMID: 29152503 PMCID: PMC5673789 DOI: 10.21037/atm.2017.07.27] [Citation(s) in RCA: 56] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/28/2017] [Accepted: 07/12/2017] [Indexed: 02/05/2023]
Abstract
Clinical research usually involves time-to-event survival analysis, in which the presence of a competing event is prevalent. It is acceptable to use the conventional Cox proportional hazard regression to model cause-specific hazard. However, this cause-specific hazard cannot directly translate to the cumulative incidence function, and the latter is usually clinically relevant. The subdistribution hazard regression directly quantifies the impact of covariates on the cumulative incidence. When estimating the subdistribution hazard, subjects experiencing competing event continue to contribute to the risk set, and censoring weights are assigned to them after the competing event time. The weights are the conditional probability that a subject remains uncensored, and can be modelled to depend on the covariates of a subject. The first option to perform regression on the subdistribution hazard was the crr() function in the cmprsk package. However, it is not straightforward to draw a nomogram, which is a user-friendly tool for risk prediction, with the crr() function. To overcome this problem, we show an alternative method to use a nomogram function based on result of subdistribution hazard modeling.
Collapse
|
Editorial |
8 |
56 |
22
|
Zhang Z. Multivariable fractional polynomial method for regression model. ANNALS OF TRANSLATIONAL MEDICINE 2016; 4:174. [PMID: 27275487 PMCID: PMC4876277 DOI: 10.21037/atm.2016.05.01] [Citation(s) in RCA: 48] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/25/2015] [Accepted: 01/24/2016] [Indexed: 02/05/2023]
|
editorial |
9 |
48 |
23
|
Zhang Z. Naïve Bayes classification in R. ANNALS OF TRANSLATIONAL MEDICINE 2016; 4:241. [PMID: 27429967 PMCID: PMC4930525 DOI: 10.21037/atm.2016.03.38] [Citation(s) in RCA: 47] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/25/2016] [Accepted: 02/24/2016] [Indexed: 02/05/2023]
Abstract
Naïve Bayes classification is a kind of simple probabilistic classification methods based on Bayes' theorem with the assumption of independence between features. The model is trained on training dataset to make predictions by predict() function. This article introduces two functions naiveBayes() and train() for the performance of Naïve Bayes classification.
Collapse
|
editorial |
9 |
47 |
24
|
Zhang Z, Pan Q, Ge H, Xing L, Hong Y, Chen P. Deep learning-based clustering robustly identified two classes of sepsis with both prognostic and predictive values. EBioMedicine 2020; 62:103081. [PMID: 33181462 PMCID: PMC7658497 DOI: 10.1016/j.ebiom.2020.103081] [Citation(s) in RCA: 45] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/06/2020] [Revised: 09/19/2020] [Accepted: 10/07/2020] [Indexed: 02/07/2023] Open
Abstract
BACKGROUND Sepsis is a heterogenous syndrome and individualized management strategy is the key to successful treatment. Genome wide expression profiling has been utilized for identifying subclasses of sepsis, but the clinical utility of these subclasses was limited because of the classification instability, and the lack of a robust class prediction model with extensive external validation. The study aimed to develop a parsimonious class model for the prediction of class membership and validate the model for its prognostic and predictive capability in external datasets. METHODS The Gene Expression Omnibus (GEO) and ArrayExpress databases were searched from inception to April 2020. Datasets containing whole blood gene expression profiling in adult sepsis patients were included. Autoencoder was used to extract representative features for k-means clustering. Genetic algorithms (GA) were employed to derive a parsimonious 5-gene class prediction model. The class model was then applied to external datasets (n = 780) to evaluate its prognostic and predictive performance. FINDINGS A total of 12 datasets involving 1613 patients were included. Two classes were identified in the discovery cohort (n = 685). Class 1 was characterized by immunosuppression with higher mortality than class 2 (21.8% [70/321] vs. 12.1% [44/364]; p < 0.01 for Chi-square test). A 5-gene class model (C14orf159, AKNA, PILRA, STOM and USP4) was developed with GA. In external validation cohorts, the 5-gene class model (AUC: 0.707; 95% CI: 0.664 - 0.750) performed better in predicting mortality than sepsis response signature (SRS) endotypes (AUC: 0.610; 95% CI: 0.521 - 0.700), and performed equivalently to the APACHE II score (AUC: 0.681; 95% CI: 0.595 - 0.767). In the dataset E-MTAB-7581, the use of hydrocortisone was associated with increased risk of mortality (OR: 3.15 [1.13, 8.82]; p = 0.029) in class 2. The effect was not statistically significant in class 1 (OR: 1.88 [0.70, 5.09]; p = 0.211). INTERPRETATION Our study identified two classes of sepsis that showed different mortality rates and responses to hydrocortisone therapy. Class 1 was characterized by immunosuppression with higher mortality rate than class 2. We further developed a 5-gene class model to predict class membership. FUNDING The study was funded by the National Natural Science Foundation of China (Grant No. 81,901,929).
Collapse
|
research-article |
5 |
45 |
25
|
Zhang Z. Residuals and regression diagnostics: focusing on logistic regression. ANNALS OF TRANSLATIONAL MEDICINE 2016; 4:195. [PMID: 27294091 PMCID: PMC4885900 DOI: 10.21037/atm.2016.03.36] [Citation(s) in RCA: 39] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/15/2016] [Accepted: 02/02/2016] [Indexed: 02/05/2023]
Abstract
Up to now I have introduced most steps in regression model building and validation. The last step is to check whether there are observations that have significant impact on model coefficient and specification. The article firstly describes plotting Pearson residual against predictors. Such plots are helpful in identifying non-linearity and provide hints on how to transform predictors. Next, I focus on observations of outlier, leverage and influence that may have significant impact on model building. Outlier is such an observation that its response value is unusual conditional on covariate pattern. Leverage is an observation with covariate pattern that is far away from the regressor space. Influence is the product of outlier and leverage. That is, when influential observation is dropped from the model, there will be a significant shift of the coefficient. Summary statistics for outlier, leverage and influence are studentized residuals, hat values and Cook's distance. They can be easily visualized with graphs and formally tested using the car package.
Collapse
|
editorial |
9 |
39 |