1
|
Gao J, Bonzel CL, Hong C, Varghese P, Zakir K, Gronsbell J. Semi-supervised ROC analysis for reliable and streamlined evaluation of phenotyping algorithms. J Am Med Inform Assoc 2024; 31:640-650. [PMID: 38128118 PMCID: PMC10873838 DOI: 10.1093/jamia/ocad226] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/03/2023] [Revised: 09/22/2023] [Accepted: 11/20/2023] [Indexed: 12/23/2023] Open
Abstract
OBJECTIVE High-throughput phenotyping will accelerate the use of electronic health records (EHRs) for translational research. A critical roadblock is the extensive medical supervision required for phenotyping algorithm (PA) estimation and evaluation. To address this challenge, numerous weakly-supervised learning methods have been proposed. However, there is a paucity of methods for reliably evaluating the predictive performance of PAs when a very small proportion of the data is labeled. To fill this gap, we introduce a semi-supervised approach (ssROC) for estimation of the receiver operating characteristic (ROC) parameters of PAs (eg, sensitivity, specificity). MATERIALS AND METHODS ssROC uses a small labeled dataset to nonparametrically impute missing labels. The imputations are then used for ROC parameter estimation to yield more precise estimates of PA performance relative to classical supervised ROC analysis (supROC) using only labeled data. We evaluated ssROC with synthetic, semi-synthetic, and EHR data from Mass General Brigham (MGB). RESULTS ssROC produced ROC parameter estimates with minimal bias and significantly lower variance than supROC in the simulated and semi-synthetic data. For the 5 PAs from MGB, the estimates from ssROC are 30% to 60% less variable than supROC on average. DISCUSSION ssROC enables precise evaluation of PA performance without demanding large volumes of labeled data. ssROC is also easily implementable in open-source R software. CONCLUSION When used in conjunction with weakly-supervised PAs, ssROC facilitates the reliable and streamlined phenotyping necessary for EHR-based research.
Collapse
Affiliation(s)
- Jianhui Gao
- Department of Statistical Sciences, University of Toronto, Toronto, ON, Canada
| | - Clara-Lea Bonzel
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, United States
| | - Chuan Hong
- Department of Biostatistics and Bioinformatics, Duke University, Durham, NC, United States
| | - Paul Varghese
- Health Informatics, Verily Life Sciences, Cambridge, MA, United States
| | - Karim Zakir
- Department of Statistical Sciences, University of Toronto, Toronto, ON, Canada
| | - Jessica Gronsbell
- Department of Statistical Sciences, University of Toronto, Toronto, ON, Canada
- Department of Family and Community Medicine, University of Toronto, Toronto, ON, Canada
- Department of Computer Science, University of Toronto, Toronto, ON, Canada
| |
Collapse
|
2
|
Wang L, Wang X, Liao KP, Cai T. Semisupervised transfer learning for evaluation of model classification performance. Biometrics 2024; 80:ujae002. [PMID: 38465982 PMCID: PMC10926267 DOI: 10.1093/biomtc/ujae002] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2022] [Revised: 12/17/2023] [Accepted: 01/17/2024] [Indexed: 03/12/2024]
Abstract
In many modern machine learning applications, changes in covariate distributions and difficulty in acquiring outcome information have posed challenges to robust model training and evaluation. Numerous transfer learning methods have been developed to robustly adapt the model itself to some unlabeled target populations using existing labeled data in a source population. However, there is a paucity of literature on transferring performance metrics, especially receiver operating characteristic (ROC) parameters, of a trained model. In this paper, we aim to evaluate the performance of a trained binary classifier on unlabeled target population based on ROC analysis. We proposed Semisupervised Transfer lEarning of Accuracy Measures (STEAM), an efficient three-step estimation procedure that employs (1) double-index modeling to construct calibrated density ratio weights and (2) robust imputation to leverage the large amount of unlabeled data to improve estimation efficiency. We establish the consistency and asymptotic normality of the proposed estimator under the correct specification of either the density ratio model or the outcome model. We also correct for potential overfitting bias in the estimators in finite samples with cross-validation. We compare our proposed estimators to existing methods and show reductions in bias and gains in efficiency through simulations. We illustrate the practical utility of the proposed method on evaluating prediction performance of a phenotyping model for rheumatoid arthritis (RA) on a temporally evolving EHR cohort.
Collapse
Affiliation(s)
- Linshanshan Wang
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA 02115, United States
| | - Xuan Wang
- Division of Biostatistics, Department of Population Health Sciences, University of Utah, Salt Lake City, UT 84108, United States
| | - Katherine P Liao
- Division of Rheumatology, Brigham and Women’s Hospital, Boston, MA 02115, United States
| | - Tianxi Cai
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA 02115, United States
| |
Collapse
|
3
|
Abdurrab I, Mahmood T, Sheikh S, Aijaz S, Kashif M, Memon A, Ali I, Peerwani G, Pathan A, Alkhodre AB, Siddiqui MS. Predicting the Length of Stay of Cardiac Patients Based on Pre-Operative Variables-Bayesian Models vs. Machine Learning Models. Healthcare (Basel) 2024; 12:249. [PMID: 38255136 PMCID: PMC10815919 DOI: 10.3390/healthcare12020249] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2023] [Revised: 01/04/2024] [Accepted: 01/16/2024] [Indexed: 01/24/2024] Open
Abstract
Length of stay (LoS) prediction is deemed important for a medical institution's operational and logistical efficiency. Sound estimates of a patient's stay increase clinical preparedness and reduce aberrations. Various statistical methods and techniques are used to quantify and predict the LoS of a patient based on pre-operative clinical features. This study evaluates and compares the results of Bayesian (simple Bayesian regression and hierarchical Bayesian regression) models and machine learning (ML) regression models against multiple evaluation metrics for the problem of LoS prediction of cardiac patients admitted to Tabba Heart Institute, Karachi, Pakistan (THI) between 2015 and 2020. In addition, the study also presents the use of hierarchical Bayesian regression to account for data variability and skewness without homogenizing the data (by removing outliers). LoS estimates from the hierarchical Bayesian regression model resulted in a root mean squared error (RMSE) and mean absolute error (MAE) of 1.49 and 1.16, respectively. Simple Bayesian regression (without hierarchy) achieved an RMSE and MAE of 3.36 and 2.05, respectively. The average RMSE and MAE of ML models remained at 3.36 and 1.98, respectively.
Collapse
Affiliation(s)
- Ibrahim Abdurrab
- Department of Computer Science, Institute of Business Administration, Karachi 75270, Pakistan;
| | - Tariq Mahmood
- Department of Computer Science, Institute of Business Administration, Karachi 75270, Pakistan;
| | - Sana Sheikh
- Department of Clinical Research Cardiology, Tabba Heart Institute, Karachi 75950, Pakistan; (S.S.); (S.A.); (M.K.); (A.M.); (I.A.); (G.P.); (A.P.)
| | - Saba Aijaz
- Department of Clinical Research Cardiology, Tabba Heart Institute, Karachi 75950, Pakistan; (S.S.); (S.A.); (M.K.); (A.M.); (I.A.); (G.P.); (A.P.)
| | - Muhammad Kashif
- Department of Clinical Research Cardiology, Tabba Heart Institute, Karachi 75950, Pakistan; (S.S.); (S.A.); (M.K.); (A.M.); (I.A.); (G.P.); (A.P.)
| | - Ahson Memon
- Department of Clinical Research Cardiology, Tabba Heart Institute, Karachi 75950, Pakistan; (S.S.); (S.A.); (M.K.); (A.M.); (I.A.); (G.P.); (A.P.)
| | - Imran Ali
- Department of Clinical Research Cardiology, Tabba Heart Institute, Karachi 75950, Pakistan; (S.S.); (S.A.); (M.K.); (A.M.); (I.A.); (G.P.); (A.P.)
| | - Ghazal Peerwani
- Department of Clinical Research Cardiology, Tabba Heart Institute, Karachi 75950, Pakistan; (S.S.); (S.A.); (M.K.); (A.M.); (I.A.); (G.P.); (A.P.)
| | - Asad Pathan
- Department of Clinical Research Cardiology, Tabba Heart Institute, Karachi 75950, Pakistan; (S.S.); (S.A.); (M.K.); (A.M.); (I.A.); (G.P.); (A.P.)
| | - Ahmad B. Alkhodre
- Faculty of Computer and Information Systems, Islamic University of Madinah, Madinah 42351, Saudi Arabia; (A.B.A.); (M.S.S.)
| | - Muhammad Shoaib Siddiqui
- Faculty of Computer and Information Systems, Islamic University of Madinah, Madinah 42351, Saudi Arabia; (A.B.A.); (M.S.S.)
| |
Collapse
|
4
|
Zhang D, Khalili A, Asgharian M. Post-model-selection inference in linear regression models: An integrated review. STATISTICS SURVEYS 2022. [DOI: 10.1214/22-ss135] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Affiliation(s)
- Dongliang Zhang
- Department of Biostatistics, Johns Hopkins University, Baltimore, MD, USA
| | - Abbas Khalili
- Department of Mathematics and Statistics, McGill University, Montréal, QC, Canada
| | - Masoud Asgharian
- Department of Mathematics and Statistics, McGill University, Montréal, QC, Canada
| |
Collapse
|
5
|
Ng TL, Newton MA. Random weighting in LASSO regression. Electron J Stat 2022. [DOI: 10.1214/22-ejs2020] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Affiliation(s)
- Tun Lee Ng
- Department of Statistics, 1300 University Ave, Madison WI 53706
| | | |
Collapse
|
6
|
Zhang HG, Hejblum BP, Weber GM, Palmer NP, Churchill SE, Szolovits P, Murphy SN, Liao KP, Kohane IS, Cai T. ATLAS: an automated association test using probabilistically linked health records with application to genetic studies. J Am Med Inform Assoc 2021; 28:2582-2592. [PMID: 34608931 DOI: 10.1093/jamia/ocab187] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/03/2021] [Revised: 08/14/2021] [Accepted: 08/22/2021] [Indexed: 11/12/2022] Open
Abstract
OBJECTIVE Large amounts of health data are becoming available for biomedical research. Synthesizing information across databases may capture more comprehensive pictures of patient health and enable novel research studies. When no gold standard mappings between patient records are available, researchers may probabilistically link records from separate databases and analyze the linked data. However, previous linked data inference methods are constrained to certain linkage settings and exhibit low power. Here, we present ATLAS, an automated, flexible, and robust association testing algorithm for probabilistically linked data. MATERIALS AND METHODS Missing variables are imputed at various thresholds using a weighted average method that propagates uncertainty from probabilistic linkage. Next, estimated effect sizes are obtained using a generalized linear model. ATLAS then conducts the threshold combination test by optimally combining P values obtained from data imputed at varying thresholds using Fisher's method and perturbation resampling. RESULTS In simulations, ATLAS controls for type I error and exhibits high power compared to previous methods. In a real-world genetic association study, meta-analysis of ATLAS-enabled analyses on a linked cohort with analyses using an existing cohort yielded additional significant associations between rheumatoid arthritis genetic risk score and laboratory biomarkers. DISCUSSION Weighted average imputation weathers false matches and increases contribution of true matches to mitigate linkage error-induced bias. The threshold combination test avoids arbitrarily choosing a threshold to rule a match, thus automating linked data-enabled analyses and preserving power. CONCLUSION ATLAS promises to enable novel and powerful research studies using linked data to capitalize on all available data sources.
Collapse
Affiliation(s)
- Harrison G Zhang
- Department of Biomedical Informatics, Harvard Medical School, Boston, Massachusetts, USA.,Division of Rheumatology, Immunology, and Allergy, Brigham and Women's Hospital, Boston, Massachusetts, USA.,Department of Biological Sciences, Columbia University, New York City, New York, USA
| | - Boris P Hejblum
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, Massachusetts, USA.,Bordeaux Population Health, Université de Bordeaux, Inserm U1219, Inria SISTM, Bordeaux, France
| | - Griffin M Weber
- Department of Biomedical Informatics, Harvard Medical School, Boston, Massachusetts, USA
| | - Nathan P Palmer
- Department of Biomedical Informatics, Harvard Medical School, Boston, Massachusetts, USA
| | - Susanne E Churchill
- Department of Biomedical Informatics, Harvard Medical School, Boston, Massachusetts, USA
| | - Peter Szolovits
- Computer Science and Artificial Intelligence Laboratory (CSAIL), Massachusetts Institute of Technology, Cambridge, Massachusetts, USA
| | - Shawn N Murphy
- Department of Neurology, Massachusetts General Hospital, Boston, Massachusetts, USA.,Research IS and Computing, Mass General Brigham HealthCare, Charlestown, Massachusetts, USA
| | - Katherine P Liao
- Department of Biomedical Informatics, Harvard Medical School, Boston, Massachusetts, USA.,Division of Rheumatology, Immunology, and Allergy, Brigham and Women's Hospital, Boston, Massachusetts, USA
| | - Isaac S Kohane
- Department of Biomedical Informatics, Harvard Medical School, Boston, Massachusetts, USA
| | - Tianxi Cai
- Department of Biomedical Informatics, Harvard Medical School, Boston, Massachusetts, USA.,Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, Massachusetts, USA
| |
Collapse
|
7
|
Zhang X, Fang K, Zhang Q. Multivariate functional generalized additive models. J STAT COMPUT SIM 2021. [DOI: 10.1080/00949655.2021.1979550] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
Affiliation(s)
- Xiaochen Zhang
- Zhongtai Securities Institute for Financial Studies, Shandong University, Jinan, People's Republic of China
| | - Kuangnan Fang
- Department of Statistics and Data Science, School of Economics, Xiamen University, Xiamen, People's Republic of China
| | - Qingzhao Zhang
- Department of Statistics and Data Science, School of Economics, Xiamen University, Xiamen, People's Republic of China
- The Wang Yanan Institute for Studies in Economics, Xiamen University, Xiamen, People's Republic of China
| |
Collapse
|
8
|
Fang F, Zhao J, Ahmed SE, Qu A. A weak‐signal‐assisted procedure for variable selection and statistical inference with an informative subsample. Biometrics 2021. [DOI: 10.1111/biom.13346] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]
Affiliation(s)
- Fang Fang
- Key Laboratory of Advanced Theory and Application in Statistics and Data Science ‐ MOE School of Statistics East China Normal University Shanghai China
| | - Jiwei Zhao
- Department of Biostatistics and Medical Informatics University of Wisconsin Madison Wisconsin
| | - S. Ejaz Ahmed
- Faculty of Mathematics and Science Brock University St. Catharines Ontario Canada
| | - Annie Qu
- Department of Statistics University of California Irvine California
| |
Collapse
|
9
|
Yu Q, Li Y, Wang Y, Yang Y, Zheng Z. Scalable and efficient inference via CPE. COMMUN STAT-THEOR M 2021. [DOI: 10.1080/03610926.2021.1936044] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
Affiliation(s)
- Qin Yu
- International Institute of Finance, The School of Management, University of Science and Technology of China, Hefei, Anhui, P. R. China
| | - Yang Li
- International Institute of Finance, The School of Management, University of Science and Technology of China, Hefei, Anhui, P. R. China
| | - Yumeng Wang
- International Institute of Finance, The School of Management, University of Science and Technology of China, Hefei, Anhui, P. R. China
| | - Yachong Yang
- Department of Statistics, The Wharton School, University of Pennsylvania, Philadelphia, Pennsylvania, USA
| | - Zemin Zheng
- International Institute of Finance, The School of Management, University of Science and Technology of China, Hefei, Anhui, P. R. China
| |
Collapse
|
10
|
Cheng D, Ananthakrishnan AN, Cai T. Robust and efficient semi-supervised estimation of average treatment effects with application to electronic health records data. Biometrics 2021; 77:413-423. [PMID: 32413171 PMCID: PMC7758040 DOI: 10.1111/biom.13298] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2018] [Revised: 04/30/2020] [Accepted: 05/01/2020] [Indexed: 11/29/2022]
Abstract
We consider the problem of estimating the average treatment effect (ATE) in a semi-supervised learning setting, where a very small proportion of the entire set of observations are labeled with the true outcome but features predictive of the outcome are available among all observations. This problem arises, for example, when estimating treatment effects in electronic health records (EHR) data because gold-standard outcomes are often not directly observable from the records but are observed for a limited number of patients through small-scale manual chart review. We develop an imputation-based approach for estimating the ATE that is robust to misspecification of the imputation model. This effectively allows information from the predictive features to be safely leveraged to improve efficiency in estimating the ATE. The estimator is additionally doubly-robust in that it is consistent under correct specification of either an initial propensity score model or a baseline outcome model. It is also locally semiparametric efficient under an ideal semi-supervised model where the distribution of the unlabeled data is known. Simulations exhibit the efficiency and robustness of the proposed method compared to existing approaches in finite samples. We illustrate the method by comparing rates of treatment response to two biologic agents for treatment inflammatory bowel disease using EHR data from Partners' Healthcare.
Collapse
Affiliation(s)
- David Cheng
- VA Boston Healthcare System, Boston, Massachusetts, U.S.A
| | | | - Tianxi Cai
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, Massachusetts, U.S.A
| |
Collapse
|
11
|
Zheng Z, Liu L, Li Y, Zhao N. High-dimensional statistical inference via DATE. COMMUN STAT-THEOR M 2021. [DOI: 10.1080/03610926.2021.1909733] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
Affiliation(s)
- Zemin Zheng
- School of Management, University of Science and Technology of China, Hefei, Anhui, P. R. China
| | - Lei Liu
- School of Management, University of Science and Technology of China, Hefei, Anhui, P. R. China
| | - Yang Li
- School of Management, University of Science and Technology of China, Hefei, Anhui, P. R. China
| | - Ni Zhao
- School of Mathematics and Physics Sciences, Anhui Jianzhu University, Hefei, Anhui, P. R. China
| |
Collapse
|
12
|
Fei Z, Li Y. Estimation and Inference for High Dimensional Generalized Linear Models: A Splitting and Smoothing Approach. JOURNAL OF MACHINE LEARNING RESEARCH : JMLR 2021; 22:58. [PMID: 34531706 PMCID: PMC8442657] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Figures] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
The focus of modern biomedical studies has gradually shifted to explanation and estimation of joint effects of high dimensional predictors on disease risks. Quantifying uncertainty in these estimates may provide valuable insight into prevention strategies or treatment decisions for both patients and physicians. High dimensional inference, including confidence intervals and hypothesis testing, has sparked much interest. While much work has been done in the linear regression setting, there is lack of literature on inference for high dimensional generalized linear models. We propose a novel and computationally feasible method, which accommodates a variety of outcome types, including normal, binomial, and Poisson data. We use a "splitting and smoothing" approach, which splits samples into two parts, performs variable selection using one part and conducts partial regression with the other part. Averaging the estimates over multiple random splits, we obtain the smoothed estimates, which are numerically stable. We show that the estimates are consistent, asymptotically normal, and construct confidence intervals with proper coverage probabilities for all predictors. We examine the finite sample performance of our method by comparing it with the existing methods and applying it to analyze a lung cancer cohort study.
Collapse
Affiliation(s)
- Zhe Fei
- Department of Biostatistics, UCLA, Los Angeles, California, 90025
| | - Yi Li
- Department of Biostatistics, University of Michigan, Ann Arbor, Michigan, 48109
| |
Collapse
|
13
|
Zhao J, Chen C. A Nuisance-Free Inference Procedure Accounting for the Unknown Missingness with Application to Electronic Health Records. ENTROPY (BASEL, SWITZERLAND) 2020; 22:E1154. [PMID: 33286923 PMCID: PMC7597318 DOI: 10.3390/e22101154] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/24/2020] [Revised: 09/27/2020] [Accepted: 10/12/2020] [Indexed: 11/16/2022]
Abstract
We study how to conduct statistical inference in a regression model where the outcome variable is prone to missing values and the missingness mechanism is unknown. The model we consider might be a traditional setting or a modern high-dimensional setting where the sparsity assumption is usually imposed and the regularization technique is popularly used. Motivated by the fact that the missingness mechanism, albeit usually treated as a nuisance, is difficult to specify correctly, we adopt the conditional likelihood approach so that the nuisance can be completely ignored throughout our procedure. We establish the asymptotic theory of the proposed estimator and develop an easy-to-implement algorithm via some data manipulation strategy. In particular, under the high-dimensional setting where regularization is needed, we propose a data perturbation method for the post-selection inference. The proposed methodology is especially appealing when the true missingness mechanism tends to be missing not at random, e.g., patient reported outcomes or real world data such as electronic health records. The performance of the proposed method is evaluated by comprehensive simulation experiments as well as a study of the albumin level in the MIMIC-III database.
Collapse
Affiliation(s)
- Jiwei Zhao
- Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, Madison, WI 53726, USA
| | - Chi Chen
- Novartis Institutes for Biomedical Research, Shanghai 201203, China;
| |
Collapse
|
14
|
Robust high-dimensional regression for data with anomalous responses. ANN I STAT MATH 2020. [DOI: 10.1007/s10463-020-00764-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
15
|
|
16
|
Solution paths for the generalized lasso with applications to spatially varying coefficients regression. Comput Stat Data Anal 2020. [DOI: 10.1016/j.csda.2019.106821] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
|
17
|
|
18
|
Lin J, Wang D, Zheng Q. Regression analysis and variable selection for two-stage multiple-infection group testing data. Stat Med 2019; 38:4519-4533. [PMID: 31297869 DOI: 10.1002/sim.8311] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/03/2018] [Revised: 03/03/2019] [Accepted: 06/14/2019] [Indexed: 12/17/2022]
Abstract
Group testing, as a cost-effective strategy, has been widely used to perform large-scale screening for rare infections. Recently, the use of multiplex assays has transformed the goal of group testing from detecting a single disease to diagnosing multiple infections simultaneously. Existing research on multiple-infection group testing data either exclude individual covariate information or ignore possible retests on suspicious individuals. To incorporate both, we propose a new regression model. This new model allows us to perform a regression analysis for each infection using multiple-infection group testing data. Furthermore, we introduce an efficient variable selection method to reveal truly relevant risk factors for each disease. Our methodology also allows for the estimation of the assay sensitivity and specificity when they are unknown. We examine the finite sample performance of our method through extensive simulation studies and apply it to a chlamydia and gonorrhea screening data set to illustrate its practical usefulness.
Collapse
Affiliation(s)
- Juexin Lin
- Department of Statistics, University of South Carolina, South Carolina
| | - Dewei Wang
- Department of Statistics, University of South Carolina, South Carolina
| | - Qi Zheng
- Department of Bioinformatics and Biostatistics, University of Louisville, Kentucky
| |
Collapse
|
19
|
Abstract
Summary
The lasso is a popular estimation procedure in multiple linear regression. We develop and establish the validity of a perturbation bootstrap method for approximating the distribution of the lasso estimator in a heteroscedastic linear regression model. We allow the underlying covariates to be either random or nonrandom, and show that the proposed bootstrap method works irrespective of the nature of the covariates. We also investigate finite-sample properties of the proposed bootstrap method in a moderately large simulation study.
Collapse
Affiliation(s)
- Debraj Das
- Theoretical Statistics and Mathematics Unit, Indian Statistical Institute, 7 S.J.S. Sansanwal Marg, Delhi 110016, India
| | - S N Lahiri
- Department of Statistics, North Carolina State University, 2311 Stinson Drive, Raleigh, North Carolina 27695, USA
| |
Collapse
|
20
|
Affiliation(s)
- Jingshen Wang
- Department of Statistics, University of Michigan, Ann Arbor, MI
| | - Xuming He
- Department of Statistics, University of Michigan, Ann Arbor, MI
| | - Gongjun Xu
- Department of Statistics, University of Michigan, Ann Arbor, MI
| |
Collapse
|
21
|
Cilluffo G, Sottile G, La Grutta S, Muggeo VM. The Induced Smoothed lasso: A practical framework for hypothesis testing in high dimensional regression. Stat Methods Med Res 2019; 29:765-777. [PMID: 30991902 DOI: 10.1177/0962280219842890] [Citation(s) in RCA: 17] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
This paper focuses on hypothesis testing in lasso regression, when one is interested in judging statistical significance for the regression coefficients in the regression equation involving a lot of covariates. To get reliable p-values, we propose a new lasso-type estimator relying on the idea of induced smoothing which allows to obtain appropriate covariance matrix and Wald statistic relatively easily. Some simulation experiments reveal that our approach exhibits good performance when contrasted with the recent inferential tools in the lasso framework. Two real data analyses are presented to illustrate the proposed framework in practice.
Collapse
Affiliation(s)
- Giovanna Cilluffo
- Institute of Biomedicine and Molecular Immunology, National Research Council, Palermo, Italy
| | - Gianluca Sottile
- Dipartimento di Scienze Economiche, Aziendali e Statistiche, Università degli Studi di Palermo, Palermo, Italy
| | - Stefania La Grutta
- Institute of Biomedicine and Molecular Immunology, National Research Council, Palermo, Italy
| | - Vito Mr Muggeo
- Dipartimento di Scienze Economiche, Aziendali e Statistiche, Università degli Studi di Palermo, Palermo, Italy
| |
Collapse
|
22
|
Gronsbell J, Minnier J, Yu S, Liao K, Cai T. Automated feature selection of predictors in electronic medical records data. Biometrics 2019; 75:268-277. [PMID: 30353541 DOI: 10.1111/biom.12987] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2017] [Accepted: 10/01/2018] [Indexed: 01/29/2023]
Abstract
The use of Electronic Health Records (EHR) for translational research can be challenging due to difficulty in extracting accurate disease phenotype data. Historically, EHR algorithms for annotating phenotypes have been either rule-based or trained with billing codes and gold standard labels curated via labor intensive medical chart review. These simplistic algorithms tend to have unpredictable portability across institutions and low accuracy for many disease phenotypes due to imprecise billing codes. Recently, more sophisticated machine learning algorithms have been developed to improve the robustness and accuracy of EHR phenotyping algorithms. These algorithms are typically trained via supervised learning, relating gold standard labels to a wide range of candidate features including billing codes, procedure codes, medication prescriptions and relevant clinical concepts extracted from narrative notes via Natural Language Processing (NLP). However, due to the time intensiveness of gold standard labeling, the size of the training set is often insufficient to build a generalizable algorithm with the large number of candidate features extracted from EHR. To reduce the number of candidate predictors and in turn improve model performance, we present an automated feature selection method based entirely on unlabeled observations. The proposed method generates a comprehensive surrogate for the underlying phenotype with an unsupervised clustering of disease status based on several highly predictive features such as diagnosis codes and mentions of the disease in text fields available in the entire set of EHR data. A sparse regression model is then built with the estimated outcomes and remaining covariates to identify those features most informative of the phenotype of interest. Relying on the results of Li and Duan (1989), we demonstrate that variable selection for the underlying phenotype model can be achieved by fitting the surrogate-based model. We explore the performance of our methods in numerical simulations and present the results of a prediction model for Rheumatoid Arthritis (RA) built on a large EHR data mart from the Partners Health System consisting of billing codes and NLP terms. Empirical results suggest that our procedure reduces the number of gold-standard labels necessary for phenotyping thereby harnessing the automated power of EHR data and improving efficiency.
Collapse
Affiliation(s)
- Jessica Gronsbell
- Department of Biomedical Data Science, Stanford University, Stanford, California
| | - Jessica Minnier
- OHSU-PSU School of Public Health, Oregon Health & Science University, Portland, Oregon
| | - Sheng Yu
- Center for Statistical Science, Tsinghua University, Beijing, China
| | | | - Tianxi Cai
- Department of Biostatistics, Harvard University, Boston, Massachusetts
| |
Collapse
|
23
|
Das D, Lahiri S. Second order correctness of perturbation bootstrap M-estimator of multiple linear regression parameter. BERNOULLI 2019. [DOI: 10.3150/17-bej1001] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
24
|
Lee SMS, Wu Y. A bootstrap recipe for post-model-selection inference under linear regression models. Biometrika 2018. [DOI: 10.1093/biomet/asy046] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Affiliation(s)
- S M S Lee
- Department of Statistics and Actuarial Science, The University of Hong Kong, Pokfulam, Hong Kong
| | - Y Wu
- Department of Statistics and Actuarial Science, University of Waterloo, 200 University Avenue West, Waterloo, Ontario, Canada
| |
Collapse
|
25
|
Wang L, Van Keilegom I, Maidman A. Wild residual bootstrap inference for penalized quantile regression with heteroscedastic errors. Biometrika 2018. [DOI: 10.1093/biomet/asy037] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Affiliation(s)
- Lan Wang
- School of Statistics, University of Minnesota, 224 Church Street South East, Minneapolis, Minnesota, USA
| | - Ingrid Van Keilegom
- Research Centre for Operations Research and Business Statistics, KU Leuven, Naamsestraat 69, Leuven, Belgium
| | - Adam Maidman
- School of Statistics, University of Minnesota, 224 Church Street South East, Minneapolis, Minnesota, USA
| |
Collapse
|
26
|
|
27
|
Tuson M, Turlach B, Vickery A, Whyatt D. Reducing Bruzzi's Formula to Remove Instability in the Estimation of Population Attributable Fraction for Health Outcomes. Am J Epidemiol 2018; 187:170-179. [PMID: 28595350 DOI: 10.1093/aje/kwx200] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/03/2015] [Accepted: 02/27/2017] [Indexed: 11/13/2022] Open
Abstract
The aim of this study was to reconcile 3 approaches to calculating population attributable fractions and attributable burden percentage: the approach of Bruzzi et al. (Am J Epidemiol. 1985;122(5):904-914.), the maximum-likelihood method of Greenland and Drescher (Biometrics. 1993;49(3):865-872.), and the multivariable method of Tanuseputro et al. (Popul Health Metr. 2015;13:5.). Using data from a statewide point prevalence survey (Western Australian Point Prevalence Survey, 2014) linked to an administrative database, we compared estimates of attributable burden percentage obtained using the contrasting methods in 6 logistic models of health outcomes from the survey, estimating 95% confidence intervals using nonparametric and weighted bootstrap approaches. Our results show that instability can arise from the fundamental algebraic construction of Bruzzi's formula, and that this instability may substantially influence the calculation of attributable burden percentage and associated confidence intervals. These observations were confirmed in a simulation study. The algebraic reduction of Bruzzi's formula to the 2 alternative methods resulted in markedly more stable estimates for population attributable fraction and attributable burden percentage in cross-sectional studies and cohort designs with fixed follow-up time. We advocate the widespread implementation of the maximum-likelihood approach and the multivariable method.
Collapse
Affiliation(s)
- Matthew Tuson
- Medical School, Faculty of Health and Medical Sciences, University of Western Australia, Perth, Australia
| | - Berwin Turlach
- School of Mathematics and Statistics, Faculty of Engineering and Mathematical Sciences, University of Western Australia, Perth, Australia
| | - Alistair Vickery
- Medical School, Faculty of Health and Medical Sciences, University of Western Australia, Perth, Australia
| | - David Whyatt
- Medical School, Faculty of Health and Medical Sciences, University of Western Australia, Perth, Australia
| |
Collapse
|
28
|
Gronsbell JL, Cai T. Semi-supervised approaches to efficient evaluation of model prediction performance. J R Stat Soc Series B Stat Methodol 2017. [DOI: 10.1111/rssb.12264] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|
29
|
Geva A, Gronsbell JL, Cai T, Cai T, Murphy SN, Lyons JC, Heinz MM, Natter MD, Patibandla N, Bickel J, Mullen MP, Mandl KD. A Computable Phenotype Improves Cohort Ascertainment in a Pediatric Pulmonary Hypertension Registry. J Pediatr 2017; 188. [PMID: 28625502 PMCID: PMC5572538 DOI: 10.1016/j.jpeds.2017.05.037] [Citation(s) in RCA: 23] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 01/28/2023]
Abstract
OBJECTIVES To compare registry and electronic health record (EHR) data mining approaches for cohort ascertainment in patients with pediatric pulmonary hypertension (PH) in an effort to overcome some of the limitations of registry enrollment alone in identifying patients with particular disease phenotypes. STUDY DESIGN This study was a single-center retrospective analysis of EHR and registry data at Boston Children's Hospital. The local Informatics for Integrating Biology and the Bedside (i2b2) data warehouse was queried for billing codes, prescriptions, and narrative data related to pediatric PH. Computable phenotype algorithms were developed by fitting penalized logistic regression models to a physician-annotated training set. Algorithms were applied to a candidate patient cohort, and performance was evaluated using a separate set of 136 records and 179 registry patients. We compared clinical and demographic characteristics of patients identified by computable phenotype and the registry. RESULTS The computable phenotype had an area under the receiver operating characteristics curve of 90% (95% CI, 85%-95%), a positive predictive value of 85% (95% CI, 77%-93%), and identified 413 patients (an additional 231%) with pediatric PH who were not enrolled in the registry. Patients identified by the computable phenotype were clinically distinct from registry patients, with a greater prevalence of diagnoses related to perinatal distress and left heart disease. CONCLUSIONS Mining of EHRs using computable phenotypes identified a large cohort of patients not recruited using a classic registry. Fusion of EHR and registry data can improve cohort ascertainment for the study of rare diseases. TRIAL REGISTRATION ClinicalTrials.gov: NCT02249923.
Collapse
Affiliation(s)
- Alon Geva
- Computational Health Informatics Program, Boston Children’s Hospital, Boston, MA,Division of Critical Care Medicine, Department of Anesthesiology, Perioperative, and Pain Medicine, Boston Children’s Hospital, Boston, MA,Department of Anaesthesia, Harvard Medical School, Boston, MA
| | - Jessica L. Gronsbell
- Department of Biostatistics, Harvard T. H. Chan School of Public Health, Boston, MA
| | - Tianxi Cai
- Department of Biostatistics, Harvard T. H. Chan School of Public Health, Boston, MA
| | - Tianrun Cai
- Division of Rheumatology, Immunology and Allergy, Brigham and Women’s Hospital, Boston, MA
| | - Shawn N. Murphy
- Department of Research Information Services and Computing, Partners Healthcare, Boston, MA,Department of Neurology, Massachusetts General Hospital, Boston, MA,Department of Biomedical Informatics, Harvard Medical School, Boston, MA
| | - Jessica C. Lyons
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA
| | - Michelle M. Heinz
- Computational Health Informatics Program, Boston Children’s Hospital, Boston, MA
| | - Marc D. Natter
- Computational Health Informatics Program, Boston Children’s Hospital, Boston, MA,Department of Pediatrics, Harvard Medical School, Boston, MA
| | - Nandan Patibandla
- Information Services Department, Boston Children’s Hospital, Boston, MA
| | - Jonathan Bickel
- Computational Health Informatics Program, Boston Children’s Hospital, Boston, MA,Information Services Department, Boston Children’s Hospital, Boston, MA,Department of Pediatrics, Harvard Medical School, Boston, MA
| | - Mary P. Mullen
- Department of Cardiology, Boston Children’s Hospital, Boston, MA,Department of Pediatrics, Harvard Medical School, Boston, MA
| | - Kenneth D. Mandl
- Computational Health Informatics Program, Boston Children’s Hospital, Boston, MA,Department of Biomedical Informatics, Harvard Medical School, Boston, MA,Department of Pediatrics, Harvard Medical School, Boston, MA
| | | |
Collapse
|
30
|
|
31
|
Marino M, Buxton OM, Li Y. Covariate Selection for Multilevel Models with Missing Data. Stat (Int Stat Inst) 2017; 6:31-46. [PMID: 28239457 DOI: 10.1002/sta4.133] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
Missing covariate data hampers variable selection in multilevel regression settings. Current variable selection techniques for multiply-imputed data commonly address missingness in the predictors through list-wise deletion and stepwise-selection methods which are problematic. Moreover, most variable selection methods are developed for independent linear regression models and do not accommodate multilevel mixed effects regression models with incomplete covariate data. We develop a novel methodology that is able to perform covariate selection across multiply-imputed data for multilevel random effects models when missing data is present. Specifically, we propose to stack the multiply-imputed data sets from a multiple imputation procedure and to apply a group variable selection procedure through group lasso regularization to assess the overall impact of each predictor on the outcome across the imputed data sets. Simulations confirm the advantageous performance of the proposed method compared with the competing methods. We applied the method to reanalyze the Healthy Directions-Small Business cancer prevention study, which evaluated a behavioral intervention program targeting multiple risk-related behaviors in a working-class, multi-ethnic population.
Collapse
Affiliation(s)
- Miguel Marino
- Department of Family Medicine, Department of Public Health, Division of Biostatistics, Oregon Health and Science University, Portland, OR 97239 USA
| | - Orfeu M Buxton
- Associate Professor, Department of Biobehavioral Health, Pennsylvania State University, University Park, PA 16802. Lecturer on Medicine, Division of Sleep Medicine, Harvard Medical School, Boston, MA 02115. Associate Neuroscientist, Department of Medicine, Brigham and Women's Hospital, Boston, MA 02115. Adjunct Associate Professor, Department of Social and Behavioral Sciences, Harvard T.H. Chan School of Public Health, Boston, MA 02115
| | - Yi Li
- Department of Biostatistics, University of Michigan, Ann Arbor, MI 48109 USA
| |
Collapse
|
32
|
Johnson BA, Long Q, Huang Y, Chansky K, Redman M. Model selection and inference for censored lifetime medical expenditures. Biometrics 2016; 72:731-41. [PMID: 26689300 PMCID: PMC5741192 DOI: 10.1111/biom.12464] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2015] [Revised: 11/01/2015] [Accepted: 11/01/2015] [Indexed: 11/30/2022]
Abstract
Identifying factors associated with increased medical cost is important for many micro- and macro-institutions, including the national economy and public health, insurers and the insured. However, assembling comprehensive national databases that include both the cost and individual-level predictors can prove challenging. Alternatively, one can use data from smaller studies with the understanding that conclusions drawn from such analyses may be limited to the participant population. At the same time, smaller clinical studies have limited follow-up and lifetime medical cost may not be fully observed for all study participants. In this context, we develop new model selection methods and inference procedures for secondary analyses of clinical trial data when lifetime medical cost is subject to induced censoring. Our model selection methods extend a theory of penalized estimating function to a calibration regression estimator tailored for this data type. Next, we develop a novel inference procedure for the unpenalized regression estimator using perturbation and resampling theory. Then, we extend this resampling plan to accommodate regularized coefficient estimation of censored lifetime medical cost and develop postselection inference procedures for the final model. Our methods are motivated by data from Southwest Oncology Group Protocol 9509, a clinical trial of patients with advanced nonsmall cell lung cancer, and our models of lifetime medical cost are specific to this population. But the methods presented in this article are built on rather general techniques and could be applied to larger databases as those data become available.
Collapse
Affiliation(s)
- Brent A Johnson
- Department of Biostatistics and Computational Biology, University of Rochester, Rochester, New York, U.S.A..
| | - Qi Long
- Department of Biostatistics and Bioinformatics, Emory University, Atlanta, Georgia, U.S.A
| | - Yijian Huang
- Department of Biostatistics and Bioinformatics, Emory University, Atlanta, Georgia, U.S.A
| | - Kari Chansky
- The Fred Hutchinson Cancer Research Center, Seattle, Washington 98109, U.S.A
| | - Mary Redman
- The Fred Hutchinson Cancer Research Center, Seattle, Washington 98109, U.S.A
| |
Collapse
|
33
|
Tibshirani RJ, Taylor J, Lockhart R, Tibshirani R. Exact Post-Selection Inference for Sequential Regression Procedures. J Am Stat Assoc 2016. [DOI: 10.1080/01621459.2015.1108848] [Citation(s) in RCA: 86] [Impact Index Per Article: 10.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
|
34
|
Laurin C, Boomsma D, Lubke G. The use of vector bootstrapping to improve variable selection precision in Lasso models. Stat Appl Genet Mol Biol 2016; 15:305-20. [PMID: 27248122 PMCID: PMC5131926 DOI: 10.1515/sagmb-2015-0043] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
The Lasso is a shrinkage regression method that is widely used for variable selection in statistical genetics. Commonly, K-fold cross-validation is used to fit a Lasso model. This is sometimes followed by using bootstrap confidence intervals to improve precision in the resulting variable selections. Nesting cross-validation within bootstrapping could provide further improvements in precision, but this has not been investigated systematically. We performed simulation studies of Lasso variable selection precision (VSP) with and without nesting cross-validation within bootstrapping. Data were simulated to represent genomic data under a polygenic model as well as under a model with effect sizes representative of typical GWAS results. We compared these approaches to each other as well as to software defaults for the Lasso. Nested cross-validation had the most precise variable selection at small effect sizes. At larger effect sizes, there was no advantage to nesting. We illustrated the nested approach with empirical data comprising SNPs and SNP-SNP interactions from the most significant SNPs in a GWAS of borderline personality symptoms. In the empirical example, we found that the default Lasso selected low-reliability SNPs and interactions which were excluded by bootstrapping.
Collapse
|
35
|
Lin CY, Halabi S. A Simple Method for Deriving the Confidence Regions for the Penalized Cox's Model via the Minimand Perturbation. COMMUN STAT-THEOR M 2016; 46:4791-4808. [PMID: 29326496 DOI: 10.1080/03610926.2015.1085568] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
Abstract
We propose a minimand perturbation method to derive the confidence regions for the regularized estimators for the Cox's proportional hazards model. Although the regularized estimation procedure produces a more stable point estimate, it remains challenging to provide an interval estimator or an analytic variance estimator for the associated point estimate. Based on the sandwich formula, the current variance estimator provides a simple approximation, but its finite sample performance is not entirely satisfactory. Besides, the sandwich formula can only provide variance estimates for the non-zero coefficients. In this article, we present a generic description for the perturbation method and then introduce a computation algorithm using the adaptive least absolute shrinkage and selection operator (LASSO) penalty. Through simulation studies, we demonstrate that our method can better approximate the limiting distribution of the adaptive LASSO estimator and produces more accurate inference compared with the sandwich formula. The simulation results also indicate the possibility of extending the applications to the adaptive elastic-net penalty. We further demonstrate our method using data from a phase III clinical trial in prostate cancer.
Collapse
Affiliation(s)
| | - Susan Halabi
- Department of Biostatistics and Bioinformatics, Duke University Durham, NC 27710
| |
Collapse
|
36
|
Lu S, Liu Y, Yin L, Zhang K. Confidence intervals and regions for the lasso by using stochastic variational inequality techniques in optimization. J R Stat Soc Series B Stat Methodol 2016. [DOI: 10.1111/rssb.12184] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Affiliation(s)
- Shu Lu
- University of North Carolina at Chapel Hill; USA
| | - Yufeng Liu
- University of North Carolina at Chapel Hill; USA
| | - Liang Yin
- University of North Carolina at Chapel Hill; USA
| | - Kai Zhang
- University of North Carolina at Chapel Hill; USA
| |
Collapse
|
37
|
Mandozzi J, Bühlmann P. Hierarchical Testing in the High-Dimensional Setting With Correlated Variables. J Am Stat Assoc 2016. [DOI: 10.1080/01621459.2015.1007209] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]
|
38
|
Sinnott JA, Cai T. Inference for survival prediction under the regularized Cox model. Biostatistics 2016; 17:692-707. [PMID: 27107008 DOI: 10.1093/biostatistics/kxw016] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2015] [Accepted: 03/23/2016] [Indexed: 12/31/2022] Open
Abstract
When a moderate number of potential predictors are available and a survival model is fit with regularization to achieve variable selection, providing accurate inference on the predicted survival can be challenging. We investigate inference on the predicted survival estimated after fitting a Cox model under regularization guaranteeing the oracle property. We demonstrate that existing asymptotic formulas for the standard errors of the coefficients tend to underestimate the variability for some coefficients, while typical resampling such as the bootstrap tends to overestimate it; these approaches can both lead to inaccurate variance estimation for predicted survival functions. We propose a two-stage adaptation of a resampling approach that brings the estimated error in line with the truth. In stage 1, we estimate the coefficients in the observed data set and in [Formula: see text] resampled data sets, and allow the resampled coefficient estimates to vote on whether each coefficient should be 0. For those coefficients voted as zero, we set both the point and interval estimates to [Formula: see text] In stage 2, to make inference about coefficients not voted as zero in stage 1, we refit the penalized model in the observed data and in the [Formula: see text] resampled data sets with only variables corresponding to those coefficients. We demonstrate that ensemble voting-based point and interval estimators of the coefficients perform well in finite samples, and prove that the point estimator maintains the oracle property. We extend this approach to derive inference procedures for survival functions and demonstrate that our proposed interval estimation procedures substantially outperform estimators based on asymptotic inference or standard bootstrap. We further illustrate our proposed procedures to predict breast cancer survival in a gene expression study.
Collapse
Affiliation(s)
- Jennifer A Sinnott
- Department of Statistics, The Ohio State University, Columbus, OH 43210, USA
| | - Tianxi Cai
- Department of Biostatistics, Harvard University, Boston, MA 02115, USA
| |
Collapse
|
39
|
Agniel D, Liao KP, Cai T. Estimation and testing for multiple regulation of multivariate mixed outcomes. Biometrics 2016; 72:1194-1205. [PMID: 26910481 DOI: 10.1111/biom.12495] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/01/2015] [Revised: 11/01/2015] [Accepted: 12/01/2015] [Indexed: 11/27/2022]
Abstract
Considerable interest has recently been focused on studying multiple phenotypes simultaneously in both epidemiological and genomic studies, either to capture the multidimensionality of complex disorders or to understand shared etiology of related disorders. We seek to identify multiple regulators or predictors that are associated with multiple outcomes when these outcomes may be measured on very different scales or composed of a mixture of continuous, binary, and not-fully observed elements. We first propose an estimation technique to put all effects on similar scales, and we induce sparsity on the estimated effects. We provide standard asymptotic results for this estimator and show that resampling can be used to quantify uncertainty in finite samples. We finally provide a multiple testing procedure which can be geared specifically to the types of multiple regulators of interest, and we establish that, under standard regularity conditions, the familywise error rate will approach 0 as sample size diverges. Simulation results indicate that our approach can improve over unregularized methods both in reducing bias in estimation and improving power for testing.
Collapse
Affiliation(s)
- Denis Agniel
- Department of Biomedical Informatics, Harvard Medical School, Boston, Massachusetts, U.S.A. 02115
| | - Katherine P Liao
- Brigham and Women's Hospital, Boston, Massachusetts, U.S.A. 02115
| | - Tianxi Cai
- Department of Biostatistics, Harvard School of Public Health, Boston, Massachusetts, U.S.A. 02115
| |
Collapse
|
40
|
|
41
|
Kim HL, Halabi S, Li P, Mayhew G, Simko J, Nixon AB, Small EJ, Rini B, Morris MJ, Taplin ME, George D. A Molecular Model for Predicting Overall Survival in Patients with Metastatic Clear Cell Renal Carcinoma: Results from CALGB 90206 (Alliance). EBioMedicine 2015; 2:1814-20. [PMID: 26870806 PMCID: PMC4740313 DOI: 10.1016/j.ebiom.2015.09.012] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2015] [Revised: 09/06/2015] [Accepted: 09/07/2015] [Indexed: 11/30/2022] Open
Abstract
BACKGROUND Prognosis associated with metastatic renal cell carcinoma (mRCC) can vary widely. METHODS This study used pretreatment nephrectomy specimens from a randomized phase III trial. Expression levels of candidate genes were determined from archival tumors using the OpenArray® platform for TaqMan® RT-qPCR. The dataset was randomly divided at 2:1 ratio into training (n = 221) and testing (n = 103) sets to develop a multigene prognostic signature. FINDINGS Gene expressions were measured in 324 patients. In the training set, multiple models testing 424 candidate genes identified a prognostic signature containing 8 genes plus MSKCC clinical risk factors. In the testing set, the time dependent (td) AUC for a prognostic model containing the 8 genes with and without MSKCC risk factors were 0.72 and 0.69, respectively. The tdAUC for the clinical risk factors alone was 0.61. Additional primary mRCCs from patients with mRCC (n = 12) were sampled in multiple sites and standard deviations of gene expressions within a tumor were used as a measure of heterogeneity. All 8 genes in the final prognostic model met our criteria for minimal heterogeneity. CONCLUSIONS A molecular prognostic signature based on 8 genes was developed and is ready for external validation in this patient population and other related settings such as nonmetastatic RCC.
Collapse
Affiliation(s)
- Hyung L Kim
- Cedars-Sinai Medical Center, Los Angeles, CA, United States
| | - Susan Halabi
- Department of Biostatistics and Bioinformatics, and Alliance Statistics and Data Center, Duke University, Durham, NC, United States
| | - Ping Li
- Cedars-Sinai Medical Center, Los Angeles, CA, United States
| | - Greg Mayhew
- GeneCentric Diagnostics, Durham, NC, United States
| | - Jeff Simko
- University of California at San Francisco, San Francisco, CA, United States
| | | | - Eric J Small
- University of California at San Francisco, San Francisco, CA, United States
| | - Brian Rini
- Cleveland Clinic Taussig Cancer Institute, Cleveland, OH, United States
| | - Michael J Morris
- Memorial Sloan Kettering Cancer Center, New York, NY, United States
| | | | - Daniel George
- Department of Biostatistics and Bioinformatics, and Alliance Statistics and Data Center, Duke University, Durham, NC, United States
| | | |
Collapse
|
42
|
Bühlmann P, van de Geer S. High-dimensional inference in misspecified linear models. Electron J Stat 2015. [DOI: 10.1214/15-ejs1041] [Citation(s) in RCA: 26] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
43
|
|
44
|
Lockhart R, Taylor J, Tibshirani RJ, Tibshirani R. A SIGNIFICANCE TEST FOR THE LASSO. Ann Stat 2014; 42:413-468. [PMID: 25574062 DOI: 10.1214/13-aos1175] [Citation(s) in RCA: 335] [Impact Index Per Article: 33.5] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Abstract
In the sparse linear regression setting, we consider testing the significance of the predictor variable that enters the current lasso model, in the sequence of models visited along the lasso solution path. We propose a simple test statistic based on lasso fitted values, called the covariance test statistic, and show that when the true model is linear, this statistic has an Exp(1) asymptotic distribution under the null hypothesis (the null being that all truly active variables are contained in the current lasso model). Our proof of this result for the special case of the first predictor to enter the model (i.e., testing for a single significant predictor variable against the global null) requires only weak assumptions on the predictor matrix X. On the other hand, our proof for a general step in the lasso path places further technical assumptions on X and the generative model, but still allows for the important high-dimensional case p > n, and does not necessarily require that the current lasso model achieves perfect recovery of the truly active variables. Of course, for testing the significance of an additional variable between two nested linear models, one typically uses the chi-squared test, comparing the drop in residual sum of squares (RSS) to a [Formula: see text] distribution. But when this additional variable is not fixed, and has been chosen adaptively or greedily, this test is no longer appropriate: adaptivity makes the drop in RSS stochastically much larger than [Formula: see text] under the null hypothesis. Our analysis explicitly accounts for adaptivity, as it must, since the lasso builds an adaptive sequence of linear models as the tuning parameter λ decreases. In this analysis, shrinkage plays a key role: though additional variables are chosen adaptively, the coefficients of lasso active variables are shrunken due to the [Formula: see text] penalty. Therefore, the test statistic (which is based on lasso fitted values) is in a sense balanced by these two opposing properties-adaptivity and shrinkage-and its null distribution is tractable and asymptotically Exp(1).
Collapse
Affiliation(s)
- Richard Lockhart
- Department of Statistics and Actuarial Science, Simon Fraser University, Burnaby, British Columbia V5A 1S6, Canada
| | - Jonathan Taylor
- Department of Statistics, Stanford University, Stanford, California 94305, USA
| | - Ryan J Tibshirani
- Departments of Statistics and Machine Learning, Carnegie Mellon University, 229B Baker Hall, Pittsburgh, Pennsylvania 15213, USA
| | - Robert Tibshirani
- Department of Health, Research & Policy, Department of Statistics, Stanford University, Stanford, California 94305, USA
| |
Collapse
|
45
|
|
46
|
|
47
|
Halabi S, Lin CY, Kelly WK, Fizazi KS, Moul JW, Kaplan EB, Morris MJ, Small EJ. Updated prognostic model for predicting overall survival in first-line chemotherapy for patients with metastatic castration-resistant prostate cancer. J Clin Oncol 2014; 32:671-7. [PMID: 24449231 DOI: 10.1200/jco.2013.52.3696] [Citation(s) in RCA: 366] [Impact Index Per Article: 36.6] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Abstract
PURPOSE Prognostic models for overall survival (OS) for patients with metastatic castration-resistant prostate cancer (mCRPC) are dated and do not reflect significant advances in treatment options available for these patients. This work developed and validated an updated prognostic model to predict OS in patients receiving first-line chemotherapy. METHODS Data from a phase III trial of 1,050 patients with mCRPC were used (Cancer and Leukemia Group B CALGB-90401 [Alliance]). The data were randomly split into training and testing sets. A separate phase III trial served as an independent validation set. Adaptive least absolute shrinkage and selection operator selected eight factors prognostic for OS. A predictive score was computed from the regression coefficients and used to classify patients into low- and high-risk groups. The model was assessed for its predictive accuracy using the time-dependent area under the curve (tAUC). RESULTS The model included Eastern Cooperative Oncology Group performance status, disease site, lactate dehydrogenase, opioid analgesic use, albumin, hemoglobin, prostate-specific antigen, and alkaline phosphatase. Median OS values in the high- and low-risk groups, respectively, in the testing set were 17 and 30 months (hazard ratio [HR], 2.2; P < .001); in the validation set they were 14 and 26 months (HR, 2.9; P < .001). The tAUCs were 0.73 (95% CI, 0.70 to 0.73) and 0.76 (95% CI, 0.72 to 0.76) in the testing and validation sets, respectively. CONCLUSION An updated prognostic model for OS in patients with mCRPC receiving first-line chemotherapy was developed and validated on an external set. This model can be used to predict OS, as well as to better select patients to participate in trials on the basis of their prognosis.
Collapse
Affiliation(s)
- Susan Halabi
- Susan Halabi, Chen-Yen Lin, and Ellen B. Kaplan, Duke University; Judd W. Moul, Duke Cancer Institute, Durham, NC; W. Kevin Kelly, Thomas Jefferson University, Philadelphia, PA; Karim S. Fizazi, Institut Gustave Roussy, University of Paris Sud, Villejuif, France; Michael J. Morris, Memorial Sloan-Kettering Cancer Center, New York, NY; and Eric J. Small, University of California, San Francisco, San Francisco, CA
| | | | | | | | | | | | | | | |
Collapse
|
48
|
Halabi S, Lin CY, Small EJ, Armstrong AJ, Kaplan EB, Petrylak D, Sternberg CN, Shen L, Oudard S, de Bono J, Sartor O. Prognostic model predicting metastatic castration-resistant prostate cancer survival in men treated with second-line chemotherapy. J Natl Cancer Inst 2013; 105:1729-37. [PMID: 24136890 DOI: 10.1093/jnci/djt280] [Citation(s) in RCA: 128] [Impact Index Per Article: 11.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/30/2022] Open
Abstract
BACKGROUND Several prognostic models for overall survival (OS) have been developed and validated in men with metastatic castration-resistant prostate cancer (mCRPC) who receive first-line chemotherapy. We sought to develop and validate a prognostic model to predict OS in men who had progressed after first-line chemotherapy and were selected to receive second-line chemotherapy. METHODS Data from a phase III trial in men with mCRPC who had developed progressive disease after first-line chemotherapy (TROPIC trial) were used. The TROPIC was randomly split into training (n = 507) and testing (n = 248) sets. Another dataset consisting of 488 men previously treated with docetaxel (SPARC trial) was used for external validation. Adaptive least absolute shrinkage and selection operator selected nine prognostic factors of OS. A prognostic score was computed from the regression coefficients. The model was assessed on the testing and validation sets for its predictive accuracy using the time-dependent area under the curve (tAUC). RESULTS The nine prognostic variables in the final model were Eastern Cooperative Oncology Group performance status, time since last docetaxel use, measurable disease, presence of visceral disease, pain, duration of hormonal use, hemoglobin, prostate specific antigen, and alkaline phosphatase. The tAUCs for this model were 0.73 (95% confidence interval [CI] = 0.72 to 0.74) and 0.70 (95% CI = 0.68 to 0.72) for the testing and validation sets, respectively. CONCLUSIONS A prognostic model of OS in the postdocetaxel, second-line chemotherapy, mCRPC setting was developed and externally validated. This model incorporates novel prognostic factors and can be used to provide predicted probabilities for individual patients and to select patients to participate in clinical trials on the basis of their prognosis. Prospective validation is needed.
Collapse
Affiliation(s)
- Susan Halabi
- Affiliations of authors: Department of Biostatistics and Bioinformatics, (SH, C-YL, EK), and Alliance Statistics and Data Center (SH), Duke University, Durham, NC; Departments of Medicine and Urology, University of California-San Francisco, San Francisco, CA (EJS); Division of Medical Oncology, Duke Prostate Center and the Duke Cancer Institute, Durham, NC (AJA); Departments of Medical Oncology and Urology, Yale University Cancer Center, New Haven, CT (DP); Department of Medical Oncology, San Camillo and Forlanini Hospital, Rome, Italy (CNS); Sanofi, Malvern, PA (LS); Department of Medical Oncology, Georges Pompidou European Hospital, Paris, France (SO); Department of Clinical Studies, Royal Marsden Hospital and Institute of Cancer Research, Surrey, United Kingdom (JdB); Urology Department, Tulane Cancer Center, New Orleans, LA (OS)
| | | | | | | | | | | | | | | | | | | | | |
Collapse
|
49
|
|
50
|
Chatterjee A, Lahiri SN. Rates of convergence of the Adaptive LASSO estimators to the Oracle distribution and higher order refinements by the bootstrap. Ann Stat 2013. [DOI: 10.1214/13-aos1106] [Citation(s) in RCA: 59] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|