1
|
Cheng X, Wu Z, Lin J, Wang B, Huang S, Liu M, Yang J. A two-stage ensemble learning based prediction and grading model for PD-1/PD-L1 inhibitor-related cardiac adverse events: A multicenter retrospective study. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2024; 255:108360. [PMID: 39163785 DOI: 10.1016/j.cmpb.2024.108360] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/10/2024] [Revised: 06/12/2024] [Accepted: 07/27/2024] [Indexed: 08/22/2024]
Abstract
BACKGROUND Immune-related cardiac adverse events (ircAEs) caused by programmed cell death protein-1 (PD-1) and programmed death-ligand-1 (PD-L1) inhibitors can lead to fulminant and even fatal consequences. This study aims to develop a prediction and grading model for ircAEs, enabling graded management of patients. METHODS This study utilized medical record systems from two medical institutions to develop a prediction and grading model for ircAEs using ten machine learning algorithms and two variable screening methods. The model was developed based on a two-stage ensemble learning framework. In the first stage, the ircAEs and non-ircAEs cases were classified. In the second stage, ircAEs cases were grouped into grades 1-2 and 3-5. The experiments were evaluated using five-fold cross-validation. The model's prediction performance was assessed using accuracy, precision, recall, F1 value, Brier score, receiver operating characteristic curve area (AUC), and area under the precision-recall curve (AUPR). RESULTS 615 patients were included in the study. 147 experienced ircAEs, and 44 experienced grade 3-5 ircAEs. The soft voting classifier trained using the variables screened by feature importance ranking performed better than other classifiers in both stages. The average AUC for the first and second stages is 84.18 % and 85.13 %, respectively. In the first stage, the three most important variables are N-terminal B-type natriuretic peptide (NT-proBNP), interleukin-2 (IL-2), and C-reactive protein (CRP). In the second stage, the patient's age, NT-proBNP, and left ventricular ejection fraction (LVEF) are the three most critical variables. CONCLUSIONS The prediction and grading model of ircAEs based on two-stage ensemble learning established in this study has good performance and potential clinical application.
Collapse
Affiliation(s)
- Xitong Cheng
- Department of Pharmacy, Fujian Medical University Union Hospital, Fuzhou, PR China; College of Pharmacy, Fujian Medical University, Fuzhou, PR China
| | - Zhaochun Wu
- Department of Pharmacy, Fujian Medical University Affiliated Nanping First Hospital, Nanping, PR China
| | - Jierong Lin
- Department of Pharmacy, Fujian Medical University Union Hospital, Fuzhou, PR China; College of Pharmacy, Fujian Medical University, Fuzhou, PR China
| | - Bitao Wang
- Department of Pharmacy, Fujian Medical University Union Hospital, Fuzhou, PR China; College of Pharmacy, Fujian Medical University, Fuzhou, PR China
| | - Shunming Huang
- Department of Pharmacy, Fujian Medical University Union Hospital, Fuzhou, PR China; College of Pharmacy, Fujian Medical University, Fuzhou, PR China
| | - Maobai Liu
- Department of Pharmacy, Fujian Medical University Union Hospital, Fuzhou, PR China; College of Pharmacy, Fujian Medical University, Fuzhou, PR China
| | - Jing Yang
- Department of Pharmacy, Fujian Medical University Union Hospital, Fuzhou, PR China; College of Pharmacy, Fujian Medical University, Fuzhou, PR China.
| |
Collapse
|
2
|
El Badisy I, Graffeo N, Khalis M, Giorgi R. Multi-metric comparison of machine learning imputation methods with application to breast cancer survival. BMC Med Res Methodol 2024; 24:191. [PMID: 39215245 PMCID: PMC11363416 DOI: 10.1186/s12874-024-02305-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2024] [Accepted: 08/08/2024] [Indexed: 09/04/2024] Open
Abstract
Handling missing data in clinical prognostic studies is an essential yet challenging task. This study aimed to provide a comprehensive assessment of the effectiveness and reliability of different machine learning (ML) imputation methods across various analytical perspectives. Specifically, it focused on three distinct classes of performance metrics used to evaluate ML imputation methods: post-imputation bias of regression estimates, post-imputation predictive accuracy, and substantive model-free metrics. As an illustration, we applied data from a real-world breast cancer survival study. This comprehensive approach aimed to provide a thorough assessment of the effectiveness and reliability of ML imputation methods across various analytical perspectives. A simulated dataset with 30% Missing At Random (MAR) values was used. A number of single imputation (SI) methods - specifically KNN, missMDA, CART, missForest, missRanger, missCforest - and multiple imputation (MI) methods - specifically miceCART and miceRF - were evaluated. The performance metrics used were Gower's distance, estimation bias, empirical standard error, coverage rate, length of confidence interval, predictive accuracy, proportion of falsely classified (PFC), normalized root mean squared error (NRMSE), AUC, and C-index scores. The analysis revealed that in terms of Gower's distance, CART and missForest were the most accurate, while missMDA and CART excelled for binary covariates; missForest and miceCART were superior for continuous covariates. When assessing bias and accuracy in regression estimates, miceCART and miceRF exhibited the least bias. Overall, the various imputation methods demonstrated greater efficiency than complete-case analysis (CCA), with MICE methods providing optimal confidence interval coverage. In terms of predictive accuracy for Cox models, missMDA and missForest had superior AUC and C-index scores. Despite offering better predictive accuracy, the study found that SI methods introduced more bias into the regression coefficients compared to MI methods. This study underlines the importance of selecting appropriate imputation methods based on study goals and data types in time-to-event research. The varying effectiveness of methods across the different performance metrics studied highlights the value of using advanced machine learning algorithms within a multiple imputation framework to enhance research integrity and the robustness of findings.
Collapse
Affiliation(s)
- Imad El Badisy
- Mohammed VI Center For Research and Innovation, Rabat, Morocco.
- International School of Public Health, Mohammed VI University of Sciences and Health, Casablanca, Morocco.
- Aix Marseille Univ, INSERM, IRD, ISSPAM, SESSTIM, Sciences Economiques & Sociales de la Santé & Traitement de l'Information Médicale, ISSPAM, Marseille, France.
| | - Nathalie Graffeo
- Aix Marseille Univ, INSERM, IRD, ISSPAM, SESSTIM, Sciences Economiques & Sociales de la Santé & Traitement de l'Information Médicale, ISSPAM, Marseille, France
| | - Mohamed Khalis
- Mohammed VI Center For Research and Innovation, Rabat, Morocco
- International School of Public Health, Mohammed VI University of Sciences and Health, Casablanca, Morocco
| | - Roch Giorgi
- Aix Marseille Univ, INSERM, IRD, ISSPAM, SESSTIM, Sciences Economiques & Sociales de la Santé & Traitement de l'Information Médicale, ISSPAM, Marseille, France
- Aix Marseille Univ, APHM, INSERM, IRD, SESSTIM, Hop Timone, Biostatistique et Technologies de l'Information et de la Communication, Sciences Economiques & Sociales de la Santé & Traitement de l'Information Médicale, ISSPAM, Hop Timone, BioSTIC, Biostatistique et Technologies de l'Information et de la Communication, Marseille, France
| |
Collapse
|
3
|
Santipas B, Veerakanjana K, Ittichaiwong P, Chavalparit P, Wilartratsami S, Luksanapruksa P. Development and internal validation of machine-learning models for predicting survival in patients who underwent surgery for spinal metastases. Asian Spine J 2024; 18:325-335. [PMID: 38764230 PMCID: PMC11222881 DOI: 10.31616/asj.2023.0314] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 09/19/2023] [Revised: 01/17/2024] [Accepted: 01/23/2024] [Indexed: 05/21/2024] Open
Abstract
STUDY DESIGN A retrospective study. PURPOSE This study aimed to develop machine-learning algorithms for predicting survival in patients who underwent surgery for spinal metastasis. OVERVIEW OF LITERATURE This study develops machine-learning models to predict postoperative survival in spinal metastasis patients, filling the gaps of traditional prognostic systems. Utilizing data from 389 patients, the study highlights XGBoost and CatBoost algorithms̓ effectiveness for 90, 180, and 365-day survival predictions, with preoperative serum albumin as a key predictor. These models offer a promising approach for enhancing clinical decision-making and personalized patient care. METHODS A registry of patients who underwent surgery (instrumentation, decompression, or fusion) for spinal metastases between 2004 and 2018 was used. The outcome measure was survival at postoperative days 90, 180, and 365. Preoperative variables were used to develop machine-learning algorithms to predict survival chance in each period. The performance of the algorithms was measured using the area under the receiver operating characteristic curve (AUC). RESULTS A total of 389 patients were identified, with 90-, 180-, and 365-day mortality rates of 18%, 41%, and 45% postoperatively, respectively. The XGBoost algorithm showed the best performance for predicting 180-day and 365-day survival (AUCs of 0.744 and 0.693, respectively). The CatBoost algorithm demonstrated the best performance for predicting 90-day survival (AUC of 0.758). Serum albumin had the highest positive correlation with survival after surgery. CONCLUSIONS These machine-learning algorithms showed promising results in predicting survival in patients who underwent spinal palliative surgery for spinal metastasis, which may assist surgeons in choosing appropriate treatment and increasing awareness of mortality-related factors before surgery.
Collapse
Affiliation(s)
- Borriwat Santipas
- Department of Orthopaedic Surgery, Faculty of Medicine Siriraj Hospital, Mahidol University, Bangkok, Thailand
| | - Kanyakorn Veerakanjana
- Siriraj Informatics and Data Innovation Center, Faculty of Medicine Siriraj Hospital, Mahidol University, Bangkok, Thailand
| | - Piyalitt Ittichaiwong
- Siriraj Informatics and Data Innovation Center, Faculty of Medicine Siriraj Hospital, Mahidol University, Bangkok, Thailand
| | - Piya Chavalparit
- Department of Orthopaedic Surgery, Faculty of Medicine Siriraj Hospital, Mahidol University, Bangkok, Thailand
- Department of Orthopaedic Surgery, Faculty of Medicine Vajira Hospital, Navamindradhiraj University, Bangkok, Thailand
| | - Sirichai Wilartratsami
- Department of Orthopaedic Surgery, Faculty of Medicine Siriraj Hospital, Mahidol University, Bangkok, Thailand
| | - Panya Luksanapruksa
- Department of Orthopaedic Surgery, Faculty of Medicine Siriraj Hospital, Mahidol University, Bangkok, Thailand
| |
Collapse
|
4
|
Liu J, Duan Z, Hu X, Zhong J, Yin Y. Detracking Autoencoding Conditional Generative Adversarial Network: Improved Generative Adversarial Network Method for Tabular Missing Value Imputation. ENTROPY (BASEL, SWITZERLAND) 2024; 26:402. [PMID: 38785651 PMCID: PMC11120050 DOI: 10.3390/e26050402] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/20/2024] [Revised: 04/20/2024] [Accepted: 04/21/2024] [Indexed: 05/25/2024]
Abstract
Due to various reasons, such as limitations in data collection and interruptions in network transmission, gathered data often contain missing values. Existing state-of-the-art generative adversarial imputation methods face three main issues: limited applicability, neglect of latent categorical information that could reflect relationships among samples, and an inability to balance local and global information. We propose a novel generative adversarial model named DTAE-CGAN that incorporates detracking autoencoding and conditional labels to address these issues. This enhances the network's ability to learn inter-sample correlations and makes full use of all data information in incomplete datasets, rather than learning random noise. We conducted experiments on six real datasets of varying sizes, comparing our method with four classic imputation baselines. The results demonstrate that our proposed model consistently exhibited superior imputation accuracy.
Collapse
Affiliation(s)
- Jingrui Liu
- College of Computer Science, Chongqing University, Chongqing 400044, China
- Chongqing University-University of Cincinnati Joint Co-op Institute, Chongqing University, Chongqing 400044, China
| | - Zixin Duan
- School of Pharmaceutical Sciences, Chongqing University, Chongqing 401331, China
| | - Xinkai Hu
- College of Computer Science, Chongqing University, Chongqing 400044, China
| | - Jingxuan Zhong
- College of Mechanical and Vehicle Engineering, Chongqing University, Chongqing 400044, China
| | - Yunfei Yin
- College of Computer Science, Chongqing University, Chongqing 400044, China
| |
Collapse
|
5
|
Li J, Hao Y, Liu Y, Wu L, Liang H, Ni L, Wang F, Wang S, Duan Y, Xu Q, Xiao J, Yang D, Gao G, Ding Y, Gao C, Xiao J, Zhao H. Supervised machine learning algorithms to predict the duration and risk of long-term hospitalization in HIV-infected individuals: a retrospective study. Front Public Health 2024; 11:1282324. [PMID: 38249414 PMCID: PMC10796994 DOI: 10.3389/fpubh.2023.1282324] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2023] [Accepted: 12/13/2023] [Indexed: 01/23/2024] Open
Abstract
Objective The study aimed to use supervised machine learning models to predict the length and risk of prolonged hospitalization in PLWHs to help physicians timely clinical intervention and avoid waste of health resources. Methods Regression models were established based on RF, KNN, SVM, and XGB to predict the length of hospital stay using RMSE, MAE, MAPE, and R2, while classification models were established based on RF, KNN, SVM, NN, and XGB to predict risk of prolonged hospital stay using accuracy, PPV, NPV, specificity, sensitivity, and kappa, and visualization evaluation based on AUROC, AUPRC, calibration curves and decision curves of all models were used for internally validation. Results In regression models, XGB model performed best in the internal validation (RMSE = 16.81, MAE = 10.39, MAPE = 0.98, R2 = 0.47) to predict the length of hospital stay, while in classification models, NN model presented good fitting and stable features and performed best in testing sets, with excellent accuracy (0.7623), PPV (0.7853), NPV (0.7092), sensitivity (0.8754), specificity (0.5882), and kappa (0.4672), and further visualization evaluation indicated that the largest AUROC (0.9779), AUPRC (0.773) and well-performed calibration curve and decision curve in the internal validation. Conclusion This study showed that XGB model was effective in predicting the length of hospital stay, while NN model was effective in predicting the risk of prolonged hospitalization in PLWH. Based on predictive models, an intelligent medical prediction system may be developed to effectively predict the length of stay and risk of HIV patients according to their medical records, which helped reduce the waste of healthcare resources.
Collapse
Affiliation(s)
- Jialu Li
- Clinical and Research Center of AIDS, Beijing Ditan Hospital, Capital Medical University, Beijing, China
| | - Yiwei Hao
- Division of Medical Record and Statistics, Beijing Ditan Hospital, Capital Medical University, Beijing, China
| | - Ying Liu
- Clinical and Research Center of AIDS, Beijing Ditan Hospital, Capital Medical University, Beijing, China
| | - Liang Wu
- Clinical and Research Center of AIDS, Beijing Ditan Hospital, Capital Medical University, Beijing, China
| | - Hongyuan Liang
- Clinical and Research Center of AIDS, Beijing Ditan Hospital, Capital Medical University, Beijing, China
| | - Liang Ni
- Clinical and Research Center of AIDS, Beijing Ditan Hospital, Capital Medical University, Beijing, China
| | - Fang Wang
- Clinical and Research Center of AIDS, Beijing Ditan Hospital, Capital Medical University, Beijing, China
| | - Sa Wang
- Clinical and Research Center of AIDS, Beijing Ditan Hospital, Capital Medical University, Beijing, China
| | - Yujiao Duan
- Clinical and Research Center of AIDS, Beijing Ditan Hospital, Capital Medical University, Beijing, China
| | - Qiuhua Xu
- Clinical and Research Center of AIDS, Beijing Ditan Hospital, Capital Medical University, Beijing, China
| | - Jinjing Xiao
- Department of Clinical Medicine, Zhengzhou University, Zhengzhou, China
| | - Di Yang
- Clinical and Research Center of AIDS, Beijing Ditan Hospital, Capital Medical University, Beijing, China
| | - Guiju Gao
- Clinical and Research Center of AIDS, Beijing Ditan Hospital, Capital Medical University, Beijing, China
| | - Yi Ding
- Clinical and Research Center of AIDS, Beijing Ditan Hospital, Capital Medical University, Beijing, China
| | - Chengyu Gao
- Clinical and Research Center of AIDS, Beijing Ditan Hospital, Capital Medical University, Beijing, China
| | - Jiang Xiao
- Clinical and Research Center of AIDS, Beijing Ditan Hospital, Capital Medical University, Beijing, China
| | - Hongxin Zhao
- Clinical and Research Center of AIDS, Beijing Ditan Hospital, Capital Medical University, Beijing, China
| |
Collapse
|
6
|
Kondo M, Oba K. Handling of outcome missing data dependent on measured or unmeasured background factors in micro-randomized trial: Simulation and application study. Digit Health 2024; 10:20552076241249631. [PMID: 38698826 PMCID: PMC11064756 DOI: 10.1177/20552076241249631] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2023] [Accepted: 04/08/2024] [Indexed: 05/05/2024] Open
Abstract
Background Micro-randomized trials (MRTs) enhance the effects of mHealth by determining the optimal components, timings, and frequency of interventions. Appropriate handling of missing values is crucial in clinical research; however, it remains insufficiently explored in the context of MRTs. Our study aimed to investigate appropriate methods for missing data in simple MRTs with uniform intervention randomization and no time-dependent covariates. We focused on outcome missing data depending on the participants' background factors. Methods We evaluated the performance of the available data analysis (AD) and the multiple imputation in generalized estimating equations (GEE) and random effects model (RE) through simulations. The scenarios were examined based on the presence of unmeasured background factors and the presence of interaction effects. We conducted the regression and propensity score methods as multiple imputation. These missing data handling methods were also applied to actual MRT data. Results Without the interaction effect, AD was biased for GEE, but there was almost no bias for RE. With the interaction effect, estimates were biased for both. For multiple imputation, regression methods estimated without bias when the imputation models were correct, but bias occurred when the models were incorrect. However, this bias was reduced by including the random effects in the imputation model. In the propensity score method, bias occurred even when the missing probability model was correct. Conclusions Without the interaction effect, AD of RE was preferable. When employing GEE or anticipating interactions, we recommend the multiple imputation, especially with regression methods, including individual-level random effects.
Collapse
Affiliation(s)
- Masahiro Kondo
- Biostatistics Unit, Clinical and Translational Research Center, Keio University Hospital, Tokyo, Japan
- Graduate School of Health Management, Keio University, Kanagawa, Japan
| | - Koji Oba
- Interfaculty Initiative in Information Studies, the University of Tokyo, Tokyo, Japan
- Department of Biostatistics, School of Public Health, Graduate School of Medicine, the University of Tokyo, Tokyo, Japan
| |
Collapse
|
7
|
Hong WT, Clifton G, Nelson JD. Railway accident causation analysis: Current approaches, challenges and potential solutions. ACCIDENT; ANALYSIS AND PREVENTION 2023; 186:107049. [PMID: 36989961 DOI: 10.1016/j.aap.2023.107049] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/03/2022] [Revised: 03/23/2023] [Accepted: 03/24/2023] [Indexed: 06/19/2023]
Abstract
Railway accident causation analysis is fundamental to understanding the nature of railway safety. Although a considerable number of prior studies have investigated this context, many of them suffer from the need to deal with a large amount of textual data given that most railway safety-related information is recorded and stored in the form of text. To gain a better understanding of the limitations imposed by overreliance on textual analysis, a scoping review of the academic literature on how railway accident causation analysis is addressed has been conducted. The results confirm the high frequency of using textual data, a single case study, and in-depth analysis frameworks. While the value of exploring causational factors is clear, the high level of human intervention and the labour-intensive analysis processes based on a large volume of textual data hinder researchers from understanding the complex nature of the rail safety system. Recently, growing attention has been given to the application of Natural Language Processing (NLP) to aid the practice of analysing a large corpus of textual data, but only limited studies to date in railway safety use such techniques and none address railway accident causation analysis. To fill this gap, a supplementary review is conducted to identify opportunities, challenges, boundaries and limitations in the application of NLP approaches to railway accident causation analysis. Findings indicate that novel techniques using off-the-shelf tools have strong potential to overcome the limitations of overreliance on manual analysis in practice and theory, but the absence of shared railway safety-related benchmark corpora restricts implementation. This study sheds light on a new approach to railway accident causation analysis and clarifies future applicable utilisations for further research.
Collapse
Affiliation(s)
- Wei-Ting Hong
- Institute of Transport and Logistics Studies (ITLS), The University of Sydney Business School, The University of Sydney, NSW 2006, Australia.
| | - Geoffrey Clifton
- Institute of Transport and Logistics Studies (ITLS), The University of Sydney Business School, The University of Sydney, NSW 2006, Australia
| | - John D Nelson
- Institute of Transport and Logistics Studies (ITLS), The University of Sydney Business School, The University of Sydney, NSW 2006, Australia
| |
Collapse
|
8
|
Pelgrims I, Devleesschauwer B, Vandevijvere S, De Clercq EM, Vansteelandt S, Gorasso V, Van der Heyden J. Using random-forest multiple imputation to address bias of self-reported anthropometric measures, hypertension and hypercholesterolemia in the Belgian health interview survey. BMC Med Res Methodol 2023; 23:69. [PMID: 36966305 PMCID: PMC10040120 DOI: 10.1186/s12874-023-01892-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2022] [Accepted: 03/16/2023] [Indexed: 03/27/2023] Open
Abstract
BACKGROUND In many countries, the prevalence of non-communicable diseases risk factors is commonly assessed through self-reported information from health interview surveys. It has been shown, however, that self-reported instead of objective data lead to an underestimation of the prevalence of obesity, hypertension and hypercholesterolemia. This study aimed to assess the agreement between self-reported and measured height, weight, hypertension and hypercholesterolemia and to identify an adequate approach for valid measurement error correction. METHODS Nine thousand four hundred thirty-nine participants of the 2018 Belgian health interview survey (BHIS) older than 18 years, of which 1184 participated in the 2018 Belgian health examination survey (BELHES), were included in the analysis. Regression calibration was compared with multiple imputation by chained equations based on parametric and non-parametric techniques. RESULTS This study confirmed the underestimation of risk factor prevalence based on self-reported data. With both regression calibration and multiple imputation, adjusted estimation of these variables in the BHIS allowed to generate national prevalence estimates that were closer to their BELHES clinical counterparts. For overweight, obesity and hypertension, all methods provided smaller standard errors than those obtained with clinical data. However, for hypercholesterolemia, for which the regression model's accuracy was poor, multiple imputation was the only approach which provided smaller standard errors than those based on clinical data. CONCLUSIONS The random-forest multiple imputation proves to be the method of choice to correct the bias related to self-reported data in the BHIS. This method is particularly useful to enable improved secondary analysis of self-reported data by using information included in the BELHES. Whenever feasible, combined information from HIS and objective measurements should be used in risk factor monitoring.
Collapse
Affiliation(s)
- Ingrid Pelgrims
- Service Risk and Health Impact Assessment, Sciensano, Rue Juliette Wytsman 14, 1050, Brussels, Belgium.
- Applied Mathematics, Computer Science and Statistics, Ghent University, Krijgslaan 281, S9, BE-9000, Ghent, Belgium.
- Department of Epidemiology and Public Health, Sciensano, Rue Juliette Wytsman 14, 1050, Brussels, Belgium.
| | - Brecht Devleesschauwer
- Department of Epidemiology and Public Health, Sciensano, Rue Juliette Wytsman 14, 1050, Brussels, Belgium
- Department of Translational Physiology, Infectiology and Public Health, Ghent University, Salisburylaan 133, Hoogbouw, B-9820, Merelbeke, Belgium
| | - Stefanie Vandevijvere
- Department of Epidemiology and Public Health, Sciensano, Rue Juliette Wytsman 14, 1050, Brussels, Belgium
| | - Eva M De Clercq
- Service Risk and Health Impact Assessment, Sciensano, Rue Juliette Wytsman 14, 1050, Brussels, Belgium
| | - Stijn Vansteelandt
- Applied Mathematics, Computer Science and Statistics, Ghent University, Krijgslaan 281, S9, BE-9000, Ghent, Belgium
| | - Vanessa Gorasso
- Department of Epidemiology and Public Health, Sciensano, Rue Juliette Wytsman 14, 1050, Brussels, Belgium
- Department of Public Health and Primary Care, Ghent University, Corneel Heymanslaan 10, 9000, Ghent, Belgium
| | - Johan Van der Heyden
- Department of Epidemiology and Public Health, Sciensano, Rue Juliette Wytsman 14, 1050, Brussels, Belgium
| |
Collapse
|
9
|
Li D, Wong J, Li X, Toh S, Wang R. Imputing missing covariates in time-to-event analysis within distributed research networks: A simulation study. Pharmacoepidemiol Drug Saf 2023; 32:330-340. [PMID: 36380400 DOI: 10.1002/pds.5563] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2022] [Revised: 09/13/2022] [Accepted: 10/26/2022] [Indexed: 11/18/2022]
Abstract
PURPOSE In distributed research network (DRN) settings, multiple imputation cannot be directly implemented because pooling individual-level data are often not feasible. The performance of multiple imputation in combination with meta-analysis is not well understood within DRNs. METHODS To evaluate the performance of imputation for missing baseline covariate data in combination with meta-analysis for time-to-event analysis within DRNs, we compared two parametric algorithms including one approximated linear imputation model (Approx), and one nonlinear substantive model compatible imputation model (SMC), as well as two non-parametric machine learning algorithms including random forest (RF), and classification and regression trees (CART), through simulation studies motivated by a real-world data set. RESULTS Under the setting with small effect sizes (i.e., log-Hazard ratios [logHR]) and homogeneous missingness mechanisms across sites, all imputation methods produced unbiased and more efficient estimates while the complete-case analysis could be biased and inefficient; and under heterogeneous missingness mechanisms, estimates with RF method could have higher efficiency. Estimates from the distributed imputation combined by meta-analysis were similar to those from the imputation using pooled data. When logHRs were large, the SMC imputation algorithm generally performed better than others. CONCLUSIONS These findings suggest the validity and feasibility of imputation within DRNs in the presence of missing covariate data in time-to-event analysis under various settings. The performance of the four imputation algorithms varies with the effect sizes and level of missingness.
Collapse
Affiliation(s)
- Dongdong Li
- Department of Population Medicine, Harvard Pilgrim Health Care Institute and Harvard Medical School, Boston, Massachusetts, USA
| | - Jenna Wong
- Department of Population Medicine, Harvard Pilgrim Health Care Institute and Harvard Medical School, Boston, Massachusetts, USA
| | - Xiaojuan Li
- Department of Population Medicine, Harvard Pilgrim Health Care Institute and Harvard Medical School, Boston, Massachusetts, USA
| | - Sengwee Toh
- Department of Population Medicine, Harvard Pilgrim Health Care Institute and Harvard Medical School, Boston, Massachusetts, USA
| | - Rui Wang
- Department of Population Medicine, Harvard Pilgrim Health Care Institute and Harvard Medical School, Boston, Massachusetts, USA.,Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, Massachusetts, USA
| |
Collapse
|
10
|
Kong W, Hui HWH, Peng H, Goh WWB. Dealing with missing values in proteomics data. Proteomics 2022; 22:e2200092. [PMID: 36349819 DOI: 10.1002/pmic.202200092] [Citation(s) in RCA: 18] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2022] [Revised: 09/15/2022] [Accepted: 10/11/2022] [Indexed: 11/10/2022]
Abstract
Proteomics data are often plagued with missingness issues. These missing values (MVs) threaten the integrity of subsequent statistical analyses by reduction of statistical power, introduction of bias, and failure to represent the true sample. Over the years, several categories of missing value imputation (MVI) methods have been developed and adapted for proteomics data. These MVI methods perform their tasks based on different prior assumptions (e.g., data is normally or independently distributed) and operating principles (e.g., the algorithm is built to address random missingness only), resulting in varying levels of performance even when dealing with the same dataset. Thus, to achieve a satisfactory outcome, a suitable MVI method must be selected. To guide decision making on suitable MVI method, we provide a decision chart which facilitates strategic considerations on datasets presenting different characteristics. We also bring attention to other issues that can impact proper MVI such as the presence of confounders (e.g., batch effects) which can influence MVI performance. Thus, these too, should be considered during or before MVI.
Collapse
Affiliation(s)
- Weijia Kong
- Lee Kong Chian School of Medicine, Nanyang Technological University, Singapore, Singapore.,School of Biological Sciences, Nanyang Technological University, Singapore, Singapore
| | - Harvard Wai Hann Hui
- Lee Kong Chian School of Medicine, Nanyang Technological University, Singapore, Singapore.,School of Biological Sciences, Nanyang Technological University, Singapore, Singapore
| | - Hui Peng
- Lee Kong Chian School of Medicine, Nanyang Technological University, Singapore, Singapore.,School of Biological Sciences, Nanyang Technological University, Singapore, Singapore
| | - Wilson Wen Bin Goh
- Lee Kong Chian School of Medicine, Nanyang Technological University, Singapore, Singapore.,School of Biological Sciences, Nanyang Technological University, Singapore, Singapore.,Centre for Biomedical Informatics, Nanyang Technological University, Singapore, Singapore
| |
Collapse
|
11
|
Leist AK, Klee M, Kim JH, Rehkopf DH, Bordas SPA, Muniz-Terrera G, Wade S. Mapping of machine learning approaches for description, prediction, and causal inference in the social and health sciences. SCIENCE ADVANCES 2022; 8:eabk1942. [PMID: 36260666 PMCID: PMC9581488 DOI: 10.1126/sciadv.abk1942] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/28/2021] [Accepted: 09/01/2022] [Indexed: 05/20/2023]
Abstract
Machine learning (ML) methodology used in the social and health sciences needs to fit the intended research purposes of description, prediction, or causal inference. This paper provides a comprehensive, systematic meta-mapping of research questions in the social and health sciences to appropriate ML approaches by incorporating the necessary requirements to statistical analysis in these disciplines. We map the established classification into description, prediction, counterfactual prediction, and causal structural learning to common research goals, such as estimating prevalence of adverse social or health outcomes, predicting the risk of an event, and identifying risk factors or causes of adverse outcomes, and explain common ML performance metrics. Such mapping may help to fully exploit the benefits of ML while considering domain-specific aspects relevant to the social and health sciences and hopefully contribute to the acceleration of the uptake of ML applications to advance both basic and applied social and health sciences research.
Collapse
Affiliation(s)
- Anja K. Leist
- Department of Social Sciences, Institute for Research on Socio-Economic Inequality (IRSEI), University of Luxembourg, Esch-sur-Alzette, Luxembourg
- Corresponding author.
| | - Matthias Klee
- Department of Social Sciences, Institute for Research on Socio-Economic Inequality (IRSEI), University of Luxembourg, Esch-sur-Alzette, Luxembourg
| | - Jung Hyun Kim
- Department of Social Sciences, Institute for Research on Socio-Economic Inequality (IRSEI), University of Luxembourg, Esch-sur-Alzette, Luxembourg
| | - David H. Rehkopf
- Department of Epidemiology and Population Health, Stanford University, Palo Alto, CA, USA
| | | | - Graciela Muniz-Terrera
- Centre for Dementia Prevention, University of Edinburgh, Edinburgh, UK
- Ohio University, Athens, OH, USA
| | - Sara Wade
- School of Mathematics, University of Edinburgh, Edinburgh, UK
| |
Collapse
|
12
|
Abstract
Multiple imputation techniques are commonly used when data are missing, however, there are many options one can consider. Multivariate imputation by chained equations is a popular method for generating imputations but relies on specifying models when imputing missing values. In this work, we introduce multiple imputation by super learning, an update to the multivariate imputation by chained equations method to generate imputations with ensemble learning. Ensemble methodologies have recently gained attention for use in inference and prediction as they optimally combine a variety of user-specified parametric and non-parametric models and perform well when estimating complex functions, including those with interaction terms. Through two simulations we compare inferences made using the multiple imputation by super learning approach to those made with other commonly used multiple imputation methods and demonstrate multiple imputation by super learning as a superior option when considering characteristics such as bias, confidence interval coverage rate, and confidence interval width.
Collapse
Affiliation(s)
- Thomas Carpenito
- Department of Health Sciences, 1848Northeastern University, Boston, MA, USA
| | - Justin Manjourides
- Department of Health Sciences, 1848Northeastern University, Boston, MA, USA
| |
Collapse
|