1
|
Nehler KJ, Schultze M. Simulation-Based Performance Evaluation of Missing Data Handling in Network Analysis. Multivariate Behav Res 2024:1-21. [PMID: 38247019 DOI: 10.1080/00273171.2023.2283638] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/23/2024]
Abstract
Network analysis has gained popularity as an approach to investigate psychological constructs. However, there are currently no guidelines for applied researchers when encountering missing values. In this simulation study, we compared the performance of a two-step EM algorithm with separated steps for missing handling and regularization, a combined direct EM algorithm, and pairwise deletion. We investigated conditions with varying network sizes, numbers of observations, missing data mechanisms, and percentages of missing values. These approaches are evaluated with regard to recovering population networks in terms of loss in the precision matrix, edge set identification and network statistics. The simulation showed adequate performance only in conditions with large samples (n ≥ 500 ) or small networks (p = 10). Comparing the missing data approaches, the direct EM appears to be more sensitive and superior in nearly all chosen conditions. The two-step EM yields better results when the ratio of n/p is very large - being less sensitive but more specific. Pairwise deletion failed to converge across numerous conditions and yielded inferior results overall. Overall, direct EM is recommended in most cases, as it is able to mitigate the impact of missing data quite well, while modifications to two-step EM could improve its performance.
Collapse
|
2
|
Kim M, Kim TH, Kim D, Lee D, Kim D, Heo J, Kang S, Ha T, Kim J, Moon DH, Heo Y, Kim WJ, Lee SJ, Kim Y, Park SW, Han SS, Choi HS. In-Advance Prediction of Pressure Ulcers via Deep-Learning-Based Robust Missing Value Imputation on Real-Time Intensive Care Variables. J Clin Med 2023; 13:36. [PMID: 38202043 PMCID: PMC10780209 DOI: 10.3390/jcm13010036] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2023] [Revised: 12/06/2023] [Accepted: 12/15/2023] [Indexed: 01/12/2024] Open
Abstract
Pressure ulcers (PUs) are a prevalent skin disease affecting patients with impaired mobility and in high-risk groups. These ulcers increase patients' suffering, medical expenses, and burden on medical staff. This study introduces a clinical decision support system and verifies it for predicting real-time PU occurrences within the intensive care unit (ICU) by using MIMIC-IV and in-house ICU data. We develop various machine learning (ML) and deep learning (DL) models for predicting PU occurrences in real time using the MIMIC-IV and validate using the MIMIC-IV and Kangwon National University Hospital (KNUH) dataset. To address the challenge of missing values in time series, we propose a novel recurrent neural network model, GRU-D++. This model outperformed other experimental models by achieving the area under the receiver operating characteristic curve (AUROC) of 0.945 for the on-time prediction and AUROC of 0.912 for 48h in-advance prediction. Furthermore, in the external validation with the KNUH dataset, the fine-tuned GRU-D++ model demonstrated superior performances, achieving an AUROC of 0.898 for on-time prediction and an AUROC of 0.897 for 48h in-advance prediction. The proposed GRU-D++, designed to consider temporal information and missing values, stands out for its predictive accuracy. Our findings suggest that this model can significantly alleviate the workload of medical staff and prevent the worsening of patient conditions by enabling timely interventions for PUs in the ICU.
Collapse
Affiliation(s)
- Minkyu Kim
- Department of Research & Development, Ziovision Co., Ltd., Chuncheon 24341, Republic of Korea; (M.K.); (D.K.); (D.L.); (D.K.)
| | - Tae-Hoon Kim
- Department of Internal Medicine, Kangwon National University, Chuncheon 24341, Republic of Korea; (T.-H.K.); (J.H.); (J.K.); (D.H.M.); (Y.H.); (W.J.K.); (S.-J.L.)
| | - Dowon Kim
- Department of Research & Development, Ziovision Co., Ltd., Chuncheon 24341, Republic of Korea; (M.K.); (D.K.); (D.L.); (D.K.)
| | - Donghoon Lee
- Department of Research & Development, Ziovision Co., Ltd., Chuncheon 24341, Republic of Korea; (M.K.); (D.K.); (D.L.); (D.K.)
| | - Dohyun Kim
- Department of Research & Development, Ziovision Co., Ltd., Chuncheon 24341, Republic of Korea; (M.K.); (D.K.); (D.L.); (D.K.)
| | - Jeongwon Heo
- Department of Internal Medicine, Kangwon National University, Chuncheon 24341, Republic of Korea; (T.-H.K.); (J.H.); (J.K.); (D.H.M.); (Y.H.); (W.J.K.); (S.-J.L.)
| | - Seonguk Kang
- Department of Convergence Security, Kangwon National University, Chuncheon 24341, Republic of Korea;
| | - Taejun Ha
- Biomedical Research Institute, Kangwon National University Hospital, Chuncheon 24289, Republic of Korea;
| | - Jinju Kim
- Department of Internal Medicine, Kangwon National University, Chuncheon 24341, Republic of Korea; (T.-H.K.); (J.H.); (J.K.); (D.H.M.); (Y.H.); (W.J.K.); (S.-J.L.)
| | - Da Hye Moon
- Department of Internal Medicine, Kangwon National University, Chuncheon 24341, Republic of Korea; (T.-H.K.); (J.H.); (J.K.); (D.H.M.); (Y.H.); (W.J.K.); (S.-J.L.)
- Department of Pulmonology, Kangwon National University Hospital, Chuncheon 24289, Republic of Korea
| | - Yeonjeong Heo
- Department of Internal Medicine, Kangwon National University, Chuncheon 24341, Republic of Korea; (T.-H.K.); (J.H.); (J.K.); (D.H.M.); (Y.H.); (W.J.K.); (S.-J.L.)
- Department of Pulmonology, Kangwon National University Hospital, Chuncheon 24289, Republic of Korea
| | - Woo Jin Kim
- Department of Internal Medicine, Kangwon National University, Chuncheon 24341, Republic of Korea; (T.-H.K.); (J.H.); (J.K.); (D.H.M.); (Y.H.); (W.J.K.); (S.-J.L.)
| | - Seung-Joon Lee
- Department of Internal Medicine, Kangwon National University, Chuncheon 24341, Republic of Korea; (T.-H.K.); (J.H.); (J.K.); (D.H.M.); (Y.H.); (W.J.K.); (S.-J.L.)
| | - Yoon Kim
- Department of Computer Science and Engineering, Kangwon National University, Chuncheon 24341, Republic of Korea;
| | - Sang Won Park
- Department of Medical Informatics, School of Medicine, Kangwon National University, Chuncheon 24341, Republic of Korea;
- Institute of Medical Science, School of Medicine, Kangwon National University, Chuncheon 24341, Republic of Korea
| | - Seon-Sook Han
- Department of Internal Medicine, Kangwon National University, Chuncheon 24341, Republic of Korea; (T.-H.K.); (J.H.); (J.K.); (D.H.M.); (Y.H.); (W.J.K.); (S.-J.L.)
| | - Hyun-Soo Choi
- Department of Computer Science and Engineering, Seoul National University of Science and Technology, Seoul 01811, Republic of Korea
| |
Collapse
|
3
|
Aßmann C, Gaasch JC, Stingl D. A Bayesian Approach Towards Missing Covariate Data in Multilevel Latent Regression Models. Psychometrika 2023; 88:1495-1528. [PMID: 36418780 PMCID: PMC10656345 DOI: 10.1007/s11336-022-09888-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/07/2021] [Revised: 08/29/2022] [Accepted: 09/20/2022] [Indexed: 06/16/2023]
Abstract
The measurement of latent traits and investigation of relations between these and a potentially large set of explaining variables is typical in psychology, economics, and the social sciences. Corresponding analysis often relies on surveyed data from large-scale studies involving hierarchical structures and missing values in the set of considered covariates. This paper proposes a Bayesian estimation approach based on the device of data augmentation that addresses the handling of missing values in multilevel latent regression models. Population heterogeneity is modeled via multiple groups enriched with random intercepts. Bayesian estimation is implemented in terms of a Markov chain Monte Carlo sampling approach. To handle missing values, the sampling scheme is augmented to incorporate sampling from the full conditional distributions of missing values. We suggest to model the full conditional distributions of missing values in terms of non-parametric classification and regression trees. This offers the possibility to consider information from latent quantities functioning as sufficient statistics. A simulation study reveals that this Bayesian approach provides valid inference and outperforms complete cases analysis and multiple imputation in terms of statistical efficiency and computation time involved. An empirical illustration using data on mathematical competencies demonstrates the usefulness of the suggested approach.
Collapse
Affiliation(s)
- Christian Aßmann
- Leibniz Institute for Educational Trajectories Bamberg, Bamberg, Germany
- Otto-Friedrich-Universität Bamberg, Bamberg, Germany
| | | | - Doris Stingl
- Otto-Friedrich-Universität Bamberg, Bamberg, Germany.
| |
Collapse
|
4
|
Karamti H, Alharthi R, Anizi AA, Alhebshi RM, Eshmawi AA, Alsubai S, Umer M. Improving Prediction of Cervical Cancer Using KNN Imputed SMOTE Features and Multi-Model Ensemble Learning Approach. Cancers (Basel) 2023; 15:4412. [PMID: 37686692 PMCID: PMC10486648 DOI: 10.3390/cancers15174412] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2023] [Revised: 08/02/2023] [Accepted: 08/09/2023] [Indexed: 09/10/2023] Open
Abstract
Objective: Cervical cancer ranks among the top causes of death among females in developing countries. The most important procedures that should be followed to guarantee the minimizing of cervical cancer's aftereffects are early identification and treatment under the finest medical guidance. One of the best methods to find this sort of malignancy is by looking at a Pap smear image. For automated detection of cervical cancer, the available datasets often have missing values, which can significantly affect the performance of machine learning models. Methods: To address these challenges, this study proposes an automated system for predicting cervical cancer that efficiently handles missing values with SMOTE features to achieve high accuracy. The proposed system employs a stacked ensemble voting classifier model that combines three machine learning models, along with KNN Imputer and SMOTE up-sampled features for handling missing values. Results: The proposed model achieves 99.99% accuracy, 99.99% precision, 99.99% recall, and 99.99% F1 score when using KNN imputed SMOTE features. The study compares the performance of the proposed model with multiple other machine learning algorithms under four scenarios: with missing values removed, with KNN imputation, with SMOTE features, and with KNN imputed SMOTE features. The study validates the efficacy of the proposed model against existing state-of-the-art approaches. Conclusions: This study investigates the issue of missing values and class imbalance in the data collected for cervical cancer detection and might aid medical practitioners in timely detection and providing cervical cancer patients with better care.
Collapse
Affiliation(s)
- Hanen Karamti
- Department of Computer Sciences, College of Computer and Information Sciences, Princess Nourah bint Abdulrahman University, P.O. Box 84428, Riyadh 11671, Saudi Arabia;
| | - Raed Alharthi
- Department of Computer Science and Engineering, University of Hafr Al-Batin, Hafar Al-Batin 39524, Saudi Arabia;
| | - Amira Al Anizi
- Department of Computer Sciences, College of Computer and Information Sciences, Princess Nourah bint Abdulrahman University, P.O. Box 84428, Riyadh 11671, Saudi Arabia;
| | - Reemah M. Alhebshi
- Department of Computer Science, Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah 21589, Saudi Arabia;
| | - Ala’ Abdulmajid Eshmawi
- Department of Cybersecurity, College of Computer Science and Engineering, University of Jeddah, Jeddah 23218, Saudi Arabia;
| | - Shtwai Alsubai
- Department of Computer Science, College of Computer Engineering and Sciences, Prince Sattam bin Abdulaziz University, P.O. Box 151, Al-Kharj 11942, Saudi Arabia;
| | - Muhammad Umer
- Department of Computer Science & Information Technology, The Islamia University of Bahawalpur, Bahawalpur 63100, Pakistan
| |
Collapse
|
5
|
Abstract
Missing values are a notable challenge when analyzing mass spectrometry-based proteomics data. While the field is still actively debating the best practices, the challenge increased with the emergence of mass spectrometry-based single-cell proteomics and the dramatic increase in missing values. A popular approach to deal with missing values is to perform imputation. Imputation has several drawbacks for which alternatives exist, but currently, imputation is still a practical solution widely adopted in single-cell proteomics data analysis. This perspective discusses the advantages and drawbacks of imputation. We also highlight 5 main challenges linked to missing value management in single-cell proteomics. Future developments should aim to solve these challenges, whether it is through imputation or data modeling. The perspective concludes with recommendations for reporting missing values, for reporting methods that deal with missing values, and for proper encoding of missing values.
Collapse
Affiliation(s)
- Christophe Vanderaa
- Computational Biology and Bioinformatics Unit (CBIO), de Duve Institute, UCLouvain, 1200 Brussels, Belgium
| | - Laurent Gatto
- Computational Biology and Bioinformatics Unit (CBIO), de Duve Institute, UCLouvain, 1200 Brussels, Belgium
| |
Collapse
|
6
|
Stahlmann K, Reitsma JB, Zapf A. Missing values and inconclusive results in diagnostic studies - A scoping review of methods. Stat Methods Med Res 2023; 32:1842-1855. [PMID: 37559474 PMCID: PMC10540494 DOI: 10.1177/09622802231192954] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/11/2023]
Abstract
Most diagnostic studies exclude missing values and inconclusive results from the analysis or apply simple methods resulting in biased accuracy estimates. This may be due to the lack of availability or awareness of appropriate methods. This scoping review aimed to provide an overview of strategies to handle missing values and inconclusive results in the reference standard or index test in diagnostic accuracy studies. Conducting a systematic literature search in MEDLINE, Cochrane Library, and Web of Science, we could identify many articles proposing methods for addressing missing values in the reference standard. There are also several articles describing methods regarding missing values or inconclusive results in the index test. The latter encompass imputation, frequentist and Bayesian likelihood, model-based, and latent class methods. While methods for missing values in the reference standard are regularly applied in practice, this is not true for methods addressing missing values and inconclusive results in the index test. Our comprehensive overview and description of available methods may raise further awareness of these methods and will enhance their application. Future research is needed to compare the performance of these methods under different conditions to give valid and robust recommendations for their usage in various diagnostic accuracy research scenarios.
Collapse
Affiliation(s)
- Katharina Stahlmann
- Institute of Medical Biometry and Epidemiology, University Medical Center Hamburg-Eppendorf, Germany
| | - Johannes B Reitsma
- Julius Center for Health Sciences and Primary Care, University Medical Center Utrecht, Utrecht University, the Netherlands
| | - Antonia Zapf
- Institute of Medical Biometry and Epidemiology, University Medical Center Hamburg-Eppendorf, Germany
| |
Collapse
|
7
|
Mayer I, Josse J. Generalizing treatment effects with incomplete covariates: Identifying assumptions and multiple imputation algorithms. Biom J 2023; 65:e2100294. [PMID: 36907999 DOI: 10.1002/bimj.202100294] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2021] [Revised: 01/24/2023] [Accepted: 02/13/2023] [Indexed: 03/14/2023]
Abstract
We focus on the problem of generalizing a causal effect estimated on a randomized controlled trial (RCT) to a target population described by a set of covariates from observational data. Available methods such as inverse propensity sampling weighting are not designed to handle missing values, which are however common in both data sources. In addition to coupling the assumptions for causal effect identifiability and for the missing values mechanism and to defining appropriate estimation strategies, one difficulty is to consider the specific structure of the data with two sources and treatment and outcome only available in the RCT. We propose three multiple imputation strategies to handle missing values when generalizing treatment effects, each handling the multisource structure of the problem differently (separate imputation, joint imputation with fixed effect, joint imputation ignoring source information). As an alternative to multiple imputation, we also propose a direct estimation approach that treats incomplete covariates as semidiscrete variables. The multiple imputation strategies and the latter alternative rely on different sets of assumptions concerning the impact of missing values on identifiability. We discuss these assumptions and assess the methods through an extensive simulation study. This work is motivated by the analysis of a large registry of over 20,000 major trauma patients and an RCT studying the effect of tranexamic acid administration on mortality in major trauma patients admitted to intensive care units. The analysis illustrates how the missing values handling can impact the conclusion about the effect generalized from the RCT to the target population.
Collapse
Affiliation(s)
- Imke Mayer
- Institute of Public Health, Charité - Universitätsmedizin, Berlin, Germany
- PreMeDICaL, Inria Sophia-Antipolis, Montpellier, France
| | - Julie Josse
- PreMeDICaL, Inria Sophia-Antipolis, Montpellier, France
| |
Collapse
|
8
|
Pandolfi S, Bartolucci F, Pennoni F. A hidden Markov model for continuous longitudinal data with missing responses and dropout. Biom J 2023; 65:e2200016. [PMID: 37035989 DOI: 10.1002/bimj.202200016] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2022] [Revised: 01/04/2023] [Accepted: 01/10/2023] [Indexed: 04/11/2023]
Abstract
We propose a hidden Markov model for multivariate continuous longitudinal responses with covariates that accounts for three different types of missing pattern: (I) partially missing outcomes at a given time occasion, (II) completely missing outcomes at a given time occasion (intermittent pattern), and (III) dropout before the end of the period of observation (monotone pattern). The missing-at-random (MAR) assumption is formulated to deal with the first two types of missingness, while to account for the informative dropout, we rely on an extra absorbing state. Estimation of the model parameters is based on the maximum likelihood method that is implemented by an expectation-maximization (EM) algorithm relying on suitable recursions. The proposal is illustrated by a Monte Carlo simulation study and an application based on historical data on primary biliary cholangitis.
Collapse
Affiliation(s)
- Silvia Pandolfi
- Department of Economics, University of Perugia, Perugia, Italy
| | | | - Fulvia Pennoni
- Department of Statistics and Quantitative Methods, University of Milano-Bicocca, Milan, Italy
| |
Collapse
|
9
|
Buczak P, Chen JJ, Pauly M. Analyzing the Effect of Imputation on Classification Performance under MCAR and MAR Missing Mechanisms. Entropy (Basel) 2023; 25:521. [PMID: 36981409 PMCID: PMC10048089 DOI: 10.3390/e25030521] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 01/20/2023] [Revised: 03/10/2023] [Accepted: 03/14/2023] [Indexed: 06/18/2023]
Abstract
Many datasets in statistical analyses contain missing values. As omitting observations containing missing entries may lead to information loss or greatly reduce the sample size, imputation is usually preferable. However, imputation can also introduce bias and impact the quality and validity of subsequent analysis. Focusing on binary classification problems, we analyzed how missing value imputation under MCAR as well as MAR missingness with different missing patterns affects the predictive performance of subsequent classification. To this end, we compared imputation methods such as several MICE variants, missForest, Hot Deck as well as mean imputation with regard to the classification performance achieved with commonly used classifiers such as Random Forest, Extreme Gradient Boosting, Support Vector Machine and regularized logistic regression. Our simulation results showed that Random Forest based imputation (i.e., MICE Random Forest and missForest) performed particularly well in most scenarios studied. In addition to these two methods, simple mean imputation also proved to be useful, especially when many features (covariates) contained missing values.
Collapse
Affiliation(s)
- Philip Buczak
- Department of Statistics, TU Dortmund University, 44227 Dortmund, Germany
| | - Jian-Jia Chen
- Department of Computer Science, TU Dortmund University, 44227 Dortmund, Germany
| | - Markus Pauly
- Department of Statistics, TU Dortmund University, 44227 Dortmund, Germany
- UA Ruhr, Research Center Trustworthy Data Science and Security, 44227 Dortmund, Germany
| |
Collapse
|
10
|
Buyukozkan M, Benedetti E, Krumsiek J. rox: A Statistical Model for Regression with Missing Values. Metabolites 2023; 13:metabo13010127. [PMID: 36677052 PMCID: PMC9861384 DOI: 10.3390/metabo13010127] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2022] [Revised: 11/15/2022] [Accepted: 11/17/2022] [Indexed: 01/18/2023] Open
Abstract
High-dimensional omics datasets frequently contain missing data points, which typically occur due to concentrations below the limit of detection (LOD) of the profiling platform. The presence of such missing values significantly limits downstream statistical analysis and result interpretation. Two common techniques to deal with this issue include the removal of samples with missing values and imputation approaches that substitute the missing measurements with reasonable estimates. Both approaches, however, suffer from various shortcomings and pitfalls. In this paper, we present "rox", a novel statistical model for the analysis of omics data with missing values without the need for imputation. The model directly incorporates missing values as "low" concentrations into the calculation. We show the superiority of rox over common approaches on simulated data and on six metabolomics datasets. Fully leveraging the information contained in LOD-based missing values, rox provides a powerful tool for the statistical analysis of omics data.
Collapse
|
11
|
Chen X, Aljrees T, Umer M, Saidani O, Almuqren L, Mzoughi O, Ishaq A, Ashraf I. Cervical cancer detection using K nearest neighbor imputer and stacked ensemble learningmodel. Digit Health 2023; 9:20552076231203802. [PMID: 37799501 PMCID: PMC10548812 DOI: 10.1177/20552076231203802] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2023] [Accepted: 09/08/2023] [Indexed: 10/07/2023] Open
Abstract
Objective Cervical cancer stands as a leading cause of mortality among women in developing nations. To ensure the reduction of its adverse consequences, the primary protocols to be adhered to involve early detection and treatment under the guidance of expert medical professionals. An effective approach for identifying this form of malignancy involves the examination of Pap smear images. However, in the context of automating cervical cancer detection, many of the existing datasets frequently exhibit missing data points, a factor that can substantially impact the effectiveness of machine learning models. Methods In response to these hurdles, this research introduces an automated system designed to predict cervical cancer with a dual focus: adeptly managing missing data while attaining remarkable accuracy. The system's core is built upon a stacked ensemble voting classifier model, which amalgamates three distinct machine learning models, all harmoniously integrated with the KNN Imputer to address the issue of missing values. Results The model put forth attains an accuracy of 99.41%, precision of 97.63%, recall of 95.96%, and an F1 score of 96.76% when incorporating the KNN imputation method. The investigation conducts a comparative analysis, contrasting the performance of this model with seven alternative machine learning algorithms in two scenarios: one where missing values are eliminated, and another employing KNN imputation. This study offers validation of the effectiveness of the proposed model in comparison to current state-of-the-art methodologies. Conclusions This research delves into the challenge of handling missing data in the dataset utilized for cervical cancer detection. The findings have the potential to assist healthcare professionals in achieving early detection and enhancing the quality of care provided to individuals affected by cervical cancer.
Collapse
Affiliation(s)
- Xiaoyuan Chen
- Huzhou Key Laboratory of Green Energy Materials and Battery Cascade Utilization, School of Intelligent Manufacturing, Huzhou College, Huzhou, P.R. China
| | - Turki Aljrees
- Department College of Computer Science and Engineering, University of Hafr Al-Batin, Hafar Al-Batin, Saudi Arabia
| | - Muhammad Umer
- Department of Computer Science & Information Technology, The Islamia University of Bahawalpur, Bahawalpur, Pakistan
| | - Oumaima Saidani
- Department of Information Systems, College of Computer and Information Sciences, Princess Nourah bint Abdulrahman University, Riyadh, Saudi Arabia
| | - Latifah Almuqren
- Department of Information Systems, College of Computer and Information Sciences, Princess Nourah bint Abdulrahman University, Riyadh, Saudi Arabia
| | - Olfa Mzoughi
- Department of Computer Science, College of Sciences and Humanities-Aflaj, Prince Sattam bin Abdulaziz University, Aflaj, Saudi Arabia
| | - Abid Ishaq
- Department of Computer Science & Information Technology, The Islamia University of Bahawalpur, Bahawalpur, Pakistan
| | - Imran Ashraf
- Department of Information and Communication Engineering, Yeungnam University, Gyeongsan, South Korea
| |
Collapse
|
12
|
Witte J, Foraita R, Didelez V. Multiple imputation and test-wise deletion for causal discovery with incomplete cohort data. Stat Med 2022; 41:4716-4743. [PMID: 35908775 DOI: 10.1002/sim.9535] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2021] [Revised: 06/12/2022] [Accepted: 07/11/2022] [Indexed: 11/08/2022]
Abstract
Causal discovery algorithms estimate causal graphs from observational data. This can provide a valuable complement to analyses focusing on the causal relation between individual treatment-outcome pairs. Constraint-based causal discovery algorithms rely on conditional independence testing when building the graph. Until recently, these algorithms have been unable to handle missing values. In this article, we investigate two alternative solutions: test-wise deletion and multiple imputation. We establish necessary and sufficient conditions for the recoverability of causal structures under test-wise deletion, and argue that multiple imputation is more challenging in the context of causal discovery than for estimation. We conduct an extensive comparison by simulating from benchmark causal graphs: as one might expect, we find that test-wise deletion and multiple imputation both clearly outperform list-wise deletion and single imputation. Crucially, our results further suggest that multiple imputation is especially useful in settings with a small number of either Gaussian or discrete variables, but when the dataset contains a mix of both neither method is uniformly best. The methods we compare include random forest imputation and a hybrid procedure combining test-wise deletion and multiple imputation. An application to data from the IDEFICS cohort study on diet- and lifestyle-related diseases in European children serves as an illustrating example.
Collapse
Affiliation(s)
- Janine Witte
- Leibniz Institute for Prevention Research and Epidemiology - BIPS, Bremen, Germany.,Faculty of Mathematics and Computer Science, University of Bremen, Bremen, Germany
| | - Ronja Foraita
- Leibniz Institute for Prevention Research and Epidemiology - BIPS, Bremen, Germany
| | - Vanessa Didelez
- Leibniz Institute for Prevention Research and Epidemiology - BIPS, Bremen, Germany.,Faculty of Mathematics and Computer Science, University of Bremen, Bremen, Germany
| |
Collapse
|
13
|
Gu Y, Preisser JS, Zeng D, Shrestha P, Shah M, Simancas-Pallares MA, Ginnis J, Divaris K. PARTITIONING AROUND MEDOIDS CLUSTERING AND RANDOM FOREST CLASSIFICATION FOR GIS-INFORMED IMPUTATION OF FLUORIDE CONCENTRATION DATA. Ann Appl Stat 2022; 16:551-572. [PMID: 35356492 DOI: 10.1214/21-aoas1516] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/02/2023]
Abstract
Community water fluoridation is an important component of oral health promotion, as fluoride exposure is a well-documented dental caries-preventive agent. Direct measurements of domestic water fluoride content provide valuable information regarding individuals' fluoride exposure and thus caries risk; however, they are logistically challenging to carry out at a large scale in oral health research. This article describes the development and evaluation of a novel method for the imputation of missing domestic water fluoride concentration data informed by spatial autocorrelation. The context is a state-wide epidemiologic study of pediatric oral health in North Carolina, where domestic water fluoride concentration information was missing for approximately 75% of study participants with clinical data on dental caries. A new machine-learning-based imputation method that combines partitioning around medoids clustering and random forest classification (PAMRF) is developed and implemented. Imputed values are filtered according to allowable error rates or target sample size, depending on the requirements of each application. In leave-one-out cross-validation and simulation studies, PAMRF outperforms four existing imputation approaches-two conventional spatial interpolation methods (i.e., inverse-distance weighting, IDW and universal kriging, UK) and two supervised learning methods (k-nearest neighbors, KNN and classification and regression trees, CART). The inclusion of multiply imputed values in the estimation of the association between fluoride concentration and dental caries prevalence resulted in essentially no change in PAMRF estimates but substantial gains in precision due to larger effective sample size. PAMRF is a powerful new method for the imputation of missing fluoride values where geographical information exists.
Collapse
Affiliation(s)
- Yu Gu
- Department of Biostatistics, Gillings School of Global Public Health, University of North Carolina at Chapel Hill
| | - John S Preisser
- Department of Biostatistics, Gillings School of Global Public Health, University of North Carolina at Chapel Hill
| | - Donglin Zeng
- Department of Biostatistics, Gillings School of Global Public Health, University of North Carolina at Chapel Hill
| | - Poojan Shrestha
- Division of Pediatric and Public Health, Adams School of Dentistry, University of North Carolina at Chapel Hill.,Department of Epidemiology, Gillings School of Global Public Health, University of North Carolina at Chapel Hill
| | - Molina Shah
- Division of Pediatric and Public Health, Adams School of Dentistry, University of North Carolina at Chapel Hill
| | - Miguel A Simancas-Pallares
- Division of Pediatric and Public Health, Adams School of Dentistry, University of North Carolina at Chapel Hill
| | - Jeannie Ginnis
- Division of Pediatric and Public Health, Adams School of Dentistry, University of North Carolina at Chapel Hill
| | - Kimon Divaris
- Division of Pediatric and Public Health, Adams School of Dentistry, University of North Carolina at Chapel Hill.,Department of Epidemiology, Gillings School of Global Public Health, University of North Carolina at Chapel Hill
| |
Collapse
|
14
|
Alsaber A, Al-Herz A, Pan J, Al-Sultan AT, Mishra D. Handling missing data in a rheumatoid arthritis registry using random forest approach. Int J Rheum Dis 2021; 24:1282-1293. [PMID: 34382756 DOI: 10.1111/1756-185x.14203] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2021] [Revised: 07/13/2021] [Accepted: 07/23/2021] [Indexed: 12/01/2022]
Abstract
Missing data in clinical epidemiological research violate the intention-to-treat principle, reduce the power of statistical analysis, and can introduce bias if the cause of missing data is related to a patient's response to treatment. Multiple imputation provides a solution to predict the values of missing data. The main objective of this study is to estimate and impute missing values in patient records. The data from the Kuwait Registry for Rheumatic Diseases was used to deal with missing values among patient records. A number of methods were implemented to deal with missing data; however, choosing the best imputation method was judged by the lowest root mean square error (RMSE). Among 1735 rheumatoid arthritis patients, we found missing values vary from 5% to 65.5% of the total observations. The results show that sequential random forest method can estimate these missing values with a high level of accuracy. The RMSE varied between 2.5 and 5.0. missForest had the lowest imputation error for both continuous and categorical variables under each missing data rate (10%, 20%, and 30%) and had the smallest prediction error difference when the models used the imputed laboratory values.
Collapse
Affiliation(s)
- Ahmad Alsaber
- Department of Mathematics and Statistics, University of Strathclyde, Glasgow, UK
| | - Adeeba Al-Herz
- Department of Rheumatology, Al-Amiri Hospital, Kuwait City, Kuwait
| | - Jiazhu Pan
- Department of Mathematics and Statistics, University of Strathclyde, Glasgow, UK
| | - Ahmad T Al-Sultan
- Department of Community Medicine and Behavioral Sciences, Kuwait University, Kuwait City, Kuwait
| | - Divya Mishra
- Department of Plant Pathology, Kansas State University, Kansas, MN, USA
| | -
- Department of Rheumatology, Al-Amiri Hospital, Kuwait City, Kuwait
| |
Collapse
|
15
|
Amro L, Pauly M, Ramosaj B. Asymptotic-based bootstrap approach for matched pairs with missingness in a single arm. Biom J 2021; 63:1389-1405. [PMID: 34240446 DOI: 10.1002/bimj.202000051] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2020] [Revised: 12/11/2020] [Accepted: 01/20/2021] [Indexed: 11/06/2022]
Abstract
The issue of missing values is an arising difficulty when dealing with paired data. Several test procedures are developed in the literature to tackle this problem. Some of them are even robust under deviations and control type-I error quite accurately. However, most of these methods are not applicable when missing values are present only in a single arm. For this case, we provide asymptotic correct resampling tests that are robust under heteroskedasticity and skewed distributions. The tests are based on a meaningful restructuring of all observed information in quadratic form-type test statistics. An extensive simulation study is conducted exemplifying the tests for finite sample sizes under different missingness mechanisms. In addition, illustrative data examples based on real life studies are analyzed.
Collapse
Affiliation(s)
- Lubna Amro
- Mathematical Statistics and Applications in Industry, Faculty of Statistics, Technical University of Dortmund, Dortmund, Germany
| | - Markus Pauly
- Mathematical Statistics and Applications in Industry, Faculty of Statistics, Technical University of Dortmund, Dortmund, Germany
| | - Burim Ramosaj
- Mathematical Statistics and Applications in Industry, Faculty of Statistics, Technical University of Dortmund, Dortmund, Germany
| |
Collapse
|
16
|
Sendi P, Ramadani A, Bornstein MM. Prevalence of Missing Values and Protest Zeros in Contingent Valuation in Dental Medicine. Int J Environ Res Public Health 2021; 18:7219. [PMID: 34299670 DOI: 10.3390/ijerph18147219] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Received: 05/13/2021] [Revised: 06/30/2021] [Accepted: 07/02/2021] [Indexed: 12/26/2022]
Abstract
Background: The number of contingent valuation (CV) studies in dental medicine using willingness-to-pay (WTP) methodology has substantially increased in recent years. Missing values due to absent information (i.e., missingness) or false information (i.e., protest zeros) are a common problem in WTP studies. The objective of this study is to evaluate the prevalence of missing values in CV studies in dental medicine, to assess how these have been dealt with, and to suggest recommendations for future research. Methods: We systematically searched electronic databases (MEDLINE, Web of Science, Cochrane Library, PROSPERO) on 8 June 2021, and hand-searched references of selected reviews. CV studies in clinical dentistry using WTP for valuing a good or service were included. Results: We included 49 WTP studies in our review. Out of these, 19 (38.8%) reported missing values due to absent information, and 28 (57.1%) reported zero values (i.e., WTP valued at zero). Zero values were further classified into true zeros (i.e., representing the underlying preference of the respondent) or protest zeros (i.e., false information as a protest behavior) in only 9 studies. Most studies used a complete case analysis to address missingness while only one study used multiple imputation. Conclusions: There is uncertainty in the dental literature on how to address missing values and zero values in CV studies. Zero values need to be classified as true zeros versus protest zeros with follow-up questions after the WTP elicitation procedure, and then need to be handled differently. Advanced statistical methods are available to address both missing values due to missingness and due to protest zeros but these are currently underused in dental medicine. Failing to appropriately address missing values in CV studies may lead to biased WTP estimates of dental interventions.
Collapse
|
17
|
Abstract
Imputation is a prominent strategy when dealing with missing values (MVs) in proteomics data analysis pipelines. However, it is difficult to assess the performance of different imputation methods and varies strongly depending on data characteristics. To overcome this issue, we present the concept of a data-driven selection of an imputation algorithm (DIMA). The performance and broad applicability of DIMA are demonstrated on 142 quantitative proteomics data sets from the PRoteomics IDEntifications (PRIDE) database and on simulated data consisting of 5-50% MVs with different proportions of missing not at random and missing completely at random values. DIMA reliably suggests a high-performing imputation algorithm, which is always among the three best algorithms and results in a root mean square error difference (ΔRMSE) ≤ 10% in 80% of the cases. DIMA implementation is available in MATLAB at github.com/kreutz-lab/OmicsData and in R at github.com/kreutz-lab/DIMAR.
Collapse
Affiliation(s)
- Janine Egert
- Institute of Medical Biometry and Statistics (IMBI), Institute of Medicine and Medical Center Freiburg, 79104 Freiburg im Breisgau, Germany.,Centre for Integrative Biological Signalling Studies (CIBSS), Albert-Ludwigs-Universität Freiburg, 79104 Freiburg, Germany
| | - Eva Brombacher
- Institute of Medical Biometry and Statistics (IMBI), Institute of Medicine and Medical Center Freiburg, 79104 Freiburg im Breisgau, Germany.,Centre for Integrative Biological Signalling Studies (CIBSS), Albert-Ludwigs-Universität Freiburg, 79104 Freiburg, Germany.,Spemann Graduate School of Biology and Medicine (SGBM), Albert-Ludwigs-Universität Freiburg, 79104 Freiburg, Germany.,Faculty of Biology, Albert-Ludwigs-Universität Freiburg, 79104 Freiburg im Breisgau, Germany
| | - Bettina Warscheid
- Biochemistry and Functional Proteomics, Institute of Biology II, Faculty of Biology, Albert-Ludwigs-Universität Freiburg, 79104 Freiburg im Breisgau, Germany.,Signalling Research Centres BIOSS and CIBSS, Albert-Ludwigs-Universität Freiburg, 79104 Freiburg im Breisgau, Germany
| | - Clemens Kreutz
- Institute of Medical Biometry and Statistics (IMBI), Institute of Medicine and Medical Center Freiburg, 79104 Freiburg im Breisgau, Germany.,Signalling Research Centres BIOSS and CIBSS, Albert-Ludwigs-Universität Freiburg, 79104 Freiburg im Breisgau, Germany.,Center for Data Analysis and Modeling (FDM), Albert-Ludwigs-Universität Freiburg, 79104 Freiburg im Breisgau, Germany
| |
Collapse
|
18
|
Dabke K, Kreimer S, Jones MR, Parker SJ. A Simple Optimization Workflow to Enable Precise and Accurate Imputation of Missing Values in Proteomic Data Sets. J Proteome Res 2021; 20:3214-3229. [PMID: 33939434 DOI: 10.1021/acs.jproteome.1c00070] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/07/2023]
Abstract
Missing values in proteomic data sets have real consequences on downstream data analysis and reproducibility. Although several imputation methods exist to handle missing values, no single imputation method is best suited for a diverse range of data sets, and no clear strategy exists for evaluating imputation methods for clinical DIA-MS data sets, especially at different levels of protein quantification. To navigate through the different imputation strategies available in the literature, we have established a strategy to assess imputation methods on clinical label-free DIA-MS data sets. We used three DIA-MS data sets with real missing values to evaluate eight imputation methods with multiple parameters at different levels of protein quantification: a dilution series data set, a small pilot data set, and a clinical proteomic data set comparing paired tumor and stroma tissue. We found that imputation methods based on local structures within the data, like local least-squares (LLS) and random forest (RF), worked well in our dilution series data set, whereas imputation methods based on global structures within the data, like BPCA, performed well in the other two data sets. We also found that imputation at the most basic protein quantification level-fragment level-improved accuracy and the number of proteins quantified. With this analytical framework, we quickly and cost-effectively evaluated different imputation methods using two smaller complementary data sets to narrow down to the larger proteomic data set's most accurate methods. This acquisition strategy allowed us to provide reproducible evidence of the accuracy of the imputation method, even in the absence of a ground truth. Overall, this study indicates that the most suitable imputation method relies on the overall structure of the data set and provides an example of an analytic framework that may assist in identifying the most appropriate imputation strategies for the differential analysis of proteins.
Collapse
Affiliation(s)
- Kruttika Dabke
- Center for Bioinformatics and Functional Genomics, Department of Biomedical Science, Cedars-Sinai Medical Center, Los Angeles, California 90048, United States.,Graduate Program in Biomedical Sciences, Department of Biomedical Science, Cedars-Sinai Medical Center, Los Angeles, California 90048, United States
| | - Simion Kreimer
- Advanced Clinical Biosystems Research Institute, Smidt Heart Institute, Departments of Cardiology and Biomedical Sciences, Cedars-Sinai Medical Center, Los Angeles, California 90048, United States
| | - Michelle R Jones
- Center for Bioinformatics and Functional Genomics, Department of Biomedical Science, Cedars-Sinai Medical Center, Los Angeles, California 90048, United States
| | - Sarah J Parker
- Advanced Clinical Biosystems Research Institute, Smidt Heart Institute, Departments of Cardiology and Biomedical Sciences, Cedars-Sinai Medical Center, Los Angeles, California 90048, United States
| |
Collapse
|
19
|
Abstract
Purpose The data in a patient's laboratory test result is a notable resource to support clinical investigation and enhance medical research. However, for a variety of reasons, this type of data often contains a non-trivial number of missing values. For example, physicians may neglect to order tests or document the results. Such a phenomenon reduces the degree to which this data can be utilized to learn efficient and effective predictive models. To address this problem, various approaches have been developed to impute missing laboratory values; however, their performance has been limited. This is due, in part, to the fact no approaches effectively leverage the contextual information 1) in individual or 2) between laboratory test variables. Method We introduce an approach to combine an unsupervised prefilling strategy with a supervised machine learning approach, in the form of extreme gradient boosting (XGBoost), to leverage both types of context for imputation purposes. We evaluated the methodology through a series of experiments on approximately 8,200 patients' records in the MIMIC-III dataset. Result The results demonstrate that the new model outperforms baseline and state-of-the-art models on 13 commonly collected laboratory test variables. In terms of the normalized root mean square derivation (nRMSD), our model exhibits an imputation improvement by over 20%, on average. Conclusion Missing data imputation on the temporal variables can be largely improved via prefilling strategy and the supervised training technique, which leverages both the longitudinal and cross-sectional context simultaneously.
Collapse
Affiliation(s)
| | - Chao Yan
- Vanderbilt University, Nashville, TN, USA
| | - Cheng Gao
- Vanderbilt University Medical Center, Nashville, TN, USA
| | | | - You Chen
- Vanderbilt University Medical Center, Nashville, TN, USA
| |
Collapse
|
20
|
Abstract
![]()
Isobaric
labeling has the promise of combining high sample multiplexing
with precise quantification. However, normalization issues and the
missing value problem of complete n-plexes hamper
quantification across more than one n-plex. Here,
we introduce two novel algorithms implemented in MaxQuant that substantially
improve the data analysis with multiple n-plexes.
First, isobaric matching between runs makes use of the three-dimensional
MS1 features to transfer identifications from identified to unidentified
MS/MS spectra between liquid chromatography–mass spectrometry
runs in order to utilize reporter ion intensities in unidentified
spectra for quantification. On typical datasets, we observe a significant
gain in MS/MS spectra that can be used for quantification. Second,
we introduce a novel PSM-level normalization, applicable to data with
and without the common reference channel. It is a weighted median-based
method, in which the weights reflect the number of ions that were
used for fragmentation. On a typical dataset, we observe complete
removal of batch effects and dominance of the biological sample grouping
after normalization. Furthermore, we provide many novel processing
and normalization options in Perseus, the companion software for the
downstream analysis of quantitative proteomics results. All novel
tools and algorithms are available with the regular MaxQuant and Perseus
releases, which are downloadable at http://maxquant.org.
Collapse
Affiliation(s)
- Sung-Huan Yu
- Computational Systems Biochemistry, Max-Planck Institute of Biochemistry, Am Klopferspitz 18, Martinsried 82152, Germany
| | - Pelagia Kyriakidou
- Computational Systems Biochemistry, Max-Planck Institute of Biochemistry, Am Klopferspitz 18, Martinsried 82152, Germany
| | - Jürgen Cox
- Computational Systems Biochemistry, Max-Planck Institute of Biochemistry, Am Klopferspitz 18, Martinsried 82152, Germany.,Department of Biological and Medical Psychology, University of Bergen, Jonas Liesvei 91, Bergen 5009, Norway
| |
Collapse
|
21
|
Abstract
High-throughput biological data-such as mass spectrometry (MS)-based proteomics data-suffer from systematic non-biological variance due to systematic errors. This hinders the estimation of "real" biological signals and, in turn, decreases the power of statistical tests and biases the identification of differentially expressed proteins. To remove such unintended variation, while retaining the biological signal of interest, analysis workflows for quantitative MS data typically comprise normalization prior to their statistical analysis. Several normalization methods, such as quantile normalization (QN), have originally been developed for microarray data. In contrast to microarray data proteomics data may contain features, in the form of protein intensities that are consistently high across experimental conditions and, hence, are encountered in the tails of the protein intensity distribution. If QN is applied in the presence of such proteins statistical inferences of the features' intensity profiles are impeded due to the biased estimation of their variance. A freely available, novel approach is introduced which serves as an improvement of the classical QN by preserving the biological signals of features in the tails of the intensity distribution and by accounting for sample-dependent missing values (MVs): The "tail-robust quantile normalization" (TRQN).
Collapse
Affiliation(s)
- Eva Brombacher
- Institute of Medical Biometry and Statistics, Faculty of Medicine and Medical Center, University of Freiburg, 79104, Freiburg, Germany.,Spemann Graduate School of Biology and Medicine (SGBM), University of Freiburg, 79104, Freiburg, Germany.,Centre for Integrative Biological Signaling Studies (CIBSS), University of Freiburg, 79104, Freiburg, Germany.,German Cancer Consortium (DKTK), 79106, Freiburg, Germany.,German Cancer Research Center (DKFZ), 69120, Heidelberg, Germany
| | - Ariane Schad
- Center for Biosystems Analysis (ZBSA), University of Freiburg, 79104, Freiburg, Germany
| | - Clemens Kreutz
- Institute of Medical Biometry and Statistics, Faculty of Medicine and Medical Center, University of Freiburg, 79104, Freiburg, Germany.,Centre for Integrative Biological Signaling Studies (CIBSS), University of Freiburg, 79104, Freiburg, Germany
| |
Collapse
|
22
|
Gillies CE, Jennaro TS, Puskarich MA, Sharma R, Ward KR, Fan X, Jones AE, Stringer KA. A Multilevel Bayesian Approach to Improve Effect Size Estimation in Regression Modeling of Metabolomics Data Utilizing Imputation with Uncertainty. Metabolites 2020; 10:E319. [PMID: 32781624 PMCID: PMC7465156 DOI: 10.3390/metabo10080319] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2020] [Revised: 07/29/2020] [Accepted: 08/03/2020] [Indexed: 01/12/2023] Open
Abstract
To ensure scientific reproducibility of metabolomics data, alternative statistical methods are needed. A paradigm shift away from the p-value toward an embracement of uncertainty and interval estimation of a metabolite's true effect size may lead to improved study design and greater reproducibility. Multilevel Bayesian models are one approach that offer the added opportunity of incorporating imputed value uncertainty when missing data are present. We designed simulations of metabolomics data to compare multilevel Bayesian models to standard logistic regression with corrections for multiple hypothesis testing. Our simulations altered the sample size and the fraction of significant metabolites truly different between two outcome groups. We then introduced missingness to further assess model performance. Across simulations, the multilevel Bayesian approach more accurately estimated the effect size of metabolites that were significantly different between groups. Bayesian models also had greater power and mitigated the false discovery rate. In the presence of increased missing data, Bayesian models were able to accurately impute the true concentration and incorporating the uncertainty of these estimates improved overall prediction. In summary, our simulations demonstrate that a multilevel Bayesian approach accurately quantifies the estimated effect size of metabolite predictors in regression modeling, particularly in the presence of missing data.
Collapse
Affiliation(s)
- Christopher E. Gillies
- Department of Emergency Medicine, University of Michigan, Ann Arbor, MI 48109, USA;
- Michigan Center for Integrative Research in Critical Care (MCIRCC), University of Michigan, Ann Arbor, MI 48109, USA;
- Michigan Institute for Data Science (MIDAS), Office of Research, University of Michigan, Ann Arbor, MI 48109, USA
| | - Theodore S. Jennaro
- Department of Clinical Pharmacy, College of Pharmacy, University of Michigan, Ann Arbor, MI 48109, USA;
| | - Michael A. Puskarich
- Department of Emergency Medicine, University of Minnesota, Minneapolis, MN 55455, USA;
| | - Ruchi Sharma
- Department of Biomedical Engineering, University of Michigan, Ann Arbor, MI 48109, USA;
| | - Kevin R. Ward
- Department of Emergency Medicine, University of Michigan, Ann Arbor, MI 48109, USA;
- Michigan Center for Integrative Research in Critical Care (MCIRCC), University of Michigan, Ann Arbor, MI 48109, USA;
- Michigan Institute for Data Science (MIDAS), Office of Research, University of Michigan, Ann Arbor, MI 48109, USA
- Department of Biomedical Engineering, University of Michigan, Ann Arbor, MI 48109, USA;
| | - Xudong Fan
- Michigan Center for Integrative Research in Critical Care (MCIRCC), University of Michigan, Ann Arbor, MI 48109, USA;
- Michigan Institute for Data Science (MIDAS), Office of Research, University of Michigan, Ann Arbor, MI 48109, USA
- Department of Biomedical Engineering, University of Michigan, Ann Arbor, MI 48109, USA;
| | - Alan E. Jones
- Department of Emergency Medicine, University of Mississippi Medical Center, Jackson, MS 39216, USA;
| | - Kathleen A. Stringer
- Michigan Center for Integrative Research in Critical Care (MCIRCC), University of Michigan, Ann Arbor, MI 48109, USA;
- The NMR Metabolomics Laboratory, Department of Clinical Pharmacy, College of Pharmacy, University of Michigan, Ann Arbor, MI 48109, USA
- Division of Pulmonary and Critical Care Medicine, Department of Internal Medicine, School of Medicine, University of Michigan, Ann Arbor, MI 48109, USA
| |
Collapse
|
23
|
Kakileti ST, Manjunath G, Dekker A, Wee L. Robust Estimation of Breast Cancer Incidence Risk in Presence of Incomplete or Inaccurate Information. Asian Pac J Cancer Prev 2020; 21:2307-2313. [PMID: 32856859 PMCID: PMC7771951 DOI: 10.31557/apjcp.2020.21.8.2307] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2020] [Indexed: 11/25/2022] Open
Abstract
Purpose: To evaluate the robustness of multiple machine learning classifiers for breast cancer risk estimation in the presence of incomplete or inaccurate information. Data and methods: Open data for this study was obtained from the BCSC Data Resource (http://breastscreening.cancer.gov/). We conducted two ablation-type experiments to compare the robustness of different classifiers where we randomly switched known information to missing with a missing probability of pm in one experiment, and randomly corrupted the existing information with a probability of pc in another experiment. We considered three prominent machine-learning classifiers such as Logistic regression (LR), Random Forests (RF) and a custom Neural Network (NN) architecture and compared their degradation of discrimination performance as a function of increasing probability of missing or inaccurate data. Results: LR, RF and custom NN resulted in an Area Under Curve (AUC) of 0.645, 0.643 and 0.649, respectively, on a test set with 500,000 total observations. When we manipulated the data by varying probabilities pm and pc from 0 to 1, NN resulted in better performance in terms of AUC compared to RF and LR as long as less than half the data was missing/inaccurate (that is, for values of pm < 0.5 and pc < 0.5). However, for missing (pm) or corruption (pc) probabilities above 0.5, LR gave similar performance as the custom NN. RF resulted in overall poorer performance when the data had additional missing or incorrect entries. Conclusion: In cases where the input information is missing or inaccurate, our experiments show that the proposed custom NN provides reliable risk estimates in medical datasets like BCSC. These results are particularly important in health care applications where not every attribute of the individual participant might be available.
Collapse
Affiliation(s)
- Siva Teja Kakileti
- Niramai Health Analytix Pvt Ltd., Koramangala, Bangalore, Karnataka, India.,Department of Radiation Oncology (MAASTRO Clinic), GROW School for Oncology and Developmental Biology, Maastricht University Medical Centre+, Maastricht, The Netherlands
| | - Geetha Manjunath
- Niramai Health Analytix Pvt Ltd., Koramangala, Bangalore, Karnataka, India
| | - Andre Dekker
- Department of Radiation Oncology (MAASTRO Clinic), GROW School for Oncology and Developmental Biology, Maastricht University Medical Centre+, Maastricht, The Netherlands
| | - Leonard Wee
- Department of Radiation Oncology (MAASTRO Clinic), GROW School for Oncology and Developmental Biology, Maastricht University Medical Centre+, Maastricht, The Netherlands
| |
Collapse
|
24
|
Hossain T, Ahad MAR, Inoue S. A Method for Sensor-Based Activity Recognition in Missing Data Scenario. Sensors (Basel) 2020; 20:E3811. [PMID: 32650486 DOI: 10.3390/s20143811] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/24/2020] [Revised: 06/09/2020] [Accepted: 06/30/2020] [Indexed: 11/30/2022]
Abstract
Sensor-based human activity recognition has various applications in the arena of healthcare, elderly smart-home, sports, etc. There are numerous works in this field—to recognize various human activities from sensor data. However, those works are based on data patterns that are clean data and have almost no missing data, which is a genuine concern for real-life healthcare centers. Therefore, to address this problem, we explored the sensor-based activity recognition when some partial data were lost in a random pattern. In this paper, we propose a novel method to improve activity recognition while having missing data without any data recovery. For the missing data pattern, we considered data to be missing in a random pattern, which is a realistic missing pattern for sensor data collection. Initially, we created different percentages of random missing data only in the test data, while the training was performed on good quality data. In our proposed approach, we explicitly induce different percentages of missing data randomly in the raw sensor data to train the model with missing data. Learning with missing data reinforces the model to regulate missing data during the classification of various activities that have missing data in the test module. This approach demonstrates the plausibility of the machine learning model, as it can learn and predict from an identical domain. We exploited several time-series statistical features to extricate better features in order to comprehend various human activities. We explored both support vector machine and random forest as machine learning models for activity classification. We developed a synthetic dataset to empirically evaluate the performance and show that the method can effectively improve the recognition accuracy from 80.8% to 97.5%. Afterward, we tested our approach with activities from two challenging benchmark datasets: the human activity sensing consortium (HASC) dataset and single chest-mounted accelerometer dataset. We examined the method for different missing percentages, varied window sizes, and diverse window sliding widths. Our explorations demonstrated improved recognition performances even in the presence of missing data. The achieved results provide persuasive findings on sensor-based activity recognition in the presence of missing data.
Collapse
|
25
|
Liu M, Dongre A. Proper imputation of missing values in proteomics datasets for differential expression analysis. Brief Bioinform 2020; 22:5855395. [PMID: 32520347 DOI: 10.1093/bib/bbaa112] [Citation(s) in RCA: 23] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2020] [Revised: 04/16/2020] [Accepted: 05/11/2020] [Indexed: 01/01/2023] Open
Abstract
Label-free shotgun proteomics is an important tool in biomedical research, where tandem mass spectrometry with data-dependent acquisition (DDA) is frequently used for protein identification and quantification. However, the DDA datasets contain a significant number of missing values (MVs) that severely hinders proper analysis. Existing literature suggests that different imputation methods should be used for the two types of MVs: missing completely at random or missing not at random. However, the simulated or biased datasets utilized by most of such studies offer few clues about the composition and thus proper imputation of MVs in real-life proteomic datasets. Moreover, the impact of imputation methods on downstream differential expression analysis-a critical goal for many biomedical projects-is largely undetermined. In this study, we investigated public DDA datasets of various tissue/sample types to determine the composition of MVs in them. We then developed simulated datasets that imitate the MV profile of real-life datasets. Using such datasets, we compared the impact of various popular imputation methods on the analysis of differentially expressed proteins. Finally, we make recommendations on which imputation method(s) to use for proteomic data beyond just DDA datasets.
Collapse
|
26
|
Abstract
Multivariate longitudinal data arisen in medical studies often exhibit complex features such as censored responses, intermittent missing values, and atypical or outlying observations. The multivariate-t linear mixed model (MtLMM) has been recognized as a powerful tool for robust modeling of multivariate longitudinal data in the presence of potential outliers or fat-tailed noises. This paper presents a generalization of MtLMM, called the MtLMM-CM, to properly adjust for censorship due to detection limits of the assay and missingness embodied within multiple outcome variables recorded at irregular occasions. An expectation conditional maximization either (ECME) algorithm is developed to compute parameter estimates using the maximum likelihood (ML) approach. The asymptotic standard errors of the ML estimators of fixed effects are obtained by inverting the empirical information matrix according to Louis' method. The techniques for the estimation of random effects and imputation of missing responses are also investigated. The proposed methodology is illustrated on two real-world examples from HIV-AIDS studies and a simulation study under a variety of scenarios.
Collapse
Affiliation(s)
- Tsung-I Lin
- Institute of Statistics, National Chung Hsing University, Taichung, Taiwan.,Department of Public Health, China Medical University, Taichung, Taiwan
| | - Wan-Lun Wang
- Department of Statistics, Graduate Institute of Statistics and Actuarial Science, Feng Chia University, Taichung, Taiwan
| |
Collapse
|
27
|
Awad A, Bader-El-Den M, McNicholas J, Briggs J, El-Sonbaty Y. Predicting hospital mortality for intensive care unit patients: Time-series analysis. Health Informatics J 2019; 26:1043-1059. [PMID: 31347428 DOI: 10.1177/1460458219850323] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022]
Abstract
Current mortality prediction models and scoring systems for intensive care unit patients are generally usable only after at least 24 or 48 h of admission, as some parameters are unclear at admission. However, some of the most relevant measurements are available shortly following admission. It is hypothesized that outcome prediction may be made using information available in the earliest phase of intensive care unit admission. This study aims to investigate how early hospital mortality can be predicted for intensive care unit patients. We conducted a thorough time-series analysis on the performance of different data mining methods during the first 48 h of intensive care unit admission. The results showed that the discrimination power of the machine-learning classification methods after 6 h of admission outperformed the main scoring systems used in intensive care medicine (Acute Physiology and Chronic Health Evaluation, Simplified Acute Physiology Score and Sequential Organ Failure Assessment) after 48 h of admission.
Collapse
Affiliation(s)
- Aya Awad
- University of Portsmouth, UK; Arab Academy for Science and Technology, Egypt
| | | | | | | | | |
Collapse
|
28
|
Battey HS, Cox DR, Jackson MV. On the linear in probability model for binary data. R Soc Open Sci 2019; 6:190067. [PMID: 31218050 PMCID: PMC6549984 DOI: 10.1098/rsos.190067] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/11/2019] [Accepted: 04/05/2019] [Indexed: 05/23/2023]
Abstract
The analysis of binary response data commonly uses models linear in the logistic transform of probabilities. This paper considers some of the advantages and disadvantages of simple least-squares estimates based on a linear representation of the probabilities themselves, this in particular sometimes allowing a more direct empirical interpretation of underlying parameters. A sociological study is used in illustration.
Collapse
Affiliation(s)
- H. S. Battey
- Department of Mathematics, Imperial College London, London, UK
| | | | - M. V. Jackson
- Department of Sociology, Stanford University, Stanford, CA, USA
| |
Collapse
|
29
|
Dong X, Chen C, Geng Q, Cao Z, Chen X, Lin J, Jin Y, Zhang Z, Shi Y, Zhang XD. An Improved Method of Handling Missing Values in the Analysis of Sample Entropy for Continuous Monitoring of Physiological Signals. Entropy (Basel) 2019; 21:e21030274. [PMID: 33266989 PMCID: PMC7514754 DOI: 10.3390/e21030274] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/11/2019] [Revised: 03/08/2019] [Accepted: 03/09/2019] [Indexed: 11/17/2022]
Abstract
Medical devices generate huge amounts of continuous time series data. However, missing values commonly found in these data can prevent us from directly using analytic methods such as sample entropy to reveal the information contained in these data. To minimize the influence of missing points on the calculation of sample entropy, we propose a new method to handle missing values in continuous time series data. We use both experimental and simulated datasets to compare the performance (in percentage error) of our proposed method with three currently used methods: skipping the missing values, linear interpolation, and bootstrapping. Unlike the methods that involve modifying the input data, our method modifies the calculation process. This keeps the data unchanged which is less intrusive to the structure of the data. The results demonstrate that our method has a consistent lower average percentage error than other three commonly used methods in multiple common physiological signals. For missing values in common physiological signal type, different data size and generating mechanism, our method can more accurately extract the information contained in continuously monitored data than traditional methods. So it may serve as an effective tool for handling missing values and may have broad utility in analyzing sample entropy for common physiological signals. This could help develop new tools for disease diagnosis and evaluation of treatment effects.
Collapse
Affiliation(s)
- Xinzheng Dong
- School of Software Engineering, South China University of Technology, Guangzhou 510006, China;
- Zhuhai Laboratory of Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Zhuhai College of Jilin University, Zhuhai 519041, China
| | - Chang Chen
- Faculty of Health Sciences, University of Macau, Taipa, Macau 999078, China; (C.C.); (Y.J.)
| | - Qingshan Geng
- Guangdong General Hospital, Guangdong Academy of Medical Science, Guangzhou 510080, China;
| | - Zhixin Cao
- Beijing Engineering Research Center of Diagnosis and Treatment of Respiratory and Critical Care Medicine, Beijing Chaoyang Hospital, Beijing 100043, China; (Z.C.); (Y.S.)
| | - Xiaoyan Chen
- Department of Endocrinology, First Affiliated Hospital of Guangzhou Medical University, Guangzhou 510120, China; (X.C.); (J.L.)
| | - Jinxiang Lin
- Department of Endocrinology, First Affiliated Hospital of Guangzhou Medical University, Guangzhou 510120, China; (X.C.); (J.L.)
| | - Yu Jin
- Faculty of Health Sciences, University of Macau, Taipa, Macau 999078, China; (C.C.); (Y.J.)
| | - Zhaozhi Zhang
- School of Law, Washington University, St. Louis, MO 63130, USA;
| | - Yan Shi
- Beijing Engineering Research Center of Diagnosis and Treatment of Respiratory and Critical Care Medicine, Beijing Chaoyang Hospital, Beijing 100043, China; (Z.C.); (Y.S.)
- Department of Mechanical and Electronic Engineering, Beihang University, Beijing 100191, China
| | - Xiaohua Douglas Zhang
- Faculty of Health Sciences, University of Macau, Taipa, Macau 999078, China; (C.C.); (Y.J.)
- Correspondence: ; Tel: +853-8822-4813
| |
Collapse
|
30
|
Li L, Lee JH, Sutton SK, Simmons VN, Brandon TH. A Bayesian transition model for missing longitudinal binary outcomes and an application to a smoking cessation study. STAT MODEL 2019; 20:310-338. [PMID: 33854408 DOI: 10.1177/1471082x18821489] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Smoking cessation intervention studies often produce data on smoking status at discrete follow-up assessments, often with missing data in different amounts at each assessment. Smoking status in these studies is a dynamic process with individuals transitioning from smoking to abstinent, as well as abstinent to smoking, at different times during the intervention. Directly assessing transitions provides an opportunity to answer important questions like 'Does the proposed intervention help smokers remain abstinent or quit smoking more effectively than other interventions?' In this article, we model changes in smoking status and examine how interventions and other covariates affect the transitions. We propose a Bayesian approach for fitting the transition model to the observed data and impute missing outcomes based on a logistic model, which accounts for both missing at random (MAR) and missing not at random (MNAR) mechanisms. The proposed Bayesian approach treats missing data as additional unknown quantities and samples them from their posterior distributions. The performance of the proposed method is investigated through simulation studies and illustrated by data from a randomized controlled trial of smoking cessation interventions. Finally, posterior predictive checking and log pseudo marginal likelihood (LPML) are used to assess model assumptions and perform model comparisons, respectively.
Collapse
Affiliation(s)
- Li Li
- Department of Mathematics and Statistics, University of New Mexico, Albuquerque, NM, USA
| | - Ji-Hyun Lee
- Division of Quantitative Sciences, University of Florida Health Cancer Center; Department of Biostatistics, University of Florida, Gainesville, FL, USA
| | - Steven K Sutton
- Department of Biostatistics and Bioinformatics, Moffitt Cancer Center, Tampa, FL, USA
| | - Vani N Simmons
- Department of Health Outcomes and Behaviour, Moffitt Cancer Center, Tampa, FL, USA
| | - Thomas H Brandon
- Department of Health Outcomes and Behaviour, Moffitt Cancer Center, Tampa, FL, USA
| |
Collapse
|
31
|
Pohl S, von Davier M. Commentary: On the Importance of the Speed-Ability Trade-Off When Dealing With Not Reached Items. Front Psychol 2018; 9:1988. [PMID: 30425667 PMCID: PMC6218577 DOI: 10.3389/fpsyg.2018.01988] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2018] [Accepted: 09/27/2018] [Indexed: 11/13/2022] Open
Affiliation(s)
- Steffi Pohl
- Department of Education and Psychology, Freie Universität Berlin, Berlin, Germany
| | | |
Collapse
|
32
|
Abstract
This study assessed the completeness of the Major Trauma Registry of Navarra (MTR-N) data and their concordance with the patients' medical files. It retrospectively reviewed all the MTR-N cases documented in June and July of 2014 and 2015. For each case, 42 parameters' values were taken from the MTR-N. To assess concordance between the MTR-N and medical files, the same variables values were re-recorded. Data completeness was calculated for all cases and data correctness for those documented in the MTR-N, separately for each variable. The overall average completeness rate for all variables was 92.8%. The percentages of completely missing data ranged from 0% (29 variables) to 76.8% (base excess). The overall average rate of correctness was 98.0%. Exact concordance ranged from 93.0% (7 variables) to 100% (22 variables). This study demonstrates the reliability and validity of the MTR-N data and its effectiveness for quality improvement and research in our community.
Collapse
Affiliation(s)
- Bismil Ali Ali
- a Accident & Emergency Department , Complejo Hospitalario de Navarra, Health service of Navarra - Osasunbidea , Pamplona , Spain
| | - Rolf Lefering
- b Institute for Research in Operative Medicine (IFOM) , University of Witten/Herdecke , , Cologne , Germany
| | - Tomas Belzunegui Otano
- a Accident & Emergency Department , Complejo Hospitalario de Navarra, Health service of Navarra - Osasunbidea , Pamplona , Spain.,c Department of Health , Public University of Navarra , Pamplona , Spain
| |
Collapse
|
33
|
Rosinska M, Pantazis N, Janiec J, Pharris A, Amato-Gauci AJ, Quinten C. Potential adjustment methodology for missing data and reporting delay in the HIV Surveillance System, European Union/European Economic Area, 2015. Euro Surveill 2018; 23:1700359. [PMID: 29897039 PMCID: PMC6152165 DOI: 10.2807/1560-7917.es.2018.23.23.1700359] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/22/2023] Open
Abstract
Accurate case-based surveillance data remain the key data source for estimating HIV burden and monitoring prevention efforts in Europe. We carried out a literature review and exploratory analysis of surveillance data regarding two crucial issues affecting European surveillance for HIV: missing data and reporting delay. Initial screening showed substantial variability of these data issues, both in time and across countries. In terms of missing data, the CD4+ cell count is the most problematic variable because of the high proportion of missing values. In 20 of 31 countries of the European Union/European Economic Area (EU/EEA), CD4+ counts are systematically missing for all or some years. One of the key challenges related to reporting delays is that countries undertake specific one-off actions in effort to capture previously unreported cases, and that these cases are subsequently reported with excessive delays. Slightly different underlying assumptions and effectively different models may be required for individual countries to adjust for missing data and reporting delays. However, using a similar methodology is recommended to foster harmonisation and to improve the accuracy and usability of HIV surveillance data at national and EU/EEA levels.
Collapse
Affiliation(s)
- Magdalena Rosinska
- National Institute of Public Health – National Institute of Hygiene, Warsaw, Poland
| | - Nikos Pantazis
- National and Kapodistrian University of Athens, Athens, Greece
| | - Janusz Janiec
- National Institute of Public Health – National Institute of Hygiene, Warsaw, Poland
| | - Anastasia Pharris
- European Centre for Disease Prevention and Control (ECDC), Stockholm, Sweden
| | | | - Chantal Quinten
- European Centre for Disease Prevention and Control (ECDC), Stockholm, Sweden
| | | |
Collapse
|
34
|
Ondeck NT, Fu MC, Skrip LA, McLynn RP, Su EP, Grauer JN. Treatments of Missing Values in Large National Data Affect Conclusions: The Impact of Multiple Imputation on Arthroplasty Research. J Arthroplasty 2018; 33:661-7. [PMID: 29153865 DOI: 10.1016/j.arth.2017.10.034] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 09/10/2017] [Revised: 10/11/2017] [Accepted: 10/11/2017] [Indexed: 02/01/2023] Open
Abstract
BACKGROUND Despite the advantages of large, national datasets, one continuing concern is missing data values. Complete case analysis, where only cases with complete data are analyzed, is commonly used rather than more statistically rigorous approaches such as multiple imputation. This study characterizes the potential selection bias introduced using complete case analysis and compares the results of common regressions using both techniques following unicompartmental knee arthroplasty. METHODS Patients undergoing unicompartmental knee arthroplasty were extracted from the 2005 to 2015 National Surgical Quality Improvement Program. As examples, the demographics of patients with and without missing preoperative albumin and hematocrit values were compared. Missing data were then treated with both complete case analysis and multiple imputation (an approach that reproduces the variation and associations that would have been present in a full dataset) and the conclusions of common regressions for adverse outcomes were compared. RESULTS A total of 6117 patients were included, of which 56.7% were missing at least one value. Younger, female, and healthier patients were more likely to have missing preoperative albumin and hematocrit values. The use of complete case analysis removed 3467 patients from the study in comparison with multiple imputation which included all 6117 patients. The 2 methods of handling missing values led to differing associations of low preoperative laboratory values with commonly studied adverse outcomes. CONCLUSION The use of complete case analysis can introduce selection bias and may lead to different conclusions in comparison with the statistically rigorous multiple imputation approach. Joint surgeons should consider the methods of handling missing values when interpreting arthroplasty research.
Collapse
|
35
|
Ebert JF, Huibers L, Christensen B, Christensen MB. Paper- or Web-Based Questionnaire Invitations as a Method for Data Collection: Cross-Sectional Comparative Study of Differences in Response Rate, Completeness of Data, and Financial Cost. J Med Internet Res 2018; 20:e24. [PMID: 29362206 PMCID: PMC5801515 DOI: 10.2196/jmir.8353] [Citation(s) in RCA: 171] [Impact Index Per Article: 28.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2017] [Revised: 10/12/2017] [Accepted: 11/16/2017] [Indexed: 11/27/2022] Open
Abstract
Background Paper questionnaires have traditionally been the first choice for data collection in research. However, declining response rates over the past decade have increased the risk of selection bias in cross-sectional studies. The growing use of the Internet offers new ways of collecting data, but trials using Web-based questionnaires have so far seen mixed results. A secure, online digital mailbox (e-Boks) linked to a civil registration number became mandatory for all Danish citizens in 2014 (exemption granted only in extraordinary cases). Approximately 89% of the Danish population have a digital mailbox, which is used for correspondence with public authorities. Objective We aimed to compare response rates, completeness of data, and financial costs for different invitation methods: traditional surface mail and digital mail. Methods We designed a cross-sectional comparative study. An invitation to participate in a survey on help-seeking behavior in out-of-hours care was sent to two groups of randomly selected citizens from age groups 30-39 and 50-59 years and parents to those aged 0-4 years using either traditional surface mail (paper group) or digital mail sent to a secure online mailbox (digital group). Costs per respondent were measured by adding up all costs for handling, dispatch, printing, and work salary and then dividing the total figure by the number of respondents. Data completeness was assessed by comparing the number of missing values between the two methods. Socioeconomic variables (age, gender, family income, education duration, immigrant status, and job status) were compared both between respondents and nonrespondents and within these groups to evaluate the degree of selection bias. Results A total 3600 citizens were invited in each group; 1303 (36.29%) responded to the digital invitation and 1653 (45.99%) to the paper invitation (difference 9.66%, 95% CI 7.40-11.92). The costs were €1.51 per respondent for the digital group and €15.67 for paper group respondents. Paper questionnaires generally had more missing values; this was significant in five of 17 variables (P<.05). Substantial differences were found in the socioeconomic variables between respondents and nonrespondents, whereas only minor differences were seen within the groups of respondents and nonrespondents. Conclusions Although we found lower response rates for Web-based invitations, this solution was more cost-effective (by a factor of 10) and had slightly lower numbers of missing values than questionnaires sent with paper invitations. Analyses of socioeconomic variables showed almost no difference between nonrespondents in both groups, which could imply that the lower response rate in the digital group does not necessarily increase the level of selection bias. Invitations to questionnaire studies via digital mail may be an excellent option for collecting research data in the future. This study may serve as the foundational pillar of digital data collection in health care research in Scandinavia and other countries considering implementing similar systems.
Collapse
Affiliation(s)
- Jonas Fynboe Ebert
- Department of Public Health, Research Unit for General Practice, Aarhus University, Aarhus, Denmark
| | - Linda Huibers
- Department of Public Health, Research Unit for General Practice, Aarhus University, Aarhus, Denmark
| | - Bo Christensen
- Department of Public Health, Section for General Medical Practice, Aarhus University, Aarhus, Denmark
| | - Morten Bondo Christensen
- Department of Public Health, Research Unit for General Practice, Aarhus University, Aarhus, Denmark
| |
Collapse
|
36
|
Cahsai A, Anagnostopoulos C, Triantafillou P. Scalable Data Quality for Big Data: The Pythia Framework for Handling Missing Values. Big Data 2015; 3:159-172. [PMID: 27442958 DOI: 10.1089/big.2015.0002] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
Solving the missing-value (MV) problem with small estimation errors in large-scale data environments is a notoriously resource-demanding task. The most widely used MV imputation approaches are computationally expensive because they explicitly depend on the volume and the dimension of the data. Moreover, as datasets and their user community continuously grow, the problem can only be exacerbated. In an attempt to deal with such a problem, in our previous work, we introduced a novel framework coined Pythia, which employs a number of distributed data nodes (cohorts), each of which contains a partition of the original dataset. To perform MV imputation, the Pythia, based on specific machine and statistical learning structures (signatures), selects the most appropriate subset of cohorts to perform locally a missing value substitution algorithm (MVA). This selection relies on the principle that particular subset of cohorts maintains the most relevant partition of the dataset. In addition to this, as Pythia uses only part of the dataset for imputation and accesses different cohorts in parallel, it improves efficiency, scalability, and accuracy compared to a single machine (coined Godzilla), which uses the entire massive dataset to compute imputation requests. Although this article is an extension of our previous work, we particularly investigate the robustness of the Pythia framework and show that the Pythia is independent from any MVA and signature construction algorithms. In order to facilitate our research, we considered two well-known MVAs (namely K-nearest neighbor and expectation-maximization imputation algorithms), as well as two machine and neural computational learning signature construction algorithms based on adaptive vector quantization and competitive learning. We prove comprehensive experiments to assess the performance of the Pythia against Godzilla and showcase the benefits stemmed from this framework.
Collapse
Affiliation(s)
- Atoshum Cahsai
- School of Computing Science, University of Glasgow , Glasgow, United Kingdom
| | | | - Peter Triantafillou
- School of Computing Science, University of Glasgow , Glasgow, United Kingdom
| |
Collapse
|
37
|
Loh WY, He X, Man M. A regression tree approach to identifying subgroups with differential treatment effects. Stat Med 2015; 34:1818-33. [PMID: 25656439 PMCID: PMC4393794 DOI: 10.1002/sim.6454] [Citation(s) in RCA: 96] [Impact Index Per Article: 10.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/11/2014] [Revised: 11/13/2014] [Accepted: 01/20/2015] [Indexed: 12/13/2022]
Abstract
In the fight against hard-to-treat diseases such as cancer, it is often difficult to discover new treatments that benefit all subjects. For regulatory agency approval, it is more practical to identify subgroups of subjects for whom the treatment has an enhanced effect. Regression trees are natural for this task because they partition the data space. We briefly review existing regression tree algorithms. Then, we introduce three new ones that are practically free of selection bias and are applicable to data from randomized trials with two or more treatments, censored response variables, and missing values in the predictor variables. The algorithms extend the generalized unbiased interaction detection and estimation (GUIDE) approach by using three key ideas: (i) treatment as a linear predictor, (ii) chi-squared tests to detect residual patterns and lack of fit, and (iii) proportional hazards modeling via Poisson regression. Importance scores with thresholds for identifying influential variables are obtained as by-products. A bootstrap technique is used to construct confidence intervals for the treatment effects in each node. The methods are compared using real and simulated data.
Collapse
Affiliation(s)
- Wei-Yin Loh
- Department of Statistics, University of Wisconsin Madison, WI 53706, U.S.A.
| | - Xu He
- Academy of Mathematics and Systems Science Chinese Academy of Sciences,
Beijing, China
| | - Michael Man
- Eli Lilly and Company Indianapolis, IN 46285, U.S.A.
| |
Collapse
|
38
|
Goetz CG, Luo S, Wang L, Tilley BC, LaPelle NR, Stebbins GT. Handling missing values in the MDS-UPDRS. Mov Disord 2015; 30:1632-8. [PMID: 25649812 PMCID: PMC5072275 DOI: 10.1002/mds.26153] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/03/2014] [Revised: 12/16/2014] [Accepted: 12/18/2014] [Indexed: 11/10/2022] Open
Abstract
This study was undertaken to define the number of missing values permissible to render valid total scores for each Movement Disorder Society Unified Parkinson's Disease Rating Scale (MDS-UPDRS) part. To handle missing values, imputation strategies serve as guidelines to reject an incomplete rating or create a surrogate score. We tested a rigorous, scale-specific, data-based approach to handling missing values for the MDS-UPDRS. From two large MDS-UPDRS datasets, we sequentially deleted item scores, either consistently (same items) or randomly (different items) across all subjects. Lin's Concordance Correlation Coefficient (CCC) compared scores calculated without missing values with prorated scores based on sequentially increasing missing values. The maximal number of missing values retaining a CCC greater than 0.95 determined the threshold for rendering a valid prorated score. A second confirmatory sample was selected from the MDS-UPDRS international translation program. To provide valid part scores applicable across all Hoehn and Yahr (H&Y) stages when the same items are consistently missing, one missing item from Part I, one from Part II, three from Part III, but none from Part IV can be allowed. To provide valid part scores applicable across all H&Y stages when random item entries are missing, one missing item from Part I, two from Part II, seven from Part III, but none from Part IV can be allowed. All cutoff values were confirmed in the validation sample. These analyses are useful for constructing valid surrogate part scores for MDS-UPDRS when missing items fall within the identified threshold and give scientific justification for rejecting partially completed ratings that fall below the threshold.
Collapse
Affiliation(s)
- Christopher G Goetz
- Department of Neurological Sciences, Rush University Medical Center, Chicago IL, USA
| | - Sheng Luo
- Division of Biostatistics, School of Public Health, University of Texas Health Science Center at Houston, Houston, TX, USA
| | - Lu Wang
- Division of Biostatistics, School of Public Health, University of Texas Health Science Center at Houston, Houston, TX, USA
| | - Barbara C Tilley
- Division of Biostatistics, School of Public Health, University of Texas Health Science Center at Houston, Houston, TX, USA
| | - Nancy R LaPelle
- Division of Preventive and Behavioral Medicine, University of Massachusetts, Worcester, MA, USA
| | - Glenn T Stebbins
- Department of Neurological Sciences, Rush University Medical Center, Chicago IL, USA
| |
Collapse
|
39
|
Abstract
Missing data problem degrades the statistical power of any analysis made in clinical studies. To infer valid results from such studies, suitable method is required to replace the missing values. There is no method which can be universally applicable for handling missing values and the main objective of this paper is to introduce a common method applicable in all cases of missing data. In this paper, Bayesian Genetic Algorithm (BGA) is proposed to effectively impute both missing continuous and discrete values using heuristic search algorithm called genetic algorithm and Bayesian rule. BGA is applied to impute missing values in a real cancer dataset under Missing At Random (MAR) and Missing Completely At Random (MCAR) conditions. For both discrete and continuous attributes, the results show better classification accuracy and RMSE% than many existing methods.
Collapse
Affiliation(s)
- R Devi Priya
- Kongu Engineering College, Erode 638 052, Tamil Nadu, India
| | - S Kuppuswami
- Kongu Engineering College, Erode 638 052, Tamil Nadu, India
| |
Collapse
|
40
|
Abstract
PURPOSE To clarify the effects of missing values due to behavioral and psychological symptoms in dementia (BPSD) in Alzheimer's disease (AD) patients on the neuropsychological tests, this study describes the pattern of missing values due to BPSD, and its influence on tests. MATERIALS AND METHODS Drug-naïve probable AD patients (n=127) with BPSD and without BPSD (n=32) were assessed with Seoul Neuropsychological Screening Battery including measures of memory, intelligence, and executive functioning. Moreover, patients were rated on Korean Neuropsychiatry Inventory (K-NPI). RESULTS The more severe the K-NPI score, the less neuropsychological tests were assessable, leading to many missing values. Patients with BPSD were more severely demented than those without BPSD. K-NPI scores were significantly correlated with the number of missing values. The effect of BPSD was largest for tests measuring frontal functions. The replacement of the missing values due to BPSD by the lowest observed score also showed the largest effect on tests of frontal function. CONCLUSION The global cognitive and behavior scales are related with missing values. Among K-NPI sub-domains, delusion, depressing, apathy, and aberrant motor behavior are significantly correlated for missing values. Data imputation of missing values due to BPSD provides a more differentiated picture of cognitive deficits in AD with BPSD.
Collapse
Affiliation(s)
- Yong Tae Kwak
- Department of Neurology, Hyoja Geriatric Hospital, 1-30 Jungbu-daero 874beon-gil, Giheung-gu, Yongin 446-512, Korea.
| | | | | |
Collapse
|
41
|
Abstract
Research on the prevention and management of chronic illnesses involves understanding changes in complex and interrelated aspects of each individual. To capture these changes or to control for them, nursing and health research needs to be longitudinal. However, there are a number of potential pitfalls in analyzing longitudinal data from a chronically ill population. This paper will examine four major pitfalls: selection of time points, measurement, choosing appropriate statistical procedures, and missing values. Although the frequency of data collection is often driven primarily by practical concerns, it will affect the results. In addition, outcome measures may capture different constructs at different points in times. Traditional analysis techniques often have assumptions about data characteristics that are violated in clinical populations. Missing values are common in research with chronically ill individuals because of problems of subject retention and because individuals have frequent medical complications. Solutions to these pitfalls are also discussed.
Collapse
|
42
|
Plachta-Danielzik S, Bartel C, Raspe H, Thyen U, Landsberg B, Müller MJ. Assessment of representativity of a study population - experience of the Kiel Obesity Prevention Study (KOPS). Obes Facts 2008; 1:325-30. [PMID: 20054196 PMCID: PMC6452140 DOI: 10.1159/000176609] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 12/31/2022] Open
Abstract
OBJECTIVE Exemplified by data of the Kiel Obesity Prevention Study (KOPS), different methods to control for response bias and to assess representativity were compared. METHODS 4,997 cross-sectional data of 5- to 7-year-old German children (main cohort) were investigated between 1996 and 2001 within school entry examination. A subgroup responded to a questionnaire to socio-demographic and lifestyle factors (responders, n = 2,631). Representativity of the main cohort was tested in comparison to the total population. To control for response bias within the responders a non-response analysis as well as an analysis of missing values were performed. RESULTS The comparison with the total population showed a higher prevalence of obese boys and girls from families of low socio-economic status (SES) within the main cohort. The responders were less frequently obese and overweight and more rarely belonged to low SES families when compared with non-responders. Analysis of missing values did not detect any further biases. According to an epidemiological assessment of differences the main cohort of KOPS is suggested to be representative for all 5- to 7-year-old children in Kiel, whereas the responders can be at best called 'relatively' representative. CONCLUSION The analysis of non-response is the most sensitive method to detect group differences, but a comparison with the total population can also be used to control for biases. In addition representativity has to be proven not only for the main cohort but also for the subgroup of responders with which data analysis will be done.
Collapse
Affiliation(s)
- Sandra Plachta-Danielzik
- Institut für Humanernährung und Lebensmittelkunde, Christian-Albrechts-Universität zu Kiel, Kiel, Germany
| | - Carmen Bartel
- Institut für Sozialmedizin, Campus Lübeck, Lübeck, Germany
| | - Heiner Raspe
- Institut für Sozialmedizin, Campus Lübeck, Lübeck, Germany
| | - Ute Thyen
- Institut für Kinder- und Jugendheilkunde, Universitätsklinikum Schleswig-Holstein, Campus Lübeck, Lübeck, Germany
| | - Beate Landsberg
- Institut für Humanernährung und Lebensmittelkunde, Christian-Albrechts-Universität zu Kiel, Kiel, Germany
| | - Manfred James Müller
- Institut für Humanernährung und Lebensmittelkunde, Christian-Albrechts-Universität zu Kiel, Kiel, Germany
- *Prof. Dr. Manfred James Müller, Institut für Humanernährung und Lebensmittelkunde, Christian-Albrechts-Universität zu Kiel, Düsternbrooker Weg 17, 24105 Kiel, Germany, Tel.: +49 431 880 56-70, Fax -79,
| |
Collapse
|
43
|
Velden MVD, Bijmolt THA. Generalized canonical correlation analysis of matrices with missing rows: a simulation study. Psychometrika 2006; 71:323-331. [PMID: 28197957 DOI: 10.1007/s11336-004-1168-9] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/11/2006] [Accepted: 06/09/2006] [Indexed: 06/06/2023]
Abstract
A method is presented for generalized canonical correlation analysis of two or more matrices with missing rows. The method is a combination of Carroll's (1968) method and the missing data approach of the OVERALS technique (Van der Burg, 1988). In a simulation study we assess the performance of the method and compare it to an existing procedure called GENCOM, proposed by Green and Carroll (1988). We find that the proposed method outperforms the GENCOM algorithm both with respect to model fit and recovery of the true structure.
Collapse
Affiliation(s)
- Michel van de Velden
- Erasmus University Rotterdam, Rotterdam.
- Econometric Institute, Erasmus University Rotterdam, P.O. Box 1738, 3000 DR, Rotterdam, The Netherlands.
| | | |
Collapse
|
44
|
Abstract
Standard statistical analyses of randomized controlled trials with partially missing outcome data often exclude valuable information from individuals with incomplete follow-up. This may lead to biased estimates of the intervention effect and loss of precision. We consider a randomized trial with a repeatedly measured outcome, in which the value of the outcome on the final occasion is of primary interest. We propose a modelling strategy in which the model is successively extended to include baseline values of the outcome, then intermediate values of the outcome, and finally values of other outcome variables. Likelihood-based estimation of random effects models is used, allowing the incorporation of data from individuals with some missing outcomes. Each estimated intervention effect is free of non-response bias under a different missing-at-random assumption. These assumptions become more plausible as the more complex models are fitted, so we propose using the trend in estimated intervention effects to assess the nature of any non-response bias. The methods are applied to data from a trial comparing intensive case management with standard case management for severely psychotic patients. All models give similar estimates of the intervention effect and we conclude that non-response bias is likely to be small.
Collapse
Affiliation(s)
- Ian R White
- MRC Biostatistics Unit, Institute of Public Health, Cambridge, UK.
| | | | | | | |
Collapse
|