1
|
Shah JS, Rai SN, DeFilippis AP, Hill BG, Bhatnagar A, Brock GN. Distribution based nearest neighbor imputation for truncated high dimensional data with applications to pre-clinical and clinical metabolomics studies. BMC Bioinformatics 2017; 18:114. [PMID: 28219348 PMCID: PMC5319174 DOI: 10.1186/s12859-017-1547-6] [Citation(s) in RCA: 44] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2016] [Accepted: 02/13/2017] [Indexed: 12/22/2022] Open
Abstract
BACKGROUND High throughput metabolomics makes it possible to measure the relative abundances of numerous metabolites in biological samples, which is useful to many areas of biomedical research. However, missing values (MVs) in metabolomics datasets are common and can arise due to both technical and biological reasons. Typically, such MVs are substituted by a minimum value, which may lead to different results in downstream analyses. RESULTS Here we present a modified version of the K-nearest neighbor (KNN) approach which accounts for truncation at the minimum value, i.e., KNN truncation (KNN-TN). We compare imputation results based on KNN-TN with results from other KNN approaches such as KNN based on correlation (KNN-CR) and KNN based on Euclidean distance (KNN-EU). Our approach assumes that the data follow a truncated normal distribution with the truncation point at the detection limit (LOD). The effectiveness of each approach was analyzed by the root mean square error (RMSE) measure as well as the metabolite list concordance index (MLCI) for influence on downstream statistical testing. Through extensive simulation studies and application to three real data sets, we show that KNN-TN has lower RMSE values compared to the other two KNN procedures as well as simpler imputation methods based on substituting missing values with the metabolite mean, zero values, or the LOD. MLCI values between KNN-TN and KNN-EU were roughly equivalent, and superior to the other four methods in most cases. CONCLUSION Our findings demonstrate that KNN-TN generally has improved performance in imputing the missing values of the different datasets compared to KNN-CR and KNN-EU when there is missingness due to missing at random combined with an LOD. The results shown in this study are in the field of metabolomics but this method could be applicable with any high throughput technology which has missing due to LOD.
Collapse
|
Journal Article |
8 |
44 |
2
|
Liu M, Li S, Yuan H, Ong MEH, Ning Y, Xie F, Saffari SE, Shang Y, Volovici V, Chakraborty B, Liu N. Handling missing values in healthcare data: A systematic review of deep learning-based imputation techniques. Artif Intell Med 2023; 142:102587. [PMID: 37316097 DOI: 10.1016/j.artmed.2023.102587] [Citation(s) in RCA: 16] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2022] [Revised: 04/08/2023] [Accepted: 05/16/2023] [Indexed: 06/16/2023]
Abstract
OBJECTIVE The proper handling of missing values is critical to delivering reliable estimates and decisions, especially in high-stakes fields such as clinical research. In response to the increasing diversity and complexity of data, many researchers have developed deep learning (DL)-based imputation techniques. We conducted a systematic review to evaluate the use of these techniques, with a particular focus on the types of data, intending to assist healthcare researchers from various disciplines in dealing with missing data. MATERIALS AND METHODS We searched five databases (MEDLINE, Web of Science, Embase, CINAHL, and Scopus) for articles published prior to February 8, 2023 that described the use of DL-based models for imputation. We examined selected articles from four perspectives: data types, model backbones (i.e., main architectures), imputation strategies, and comparisons with non-DL-based methods. Based on data types, we created an evidence map to illustrate the adoption of DL models. RESULTS Out of 1822 articles, a total of 111 were included, of which tabular static data (29%, 32/111) and temporal data (40%, 44/111) were the most frequently investigated. Our findings revealed a discernible pattern in the choice of model backbones and data types, for example, the dominance of autoencoder and recurrent neural networks for tabular temporal data. The discrepancy in imputation strategy usage among data types was also observed. The "integrated" imputation strategy, which solves the imputation task simultaneously with downstream tasks, was most popular for tabular temporal data (52%, 23/44) and multi-modal data (56%, 5/9). Moreover, DL-based imputation methods yielded a higher level of imputation accuracy than non-DL methods in most studies. CONCLUSION The DL-based imputation models are a family of techniques, with diverse network structures. Their designation in healthcare is usually tailored to data types with different characteristics. Although DL-based imputation models may not be superior to conventional approaches across all datasets, it is highly possible for them to achieve satisfactory results for a particular data type or dataset. There are, however, still issues with regard to portability, interpretability, and fairness associated with current DL-based imputation models.
Collapse
|
Systematic Review |
2 |
16 |
3
|
Das S, Sil J. Managing uncertainty in imputing missing symptom value for healthcare of rural India. Health Inf Sci Syst 2019; 7:5. [PMID: 30863541 DOI: 10.1007/s13755-019-0066-4] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/17/2018] [Accepted: 02/01/2019] [Indexed: 11/30/2022] Open
Abstract
Purpose In India, 67% of the total population live in remote area, where providing primary healthcare is a real challenge due to the scarcity of doctors. Health kiosks are deployed in remote villages and basic health data like blood pressure, pulse rate, height-weight, BMI, Oxygen saturation level (SpO2) etc. are collected. The acquired data is often imprecise due to measurement error and contains missing value. The paper proposes a comprehensive framework to impute missing symptom values by managing uncertainty present in the data set. Methods The data sets are fuzzified to manage uncertainty and fuzzy c-means clustering algorithm has been applied to group the symptom feature vectors into different disease classes. The missing symptom values corresponding to each disease are imputed using multiple fuzzy based regression model. Relations between different symptoms are framed with the help of experts and medical literature. Blood pressure symptom has been dealt with using a novel approach due to its characteristics and different from other symptoms. Patients' records obtained from the kiosks are not adequate, so relevant data are simulated by the Monte Carlo method to avoid over-fitting problem while imputing missing values of the symptoms. The generated datasets are verified using Kulberk-Leiber (K-L) distance and distance correlation (dCor) techniques, showing that the simulated data sets are well correlated with the real data set. Results Using the data sets, the proposed model is built and new patients are provisionally diagnosed using Softmax cost function. Multiple class labels as diseases are determined by achieving about 98% accuracy and verified with the ground truth provided by the experts. Conclusions It is worth to mention that the system is for primary healthcare and in emergency cases, patients are referred to the experts.
Collapse
|
Journal Article |
6 |
4 |
4
|
Rethinking modeling Alzheimer's disease progression from a multi-task learning perspective with deep recurrent neural network. Comput Biol Med 2021; 138:104935. [PMID: 34656869 DOI: 10.1016/j.compbiomed.2021.104935] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2021] [Revised: 10/07/2021] [Accepted: 10/08/2021] [Indexed: 11/22/2022]
Abstract
Alzheimer's disease (AD) is a severe neurodegenerative disorder that usually starts slowly and progressively worsens. Predicting the progression of Alzheimer's disease with longitudinal analysis on the time series data has recently received increasing attention. However, training an accurate progression model for brain network faces two major challenges: missing features, and the small sample size during the follow-up study. According to our analysis on the AD progression task, we thoroughly analyze the correlation among the multiple predictive tasks of AD progression at multiple time points. Thus, we propose a multi-task learning framework that can adaptively impute missing values and predict future progression over time from a subject's historical measurements. Progression is measured in terms of MRI volumetric measurements, trajectories of a cognitive score and clinical status. To this end, we propose a new perspective of predicting the AD progression with a multi-task learning paradigm. In our multi-task learning paradigm, we hypothesize that the inherent correlations exist among: (i). the prediction tasks of clinical diagnosis, cognition and ventricular volume at each time point; (ii). the tasks of imputation and prediction; and (iii). the prediction tasks at multiple future time points. According to our findings of the task correlation, we develop an end-to-end deep multi-task learning method to jointly improve the performance of assigning missing value and prediction. We design a balanced multi-task dynamic weight optimization. With in-depth analysis and empirical evidence on Alzheimer's Disease Neuroimaging Initiative (ADNI), we show the benefits and flexibility of the proposed multi-task learning model, especially for the prediction at the M60 time point. The proposed approach achieves 5.6%, 5.7%, 4.0% and 11.8% improvement with respect to mAUC, BCA and MAE (ADAS-Cog13 and Ventricles), respectively.
Collapse
|
|
4 |
2 |
5
|
Kuang J, Michel K, Scoglio C. GeCoNet-Tool: a software package for gene co-expression network construction and analysis. BMC Bioinformatics 2023; 24:281. [PMID: 37434115 DOI: 10.1186/s12859-023-05382-1] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2022] [Accepted: 06/09/2023] [Indexed: 07/13/2023] Open
Abstract
BACKGROUND Network analysis is a powerful tool for studying gene regulation and identifying biological processes associated with gene function. However, constructing gene co-expression networks can be a challenging task, particularly when dealing with a large number of missing values. RESULTS We introduce GeCoNet-Tool, an integrated gene co-expression network construction and analysis tool. The tool comprises two main parts: network construction and network analysis. In the network construction part, GeCoNet-Tool offers users various options for processing gene co-expression data derived from diverse technologies. The output of the tool is an edge list with the option of weights associated with each link. In network analysis part, the user can produce a table that includes several network properties such as communities, cores, and centrality measures. With GeCoNet-Tool, users can explore and gain insights into the complex interactions between genes.
Collapse
|
|
2 |
2 |
6
|
Bhushan S, Kumar A, Pokhrel R, Bakr ME, Mekiso GT. Design based synthetic imputation methods for domain mean. Sci Rep 2024; 14:4250. [PMID: 38378823 PMCID: PMC10879151 DOI: 10.1038/s41598-024-53909-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/04/2023] [Accepted: 02/06/2024] [Indexed: 02/22/2024] Open
Abstract
In real life, situations may arise when the available data are insufficient to provide accurate estimates for the domain, the small area estimation (SAE) technique has been used to get accurate estimates for the variable under study. The problem of missing data is a serious problem that has an impact on sample surveys, but small area estimates are especially prone to it. This paper is a basic effort that suggests design based synthetic imputation methods for the domain mean estimation using simple random sampling in order to address the issue of missing data under SAE. The expression of the mean square error for the proposed imputation methods are obtained up to first order approximation. The efficiency conditions are determined and a thorough simulation study is carried out using artificially generated data sets. An application is included with real data that further supports this study.
Collapse
|
research-article |
1 |
|
7
|
Bu Z, Dai Z, Zhang Y, Long Q. MISNN: Multiple Imputation via Semi-parametric Neural Networks. ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING : ... PACIFIC-ASIA CONFERENCE, PAKDD ..., PROCEEDINGS. PACIFIC-ASIA CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING 2023; 13935:430-442. [PMID: 38370342 PMCID: PMC10869892 DOI: 10.1007/978-3-031-33374-3_34] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/20/2024]
Abstract
Multiple imputation (MI) has been widely applied to missing value problems in biomedical, social and econometric research, in order to avoid improper inference in the downstream data analysis. In the presence of high-dimensional data, imputation models that include feature selection, especially ℓ 1 regularized regression (such as Lasso, adaptive Lasso, and Elastic Net), are common choices to prevent the model from underdetermination. However, conducting MI with feature selection is difficult: existing methods are often computationally inefficient and poor in performance. We propose MISNN, a novel and efficient algorithm that incorporates feature selection for MI. Leveraging the approximation power of neural networks, MISNN is a general and flexible framework, compatible with any feature selection method, any neural network architecture, high/low-dimensional data and general missing patterns. Through empirical experiments, MISNN has demonstrated great advantages over state-of-the-art imputation methods (e.g. Bayesian Lasso and matrix completion), in terms of imputation accuracy, statistical consistency and computation speed.
Collapse
|
research-article |
2 |
|
8
|
Kuang J, Buchon N, Michel K, Scoglio C. A global Anopheles gambiae gene co-expression network constructed from hundreds of experimental conditions with missing values. BMC Bioinformatics 2022; 23:170. [PMID: 35534830 PMCID: PMC9082846 DOI: 10.1186/s12859-022-04697-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2022] [Accepted: 04/25/2022] [Indexed: 11/21/2022] Open
Abstract
BACKGROUND Gene co-expression networks (GCNs) can be used to determine gene regulation and attribute gene function to biological processes. Different high throughput technologies, including one and two-channel microarrays and RNA-sequencing, allow evaluating thousands of gene expression data simultaneously, but these methodologies provide results that cannot be directly compared. Thus, it is complex to analyze co-expression relations between genes, especially when there are missing values arising for experimental reasons. Networks are a helpful tool for studying gene co-expression, where nodes represent genes and edges represent co-expression of pairs of genes. RESULTS In this paper, we establish a method for constructing a gene co-expression network for the Anopheles gambiae transcriptome from 257 unique studies obtained with different methodologies and experimental designs. We introduce the sliding threshold approach to select node pairs with high Pearson correlation coefficients. The resulting network, which we name AgGCN1.0, is robust to random removal of conditions and has similar characteristics to small-world and scale-free networks. Analysis of network sub-graphs revealed that the core is largely comprised of genes that encode components of the mitochondrial respiratory chain and the ribosome, while different communities are enriched for genes involved in distinct biological processes. CONCLUSION Analysis of the network reveals that both the architecture of the core sub-network and the network communities are based on gene function, supporting the power of the proposed method for GCN construction. Application of network science methodology reveals that the overall network structure is driven to maximize the integration of essential cellular functions, possibly allowing the flexibility to add novel functions.
Collapse
|
research-article |
3 |
|
9
|
Mohammedqasem R, Mohammedqasim H, Asad Ali Biabani S, Ata O, Alomary MN, Almehmadi M, Amer Alsairi A, Azam Ansari M. Multi-objective deep learning framework for COVID-19 dataset problems. JOURNAL OF KING SAUD UNIVERSITY. SCIENCE 2023; 35:102527. [PMID: 36590237 PMCID: PMC9795799 DOI: 10.1016/j.jksus.2022.102527] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 03/18/2022] [Revised: 12/08/2022] [Accepted: 12/23/2022] [Indexed: 05/28/2023]
Abstract
Background It has been reported that a deadly virus known as COVID-19 has arisen in China and has spread rapidly throughout the country. The globe was shattered, and a large number of people on the planet died. It quickly became an epidemic due to the absence of apparent symptoms and causes for patients, confusion appears due to the lack of sufficient laboratory results, and its intelligent algorithms were used to make decisions on clinical outcomes. Methods This study developed a new framework for medical datasets with high missing values based on deep-learning optimization models. The robustness of our model is achieved by combining: Data Missing Care (DMC) Framework to overcome the problem of high missing data in medical datasets, and Grid-Search optimization used to develop an improved deep predictive training model for patients with COVID-19 by setting multiple hyperparameters and tuning assessments on three deep learning algorithms: ANN (Artificial Neural Network), CNN (Convolutional Neural Network), and Recurrent Neural Networks (RNN). Results The experiment results conducted on three medical datasets showed the effectiveness of our hybrid approach and an improvement in accuracy and efficiency since all the evaluation metrics were close to ideal for all deep learning classifiers. We got the best evaluation in terms of accuracy 98%, precession 98.5%, F1-score 98.6%, and ROC Curve (95% to 99%) for the COVID-19 dataset provided by GitHub. The second dataset is also Covid-19 provided by Albert Einstein Hospital with high missing data after applying our approach the accuracy reached more than 91%. Third dataset for Cervical Cancer provided by Kaggle all the evaluation metrics reached more than 95%. Conclusions The proposed formula for processing this type of data can replace the traditional formats in optimization while providing high accuracy and less time to classify patients. Whereas, the experimental results of our approach, supported by comprehensive statistical analysis, can improve the overall evaluation performance of the problem of classifying medical data sets with high missing values. Therefore, this approach can be used in many areas such as energy management, environment, and medicine.
Collapse
|
research-article |
2 |
|
10
|
Amene E, Horn B, Pirie R, Lake R, Döpfer D. Filling gaps in notification data: a model-based approach applied to travel related campylobacteriosis cases in New Zealand. BMC Infect Dis 2016; 16:475. [PMID: 27600394 PMCID: PMC5011939 DOI: 10.1186/s12879-016-1784-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2015] [Accepted: 08/16/2016] [Indexed: 11/16/2022] Open
Abstract
Background Data containing notified cases of disease are often compromised by incomplete or partial information related to individual cases. In an effort to enhance the value of information from enteric disease notifications in New Zealand, this study explored the use of Bayesian and Multiple Imputation (MI) models to fill risk factor data gaps. As a test case, overseas travel as a risk factor for infection with campylobacteriosis has been examined. Methods Two methods, namely Bayesian Specification (BAS) and Multiple Imputation (MI), were compared regarding predictive performance for various levels of artificially induced missingness of overseas travel status in campylobacteriosis notification data. Predictive performance of the models was assessed through the Brier Score, the Area Under the ROC Curve and the Percent Bias of regression coefficients. Finally, the best model was selected and applied to predict missing overseas travel status of campylobacteriosis notifications. Results While no difference was observed in the predictive performance of the BAS and MI methods at a lower rate of missingness (<10 %), but the BAS approach performed better than MI at a higher rate of missingness (50 %, 65 %, 80 %). The estimated proportion (95 % Credibility Intervals) of travel related cases was greatest in highly urban District Health Boards (DHBs) in Counties Manukau, Auckland and Waitemata, at 0.37 (0.12, 0.57), 0.33 (0.13, 0.55) and 0.28 (0.10, 0.49), whereas the lowest proportion was estimated for more rural West Coast, Northland and Tairawhiti DHBs at 0.02 (0.01, 0.05), 0.03 (0.01, 0.08) and 0.04 (0.01, 0.06), respectively. The national rate of travel related campylobacteriosis cases was estimated at 0.16 (0.02, 0.48). Conclusion The use of BAS offers a flexible approach to data augmentation particularly when the missing rate is very high and when the Missing At Random (MAR) assumption holds. High rates of travel associated cases in urban regions of New Zealand predicted by this approach are plausible given the high rate of travel in these regions, including destinations with higher risk of infection. The added advantage of using a Bayesian approach is that the model’s prediction can be improved whenever new information becomes available. Electronic supplementary material The online version of this article (doi:10.1186/s12879-016-1784-8) contains supplementary material, which is available to authorized users.
Collapse
|
|
9 |
|
11
|
Yin Y, Yuan Z, Tanvir IM, Bao X. Electronic medical records imputation by temporal Generative Adversarial Network. BioData Min 2024; 17:19. [PMID: 38926718 PMCID: PMC11202349 DOI: 10.1186/s13040-024-00372-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/06/2023] [Accepted: 06/14/2024] [Indexed: 06/28/2024] Open
Abstract
The loss of electronic medical records has seriously affected the practical application of biomedical data. Therefore, it is a meaningful research effort to effectively fill these lost data. Currently, state-of-the-art methods focus on using Generative Adversarial Networks (GANs) to fill the missing values of electronic medical records, achieving breakthrough progress. However, when facing datasets with high missing rates, the imputation accuracy of these methods sharply deceases. This motivates us to explore the uncertainty of GANs and improve the GAN-based imputation methods. In this paper, the GRUD (Gate Recurrent Unit Decay) network and the UGAN (Uncertainty Generative Adversarial Network) are proposed and organically combined, called UGAN-GRUD. In UGAN-GRUD, it highlights using GAN to generate imputation values and then leveraging GRUD to compensate them. We have designed the UGAN and the GRUD network. The former is employed to learn the distribution pattern and uncertainty of data through the Generator and Discriminator, iteratively. The latter is exploited to compensate the former by leveraging the GRUD based on time decay factor, which can learn the specific temporal relations in electronic medical records. Through experimental research on publicly available biomedical datasets, the results show that UGAN-GRUD outperforms the current state-of-the-art methods, with average 13% RMSE (Root Mean Squared Error) and 24.5% MAPE (Mean Absolute Percentage Error) improvements.
Collapse
|
research-article |
1 |
|