1
|
Chang Z, Liu S, Qiu R, Song S, Cai Z, Tu G. Time-aware neural ordinary differential equations for incomplete time series modeling. THE JOURNAL OF SUPERCOMPUTING 2023; 79:1-29. [PMID: 37359342 PMCID: PMC10192786 DOI: 10.1007/s11227-023-05327-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Accepted: 04/19/2023] [Indexed: 06/28/2023]
Abstract
Internet of Things realizes the ubiquitous connection of all things, generating countless time-tagged data called time series. However, real-world time series are often plagued with missing values on account of noise or malfunctioning sensors. Existing methods for modeling such incomplete time series typically involve preprocessing steps, such as deletion or missing data imputation using statistical learning or machine learning methods. Unfortunately, these methods unavoidable destroy time information and bring error accumulation to the subsequent model. To this end, this paper introduces a novel continuous neural network architecture, named Time-aware Neural-Ordinary Differential Equations (TN-ODE), for incomplete time data modeling. The proposed method not only supports imputation missing values at arbitrary time points, but also enables multi-step prediction at desired time points. Specifically, TN-ODE employs a time-aware Long Short-Term Memory as an encoder, which effectively learns the posterior distribution from partial observed data. Additionally, the derivative of latent states is parameterized with a fully connected network, thereby enabling continuous-time latent dynamics generation. The proposed TN-ODE model is evaluated on both real-world and synthetic incomplete time-series datasets by conducting data interpolation and extrapolation tasks as well as classification task. Extensive experiments show the TN-ODE model outperforms baseline methods in terms of Mean Square Error for imputation and prediction tasks, as well as accuracy in downstream classification task.
Collapse
Affiliation(s)
- Zhuoqing Chang
- School of Computer Science, Wuhan University, 299# Bayi Rd, Wuchang District, Wuhan, 430072 Hubei China
| | - Shubo Liu
- School of Computer Science, Wuhan University, 299# Bayi Rd, Wuchang District, Wuhan, 430072 Hubei China
| | - Run Qiu
- School of Computer Science, Wuhan University, 299# Bayi Rd, Wuchang District, Wuhan, 430072 Hubei China
| | - Song Song
- School of Computer Science, Wuhan University, 299# Bayi Rd, Wuchang District, Wuhan, 430072 Hubei China
| | - Zhaohui Cai
- School of Computer Science, Wuhan University, 299# Bayi Rd, Wuchang District, Wuhan, 430072 Hubei China
| | - Guoqing Tu
- School of Cyber Science and Engineering, Wuhan University, 299# Bayi Rd, Wuchang District, Wuhan, 430072 Hubei China
| |
Collapse
|
2
|
Qin R, Wang Y. ImputeGAN: Generative Adversarial Network for Multivariate Time Series Imputation. ENTROPY (BASEL, SWITZERLAND) 2023; 25:137. [PMID: 36673278 PMCID: PMC9858206 DOI: 10.3390/e25010137] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 10/22/2022] [Revised: 01/06/2023] [Accepted: 01/08/2023] [Indexed: 06/17/2023]
Abstract
Since missing values in multivariate time series data are inevitable, many researchers have come up with methods to deal with the missing data. These include case deletion methods, statistics-based imputation methods, and machine learning-based imputation methods. However, these methods cannot handle temporal information, or the complementation results are unstable. We propose a model based on generative adversarial networks (GANs) and an iterative strategy based on the gradient of the complementary results to solve these problems. This ensures the generalizability of the model and the reasonableness of the complementation results. We conducted experiments on three large-scale datasets and compare them with traditional complementation methods. The experimental results show that imputeGAN outperforms traditional complementation methods in terms of accuracy of complementation.
Collapse
|
3
|
Missing Values Imputation Using Fuzzy K-Top Matching Value. JOURNAL OF KING SAUD UNIVERSITY - COMPUTER AND INFORMATION SCIENCES 2022. [DOI: 10.1016/j.jksuci.2022.12.011] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/24/2022]
|
4
|
Hu X, Shen Y, Pedrycz W, Li Y, Wu G. Granular Fuzzy Rule-Based Modeling With Incomplete Data Representation. IEEE TRANSACTIONS ON CYBERNETICS 2022; 52:6420-6433. [PMID: 33909582 DOI: 10.1109/tcyb.2021.3071145] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Incomplete data are frequently encountered and bring difficulties when it comes to further processing. The concepts of granular computing (GrC) help deliver a higher level of abstraction to address this problem. Most of the existing data imputation and related modeling methods are of numeric nature and require prior numeric models to be provided. The underlying objective of this study is to introduce a novel and straightforward approach that uses information granules as a vehicle to effectively represent missing data and build granular fuzzy models directly from resulting hybrid granular and numeric data. The evaluation and optimization of this method are guided by the principle of justifiable granularity engaging the coverage and specificity criteria and carried out with the help of particle swarm optimization. We provide a collection of experimental studies using a synthetic dataset and several publicly available real-world datasets to demonstrate the feasibility and analyze the main features of this method.
Collapse
|
5
|
Datasets on South Korean manufacturing factories' electricity consumption and demand response participation. Sci Data 2022; 9:227. [PMID: 35610251 PMCID: PMC9130238 DOI: 10.1038/s41597-022-01357-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2021] [Accepted: 04/28/2022] [Indexed: 11/23/2022] Open
Abstract
This study describes the release of electricity consumption data of some manufacturing factories located in South Korea that participate in the demand response (DR) market. The data (in kilowatt) comprise individual factories’ total power usage details that were acquired using advanced metering infrastructures. They further contain details on the manufacture types, DR participation dates, mandatory reduction capacities, and response capacities of the factories. For data acquisition, 10 manufacturing companies are representatively selected according to the process regularity and company size standard of this study. Entire datasets are newly collected and available at one-minute intervals for seven months from 1 March to 30 September 2019. These datasets can be used in a variety of ways to contribute to the functioning of power systems and markets, including the conduction of industrial load characteristic analysis for load flexibility, estimation of demand-side considerations for virtual power plant design, and determination of energy markets and incentives to achieve carbon neutrality targets at the national level. Measurement(s) | electricity consumption | Technology Type(s) | advanced metering infrastructure |
Collapse
|
6
|
Liu C, Cui G, Liu S. CGCNImp: a causal graph convolutional network for multivariate time series imputation. PeerJ Comput Sci 2022; 8:e966. [PMID: 35634128 PMCID: PMC9138184 DOI: 10.7717/peerj-cs.966] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/26/2021] [Accepted: 04/08/2022] [Indexed: 06/15/2023]
Abstract
BACKGROUND Multivariate time series data generally contains missing values, which can be an obstacle to subsequent analysis and may compromise downstream applications. One challenge in this endeavor is the presence of the missing values brought about by sensor failure and transmission packet loss. Imputation is the usual remedy in such circumstances. However, in some multivariate time series data, the complex correlation and temporal dependencies, coupled with the non-stationarity of the data, make imputation difficult. MEHODS To address this problem, we propose a novel model for multivariate time series imputation called CGCNImp that considers both correlation and temporal dependency modeling. The correlation dependency module leverages neural Granger causality and a GCN to capture the correlation dependencies among different attributes of the time series data, while the temporal dependency module relies on an attention-driven long short term memory (LSTM) and a time lag matrix to learn its dependencies. Missing values and noise are addressed with total variation reconstruction. RESULTS We conduct thorough empirical analyses on two real-world datasets. Imputation results show that CGCNImp achieves state-of-the-art performance when compared to previous methods.
Collapse
Affiliation(s)
- Caizheng Liu
- Department of Data Science, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
- Department of Computer Science and Technology, University of Chinese Academy of Sciences, Beijing, China
| | - Guangfan Cui
- Department of Computer Science and Technology, University of Chinese Academy of Sciences, Beijing, China
| | - Shenghua Liu
- Department of Data Science, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
| |
Collapse
|
7
|
Data imputation via conditional generative adversarial network with fuzzy c mean membership based loss term. APPL INTELL 2022. [DOI: 10.1007/s10489-021-02661-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
|
8
|
Nagarajan G, Dhinesh Babu LD. Missing data imputation on biomedical data using deeply learned clustering and L2 regularized regression based on symmetric uncertainty. Artif Intell Med 2022; 123:102214. [PMID: 34998512 DOI: 10.1016/j.artmed.2021.102214] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/22/2020] [Revised: 11/08/2021] [Accepted: 11/08/2021] [Indexed: 12/28/2022]
Abstract
Big data era in healthcare led to the generation of high dimensional datasets like genomic datasets, electronic health records etc. One among the critical issues to be addressed in such datasets is handling incomplete data that may yield misleading results if not handled properly. Imputation is considered to be an effective way when the missing data rate is high. While imputation accuracy and classification accuracy are the two important metrics generally considered by most of the imputation techniques, high dimensional datasets such as genomic datasets motivated the need for imputation techniques that are also computationally efficient and preserves the structure of the dataset. This paper proposes a novel approach to missing data imputation in biomedical datasets using an ensemble of deeply learned clustering and L2 regularized regression based on symmetric uncertainty. The experiments are conducted with different proportion of missing data on both genomic and non-genomic biomedical datasets for different types of missingness pattern. Our proposed approach is compared with seven proven baseline imputation methods and two recently proposed imputation approaches. The results show that the proposed approach outperforms the other approaches considered in our experimentation in terms of imputation accuracy and computational efficiency despite preserving the structure of the dataset. Thus, the overall classification accuracy of the biomedical classification tasks is also improved when our proposed missing data imputation technique is used.
Collapse
Affiliation(s)
| | - L D Dhinesh Babu
- School of Information Technology and Engineering, VIT university, India.
| |
Collapse
|
9
|
A novel clustering-based purity and distance imputation for handling medical data with missing values. Soft comput 2021. [DOI: 10.1007/s00500-021-05947-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
|
10
|
Fernando M, Cèsar F, David N, José H. Missing the missing values: The ugly duckling of fairness in machine learning. INT J INTELL SYST 2021. [DOI: 10.1002/int.22415] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Affiliation(s)
- Martínez‐Plumed Fernando
- Joint Research Centre European Commission Seville Spain
- Valencian Research Institute for Artificial Intelligence (VRAIN), Universitat Politècnica de València València Spain
| | - Ferri Cèsar
- Valencian Research Institute for Artificial Intelligence (VRAIN), Universitat Politècnica de València València Spain
| | - Nieves David
- Valencian Research Institute for Artificial Intelligence (VRAIN), Universitat Politècnica de València València Spain
| | - Hernández‐Orallo José
- Valencian Research Institute for Artificial Intelligence (VRAIN), Universitat Politècnica de València València Spain
- Leverhulme Centre for the Future of Intelligence, University of Cambridge Cambridge UK
| |
Collapse
|
11
|
Chicco G. Data Consistency for Data-Driven Smart Energy Assessment. Front Big Data 2021; 4:683682. [PMID: 34056585 PMCID: PMC8155608 DOI: 10.3389/fdata.2021.683682] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2021] [Accepted: 04/19/2021] [Indexed: 11/16/2022] Open
Abstract
In the smart grid era, the number of data available for different applications has increased considerably. However, data could not perfectly represent the phenomenon or process under analysis, so their usability requires a preliminary validation carried out by experts of the specific domain. The process of data gathering and transmission over the communication channels has to be verified to ensure that data are provided in a useful format, and that no external effect has impacted on the correct data to be received. Consistency of the data coming from different sources (in terms of timings and data resolution) has to be ensured and managed appropriately. Suitable procedures are needed for transforming data into knowledge in an effective way. This contribution addresses the previous aspects by highlighting a number of potential issues and the solutions in place in different power and energy system, including the generation, grid and user sides. Recent references, as well as selected historical references, are listed to support the illustration of the conceptual aspects.
Collapse
Affiliation(s)
- Gianfranco Chicco
- Dipartimento Energia "Galileo Ferraris," Politecnico di Torino, Torino, Italy
| |
Collapse
|
12
|
A simple and efficient incremental missing data imputation method for evolving neo-fuzzy network. EVOLVING SYSTEMS 2021. [DOI: 10.1007/s12530-021-09376-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
|
13
|
Thomas T, Rajabi E. A systematic review of machine learning-based missing value imputation techniques. DATA TECHNOLOGIES AND APPLICATIONS 2021. [DOI: 10.1108/dta-12-2020-0298] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
PurposeThe primary aim of this study is to review the studies from different dimensions including type of methods, experimentation setup and evaluation metrics used in the novel approaches proposed for data imputation, particularly in the machine learning (ML) area. This ultimately provides an understanding about how well the proposed framework is evaluated and what type and ratio of missingness are addressed in the proposals. The review questions in this study are (1) what are the ML-based imputation methods studied and proposed during 2010–2020? (2) How the experimentation setup, characteristics of data sets and missingness are employed in these studies? (3) What metrics were used for the evaluation of imputation method?Design/methodology/approachThe review process went through the standard identification, screening and selection process. The initial search on electronic databases for missing value imputation (MVI) based on ML algorithms returned a large number of papers totaling at 2,883. Most of the papers at this stage were not exactly an MVI technique relevant to this study. The literature reviews are first scanned in the title for relevancy, and 306 literature reviews were identified as appropriate. Upon reviewing the abstract text, 151 literature reviews that are not eligible for this study are dropped. This resulted in 155 research papers suitable for full-text review. From this, 117 papers are used in assessment of the review questions.FindingsThis study shows that clustering- and instance-based algorithms are the most proposed MVI methods. Percentage of correct prediction (PCP) and root mean square error (RMSE) are most used evaluation metrics in these studies. For experimentation, majority of the studies sourced the data sets from publicly available data set repositories. A common approach is that the complete data set is set as baseline to evaluate the effectiveness of imputation on the test data sets with artificially induced missingness. The data set size and missingness ratio varied across the experimentations, while missing datatype and mechanism are pertaining to the capability of imputation. Computational expense is a concern, and experimentation using large data sets appears to be a challenge.Originality/valueIt is understood from the review that there is no single universal solution to missing data problem. Variants of ML approaches work well with the missingness based on the characteristics of the data set. Most of the methods reviewed lack generalization with regard to applicability. Another concern related to applicability is the complexity of the formulation and implementation of the algorithm. Imputations based on k-nearest neighbors (kNN) and clustering algorithms which are simple and easy to implement make it popular across various domains.
Collapse
|
14
|
Xue Y, Tang Y, Xu X, Liang J, Neri F. Multi-Objective Feature Selection With Missing Data in Classification. IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE 2021. [DOI: 10.1109/tetci.2021.3074147] [Citation(s) in RCA: 27] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
|
15
|
Brief Report: Predicting Social Skills from Semantic, Syntactic, and Pragmatic Language Among Young Children with Autism Spectrum Disorder. J Autism Dev Disord 2020; 50:4165-4175. [PMID: 32215820 DOI: 10.1007/s10803-020-04445-z] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]
Abstract
The language and social skill deficits associated with autism spectrum disorder (ASD) warrant further study. Existing research has focused on the contributions of pragmatic language to social skills, with little attention to other aspects of language. We examined the associations across three language domains (semantics, syntax, and pragmatics) and their relations to parent- and teacher-rated social skills among children with ASD. When parent-reported language skills were considered simultaneously, only semantics significantly predicted children's social skills. For teacher-reported language skills, all three language domains predicted children's social skills, but none made unique contributions above and beyond one another. Further research should consider the impact of social context on language expectations and interventions targeting semantic language on children's development of social skills.
Collapse
|
16
|
Kejia S, Parvin H, Qasem SN, Tuan BA, Pho KH. A classification model based on svm and fuzzy rough set for network intrusion detection. JOURNAL OF INTELLIGENT & FUZZY SYSTEMS 2020. [DOI: 10.3233/jifs-191621] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
Intrusion Detection Systems (IDS) are designed to provide security into computer networks. Different classification models such as Support Vector Machine (SVM) has been successfully applied on the network data. Meanwhile, the extension or improvement of the current models using prototype selection simultaneous with their training phase is crucial due to the serious inefficacies during training (i.e. learning overhead). This paper introduces an improved model for prototype selection. Applying proposed prototype selection along with SVM classification model increases attack discovery rate. In this article, we use fuzzy rough sets theory (FRST) for prototype selection to enhance SVM in intrusion detection. Testing and evaluation of the proposed IDS have been mainly performed on NSL-KDD dataset as a refined version of KDD-CUP99. Experimentations indicate that the proposed IDS outperforms the basic and simple IDSs and modern IDSs in terms of precision, recall, and accuracy rate.
Collapse
Affiliation(s)
- Shen Kejia
- The Second Affiliated Hospital of the Second Military Medical University, Shanghai City, China
| | - Hamid Parvin
- Institute of Research and Development, Duy Tan University, Da Nang, Vietnam
- Faculty of Information Technology, Duy Tan University, Da Nang, Vietnam
- Department of Computer Science, Nourabad Mamasani Branch, Islamic Azad University, Mamasani, Iran
| | - Sultan Noman Qasem
- Computer Science Department, College of Computer and Information Sciences, AI Imam Mohammad Ibn Saud Islamic University (IMSIU), Riyadh, Saudi Arabia
- Computer Science Department, Faculty of Applied Science, Taiz University, Taiz, Yemen
| | - Bui Anh Tuan
- Department of Mathematics Education, Teachers College, Can Tho University, Can Tho City, Vietnam
| | - Kim-Hung Pho
- Fractional Calculus, Optimization and Algebra Research Group, Faculty of Mathematics and Statistics, Ton Duc Thang University, Ho Chi Minh City, Vietnam
| |
Collapse
|
17
|
Dutta A, Breloff SP, Dai F, Sinsel EW, Carey RE, Warren CM, Wu JZ. Fusing imperfect experimental data for risk assessment of musculoskeletal disorders in construction using canonical polyadic decomposition. AUTOMATION IN CONSTRUCTION 2020; 119:10.1016/j.autcon.2020.103322. [PMID: 33897107 PMCID: PMC8064735 DOI: 10.1016/j.autcon.2020.103322] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Field or laboratory data collected for work-related musculoskeletal disorder (WMSD) risk assessment in construction often becomes unreliable as a large amount of data go missing due to technology-induced errors, instrument failures or sometimes at random. Missing data can adversely affect the assessment conclusions. This study proposes a method that applies Canonical Polyadic Decomposition (CPD) tensor decomposition to fuse multiple sparse risk-related datasets and fill in missing data by leveraging the correlation among multiple risk indicators within those datasets. Two knee WMSD risk-related datasets-3D knee rotation (kinematics) and electromyography (EMG) of five knee postural muscles-collected from previous studies were used for the validation and demonstration of the proposed method. The analysis results revealed that for a large portion of missing values (40%), the proposed method can generate a fused dataset that provides reliable risk assessment results highly consistent (70%-87%) with those obtained from the original experimental datasets. This signified the usefulness of the proposed method for use in WMSD risk assessment studies when data collection is affected by a significant amount of missing data, which will facilitate reliable assessment of WMSD risks among construction workers. In the future, findings of this study will be implemented to explore whether, and to what extent, the fused dataset outperforms the datasets with missing values by comparing consistencies of the risk assessment results obtained from these datasets for further investigation of the fusion performance.
Collapse
Affiliation(s)
- Amrita Dutta
- Department of Civil and Environmental Engineering, West Virginia University, P.O. Box 6103, Morgantown, WV 26506, United States of America
| | - Scott P. Breloff
- National Institute for Occupational Safety and Health, 1095 Willowdale Road, Morgantown, WV 26505, United States of America
| | - Fei Dai
- Department of Civil and Environmental Engineering, West Virginia University, P.O. Box 6103, Morgantown, WV 26506, United States of America
| | - Erik W. Sinsel
- National Institute for Occupational Safety and Health, 1095 Willowdale Road, Morgantown, WV 26505, United States of America
| | - Robert E. Carey
- National Institute for Occupational Safety and Health, 1095 Willowdale Road, Morgantown, WV 26505, United States of America
| | - Christopher M. Warren
- National Institute for Occupational Safety and Health, 1095 Willowdale Road, Morgantown, WV 26505, United States of America
| | - John Z. Wu
- National Institute for Occupational Safety and Health, 1095 Willowdale Road, Morgantown, WV 26505, United States of America
| |
Collapse
|
18
|
Cheng CH, Chang JR, Huang HH. A novel weighted distance threshold method for handling medical missing values. Comput Biol Med 2020; 122:103824. [PMID: 32658729 DOI: 10.1016/j.compbiomed.2020.103824] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/26/2020] [Revised: 05/14/2020] [Accepted: 05/14/2020] [Indexed: 01/04/2023]
Abstract
Data in the medical field often contain missing values and may result in biased research results. Therefore, the objective of this work is to propose a new imputation method, a novel weighted distance threshold method, to impute missing values. After several experiments, we find that the proposed imputation method has the following benefits. (1) The proposed method with purity can reassign instances into the nearest class of the dataset, and the purity computation can filter outliers; (2) The proposed method redefines the degree of missing values and can determine attributes and instances relative to the missing values in different datasets; and (3) The proposed method need not set the k value of the nearest neighborhood because this study identifies the k value based on the best threshold to calculate purity to enhance the results of imputation. In addition, the distance threshold can adjust the optimal nearest neighborhood to estimate missing values. This study implements several experiments to compare the proposed method with other imputation methods using different missing types, missing degrees, and types of datasets. The results indicate that the proposed imputation method is better than the listed methods. Moreover, this study uses the stroke dataset from the International Stroke Trial (IST) to verify whether the proposed method can be effectively applied in practice, and the results show that the proposed method achieves 90% accuracy in the Stroke dataset.
Collapse
Affiliation(s)
- Ching-Hsue Cheng
- Department of Information Management, National Yunlin University of Science & Technology, 123, section 3, University Road, Touliu, Yunlin 640, Taiwan.
| | - Jing-Rong Chang
- Department of Information Management, Chaoyang University of Technology, Taichung, Taiwan
| | - Hao-Hsuan Huang
- Information Center, China Medical University Hospital, Taichung, Taiwan
| |
Collapse
|
19
|
Sefidian AM, Daneshpour N. Estimating missing data using novel correlation maximization based methods. Appl Soft Comput 2020. [DOI: 10.1016/j.asoc.2020.106249] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|
20
|
An Integrated Fuzzy C-Means Method for Missing Data Imputation Using Taxi GPS Data. SENSORS 2020; 20:s20071992. [PMID: 32252432 PMCID: PMC7181140 DOI: 10.3390/s20071992] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/17/2020] [Revised: 03/28/2020] [Accepted: 03/30/2020] [Indexed: 11/17/2022]
Abstract
Various traffic-sensing technologies have been employed to facilitate traffic control. Due to certain factors, e.g., malfunctioning devices and artificial mistakes, missing values typically occur in the Intelligent Transportation System (ITS) sensing datasets, resulting in a decrease in the data quality. In this study, an integrated imputation algorithm based on fuzzy C-means (FCM) and the genetic algorithm (GA) is proposed to improve the accuracy of the estimated values. The GA is applied to optimize the parameter of the membership degree and the number of cluster centroids in the FCM model. An experimental test of the taxi global positioning system (GPS) data in Manhattan, New York City, is employed to demonstrate the effectiveness of the integrated imputation approach. Three evaluation criteria, the root mean squared error (RMSE), correlation coefficient (R), and relative accuracy (RA), are used to verify the experimental results. Under the ±5% and ±10% thresholds, the average RAs obtained by the integrated imputation method are 0.576 and 0.785, which remain the highest among different methods, indicating that the integrated imputation method outperforms the history imputation method and the conventional FCM method. On the other hand, the clustering imputation performance with the Euclidean distance is better than that with the Manhattan distance. Thus, our proposed integrated imputation method can be employed to estimate the missing values in the daily traffic management.
Collapse
|
21
|
Three-Way Decision for Handling Uncertainty in Machine Learning: A Narrative Review. ROUGH SETS 2020. [PMCID: PMC7338178 DOI: 10.1007/978-3-030-52705-1_10] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 12/04/2022]
Abstract
In this work we introduce a framework, based on three-way decision (TWD) and the trisecting-acting-outcome model, to handle uncertainty in Machine Learning (ML). We distinguish between handling uncertainty affecting the input of ML models, when TWD is used to identify and properly take into account the uncertain instances; and handling the uncertainty lying in the output, where TWD is used to allow the ML model to abstain. We then present a narrative review of the state of the art of applications of TWD in regard to the different areas of concern identified by the framework, and in so doing, we will highlight both the points of strength of the three-way methodology, and the opportunities for further research.
Collapse
|
22
|
Khan SI, Hoque ASML. SICE: an improved missing data imputation technique. JOURNAL OF BIG DATA 2020; 7:37. [PMID: 32547903 PMCID: PMC7291187 DOI: 10.1186/s40537-020-00313-w] [Citation(s) in RCA: 32] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/08/2020] [Accepted: 05/29/2020] [Indexed: 05/16/2023]
Abstract
In data analytics, missing data is a factor that degrades performance. Incorrect imputation of missing values could lead to a wrong prediction. In this era of big data, when a massive volume of data is generated in every second, and utilization of these data is a major concern to the stakeholders, efficiently handling missing values becomes more important. In this paper, we have proposed a new technique for missing data imputation, which is a hybrid approach of single and multiple imputation techniques. We have proposed an extension of popular Multivariate Imputation by Chained Equation (MICE) algorithm in two variations to impute categorical and numeric data. We have also implemented twelve existing algorithms to impute binary, ordinal, and numeric missing values. We have collected sixty-five thousand real health records from different hospitals and diagnostic centers of Bangladesh, maintaining the privacy of data. We have also collected three public datasets from the UCI Machine Learning Repository, ETH Zurich, and Kaggle. We have compared the performance of our proposed algorithms with existing algorithms using these datasets. Experimental results show that our proposed algorithm achieves 20% higher F-measure for binary data imputation and 11% less error for numeric data imputations than its competitors with similar execution time.
Collapse
Affiliation(s)
- Shahidul Islam Khan
- Department of CSE, Bangladesh University of Engineering and Technology, Dhaka, Bangladesh
- Department of CSE, International Islamic University Chittagong, Chittagong, Bangladesh
| | | |
Collapse
|
23
|
Raja PS, Sasirekha K, Thangavel K. A Novel Fuzzy Rough Clustering Parameter-based missing value imputation. Neural Comput Appl 2019. [DOI: 10.1007/s00521-019-04535-9] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]
|
24
|
Vluymans S, Mac Parthaláin N, Cornelis C, Saeys Y. Weight selection strategies for ordered weighted average based fuzzy rough sets. Inf Sci (N Y) 2019. [DOI: 10.1016/j.ins.2019.05.085] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]
|
25
|
Hamidzadeh J, Moradi M. Enhancing data analysis: uncertainty-resistance method for handling incomplete data. APPL INTELL 2019. [DOI: 10.1007/s10489-019-01514-4] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]
|
26
|
Nagarajan G, Dhinesh Babu LD. A hybrid of whale optimization and late acceptance hill climbing based imputation to enhance classification performance in electronic health records. J Biomed Inform 2019; 94:103190. [PMID: 31054960 DOI: 10.1016/j.jbi.2019.103190] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/14/2018] [Revised: 04/13/2019] [Accepted: 04/26/2019] [Indexed: 11/19/2022]
Abstract
Electronic health records (EHR) are a major source of information in biomedical informatics. Yet, missing values are prominent characteristics of EHR. Prediction on dataset with missing values results in inaccurate inferences. Nearest neighbour imputation based on lazy learning approach is a proven technique for missing data imputation and is recognized as one among the top ten data mining algorithms due to its simplicity and understandability. But its performance is deteriorated due to the curse of dimensionality as unimportant features are likely to dominate. We address this problem by proposing a novel approach for feature weighting based on a hybrid of metaheuristic whale optimization algorithm (WOA) and local search late acceptance hill climbing algorithm (LAHCA) on nearest neighbour imputation method. Our proposed approach Metaheuristic and Local Search based Feature Weighted Nearest Neighbour Imputation (kNN+LAHCAWOA) also learns different k values for different test points. Our approach is tested on benchmark EHR datasets with three proven classifiers Support Vector Machines(SVM), Random forest(RF) and Deep neural networks(DNN). The results prove that kNN+LAHCAWOA is an effective imputation strategy and aids in improving the classification performance when compared with its competitor methods.
Collapse
Affiliation(s)
- Gayathri Nagarajan
- School of Information Technology and Engineering, VIT University, India.
| | - L D Dhinesh Babu
- School of Information Technology and Engineering, VIT University, India
| |
Collapse
|
27
|
Aieb A, Madani K, Scarpa M, Bonaccorso B, Lefsih K. A new approach for processing climate missing databases applied to daily rainfall data in Soummam watershed, Algeria. Heliyon 2019; 5:e01247. [PMID: 30886916 PMCID: PMC6384304 DOI: 10.1016/j.heliyon.2019.e01247] [Citation(s) in RCA: 22] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/22/2018] [Revised: 12/16/2018] [Accepted: 02/13/2019] [Indexed: 12/04/2022] Open
Abstract
Missing data is a very frequent problem in climatology, it influences on the quality of results that will afford in hydrological studies, as well as water resources management. This paper proposes a new imputation algorithm, based on the optimization of some regression methods, which are hot deck, k-nearest-neighbors imputation, weighted k-nearest-neighbors imputation, multiple imputation, linear regression and simple average method. The choice of these methods was justified by qualitative and quantitative statistical tests analysis. However, the reliability of obtained results depends mainly on percentage of missing data, choice of neighboring stations and data missingness mechanism which should be missing at random. During the study it was found that the most of stations in Soummam watershed don't have a good correlation because the large loss in rainfall data or the geology of watershed which gives a relationship between station position and rainfall variability. For this case, principal component analysis is applied on a set of stations; it showed a positive impact of altitude, latitude and longitude on correlation index between selected stations. The graphical analysis of the normal law on RMSE values, which were obtained by applying the proposed technique in several random cases of missingness, that are 4%, 8%, 12% and 16% respectively, it confirmed the validity and the performance of this approach.
Collapse
Affiliation(s)
- Amir Aieb
- Laboratoire de Biomathématiques, Biophysique, Biochimie, et Scientométrie (L3BS), Université de Bejaia, 06000 Bejaia, Algérie.,Department of Computer Science, Faculty of Exact Science, Abderrahmane Mira University, Bejaïa 06000, Algeria
| | - Khodir Madani
- Laboratoire de Biomathématiques, Biophysique, Biochimie, et Scientométrie (L3BS), Université de Bejaia, 06000 Bejaia, Algérie
| | - Marco Scarpa
- Department of Engineering, University of Messina, Italy
| | | | - Khalef Lefsih
- Laboratoire de Biomathématiques, Biophysique, Biochimie, et Scientométrie (L3BS), Université de Bejaia, 06000 Bejaia, Algérie
| |
Collapse
|
28
|
Xu Y, Hu S. Extended rough set model based on modified data-driven valued tolerance relation. JOURNAL OF INTELLIGENT & FUZZY SYSTEMS 2019. [DOI: 10.3233/jifs-18658] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Affiliation(s)
- Yi Xu
- Key Laboratory of Intelligent Computing and Signal Processing, Ministry of Education, Anhui University, Hefei, China
- School of Computer Science and Technology, Anhui University, Hefei, China
| | - Shanzhong Hu
- School of Computer Science and Technology, Anhui University, Hefei, China
| |
Collapse
|
29
|
Yadav ML, Roychoudhury B. Handling missing values: A study of popular imputation packages in R. Knowl Based Syst 2018. [DOI: 10.1016/j.knosys.2018.06.012] [Citation(s) in RCA: 42] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
|
30
|
|