1
|
Darji J, Biswas N, Padul V, Gill J, Kesari S, Ashili S. Efficient use of binned data for imputing univariate time series data. Front Big Data 2024; 7:1422650. [PMID: 39234189 PMCID: PMC11371617 DOI: 10.3389/fdata.2024.1422650] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2024] [Accepted: 08/05/2024] [Indexed: 09/06/2024] Open
Abstract
Time series data are recorded in various sectors, resulting in a large amount of data. However, the continuity of these data is often interrupted, resulting in periods of missing data. Several algorithms are used to impute the missing data, and the performance of these methods is widely varied. Apart from the choice of algorithm, the effective imputation depends on the nature of missing and available data. We conducted extensive studies using different types of time series data, specifically heart rate data and power consumption data. We generated the missing data for different time spans and imputed using different algorithms with binned data of different sizes. The performance was evaluated using the root mean square error (RMSE) metric. We observed a reduction in RMSE when using binned data compared to the entire dataset, particularly in the case of the expectation-maximization (EM) algorithm. We found that RMSE was reduced when using binned data for 1-, 5-, and 15-min missing data, with greater reduction observed for 15-min missing data. We also observed the effect of data fluctuation. We conclude that the usefulness of binned data depends precisely on the span of missing data, sampling frequency of the data, and fluctuation within data. Depending on the inherent characteristics, quality, and quantity of the missing and available data, binned data can impute a wide variety of data, including biological heart rate data derived from the Internet of Things (IoT) device smartwatch and non-biological data such as household power consumption data.
Collapse
Affiliation(s)
- Jay Darji
- Rhenix Lifesciences, Hyderabad, Telangana, India
| | - Nupur Biswas
- Rhenix Lifesciences, Hyderabad, Telangana, India
- CureScience, San Diego, CA, United States
| | - Vijay Padul
- Rhenix Lifesciences, Hyderabad, Telangana, India
| | - Jaya Gill
- CureScience, San Diego, CA, United States
| | - Santosh Kesari
- Department of Translational Neurosciences, Pacific Neuroscience Institute and Saint John's Cancer Institute at Providence Saint John's Health Center, Santa Monica, CA, United States
| | | |
Collapse
|
2
|
Yin Y, Huang C, Bao X. ContrAttNet: Contribution and attention approach to multivariate time-series data imputation. NETWORK (BRISTOL, ENGLAND) 2024:1-24. [PMID: 38828665 DOI: 10.1080/0954898x.2024.2360157] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/28/2023] [Accepted: 05/22/2024] [Indexed: 06/05/2024]
Abstract
The imputation of missing values in multivariate time-series data is a basic and popular data processing technology. Recently, some studies have exploited Recurrent Neural Networks (RNNs) and Generative Adversarial Networks (GANs) to impute/fill the missing values in multivariate time-series data. However, when faced with datasets with high missing rates, the imputation error of these methods increases dramatically. To this end, we propose a neural network model based on dynamic contribution and attention, denoted as ContrAttNet. ContrAttNet consists of three novel modules: feature attention module, iLSTM (imputation Long Short-Term Memory) module, and 1D-CNN (1-Dimensional Convolutional Neural Network) module. ContrAttNet exploits temporal information and spatial feature information to predict missing values, where iLSTM attenuates the memory of LSTM according to the characteristics of the missing values, to learn the contributions of different features. Moreover, the feature attention module introduces an attention mechanism based on contributions, to calculate supervised weights. Furthermore, under the influence of these supervised weights, 1D-CNN processes the time-series data by treating them as spatial features. Experimental results show that ContrAttNet outperforms other state-of-the-art models in the missing value imputation of multivariate time-series data, with average 6% MAPE and 9% MAE on the benchmark datasets.
Collapse
Affiliation(s)
- Yunfei Yin
- College of Computer Science, Chongqing University, Chongqing, China
| | - Caihao Huang
- College of Computer Science, Chongqing University, Chongqing, China
| | - Xianjian Bao
- Department of Computer Science, Maharishi University of Management, Fairfield, USA
| |
Collapse
|
3
|
Li Y, Zhou Q, Fan Y, Pan G, Dai Z, Lei B. A novel machine learning-based imputation strategy for missing data in step-stress accelerated degradation test. Heliyon 2024; 10:e26429. [PMID: 38434061 PMCID: PMC10906311 DOI: 10.1016/j.heliyon.2024.e26429] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/10/2023] [Revised: 11/25/2023] [Accepted: 02/13/2024] [Indexed: 03/05/2024] Open
Abstract
The presence of missing data is a significant data quality issue that negatively impacts the accuracy and reliability of data analysis. This issue is especially relevant in the context of accelerated tests, particularly for step-stress accelerated degradation tests. While missing data can occur due to objective factors or human error, high missing rate is an inevitable pattern of missing data that will occur during the conversion process of accelerated test data. This type of missing data manifests as a degradation dataset with unequal measuring intervals. Therefore, developing a more appropriate imputation method for accelerated test data is essential. In this study, we propose a novel hybrid imputation method that combines the LSSVM and RBF models to address missing data problems. A comparison is conducted between the proposed model and various traditional and machine learning imputation methods using simulation data, to justify the advantages of the proposed model over the existing methods. Finally, the proposed model is implemented on real degradation datasets of the super-luminescent diode (SLD) to validate its performance and effectiveness in dealing with missing data in step-stress accelerated degradation test. Additionally, due to the generalizability of the proposed method, it is expected to be applicable in other scenarios with high missing data rates.
Collapse
Affiliation(s)
- Yaqiu Li
- China Electronic Product Reliability and Environmental Testing Research Institute, No. 76, West Zhucun Avenue, Guangzhou, China
- Key Laboratory of Active Medical Devices Quality & Reliability Management and Assessment, No. 76, West Zhucun Avenue, Guangzhou, China
| | - Qijie Zhou
- China Electronic Product Reliability and Environmental Testing Research Institute, No. 76, West Zhucun Avenue, Guangzhou, China
- Key Laboratory of Active Medical Devices Quality & Reliability Management and Assessment, No. 76, West Zhucun Avenue, Guangzhou, China
| | - Ye Fan
- Beijing Institute of Structure and Environment Engineer, No.1, South Dahongmen Avenue, Beijing, China
| | - Guangze Pan
- China Electronic Product Reliability and Environmental Testing Research Institute, No. 76, West Zhucun Avenue, Guangzhou, China
- Guangdong Provincial Key Laboratory of Electronic Information Products Reliability Technology, No. 76, West Zhucun Avenue, Guangzhou, China
| | - Zongbei Dai
- China Electronic Product Reliability and Environmental Testing Research Institute, No. 76, West Zhucun Avenue, Guangzhou, China
| | - Baimao Lei
- China Electronic Product Reliability and Environmental Testing Research Institute, No. 76, West Zhucun Avenue, Guangzhou, China
| |
Collapse
|
4
|
Ngueilbaye A, Huang JZ, Khan M, Wang H. Data quality model for assessing public COVID-19 big datasets. THE JOURNAL OF SUPERCOMPUTING 2023:1-33. [PMID: 37359333 PMCID: PMC10230148 DOI: 10.1007/s11227-023-05410-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Accepted: 05/17/2023] [Indexed: 06/28/2023]
Abstract
For decision-making support and evidence based on healthcare, high quality data are crucial, particularly if the emphasized knowledge is lacking. For public health practitioners and researchers, the reporting of COVID-19 data need to be accurate and easily available. Each nation has a system in place for reporting COVID-19 data, albeit these systems' efficacy has not been thoroughly evaluated. However, the current COVID-19 pandemic has shown widespread flaws in data quality. We propose a data quality model (canonical data model, four adequacy levels, and Benford's law) to assess the quality issue of COVID-19 data reporting carried out by the World Health Organization (WHO) in the six Central African Economic and Monitory Community (CEMAC) region countries between March 6,2020, and June 22, 2022, and suggest potential solutions. These levels of data quality sufficiency can be interpreted as dependability indicators and sufficiency of Big Dataset inspection. This model effectively identified the quality of the entry data for big dataset analytics. The future development of this model requires scholars and institutions from all sectors to deepen their understanding of its core concepts, improve integration with other data processing technologies, and broaden the scope of its applications.
Collapse
Affiliation(s)
- Alladoumbaye Ngueilbaye
- Big Data Institute, College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, 518060 Guangdong China
- National Engineering Laboratory for Big Data System Computing Technology, Shenzhen University, Shenzhen, 518060 Guangdong China
| | - Joshua Zhexue Huang
- Big Data Institute, College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, 518060 Guangdong China
- National Engineering Laboratory for Big Data System Computing Technology, Shenzhen University, Shenzhen, 518060 Guangdong China
| | - Mehak Khan
- Department of Computer Science, AI Lab, Oslo Metropolitan University, Oslo, Norway
| | - Hongzhi Wang
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, 150001 Heilongjiang China
| |
Collapse
|
5
|
Multiple imputation method of missing credit risk assessment data based on generative adversarial networks. Appl Soft Comput 2022. [DOI: 10.1016/j.asoc.2022.109273] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
|
6
|
Support vector machine regression to predict gas diffusion coefficient of biochar-amended soil. Appl Soft Comput 2022. [DOI: 10.1016/j.asoc.2022.109345] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022]
|
7
|
Ngueilbaye A, Wang H, Mahamat DA, Elgendy IA. SDLER: stacked dedupe learning for entity resolution in big data era. THE JOURNAL OF SUPERCOMPUTING 2021; 77:10959-10983. [DOI: 10.1007/s11227-021-03710-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Accepted: 02/23/2021] [Indexed: 07/07/2023]
|