1
|
Josephides JM, Chen CL. Unravelling single-cell DNA replication timing dynamics using machine learning reveals heterogeneity in cancer progression. Nat Commun 2025; 16:1472. [PMID: 39922809 PMCID: PMC11807193 DOI: 10.1038/s41467-025-56783-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2024] [Accepted: 01/29/2025] [Indexed: 02/10/2025] Open
Abstract
Genomic heterogeneity has largely been overlooked in single-cell replication timing (scRT) studies. Here, we develop MnM, an efficient machine learning-based tool that allows disentangling scRT profiles from heterogenous samples. We use single-cell copy number data to accurately perform missing value imputation, identify cell replication states, and detect genomic heterogeneity. This allows us to separate somatic copy number alterations from copy number changes resulting from DNA replication. Our methodology brings critical insights into chromosomal aberrations and highlights the ubiquitous aneuploidy process during tumorigenesis. The copy number and scRT profiles obtained by analysing >119,000 high-quality human single cells from different cell lines, patient tumours and patient-derived xenograft samples leads to a multi-sample heterogeneity-resolved scRT atlas. This atlas is an important resource for cancer research and demonstrates that scRT profiles can be used to study replication timing heterogeneity in cancer. Our findings also highlight the importance of studying cancer tissue samples to comprehensively grasp the complexities of DNA replication because cell lines, although convenient, lack dynamic environmental factors. These results facilitate future research at the interface of genomic instability and replication stress during cancer progression.
Collapse
Affiliation(s)
- Joseph M Josephides
- Institut Curie, PSL Research University, CNRS UMR3244, Dynamics of Genetic Information, Sorbonne Université, Paris, France
| | - Chun-Long Chen
- Institut Curie, PSL Research University, CNRS UMR3244, Dynamics of Genetic Information, Sorbonne Université, Paris, France.
| |
Collapse
|
2
|
Sasu GV, Ciubotaru BI, Goga N, Vasilățeanu A. Addressing Missing Data Challenges in Geriatric Health Monitoring: A Study of Statistical and Machine Learning Imputation Methods. SENSORS (BASEL, SWITZERLAND) 2025; 25:614. [PMID: 39943253 PMCID: PMC11820420 DOI: 10.3390/s25030614] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 11/14/2024] [Revised: 12/24/2024] [Accepted: 01/07/2025] [Indexed: 02/16/2025]
Abstract
In geriatric healthcare, missing data pose significant challenges, especially in systems used for frailty monitoring in elderly individuals. This study explores advanced imputation techniques used to enhance data quality and maintain model performance in a system designed to detect frailty insights. We introduce missing data mechanisms-Missing Completely at Random (MCAR), Missing at Random (MAR), and Missing Not at Random (MNAR)-into a dataset collected from smart bracelets, simulating real-world conditions. Imputation methods, including Expectation-Maximization (EM), matrix completion, Bayesian networks, K-Nearest Neighbors (KNN), Support Vector Machines (SVMs), Generative Adversarial Imputation Networks (GAINs), Variational Autoencoder (VAE), and GRU-D, were evaluated based on normalized Mean Squared Error (MSE), Mean Absolute Error (MAE), and R2 metrics. The results demonstrate that KNN and SVM consistently outperform other methods across all three mechanisms due to their ability to adapt to diverse patterns of missingness. Specifically, KNN and SVM excel in MAR conditions by leveraging observed data relationships to accurately infer missing values, while their robustness to randomness enables superior performance under MCAR scenarios. In MNAR contexts, KNN and SVM effectively handle unobserved dependencies by identifying underlying patterns in the data, outperforming methods like GRU-D and VAE. These findings highlight the importance of selecting imputation methods based on the characteristics of missing data mechanisms, emphasizing the versatility and reliability of KNN and SVM in healthcare applications. This study advocates for hybrid approaches in healthcare applications like the cINnAMON project, which supports elderly individuals at risk of frailty through non-intrusive home monitoring systems.
Collapse
Affiliation(s)
- Gabriel-Vasilică Sasu
- Faculty of Automatic Control and Computers, National University of Science and Technology Politehnica Bucharest, 060042 Bucharest, Romania;
| | - Bogdan-Iulian Ciubotaru
- Military Equipment and Technologies Research Agency (METRA), Ministry of National Defence, Clinceni, 077025 Ilfov, Romania;
| | - Nicolae Goga
- The Faculty of Engineering in Foreign Languages, National University of Science and Technology Politehnica Bucharest, 060042 Bucharest, Romania;
| | - Andrei Vasilățeanu
- The Faculty of Engineering in Foreign Languages, National University of Science and Technology Politehnica Bucharest, 060042 Bucharest, Romania;
| |
Collapse
|
3
|
Dong W, Da Roza CC, Cheng D, Zhang D, Xiang Y, Seto WK, Wong WCW. Development and validation of HBV surveillance models using big data and machine learning. Ann Med 2024; 56:2314237. [PMID: 38340309 PMCID: PMC10860422 DOI: 10.1080/07853890.2024.2314237] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 09/25/2023] [Accepted: 01/30/2024] [Indexed: 02/12/2024] Open
Abstract
BACKGROUND The construction of a robust healthcare information system is fundamental to enhancing countries' capabilities in the surveillance and control of hepatitis B virus (HBV). Making use of China's rapidly expanding primary healthcare system, this innovative approach using big data and machine learning (ML) could help towards the World Health Organization's (WHO) HBV infection elimination goals of reaching 90% diagnosis and treatment rates by 2030. We aimed to develop and validate HBV detection models using routine clinical data to improve the detection of HBV and support the development of effective interventions to mitigate the impact of this disease in China. METHODS Relevant data records extracted from the Family Medicine Clinic of the University of Hong Kong-Shenzhen Hospital's Hospital Information System were structuralized using state-of-the-art Natural Language Processing techniques. Several ML models have been used to develop HBV risk assessment models. The performance of the ML model was then interpreted using the Shapley value (SHAP) and validated using cohort data randomly divided at a ratio of 2:1 using a five-fold cross-validation framework. RESULTS The patterns of physical complaints of patients with and without HBV infection were identified by processing 158,988 clinic attendance records. After removing cases without any clinical parameters from the derivation sample (n = 105,992), 27,392 cases were analysed using six modelling methods. A simplified model for HBV using patients' physical complaints and parameters was developed with good discrimination (AUC = 0.78) and calibration (goodness of fit test p-value >0.05). CONCLUSIONS Suspected case detection models of HBV, showing potential for clinical deployment, have been developed to improve HBV surveillance in primary care setting in China. (Word count: 264).
Collapse
Affiliation(s)
- Weinan Dong
- Department of Family Medicine and Primary Care, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Hong Kong, China
| | - Cecilia Clara Da Roza
- Department of Family Medicine and Primary Care, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Hong Kong, China
| | - Dandan Cheng
- Department of Family Medicine and Primary Care, The University of Hong Kong-Shenzhen Hospital, Shenzhen, Guangdong, China
| | - Dahao Zhang
- Department of Family Medicine and Primary Care, The University of Hong Kong-Shenzhen Hospital, Shenzhen, Guangdong, China
| | - Yuling Xiang
- Department of Family Medicine and Primary Care, The University of Hong Kong-Shenzhen Hospital, Shenzhen, Guangdong, China
| | - Wai Kay Seto
- Department of Medicine and State Key Laboratory of Liver Research, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Hong Kong, China
- Department of Medicine, The University of Hong Kong-Shenzhen Hospital, Shenzhen, Guangdong, China
| | - William C. W. Wong
- Department of Family Medicine and Primary Care, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Hong Kong, China
- Department of Family Medicine and Primary Care, The University of Hong Kong-Shenzhen Hospital, Shenzhen, Guangdong, China
| |
Collapse
|
4
|
Hu YH, Wu RY, Lin YC, Lin TY. A novel MissForest-based missing values imputation approach with recursive feature elimination in medical applications. BMC Med Res Methodol 2024; 24:269. [PMID: 39516783 PMCID: PMC11546113 DOI: 10.1186/s12874-024-02392-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/11/2024] [Accepted: 10/28/2024] [Indexed: 11/16/2024] Open
Abstract
BACKGROUND Missing values in datasets present significant challenges for data analysis, particularly in the medical field where data accuracy is crucial for patient diagnosis and treatment. Although MissForest (MF) has demonstrated efficacy in imputation research and recursive feature elimination (RFE) has proven effective in feature selection, the potential for enhancing MF through RFE integration remains unexplored. METHODS This study introduces a novel imputation method, "recursive feature elimination-MissForest" (RFE-MF), designed to enhance imputation quality by reducing the impact of irrelevant features. A comparative analysis is conducted between RFE-MF and four classical imputation methods: mean/mode, k-nearest neighbors (kNN), multiple imputation by chained equations (MICE), and MF. The comparison is carried out across ten medical datasets containing both numerical and mixed data types. Different missing data rates, ranging from 10 to 50%, are evaluated under the missing completely at random (MCAR) mechanism. The performance of each method is assessed using two evaluation metrics: normalized root mean squared error (NRMSE) and predictive fidelity criterion (PFC). Additionally, paired samples t-tests are employed to analyze the statistical significance of differences among the outcomes. RESULTS The findings indicate that RFE-MF demonstrates superior performance across the majority of datasets when compared to four classical imputation methods (mean/mode, kNN, MICE, and MF). Notably, RFE-MF consistently outperforms the original MF, irrespective of variable type (numerical or categorical). Mean/mode imputation exhibits consistent performance across various scenarios. Conversely, the efficacy of kNN imputation fluctuates in relation to varying missing data rates. CONCLUSION This study demonstrates that RFE-MF holds promise as an effective imputation method for medical datasets, providing a novel approach to addressing missing data challenges in medical applications.
Collapse
Affiliation(s)
- Ya-Han Hu
- Department of Information Management, National Central University, Taoyuan City, Taiwan
| | - Ruei-Yan Wu
- Department of Information Management, National Central University, Taoyuan City, Taiwan
| | - Yen-Cheng Lin
- Department of Information Management, National Central University, Taoyuan City, Taiwan
| | - Ting-Yin Lin
- Department of Laboratory Medicine, Ditmanson Medical Foundation Chia-Yi Christian Hospital, Chiayi City, Taiwan.
| |
Collapse
|
5
|
Rahadian RE, Tan HQ, Ho BS, Kumaran A, Villanueva A, Sng J, Tan RSYC, Tan TJY, Tan VKM, Tan BKT, Lim GH, Cai Y, Nei WL, Wong FY. Using Machine Learning Models to Predict Pathologic Complete Response to Neoadjuvant Chemotherapy in Breast Cancer. JCO Clin Cancer Inform 2024; 8:e2400071. [PMID: 39576956 DOI: 10.1200/cci.24.00071] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/27/2024] [Revised: 08/15/2024] [Accepted: 10/11/2024] [Indexed: 11/24/2024] Open
Abstract
PURPOSE Neoadjuvant chemotherapy (NAC) is increasingly used in breast cancer. Predictive modeling is useful in predicting pathologic complete response (pCR) to NAC. We test machine learning (ML) models to predict pCR in breast cancer and explore methods of handling missing data. METHODS Four hundred and ninety-nine patients with breast cancer treated with NAC in two centers in Singapore (National Cancer Centre Singapore [NCCS] and KK Hospital) between January 2014 and December 2017 were included. Eleven clinical features were used to train five different ML models. Listwise deletion and imputation were evaluated on handling missing data. Model performance was evaluated by AUC and calibration (Brier score). Feature importance from the best performing model in the external testing data set was calculated using Shapley additive explanations. RESULTS Seventy-two (24.6%), 18 (24.7%), and 31 (24.8%) patients attained pCR in NCCS training, NCCS testing, and KK Women's and Children's Hospital (KKH) testing data sets, respectively. The random forest (RF) base and imputed models have the highest AUCs in the KKH cohort of 0.794 (95% CI, 0.709 to 0.873) and 0.795 (95% CI, 0.706 to 0.871), respectively, and were the best calibrated with the lowest Brier score. No statistically significant difference was noted between AUCs of the base and imputed models in all data sets. The imputed model had a larger positive predictive value (PPV; 98.2% v 95.1%) and negative predictive value (NPV; 96.7% v 90.0%) than the base model in the KKH data set. Estrogen receptor intensity, human epidermal growth factor 2 intensity, and age at diagnosis were the three most important predictors. CONCLUSION ML, particularly RF, demonstrates reasonable accuracy in pCR prediction after NAC. Imputing missing fields in the data can improve the PPV and NPV of the pCR prediction model.
Collapse
Affiliation(s)
| | - Hong Qi Tan
- Division of Radiation Oncology, National Cancer Centre Singapore, Singapore Health Services, Singapore, Singapore
| | - Bryan Shihan Ho
- Division of Radiation Oncology, National Cancer Centre Singapore, Singapore Health Services, Singapore, Singapore
| | - Arjunan Kumaran
- Division of Radiation Oncology, National Cancer Centre Singapore, Singapore Health Services, Singapore, Singapore
| | - Andre Villanueva
- Division of Radiation Oncology, National Cancer Centre Singapore, Singapore Health Services, Singapore, Singapore
| | - Joy Sng
- Division of Radiation Oncology, National Cancer Centre Singapore, Singapore Health Services, Singapore, Singapore
| | - Ryan Shea Ying Cong Tan
- Division of Medical Oncology, National Cancer Centre Singapore, Singapore Health Services, Singapore, Singapore
| | - Tira Jing Ying Tan
- Division of Medical Oncology, National Cancer Centre Singapore, Singapore Health Services, Singapore, Singapore
| | - Veronique Kiak Mien Tan
- Division of Breast Surgery, Singapore General Hospital, Singapore Health Services, Singapore, Singapore
| | - Benita Kiat Tee Tan
- Division of Breast Surgery, Sengkang General Hospital, Singapore Health Services, Singapore, Singapore
| | - Geok Hoon Lim
- Breast Department, KK Women's and Children's Hospital, Singapore, Singapore
| | - Yiyu Cai
- School of Mechanical & Aerospace Engineering, Nanyang Technological University, Singapore, Singapore
| | - Wen Long Nei
- Division of Radiation Oncology, National Cancer Centre Singapore, Singapore Health Services, Singapore, Singapore
| | - Fuh Yong Wong
- Division of Radiation Oncology, National Cancer Centre Singapore, Singapore Health Services, Singapore, Singapore
| |
Collapse
|
6
|
Alsaber AR, Al-Herz A, Alawadhi B, Doush IA, Setiya P, AL-Sultan AT, Saleh K, Al-Awadhi A, Hasan E, Al-Kandari W, Mokaddem K, Ghanem AA, Attia Y, Hussain M, AlHadhood N, Ali Y, Tarakmeh H, Aldabie G, AlKadi A, Alhajeri H. Machine learning-based remission prediction in rheumatoid arthritis patients treated with biologic disease-modifying anti-rheumatic drugs: findings from the Kuwait rheumatic disease registry. Front Big Data 2024; 7:1406365. [PMID: 39421133 PMCID: PMC11484091 DOI: 10.3389/fdata.2024.1406365] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2024] [Accepted: 09/12/2024] [Indexed: 10/19/2024] Open
Abstract
Background Rheumatoid arthritis (RA) is a common condition treated with biological disease-modifying anti-rheumatic medicines (bDMARDs). However, many patients exhibit resistance, necessitating the use of machine learning models to predict remissions in patients treated with bDMARDs, thereby reducing healthcare costs and minimizing negative effects. Objective The study aims to develop machine learning models using data from the Kuwait Registry for Rheumatic Diseases (KRRD) to identify clinical characteristics predictive of remission in RA patients treated with biologics. Methods The study collected follow-up data from 1,968 patients treated with bDMARDs from four public hospitals in Kuwait from 2013 to 2022. Machine learning techniques like lasso, ridge, support vector machine, random forest, XGBoost, and Shapley additive explanation were used to predict remission at a 1-year follow-up. Results The study used the Shapley plot in explainable Artificial Intelligence (XAI) to analyze the effects of predictors on remission prognosis across different types of bDMARDs. Top clinical features were identified for patients treated with bDMARDs, each associated with specific mean SHAP values. The findings highlight the importance of clinical assessments and specific treatments in shaping treatment outcomes. Conclusion The proposed machine learning model system effectively identifies clinical features predicting remission in bDMARDs, potentially improving treatment efficacy in rheumatoid arthritis patients.
Collapse
Affiliation(s)
- Ahmad R. Alsaber
- College of Business and Economics, American University of Kuwait, Salmiya, Kuwait
| | - Adeeba Al-Herz
- Department of Rheumatology, Al-Amiri Hospital, Kuwait City, Kuwait
| | - Balqees Alawadhi
- Department of Food and Nutritional Sciences, The Public Authority for Applied Education & Training, Shuwaikh Industrial, Kuwait
| | - Iyad Abu Doush
- College of Engineering and Applied Sciences, American University of Kuwait, Salmiya, Kuwait
- Computer Science Department, Yarmouk University, Irbid, Jordan
| | - Parul Setiya
- College of Agriculture, Govind Ballabh Pant University of Agriculture and Technology, Pantnagar, India
| | - Ahmad T. AL-Sultan
- Department of Community Medicine and Behavioral Sciences, Kuwait University, Safat, Kuwait
| | - Khulood Saleh
- Department of Rheumatology, Farwaniya Hospital, Kuwait City, Kuwait
| | - Adel Al-Awadhi
- Department of Rheumatology, Al-Amiri Hospital, Kuwait City, Kuwait
| | - Eman Hasan
- Department of Rheumatology, Al-Amiri Hospital, Kuwait City, Kuwait
| | | | - Khalid Mokaddem
- Department of Rheumatology, Al-Amiri Hospital, Kuwait City, Kuwait
| | - Aqeel A. Ghanem
- Department of Rheumatology, Mubarak Al-Kabeer Hospital, Kuwait City, Kuwait
| | - Yousef Attia
- Department of Rheumatology, Al-Amiri Hospital, Kuwait City, Kuwait
| | - Mohammed Hussain
- Department of Rheumatology, Al-Amiri Hospital, Kuwait City, Kuwait
| | - Naser AlHadhood
- Department of Rheumatology, Farwaniya Hospital, Kuwait City, Kuwait
| | - Yaser Ali
- Department of Rheumatology, Mubarak Al-Kabeer Hospital, Kuwait City, Kuwait
| | - Hoda Tarakmeh
- Department of Rheumatology, Mubarak Al-Kabeer Hospital, Kuwait City, Kuwait
| | - Ghaydaa Aldabie
- Department of Rheumatology, Farwaniya Hospital, Kuwait City, Kuwait
| | - Amjad AlKadi
- Department of Rheumatology, Al-Sabah Hospital, Kuwait City, Kuwait
| | - Hebah Alhajeri
- Department of Rheumatology, Mubarak Al-Kabeer Hospital, Kuwait City, Kuwait
| |
Collapse
|
7
|
Dong W, Wan EYF, Fong DYT, Tan KCB, Tsui WWS, Hui EMT, Chan KH, Fung CSC, Lam CLK. Development and validation of 10-year risk prediction models of cardiovascular disease in Chinese type 2 diabetes mellitus patients in primary care using interpretable machine learning-based methods. Diabetes Obes Metab 2024; 26:3969-3987. [PMID: 39010291 DOI: 10.1111/dom.15745] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 03/12/2024] [Revised: 06/03/2024] [Accepted: 06/11/2024] [Indexed: 07/17/2024]
Abstract
AIM To develop 10-year cardiovascular disease (CVD) risk prediction models in Chinese patients with type 2 diabetes mellitus (T2DM) managed in primary care using machine learning (ML) methods. METHODS In this 10-year population-based retrospective cohort study, 141 516 Chinese T2DM patients aged 18 years or above, without history of CVD or end-stage renal disease and managed in public primary care clinics in 2008, were included and followed up until December 2017. Two-thirds of the patients were randomly selected to develop sex-specific CVD risk prediction models. The remaining one-third of patients were used as the validation sample to evaluate the discrimination and calibration of the models. ML-based methods were applied to missing data imputation, predictor selection, risk prediction modelling, model interpretation, and model evaluation. Cox regression was used to develop the statistical models in parallel for comparison. RESULTS During a median follow-up of 9.75 years, 32 445 patients (22.9%) developed CVD. Age, T2DM duration, urine albumin-to-creatinine ratio (ACR), estimated glomerular filtration rate (eGFR), systolic blood pressure variability and glycated haemoglobin (HbA1c) variability were the most important predictors. ML models also identified nonlinear effects of several predictors, particularly the U-shaped effects of eGFR and body mass index. The ML models showed a Harrell's C statistic of >0.80 and good calibration. The ML models performed significantly better than the Cox regression models in CVD risk prediction and achieved better risk stratification for individual patients. CONCLUSION Using routinely available predictors and ML-based algorithms, this study established 10-year CVD risk prediction models for Chinese T2DM patients in primary care. The findings highlight the importance of renal function indicators, and variability in both blood pressure and HbA1c as CVD predictors, which deserve more clinical attention. The derived risk prediction tools have the potential to support clinical decision making and encourage patients towards self-care, subject to further research confirming the models' feasibility, acceptability and applicability at the point of care.
Collapse
Affiliation(s)
- Weinan Dong
- Department of Family Medicine and Primary Care, The University of Hong Kong, Hong Kong, China
| | - Eric Yuk Fai Wan
- Department of Family Medicine and Primary Care, The University of Hong Kong, Hong Kong, China
- Centre for Safe Medication Practice and Research, Department of Pharmacology and Pharmacy, The University of Hong Kong, Hong Kong, China
- Advanced Data Analytics for Medical Science (ADAMS) Limited, Hong Kong, China
| | | | | | - Wendy Wing-Sze Tsui
- Department of Family Medicine & Primary Healthcare, Hong Kong West Cluster, Hosptial Authority, Hong Kong, China
| | - Eric Ming-Tung Hui
- Department of Family Medicine, New Territories East Cluster, Hospital Authority, Hong Kong, China
| | - King Hong Chan
- Department of Family Medicine & General Out-patient Clinics, Kowloon Central Cluster, Hospital Authority, Hong Kong, China
| | - Colman Siu Cheung Fung
- Department of Family Medicine and Primary Care, The University of Hong Kong, Hong Kong, China
| | - Cindy Lo Kuen Lam
- Department of Family Medicine and Primary Care, The University of Hong Kong, Hong Kong, China
- Department of Family Medicine, The University of Hong Kong Shenzhen Hospital, Shenzhen, China
| |
Collapse
|
8
|
Afkanpour M, Hosseinzadeh E, Tabesh H. Identify the most appropriate imputation method for handling missing values in clinical structured datasets: a systematic review. BMC Med Res Methodol 2024; 24:188. [PMID: 39198744 PMCID: PMC11351057 DOI: 10.1186/s12874-024-02310-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/06/2024] [Accepted: 08/19/2024] [Indexed: 09/01/2024] Open
Abstract
BACKGROUND AND OBJECTIVES Comprehending the research dataset is crucial for obtaining reliable and valid outcomes. Health analysts must have a deep comprehension of the data being analyzed. This comprehension allows them to suggest practical solutions for handling missing data, in a clinical data source. Accurate handling of missing values is critical for producing precise estimates and making informed decisions, especially in crucial areas like clinical research. With data's increasing diversity and complexity, numerous scholars have developed a range of imputation techniques. To address this, we conducted a systematic review to introduce various imputation techniques based on tabular dataset characteristics, including the mechanism, pattern, and ratio of missingness, to identify the most appropriate imputation methods in the healthcare field. MATERIALS AND METHODS We searched four information databases namely PubMed, Web of Science, Scopus, and IEEE Xplore, for articles published up to September 20, 2023, that discussed imputation methods for addressing missing values in a clinically structured dataset. Our investigation of selected articles focused on four key aspects: the mechanism, pattern, ratio of missingness, and various imputation strategies. By synthesizing insights from these perspectives, we constructed an evidence map to recommend suitable imputation methods for handling missing values in a tabular dataset. RESULTS Out of 2955 articles, 58 were included in the analysis. The findings from the development of the evidence map, based on the structure of the missing values and the types of imputation methods used in the extracted items from these studies, revealed that 45% of the studies employed conventional statistical methods, 31% utilized machine learning and deep learning methods, and 24% applied hybrid imputation techniques for handling missing values. CONCLUSION Considering the structure and characteristics of missing values in a clinical dataset is essential for choosing the most appropriate data imputation technique, especially within conventional statistical methods. Accurately estimating missing values to reflect reality enhances the likelihood of obtaining high-quality and reusable data, contributing significantly to precise medical decision-making processes. Performing this review study creates a guideline for choosing the most appropriate imputation methods in data preprocessing stages to perform analytical processes on structured clinical datasets.
Collapse
Affiliation(s)
- Marziyeh Afkanpour
- Department of Medical Informatics, Faculty of Medicine, Mashhad University of Medical Sciences, Mashhad, Iran
| | - Elham Hosseinzadeh
- Department of Medical Informatics, Faculty of Medicine, Mashhad University of Medical Sciences, Mashhad, Iran
| | - Hamed Tabesh
- Department of Medical Informatics, Faculty of Medicine, Mashhad University of Medical Sciences, Mashhad, Iran.
| |
Collapse
|
9
|
Xu L, Zhao W, He J, Hou S, He J, Zhuang Y, Wang Y, Yang H, Xiao J, Qiu Y. Abdominal perfusion pressure is critical for survival analysis in patients with intra-abdominal hypertension: mortality prediction using incomplete data. Int J Surg 2024; 111:01279778-990000000-01889. [PMID: 39166944 PMCID: PMC11745648 DOI: 10.1097/js9.0000000000002026] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2024] [Accepted: 07/30/2024] [Indexed: 08/23/2024]
Abstract
BACKGROUND Abdominal perfusion pressure (APP) is a salient feature in the design of a prognostic model for patients with intra-abdominal hypertension (IAH). However, incomplete data significantly limits the size of the beneficiary patient population in clinical practice. Using advanced artificial intelligence methods, we developed a robust mortality prediction model with APP from incomplete data. METHODS We retrospectively evaluated the patients with IAH from the Medical Information Mart for Intensive Care IV (MIMIC-IV) database. Incomplete data were filled in using generative adversarial imputation nets (GAIN). Lastly, demographic, clinical, and laboratory findings were combined to build a 7-day mortality prediction model. RESULTS We included 1354 patients in this study, of which 63 features were extracted. Data imputation with GAIN achieved the best performance. Patients with an APP< 60 mmHg had significantly higher all-cause mortality within 7 to 90 days. The difference remained significant in long-term survival even after propensity score matching (PSM) eliminated other mortality risks between groups. Lastly, the built machine learning model for 7-day modality prediction achieved the best results with an AUC of 0.80 in patients with confirmed IAH outperforming the other four traditional clinical scoring systems. CONCLUSIONS APP reduction is an important survival predictor affecting the survival prognosis of patients with IAH. We constructed a robust model to predict the 7-day mortality probability of patients with IAH, which is superior to the commonly used clinical scoring systems.
Collapse
Affiliation(s)
- Liang Xu
- Department of General Surgery, The Second Affiliated Hospital of the Army Medical University
- Bio-Med Informatics Research Centre and Clinical Research Centre, The Second Affiliated Hospital of the Army Medical University
| | - Weijie Zhao
- Bioengineering College, Chongqing University
| | - Jiao He
- Department of Respiratory and Critical Care Medicine, The First Affiliated Hospital of Chongqing Medical University
| | - Siyu Hou
- Bio-Med Informatics Research Centre and Clinical Research Centre, The Second Affiliated Hospital of the Army Medical University
| | - Jialin He
- Department of Gastroenterology, The Second Affiliated Hospital of the Army Medical University
| | - Yan Zhuang
- Medical Big Data Research Center, Chinese PLA General Hospital, Beijing, People’s Republic of China
| | - Ying Wang
- Department of General Surgery, The Second Affiliated Hospital of the Army Medical University
| | - Hua Yang
- Department of General Surgery, Chongqing General Hospital, Chongqing
| | - Jingjing Xiao
- Bio-Med Informatics Research Centre and Clinical Research Centre, The Second Affiliated Hospital of the Army Medical University
| | - Yuan Qiu
- Department of General Surgery, The Second Affiliated Hospital of the Army Medical University
| |
Collapse
|
10
|
Li J, Wang Z, Wu L, Qiu S, Zhao H, Lin F, Zhang K. Method for Incomplete and Imbalanced Data Based on Multivariate Imputation by Chained Equations and Ensemble Learning. IEEE J Biomed Health Inform 2024; 28:3102-3113. [PMID: 38483807 DOI: 10.1109/jbhi.2024.3376428] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/20/2024]
Abstract
The classification analysis of incomplete and imbalanced data is still a challenging task since these issues could negatively impact the training of classifiers, which were also found in our study on the physical fitness assessments of patients. And in fields such as healthcare, there are higher requirements for the accuracy of the generated imputation values. To train a high-performance classifier and pursue high accuracy, we attempted to resolve any potential negative impact by using a novel algorithmic approach based on the combination of multivariate imputation by chained equations and the ensemble learning method (MICEEN), which can solve the two problems simultaneously. We used multivariate imputation by chained equations to generate more accurate imputation values for the training set passed to ensemble learning to build a predictor. On the other hand, missing values were introduced into minority classes and used them to generate new samples belonging to the minority classes in order to balance the distribution of classes. On real-world datasets, we perform extensive experiments to assess our method and compare it to other state-of-the-art approaches. The advantages of the proposed method are demonstrated by experimental results for the benchmark datasets and self-collected datasets of physical fitness assessment of tumor patients with varying missing rates.
Collapse
|
11
|
Ou H, Yao Y, He Y. Missing Data Imputation Method Combining Random Forest and Generative Adversarial Imputation Network. SENSORS (BASEL, SWITZERLAND) 2024; 24:1112. [PMID: 38400270 PMCID: PMC10893362 DOI: 10.3390/s24041112] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/25/2023] [Revised: 01/27/2024] [Accepted: 02/06/2024] [Indexed: 02/25/2024]
Abstract
(1) Background: In order to solve the problem of missing time-series data due to the influence of the acquisition system or external factors, a missing time-series data interpolation method based on random forest and a generative adversarial interpolation network is proposed. (2) Methods: First, the position of the missing part of the data is calibrated, and the trained random forest algorithm is used for the first data interpolation. The output value of the random forest algorithm is used as the input value of the generative adversarial interpolation network, and the generative adversarial interpolation network is used to calibrate the position. The data are interpolated for the second time, and the advantages of the two algorithms are combined to make the interpolation result closer to the true value. (3) Results: The filling effect of the algorithm is tested on a certain bearing data set, and the root mean square error (RMSE) is used to evaluate the interpolation results. The results show that the RMSE of the interpolation results based on the random forest and generative adversarial interpolation network algorithms in the case of single-segment and multi-segment missing data is only 0.0157, 0.0386, and 0.0527, which is better than the random forest algorithm, generative adversarial interpolation network algorithm, and K-nearest neighbor algorithm. (4) Conclusions: The proposed algorithm performs well in each data set and provides a reference method in the field of data filling.
Collapse
Affiliation(s)
| | - Yunan Yao
- School of Naval Architecture, Ocean and Energy Power Engineering, Wuhan University of Technology, Wuhan 430063, China; (H.O.); (Y.H.)
| | | |
Collapse
|
12
|
Liu M, Li S, Yuan H, Ong MEH, Ning Y, Xie F, Saffari SE, Shang Y, Volovici V, Chakraborty B, Liu N. Handling missing values in healthcare data: A systematic review of deep learning-based imputation techniques. Artif Intell Med 2023; 142:102587. [PMID: 37316097 DOI: 10.1016/j.artmed.2023.102587] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2022] [Revised: 04/08/2023] [Accepted: 05/16/2023] [Indexed: 06/16/2023]
Abstract
OBJECTIVE The proper handling of missing values is critical to delivering reliable estimates and decisions, especially in high-stakes fields such as clinical research. In response to the increasing diversity and complexity of data, many researchers have developed deep learning (DL)-based imputation techniques. We conducted a systematic review to evaluate the use of these techniques, with a particular focus on the types of data, intending to assist healthcare researchers from various disciplines in dealing with missing data. MATERIALS AND METHODS We searched five databases (MEDLINE, Web of Science, Embase, CINAHL, and Scopus) for articles published prior to February 8, 2023 that described the use of DL-based models for imputation. We examined selected articles from four perspectives: data types, model backbones (i.e., main architectures), imputation strategies, and comparisons with non-DL-based methods. Based on data types, we created an evidence map to illustrate the adoption of DL models. RESULTS Out of 1822 articles, a total of 111 were included, of which tabular static data (29%, 32/111) and temporal data (40%, 44/111) were the most frequently investigated. Our findings revealed a discernible pattern in the choice of model backbones and data types, for example, the dominance of autoencoder and recurrent neural networks for tabular temporal data. The discrepancy in imputation strategy usage among data types was also observed. The "integrated" imputation strategy, which solves the imputation task simultaneously with downstream tasks, was most popular for tabular temporal data (52%, 23/44) and multi-modal data (56%, 5/9). Moreover, DL-based imputation methods yielded a higher level of imputation accuracy than non-DL methods in most studies. CONCLUSION The DL-based imputation models are a family of techniques, with diverse network structures. Their designation in healthcare is usually tailored to data types with different characteristics. Although DL-based imputation models may not be superior to conventional approaches across all datasets, it is highly possible for them to achieve satisfactory results for a particular data type or dataset. There are, however, still issues with regard to portability, interpretability, and fairness associated with current DL-based imputation models.
Collapse
Affiliation(s)
- Mingxuan Liu
- Centre for Quantitative Medicine, Duke-NUS Medical School, Singapore
| | - Siqi Li
- Centre for Quantitative Medicine, Duke-NUS Medical School, Singapore
| | - Han Yuan
- Centre for Quantitative Medicine, Duke-NUS Medical School, Singapore
| | - Marcus Eng Hock Ong
- Programme in Health Services and Systems Research, Duke-NUS Medical School, Singapore; Department of Emergency Medicine, Singapore General Hospital, Singapore
| | - Yilin Ning
- Centre for Quantitative Medicine, Duke-NUS Medical School, Singapore
| | - Feng Xie
- Centre for Quantitative Medicine, Duke-NUS Medical School, Singapore; Programme in Health Services and Systems Research, Duke-NUS Medical School, Singapore
| | - Seyed Ehsan Saffari
- Centre for Quantitative Medicine, Duke-NUS Medical School, Singapore; Programme in Health Services and Systems Research, Duke-NUS Medical School, Singapore
| | - Yuqing Shang
- Centre for Quantitative Medicine, Duke-NUS Medical School, Singapore
| | - Victor Volovici
- Department of Neurosurgery, Erasmus MC University Medical Center, Rotterdam, the Netherlands
| | - Bibhas Chakraborty
- Centre for Quantitative Medicine, Duke-NUS Medical School, Singapore; Programme in Health Services and Systems Research, Duke-NUS Medical School, Singapore; Department of Statistics and Data Science, National University of Singapore, Singapore; Department of Biostatistics and Bioinformatics, Duke University, Durham, NC, USA
| | - Nan Liu
- Centre for Quantitative Medicine, Duke-NUS Medical School, Singapore; Programme in Health Services and Systems Research, Duke-NUS Medical School, Singapore; SingHealth AI Office, Singapore Health Services, Singapore; Institute of Data Science, National University of Singapore, Singapore.
| |
Collapse
|
13
|
Cascella M, Scarpati G, Bignami EG, Cuomo A, Vittori A, Di Gennaro P, Crispo A, Coluccia S. Utilizing an artificial intelligence framework (conditional generative adversarial network) to enhance telemedicine strategies for cancer pain management. JOURNAL OF ANESTHESIA, ANALGESIA AND CRITICAL CARE (ONLINE) 2023; 3:19. [PMID: 37386680 DOI: 10.1186/s44158-023-00104-8] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/24/2023] [Accepted: 06/13/2023] [Indexed: 07/01/2023]
Abstract
BACKGROUND The utilization of artificial intelligence (AI) in healthcare has significant potential to revolutionize the delivery of medical services, particularly in the field of telemedicine. In this article, we investigate the capabilities of a specific deep learning model, a generative adversarial network (GAN), and explore its potential for enhancing the telemedicine approach to cancer pain management. MATERIALS AND METHODS We implemented a structured dataset comprising demographic and clinical variables from 226 patients and 489 telemedicine visits for cancer pain management. The deep learning model, specifically a conditional GAN, was employed to generate synthetic samples that closely resemble real individuals in terms of their characteristics. Subsequently, four machine learning (ML) algorithms were used to assess the variables associated with a higher number of remote visits. RESULTS The generated dataset exhibits a distribution comparable to the reference dataset for all considered variables, including age, number of visits, tumor type, performance status, characteristics of metastasis, opioid dosage, and type of pain. Among the algorithms tested, random forest demonstrated the highest performance in predicting a higher number of remote visits, achieving an accuracy of 0.8 on the test data. The simulations based on ML indicated that individuals who are younger than 45 years old, and those experiencing breakthrough cancer pain, may require an increased number of telemedicine-based clinical evaluations. CONCLUSION As the advancement of healthcare processes relies on scientific evidence, AI techniques such as GANs can play a vital role in bridging knowledge gaps and accelerating the integration of telemedicine into clinical practice. Nonetheless, it is crucial to carefully address the limitations of these approaches.
Collapse
Affiliation(s)
- Marco Cascella
- Department of Anesthesia and Critical Care, Istituto Nazionale Tumori-IRCCS, Fondazione Pascale, 80100, Naples, Italy.
| | - Giuliana Scarpati
- Department of Medicine, Surgery and Dentistry "Scuola Medica Salernitana, " University of Salerno, 84084, Baronissi, SA, Italy
| | - Elena Giovanna Bignami
- Critical Care and Pain Medicine Division, Department of Medicine and Surgery, University of Parma, Viale Gramsci 14, 43126, Parma, Italy
| | - Arturo Cuomo
- Department of Anesthesia and Critical Care, Istituto Nazionale Tumori-IRCCS, Fondazione Pascale, 80100, Naples, Italy
| | - Alessandro Vittori
- Department of Anesthesia and Critical Care, ARCO Roma, Ospedale Pediatrico Bambino Gesù IRCCS, Rome, Italy
| | - Piergiacomo Di Gennaro
- Epidemiology and Biostatistics Unit, Istituto Nazionale Tumori-IRCCS, Fondazione Pascale, 80100, Naples, Italy
| | - Anna Crispo
- Epidemiology and Biostatistics Unit, Istituto Nazionale Tumori-IRCCS, Fondazione Pascale, 80100, Naples, Italy
| | - Sergio Coluccia
- Epidemiology and Biostatistics Unit, Istituto Nazionale Tumori-IRCCS, Fondazione Pascale, 80100, Naples, Italy
| |
Collapse
|
14
|
Ge Y, Li Z, Zhang J. A simulation study on missing data imputation for dichotomous variables using statistical and machine learning methods. Sci Rep 2023; 13:9432. [PMID: 37296269 PMCID: PMC10256703 DOI: 10.1038/s41598-023-36509-2] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/07/2023] [Accepted: 06/05/2023] [Indexed: 06/12/2023] Open
Abstract
The problem of missing data, particularly for dichotomous variables, is a common issue in medical research. However, few studies have focused on the imputation methods of dichotomous data and their performance, as well as the applicability of these imputation methods and the factors that may affect their performance. In the arrangement of application scenarios, different missing mechanisms, sample sizes, missing rates, the correlation between variables, value distributions, and the number of missing variables were considered. We used data simulation techniques to establish a variety of different compound scenarios for missing dichotomous variables and conducted real-data validation on two real-world medical datasets. We comprehensively compared the performance of eight imputation methods (mode, logistic regression (LogReg), multiple imputation (MI), decision tree (DT), random forest (RF), k-nearest neighbor (KNN), support vector machine (SVM), and artificial neural network (ANN)) in each scenario. Accuracy and mean absolute error (MAE) were applied to evaluating their performance. The results showed that missing mechanisms, value distributions and the correlation between variables were the main factors affecting the performance of imputation methods. Machine learning-based methods, especially SVM, ANN, and DT, achieved relatively high accuracy with stable performance and were of potential applicability. Researchers should explore the correlation between variables and their distribution pattern in advance and prioritize machine learning-based methods for practical applications when encountering dichotomous missing data.
Collapse
Affiliation(s)
- Yingfeng Ge
- Department of Medical Statistics, School of Public Health, Sun Yat-Sen University, Guangzhou, 510080, People's Republic of China
| | - Zhiwei Li
- Department of Medical Statistics, School of Public Health, Sun Yat-Sen University, Guangzhou, 510080, People's Republic of China
| | - Jinxin Zhang
- Department of Medical Statistics, School of Public Health, Sun Yat-Sen University, Guangzhou, 510080, People's Republic of China.
| |
Collapse
|
15
|
Youn HM, Quan J, Mak IL, Yu EYT, Lau CS, Ip MSM, Tang SCW, Wong ICK, Lau KK, Lee MSF, Ng CS, Grépin KA, Chao DVK, Ko WWK, Lam CLK, Wan EYF. Long-term spill-over impact of COVID-19 on health and healthcare of people with non-communicable diseases: a study protocol for a population-based cohort and health economic study. BMJ Open 2022; 12:e063150. [PMID: 35973704 PMCID: PMC9385580 DOI: 10.1136/bmjopen-2022-063150] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 11/16/2022] Open
Abstract
INTRODUCTION The COVID-19 pandemic has a significant spill-over effect on people with non-communicable diseases (NCDs) over the long term, beyond the direct effect of COVID-19 infection. Evaluating changes in health outcomes, health service use and costs can provide evidence to optimise care for people with NCDs during and after the pandemic, and to better prepare outbreak responses in the future. METHODS AND ANALYSIS This is a population-based cohort study using electronic health records of the Hong Kong Hospital Authority (HA) CMS, economic modelling and serial cross-sectional surveys on health service use. This study includes people aged ≥18 years who have a documented diagnosis of diabetes mellitus, hypertension, cardiovascular disease, cancer, chronic respiratory disease or chronic kidney disease with at least one attendance at the HA hospital or clinic between 1 January 2010 and 31 December 2019, and without COVID-19 infection. Changes in all-cause mortality, disease-specific outcomes, and health services use rates and costs will be assessed between pre-COVID-19 and-post-COVID-19 pandemic or during each wave using an interrupted time series analysis. The long-term health economic impact of healthcare disruptions during the COVID-19 pandemic will be studied using microsimulation modelling. Multivariable Cox proportional hazards regression and Poisson/negative binomial regression will be used to evaluate the effect of different modes of supplementary care on health outcomes. ETHICS AND DISSEMINATION The study was approved by the institutional review board of the University of Hong Kong, the HA Hong Kong West Cluster (reference number UW 21-297). The study findings will be disseminated through peer-reviewed publications and international conferences.
Collapse
Affiliation(s)
- Hin Moi Youn
- Department of Family Medicine and Primary Care, School of Clinical Medicine, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Hong Kong SAR, China
| | - Jianchao Quan
- School of Public Health, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Hong Kong SAR, China
| | - Ivy Lynn Mak
- Department of Family Medicine and Primary Care, School of Clinical Medicine, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Hong Kong SAR, China
| | - Esther Yee Tak Yu
- Department of Family Medicine and Primary Care, School of Clinical Medicine, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Hong Kong SAR, China
| | - Chak Sing Lau
- School of Clinical Medicine, Department of Medicine, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Hong Kong SAR, China
| | - Mary Sau Man Ip
- Division of Respiratory, Department of Medicine, School of Clinical Medicine, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Hong Kong SAR, China
| | - Sydney Chi Wai Tang
- Division of Nephrology, Department of Medicine, School of Clinical Medicine, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Hong Kong SAR, China
| | - Ian Chi Kei Wong
- Centre for Safe Medication Practice and research, Department of Pharmacology and Pharmacy, The University of Hong Kong, Hong Kong SAR, China
- School of Pharmacy, University College London, London, UK
- Aston Pharmacy School, Aston University, Birmingham, UK
- Department of Pharmacy, The University of Hong Kong-Shenzhen Hospital, Shenzhen, China
- Laboratory of Data Discovery for Health (D24H), Hong Kong SAR, China
| | - Kui Kai Lau
- Division of Neurology, Department of Medicine, School of Clinical Medicine, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Hong Kong SAR, China
- The State Key Laboratory of Brain and Cognitive Sciences, The University of Hong Kong, Hong Kong SAR, China
| | - Michael Shing Fung Lee
- Department of Clinical Oncology, Tuen Mun Hospital, Hospital Authority, Hong Kong SAR, China
- Department of Clinical Oncology, Queen Mary Hospital, Hong Kong SAR, China
- Department of Radiation Oncology, National University Cancer Institute, Singapore
| | - Carmen S Ng
- School of Public Health, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Hong Kong SAR, China
| | - Karen Ann Grépin
- School of Public Health, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Hong Kong SAR, China
| | - David Vai Kiong Chao
- Department of Family Medicine and Primary Health Care, Hospital Authority Kowloon East Cluster, Hong Kong SAR, China
| | - Welchie Wai Kit Ko
- Department of Family Medicine and Primary Health Care, Hospital Authority Hong Kong West Cluster, Hong Kong SAR, China
| | - Cindy Lo Kuen Lam
- Department of Family Medicine and Primary Care, School of Clinical Medicine, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Hong Kong SAR, China
- Department of Family Medicine, The University of Hong Kong Shenzhen Hospital, Shenzhen, China
| | - Eric Yuk Fai Wan
- Department of Family Medicine and Primary Care, School of Clinical Medicine, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Hong Kong SAR, China
- Centre for Safe Medication Practice and research, Department of Pharmacology and Pharmacy, The University of Hong Kong, Hong Kong SAR, China
- Laboratory of Data Discovery for Health (D24H), Hong Kong SAR, China
| |
Collapse
|
16
|
Jiang X, Yang Z, Wang S, Deng S. “Big Data” Approaches for Prevention of the Metabolic Syndrome. Front Genet 2022; 13:810152. [PMID: 35571045 PMCID: PMC9095427 DOI: 10.3389/fgene.2022.810152] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/06/2021] [Accepted: 03/28/2022] [Indexed: 11/21/2022] Open
Abstract
Metabolic syndrome (MetS) is characterized by the concurrence of multiple metabolic disorders resulting in the increased risk of a variety of diseases related to disrupted metabolism homeostasis. The prevalence of MetS has reached a pandemic level worldwide. In recent years, extensive amount of data have been generated throughout the research targeted or related to the condition with techniques including high-throughput screening and artificial intelligence, and with these “big data”, the prevention of MetS could be pushed to an earlier stage with different data source, data mining tools and analytic tools at different levels. In this review we briefly summarize the recent advances in the study of “big data” applications in the three-level disease prevention for MetS, and illustrate how these technologies could contribute tobetter preventive strategies.
Collapse
Affiliation(s)
- Xinping Jiang
- Department of United Ultrasound, The First Hospital of Jilin University, Changchun, China
| | - Zhang Yang
- Department of Vascular Surgery, The First Hospital of Jilin University, Changchun, China
| | - Shuai Wang
- Department of Vascular Surgery, The First Hospital of Jilin University, Changchun, China
| | - Shuanglin Deng
- Department of Oncological Neurosurgery, The First Hospital of Jilin University, Changchun, China
- *Correspondence: Shuanglin Deng,
| |
Collapse
|
17
|
Festag S, Denzler J, Spreckelsen C. Generative Adversarial Networks for Biomedical Time Series Forecasting and Imputation A systematic review. J Biomed Inform 2022; 129:104058. [DOI: 10.1016/j.jbi.2022.104058] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/08/2021] [Revised: 02/23/2022] [Accepted: 03/22/2022] [Indexed: 10/18/2022]
|
18
|
Missing Data Imputation – A Survey. INTERNATIONAL JOURNAL OF DECISION SUPPORT SYSTEM TECHNOLOGY 2022. [DOI: 10.4018/ijdsst.292446] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
Many real world datasets may contain missing values for various reasons. These incomplete datasets can pose severe issues to the underlying machine learning algorithms and decision support systems. It may result in high computational cost, skewed output and invalid deductions. Various solutions exist to mitigate this issue; the most popular strategy is to estimate the missing values by applying inferential techniques such as linear regression, decision trees or Bayesian inference. In this paper, the missing data problem is discussed in detail with a comprehensive review of the approaches to tackle it. The paper concludes with a discussion on the effectiveness of three imputation methods namely, imputation based on Multiple Linear Regression (MLR), Predictive Mean Matching (PMM) and Classification And Regression Tree (CART) in the context of subspace clustering. The experimental results obtained on real benchmark datasets and high-dimensional synthetic datasets highlight that, MLR based imputation method is more efficient on high-dimensional incomplete datasets.
Collapse
|