1
|
Mpouzika M, Karanikola M, Blot S. The conundrum of predicting neurological outcomes in non-traumatic coma patients: True prediction or "Flipping a Coin"? Intensive Crit Care Nurs 2024; 83:103707. [PMID: 38636295 DOI: 10.1016/j.iccn.2024.103707] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/20/2024]
Affiliation(s)
- Meropi Mpouzika
- Nursing Department, Cyprus University of Technology, Limassol, Cyprus.
| | - Maria Karanikola
- Nursing Department, Cyprus University of Technology, Limassol, Cyprus
| | - Stijn Blot
- Department of Internal Medicine and Pediatrics, Ghent University, Ghent, Belgium
| |
Collapse
|
2
|
Bellmann L, Wiederhold AJ, Trübe L, Twerenbold R, Ückert F, Gottfried K. Introducing Attribute Association Graphs to Facilitate Medical Data Exploration: Development and Evaluation Using Epidemiological Study Data. JMIR Med Inform 2024; 12:e49865. [PMID: 39046780 DOI: 10.2196/49865] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2023] [Revised: 10/11/2023] [Accepted: 05/04/2024] [Indexed: 07/25/2024] Open
Abstract
BACKGROUND Interpretability and intuitive visualization facilitate medical knowledge generation through big data. In addition, robustness to high-dimensional and missing data is a requirement for statistical approaches in the medical domain. A method tailored to the needs of physicians must meet all the abovementioned criteria. OBJECTIVE This study aims to develop an accessible tool for visual data exploration without the need for programming knowledge, adjusting complex parameterizations, or handling missing data. We sought to use statistical analysis using the setting of disease and control cohorts familiar to clinical researchers. We aimed to guide the user by identifying and highlighting data patterns associated with disease and reveal relations between attributes within the data set. METHODS We introduce the attribute association graph, a novel graph structure designed for visual data exploration using robust statistical metrics. The nodes capture frequencies of participant attributes in disease and control cohorts as well as deviations between groups. The edges represent conditional relations between attributes. The graph is visualized using the Neo4j (Neo4j, Inc) data platform and can be interactively explored without the need for technical knowledge. Nodes with high deviations between cohorts and edges of noticeable conditional relationship are highlighted to guide the user during the exploration. The graph is accompanied by a dashboard visualizing variable distributions. For evaluation, we applied the graph and dashboard to the Hamburg City Health Study data set, a large cohort study conducted in the city of Hamburg, Germany. All data structures can be accessed freely by researchers, physicians, and patients. In addition, we developed a user test conducted with physicians incorporating the System Usability Scale, individual questions, and user tasks. RESULTS We evaluated the attribute association graph and dashboard through an exemplary data analysis of participants with a general cardiovascular disease in the Hamburg City Health Study data set. All results extracted from the graph structure and dashboard are in accordance with findings from the literature, except for unusually low cholesterol levels in participants with cardiovascular disease, which could be induced by medication. In addition, 95% CIs of Pearson correlation coefficients were calculated for all associations identified during the data analysis, confirming the results. In addition, a user test with 10 physicians assessing the usability of the proposed methods was conducted. A System Usability Scale score of 70.5% and average successful task completion of 81.4% were reported. CONCLUSIONS The proposed attribute association graph and dashboard enable intuitive visual data exploration. They are robust to high-dimensional as well as missing data and require no parameterization. The usability for clinicians was confirmed via a user test, and the validity of the statistical results was confirmed by associations known from literature and standard statistical inference.
Collapse
Affiliation(s)
- Louis Bellmann
- Institute for Applied Medical Informatics, University Medical Center Hamburg-Eppendorf, Hamburg, Germany
| | | | - Leona Trübe
- Institute for Applied Medical Informatics, University Medical Center Hamburg-Eppendorf, Hamburg, Germany
| | - Raphael Twerenbold
- Department of Cardiology, University Heart & Vascular Center Hamburg, University Medical Center Hamburg-Eppendorf, Hamburg, Germany
- German Center for Cardiovascular Research (DZHK) Partner Site Hamburg-Kiel-Lübeck, Hamburg, Germany
- University Center of Cardiovascular Science, University Heart & Vascular Center Hamburg, University Medical Center Hamburg-Eppendorf, Hamburg, Germany
| | - Frank Ückert
- Institute for Applied Medical Informatics, University Medical Center Hamburg-Eppendorf, Hamburg, Germany
| | - Karl Gottfried
- Institute for Applied Medical Informatics, University Medical Center Hamburg-Eppendorf, Hamburg, Germany
| |
Collapse
|
3
|
Napravnik M, Hržić F, Tschauner S, Štajduhar I. Building RadiologyNET: an unsupervised approach to annotating a large-scale multimodal medical database. BioData Min 2024; 17:22. [PMID: 38997749 PMCID: PMC11245804 DOI: 10.1186/s13040-024-00373-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2023] [Accepted: 06/30/2024] [Indexed: 07/14/2024] Open
Abstract
BACKGROUND The use of machine learning in medical diagnosis and treatment has grown significantly in recent years with the development of computer-aided diagnosis systems, often based on annotated medical radiology images. However, the lack of large annotated image datasets remains a major obstacle, as the annotation process is time-consuming and costly. This study aims to overcome this challenge by proposing an automated method for annotating a large database of medical radiology images based on their semantic similarity. RESULTS An automated, unsupervised approach is used to create a large annotated dataset of medical radiology images originating from the Clinical Hospital Centre Rijeka, Croatia. The pipeline is built by data-mining three different types of medical data: images, DICOM metadata and narrative diagnoses. The optimal feature extractors are then integrated into a multimodal representation, which is then clustered to create an automated pipeline for labelling a precursor dataset of 1,337,926 medical images into 50 clusters of visually similar images. The quality of the clusters is assessed by examining their homogeneity and mutual information, taking into account the anatomical region and modality representation. CONCLUSIONS The results indicate that fusing the embeddings of all three data sources together provides the best results for the task of unsupervised clustering of large-scale medical data and leads to the most concise clusters. Hence, this work marks the initial step towards building a much larger and more fine-grained annotated dataset of medical radiology images.
Collapse
Affiliation(s)
- Mateja Napravnik
- Faculty of Engineering, University of Rijeka, Vukovarska 58, Rijeka, 51000, Croatia
| | - Franko Hržić
- Faculty of Engineering, University of Rijeka, Vukovarska 58, Rijeka, 51000, Croatia
- Center for Artificial Intelligence and Cybersecurity, Radmile Matejcic 2, Rijeka, 51000, Croatia
| | - Sebastian Tschauner
- Division of Pediatric Radiology, Department of Radiology, Medical University of Graz, Neue Stiftingtalstraße 6, Graz, 8010, Austria
| | - Ivan Štajduhar
- Faculty of Engineering, University of Rijeka, Vukovarska 58, Rijeka, 51000, Croatia.
- Center for Artificial Intelligence and Cybersecurity, Radmile Matejcic 2, Rijeka, 51000, Croatia.
| |
Collapse
|
4
|
Mulat Tebeje T, Kindie Yenit M, Gedlu Nigatu S, Bizuneh Mengistu S, Kidie Tesfie T, Byadgie Gelaw N, Moges Chekol Y. Prediction of diabetic retinopathy among type 2 diabetic patients in University of Gondar Comprehensive Specialized Hospital, 2006-2021: A prognostic model. Int J Med Inform 2024; 190:105536. [PMID: 38970878 DOI: 10.1016/j.ijmedinf.2024.105536] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2024] [Revised: 06/26/2024] [Accepted: 07/01/2024] [Indexed: 07/08/2024]
Abstract
BACKGROUND There has been a paucity of evidence for the development of a prediction model for diabetic retinopathy (DR) in Ethiopia. Predicting the risk of developing DR based on the patient's demographic, clinical, and behavioral data is helpful in resource-limited areas where regular screening for DR is not available and to guide practitioners estimate the future risk of their patients. METHODS A retrospective follow-up study was conducted at the University of Gondar (UoG) Comprehensive Specialized Hospital from January 2006 to May 2021 among 856 patients with type 2 diabetes (T2DM). Variables were selected using the Least Absolute Shrinkage and Selection Operator (LASSO) regression. The data were validated by 10-fold cross-validation. Four ML techniques (naïve Bayes, K-nearest neighbor, decision tree, and logistic regression) were employed. The performance of each algorithm was measured, and logistic regression was a well-performing algorithm. After multivariable logistic regression and model reduction, a nomogram was developed to predict the individual risk of DR. RESULTS Logistic regression was the best algorithm for predicting DR with an area under the curve of 92%, sensitivity of 87%, specificity of 83%, precision of 84%, F1-score of 85%, and accuracy of 85%. The logistic regression model selected seven predictors: total cholesterol, duration of diabetes, glycemic control, adherence to anti-diabetic medications, other microvascular complications of diabetes, sex, and hypertension. A nomogram was developed and deployed as a web-based application. A decision curve analysis showed that the model was useful in clinical practice and was better than treating all or none of the patients. CONCLUSIONS The model has excellent performance and a better net benefit to be utilized in clinical practice to show the future probability of having DR. Identifying those with a higher risk of DR helps in the early identification and intervention of DR.
Collapse
Affiliation(s)
- Tsion Mulat Tebeje
- School of Public Health, College of Health Science and Medicine, Dilla University, Dilla, Ethiopia.
| | - Melaku Kindie Yenit
- Department of Epidemiology and Biostatistics, College of Medicine and Health Science, University of Gondar, Gondar, Ethiopia
| | - Solomon Gedlu Nigatu
- Department of Epidemiology and Biostatistics, College of Medicine and Health Science, University of Gondar, Gondar, Ethiopia
| | - Segenet Bizuneh Mengistu
- Department of Internal Medicine, School of Medicine, College of Medicine and Health Sciences, University of Gondar, Gondar, Ethiopia
| | - Tigabu Kidie Tesfie
- Department of Epidemiology and Biostatistics, College of Medicine and Health Science, University of Gondar, Gondar, Ethiopia
| | - Negalgn Byadgie Gelaw
- Department of Public Health, Mizan Aman College of Health Science, Mizan Aman, Southwest Ethiopia, Ethiopia
| | - Yazachew Moges Chekol
- Department of Health Information Technology, Mizan Aman College of Health Science, Mizan Aman, Southwest Ethiopia, Ethiopia
| |
Collapse
|
5
|
Maekawa E, Grua EM, Nakamura CA, Scazufca M, Araya R, Peters T, van de Ven P. Bayesian Networks for Prescreening in Depression: Algorithm Development and Validation. JMIR Ment Health 2024; 11:e52045. [PMID: 38963925 PMCID: PMC11258528 DOI: 10.2196/52045] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 08/21/2023] [Revised: 04/02/2024] [Accepted: 04/17/2024] [Indexed: 07/06/2024] Open
Abstract
BACKGROUND Identifying individuals with depressive symptomatology (DS) promptly and effectively is of paramount importance for providing timely treatment. Machine learning models have shown promise in this area; however, studies often fall short in demonstrating the practical benefits of using these models and fail to provide tangible real-world applications. OBJECTIVE This study aims to establish a novel methodology for identifying individuals likely to exhibit DS, identify the most influential features in a more explainable way via probabilistic measures, and propose tools that can be used in real-world applications. METHODS The study used 3 data sets: PROACTIVE, the Brazilian National Health Survey (Pesquisa Nacional de Saúde [PNS]) 2013, and PNS 2019, comprising sociodemographic and health-related features. A Bayesian network was used for feature selection. Selected features were then used to train machine learning models to predict DS, operationalized as a score of ≥10 on the 9-item Patient Health Questionnaire. The study also analyzed the impact of varying sensitivity rates on the reduction of screening interviews compared to a random approach. RESULTS The methodology allows the users to make an informed trade-off among sensitivity, specificity, and a reduction in the number of interviews. At the thresholds of 0.444, 0.412, and 0.472, determined by maximizing the Youden index, the models achieved sensitivities of 0.717, 0.741, and 0.718, and specificities of 0.644, 0.737, and 0.766 for PROACTIVE, PNS 2013, and PNS 2019, respectively. The area under the receiver operating characteristic curve was 0.736, 0.801, and 0.809 for these 3 data sets, respectively. For the PROACTIVE data set, the most influential features identified were postural balance, shortness of breath, and how old people feel they are. In the PNS 2013 data set, the features were the ability to do usual activities, chest pain, sleep problems, and chronic back problems. The PNS 2019 data set shared 3 of the most influential features with the PNS 2013 data set. However, the difference was the replacement of chronic back problems with verbal abuse. It is important to note that the features contained in the PNS data sets differ from those found in the PROACTIVE data set. An empirical analysis demonstrated that using the proposed model led to a potential reduction in screening interviews of up to 52% while maintaining a sensitivity of 0.80. CONCLUSIONS This study developed a novel methodology for identifying individuals with DS, demonstrating the utility of using Bayesian networks to identify the most significant features. Moreover, this approach has the potential to substantially reduce the number of screening interviews while maintaining high sensitivity, thereby facilitating improved early identification and intervention strategies for individuals experiencing DS.
Collapse
Affiliation(s)
- Eduardo Maekawa
- Department of Electronic and Computer Engineering, University of Limerick, Limerick, Ireland
- Health Research Institute, University of Limerick, Limerick, Ireland
| | - Eoin Martino Grua
- Department of Electronic and Computer Engineering, University of Limerick, Limerick, Ireland
- Health Research Institute, University of Limerick, Limerick, Ireland
| | - Carina Akemi Nakamura
- Departamento de Psiquiatria, Faculdade de Medicina da Universidade de Sao Paulo, Universidade de Sao Paulo, Sao Paulo, Brazil
| | - Marcia Scazufca
- Departamento de Psiquiatria, Faculdade de Medicina da Universidade de Sao Paulo, Universidade de Sao Paulo, Sao Paulo, Brazil
- Instituto de Psiquiatria, Hospital das Clinicas da Faculdade de Medicina da Universidade de Sao Paulo, Faculdade de Medicina, Universidade de Sao Paulo, Sao Paulo, Brazil
| | - Ricardo Araya
- Centre for Global Mental Health, King's College London, London, United Kingdom
| | - Tim Peters
- Bristol Dental School, University of Bristol, Bristol, United Kingdom
| | - Pepijn van de Ven
- Department of Electronic and Computer Engineering, University of Limerick, Limerick, Ireland
- Health Research Institute, University of Limerick, Limerick, Ireland
| |
Collapse
|
6
|
Sajdeya R, Narouze S. Harnessing artificial intelligence for predicting and managing postoperative pain: a narrative literature review. Curr Opin Anaesthesiol 2024:00001503-990000000-00209. [PMID: 39011674 DOI: 10.1097/aco.0000000000001408] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/17/2024]
Abstract
PURPOSE OF REVIEW This review examines recent research on artificial intelligence focusing on machine learning (ML) models for predicting postoperative pain outcomes. We also identify technical, ethical, and practical hurdles that demand continued investigation and research. RECENT FINDINGS Current ML models leverage diverse datasets, algorithmic techniques, and validation methods to identify predictive biomarkers, risk factors, and phenotypic signatures associated with increased acute and chronic postoperative pain and persistent opioid use. ML models demonstrate satisfactory performance to predict pain outcomes and their prognostic trajectories, identify modifiable risk factors and at-risk patients who benefit from targeted pain management strategies, and show promise in pain prevention applications. However, further evidence is needed to evaluate the reliability, generalizability, effectiveness, and safety of ML-driven approaches before their integration into perioperative pain management practices. SUMMARY Artificial intelligence (AI) has the potential to enhance perioperative pain management by providing more accurate predictive models and personalized interventions. By leveraging ML algorithms, clinicians can better identify at-risk patients and tailor treatment strategies accordingly. However, successful implementation needs to address challenges in data quality, algorithmic complexity, and ethical and practical considerations. Future research should focus on validating AI-driven interventions in clinical practice and fostering interdisciplinary collaboration to advance perioperative care.
Collapse
Affiliation(s)
- Ruba Sajdeya
- Department of Anesthesiology, Duke University School of Medicine, Durham, North Carolina
| | - Samer Narouze
- Division of Pain Medicine, University Hospitals Medical Center, Cleveland, Ohio, USA
| |
Collapse
|
7
|
Yehuala TZ, Agimas MC, Derseh NM, Wubante SM, Fente BM, Yismaw GA, Tesfie TK. Machine learning algorithms to predict healthcare-seeking behaviors of mothers for acute respiratory infections and their determinants among children under five in sub-Saharan Africa. Front Public Health 2024; 12:1362392. [PMID: 38962762 PMCID: PMC11220189 DOI: 10.3389/fpubh.2024.1362392] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/23/2024] [Accepted: 06/03/2024] [Indexed: 07/05/2024] Open
Abstract
Background Acute respiratory infections (ARIs) are the leading cause of death in children under the age of 5 globally. Maternal healthcare-seeking behavior may help minimize mortality associated with ARIs since they make decisions about the kind and frequency of healthcare services for their children. Therefore, this study aimed to predict the absence of maternal healthcare-seeking behavior and identify its associated factors among children under the age 5 in sub-Saharan Africa (SSA) using machine learning models. Methods The sub-Saharan African countries' demographic health survey was the source of the dataset. We used a weighted sample of 16,832 under-five children in this study. The data were processed using Python (version 3.9), and machine learning models such as extreme gradient boosting (XGB), random forest, decision tree, logistic regression, and Naïve Bayes were applied. In this study, we used evaluation metrics, including the AUC ROC curve, accuracy, precision, recall, and F-measure, to assess the performance of the predictive models. Result In this study, a weighted sample of 16,832 under-five children was used in the final analysis. Among the proposed machine learning models, the random forest (RF) was the best-predicted model with an accuracy of 88.89%, a precision of 89.5%, an F-measure of 83%, an AUC ROC curve of 95.8%, and a recall of 77.6% in predicting the absence of mothers' healthcare-seeking behavior for ARIs. The accuracy for Naïve Bayes was the lowest (66.41%) when compared to other proposed models. No media exposure, living in rural areas, not breastfeeding, poor wealth status, home delivery, no ANC visit, no maternal education, mothers' age group of 35-49 years, and distance to health facilities were significant predictors for the absence of mothers' healthcare-seeking behaviors for ARIs. On the other hand, undernourished children with stunting, underweight, and wasting status, diarrhea, birth size, married women, being a male or female sex child, and having a maternal occupation were significantly associated with good maternal healthcare-seeking behaviors for ARIs among under-five children. Conclusion The RF model provides greater predictive power for estimating mothers' healthcare-seeking behaviors based on ARI risk factors. Machine learning could help achieve early prediction and intervention in children with high-risk ARIs. This leads to a recommendation for policy direction to reduce child mortality due to ARIs in sub-Saharan countries.
Collapse
Affiliation(s)
- Tirualem Zeleke Yehuala
- Department Health Informatics, Institute of Public Health, College of Medicine and Health Sciences, University of Gondar, Gondar, Ethiopia
| | - Muluken Chanie Agimas
- Department of Epidemiology and Biostatistics, Institute of Public Health, College of Medicine and Health Sciences, University of Gondar, Gondar, Ethiopia
| | - Nebiyu Mekonnen Derseh
- Department of Epidemiology and Biostatistics, Institute of Public Health, College of Medicine and Health Sciences, University of Gondar, Gondar, Ethiopia
| | - Sisay Maru Wubante
- Department Health Informatics, Institute of Public Health, College of Medicine and Health Sciences, University of Gondar, Gondar, Ethiopia
| | - Bezawit Melak Fente
- Department of General Midwifery, School of Midwifery, College of Medicine and Health Sciences, University of Gondar, Gondar, Ethiopia
| | - Getaneh Awoke Yismaw
- Department of Epidemiology and Biostatistics, Institute of Public Health, College of Medicine and Health Sciences, University of Gondar, Gondar, Ethiopia
| | - Tigabu Kidie Tesfie
- Department of Epidemiology and Biostatistics, Institute of Public Health, College of Medicine and Health Sciences, University of Gondar, Gondar, Ethiopia
| |
Collapse
|
8
|
Tran VN, Zhou W, Kim T, Mazepa V, Valdayskikh V, Ivanov VY. Daily station-level records of air temperature, snow depth, and ground temperature in the Northern Hemisphere. Sci Data 2024; 11:645. [PMID: 38890309 PMCID: PMC11189437 DOI: 10.1038/s41597-024-03483-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2023] [Accepted: 06/06/2024] [Indexed: 06/20/2024] Open
Abstract
Air temperature (Ta), snow depth (Sd), and soil temperature (Tg) are crucial variables for studying the above- and below-ground thermal conditions, especially in high latitudes. However, in-situ observations are frequently sparse and inconsistent across various datasets, with a significant amount of missing data. This study has assembled a comprehensive dataset of in-situ observations of Ta, Sd, and Tg for the Northern Hemisphere (higher than 30°N latitude), spanning 1960-2021. This dataset encompasses metadata and daily data time series for 27,768, 32,417, and 659 gages for Ta, Sd, and Tg, respectively. Using the ERA5-Land reanalysis data product, we applied deep learning methodology to reconstruct the missing data that account for 54.5%, 59.3%, and 74.3% of Ta, Sd, and Tg daily time series, respectively. The obtained high temporal resolution dataset can be used to better understand physical phenomena and relevant mechanisms, such as the dynamics of land-surface-atmosphere energy exchange, snowpack, and permafrost.
Collapse
Affiliation(s)
- Vinh Ngoc Tran
- Department of Civil and Environmental Engineering, University of Michigan, Ann Arbor, MI, 48109, USA
| | - Wenbo Zhou
- Department of Civil and Environmental Engineering, University of Michigan, Ann Arbor, MI, 48109, USA
| | - Taeho Kim
- Department of Civil and Environmental Engineering, University of Michigan, Ann Arbor, MI, 48109, USA
| | - Valeriy Mazepa
- Institute of Plant and Animal Ecology, the Ural Branch of the Russian Academy of Sciences, Yekaterinburg, Russia
| | | | - Valeriy Y Ivanov
- Department of Civil and Environmental Engineering, University of Michigan, Ann Arbor, MI, 48109, USA.
| |
Collapse
|
9
|
Chen Y, Lin F, Wang K, Chen F, Wang R, Lai M, Chen C, Wang R. Development of a predictive model for 1-year postoperative recovery in patients with lumbar disk herniation based on deep learning and machine learning. Front Neurol 2024; 15:1255780. [PMID: 38919973 PMCID: PMC11197993 DOI: 10.3389/fneur.2024.1255780] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2023] [Accepted: 05/23/2024] [Indexed: 06/27/2024] Open
Abstract
Background The aim of this study is to develop a predictive model utilizing deep learning and machine learning techniques that will inform clinical decision-making by predicting the 1-year postoperative recovery of patients with lumbar disk herniation. Methods The clinical data of 470 inpatients who underwent tubular microdiscectomy (TMD) between January 2018 and January 2021 were retrospectively analyzed as variables. The dataset was randomly divided into a training set (n = 329) and a test set (n = 141) using a 10-fold cross-validation technique. Various deep learning and machine learning algorithms including Random Forests, Extreme Gradient Boosting, Support Vector Machines, Extra Trees, K-Nearest Neighbors, Logistic Regression, Light Gradient Boosting Machine, and MLP (Artificial Neural Networks) were employed to develop predictive models for the recovery of patients with lumbar disk herniation 1 year after surgery. The cure rate score of lumbar JOA score 1 year after TMD was used as an outcome indicator. The primary evaluation metric was the area under the receiver operating characteristic curve (AUC), with additional measures including decision curve analysis (DCA), accuracy, sensitivity, specificity, and others. Results The heat map of the correlation matrix revealed low inter-feature correlation. The predictive model employing both machine learning and deep learning algorithms was constructed using 15 variables after feature engineering. Among the eight algorithms utilized, the MLP algorithm demonstrated the best performance. Conclusion Our study findings demonstrate that the MLP algorithm provides superior predictive performance for the recovery of patients with lumbar disk herniation 1 year after surgery.
Collapse
Affiliation(s)
- Yan Chen
- Pingtan Comprehensive Experimentation Area Hospital, Pingtan, China
- Fujian Medical University Union Hospital, Fuzhou, Fujian, China
| | - Fabin Lin
- Pingtan Comprehensive Experimentation Area Hospital, Pingtan, China
- Fujian Medical University Union Hospital, Fuzhou, Fujian, China
| | - Kaifeng Wang
- Fujian Medical University, Fuzhou, Fujian, China
| | - Feng Chen
- Fujian Medical University, Fuzhou, Fujian, China
| | - Ruxian Wang
- Fujian Medical University, Fuzhou, Fujian, China
| | - Minyun Lai
- Fujian Medical University, Fuzhou, Fujian, China
| | - Chunmei Chen
- Pingtan Comprehensive Experimentation Area Hospital, Pingtan, China
- Fujian Medical University Union Hospital, Fuzhou, Fujian, China
| | - Rui Wang
- Pingtan Comprehensive Experimentation Area Hospital, Pingtan, China
- Fujian Medical University Union Hospital, Fuzhou, Fujian, China
| |
Collapse
|
10
|
Coats TJ, Mirkes EM. Missing data in emergency care: a pitfall in the interpretation of analysis and research based on electronic patient records. Emerg Med J 2024:emermed-2024-214097. [PMID: 38834288 DOI: 10.1136/emermed-2024-214097] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2024] [Accepted: 04/15/2024] [Indexed: 06/06/2024]
Abstract
Electronic patient records (EPRs) are potentially valuable sources of data for service development or research but often contain large amounts of missing data. Using complete case analysis or imputation of missing data seem like simple solutions, and are increasingly easy to perform in software packages, but can easily distort data and give misleading results if used without an understanding of missingness. So, knowing about patterns of missingness, and when to get expert data science (data engineering and analytics) help, will be a fundamental future skill for emergency physicians. This will maximise the good and minimise the harm of the easy availability of large patient datasets created by the introduction of EPRs.
Collapse
Affiliation(s)
| | - Evgeny M Mirkes
- University of Leicester, Leicester, UK
- School of Computing and Mathematical Sciences, University of Leicester, Leicester, UK
| |
Collapse
|
11
|
Li YX, Liu YC, Wang M, Huang YL. Prediction of gestational diabetes mellitus at the first trimester: machine-learning algorithms. Arch Gynecol Obstet 2024; 309:2557-2566. [PMID: 37477677 DOI: 10.1007/s00404-023-07131-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2023] [Accepted: 06/27/2023] [Indexed: 07/22/2023]
Abstract
PURPOSE Short- and long-term complications of gestational diabetes mellitus (GDM) involving pregnancies and offspring warrant the development of an effective individualized risk prediction model to reduce and prevent GDM together with its associated co-morbidities. The aim is to use machine learning (ML) algorithms to study data gathered throughout the first trimester in order to predict GDM. METHODS Two independent cohorts with forty-five features gathered through first trimester were included. We constructed prediction models based on three different algorithms and traditional logistic regression, and deployed additional two ensemble algorithms to identify the importance of individual features. RESULTS 4799 and 2795 pregnancies were included in the Xinhua Hospital Chongming branch (XHCM) and the Shanghai Pudong New Area People's Hospital (SPNPH) cohorts, respectively. Extreme gradient boosting (XGBoost) predicted GDM with moderate performance (the area under the receiver operating curve (AUC) = 0.75) at pregnancy initiation and good-to-excellent performance (AUC = 0.99) at the end of the first trimester in the XHCM cohort. The trained XGBoost showed moderate performance in the SPNPH cohort (AUC = 0.83). The top predictive features for GDM diagnosis were pre-pregnancy BMI and maternal abdominal circumference at pregnancy initiation, and FPG and HbA1c at the end of the first trimester. CONCLUSION Our work demonstrated that ML models based on the data gathered throughout the first trimester achieved moderate performance in the external validation cohort.
Collapse
Affiliation(s)
- Yi-Xin Li
- Department of Obstetrics and Gynecology, Chongming Hospital Affiliated to Shanghai University of Medicine and Health Sciences (Xinhua Hospital Chongming Branch), Shanghai, China
| | - Yi-Chen Liu
- Department of Nephrology, Chongming Hospital Affiliated to Shanghai University of Medicine and Health Sciences (Xinhua Hospital Chongming Branch), Shanghai, China
| | - Mei Wang
- Department of Gynecology, Shanghai Pudong New Area People's Hospital, Shanghai, China
| | - Yu-Li Huang
- Department of Obstetrics and Gynecology, Chongming Hospital Affiliated to Shanghai University of Medicine and Health Sciences (Xinhua Hospital Chongming Branch), Shanghai, China.
| |
Collapse
|
12
|
Lee CC, Su SY, Sung SF. Machine learning-based survival analysis approaches for predicting the risk of pneumonia post-stroke discharge. Int J Med Inform 2024; 186:105422. [PMID: 38518677 DOI: 10.1016/j.ijmedinf.2024.105422] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/14/2024] [Revised: 02/25/2024] [Accepted: 03/19/2024] [Indexed: 03/24/2024]
Abstract
BACKGROUND Post-stroke pneumonia (PSP) is common among stroke patients. PSP occurring after hospital discharge continues to increase the risk of poor functional outcomes and death among stroke survivors. Currently, there is no prediction model specifically designed to predict the occurrence of PSP beyond the acute stage of stroke. This study aimed to explore the use of machine learning (ML) methods in predicting the risk of PSP after hospital discharge. METHODS This study analyzed data from 5,754 hospitalized stroke patients. The dataset was randomly divided into a training set and a holdout test set, with a ratio of 80:20. Several clinical and laboratory variables were utilized as predictors and different ML algorithms were employed to model time-to-event data. The ML model's predictive performance was compared to existing risk-scoring systems. A model-agnostic method based on Shapley additive explanations was utilized to interpret the ML model. RESULTS The study found that 5.7% of the study patients experienced pneumonia within one year after discharge. Based on repeated 5-fold cross-validation on the training set, the random survival forest (RSF) model had the highest C-index among the various ML algorithms and traditional Cox regression analysis. The final RSF model achieved a C-index of 0.787 (95% confidence interval: 0.737-0.840) on the holdout test set, outperforming five existing risk-scoring systems. The top three important predictors were the Glasgow Coma Scale score, age, and length of hospital stay. CONCLUSIONS The RSF model demonstrated superior discriminative ability compared to other ML algorithms and traditional Cox regression analysis, suggesting a non-linear relationship between predictors and outcomes. The developed ML model can be integrated into the hospital information system to provide personalized risk assessments.
Collapse
Affiliation(s)
- Chang-Ching Lee
- Division of Pulmonary Medicine, Department of Internal Medicine, Ditmanson Medical Foundation Chia-Yi Christian Hospital, Chiayi City, Taiwan
| | - Sheng-You Su
- Clinical Medicine Research Center, Department of Medical Research, Ditmanson Medical Foundation Chia-Yi Christian Hospital, Chiayi City, Taiwan
| | - Sheng-Feng Sung
- Division of Neurology, Department of Internal Medicine, Ditmanson Medical Foundation Chia-Yi Christian Hospital, Chiayi City, Taiwan; Department of Beauty & Health Care, Min-Hwei Junior College of Health Care Management, Tainan, Taiwan.
| |
Collapse
|
13
|
Tutsoy O, Sumbul HE. A novel deep machine learning algorithm with dimensionality and size reduction approaches for feature elimination: thyroid cancer diagnoses with randomly missing data. Brief Bioinform 2024; 25:bbae344. [PMID: 39007597 PMCID: PMC11247408 DOI: 10.1093/bib/bbae344] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2023] [Revised: 06/04/2024] [Accepted: 07/02/2024] [Indexed: 07/16/2024] Open
Abstract
Thyroid cancer incidences endure to increase even though a large number of inspection tools have been developed recently. Since there is no standard and certain procedure to follow for the thyroid cancer diagnoses, clinicians require conducting various tests. This scrutiny process yields multi-dimensional big data and lack of a common approach leads to randomly distributed missing (sparse) data, which are both formidable challenges for the machine learning algorithms. This paper aims to develop an accurate and computationally efficient deep learning algorithm to diagnose the thyroid cancer. In this respect, randomly distributed missing data stemmed singularity in learning problems is treated and dimensionality reduction with inner and target similarity approaches are developed to select the most informative input datasets. In addition, size reduction with the hierarchical clustering algorithm is performed to eliminate the considerably similar data samples. Four machine learning algorithms are trained and also tested with the unseen data to validate their generalization and robustness abilities. The results yield 100% training and 83% testing preciseness for the unseen data. Computational time efficiencies of the algorithms are also examined under the equal conditions.
Collapse
Affiliation(s)
- Onder Tutsoy
- Adana Alparslan Turkes Science and Technology University, Adana, Turkey
| | - Hilmi Erdem Sumbul
- University of Health Sciences, Adana City Training and Research Hospital, Adana, Turkey
| |
Collapse
|
14
|
Xie P, Wang H, Xiao J, Xu F, Liu J, Chen Z, Zhao W, Hou S, Wu D, Ma Y, Xiao J. Development and Validation of an Explainable Deep Learning Model to Predict In-Hospital Mortality for Patients With Acute Myocardial Infarction: Algorithm Development and Validation Study. J Med Internet Res 2024; 26:e49848. [PMID: 38728685 PMCID: PMC11127140 DOI: 10.2196/49848] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2023] [Revised: 10/02/2023] [Accepted: 04/02/2024] [Indexed: 05/12/2024] Open
Abstract
BACKGROUND Acute myocardial infarction (AMI) is one of the most severe cardiovascular diseases and is associated with a high risk of in-hospital mortality. However, the current deep learning models for in-hospital mortality prediction lack interpretability. OBJECTIVE This study aims to establish an explainable deep learning model to provide individualized in-hospital mortality prediction and risk factor assessment for patients with AMI. METHODS In this retrospective multicenter study, we used data for consecutive patients hospitalized with AMI from the Chongqing University Central Hospital between July 2016 and December 2022 and the Electronic Intensive Care Unit Collaborative Research Database. These patients were randomly divided into training (7668/10,955, 70%) and internal test (3287/10,955, 30%) data sets. In addition, data of patients with AMI from the Medical Information Mart for Intensive Care database were used for external validation. Deep learning models were used to predict in-hospital mortality in patients with AMI, and they were compared with linear and tree-based models. The Shapley Additive Explanations method was used to explain the model with the highest area under the receiver operating characteristic curve in both the internal test and external validation data sets to quantify and visualize the features that drive predictions. RESULTS A total of 10,955 patients with AMI who were admitted to Chongqing University Central Hospital or included in the Electronic Intensive Care Unit Collaborative Research Database were randomly divided into a training data set of 7668 (70%) patients and an internal test data set of 3287 (30%) patients. A total of 9355 patients from the Medical Information Mart for Intensive Care database were included for independent external validation. In-hospital mortality occurred in 8.74% (670/7668), 8.73% (287/3287), and 9.12% (853/9355) of the patients in the training, internal test, and external validation cohorts, respectively. The Self-Attention and Intersample Attention Transformer model performed best in both the internal test data set and the external validation data set among the 9 prediction models, with the highest area under the receiver operating characteristic curve of 0.86 (95% CI 0.84-0.88) and 0.85 (95% CI 0.84-0.87), respectively. Older age, high heart rate, and low body temperature were the 3 most important predictors of increased mortality, according to the explanations of the Self-Attention and Intersample Attention Transformer model. CONCLUSIONS The explainable deep learning model that we developed could provide estimates of mortality and visual contribution of the features to the prediction for a patient with AMI. The explanations suggested that older age, unstable vital signs, and metabolic disorders may increase the risk of mortality in patients with AMI.
Collapse
Affiliation(s)
- Puguang Xie
- Chongqing Emergency Medical Centre, Chongqing University Central Hospital, School of Medicine, Chongqing University, Chongqing, China
| | - Hao Wang
- Chongqing Emergency Medical Centre, Chongqing University Central Hospital, School of Medicine, Chongqing University, Chongqing, China
| | - Jun Xiao
- Chongqing Emergency Medical Centre, Chongqing University Central Hospital, School of Medicine, Chongqing University, Chongqing, China
| | - Fan Xu
- Chongqing Emergency Medical Centre, Chongqing University Central Hospital, School of Medicine, Chongqing University, Chongqing, China
| | - Jingyang Liu
- Chongqing Emergency Medical Centre, Chongqing University Central Hospital, School of Medicine, Chongqing University, Chongqing, China
| | - Zihang Chen
- Bioengineering College, Chongqing University, Chongqing, China
| | - Weijie Zhao
- Bioengineering College, Chongqing University, Chongqing, China
| | - Siyu Hou
- Bio-Med Informatics Research Centre & Clinical Research Centre, Xinqiao Hospital, Army Medical University, Chongqing, China
| | - Dongdong Wu
- Medical Big Data Research Centre, Chinese People's Liberation Army General Hospital, Beijing, China
| | - Yu Ma
- Chongqing Emergency Medical Centre, Chongqing University Central Hospital, School of Medicine, Chongqing University, Chongqing, China
| | - Jingjing Xiao
- Bio-Med Informatics Research Centre & Clinical Research Centre, Xinqiao Hospital, Army Medical University, Chongqing, China
| |
Collapse
|
15
|
Kazdaghli S, Kerenidis I, Kieckbusch J, Teare P. Improved clinical data imputation via classical and quantum determinantal point processes. eLife 2024; 12:RP89947. [PMID: 38722146 PMCID: PMC11081629 DOI: 10.7554/elife.89947] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/12/2024] Open
Abstract
Imputing data is a critical issue for machine learning practitioners, including in the life sciences domain, where missing clinical data is a typical situation and the reliability of the imputation is of great importance. Currently, there is no canonical approach for imputation of clinical data and widely used algorithms introduce variance in the downstream classification. Here we propose novel imputation methods based on determinantal point processes (DPP) that enhance popular techniques such as the multivariate imputation by chained equations and MissForest. Their advantages are twofold: improving the quality of the imputed data demonstrated by increased accuracy of the downstream classification and providing deterministic and reliable imputations that remove the variance from the classification results. We experimentally demonstrate the advantages of our methods by performing extensive imputations on synthetic and real clinical data. We also perform quantum hardware experiments by applying the quantum circuits for DPP sampling since such quantum algorithms provide a computational advantage with respect to classical ones. We demonstrate competitive results with up to 10 qubits for small-scale imputation tasks on a state-of-the-art IBM quantum processor. Our classical and quantum methods improve the effectiveness and robustness of clinical data prediction modeling by providing better and more reliable data imputations. These improvements can add significant value in settings demanding high precision, such as in pharmaceutical drug trials where our approach can provide higher confidence in the predictions made.
Collapse
Affiliation(s)
| | | | - Jens Kieckbusch
- Emerging Innovations Unit, Discovery Sciences, BioPharmaceuticals R&D, AstraZenecaCambridgeUnited Kingdom
| | - Philip Teare
- Centre for AI, Data Science & AI, BioPharmaceuticals R&D, AstraZenecaCambridgeUnited Kingdom
| |
Collapse
|
16
|
Tejani AS, Ng YS, Xi Y, Rayan JC. Understanding and Mitigating Bias in Imaging Artificial Intelligence. Radiographics 2024; 44:e230067. [PMID: 38635456 DOI: 10.1148/rg.230067] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/20/2024]
Abstract
Artificial intelligence (AI) algorithms are prone to bias at multiple stages of model development, with potential for exacerbating health disparities. However, bias in imaging AI is a complex topic that encompasses multiple coexisting definitions. Bias may refer to unequal preference to a person or group owing to preexisting attitudes or beliefs, either intentional or unintentional. However, cognitive bias refers to systematic deviation from objective judgment due to reliance on heuristics, and statistical bias refers to differences between true and expected values, commonly manifesting as systematic error in model prediction (ie, a model with output unrepresentative of real-world conditions). Clinical decisions informed by biased models may lead to patient harm due to action on inaccurate AI results or exacerbate health inequities due to differing performance among patient populations. However, while inequitable bias can harm patients in this context, a mindful approach leveraging equitable bias can address underrepresentation of minority groups or rare diseases. Radiologists should also be aware of bias after AI deployment such as automation bias, or a tendency to agree with automated decisions despite contrary evidence. Understanding common sources of imaging AI bias and the consequences of using biased models can guide preventive measures to mitigate its impact. Accordingly, the authors focus on sources of bias at stages along the imaging machine learning life cycle, attempting to simplify potentially intimidating technical terminology for general radiologists using AI tools in practice or collaborating with data scientists and engineers for AI tool development. The authors review definitions of bias in AI, describe common sources of bias, and present recommendations to guide quality control measures to mitigate the impact of bias in imaging AI. Understanding the terms featured in this article will enable a proactive approach to identifying and mitigating bias in imaging AI. Published under a CC BY 4.0 license. Test Your Knowledge questions for this article are available in the supplemental material. See the invited commentary by Rouzrokh and Erickson in this issue.
Collapse
Affiliation(s)
- Ali S Tejani
- From the Department of Radiology, University of Texas Southwestern Medical Center, 5323 Harry Hines Blvd, Dallas, TX 75390
| | - Yee Seng Ng
- From the Department of Radiology, University of Texas Southwestern Medical Center, 5323 Harry Hines Blvd, Dallas, TX 75390
| | - Yin Xi
- From the Department of Radiology, University of Texas Southwestern Medical Center, 5323 Harry Hines Blvd, Dallas, TX 75390
| | - Jesse C Rayan
- From the Department of Radiology, University of Texas Southwestern Medical Center, 5323 Harry Hines Blvd, Dallas, TX 75390
| |
Collapse
|
17
|
Shi S, Bao J, Guo Z, Han Y, Xu Y, Egbeagu UU, Zhao L, Jiang N, Sun L, Liu X, Liu W, Chang N, Zhang J, Sun Y, Xu X, Fu S. Improving prediction of N 2O emissions during composting using model-agnostic meta-learning. THE SCIENCE OF THE TOTAL ENVIRONMENT 2024; 922:171357. [PMID: 38431167 DOI: 10.1016/j.scitotenv.2024.171357] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/16/2023] [Revised: 02/24/2024] [Accepted: 02/27/2024] [Indexed: 03/05/2024]
Abstract
Nitrous oxide (N2O) represents a significant environmental challenge as a harmful, long-lived greenhouse gas that contributes to the depletion of stratospheric ozone and exacerbates global anthropogenic greenhouse warming. Composting is considered a promising and economically feasible strategy for the treatment of organic waste. However, recent research indicates that composting is a source of N2O, contributing to atmospheric pollution and greenhouse effect. Consequently, there is a need for the development of effective, cost-efficient methodologies to quantify N2O emissions accurately. In this study, we employed the model-agnostic meta-learning (MAML) method to improve the performance of N2O emissions prediction during manure composting. The highest R2 and lowest root mean squared error (RMSE) values achieved were 0.939 and 18.42 mg d-1, respectively. Five machine learning methods including the backpropagation neural network, extreme learning machine, integrated machine learning method based on ELM and random forest, gradient boosting decision tree, and extreme gradient boosting were adopted for comparison to further demonstrate the effectiveness of the MAML prediction model. Feature analysis showed that moisture content of structure material and ammonium concentration during composting process were the two most significant features affecting N2O emissions. This study serves as proof of the application of MAML during N2O emissions prediction, further giving new insights into the effects of manure material properties and composting process data on N2O emissions. This approach helps determining the strategies for mitigating N2O emissions.
Collapse
Affiliation(s)
- Shuai Shi
- College of Resources and Environment, Northeast Agricultural University, Harbin 150030, China
| | - Jiaxin Bao
- College of Resources and Environment, Northeast Agricultural University, Harbin 150030, China
| | - Zhiheng Guo
- College of Resources and Environment, Northeast Agricultural University, Harbin 150030, China
| | - Yue Han
- College of Resources and Environment, Northeast Agricultural University, Harbin 150030, China
| | - Yonghui Xu
- College of Resources and Environment, Northeast Agricultural University, Harbin 150030, China
| | - Ugochi Uzoamaka Egbeagu
- College of Resources and Environment, Northeast Agricultural University, Harbin 150030, China
| | - Liyan Zhao
- College of Resources and Environment, Northeast Agricultural University, Harbin 150030, China
| | - Nana Jiang
- College of Resources and Environment, Northeast Agricultural University, Harbin 150030, China
| | - Lei Sun
- College of Resources and Environment, Northeast Agricultural University, Harbin 150030, China
| | - Xinda Liu
- College of Resources and Environment, Northeast Agricultural University, Harbin 150030, China
| | - Wanying Liu
- College of Resources and Environment, Northeast Agricultural University, Harbin 150030, China
| | - Nuo Chang
- College of Resources and Environment, Northeast Agricultural University, Harbin 150030, China
| | - Jining Zhang
- College of Resources and Environment, Northeast Agricultural University, Harbin 150030, China
| | - Yu Sun
- College of Resources and Environment, Northeast Agricultural University, Harbin 150030, China
| | - Xiuhong Xu
- College of Resources and Environment, Northeast Agricultural University, Harbin 150030, China.
| | - Song Fu
- School of Mechatronics Engineering, Harbin Institute of Technology, Harbin 150030, China.
| |
Collapse
|
18
|
Syed T, Krujatz F, Ihadjadene Y, Mühlstädt G, Hamedi H, Mädler J, Urbas L. A review on machine learning approaches for microalgae cultivation systems. Comput Biol Med 2024; 172:108248. [PMID: 38493599 DOI: 10.1016/j.compbiomed.2024.108248] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/06/2023] [Revised: 02/15/2024] [Accepted: 03/06/2024] [Indexed: 03/19/2024]
Abstract
Microalgae plays a crucial role in biomass production within aquatic environments and are increasingly recognized for their potential in generating biofuels, biomaterials, bioactive compounds, and bio-based chemicals. This growing significance is driven by the need to address imminent global challenges such as food and fuel shortages. Enhancing the value chain of bio-based products necessitates the implementation of an advanced screening and monitoring system. This system is crucial for tailoring and optimizing the cultivation conditions, ensuring the lucrative and efficient production of the final desired product. This, in turn, underscores the necessity for robust predictive models to accurately emulate algae growth in different conditions during the initial cultivation phase and simulate their subsequent processing in the downstream stage. In pursuit of these objectives, diverse mechanistic and machine learning-based methods have been independently employed to model and optimize microalgae processes. This review article thoroughly examines the techniques delineated in the literature for modeling, predicting, and monitoring microalgal biomass across various applications such as bioenergy, pharmaceuticals, and the food industry. While highlighting the merits and limitations of each method, we delve into the realm of newly emerging hybrid approaches and conduct an exhaustive survey of this evolving methodology. The challenges currently impeding the practical implementation of hybrid techniques are explored, and drawing inspiration from successful applications in other machine-learning-assisted fields, we review various plausible solutions to overcome these obstacles.
Collapse
Affiliation(s)
- Tehreem Syed
- Institute of Automation, Technische Universität Dresden, 01062, Saxony, Germany
| | - Felix Krujatz
- Faculty of Natural and Environmental Sciences, University of Applied Sciences Zittau/Görlitz, 02763, Zittau, Germany; Institute of Natural Materials Technology, Technische Universität Dresden, 01069, Saxony, Germany
| | - Yob Ihadjadene
- Institute of Natural Materials Technology, Technische Universität Dresden, 01069, Saxony, Germany
| | | | - Homa Hamedi
- Institute of Process Engineering and Environmental Technology, Technische Universität Dresden, 01062, Saxony, Germany
| | - Jonathan Mädler
- Institute of Process Engineering and Environmental Technology, Technische Universität Dresden, 01062, Saxony, Germany.
| | - Leon Urbas
- Institute of Automation, Technische Universität Dresden, 01062, Saxony, Germany; Institute of Process Engineering and Environmental Technology, Technische Universität Dresden, 01062, Saxony, Germany
| |
Collapse
|
19
|
Ellen JG, Matos J, Viola M, Gallifant J, Quion J, Anthony Celi L, Abu Hussein NS. Participant flow diagrams for health equity in AI. J Biomed Inform 2024; 152:104631. [PMID: 38548006 DOI: 10.1016/j.jbi.2024.104631] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2023] [Revised: 12/29/2023] [Accepted: 03/26/2024] [Indexed: 04/01/2024]
Abstract
Selection bias can arise through many aspects of a study, including recruitment, inclusion/exclusion criteria, input-level exclusion and outcome-level exclusion, and often reflects the underrepresentation of populations historically disadvantaged in medical research. The effects of selection bias can be further amplified when non-representative samples are used in artificial intelligence (AI) and machine learning (ML) applications to construct clinical algorithms. Building on the "Data Cards" initiative for transparency in AI research, we advocate for the addition of a participant flow diagram for AI studies detailing relevant sociodemographic and/or clinical characteristics of excluded participants across study phases, with the goal of identifying potential algorithmic biases before their clinical implementation. We include both a model for this flow diagram as well as a brief case study explaining how it could be implemented in practice. Through standardized reporting of participant flow diagrams, we aim to better identify potential inequities embedded in AI applications, facilitating more reliable and equitable clinical algorithms.
Collapse
Affiliation(s)
| | - João Matos
- Laboratory for Computational Physiology, Massachusetts Institute of Technology, Cambridge, MA, USA; Faculty of Engineering, University of Porto, Porto, Portugal; Institute for Systems and Computer Engineering, Technology and Science (INESCTEC), Porto, Portugal
| | | | - Jack Gallifant
- Laboratory for Computational Physiology, Massachusetts Institute of Technology, Cambridge, MA, USA; Department of Critical Care, Guy's and St Thomas' NHS Trust, London, United Kingdom
| | - Justin Quion
- University of the East Ramon Magsaysay Memorial Medical School, Quezon City, Philippines
| | - Leo Anthony Celi
- Laboratory for Computational Physiology, Massachusetts Institute of Technology, Cambridge, MA, USA; Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA; Department of Medicine, Beth Israel Deaconess Medical Center, Boston, MA, USA
| | | |
Collapse
|
20
|
Chadaga K, Prabhu S, Sampathila N, Chadaga R, Bhat D, Sharma AK, Swathi KS. SADXAI: Predicting social anxiety disorder using multiple interpretable artificial intelligence techniques. SLAS Technol 2024; 29:100129. [PMID: 38508237 DOI: 10.1016/j.slast.2024.100129] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/11/2024] [Accepted: 03/17/2024] [Indexed: 03/22/2024]
Abstract
Social anxiety disorder (SAD), also known as social phobia, is a psychological condition in which a person has a persistent and overwhelming fear of being negatively judged or observed by other individuals. This fear can affect them at work, in relationships and other social activities. The intricate combination of several environmental and biological factors is the reason for the onset of this mental condition. SAD is diagnosed using a test called the "Diagnostic and Statistical Manual of Mental Health Disorders (DSM-5), which is based on several physical, emotional and demographic symptoms. Artificial Intelligence has been a boon for medicine and is regularly used to diagnose various health conditions and diseases. Hence, this study used demographic, emotional, and physical symptoms and multiple machine learning (ML) techniques to diagnose SAD. A thorough descriptive and statistical analysis has been conducted before using the classifiers. Among all the models, the AdaBoost and logistic regression obtained the highest accuracy of 88 % each. Four eXplainable artificial techniques (XAI) techniques are utilized to make the predictions interpretable, transparent and understandable. According to XAI, the "Liebowitz Social Anxiety Scale questionnaire" and "The fear of speaking in public" are the most critical attributes in the diagnosis of SAD. This clinical decision support system framework could be utilized in various suitable locations such as schools, hospitals and workplaces to identify SAD in people.
Collapse
Affiliation(s)
- Krishnaraj Chadaga
- Department of Computer Science and Engineering, Manipal Institute of Technology, Manipal Academy of Higher Education, Manipal, Karnataka 576104, India
| | - Srikanth Prabhu
- Department of Computer Science and Engineering, Manipal Institute of Technology, Manipal Academy of Higher Education, Manipal, Karnataka 576104, India.
| | - Niranjana Sampathila
- Department of Biomedical Engineering, Manipal Institute of Technology, Manipal Academy of Higher Education, Manipal, Karnataka 576104, India.
| | - Rajagopala Chadaga
- Department of Mechanical and Industrial Engineering, Manipal Institute of Technology, Manipal Academy of Higher Education, Manipal, Karnataka 576104, India
| | - Devadas Bhat
- Department of Biomedical Engineering, Manipal Institute of Technology, Manipal Academy of Higher Education, Manipal, Karnataka 576104, India
| | - Akhilesh Kumar Sharma
- Department of Data Science and Engineering, Manipal University Jaipur, Jaipur, Rajasthan, India
| | - K S Swathi
- Prasanna School of Public Health, Manipal Academy of Higher Education, Manipal, Karnataka 576104, India
| |
Collapse
|
21
|
Muludi K, Setianingsih R, Sholehurrohman R, Junaidi A. Exploiting nearest neighbor data and fuzzy membership function to address missing values in classification. PeerJ Comput Sci 2024; 10:e1968. [PMID: 38660203 PMCID: PMC11042039 DOI: 10.7717/peerj-cs.1968] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2023] [Accepted: 03/07/2024] [Indexed: 04/26/2024]
Abstract
The accuracy of most classification methods is significantly affected by missing values. Therefore, this study aimed to propose a data imputation method to handle missing values through the application of nearest neighbor data and fuzzy membership function as well as to compare the results with standard methods. A total of five datasets related to classification problems obtained from the UCI Machine Learning Repository were used. The results showed that the proposed method had higher accuracy than standard imputation methods. Moreover, triangular method performed better than Gaussian fuzzy membership function. This showed that the combination of nearest neighbor data and fuzzy membership function was more effective in handling missing values and improving classification accuracy.
Collapse
Affiliation(s)
- Kurnia Muludi
- Informatics and Business Institute Darmajaya, Bandar Lampung, Lampung Province, Indonesia
| | - Revita Setianingsih
- Computer Science Department, Faculty of Science, Lampung University, Bandar Lampung, Lampung Province, Indonesia
| | - Ridho Sholehurrohman
- Computer Science Department, Faculty of Science, Lampung University, Bandar Lampung, Lampung Province, Indonesia
| | - Akmal Junaidi
- Computer Science Department, Faculty of Science, Lampung University, Bandar Lampung, Lampung Province, Indonesia
| |
Collapse
|
22
|
Parsaei M, Arvin A, Taebi M, Seyedmirzaei H, Cattarinussi G, Sambataro F, Pigoni A, Brambilla P, Delvecchio G. Machine Learning for prediction of violent behaviors in schizophrenia spectrum disorders: a systematic review. Front Psychiatry 2024; 15:1384828. [PMID: 38577400 PMCID: PMC10991827 DOI: 10.3389/fpsyt.2024.1384828] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 02/10/2024] [Accepted: 03/08/2024] [Indexed: 04/06/2024] Open
Abstract
Background Schizophrenia spectrum disorders (SSD) can be associated with an increased risk of violent behavior (VB), which can harm patients, others, and properties. Prediction of VB could help reduce the SSD burden on patients and healthcare systems. Some recent studies have used machine learning (ML) algorithms to identify SSD patients at risk of VB. In this article, we aimed to review studies that used ML to predict VB in SSD patients and discuss the most successful ML methods and predictors of VB. Methods We performed a systematic search in PubMed, Web of Sciences, Embase, and PsycINFO on September 30, 2023, to identify studies on the application of ML in predicting VB in SSD patients. Results We included 18 studies with data from 11,733 patients diagnosed with SSD. Different ML models demonstrated mixed performance with an area under the receiver operating characteristic curve of 0.56-0.95 and an accuracy of 50.27-90.67% in predicting violence among SSD patients. Our comparative analysis demonstrated a superior performance for the gradient boosting model, compared to other ML models in predicting VB among SSD patients. Various sociodemographic, clinical, metabolic, and neuroimaging features were associated with VB, with age and olanzapine equivalent dose at the time of discharge being the most frequently identified factors. Conclusion ML models demonstrated varied VB prediction performance in SSD patients, with gradient boosting outperforming. Further research is warranted for clinical applications of ML methods in this field.
Collapse
Affiliation(s)
- Mohammadamin Parsaei
- Maternal, Fetal & Neonatal Research Center, Family Health Research Institute, Tehran University of Medical Sciences, Tehran, Iran
| | - Alireza Arvin
- Center for Orthopedic Trans-disciplinary Applied Research (COTAR), Tehran University of Medical Sciences, Tehran, Iran
| | - Morvarid Taebi
- Center for Orthopedic Trans-disciplinary Applied Research (COTAR), Tehran University of Medical Sciences, Tehran, Iran
| | - Homa Seyedmirzaei
- Sports Medicine Research Center, Neuroscience Institute, Tehran University of Medical Sciences, Tehran, Iran
| | - Giulia Cattarinussi
- Department of Neuroscience (DNS), Padua Neuroscience Center, University of Padova, Padua, Italy
- Padua Neuroscience Center, University of Padova, Padua, Italy
- Department of Psychological Medicine, Institute of Psychiatry, Psychology and Neuroscience, Kings College London, London, United Kingdom
| | - Fabio Sambataro
- Department of Neuroscience (DNS), Padua Neuroscience Center, University of Padova, Padua, Italy
- Padua Neuroscience Center, University of Padova, Padua, Italy
| | - Alessandro Pigoni
- Social and Affective Neuroscience Group, MoMiLab, Institutions, Markets, Technologies (IMT) School for Advanced Studies Lucca, Lucca, Italy
- Department of Pathophysiology and Transplantation, University of Milan, Milan, Italy
| | - Paolo Brambilla
- Social and Affective Neuroscience Group, MoMiLab, Institutions, Markets, Technologies (IMT) School for Advanced Studies Lucca, Lucca, Italy
- Department of Pathophysiology and Transplantation, University of Milan, Milan, Italy
- Department of Neurosciences and Mental Health, Fondazione Istituto di Ricovero e Cura a Carattere Scientifico (IRCCS) Ca’ Granda Ospedale Maggiore Policlinico, Milan, Italy
| | - Giuseppe Delvecchio
- Department of Neurosciences and Mental Health, Fondazione Istituto di Ricovero e Cura a Carattere Scientifico (IRCCS) Ca’ Granda Ospedale Maggiore Policlinico, Milan, Italy
| |
Collapse
|
23
|
Wang M, Ye XW, Ying XH, Jia JD, Ding Y, Zhang D, Sun F. Data Imputation of Soil Pressure on Shield Tunnel Lining Based on Random Forest Model. SENSORS (BASEL, SWITZERLAND) 2024; 24:1560. [PMID: 38475093 DOI: 10.3390/s24051560] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/24/2024] [Revised: 02/24/2024] [Accepted: 02/27/2024] [Indexed: 03/14/2024]
Abstract
With the advancement of engineering techniques, underground shield tunneling projects have also started incorporating emerging technologies to monitor the forces and displacements during the construction and operation phases of shield tunnels. Monitoring devices installed on the tunnel segment components generate a large amount of data. However, due to various factors, data may be missing. Hence, the completion of the incomplete data is imperative to ensure the utmost safety of the engineering project. In this research, a missing data imputation technique utilizing Random Forest (RF) is introduced. The optimal combination of the number of decision trees, maximum depth, and number of features in the RF is determined by minimizing the Mean Squared Error (MSE). Subsequently, complete soil pressure data are artificially manipulated to create incomplete datasets with missing rates of 20%, 40%, and 60%. A comparative analysis of the imputation results using three methods-median, mean, and RF-reveals that this proposed method has the smallest imputation error. As the missing rate increases, the mean squared error of the Random Forest method and the other two methods also increases, with a maximum difference of about 70%. This indicates that the random forest method is suitable for imputing monitoring data.
Collapse
Affiliation(s)
- Min Wang
- Polytechnic Institute, Zhejiang University, Hangzhou 310058, China
| | - Xiao-Wei Ye
- Department of Civil Engineering, Zhejiang University, Hangzhou 310058, China
| | - Xin-Hong Ying
- Department of Civil Engineering, Zhejiang University, Hangzhou 310058, China
| | - Jin-Dian Jia
- Department of Civil Engineering, Zhejiang University, Hangzhou 310058, China
| | - Yang Ding
- Department of Civil Engineering, Hangzhou City University, Hangzhou 310015, China
| | - Di Zhang
- China Railway Siyuan Survey and Design Group Co., Ltd., Wuhan 430063, China
| | - Feng Sun
- China Railway Siyuan Survey and Design Group Co., Ltd., Wuhan 430063, China
| |
Collapse
|
24
|
Li J, Guo S, Ma R, He J, Zhang X, Rui D, Ding Y, Li Y, Jian L, Cheng J, Guo H. Comparison of the effects of imputation methods for missing data in predictive modelling of cohort study datasets. BMC Med Res Methodol 2024; 24:41. [PMID: 38365610 PMCID: PMC10870437 DOI: 10.1186/s12874-024-02173-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2023] [Accepted: 02/05/2024] [Indexed: 02/18/2024] Open
Abstract
BACKGROUND Missing data is frequently an inevitable issue in cohort studies and it can adversely affect the study's findings. We assess the effectiveness of eight frequently utilized statistical and machine learning (ML) imputation methods for dealing with missing data in predictive modelling of cohort study datasets. This evaluation is based on real data and predictive models for cardiovascular disease (CVD) risk. METHODS The data is from a real-world cohort study in Xinjiang, China. It includes personal information, physical examination data, questionnaires, and laboratory biochemical results from 10,164 subjects with a total of 37 variables. Simple imputation (Simple), regression imputation (Regression), expectation-maximization(EM), multiple imputation (MICE) , K nearest neighbor classification (KNN), clustering imputation (Cluster), random forest (RF), and decision tree (Cart) were the chosen imputation methods. Root Mean Square Error (RMSE) and Mean Absolute Error (MAE) are utilised to assess the performance of different methods for missing data imputation at a missing rate of 20%. The datasets processed with different missing data imputation methods were employed to construct a CVD risk prediction model utilizing the support vector machine (SVM). The predictive performance was then compared using the area under the curve (AUC). RESULTS The most effective imputation results were attained by KNN (MAE: 0.2032, RMSE: 0.7438, AUC: 0.730, CI: 0.719-0.741) and RF (MAE: 0.3944, RMSE: 1.4866, AUC: 0.777, CI: 0.769-0.785). The subsequent best performances were achieved by EM, Cart, and MICE, while Simple, Regression, and Cluster attained the worst performances. The CVD risk prediction model was constructed using the complete data (AUC:0.804, CI:0.796-0.812) in comparison with all other models with p<0.05. CONCLUSION KNN and RF exhibit superior performance and are more adept at imputing missing data in predictive modelling of cohort study datasets.
Collapse
Affiliation(s)
- JiaHang Li
- Department of Public Health, Shihezi University School of Medicine, North 2th Road, Shihezi, 832003, Xinjiang, China
- Key Laboratory for Prevention and Control of Emerging Infectious Diseases and Public Health Security, the Xinjiang Production and Construction Corps, Shihezi, Xinjiang, 832000, China
| | - ShuXia Guo
- Department of Public Health, Shihezi University School of Medicine, North 2th Road, Shihezi, 832003, Xinjiang, China
- Key Laboratory for Prevention and Control of Emerging Infectious Diseases and Public Health Security, the Xinjiang Production and Construction Corps, Shihezi, Xinjiang, 832000, China
| | - RuLin Ma
- Department of Public Health, Shihezi University School of Medicine, North 2th Road, Shihezi, 832003, Xinjiang, China
- Key Laboratory for Prevention and Control of Emerging Infectious Diseases and Public Health Security, the Xinjiang Production and Construction Corps, Shihezi, Xinjiang, 832000, China
| | - Jia He
- Department of Public Health, Shihezi University School of Medicine, North 2th Road, Shihezi, 832003, Xinjiang, China
- Key Laboratory for Prevention and Control of Emerging Infectious Diseases and Public Health Security, the Xinjiang Production and Construction Corps, Shihezi, Xinjiang, 832000, China
| | - XiangHui Zhang
- Department of Public Health, Shihezi University School of Medicine, North 2th Road, Shihezi, 832003, Xinjiang, China
- Key Laboratory for Prevention and Control of Emerging Infectious Diseases and Public Health Security, the Xinjiang Production and Construction Corps, Shihezi, Xinjiang, 832000, China
| | - DongSheng Rui
- Department of Public Health, Shihezi University School of Medicine, North 2th Road, Shihezi, 832003, Xinjiang, China
- Key Laboratory for Prevention and Control of Emerging Infectious Diseases and Public Health Security, the Xinjiang Production and Construction Corps, Shihezi, Xinjiang, 832000, China
| | - YuSong Ding
- Department of Public Health, Shihezi University School of Medicine, North 2th Road, Shihezi, 832003, Xinjiang, China
- Key Laboratory for Prevention and Control of Emerging Infectious Diseases and Public Health Security, the Xinjiang Production and Construction Corps, Shihezi, Xinjiang, 832000, China
| | - Yu Li
- Department of Public Health, Shihezi University School of Medicine, North 2th Road, Shihezi, 832003, Xinjiang, China
- Key Laboratory for Prevention and Control of Emerging Infectious Diseases and Public Health Security, the Xinjiang Production and Construction Corps, Shihezi, Xinjiang, 832000, China
| | - LeYao Jian
- Department of Public Health, Shihezi University School of Medicine, North 2th Road, Shihezi, 832003, Xinjiang, China
| | - Jing Cheng
- Department of Public Health, Shihezi University School of Medicine, North 2th Road, Shihezi, 832003, Xinjiang, China
| | - Heng Guo
- Department of Public Health, Shihezi University School of Medicine, North 2th Road, Shihezi, 832003, Xinjiang, China.
- Key Laboratory for Prevention and Control of Emerging Infectious Diseases and Public Health Security, the Xinjiang Production and Construction Corps, Shihezi, Xinjiang, 832000, China.
| |
Collapse
|
25
|
Ghaedi H, Davey SK, Feilotter H. Variant Classification Discordance: Contributing Factors and Predictive Models. J Mol Diagn 2024; 26:115-126. [PMID: 38008287 DOI: 10.1016/j.jmoldx.2023.11.002] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2022] [Revised: 08/04/2023] [Accepted: 11/07/2023] [Indexed: 11/28/2023] Open
Abstract
An ever-growing catalog of human variants is hosted in the ClinVar database. In this database, submissions on a variant are combined into a multisubmitter record; and in the case of discordance in variant classification between submitters, the record is labeled as conflicting. The current study used ClinVar data to identify characteristics that would make variants more likely to be associated with the conflict class of variants. Furthermore, the Extreme Gradient Boosting algorithm was used to train classifier models to provide prediction of classification discordance for single submission variants in ClinVar database. Population allele frequency, the gene harboring the variant, variant type, consequence on protein, variant deleteriousness score, first submitter identity, and submission count were associated with conflict in variant classification. Using such features, the optimized classifier showed accuracy on the test set of 88% with the weighted average of precision, recall, and f1-score of 0.84, 0.88, and 0.85, respectively. There were pronounced associations between variant classification discordance and allele frequency, gene type, and the identity of the first submitter. The study provides the predicted discordance status for single-submitter variants deposited in ClinVar. This approach can be used to assess whether single-submitter variants are likely to be supported, or in conflict with, future entries; this knowledge may help laboratories with clinical variant assessment.
Collapse
Affiliation(s)
- Hamid Ghaedi
- Department of Pathology and Molecular Medicine, Queen's University, Kingston, Ontario, Canada
| | - Scott K Davey
- Division of Cancer Biology and Genetics, Department of Pathology and Molecular Medicine, Queen's University Cancer Research Institute, Kingston, Ontario, Canada
| | - Harriet Feilotter
- Department of Pathology and Molecular Medicine, Queen's University, Kingston, Ontario, Canada.
| |
Collapse
|
26
|
Joseph J, Niemczak C, Lichtenstein J, Kobrina A, Magohe A, Leigh S, Ealer C, Fellows A, Reike C, Massawe E, Gui J, Buckey JC. Central auditory test performance predicts future neurocognitive function in children living with and without HIV. Sci Rep 2024; 14:2712. [PMID: 38302516 PMCID: PMC10834399 DOI: 10.1038/s41598-024-52380-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2023] [Accepted: 01/18/2024] [Indexed: 02/03/2024] Open
Abstract
Tests of the brain's ability to process complex sounds (central auditory tests) correlate with overall measures of neurocognitive performance. In the low- middle-income countries where resources to conduct detailed cognitive testing is limited, tests that assess the central auditory system may provide a novel and useful way to track neurocognitive performance. This could be particularly useful for children living with HIV (CLWH). To evaluate this, we administered central auditory tests to CLWH and children living without HIV and examined whether central auditory tests given early in a child's life could predict later neurocognitive performance. We used a machine learning technique to incorporate factors known to affect performance on neurocognitive tests, such as education. The results show that central auditory tests are useful predictors of neurocognitive performance and perform as well or in some cases better than factors such as education. Central auditory tests may offer an objective way to track neurocognitive performance in CLWH.
Collapse
Affiliation(s)
- Jeff Joseph
- Department of Quantitative Biomedical Sciences, Geisel School of Medicine at Dartmouth, Hanover, NH, USA
| | - Christopher Niemczak
- Department of Quantitative Biomedical Sciences, Geisel School of Medicine at Dartmouth, Hanover, NH, USA
- Space Medicine Innovations Laboratory, Geisel School of Medicine at Dartmouth, Hanover, NH, USA
- Department of Medicine, Dartmouth-Hitchcock Medical Center, Lebanon, NH, USA
| | - Jonathan Lichtenstein
- Department of Psychiatry, Dartmouth-Hitchcock Medical Center, Lebanon, NH, USA
- The Dartmouth Institute for Health Policy and Clinical Practice, Geisel School of Medicine at Dartmouth, Hanover, NH, USA
| | - Anastasiya Kobrina
- Space Medicine Innovations Laboratory, Geisel School of Medicine at Dartmouth, Hanover, NH, USA
| | - Albert Magohe
- Muhimbili University of Health and Allied Sciences, Dar es Salaam, Tanzania
| | - Samantha Leigh
- Space Medicine Innovations Laboratory, Geisel School of Medicine at Dartmouth, Hanover, NH, USA
| | - Christin Ealer
- Space Medicine Innovations Laboratory, Geisel School of Medicine at Dartmouth, Hanover, NH, USA
| | - Abigail Fellows
- Space Medicine Innovations Laboratory, Geisel School of Medicine at Dartmouth, Hanover, NH, USA
| | - Catherine Reike
- Space Medicine Innovations Laboratory, Geisel School of Medicine at Dartmouth, Hanover, NH, USA
| | - Enica Massawe
- Muhimbili University of Health and Allied Sciences, Dar es Salaam, Tanzania
| | - Jiang Gui
- Department of Biomedical Data Science, Geisel School of Medicine at Dartmouth, Hanover, NH, USA
| | - Jay C Buckey
- Department of Quantitative Biomedical Sciences, Geisel School of Medicine at Dartmouth, Hanover, NH, USA.
- Space Medicine Innovations Laboratory, Geisel School of Medicine at Dartmouth, Hanover, NH, USA.
- Department of Medicine, Dartmouth-Hitchcock Medical Center, Lebanon, NH, USA.
| |
Collapse
|
27
|
Jacob Junior AFL, do Carmo FA, de Santana AL, Santana EEC, Lobato FMF. EvoImp: Multiple Imputation of Multi-label Classification data with a genetic algorithm. PLoS One 2024; 19:e0297147. [PMID: 38241256 PMCID: PMC10798481 DOI: 10.1371/journal.pone.0297147] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2023] [Accepted: 12/28/2023] [Indexed: 01/21/2024] Open
Abstract
Missing data is a prevalent problem that requires attention, as most data analysis techniques are unable to handle it. This is particularly critical in Multi-Label Classification (MLC), where only a few studies have investigated missing data in this application domain. MLC differs from Single-Label Classification (SLC) by allowing an instance to be associated with multiple classes. Movie classification is a didactic example since it can be "drama" and "bibliography" simultaneously. One of the most usual missing data treatment methods is data imputation, which seeks plausible values to fill in the missing ones. In this scenario, we propose a novel imputation method based on a multi-objective genetic algorithm for optimizing multiple data imputations called Multiple Imputation of Multi-label Classification data with a genetic algorithm, or simply EvoImp. We applied the proposed method in multi-label learning and evaluated its performance using six synthetic databases, considering various missing values distribution scenarios. The method was compared with other state-of-the-art imputation strategies, such as K-Means Imputation (KMI) and weighted K-Nearest Neighbors Imputation (WKNNI). The results proved that the proposed method outperformed the baseline in all the scenarios by achieving the best evaluation measures considering the Exact Match, Accuracy, and Hamming Loss. The superior results were constant in different dataset domains and sizes, demonstrating the EvoImp robustness. Thus, EvoImp represents a feasible solution to missing data treatment for multi-label learning.
Collapse
Affiliation(s)
- Antonio Fernando Lavareda Jacob Junior
- Graduate Program in Electrical Engineering (PPGEE), Federal University of Maranhão (UFMA), São Luís, Maranhão, Brazil
- Graduate Program in Computer Engineering and Systems (PECS), State University of Maranhão (UEMA), São Luís, Maranhão, Brazil
| | - Fabricio Almeida do Carmo
- Graduate Program in Computer Engineering and Systems (PECS), State University of Maranhão (UEMA), São Luís, Maranhão, Brazil
| | | | - Ewaldo Eder Carvalho Santana
- Graduate Program in Electrical Engineering (PPGEE), Federal University of Maranhão (UFMA), São Luís, Maranhão, Brazil
- Graduate Program in Computer Engineering and Systems (PECS), State University of Maranhão (UEMA), São Luís, Maranhão, Brazil
| | - Fabio Manoel Franca Lobato
- Graduate Program in Computer Engineering and Systems (PECS), State University of Maranhão (UEMA), São Luís, Maranhão, Brazil
- Institute of Engineering and Geosciences, Federal University of Western Pará (UFOPA), Santarém, Pará, Brazil
| |
Collapse
|
28
|
Chung CW, Chou SC, Hsiao TH, Zhang GJ, Chung YF, Chen YM. Machine learning approaches to identify systemic lupus erythematosus in anti-nuclear antibody-positive patients using genomic data and electronic health records. BioData Min 2024; 17:1. [PMID: 38183082 PMCID: PMC10770905 DOI: 10.1186/s13040-023-00352-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2023] [Accepted: 12/19/2023] [Indexed: 01/07/2024] Open
Abstract
BACKGROUND Although the 2019 EULAR/ACR classification criteria for systemic lupus erythematosus (SLE) has required at least a positive anti-nuclear antibody (ANA) titer (≥ 1:80), it remains challenging for clinicians to identify patients with SLE. This study aimed to develop a machine learning (ML) approach to assist in the detection of SLE patients using genomic data and electronic health records. METHODS Participants with a positive ANA (≥ 1:80) were enrolled from the Taiwan Precision Medicine Initiative cohort. The Taiwan Biobank version 2 array was used to detect single nucleotide polymorphism (SNP) data. Six ML models, Logistic Regression, Random Forest (RF), Support Vector Machine, Light Gradient Boosting Machine, Gradient Tree Boosting, and Extreme Gradient Boosting (XGB), were used to identify SLE patients. The importance of the clinical and genetic features was determined by Shapley Additive Explanation (SHAP) values. A logistic regression model was applied to identify genetic variations associated with SLE in the subset of patients with an ANA equal to or exceeding 1:640. RESULTS A total of 946 SLE and 1,892 non-SLE controls were included in this analysis. Among the six ML models, RF and XGB demonstrated superior performance in the differentiation of SLE from non-SLE. The leading features in the SHAP diagram were anti-double strand DNA antibodies, ANA titers, AC4 ANA pattern, polygenic risk scores, complement levels, and SNPs. Additionally, in the subgroup with a high ANA titer (≥ 1:640), six SNPs positively associated with SLE and five SNPs negatively correlated with SLE were discovered. CONCLUSIONS ML approaches offer the potential to assist in diagnosing SLE and uncovering novel SNPs in a group of patients with autoimmunity.
Collapse
Affiliation(s)
- Chih-Wei Chung
- Department of Information Management, National Taiwan University, Taipei, Taiwan
| | - Seng-Cho Chou
- Department of Information Management, National Taiwan University, Taipei, Taiwan
| | - Tzu-Hung Hsiao
- Department of Medical Research, Taichung Veterans General Hospital, Taichung, Taiwan
- Department of Public Health, Fu Jen Catholic University, New Taipei City, Taiwan
- Institute of Genomics and Bioinformatics, National Chung Hsing University, Taichung, Taiwan
| | - Grace Joyce Zhang
- Department of Cellular and Physiological Sciences, The University of British Columbia, Vancouver, BC, Canada
| | - Yu-Fang Chung
- Department of Electrical Engineering, Tunghai University, Taichung, Taiwan
| | - Yi-Ming Chen
- Department of Medical Research, Taichung Veterans General Hospital, Taichung, Taiwan.
- Division of Allergy, Immunology and Rheumatology, Department of Internal Medicine, Taichung Veterans General Hospital, 1650, Section 4, Taiwan Boulevard, Xitun Dist., Taichung City, 407, Taiwan.
- Department of Post-Baccalaureate Medicine, College of Medicine, National Chung Hsing University, Taichung, Taiwan.
- School of Medicine, College of Medicine, National Yang Ming Chiao Tung University, Taipei, Taiwan.
- Rong Hsing Research Center for Translational Medicine & Ph.D. Program in Translational Medicine, National Chung Hsing University, Taichung, Taiwan.
- Precision Medicine Research Center, College of Medicine, National Chung Hsing University, Taichung, Taiwan.
| |
Collapse
|
29
|
Weng X, Song H, Lin Y, Wu Y, Zhang X, Liu B, Yang J. A joint learning method for incomplete and imbalanced data in electronic health record based on generative adversarial networks. Comput Biol Med 2024; 168:107687. [PMID: 38007974 DOI: 10.1016/j.compbiomed.2023.107687] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2023] [Revised: 10/07/2023] [Accepted: 11/06/2023] [Indexed: 11/28/2023]
Abstract
Electronic health records (EHR), present challenges of incomplete and imbalanced data in clinical predictions. Previous studies addressed these two issues with two-step separately, which caused the decrease in the performance of prediction tasks. In this paper, we propose a unified framework to simultaneously addresses the challenges of incomplete and imbalanced data in EHR. Based on the framework, we develop a model called Missing Value Imputation and Imbalanced Learning Generative Adversarial Network (MVIIL-GAN). We use MVIIL-GAN to perform joint learning on the imputation process of high missing rate data and the conditional generation process of EHR data. The joint learning is achieved by introducing two discriminators to distinguish the fake data from the generated data at sample-level and variable-level. MVIIL-GAN integrate the missing values imputation and data generation in one step, improving the consistency of parameter optimization and the performance of prediction tasks. We evaluate our framework using the public dataset MIMIC-IV with high missing rates data and imbalanced data. Experimental results show that MVIIL-GAN outperforms existing methods in prediction performance. The implementation of MVIIL-GAN can be found at https://github.com/Peroxidess/MVIIL-GAN.
Collapse
Affiliation(s)
- Xutao Weng
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing, 100081, China
| | - Hong Song
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing, 100081, China.
| | - Yucong Lin
- School of Optics and Photonics, Beijing Institute of Technology, Beijing, 100081, China
| | - You Wu
- School of Medical Technology, Beijing Institute of Technology, Beijing, 100081, China
| | - Xi Zhang
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing, 100081, China
| | - Bowen Liu
- School of Medical Technology, Beijing Institute of Technology, Beijing, 100081, China
| | - Jian Yang
- School of Optics and Photonics, Beijing Institute of Technology, Beijing, 100081, China.
| |
Collapse
|
30
|
Xu Z, Tang J, Qi C, Yao D, Liu C, Zhan Y, Lukasiewicz T. Cross-domain attention-guided generative data augmentation for medical image analysis with limited data. Comput Biol Med 2024; 168:107744. [PMID: 38006826 DOI: 10.1016/j.compbiomed.2023.107744] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2023] [Revised: 11/12/2023] [Accepted: 11/20/2023] [Indexed: 11/27/2023]
Abstract
Data augmentation is widely applied to medical image analysis tasks in limited datasets with imbalanced classes and insufficient annotations. However, traditional augmentation techniques cannot supply extra information, making the performance of diagnosis unsatisfactory. GAN-based generative methods have thus been proposed to obtain additional useful information to realize more effective data augmentation; but existing generative data augmentation techniques mainly encounter two problems: (i) Current generative data augmentation lacks of the capability in using cross-domain differential information to extend limited datasets. (ii) The existing generative methods cannot provide effective supervised information in medical image segmentation tasks. To solve these problems, we propose an attention-guided cross-domain tumor image generation model (CDA-GAN) with an information enhancement strategy. The CDA-GAN can generate diverse samples to expand the scale of datasets, improving the performance of medical image diagnosis and treatment tasks. In particular, we incorporate channel attention into a CycleGAN-based cross-domain generation network that captures inter-domain information and generates positive or negative samples of brain tumors. In addition, we propose a semi-supervised spatial attention strategy to guide spatial information of features at the pixel level in tumor generation. Furthermore, we add spectral normalization to prevent the discriminator from mode collapse and stabilize the training procedure. Finally, to resolve an inapplicability problem in the segmentation task, we further propose an application strategy of using this data augmentation model to achieve more accurate medical image segmentation with limited data. Experimental studies on two public brain tumor datasets (BraTS and TCIA) show that the proposed CDA-GAN model greatly outperforms the state-of-the-art generative data augmentation in both practical medical image classification tasks and segmentation tasks; e.g. CDA-GAN is 0.50%, 1.72%, 2.05%, and 0.21% better than the best SOTA baseline in terms of ACC, AUC, Recall, and F1, respectively, in the classification task of BraTS, while its improvements w.r.t. the best SOTA baseline in terms of Dice, Sens, HD95, and mIOU, in the segmentation task of TCIA are 2.50%, 0.90%, 14.96%, and 4.18%, respectively.
Collapse
Affiliation(s)
- Zhenghua Xu
- State Key Laboratory of Reliability and Intelligence of Electrical Equipment, School of Health Sciences and Biomedical Engineering, Hebei University of Technology, Tianjin, China
| | - Jiaqi Tang
- State Key Laboratory of Reliability and Intelligence of Electrical Equipment, School of Health Sciences and Biomedical Engineering, Hebei University of Technology, Tianjin, China
| | - Chang Qi
- State Key Laboratory of Reliability and Intelligence of Electrical Equipment, School of Health Sciences and Biomedical Engineering, Hebei University of Technology, Tianjin, China; Institute of Logic and Computation, Vienna University of Technology, Vienna, Austria.
| | - Dan Yao
- State Key Laboratory of Reliability and Intelligence of Electrical Equipment, School of Health Sciences and Biomedical Engineering, Hebei University of Technology, Tianjin, China
| | - Caihua Liu
- College of Computer Science and Technology, Civil Aviation University of China, Tianjin, China
| | - Yuefu Zhan
- Department of Radiology, Hainan Women and Children's Medical Center, Haikou, China
| | - Thomas Lukasiewicz
- Institute of Logic and Computation, Vienna University of Technology, Vienna, Austria; Department of Computer Science, University of Oxford, Oxford, United Kingdom
| |
Collapse
|
31
|
Yoon M, Park JJ, Hur T, Hua CH, Hussain M, Lee S, Choi DJ. Application and Potential of Artificial Intelligence in Heart Failure: Past, Present, and Future. INTERNATIONAL JOURNAL OF HEART FAILURE 2024; 6:11-19. [PMID: 38303917 PMCID: PMC10827704 DOI: 10.36628/ijhf.2023.0050] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/05/2023] [Revised: 11/24/2023] [Accepted: 11/26/2023] [Indexed: 02/03/2024]
Abstract
The prevalence of heart failure (HF) is increasing, necessitating accurate diagnosis and tailored treatment. The accumulation of clinical information from patients with HF generates big data, which poses challenges for traditional analytical methods. To address this, big data approaches and artificial intelligence (AI) have been developed that can effectively predict future observations and outcomes, enabling precise diagnoses and personalized treatments of patients with HF. Machine learning (ML) is a subfield of AI that allows computers to analyze data, find patterns, and make predictions without explicit instructions. ML can be supervised, unsupervised, or semi-supervised. Deep learning is a branch of ML that uses artificial neural networks with multiple layers to find complex patterns. These AI technologies have shown significant potential in various aspects of HF research, including diagnosis, outcome prediction, classification of HF phenotypes, and optimization of treatment strategies. In addition, integrating multiple data sources, such as electrocardiography, electronic health records, and imaging data, can enhance the diagnostic accuracy of AI algorithms. Currently, wearable devices and remote monitoring aided by AI enable the earlier detection of HF and improved patient care. This review focuses on the rationale behind utilizing AI in HF and explores its various applications.
Collapse
Affiliation(s)
- Minjae Yoon
- Division of Cardiology, Department of Internal Medicine, Seoul National University Bundang Hospital, Seoul National University College of Medicine, Seongnam, Korea
| | - Jin Joo Park
- Division of Cardiology, Department of Internal Medicine, Seoul National University Bundang Hospital, Seoul National University College of Medicine, Seongnam, Korea
| | - Taeho Hur
- Division of Cardiology, Department of Internal Medicine, Seoul National University Bundang Hospital, Seoul National University College of Medicine, Seongnam, Korea
- Department of Computer Science and Engineering, Kyung Hee University, Yongin, Korea
| | - Cam-Hao Hua
- Department of Computer Science and Engineering, Kyung Hee University, Yongin, Korea
| | - Musarrat Hussain
- Department of Computer Science and Engineering, Kyung Hee University, Yongin, Korea
| | - Sungyoung Lee
- Department of Computer Science and Engineering, Kyung Hee University, Yongin, Korea
| | - Dong-Ju Choi
- Division of Cardiology, Department of Internal Medicine, Seoul National University Bundang Hospital, Seoul National University College of Medicine, Seongnam, Korea
| |
Collapse
|
32
|
Grzenda A, Widge AS. Electronic health records and stratified psychiatry: bridge to precision treatment? Neuropsychopharmacology 2024; 49:285-290. [PMID: 37667021 PMCID: PMC10700348 DOI: 10.1038/s41386-023-01724-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 08/11/2023] [Revised: 08/24/2023] [Accepted: 08/27/2023] [Indexed: 09/06/2023]
Abstract
The use of a stratified psychiatry approach that combines electronic health records (EHR) data with machine learning (ML) is one potentially fruitful path toward rapidly improving precision treatment in clinical practice. This strategy, however, requires confronting pervasive methodological flaws as well as deficiencies in transparency and reporting in the current conduct of ML-based studies for treatment prediction. EHR data shares many of the same data quality issues as other types of data used in ML prediction, plus some unique challenges. To fully leverage EHR data's power for patient stratification, increased attention to data quality and collection of patient-reported outcome data is needed.
Collapse
Affiliation(s)
- Adrienne Grzenda
- Department of Psychiatry & Biobehavioral Sciences, David Geffen School of Medicine, University of California-Los Angeles, Los Angeles, CA, USA.
- Olive View-UCLA Medical Center, Sylmar, CA, USA.
| | - Alik S Widge
- Department of Psychiatry & Behavioral Sciences, University of Minnesota, Minneapolis, MN, USA
| |
Collapse
|
33
|
Murray JD, Lange JJ, Bennett-Lenane H, Holm R, Kuentz M, O'Dwyer PJ, Griffin BT. Advancing algorithmic drug product development: Recommendations for machine learning approaches in drug formulation. Eur J Pharm Sci 2023; 191:106562. [PMID: 37562550 DOI: 10.1016/j.ejps.2023.106562] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2023] [Revised: 07/09/2023] [Accepted: 08/07/2023] [Indexed: 08/12/2023]
Abstract
Artificial intelligence is a rapidly expanding area of research, with the disruptive potential to transform traditional approaches in the pharmaceutical industry, from drug discovery and development to clinical practice. Machine learning, a subfield of artificial intelligence, has fundamentally transformed in silico modelling and has the capacity to streamline clinical translation. This paper reviews data-driven modelling methodologies with a focus on drug formulation development. Despite recent advances, there is limited modelling guidance specific to drug product development and a trend towards suboptimal modelling practices, resulting in models that may not give reliable predictions in practice. There is an overwhelming focus on benchtop experimental outcomes obtained for a specific modelling aim, leaving the capabilities of data scraping or the use of combined modelling approaches yet to be fully explored. Moreover, the preference for high accuracy can lead to a reliance on black box methods over interpretable models. This further limits the widespread adoption of machine learning as black boxes yield models that cannot be easily understood for the purposes of enhancing product performance. In this review, recommendations for conducting machine learning research for drug product development to ensure trustworthiness, transparency, and reliability of the models produced are presented. Finally, possible future directions on how research in this area might develop are discussed to aim for models that provide useful and robust guidance to formulators.
Collapse
Affiliation(s)
- Jack D Murray
- School of Pharmacy, University College Cork, Cork, Ireland
| | - Justus J Lange
- School of Pharmacy, University College Cork, Cork, Ireland; Roche Pharmaceutical Research & Early Development, Pre-Clinical CMC, Roche Innovation Center Basel, F. Hoffmann-La Roche Ltd, Grenzacherstrasse 124, Basel, Switzerland
| | | | - René Holm
- Department of Physics, Chemistry and Pharmacy, University of Southern Denmark, Campusvej 55, Odense 5230, Denmark
| | - Martin Kuentz
- School of Life Sciences, University of Applied Sciences and Arts Northwestern Switzerland, Muttenz CH 4132, Switzerland
| | | | | |
Collapse
|
34
|
Ferri P, Romero-Garcia N, Badenes R, Lora-Pablos D, Morales TG, Gómez de la Cámara A, García-Gómez JM, Sáez C. Extremely missing numerical data in Electronic Health Records for machine learning can be managed through simple imputation methods considering informative missingness: A comparative of solutions in a COVID-19 mortality case study. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2023; 242:107803. [PMID: 37703700 DOI: 10.1016/j.cmpb.2023.107803] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/21/2023] [Revised: 08/28/2023] [Accepted: 09/05/2023] [Indexed: 09/15/2023]
Abstract
BACKGROUND AND OBJECTIVE Reusing Electronic Health Records (EHRs) for Machine Learning (ML) leads on many occasions to extremely incomplete and sparse tabular datasets, which can hinder the model development processes and limit their performance and generalization. In this study, we aimed to characterize the most effective data imputation techniques and ML models for dealing with highly missing numerical data in EHRs, in the case where only a very limited number of data are complete, as opposed to the usual case of having a reduced number of missing values. METHODS We used a case study including full blood count laboratory data, demographic and survival data in the context of COVID-19 hospital admissions and evaluated 30 processing pipelines combining imputation methods with ML classifiers. The imputation methods included missing mask, translation and encoding, mean imputation, k-nearest neighbors' imputation, Bayesian ridge regression imputation and generative adversarial imputation networks. The classifiers included k-nearest neighbors, logistic regression, random forest, gradient boosting and deep multilayer perceptron. RESULTS Our results suggest that in the presence of highly missing data, combining translation and encoding imputation-which considers informative missingness-with tree ensemble classifiers-random forest and gradient boosting-is a sensible choice when aiming to maximize performance, in terms of area under curve. CONCLUSIONS Based on our findings, we recommend the consideration of this imputer-classifier configuration when constructing models in the presence of extremely incomplete numerical data in EHR.
Collapse
Affiliation(s)
- Pablo Ferri
- Biomedical Data Science Lab, Instituto Universitario de Tecnologías de la Información y Comunicaciones, Universitat Politècnica de València, Camino de Vera s/n, Valencia 46022, Spain.
| | | | - Rafael Badenes
- Departament de Cirugia, Universitat de València, Spain; Instituto INCLIVA, Hospital Clínico Universitario de Valencia, Spain; Department Anesthesiology, Surgical-Trauma Intensive Care and Pain Clinic, Hospital Clínic Universitari, Valencia, Spain
| | - David Lora-Pablos
- Instituto de Investigación imas12, Hospital 12 de Octubre, Madrid, Spain; Facultad de Estudios Estadísticos, Universidad Complutense de Madrid, Spain
| | | | | | - Juan M García-Gómez
- Biomedical Data Science Lab, Instituto Universitario de Tecnologías de la Información y Comunicaciones, Universitat Politècnica de València, Camino de Vera s/n, Valencia 46022, Spain
| | - Carlos Sáez
- Biomedical Data Science Lab, Instituto Universitario de Tecnologías de la Información y Comunicaciones, Universitat Politècnica de València, Camino de Vera s/n, Valencia 46022, Spain
| |
Collapse
|
35
|
Li Q, He Y, Pan J. CrossFuse-XGBoost: accurate prediction of the maximum recommended daily dose through multi-feature fusion, cross-validation screening and extreme gradient boosting. Brief Bioinform 2023; 25:bbad511. [PMID: 38216539 PMCID: PMC10786712 DOI: 10.1093/bib/bbad511] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/28/2023] [Revised: 12/04/2023] [Accepted: 12/13/2023] [Indexed: 01/14/2024] Open
Abstract
In the drug development process, approximately 30% of failures are attributed to drug safety issues. In particular, the first-in-human (FIH) trial of a new drug represents one of the highest safety risks, and initial dose selection is crucial for ensuring safety in clinical trials. With traditional dose estimation methods, which extrapolate data from animals to humans, catastrophic events have occurred during Phase I clinical trials due to interspecies differences in compound sensitivity and unknown molecular mechanisms. To address this issue, this study proposes a CrossFuse-extreme gradient boosting (XGBoost) method that can directly predict the maximum recommended daily dose of a compound based on existing human research data, providing a reference for FIH dose selection. This method not only integrates multiple features, including molecular representations, physicochemical properties and compound-protein interactions, but also improves feature selection based on cross-validation. The results demonstrate that the CrossFuse-XGBoost method not only improves prediction accuracy compared to that of existing local weighted methods [k-nearest neighbor (k-NN) and variable k-NN (v-NN)] but also solves the low prediction coverage issue of v-NN, achieving full coverage of the external validation set and enabling more reliable predictions. Furthermore, this study offers a high level of interpretability by identifying the importance of different features in model construction. The 241 features with the most significant impact on the maximum recommended daily dose were selected, providing references for optimizing the structure of new compounds and guiding experimental research. The datasets and source code are freely available at https://github.com/cqmu-lq/CrossFuse-XGBoost.
Collapse
Affiliation(s)
- Qiang Li
- Basic Medicine Research and Innovation Center for Novel Target and Therapeutic Intervention, Ministry of Education, Institute of Life Sciences, Chongqing Medical University, Chongqing 400016, China
| | - Yu He
- Basic Medicine Research and Innovation Center for Novel Target and Therapeutic Intervention, Ministry of Education, Institute of Life Sciences, Chongqing Medical University, Chongqing 400016, China
| | - Jianbo Pan
- Basic Medicine Research and Innovation Center for Novel Target and Therapeutic Intervention, Ministry of Education, Institute of Life Sciences, Chongqing Medical University, Chongqing 400016, China
| |
Collapse
|
36
|
Francoeur PG, Koes DR. Expanding Training Data for Structure-Based Receptor-Ligand Binding Affinity Regression through Imputation of Missing Labels. ACS OMEGA 2023; 8:41680-41688. [PMID: 37970017 PMCID: PMC10634251 DOI: 10.1021/acsomega.3c05931] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 08/11/2023] [Revised: 10/10/2023] [Accepted: 10/17/2023] [Indexed: 11/17/2023]
Abstract
The success of machine learning is, in part, due to a large volume of data available to train models. However, the amount of training data for structure-based molecular property prediction remains limited. The previously described CrossDocked2020 data set expanded the available training data for binding pose classification in a molecular docking setting but did not address expanding the amount of receptor-ligand binding affinity data. We present experiments demonstrating that imputing binding affinity labels for complexes without experimentally determined binding affinities is a viable approach to expanding training data for structure-based models of receptor-ligand binding affinity. In particular, we demonstrate that utilizing imputed labels from a convolutional neural network trained only on the affinity data present in CrossDocked2020 results in a small improvement in the binding affinity regression performance, despite the additional sources of noise that such imputed labels add to the training data. The code, data splits, and imputation labels utilized in this paper are freely available at https://github.com/francoep/ImputationPaper.
Collapse
Affiliation(s)
- Paul G. Francoeur
- Department of Computational and Systems
Biology, University of Pittsburgh, Pittsburgh, Pennsylvania 15260, United States
| | - David R. Koes
- Department of Computational and Systems
Biology, University of Pittsburgh, Pittsburgh, Pennsylvania 15260, United States
| |
Collapse
|
37
|
Kim P, Serov N, Falchevskaya A, Shabalkin I, Dmitrenko A, Kladko D, Vinogradov V. Quantifying the Efficacy of Magnetic Nanoparticles for MRI and Hyperthermia Applications via Machine Learning Methods. SMALL (WEINHEIM AN DER BERGSTRASSE, GERMANY) 2023; 19:e2303522. [PMID: 37563807 DOI: 10.1002/smll.202303522] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/26/2023] [Revised: 07/16/2023] [Indexed: 08/12/2023]
Abstract
Magnetic nanoparticles are a prospective class of materials for use in biomedicine as agents for magnetic resonance imagining (MRI) and hyperthermia treatment. However, synthesis of nanoparticles with high efficacy is resource-intensive experimental work. In turn, the use of machine learning (ML) methods is becoming useful in materials design and serves as a great approach to designing nanomagnets for biomedicine. In this work, for the first time, an ML-based approach is developed for the prediction of main parameters of material efficacy, i.e., specific absorption rate (SAR) for hyperthermia and r1 /r2 relaxivities in MRI, with parameters of nanoparticles as well as experimental conditions as descriptors. For that, a unique database with more than 980 magnetic nanoparticles collected from scientific articles is assembled. Using this data, several tree-based ensemble models are trained to predict SAR, r1 and r2 relaxivity. After hyperparameter optimization, models reach performances of R2 = 0.86, R2 = 0.78, and R2 = 0.75, respectively. Testing the models on samples unseen during the training shows no performance drops. Finally, DiMag, an open access resource created to guide synthesis of novel nanosized magnets for MRI and hyperthermia treatment with machine learning and boost development of new biomedical agents, is developed.
Collapse
Affiliation(s)
- Pavel Kim
- International Institute "Solution Chemistry of Advanced Materials and Technologies", ITMO University, St. Petersburg, 191002, Russian Federation
| | - Nikita Serov
- International Institute "Solution Chemistry of Advanced Materials and Technologies", ITMO University, St. Petersburg, 191002, Russian Federation
| | - Aleksandra Falchevskaya
- International Institute "Solution Chemistry of Advanced Materials and Technologies", ITMO University, St. Petersburg, 191002, Russian Federation
| | - Ilia Shabalkin
- International Institute "Solution Chemistry of Advanced Materials and Technologies", ITMO University, St. Petersburg, 191002, Russian Federation
| | - Andrei Dmitrenko
- International Institute "Solution Chemistry of Advanced Materials and Technologies", ITMO University, St. Petersburg, 191002, Russian Federation
| | - Daniil Kladko
- International Institute "Solution Chemistry of Advanced Materials and Technologies", ITMO University, St. Petersburg, 191002, Russian Federation
| | - Vladimir Vinogradov
- International Institute "Solution Chemistry of Advanced Materials and Technologies", ITMO University, St. Petersburg, 191002, Russian Federation
| |
Collapse
|
38
|
Spence C, Shah OA, Cebula A, Tucker K, Sochart D, Kader D, Asopa V. Machine learning models to predict surgical case duration compared to current industry standards: scoping review. BJS Open 2023; 7:zrad113. [PMID: 37931236 PMCID: PMC10630142 DOI: 10.1093/bjsopen/zrad113] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2023] [Revised: 09/21/2023] [Accepted: 09/21/2023] [Indexed: 11/08/2023] Open
Abstract
BACKGROUND Surgical waiting lists have risen dramatically across the UK as a result of the COVID-19 pandemic. The effective use of operating theatres by optimal scheduling could help mitigate this, but this requires accurate case duration predictions. Current standards for predicting the duration of surgery are inaccurate. Artificial intelligence (AI) offers the potential for greater accuracy in predicting surgical case duration. This study aimed to investigate whether there is evidence to support that AI is more accurate than current industry standards at predicting surgical case duration, with a secondary aim of analysing whether the implementation of the models used produced efficiency savings. METHOD PubMed, Embase, and MEDLINE libraries were searched through to July 2023 to identify appropriate articles. PRISMA extension for scoping reviews and the Arksey and O'Malley framework were followed. Study quality was assessed using a modified version of the reporting guidelines for surgical AI papers by Farrow et al. Algorithm performance was reported using evaluation metrics. RESULTS The search identified 2593 articles: 14 were suitable for inclusion and 13 reported on the accuracy of AI algorithms against industry standards, with seven demonstrating a statistically significant improvement in prediction accuracy (P < 0.05). The larger studies demonstrated the superiority of neural networks over other machine learning techniques. Efficiency savings were identified in a RCT. Significant methodological limitations were identified across most studies. CONCLUSION The studies suggest that machine learning and deep learning models are more accurate at predicting the duration of surgery; however, further research is required to determine the best way to implement this technology.
Collapse
Affiliation(s)
- Christopher Spence
- Academic Surgical Unit, South West London Elective Orthopaedic Centre, Epsom, Surrey, UK
| | - Owais A Shah
- Academic Surgical Unit, South West London Elective Orthopaedic Centre, Epsom, Surrey, UK
| | - Anna Cebula
- Academic Surgical Unit, South West London Elective Orthopaedic Centre, Epsom, Surrey, UK
| | - Keith Tucker
- Academic Surgical Unit, South West London Elective Orthopaedic Centre, Epsom, Surrey, UK
| | - David Sochart
- Academic Surgical Unit, South West London Elective Orthopaedic Centre, Epsom, Surrey, UK
| | - Deiary Kader
- Academic Surgical Unit, South West London Elective Orthopaedic Centre, Epsom, Surrey, UK
| | - Vipin Asopa
- Academic Surgical Unit, South West London Elective Orthopaedic Centre, Epsom, Surrey, UK
| |
Collapse
|
39
|
Zieliński K, Drabczyk D, Kunicki M, Drzyzga D, Kloska A, Rumiński J. Evaluating the risk of endometriosis based on patients' self-assessment questionnaires. Reprod Biol Endocrinol 2023; 21:102. [PMID: 37898817 PMCID: PMC10612251 DOI: 10.1186/s12958-023-01156-9] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 09/06/2023] [Accepted: 10/23/2023] [Indexed: 10/30/2023] Open
Abstract
BACKGROUND Endometriosis is a condition that significantly affects the quality of life of about 10 % of reproductive-aged women. It is characterized by the presence of tissue similar to the uterine lining (endometrium) outside the uterus, which can lead lead scarring, adhesions, pain, and fertility issues. While numerous factors associated with endometriosis are documented, a wide range of symptoms may still be undiscovered. METHODS In this study, we employed machine learning algorithms to predict endometriosis based on the patient symptoms extracted from 13,933 questionnaires. We compared the results of feature selection obtained from various algorithms (i.e., Boruta algorithm, Recursive Feature Selection) with experts' decisions. As a benchmark model architecture, we utilized a LightGBM algorithm, along with Multivariate Imputation by Chained Equations (MICE) and k-nearest neighbors (KNN), for missing data imputation. Our primary objective was to assess the model's performance and feature importance compared to existing studies. RESULTS We identified the top 20 predictors of endometriosis, uncovering previously overlooked features such as Cesarean section, ovarian cysts, and hernia. Notably, the model's performance metrics were maximized when utilizing a combination of multiple feature selection methods. Specifically, the final model achieved an area under the receiver operator characteristic curve (AUC) of 0.85 on the training dataset and an AUC of 0.82 on the testing dataset. CONCLUSIONS The application of machine learning in diagnosing endometriosis has the potential to significantly impact clinical practice, streamlining the diagnostic process and enhancing efficiency. Our questionnaire-based prediction approach empowers individuals with endometriosis to proactively identify potential symptoms, facilitating informed discussions with healthcare professionals about diagnosis and treatment options.
Collapse
Affiliation(s)
- Krystian Zieliński
- INVICTA, Research and Development Center, Sopot, Poland.
- Department of Biomedical Engineering, Faculty of Electronics, Telecommunications and Informatics, Gdańsk University of Technology, Gdańsk, Poland.
| | | | | | | | - Anna Kloska
- INVICTA, Research and Development Center, Sopot, Poland.
- Department of Medical Biology and Genetics, Faculty of Biology, University of Gdańsk, Gdańsk, Poland.
| | - Jacek Rumiński
- Department of Biomedical Engineering, Faculty of Electronics, Telecommunications and Informatics, Gdańsk University of Technology, Gdańsk, Poland
| |
Collapse
|
40
|
Gan Q, Gong L, Hu D, Jiang Y, Ding X. A Hybrid Missing Data Imputation Method for Batch Process Monitoring Dataset. SENSORS (BASEL, SWITZERLAND) 2023; 23:8678. [PMID: 37960379 PMCID: PMC10650138 DOI: 10.3390/s23218678] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/01/2023] [Revised: 10/07/2023] [Accepted: 10/18/2023] [Indexed: 11/15/2023]
Abstract
Batch process monitoring datasets usually contain missing data, which decreases the performance of data-driven modeling for fault identification and optimal control. Many methods have been proposed to impute missing data; however, they do not fulfill the need for data quality, especially in sensor datasets with different types of missing data. We propose a hybrid missing data imputation method for batch process monitoring datasets with multi-type missing data. In this method, the missing data is first classified into five categories based on the continuous missing duration and the number of variables missing simultaneously. Then, different categories of missing data are step-by-step imputed considering their unique characteristics. A combination of three single-dimensional interpolation models is employed to impute transient isolated missing values. An iterative imputation based on a multivariate regression model is designed for imputing long-term missing variables, and a combination model based on single-dimensional interpolation and multivariate regression is proposed for imputing short-term missing variables. The Long Short-Term Memory (LSTM) model is utilized to impute both short-term and long-term missing samples. Finally, a series of experiments for different categories of missing data were conducted based on a real-world batch process monitoring dataset. The results demonstrate that the proposed method achieves higher imputation accuracy than other comparative methods.
Collapse
Affiliation(s)
- Qihong Gan
- Informatization Construction and Management Office, Sichuan University, Chengdu 610065, China;
- Big Data Analysis and Fusion Application Technology Engineering Laboratory of Sichuan Province, Chengdu 610065, China; (L.G.); (D.H.); (Y.J.)
| | - Lang Gong
- Big Data Analysis and Fusion Application Technology Engineering Laboratory of Sichuan Province, Chengdu 610065, China; (L.G.); (D.H.); (Y.J.)
- College of Computer Science, Sichuan University, Chengdu 610065, China
| | - Dasha Hu
- Big Data Analysis and Fusion Application Technology Engineering Laboratory of Sichuan Province, Chengdu 610065, China; (L.G.); (D.H.); (Y.J.)
- College of Computer Science, Sichuan University, Chengdu 610065, China
| | - Yuming Jiang
- Big Data Analysis and Fusion Application Technology Engineering Laboratory of Sichuan Province, Chengdu 610065, China; (L.G.); (D.H.); (Y.J.)
- College of Computer Science, Sichuan University, Chengdu 610065, China
| | - Xuefeng Ding
- Big Data Analysis and Fusion Application Technology Engineering Laboratory of Sichuan Province, Chengdu 610065, China; (L.G.); (D.H.); (Y.J.)
- College of Computer Science, Sichuan University, Chengdu 610065, China
| |
Collapse
|
41
|
Shadbahr T, Roberts M, Stanczuk J, Gilbey J, Teare P, Dittmer S, Thorpe M, Torné RV, Sala E, Lió P, Patel M, Preller J, Rudd JHF, Mirtti T, Rannikko AS, Aston JAD, Tang J, Schönlieb CB. The impact of imputation quality on machine learning classifiers for datasets with missing values. COMMUNICATIONS MEDICINE 2023; 3:139. [PMID: 37803172 PMCID: PMC10558448 DOI: 10.1038/s43856-023-00356-z] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2022] [Accepted: 09/13/2023] [Indexed: 10/08/2023] Open
Abstract
BACKGROUND Classifying samples in incomplete datasets is a common aim for machine learning practitioners, but is non-trivial. Missing data is found in most real-world datasets and these missing values are typically imputed using established methods, followed by classification of the now complete samples. The focus of the machine learning researcher is to optimise the classifier's performance. METHODS We utilise three simulated and three real-world clinical datasets with different feature types and missingness patterns. Initially, we evaluate how the downstream classifier performance depends on the choice of classifier and imputation methods. We employ ANOVA to quantitatively evaluate how the choice of missingness rate, imputation method, and classifier method influences the performance. Additionally, we compare commonly used methods for assessing imputation quality and introduce a class of discrepancy scores based on the sliced Wasserstein distance. We also assess the stability of the imputations and the interpretability of model built on the imputed data. RESULTS The performance of the classifier is most affected by the percentage of missingness in the test data, with a considerable performance decline observed as the test missingness rate increases. We also show that the commonly used measures for assessing imputation quality tend to lead to imputed data which poorly matches the underlying data distribution, whereas our new class of discrepancy scores performs much better on this measure. Furthermore, we show that the interpretability of classifier models trained using poorly imputed data is compromised. CONCLUSIONS It is imperative to consider the quality of the imputation when performing downstream classification as the effects on the classifier can be considerable.
Collapse
Affiliation(s)
- Tolou Shadbahr
- Research Program in Systems Oncology, Faculty of Medicine, University of Helsinki, Helsinki, Finland
| | - Michael Roberts
- Department of Applied Mathematics and Theoretical Physics, University of Cambridge, Cambridge, UK.
- Data Science & Artificial Intelligence, AstraZeneca, Cambridge, UK.
| | - Jan Stanczuk
- Department of Applied Mathematics and Theoretical Physics, University of Cambridge, Cambridge, UK
| | - Julian Gilbey
- Department of Applied Mathematics and Theoretical Physics, University of Cambridge, Cambridge, UK
| | - Philip Teare
- Data Science & Artificial Intelligence, AstraZeneca, Cambridge, UK
| | - Sören Dittmer
- Department of Applied Mathematics and Theoretical Physics, University of Cambridge, Cambridge, UK
- ZeTeM, University of Bremen, Bremen, Germany
| | - Matthew Thorpe
- Department of Mathematics, University of Manchester, Manchester, UK
| | - Ramon Viñas Torné
- Department of Computer Science and Technology, University of Cambridge, Cambridge, UK
| | - Evis Sala
- Department of Radiology, University of Cambridge, Cambridge, UK
| | - Pietro Lió
- Department of Mathematics, University of Manchester, Manchester, UK
| | - Mishal Patel
- Data Science & Artificial Intelligence, AstraZeneca, Cambridge, UK
- Clinical Pharmacology & Safety Sciences, AstraZeneca, Cambridge, UK
| | - Jacobus Preller
- Addenbrooke's Hospital, Cambridge University Hospitals NHS Trust, Cambridge, UK
| | - James H F Rudd
- Department of Medicine, University of Cambridge, Cambridge, UK
| | - Tuomas Mirtti
- Research Program in Systems Oncology, Faculty of Medicine, University of Helsinki, Helsinki, Finland
- Department of Pathology, University of Helsinki and Helsinki University Hospital, Helsinki, Finland
- iCAN-Digital Precision Cancer Medicine Flagship, Helsinki, Finland
| | - Antti Sakari Rannikko
- Research Program in Systems Oncology, Faculty of Medicine, University of Helsinki, Helsinki, Finland
- iCAN-Digital Precision Cancer Medicine Flagship, Helsinki, Finland
- Department of Urology, University of Helsinki and Helsinki University Hospital, Helsinki, Finland
| | - John A D Aston
- Department of Pure Mathematics and Mathematical Statistics, University of Cambridge, Cambridge, UK
| | - Jing Tang
- Research Program in Systems Oncology, Faculty of Medicine, University of Helsinki, Helsinki, Finland
| | - Carola-Bibiane Schönlieb
- Department of Applied Mathematics and Theoretical Physics, University of Cambridge, Cambridge, UK
| |
Collapse
|
42
|
Tumusiime AG, Eyobu OS, Mugume I, Oyana TJ. A weather features dataset for prediction of short-term rainfall quantities in Uganda. Data Brief 2023; 50:109613. [PMID: 37808539 PMCID: PMC10551829 DOI: 10.1016/j.dib.2023.109613] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/10/2023] [Revised: 09/11/2023] [Accepted: 09/19/2023] [Indexed: 10/10/2023] Open
Abstract
Weather data is of great importance to the development of weather prediction models. However, the availability and quality of this data remains a significant challenge for most researchers around the world. In Uganda, obtaining observational weather data is very challenging due to the sparse distribution of weather stations and inconsistent data records. This has created critical gaps in data availability to run and develop efficient weather prediction models. To bridge this gap, we obtained country-specific time series hourly observational weather data. The data period is from 2020 to 2022 of 11 weather stations distributed in the four regions of Uganda. The data was accessed from the Ogimet data repository using the "climate" R-package. The automated procedures in the R-programming language environment allowed us to download user-defined data at a time resolution from an hourly to an annual basis. However, the raw data acquired cannot be used to learn rainfall patterns because it includes duplicates and non-uniform data. Therefore, this article presents a prepared and cleaned dataset that can be used for the prediction of short-term rainfall quantities in Uganda.
Collapse
Affiliation(s)
| | - Odongo Steven Eyobu
- College of Computing and IS, Makerere University, P.O Box, 7062, Kampala, Uganda
| | - Isaac Mugume
- College of Agricultural and Environmental Sciences, Makerere University, P.O Box, 7062, Kampala, Uganda
| | - Tonny J. Oyana
- College of Computing and IS, Makerere University, P.O Box, 7062, Kampala, Uganda
| |
Collapse
|
43
|
Altuhaifa FA, Win KT, Su G. Predicting lung cancer survival based on clinical data using machine learning: A review. Comput Biol Med 2023; 165:107338. [PMID: 37625260 DOI: 10.1016/j.compbiomed.2023.107338] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2023] [Revised: 07/31/2023] [Accepted: 08/07/2023] [Indexed: 08/27/2023]
Abstract
Machine learning has gained popularity in predicting survival time in the medical field. This review examines studies utilizing machine learning and data-mining techniques to predict lung cancer survival using clinical data. A systematic literature review searched MEDLINE, Scopus, and Google Scholar databases, following reporting guidelines and using the COVIDENCE system. Studies published from 2000 to 2023 employing machine learning for lung cancer survival prediction were included. Risk of bias assessment used the prediction model risk of bias assessment tool. Thirty studies were reviewed, with 13 (43.3%) using the surveillance, epidemiology, and end results database. Missing data handling was addressed in 12 (40%) studies, primarily through data transformation and conversion. Feature selection algorithms were used in 19 (63.3%) studies, with age, sex, and N stage being the most chosen features. Random forest was the predominant machine learning model, used in 17 (56.6%) studies. While the number of lung cancer survival prediction studies is limited, the use of machine learning models based on clinical data has grown since 2012. Consideration of diverse patient cohorts and data pre-processing are crucial. Notably, most studies did not account for missing data, normalization, scaling, or standardized data, potentially introducing bias. Therefore, a comprehensive study on lung cancer survival prediction using clinical data is needed, addressing these challenges.
Collapse
Affiliation(s)
- Fatimah Abdulazim Altuhaifa
- School of Computing and Information Technology, University of Wollongong, NSW, 2500, Australia; Saudi Arabia Ministry of Higher Education, Riyadh, Saudi Arabia.
| | - Khin Than Win
- School of Computing and Information Technology, University of Wollongong, NSW, 2500, Australia
| | - Guoxin Su
- School of Computing and Information Technology, University of Wollongong, NSW, 2500, Australia
| |
Collapse
|
44
|
Liu R, Wang Z, Qiu J, Wang X. Assigning channel weights using an attention mechanism: an EEG interpolation algorithm. Front Neurosci 2023; 17:1251677. [PMID: 37811329 PMCID: PMC10552919 DOI: 10.3389/fnins.2023.1251677] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/02/2023] [Accepted: 09/06/2023] [Indexed: 10/10/2023] Open
Abstract
During the acquisition of electroencephalographic (EEG) signals, various factors can influence the data and lead to the presence of one or multiple bad channels. Bad channel interpolation is the use of good channels data to reconstruct bad channel, thereby maintaining the original dimensions of the data for subsequent analysis tasks. The mainstream interpolation algorithm assigns weights to channels based on the physical distance of the electrodes and does not take into account the effect of physiological factors on the EEG signal. The algorithm proposed in this study utilizes an attention mechanism to allocate channel weights (AMACW). The model gets the correlation among channels by learning from good channel data. Interpolation assigns weights based on learned correlations without the need for electrode location information, solving the difficulty that traditional methods cannot interpolate bad channels at unknown locations. To avoid an overly concentrated weight distribution of the model when generating data, we designed the channel masking (CM). This method spreads attention and allows the model to utilize data from multiple channels. We evaluate the reconstruction performance of the model using EEG data with 1 to 5 bad channels. With EEGLAB's interpolation method as a performance reference, tests have shown that the AMACW models can effectively reconstruct bad channels.
Collapse
Affiliation(s)
| | - Zaijun Wang
- Key Laboratory of Flight Techniques and Flight Safety Research Base, Civil Aviation Flight University of China, Guanghan, China
| | | | | |
Collapse
|
45
|
Kostekci YE, Bakırarar B, Okulu E, Erdeve O, Atasay B, Arsan S. An Early Prediction Model for Estimating Bronchopulmonary Dysplasia in Preterm Infants. Neonatology 2023; 120:709-717. [PMID: 37725910 DOI: 10.1159/000533299] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 04/13/2023] [Accepted: 07/22/2023] [Indexed: 09/21/2023]
Abstract
INTRODUCTION Accurate assessment of the risk for bronchopulmonary dysplasia (BPD) is critical to determine the prognosis and identify infants who will benefit from preventive therapies. Clinical prediction models can support the identification of high-risk patients. In this study, we investigated the potential risk factors for BPD and compared machine learning models for predicting the outcome of BPD/death on days 1, 7, 14, and 28 in preterm infants. We also developed a local BPD estimator. METHODS This study involved 124 infants. We evaluated the composite outcome of BPD/death at a postmenstrual age of 36 weeks and identified risk factors that would improve BPD/death prediction. SPSS for Windows Version 11.5 and Weka 3.9 software were used for the data analysis. RESULTS To evaluate the combined effect of all variables, all risk factors were taken into consideration. Gestational age, birth weight, mode of respiratory support, intraventricular hemorrhage, necrotizing enterocolitis, surfactant requirement, and late-onset sepsis were risk factors on postnatal days 7, 14, and 28. In a comparison of four different time points (postnatal days 1, 7, 14, and 28), the day 7 model provided the best prediction. According to this model, when a patient was diagnosed with BPD/death, the accuracy rate was 89.5%. CONCLUSION The postnatal day 7 model was the best predictor of BPD or death. Future validation studies will help identify infants who may benefit from preventive therapies and develop individualized care.
Collapse
Affiliation(s)
- Yasemin Ezgi Kostekci
- Division of Neonatology, Department of Pediatrics, Ankara University Faculty of Medicine, Ankara, Turkey
| | - Batuhan Bakırarar
- Department of Biostatistics, Ankara University Faculty of Medicine, Ankara, Turkey
| | - Emel Okulu
- Division of Neonatology, Department of Pediatrics, Ankara University Faculty of Medicine, Ankara, Turkey
| | - Omer Erdeve
- Division of Neonatology, Department of Pediatrics, Ankara University Faculty of Medicine, Ankara, Turkey
| | - Begum Atasay
- Division of Neonatology, Department of Pediatrics, Ankara University Faculty of Medicine, Ankara, Turkey
| | - Saadet Arsan
- Division of Neonatology, Department of Pediatrics, Ankara University Faculty of Medicine, Ankara, Turkey
| |
Collapse
|
46
|
Abnoosian K, Farnoosh R, Behzadi MH. Prediction of diabetes disease using an ensemble of machine learning multi-classifier models. BMC Bioinformatics 2023; 24:337. [PMID: 37697283 PMCID: PMC10496262 DOI: 10.1186/s12859-023-05465-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2023] [Accepted: 09/04/2023] [Indexed: 09/13/2023] Open
Abstract
BACKGROUND AND OBJECTIVE Diabetes is a life-threatening chronic disease with a growing global prevalence, necessitating early diagnosis and treatment to prevent severe complications. Machine learning has emerged as a promising approach for diabetes diagnosis, but challenges such as limited labeled data, frequent missing values, and dataset imbalance hinder the development of accurate prediction models. Therefore, a novel framework is required to address these challenges and improve performance. METHODS In this study, we propose an innovative pipeline-based multi-classification framework to predict diabetes in three classes: diabetic, non-diabetic, and prediabetes, using the imbalanced Iraqi Patient Dataset of Diabetes. Our framework incorporates various pre-processing techniques, including duplicate sample removal, attribute conversion, missing value imputation, data normalization and standardization, feature selection, and k-fold cross-validation. Furthermore, we implement multiple machine learning models, such as k-NN, SVM, DT, RF, AdaBoost, and GNB, and introduce a weighted ensemble approach based on the Area Under the Receiver Operating Characteristic Curve (AUC) to address dataset imbalance. Performance optimization is achieved through grid search and Bayesian optimization for hyper-parameter tuning. RESULTS Our proposed model outperforms other machine learning models, including k-NN, SVM, DT, RF, AdaBoost, and GNB, in predicting diabetes. The model achieves high average accuracy, precision, recall, F1-score, and AUC values of 0.9887, 0.9861, 0.9792, 0.9851, and 0.999, respectively. CONCLUSION Our pipeline-based multi-classification framework demonstrates promising results in accurately predicting diabetes using an imbalanced dataset of Iraqi diabetic patients. The proposed framework addresses the challenges associated with limited labeled data, missing values, and dataset imbalance, leading to improved prediction performance. This study highlights the potential of machine learning techniques in diabetes diagnosis and management, and the proposed framework can serve as a valuable tool for accurate prediction and improved patient care. Further research can build upon our work to refine and optimize the framework and explore its applicability in diverse datasets and populations.
Collapse
Affiliation(s)
- Karlo Abnoosian
- Department of Statistics, Science and Research Branch, Islamic Azad University, Tehran, Iran
| | - Rahman Farnoosh
- School of Mathematics, Iran University of Science and Technology, Tehran, Iran.
| | - Mohammad Hassan Behzadi
- Department of Statistics, Science and Research Branch, Islamic Azad University, Tehran, Iran
| |
Collapse
|
47
|
Drosouli I, Voulodimos A, Mastorocostas P, Miaoulis G, Ghazanfarpour D. A Spatial-Temporal Graph Convolutional Recurrent Network for Transportation Flow Estimation. SENSORS (BASEL, SWITZERLAND) 2023; 23:7534. [PMID: 37687992 PMCID: PMC10490678 DOI: 10.3390/s23177534] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/25/2023] [Revised: 08/25/2023] [Accepted: 08/28/2023] [Indexed: 09/10/2023]
Abstract
Accurate estimation of transportation flow is a challenging task in Intelligent Transportation Systems (ITS). Transporting data with dynamic spatial-temporal dependencies elevates transportation flow forecasting to a significant issue for operational planning, managing passenger flow, and arranging for individual travel in a smart city. The task is challenging due to the composite spatial dependency on transportation networks and the non-linear temporal dynamics with mobility conditions changing over time. To address these challenges, we propose a Spatial-Temporal Graph Convolutional Recurrent Network (ST-GCRN) that learns from both the spatial stations network data and time series of historical mobility changes in order to estimate transportation flow at a future time. The model is based on Graph Convolutional Networks (GCN) and Long Short-Term Memory (LSTM) in order to further improve the accuracy of transportation flow estimation. Extensive experiments on two real-world datasets of transportation flow, New York bike-sharing system and Hangzhou metro system, prove the effectiveness of the proposed model. Compared to the current state-of-the-art baselines, it decreases the estimation error by 98% in the metro system and 63% in the bike-sharing system.
Collapse
Affiliation(s)
- Ifigenia Drosouli
- Department of Informatics and Computer Engineering, University of West Attica, 12243 Egaleo, Greece; (I.D.)
- Department of Informatics, University of Limoges, 87032 Limoges, France
| | - Athanasios Voulodimos
- School of Electrical and Computer Engineering, National Technical University of Athens, 15773 Athens, Greece
| | - Paris Mastorocostas
- Department of Informatics and Computer Engineering, University of West Attica, 12243 Egaleo, Greece; (I.D.)
| | - Georgios Miaoulis
- Department of Informatics and Computer Engineering, University of West Attica, 12243 Egaleo, Greece; (I.D.)
| | | |
Collapse
|
48
|
Habenicht R, Fehrmann E, Blohm P, Ebenbichler G, Fischer-Grote L, Kollmitzer J, Mair P, Kienbacher T. Machine Learning Based Linking of Patient Reported Outcome Measures to WHO International Classification of Functioning, Disability, and Health Activity/Participation Categories. J Clin Med 2023; 12:5609. [PMID: 37685676 PMCID: PMC10488436 DOI: 10.3390/jcm12175609] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2023] [Revised: 08/06/2023] [Accepted: 08/23/2023] [Indexed: 09/10/2023] Open
Abstract
BACKGROUND In the primary and secondary medical health sector, patient reported outcome measures (PROMs) are widely used to assess a patient's disease-related functional health state. However, the World Health Organization (WHO), in its recently adopted resolution on "strengthening rehabilitation in all health systems", encourages that all health sectors, not only the rehabilitation sector, classify a patient's functioning and health state according to the International Classification of Functioning, Disability and Health (ICF). AIM This research sought to optimize machine learning (ML) methods that fully and automatically link information collected from PROMs in persons with unspecific chronic low back pain (cLBP) to limitations in activities and restrictions in participation that are listed in the WHO core set categories for LBP. The study also aimed to identify the minimal set of PROMs necessary for linking without compromising performance. METHODS A total of 806 patients with cLBP completed a comprehensive set of validated PROMs and were interviewed by clinical psychologists who assessed patients' performance in activity limitations and restrictions in participation according to the ICF brief core set for low back pain (LBP). The information collected was then utilized to further develop random forest (RF) methods that classified the presence or absence of a problem within each of the activity participation ICF categories of the ICF core set for LBP. Further analyses identified those PROM items relevant to the linking process and validated the respective linking performance that utilized a minimal subset of items. RESULTS Compared to a recently developed ML linking method, receiver operating characteristic curve (ROC-AUC) values for the novel RF methods showed overall improved performance, with AUC values ranging from 0.73 for the ICF category d850 to 0.81 for the ICF category d540. Variable importance measurements revealed that minimal subsets of either 24 or 15 important PROM variables (out of 80 items included in full set of PROMs) would show similar linking performance. CONCLUSIONS Findings suggest that our optimized ML based methods more accurately predict the presence or absence of limitations and restrictions listed in ICF core categories for cLBP. In addition, this accurate performance would not suffer if the list of PROM items was reduced to a minimum of 15 out of 80 items assessed.
Collapse
Affiliation(s)
- Richard Habenicht
- Karl-Landsteiner-Institute of Outpatient Rehabilitation Research, 1230 Vienna, Austria; (R.H.); (P.B.); (G.E.); (L.F.-G.); (T.K.)
| | - Elisabeth Fehrmann
- Karl-Landsteiner-Institute of Outpatient Rehabilitation Research, 1230 Vienna, Austria; (R.H.); (P.B.); (G.E.); (L.F.-G.); (T.K.)
- Department of Psychology, Karl Landsteiner University of Health Sciences, 3500 Krems, Austria
| | - Peter Blohm
- Karl-Landsteiner-Institute of Outpatient Rehabilitation Research, 1230 Vienna, Austria; (R.H.); (P.B.); (G.E.); (L.F.-G.); (T.K.)
| | - Gerold Ebenbichler
- Karl-Landsteiner-Institute of Outpatient Rehabilitation Research, 1230 Vienna, Austria; (R.H.); (P.B.); (G.E.); (L.F.-G.); (T.K.)
- Department of Physical Medicine, Rehabilitation and Occupational Medicine, Medical University of Vienna, 1090 Vienna, Austria
| | - Linda Fischer-Grote
- Karl-Landsteiner-Institute of Outpatient Rehabilitation Research, 1230 Vienna, Austria; (R.H.); (P.B.); (G.E.); (L.F.-G.); (T.K.)
| | - Josef Kollmitzer
- Department of Biomedical Engineering, TGM College for Higher Vocational Education, 1200 Vienna, Austria;
| | - Patrick Mair
- Department of Psychology, Harvard University, Cambridge, MA 02138, USA;
| | - Thomas Kienbacher
- Karl-Landsteiner-Institute of Outpatient Rehabilitation Research, 1230 Vienna, Austria; (R.H.); (P.B.); (G.E.); (L.F.-G.); (T.K.)
| |
Collapse
|
49
|
Chaumeil M, Guglielmetti C, Qiao K, Tiret B, Ozen M, Krukowski K, Nolan A, Paladini MS, Lopez C, Rosi S. Hyperpolarized 13C metabolic imaging detects long-lasting metabolic alterations following mild repetitive traumatic brain injury. RESEARCH SQUARE 2023:rs.3.rs-3166656. [PMID: 37645937 PMCID: PMC10462249 DOI: 10.21203/rs.3.rs-3166656/v1] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 08/31/2023]
Abstract
Career athletes, active military, and head trauma victims are at increased risk for mild repetitive traumatic brain injury (rTBI), a condition that contributes to the development of epilepsy and neurodegenerative diseases. Standard clinical imaging fails to identify rTBI-induced lesions, and novel non-invasive methods are needed. Here, we evaluated if hyperpolarized 13C magnetic resonance spectroscopic imaging (HP 13C MRSI) could detect long-lasting changes in brain metabolism 3.5 months post-injury in a rTBI mouse model. Our results show that this metabolic imaging approach can detect changes in cortical metabolism at that timepoint, whereas multimodal MR imaging did not detect any structural or contrast alterations. Using Machine Learning, we further show that HP 13C MRSI parameters can help classify rTBI vs. Sham and predict long-term rTBI-induced behavioral outcomes. Altogether, our study demonstrates the potential of metabolic imaging to improve detection, classification and outcome prediction of previously undetected rTBI.
Collapse
Affiliation(s)
| | | | - Kai Qiao
- University of California, San Francisco
| | | | | | | | | | | | | | | |
Collapse
|
50
|
Timilsina M, Fey D, Buosi S, Janik A, Costabello L, Carcereny E, Abreu DR, Cobo M, Castro RL, Bernabé R, Minervini P, Torrente M, Provencio M, Nováček V. Synergy between imputed genetic pathway and clinical information for predicting recurrence in early stage non-small cell lung cancer. J Biomed Inform 2023; 144:104424. [PMID: 37352900 DOI: 10.1016/j.jbi.2023.104424] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2022] [Revised: 06/06/2023] [Accepted: 06/11/2023] [Indexed: 06/25/2023]
Abstract
OBJECTIVE Lung cancer exhibits unpredictable recurrence in low-stage tumors and variable responses to different therapeutic interventions. Predicting relapse in early-stage lung cancer can facilitate precision medicine and improve patient survivability. While existing machine learning models rely on clinical data, incorporating genomic information could enhance their efficiency. This study aims to impute and integrate specific types of genomic data with clinical data to improve the accuracy of machine learning models for predicting relapse in early-stage, non-small cell lung cancer patients. METHODS The study utilized a publicly available TCGA lung cancer cohort and imputed genetic pathway scores into the Spanish Lung Cancer Group (SLCG) data, specifically in 1348 early-stage patients. Initially, tumor recurrence was predicted without imputed pathway scores. Subsequently, the SLCG data were augmented with pathway scores imputed from TCGA. The integrative approach aimed to enhance relapse risk prediction performance. RESULTS The integrative approach achieved improved relapse risk prediction with the following evaluation metrics: an area under the precision-recall curve (PR-AUC) score of 0.75, an area under the ROC (ROC-AUC) score of 0.80, an F1 score of 0.61, and a Precision of 0.80. The prediction explanation model SHAP (SHapley Additive exPlanations) was employed to explain the machine learning model's predictions. CONCLUSION We conclude that our explainable predictive model is a promising tool for oncologists that addresses an unmet clinical need of post-treatment patient stratification based on the relapse risk while also improving the predictive power by incorporating proxy genomic data not available for specific patients.
Collapse
Affiliation(s)
- Mohan Timilsina
- Data Science Institute, Insight Centre for Data Analytics, University of Galway, Ireland.
| | - Dirk Fey
- Systems Biology Ireland, University College Dublin, Ireland.
| | - Samuele Buosi
- Data Science Institute, Insight Centre for Data Analytics, University of Galway, Ireland.
| | | | | | - Enric Carcereny
- Catalan Institute of Oncology, Hospital Universitari Germans Trias i Pujol, B-ARGO, IGTP, Badalona, Spain.
| | | | - Manuel Cobo
- Medical Oncology Intercenter Unit. Regional and Virgen de la Victoria University Hospitals. IBIMA. Málaga., Spain.
| | | | - Reyes Bernabé
- Hospital Universitario Virgen del Rocio, Sevilla, Spain.
| | | | - Maria Torrente
- Medical Oncology Department, Hospital Universitario Puerta de Hierro Majadahonda, Madrid, Spain.
| | - Mariano Provencio
- Medical Oncology Department, Hospital Universitario Puerta de Hierro Majadahonda, Madrid, Spain.
| | - Vít Nováček
- Data Science Institute, Insight Centre for Data Analytics, University of Galway, Ireland; Faculty of Informatics, Masaryk University Brno, Czech Republic; Masaryk Memorial Cancer Institute, Brno, Czech Republic.
| |
Collapse
|