1
|
Oh MY, Kim HS, Jung YM, Lee HC, Lee SB, Lee SM. Machine Learning-Based Explainable Automated Nonlinear Computation Scoring System for Health Score and an Application for Prediction of Perioperative Stroke: Retrospective Study. J Med Internet Res 2025; 27:e58021. [PMID: 40106818 PMCID: PMC11966079 DOI: 10.2196/58021] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2024] [Revised: 03/24/2024] [Accepted: 10/30/2024] [Indexed: 03/22/2025] Open
Abstract
BACKGROUND Machine learning (ML) has the potential to enhance performance by capturing nonlinear interactions. However, ML-based models have some limitations in terms of interpretability. OBJECTIVE This study aimed to develop and validate a more comprehensible and efficient ML-based scoring system using SHapley Additive exPlanations (SHAP) values. METHODS We developed and validated the Explainable Automated nonlinear Computation scoring system for Health (EACH) framework score. We developed a CatBoost-based prediction model, identified key features, and automatically detected the top 5 steepest slope change points based on SHAP plots. Subsequently, we developed a scoring system (EACH) and normalized the score. Finally, the EACH score was used to predict perioperative stroke. We developed the EACH score using data from the Seoul National University Hospital cohort and validated it using data from the Boramae Medical Center, which was geographically and temporally different from the development set. RESULTS When applied for perioperative stroke prediction among 38,737 patients undergoing noncardiac surgery, the EACH score achieved an area under the curve (AUC) of 0.829 (95% CI 0.753-0.892). In the external validation, the EACH score demonstrated superior predictive performance with an AUC of 0.784 (95% CI 0.694-0.871) compared with a traditional score (AUC=0.528, 95% CI 0.457-0.619) and another ML-based scoring generator (AUC=0.564, 95% CI 0.516-0.612). CONCLUSIONS The EACH score is a more precise, explainable ML-based risk tool, proven effective in real-world data. The EACH score outperformed traditional scoring system and other prediction models based on different ML techniques in predicting perioperative stroke.
Collapse
Affiliation(s)
- Mi-Young Oh
- Department of Neurology, Sejong General Hospital, Sejong General Hospital, Bucheon-si, Republic of Korea
| | - Hee-Soo Kim
- Department of Medical Informatics, School of Medicine, Keimyung University, Daegu, Republic of Korea
| | - Young Mi Jung
- Department of Obstetrics and Gynecology, Seoul National University Bundang Hospital, Seongnam, Republic of Korea
| | - Hyung-Chul Lee
- Department of Anesthesiology and Pain Medicine, Seoul National University College of Medicine, Seoul, Republic of Korea
- Department of Anesthesiology and Pain Medicine, Seoul National University Hospital, Seoul, Republic of Korea
| | - Seung-Bo Lee
- Department of Medical Informatics, School of Medicine, Keimyung University, Daegu, Republic of Korea
| | - Seung Mi Lee
- Department of Obstetrics and Gynecology, College of Medicine, Seoul National University, Seoul, Republic of Korea
- Department of Obstetrics and Gynecology, Seoul National University Hospital, Seoul, Republic of Korea
- Innovative Medical Technology Research Institute, Seoul National University Hospital, Seoul, Republic of Korea
- Institute of Reproductive Medicine and Population & Medical Big Data Research Center, Seoul National University, Seoul, Republic of Korea
| |
Collapse
|
2
|
Roozbeh N, Montazeri F, Farashah MV, Mehrnoush V, Darsareh F. Proposing a machine learning-based model for predicting nonreassuring fetal heart. Sci Rep 2025; 15:7812. [PMID: 40050357 PMCID: PMC11885418 DOI: 10.1038/s41598-025-92810-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2025] [Accepted: 03/03/2025] [Indexed: 03/09/2025] Open
Abstract
The capacity to forecast nonreassuring fetal heart (NFH) is essential for minimizing perinatal complications; therefore, this research aims to establish if a machine learning (ML) model can predict NFH. This was a retrospective analysis of information gathered from singleton cases over the gestational age of 28 weeks that sought vaginal delivery between January 2020 and January 2022. The information was acquired from the "Iranian Maternal and Neonatal Network."A predictive model was built using four statistical ML models (decision tree classification, random forest classification, extreme gradient boost classification, and permutation feature classification with k-nearest neighbors). Because of the limited studies on the identification of NFH predictors, we decided to use the Chi-Square test to compare demographic, obstetric, maternal, and neonatal factors to identify NFH predictors. Then, all variables with p-values less than 0.05 were considered potential NFH predictors. The area under the receiver operating characteristic curve (AUROC), accuracy, precision, recall, and F1 score were measured to evaluate their diagnostic performance. The incidence of NFH in our study population was 9.2%. Based on our findings NFH was more common in cases of intrauterine growth restriction, late-term, post-term, and preterm births, preeclampsia, placenta abruption, primiparous, induced labor, male fetus, and lower in birth with the presence of doula support. Random forest classification (AUROC: 0.77), decision tree classification and extreme gradient boost classification (AUROC: 0.76), and permutation feature classification with K-nearest neighbors (AUROC: 0.77), all showed good performance in predicting NFH. The higher performance belonged to random forest classification with an accuracy of 0.77 and precision of 0.72. Although this study found that the classification tree models performed well in predicting NFH, more research is needed to make a better conclusion on the performance of ML models in predicting NFH.
Collapse
Affiliation(s)
- Nasibeh Roozbeh
- Mother and Child Welfare Research Center, Hormozgan University of Medical Sciences, Bandar Abbas, Iran
| | - Farideh Montazeri
- Mother and Child Welfare Research Center, Hormozgan University of Medical Sciences, Bandar Abbas, Iran
| | | | - Vahid Mehrnoush
- Mother and Child Welfare Research Center, Hormozgan University of Medical Sciences, Bandar Abbas, Iran
| | - Fatemeh Darsareh
- Mother and Child Welfare Research Center, Hormozgan University of Medical Sciences, Bandar Abbas, Iran.
| |
Collapse
|
3
|
Nejadshamsi S, Karami V, Ghourchian N, Armanfard N, Bergman H, Grad R, Wilchesky M, Khanassov V, Vedel I, Abbasgholizadeh Rahimi S. Development and Feasibility Study of HOPE Model for Prediction of Depression Among Older Adults Using Wi-Fi-based Motion Sensor Data: Machine Learning Study. JMIR Aging 2025; 8:e67715. [PMID: 40053734 PMCID: PMC11914842 DOI: 10.2196/67715] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2024] [Revised: 12/12/2024] [Accepted: 12/19/2024] [Indexed: 03/09/2025] Open
Abstract
BACKGROUND Depression, characterized by persistent sadness and loss of interest in daily activities, greatly reduces quality of life. Early detection is vital for effective treatment and intervention. While many studies use wearable devices to classify depression based on physical activity, these often rely on intrusive methods. Additionally, most depression classification studies involve large participant groups and use single-stage classifiers without explainability. OBJECTIVE This study aims to assess the feasibility of classifying depression using nonintrusive Wi-Fi-based motion sensor data using a novel machine learning model on a limited number of participants. We also conduct an explainability analysis to interpret the model's predictions and identify key features associated with depression classification. METHODS In this study, we recruited adults aged 65 years and older through web-based and in-person methods, supported by a McGill University health care facility directory. Participants provided consent, and we collected 6 months of activity and sleep data via nonintrusive Wi-Fi-based sensors, along with Edmonton Frailty Scale and Geriatric Depression Scale data. For depression classification, we proposed a HOPE (Home-Based Older Adults' Depression Prediction) machine learning model with feature selection, dimensionality reduction, and classification stages, evaluating various model combinations using accuracy, sensitivity, precision, and F1-score. Shapely addictive explanations and local interpretable model-agnostic explanations were used to explain the model's predictions. RESULTS A total of 6 participants were enrolled in this study; however, 2 participants withdrew later due to internet connectivity issues. Among the 4 remaining participants, 3 participants were classified as not having depression, while 1 participant was identified as having depression. The most accurate classification model, which combined sequential forward selection for feature selection, principal component analysis for dimensionality reduction, and a decision tree for classification, achieved an accuracy of 87.5%, sensitivity of 90%, and precision of 88.3%, effectively distinguishing individuals with and those without depression. The explainability analysis revealed that the most influential features in depression classification, in order of importance, were "average sleep duration," "total number of sleep interruptions," "percentage of nights with sleep interruptions," "average duration of sleep interruptions," and "Edmonton Frailty Scale." CONCLUSIONS The findings from this preliminary study demonstrate the feasibility of using Wi-Fi-based motion sensors for depression classification and highlight the effectiveness of our proposed HOPE machine learning model, even with a small sample size. These results suggest the potential for further research with a larger cohort for more comprehensive validation. Additionally, the nonintrusive data collection method and model architecture proposed in this study offer promising applications in remote health monitoring, particularly for older adults who may face challenges in using wearable devices. Furthermore, the importance of sleep patterns identified in our explainability analysis aligns with findings from previous research, emphasizing the need for more in-depth studies on the role of sleep in mental health, as suggested in the explainable machine learning study.
Collapse
Affiliation(s)
- Shayan Nejadshamsi
- Mila-Quebec Artificial Intelligence Institute, Montreal, QC, Canada
- Family Medicine Department, Faculty of Medicine and Health Sciences, McGill University, Montreal, QC, Canada
- Lady Davis Institute for Medical Research, Jewish General Hospital, Montreal, QC, Canada
| | - Vania Karami
- Mila-Quebec Artificial Intelligence Institute, Montreal, QC, Canada
- Family Medicine Department, Faculty of Medicine and Health Sciences, McGill University, Montreal, QC, Canada
- Lady Davis Institute for Medical Research, Jewish General Hospital, Montreal, QC, Canada
| | | | - Narges Armanfard
- Mila-Quebec Artificial Intelligence Institute, Montreal, QC, Canada
- Department of Electrical and Computer Engineering, Faculty of Engineering, McGill University, Montreal, QC, Canada
| | - Howard Bergman
- Family Medicine Department, Faculty of Medicine and Health Sciences, McGill University, Montreal, QC, Canada
| | - Roland Grad
- Family Medicine Department, Faculty of Medicine and Health Sciences, McGill University, Montreal, QC, Canada
| | - Machelle Wilchesky
- Family Medicine Department, Faculty of Medicine and Health Sciences, McGill University, Montreal, QC, Canada
- Lady Davis Institute for Medical Research, Jewish General Hospital, Montreal, QC, Canada
- Donald Berman Maimonides Centre for Research in Aging, Montreal, QC, Canada
| | - Vladimir Khanassov
- Family Medicine Department, Faculty of Medicine and Health Sciences, McGill University, Montreal, QC, Canada
| | - Isabelle Vedel
- Family Medicine Department, Faculty of Medicine and Health Sciences, McGill University, Montreal, QC, Canada
| | - Samira Abbasgholizadeh Rahimi
- Mila-Quebec Artificial Intelligence Institute, Montreal, QC, Canada
- Family Medicine Department, Faculty of Medicine and Health Sciences, McGill University, Montreal, QC, Canada
- Lady Davis Institute for Medical Research, Jewish General Hospital, Montreal, QC, Canada
- Faculty of Dental Medicine and Oral Health Sciences, McGill University, Montreal, Canada
| |
Collapse
|
4
|
Pouyabahar D, Andrews T, Bader GD. Interpretable single-cell factor decomposition using sciRED. Nat Commun 2025; 16:1878. [PMID: 39987196 PMCID: PMC11846867 DOI: 10.1038/s41467-025-57157-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2024] [Accepted: 02/10/2025] [Indexed: 02/24/2025] Open
Abstract
Single-cell RNA sequencing maps gene expression heterogeneity within a tissue. However, identifying biological signals in this data is challenging due to confounding technical factors, sparsity, and high dimensionality. Data factorization methods address this by separating and identifying signals in the data, such as gene expression programs, but the resulting factors must be manually interpreted. We developed Single-Cell Interpretable REsidual Decomposition (sciRED) to improve the interpretation of scRNA-seq factor analysis. sciRED removes known confounding effects, uses rotations to improve factor interpretability, maps factors to known covariates, identifies unexplained factors that may capture hidden biological phenomena, and determines the genes and biological processes represented by the resulting factors. We apply sciRED to multiple scRNA-seq datasets and identify sex-specific variation in a kidney map, discern strong and weak immune stimulation signals in a PBMC dataset, reduce ambient RNA contamination in a rat liver atlas to help identify strain variation and reveal rare cell type signatures and anatomical zonation gene programs in a healthy human liver map. These demonstrate that sciRED is useful in characterizing diverse biological signals within scRNA-seq data.
Collapse
Affiliation(s)
- Delaram Pouyabahar
- Department of Molecular Genetics, University of Toronto, Toronto, ON, Canada
- The Donnelly Centre, University of Toronto, Toronto, ON, Canada
| | - Tallulah Andrews
- Department of Biochemistry, Schulich School of Medicine and Dentistry, University of Western Ontario, London, ON, Canada
- Department of Computer Science, University of Western Ontario, London, ON, Canada
| | - Gary D Bader
- Department of Molecular Genetics, University of Toronto, Toronto, ON, Canada.
- The Donnelly Centre, University of Toronto, Toronto, ON, Canada.
- Department of Computer Science, University of Toronto, Toronto, ON, Canada.
- Lunenfeld-Tanenbaum Research Institute, Toronto, ON, Canada.
- Princess Margaret Research Institute, University Health Network, Toronto, ON, Canada.
- CIFAR Multiscale Human Program, CIFAR, Toronto, ON, Canada.
| |
Collapse
|
5
|
Safarzadeh S, Ardabili NS, Farashah MV, Roozbeh N, Darsareh F. Predicting mother and newborn skin-to-skin contact using a machine learning approach. BMC Pregnancy Childbirth 2025; 25:182. [PMID: 39966775 PMCID: PMC11837404 DOI: 10.1186/s12884-025-07313-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2023] [Accepted: 02/10/2025] [Indexed: 02/20/2025] Open
Abstract
BACKGROUND Despite the known benefits of skin-to-skin contact (SSC), limited data exists on its implementation, especially its influencing factors. The current study was designed to use machine learning (ML) to identify the predictors of SSC. METHODS This study implemented predictive SSC approaches based on the data obtained from the "Iranian Maternal and Neonatal Network (IMaN Net)" from January 2020 to January 2022. A predictive model was built using nine statistical learning models (linear regression, logistic regression, decision tree classification, random forest classification, deep learning feedforward, extreme gradient boost model, light gradient boost model, support vector machine, and permutation feature classification with k-nearest neighbors). Demographic, obstetric, and maternal and neonatal clinical factors were considered as potential predicting factors and were extracted from the patient's medical records. The area under the receiver operating characteristic curve (AUROC), accuracy, precision, recall, and F_1 Score were measured to evaluate the diagnostic performance. RESULTS Of 8031 eligible mothers, 3759 (46.8%) experienced SSC. The algorithms created by deep learning (AUROC: 0.81, accuracy: 0.75, precision: 0.67, recall: 0.77, and F_1 Score: 0.73) and linear regression (AUROC: 0.80, accuracy: 0.75, precision: 0.66, recall: 0.75, and F_1 Score: 0.71) had the highest performance in predicting SSC. Doula support, neonatal weight, gestational age, attending childbirth classes, and maternal age were the critical predictors for SSC based on the top two algorithms with superior performance. CONCLUSIONS Although this study found that the ML model performed well in predicting SSC, more research is needed to make a better conclusion about its performance.
Collapse
Affiliation(s)
- Sanaz Safarzadeh
- Mother and Child Welfare Research Center, Hormozgan University of Medical Sciences, Bandar Abbas, Iran
- Student research committee, Department of midwifery, School of Nursing and Midwifery, Shahid Beheshti University of Medical Sciences, Tehran, Iran
| | | | | | - Nasibeh Roozbeh
- Mother and Child Welfare Research Center, Hormozgan University of Medical Sciences, Bandar Abbas, Iran
| | - Fatemeh Darsareh
- Mother and Child Welfare Research Center, Hormozgan University of Medical Sciences, Bandar Abbas, Iran.
| |
Collapse
|
6
|
Bianchi A, Cano Marchal P, Martínez Gila DM, Mencarelli F, Gámez García J. Assessment of fruity aroma intensity in olive oils from different Spanish regions using a portable electronic nose. JOURNAL OF THE SCIENCE OF FOOD AND AGRICULTURE 2025; 105:1448-1455. [PMID: 38017697 PMCID: PMC11726602 DOI: 10.1002/jsfa.13179] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/15/2023] [Revised: 11/22/2023] [Accepted: 11/29/2023] [Indexed: 11/30/2023]
Abstract
BACKGROUND The organoleptic profile of an olive oil is a fundamental quality parameter obtained by human sensory panels. In this work, a portable electronic nose was employed to predict the fruity aroma intensity of 199 olive oil samples from different Spanish regions and cultivar varieties ('Picual', 'Arbequina', and 'Cornicabra'), with special emphasis in testing the robustness of the predictions versus cultivar variety variability. The primary data given by the electronic nose were used to obtain two different feature vectors that were employed to fit ridge and lasso regressions models to two datasets: one consisting of all the samples and another just the cv. Picual samples. RESULTS The results obtained showed mean average error (MAE) values below 0.88 in all cases, with an MAE of 0.67 for the 'Picual' model. These MAE values and the similarities in the model parameters fitted for the different data folds are in agreement with the results obtained in previous studies. CONCLUSION The large number of samples analyzed and the results obtained show the robustness of the approach and the applicability of the methods. Also, the results suggest that better performance can be obtained when specific models are fitted for particular cultivars. Overall, the proposed methods are capable of providing useful information for a fast screening of the fruity aroma intensity of olive oils. © 2023 The Authors. Journal of The Science of Food and Agriculture published by John Wiley & Sons Ltd on behalf of Society of Chemical Industry.
Collapse
Affiliation(s)
- Alessandro Bianchi
- Department of Agriculture, Food and EnvironmentUniversity of PisaPisaItaly
| | - Pablo Cano Marchal
- University Institute of Research on Olive and Olive Oils (INUO), Electronics and Systems Engineering Department, University of JaénJaénSpain
| | - Diego M. Martínez Gila
- University Institute of Research on Olive and Olive Oils (INUO), Electronics and Systems Engineering Department, University of JaénJaénSpain
| | - Fabio Mencarelli
- Department of Agriculture, Food and EnvironmentUniversity of PisaPisaItaly
| | - Javier Gámez García
- University Institute of Research on Olive and Olive Oils (INUO), Electronics and Systems Engineering Department, University of JaénJaénSpain
| |
Collapse
|
7
|
Jiang L, Huang YL, Fan J, Hunt CL, Eldrige JS. Development and Implementation of Automated Referral Triaging System for Spinal Cord Stimulation Procedure in Pain Medicine. J Med Syst 2025; 49:14. [PMID: 39833558 DOI: 10.1007/s10916-025-02148-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2024] [Accepted: 01/10/2025] [Indexed: 01/22/2025]
Abstract
Effective referral triaging enhances patient service outcomes, experience and access to care especially for specialized procedures. This study presents the development and implementation of an automated triaging system to predict patients who would benefit from Spinal Cord Stimulation (SCS) procedure for their pain management. The proposed triage system aims to improve the triage process by reducing unnecessary appointments before SCS assessment, ensuring appropriate pain management care. It compares various machine learning techniques for the prediction while addressing the class imbalance and overlap challenges inherent in the data. Both data-level and algorithm-level approaches were explored. Two years of patient data was collected including patient characteristics, diagnosis history, pain symptoms, appointment history, medication history, and concepts from clinical notes extracted using Natural Language Processing. EasyEnsemble with Ada Boosting method, an algorithm-level approach, showed the most promising results. The tenfold validation indicated the average area under curve of 0.82, true positive rate (TPR) of 77.3%, and true negative rate (TNR) of 73.0%. The probability threshold was adjusted to 0.575 to meet practice expectation of 15% or less on false positive rate (FPR). The implementation pipeline for the selected model was designed to be applicable to real clinical settings. The one-year implementation results showed TPR of 64.7% and TNR of 87.2%, which reduced FPR by 12.8% while reduced TPR by 12.6%. The trade-off was acceptable to practice. The proposed triage system demonstrated promising accuracy, leading to the enhancement of scheduling systems, patient care, and the reduction of unnecessary appointments in a pain medicine setting.
Collapse
Affiliation(s)
- Lan Jiang
- Robert D. and Patricia E. Kern Center for the Science of Health Care Delivery, Mayo Clinic, Rochester, MN, 55905, USA
| | - Yu-Li Huang
- Robert D. and Patricia E. Kern Center for the Science of Health Care Delivery, Mayo Clinic, Rochester, MN, 55905, USA.
| | - Jungwei Fan
- Department of Artificial Intelligence and Informatics, Mayo Clinic, Rochester, MN, 55905, USA
| | - Christy L Hunt
- Department of Pain Medicine, Mayo Clinic, Jacksonville, FL, 32224, USA
| | - Jason S Eldrige
- Department of Pain Medicine, Mayo Clinic, Jacksonville, FL, 32224, USA
| |
Collapse
|
8
|
Du Y, Ahmed KA, Hasan MR, Hossain MZ. Investigating the Impact of Antibiotics on Environmental Microbiota Through Machine Learning Models. IET Syst Biol 2025; 19:e70009. [PMID: 40150863 PMCID: PMC11949845 DOI: 10.1049/syb2.70009] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2024] [Revised: 01/21/2025] [Accepted: 02/20/2025] [Indexed: 03/29/2025] Open
Abstract
Antibiotic pollution in the environment can significantly impact soil microorganisms, such as altering the soil microbial community or emerging antibiotic-resistant bacteria. We propose three machine learning (ML) methods to investigate antibiotics' impact on microorganisms and predict microbial abundance. We examined the microbial abundances of various environmental soil samples treated with antibiotics. We developed 3 ML models: (Model 1) for predicting the most abundant bacterial classes in a specific treatment group; (Model 2) for predicting antibiotic treatment effects based on bacterial abundances; and (Model 3) for using data from short-term incubations to predict the data of community structure after stabilisation. In Model 1, the Random Forest model achieved the highest average accuracy, with a Coefficient of Variation mean of 0.05 and 0.14 in the training and test set. In Model 2, the accuracy of the random forest and SVM models have the highest accuracy (nearly 0.90). Model 3 demonstrates that the Random Forest can use data from short-term incubations to predict the abundance of bacterial communities after long-term stabilisation. This study highlights the potential of ML models as powerful tools for understanding microbial dynamics in response to antibiotic treatments. The code is publicly available at - https://github.com/DeweyYihengDu/ML_on_Microbiota.
Collapse
Affiliation(s)
- Yiheng Du
- Australian National UniversityCanberraAustralia
| | | | | | - Md Zakir Hossain
- Australian National UniversityCanberraAustralia
- Curtin UniversityBentleyAustralia
| |
Collapse
|
9
|
Guimarães P, Keller A, Böhm M, Lauder L, Fehlmann T, Ruilope LM, Vinyoles E, Gorostidi M, Segura J, Ruiz-Hurtado G, Staplin N, Williams B, de la Sierra A, Mahfoud F. Artificial Intelligence-Derived Risk Prediction: A Novel Risk Calculator Using Office and Ambulatory Blood Pressure. Hypertension 2025; 82:46-56. [PMID: 38660828 DOI: 10.1161/hypertensionaha.123.22529] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/06/2023] [Accepted: 04/07/2024] [Indexed: 04/26/2024]
Abstract
BACKGROUND Quantification of total cardiovascular risk is essential for individualizing hypertension treatment. This study aimed to develop and validate a novel, machine-learning-derived model to predict cardiovascular mortality risk using office blood pressure (OBP) and ambulatory blood pressure (ABP). METHODS The performance of the novel risk score was compared with existing risk scores, and the possibility of predicting ABP phenotypes utilizing clinical variables was assessed. Using data from 59 124 patients enrolled in the Spanish ABP Monitoring registry, machine-learning approaches (logistic regression, gradient-boosted decision trees, and deep neural networks) and stepwise forward feature selection were used. RESULTS For the prediction of cardiovascular mortality, deep neural networks yielded the highest clinical performance. The novel mortality prediction models using OBP and ABP outperformed other risk scores. The area under the curve achieved by the novel approach, already when using OBP variables, was significantly higher when compared with the area under the curve of the Framingham risk score, Systemic Coronary Risk Estimation 2, and Atherosclerotic Cardiovascular Disease score. However, the prediction of cardiovascular mortality with ABP instead of OBP data significantly increased the area under the curve (0.870 versus 0.865; P=3.61×10-28), accuracy, and specificity, respectively. The prediction of ABP phenotypes (ie, white-coat, ambulatory, and masked hypertension) using clinical characteristics was limited. CONCLUSIONS The receiver operating characteristic curves for cardiovascular mortality using ABP and OBP with deep neural network models outperformed all other risk metrics, indicating the potential for improving current risk scores by applying state-of-the-art machine learning approaches. The prediction of cardiovascular mortality using ABP data led to a significant increase in area under the curve and performance metrics.
Collapse
Affiliation(s)
- Pedro Guimarães
- Chair for Clinical Bioinformatics, Saarland University, Saarbrücken, Germany (P.G., A.K., T.F.)
- University of Coimbra, Coimbra Institute for Biomedical Imaging and Translational Research, Institute for Nuclear Sciences Applied to Health, Portugal (P.G.)
| | - Andreas Keller
- Chair for Clinical Bioinformatics, Saarland University, Saarbrücken, Germany (P.G., A.K., T.F.)
- Department of Neurology and Neurological Sciences, Stanford University, CA (A.K.)
| | - Michael Böhm
- Department of Internal Medicine III, Cardiology, Angiology, Intensive Care Medicine, Universitätsklinikum des Saarlandes, Saarland University, Homburg/Saar, Germany (M.B., L.L., F.M.)
| | - Lucas Lauder
- Department of Internal Medicine III, Cardiology, Angiology, Intensive Care Medicine, Universitätsklinikum des Saarlandes, Saarland University, Homburg/Saar, Germany (M.B., L.L., F.M.)
| | - Tobias Fehlmann
- Chair for Clinical Bioinformatics, Saarland University, Saarbrücken, Germany (P.G., A.K., T.F.)
| | - Luis M Ruilope
- Cardiorenal Translational Laboratory and Hypertension Unit, Institute of Research i+12, (L.M.R., J.S., G.R.-H.), Hospital Universitario 12 de Octubre, Madrid, Spain
- Centro de Investigación Biomédica en Red Enfermedades Cardiovaculares (L.M.R., G.R.-H.), Hospital Universitario 12 de Octubre, Madrid, Spain
- Faculty of Sport Sciences, European University of Madrid, Spain (L.M.R.)
| | - Ernest Vinyoles
- La Mina Primary Care Center, University of Barcelona, Spain (E.V.)
- IDIAP Jordi Gol, Barcelona, Spain (E.V.)
| | - Manuel Gorostidi
- Department of Nephrology, Hospital Universitario Central de Asturias, REDinREN, Oviedo, Spain (M.G.)
| | - Julián Segura
- Cardiorenal Translational Laboratory and Hypertension Unit, Institute of Research i+12, (L.M.R., J.S., G.R.-H.), Hospital Universitario 12 de Octubre, Madrid, Spain
| | - Gema Ruiz-Hurtado
- Cardiorenal Translational Laboratory and Hypertension Unit, Institute of Research i+12, (L.M.R., J.S., G.R.-H.), Hospital Universitario 12 de Octubre, Madrid, Spain
- Centro de Investigación Biomédica en Red Enfermedades Cardiovaculares (L.M.R., G.R.-H.), Hospital Universitario 12 de Octubre, Madrid, Spain
| | - Natalie Staplin
- Medical Research Council Population Health Research Unit, Clinical Trial Service Unit and Epidemiological Studies Unit, Nuffield Department of Population Health, University of Oxford, United Kingdom (N.S.)
| | - Bryan Williams
- University College London (UCL), Institute of Cardiovascular Science, National Institute for Health Research, UCL Hospitals Biomedical Research Centre, United Kingdom (B.W.)
| | - Alejandro de la Sierra
- Department of Internal Medicine, Hospital Universitario Mútua Terrasa, Universidad de Barcelona, Spain (A.d.l.S.)
| | - Felix Mahfoud
- Department of Internal Medicine III, Cardiology, Angiology, Intensive Care Medicine, Universitätsklinikum des Saarlandes, Saarland University, Homburg/Saar, Germany (M.B., L.L., F.M.)
- Harvard-MIT Biomedical Engineering Center, Institute for Medical Engineering and Science, MIT, Cambrigde, MA (F.M.)
- Department of Cardiology, University Heart Center, Basel University Hospital, Petersgraben 4, 4031 Basel (F.M.)
| |
Collapse
|
10
|
Kim S, Lee HC, Sim JE, Park SJ, Oh HH. Bacterial profile-based body fluid identification using a machine learning approach. Genes Genomics 2025; 47:87-98. [PMID: 39503932 DOI: 10.1007/s13258-024-01594-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2024] [Accepted: 10/23/2024] [Indexed: 01/16/2025]
Abstract
BACKGROUND Identifying the origins of biological traces is critical for the reconstruction of crime scenes in forensic investigations. Traditional methods for body fluid identification rely on chemical, enzymatic, immunological, and spectroscopic techniques, which can be sample-consuming and depend on simple color-change reactions. However, these methods have limitations when residual samples are insufficient after DNA extraction. OBJECTIVE This study aimed to develop a method for body fluid identification by leveraging bacterial DNA profiling to overcome the limitations of the conventional approaches. METHODS Bacterial profiles were determined by sequencing the hypervariable region of the 16 S rRNA gene, using DNA metabarcoding of evidence collected from criminal cases. Amplicon sequence variants (ASVs) were analyzed to identify significant microbial patterns in different body fluid samples. RESULTS The bacterial profile-based method demonstrated high discriminatory power with a machine learning model trained using the naïve Bayes algorithm, achieving an accuracy of over 98% in classifying samples into one of four body fluid types: blood, saliva, vaginal secretion, and mixture traces of vaginal secretions and semen. CONCLUSION Bacterial profiling enhances the accuracy and robustness of body fluid identification in forensic analysis, providing a valuable alternative to traditional methods by utilizing DNA and microbial community data despite the uncontrollable conditions. This approach offers significant improvements in the classification accuracy and practical applicability in forensic investigations.
Collapse
Affiliation(s)
- Sungmin Kim
- Forensic Genetics and Chemistry Division, Supreme Prosecutors' Office, 157 Banpo daero, Seocho gu, Seoul, 06590, Republic of Korea.
| | - Han Chul Lee
- Forensic Genetics and Chemistry Division, Supreme Prosecutors' Office, 157 Banpo daero, Seocho gu, Seoul, 06590, Republic of Korea
| | - Jeong Eun Sim
- Forensic Genetics and Chemistry Division, Supreme Prosecutors' Office, 157 Banpo daero, Seocho gu, Seoul, 06590, Republic of Korea
| | - Su Jeong Park
- Forensic Genetics and Chemistry Division, Supreme Prosecutors' Office, 157 Banpo daero, Seocho gu, Seoul, 06590, Republic of Korea
| | - Hye Hyun Oh
- Forensic Genetics and Chemistry Division, Supreme Prosecutors' Office, 157 Banpo daero, Seocho gu, Seoul, 06590, Republic of Korea
| |
Collapse
|
11
|
Pahal S, Pahal V, Chaudhary A. From data to discovery: Neuroinformatics in understanding Alzheimer's disease. J Biosci 2024; 50:2. [PMID: 39703103 DOI: 10.1007/s12038-024-00486-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2024] [Accepted: 08/20/2024] [Indexed: 01/03/2025]
|
12
|
Georgescu AL, Cummins N, Molimpakis E, Giacomazzi E, Rodrigues Marczyk J, Goria S. Screening for Depression and Anxiety Using a Nonverbal Working Memory Task in a Sample of Older Brazilians: Observational Study of Preliminary Artificial Intelligence Model Transferability. JMIR Form Res 2024; 8:e55856. [PMID: 39727020 DOI: 10.2196/55856] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2024] [Revised: 10/22/2024] [Accepted: 10/29/2024] [Indexed: 12/28/2024] Open
Abstract
Background Anxiety and depression represent prevalent yet frequently undetected mental health concerns within the older population. The challenge of identifying these conditions presents an opportunity for artificial intelligence (AI)-driven, remotely available, tools capable of screening and monitoring mental health. A critical criterion for such tools is their cultural adaptability to ensure effectiveness across diverse populations. Objective This study aims to illustrate the preliminary transferability of two established AI models designed to detect high depression and anxiety symptom scores. The models were initially trained on data from a nonverbal working memory game (1- and 2-back tasks) in a dataset by thymia, a company that develops AI solutions for mental health and well-being assessments, encompassing over 6000 participants from the United Kingdom, United States, Mexico, Spain, and Indonesia. We seek to validate the models' performance by applying it to a new dataset comprising older Brazilian adults, thereby exploring its transferability and generalizability across different demographics and cultures. Methods A total of 69 Brazilian participants aged 51-92 years old were recruited with the help of Laços Saúde, a company specializing in nurse-led, holistic home care. Participants received a link to the thymia dashboard every Monday and Thursday for 6 months. The dashboard had a set of activities assigned to them that would take 10-15 minutes to complete, which included a 5-minute game with two levels of the n-back tasks. Two Random Forest models trained on thymia data to classify depression and anxiety based on thresholds defined by scores of the Patient Health Questionnaire (8 items) (PHQ-8) ≥10 and those of the Generalized Anxiety Disorder Assessment (7 items) (GAD-7) ≥10, respectively, were subsequently tested on the Laços Saúde patient cohort. Results The depression classification model exhibited robust performance, achieving an area under the receiver operating characteristic curve (AUC) of 0.78, a specificity of 0.69, and a sensitivity of 0.72. The anxiety classification model showed an initial AUC of 0.63, with a specificity of 0.58 and a sensitivity of 0.64. This performance surpassed a benchmark model using only age and gender, which had AUCs of 0.47 for PHQ-8 and 0.53 for GAD-7. After recomputing the AUC scores on a cross-sectional subset of the data (the first n-back game session), we found AUCs of 0.79 for PHQ-8 and 0.76 for GAD-7. Conclusions This study successfully demonstrates the preliminary transferability of two AI models trained on a nonverbal working memory task, one for depression and the other for anxiety classification, to a novel sample of older Brazilian adults. Future research could seek to replicate these findings in larger samples and other cultural contexts.
Collapse
Affiliation(s)
- Alexandra Livia Georgescu
- thymia, International House, 64 Nile Street, London, N1 7SR, United Kingdom, 44 7477285252
- Department of Psychology, Institute of Psychiatry, Psychology & Neuroscience, King's College London, London, United Kingdom
| | - Nicholas Cummins
- thymia, International House, 64 Nile Street, London, N1 7SR, United Kingdom, 44 7477285252
- Department of Biostatistics and Health Informatics, Institute of Psychiatry, Psychology & Neuroscience, King's College London, London, United Kingdom
- CAMHS Digital Lab - Department of Child and Adolescent Psychiatry, Institute of Psychiatry, Psychology & Neuroscience, King's College London, London, United Kingdom
| | - Emilia Molimpakis
- thymia, International House, 64 Nile Street, London, N1 7SR, United Kingdom, 44 7477285252
| | | | | | - Stefano Goria
- thymia, International House, 64 Nile Street, London, N1 7SR, United Kingdom, 44 7477285252
| |
Collapse
|
13
|
Maurer-Alcalá XX, Kim E. TIdeS: A Comprehensive Framework for Accurate Open Reading Frame Identification and Classification in Eukaryotic Transcriptomes. Genome Biol Evol 2024; 16:evae252. [PMID: 39570867 PMCID: PMC11631190 DOI: 10.1093/gbe/evae252] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 11/15/2024] [Indexed: 12/12/2024] Open
Abstract
Studying fundamental aspects of eukaryotic biology through genetic information can face numerous challenges, including contamination and intricate biotic interactions, which are particularly pronounced when working with uncultured eukaryotes. However, existing tools for predicting open reading frames (ORFs) from transcriptomes are limited in these scenarios. Here we introduce Transcript Identification and Selection (TIdeS), a framework designed to address these nontrivial challenges associated with current 'omics approaches. Using transcriptomes from 32 taxa, representing the breadth of eukaryotic diversity, TIdeS outperforms most conventional ORF-prediction methods (i.e. TransDecoder), identifying a greater proportion of complete and in-frame ORFs. Additionally, TIdeS accurately classifies ORFs using minimal input data, even in the presence of "heavy contamination". This built-in flexibility extends to previously unexplored biological interactions, offering a robust single-stop solution for precise ORF predictions and subsequent decontamination. Beyond applications in phylogenomic-based studies, TIdeS provides a robust means to explore biotic interactions in eukaryotes (e.g. host-symbiont, prey-predator) and for reproducible dataset curation from transcriptomes and genomes.
Collapse
Affiliation(s)
- Xyrus X Maurer-Alcalá
- Division of Invertebrate Zoology and Institute for Comparative Genomics, American Museum of Natural History, New York, NY, USA
| | - Eunsoo Kim
- Division of Invertebrate Zoology and Institute for Comparative Genomics, American Museum of Natural History, New York, NY, USA
- Division of EcoScience, Ewha Womans University, Seoul, South Korea
| |
Collapse
|
14
|
Gonçalves RS, Payne J, Tan A, Benitez C, Haddock J, Gentleman R. The text2term tool to map free-text descriptions of biomedical terms to ontologies. Database (Oxford) 2024; 2024:baae119. [PMID: 39607847 PMCID: PMC11604108 DOI: 10.1093/database/baae119] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/02/2024] [Revised: 10/03/2024] [Accepted: 11/07/2024] [Indexed: 11/30/2024]
Abstract
There is an ongoing need for scalable tools to aid researchers in both retrospective and prospective standardization of discrete entity types-such as disease names, cell types, or chemicals-that are used in metadata associated with biomedical data. When metadata are not well-structured or precise, the associated data are harder to find and are often burdensome to reuse, analyze, or integrate with other datasets due to the upfront curation effort required to make the data usable-typically through retrospective standardization and cleaning of the (meta)data. With the goal of facilitating the task of standardizing metadata-either in bulk or in a one-by-one fashion, e.g. to support autocompletion of biomedical entities in forms-we have developed an open-source tool called text2term that maps free-text descriptions of biomedical entities to controlled terms in ontologies. The tool is highly configurable and can be used in multiple ways that cater to different users and expertise levels-it is available on Python Package Index and can be used programmatically as any Python package; it can also be used via a command-line interface or via our hosted, graphical user interface-based web application or by deploying a local instance of our interactive application using Docker. Database URL: https://pypi.org/project/text2term.
Collapse
Affiliation(s)
- Rafael S Gonçalves
- Stanford Center for Biomedical Informatics Research, Stanford University, 3180 Porter Dr, Palo Alto, CA 94304, United States
| | - Jason Payne
- Center for Computational Biomedicine, Harvard Medical School, 10 Shattuck St, Boston, MA 02115, United States
| | - Amelia Tan
- Department of Biomedical Informatics, Harvard Medical School, 10 Shattuck St, Boston, MA 02115, United States
| | - Carmen Benitez
- Department of Mathematics, Harvey Mudd College, 301 Platt Blvd, Claremont, CA 91711, United States
| | - Jamie Haddock
- Department of Mathematics, Harvey Mudd College, 301 Platt Blvd, Claremont, CA 91711, United States
| | - Robert Gentleman
- Center for Computational Biomedicine, Harvard Medical School, 10 Shattuck St, Boston, MA 02115, United States
| |
Collapse
|
15
|
Patiyal S, Dhall A, Kumar N, Raghava GPS. HLA-DR4Pred2: An improved method for predicting HLA-DRB1*04:01 binders. Methods 2024; 232:18-28. [PMID: 39433152 DOI: 10.1016/j.ymeth.2024.10.007] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/22/2024] [Revised: 09/27/2024] [Accepted: 10/15/2024] [Indexed: 10/23/2024] Open
Abstract
HLA-DRB1*04:01 is associated with numerous diseases, including sclerosis, arthritis, diabetes, and COVID-19, emphasizing the need to scan for binders in the antigens to develop immunotherapies and vaccines. Current prediction methods are often limited by their reliance on the small datasets. This study presents HLA-DR4Pred2, developed on a large dataset containing 12,676 binders and an equal number of non-binders. It's an improved version of HLA-DR4Pred, which was trained on a small dataset, containing 576 binders and an equal number of non-binders. All models were trained, optimized, and tested on 80 % of the data using five-fold cross-validation and evaluated on the remaining 20 %. A range of machine learning techniques was employed, achieving maximum AUROC of 0.90 and 0.87, using composition and binary profile features, respectively. The performance of the composition-based model increased to 0.93, when combined with BLAST search. Additionally, models developed on the realistic dataset containing 12,676 binders and 86,300 non-binders, achieved a maximum AUROC of 0.99. Our proposed method outperformed existing methods when we compared the performance of our best model to that of existing methods on the independent dataset. Finally, we developed a standalone tool and a webserver for HLADR4Pred2, enabling the prediction, design, and virtual scanning of HLA-DRB1*04:01 binding peptides, and we also released a Python package available on the Python Package Index (https://webs.iiitd.edu.in/raghava/hladr4pred2/; https://github.com/raghavagps/hladr4pred2; https://pypi.org/project/hladr4pred2/).
Collapse
Affiliation(s)
- Sumeet Patiyal
- Department of Computational Biology, Indraprastha Institute of Information Technology, Okhla Phase 3, New Delhi 110020, India.
| | - Anjali Dhall
- Department of Computational Biology, Indraprastha Institute of Information Technology, Okhla Phase 3, New Delhi 110020, India.
| | - Nishant Kumar
- Department of Computational Biology, Indraprastha Institute of Information Technology, Okhla Phase 3, New Delhi 110020, India.
| | - Gajendra P S Raghava
- Department of Computational Biology, Indraprastha Institute of Information Technology, Okhla Phase 3, New Delhi 110020, India.
| |
Collapse
|
16
|
Zhang L, Chen Q, Zeng S, Deng Z, Liu Z, Li X, Hou Q, Zhou R, Bao S, Hou D, Weng S, He J, Huang Z. Succeed to culture a novel lineage symbiotic bacterium of Mollicutes which widely found in arthropods intestine uncovers the potential double-edged sword ecological function. Front Microbiol 2024; 15:1458382. [PMID: 39493855 PMCID: PMC11527720 DOI: 10.3389/fmicb.2024.1458382] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/02/2024] [Accepted: 09/20/2024] [Indexed: 11/05/2024] Open
Abstract
Symbiotic gut bacteria play crucial role in host health. Symbionts are widely distributed in arthropod intestines, but their ecological functions are poorly understood due to the inability to cultivate them. Members of Candidatus Bacilliplasma (CB) are widely distributed in crustacean intestine and maybe commensals with hosts, but the paucity of pure cultures has limited further insights into their physiologies and functions. Here, four strains of representative CB bacteria in shrimp intestine were successfully isolated and identified as members of a novel Order in the Phylum Mycoplasmatota. Through genome assembly, the circular genome maps of the four strains were obtained, and the number of coding genes ranged from 1,886 to 1,980. Genomic analysis suggested that the bacteria were missing genes for many critical pathways including the TCA cycle and biosynthesis pathways for amino acids and coenzyme factors. The analysis of 16S amplification data showed that Shewanella, Pseudomonas and CB were the dominant at the genera level in the intestine of Penaeus vannamei. Ecological functional experiments revealed that the strains were symbionts and colonized shrimp intestines. Our valued findings can greatly enhance our understanding and provides new insights into the potentially significant role of uncultured symbiotic bacteria in modulating host health.
Collapse
Affiliation(s)
- Lingyu Zhang
- State Key Laboratory of Biocontrol, School of Life Sciences, School of Marine Sciences, Sun Yat-sen University, Guangzhou, China
- Southern Marine Sciences and Engineering Guangdong Laboratory (Zhuhai), School of Marine Sciences, Sun Yat-sen University, Zhuhai, China
| | - Qi Chen
- State Key Laboratory of Biocontrol, School of Life Sciences, School of Marine Sciences, Sun Yat-sen University, Guangzhou, China
- Southern Marine Sciences and Engineering Guangdong Laboratory (Zhuhai), School of Marine Sciences, Sun Yat-sen University, Zhuhai, China
| | - Shenzheng Zeng
- State Key Laboratory of Biocontrol, School of Life Sciences, School of Marine Sciences, Sun Yat-sen University, Guangzhou, China
- Southern Marine Sciences and Engineering Guangdong Laboratory (Zhuhai), School of Marine Sciences, Sun Yat-sen University, Zhuhai, China
| | - Zhixuan Deng
- State Key Laboratory of Biocontrol, School of Life Sciences, School of Marine Sciences, Sun Yat-sen University, Guangzhou, China
- Southern Marine Sciences and Engineering Guangdong Laboratory (Zhuhai), School of Marine Sciences, Sun Yat-sen University, Zhuhai, China
| | - Zhongcheng Liu
- State Key Laboratory of Biocontrol, School of Life Sciences, School of Marine Sciences, Sun Yat-sen University, Guangzhou, China
- Southern Marine Sciences and Engineering Guangdong Laboratory (Zhuhai), School of Marine Sciences, Sun Yat-sen University, Zhuhai, China
| | - Xuanting Li
- State Key Laboratory of Biocontrol, School of Life Sciences, School of Marine Sciences, Sun Yat-sen University, Guangzhou, China
- Southern Marine Sciences and Engineering Guangdong Laboratory (Zhuhai), School of Marine Sciences, Sun Yat-sen University, Zhuhai, China
| | - Qilu Hou
- State Key Laboratory of Biocontrol, School of Life Sciences, School of Marine Sciences, Sun Yat-sen University, Guangzhou, China
- Southern Marine Sciences and Engineering Guangdong Laboratory (Zhuhai), School of Marine Sciences, Sun Yat-sen University, Zhuhai, China
| | - Renjun Zhou
- State Key Laboratory of Biocontrol, School of Life Sciences, School of Marine Sciences, Sun Yat-sen University, Guangzhou, China
- Southern Marine Sciences and Engineering Guangdong Laboratory (Zhuhai), School of Marine Sciences, Sun Yat-sen University, Zhuhai, China
| | - Shicheng Bao
- State Key Laboratory of Biocontrol, School of Life Sciences, School of Marine Sciences, Sun Yat-sen University, Guangzhou, China
- Southern Marine Sciences and Engineering Guangdong Laboratory (Zhuhai), School of Marine Sciences, Sun Yat-sen University, Zhuhai, China
| | - Dongwei Hou
- State Key Laboratory of Biocontrol, School of Life Sciences, School of Marine Sciences, Sun Yat-sen University, Guangzhou, China
- Southern Marine Sciences and Engineering Guangdong Laboratory (Zhuhai), School of Marine Sciences, Sun Yat-sen University, Zhuhai, China
| | - Shaoping Weng
- State Key Laboratory of Biocontrol, School of Life Sciences, School of Marine Sciences, Sun Yat-sen University, Guangzhou, China
- Southern Marine Sciences and Engineering Guangdong Laboratory (Zhuhai), School of Marine Sciences, Sun Yat-sen University, Zhuhai, China
| | - Jianguo He
- State Key Laboratory of Biocontrol, School of Life Sciences, School of Marine Sciences, Sun Yat-sen University, Guangzhou, China
- Southern Marine Sciences and Engineering Guangdong Laboratory (Zhuhai), School of Marine Sciences, Sun Yat-sen University, Zhuhai, China
| | - Zhijian Huang
- State Key Laboratory of Biocontrol, School of Life Sciences, School of Marine Sciences, Sun Yat-sen University, Guangzhou, China
- Southern Marine Sciences and Engineering Guangdong Laboratory (Zhuhai), School of Marine Sciences, Sun Yat-sen University, Zhuhai, China
| |
Collapse
|
17
|
Lones MA. Avoiding common machine learning pitfalls. PATTERNS (NEW YORK, N.Y.) 2024; 5:101046. [PMID: 39569205 PMCID: PMC11573893 DOI: 10.1016/j.patter.2024.101046] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/22/2024]
Abstract
Mistakes in machine learning practice are commonplace and can result in loss of confidence in the findings and products of machine learning. This tutorial outlines common mistakes that occur when using machine learning and what can be done to avoid them. While it should be accessible to anyone with a basic understanding of machine learning techniques, it focuses on issues that are of particular concern within academic research, such as the need to make rigorous comparisons and reach valid conclusions. It covers five stages of the machine learning process: what to do before model building, how to reliably build models, how to robustly evaluate models, how to compare models fairly, and how to report results.
Collapse
Affiliation(s)
- Michael A Lones
- School of Mathematical and Computer Sciences, Heriot-Watt University, Edinburgh, UK
| |
Collapse
|
18
|
Kim JI, Manuele A, Maguire F, Zaheer R, McAllister TA, Beiko RG. Identification of key drivers of antimicrobial resistance in Enterococcus using machine learning. Can J Microbiol 2024; 70:446-460. [PMID: 39079170 DOI: 10.1139/cjm-2024-0049] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/03/2024]
Abstract
With antimicrobial resistance (AMR) rapidly evolving in pathogens, quick and accurate identification of genetic determinants of phenotypic resistance is essential for improving surveillance, stewardship, and clinical mitigation. Machine learning (ML) models show promise for AMR prediction in diagnostics but require a deep understanding of internal processes to use effectively. Our study utilised AMR gene, pangenomic, and predicted plasmid features from 647 Enterococcus faecium and Enterococcus faecalis genomes across the One Health continuum, along with corresponding resistance phenotypes, to develop interpretive ML classifiers. Vancomycin resistance could be predicted with 99% accuracy with AMR gene features, 98% with pangenome features, and 96% with plasmid clusters. Top pangenome features overlapped with the resistance genes of the vanA operon, which are often laterally transmitted via plasmids. Doxycycline resistance prediction achieved approximately 92% accuracy with pangenome features, with the top feature being elements of Tn916 conjugative transposon, a tet(M) carrier. Erythromycin resistance prediction models achieved about 90% accuracy, but top features were negatively correlated with resistance due to the confounding effect of population structure. This work demonstrates the importance of reviewing ML models' features to discern biological relevance even when achieving high-performance metrics. Our workflow offers the potential to propose hypotheses for experimental testing, enhancing the understanding of AMR mechanisms, which are crucial for combating the AMR crisis.
Collapse
Affiliation(s)
- Jee In Kim
- Faculty of Computer Science, Dalhousie University, Halifax, NS, Canada
- Institute for Comparative Genomics, Dalhousie University, Halifax, NS, Canada
- Agriculture and Agri-Food Canada, Lethbridge, AB, Canada
| | - Alexander Manuele
- Faculty of Computer Science, Dalhousie University, Halifax, NS, Canada
- Institute for Comparative Genomics, Dalhousie University, Halifax, NS, Canada
| | - Finlay Maguire
- Faculty of Computer Science, Dalhousie University, Halifax, NS, Canada
- Institute for Comparative Genomics, Dalhousie University, Halifax, NS, Canada
- Department of Community Health and Epidemiology, Dalhousie University, Faculty of Medicine, Halifax, NS, Canada
| | - Rahat Zaheer
- Agriculture and Agri-Food Canada, Lethbridge, AB, Canada
| | | | - Robert G Beiko
- Faculty of Computer Science, Dalhousie University, Halifax, NS, Canada
- Institute for Comparative Genomics, Dalhousie University, Halifax, NS, Canada
| |
Collapse
|
19
|
Abood EA, Abdallah MH, Alsaadi M, Imran H, Bernardo LFA, De Domenico D, Henedy SN. Machine Learning-Based Prediction Models for Punching Shear Strength of Fiber-Reinforced Polymer Reinforced Concrete Slabs Using a Gradient-Boosted Regression Tree. MATERIALS (BASEL, SWITZERLAND) 2024; 17:3964. [PMID: 39203141 PMCID: PMC11355707 DOI: 10.3390/ma17163964] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 06/24/2024] [Revised: 07/31/2024] [Accepted: 08/06/2024] [Indexed: 09/03/2024]
Abstract
Fiber-reinforced polymers (FRPs) are increasingly being used as a composite material in concrete slabs due to their high strength-to-weight ratio and resistance to corrosion. However, FRP-reinforced concrete slabs, similar to traditional systems, are susceptible to punching shear failure, a critical design concern. Existing empirical models and design provisions for predicting the punching shear strength of FRP-reinforced concrete slabs often exhibit significant bias and dispersion. These errors highlight the need for more reliable predictive models. This study aims to develop gradient-boosted regression tree (GBRT) models to accurately predict the shear strength of FRP-reinforced concrete panels and to address the limitations of existing empirical models. A comprehensive database of 238 sets of experimental results for FRP-reinforced concrete slabs has been compiled from the literature. Different machine learning algorithms were considered, and the performance of GBRT models was evaluated against these algorithms. The dataset was divided into training and testing sets to verify the accuracy of the model. The results indicated that the GBRT model achieved the highest prediction accuracy, with root mean square error (RMSE) of 64.85, mean absolute error (MAE) of 42.89, and coefficient of determination (R2) of 0.955. Comparative analysis with existing experimental models showed that the GBRT model outperformed these traditional approaches. The SHapley Additive exPlanation (SHAP) method was used to interpret the GBRT model, providing insight into the contribution of each input variable to the prediction of punching shear strength. The analysis emphasized the importance of variables such as slab thickness, FRP reinforcement ratio, and critical section perimeter. This study demonstrates the effectiveness of the GBRT model in predicting the punching shear strength of FRP-reinforced concrete slabs with high accuracy. SHAP analysis elucidates key factors that influence model predictions and provides valuable insights for future research and design improvements.
Collapse
Affiliation(s)
- Emad A. Abood
- Department of Material Engineering, College of Engineering, Al-Shatrah University, Al-Shatrah 64007, Iraq;
| | - Marwa Hameed Abdallah
- Department of Civil Engineering, Najaf Technical Institute, Al-Furat Al-Awsat Technical University, Najaf Munazira Str., Najaf 54003, Iraq;
| | - Mahmood Alsaadi
- Department of Computer Science, Al-Maarif University College, Al-Anbar 31001, Iraq;
| | - Hamza Imran
- Department of Construction and Project, Al-Karkh University of Science, Baghdad 10081, Iraq;
| | - Luís Filipe Almeida Bernardo
- GeoBioTec, Department of Civil Engineering and Architecture, University of Beira Interior, 6201-001 Covilhã, Portugal
| | - Dario De Domenico
- Department of Engineering, University of Messina, Villaggio S. Agata, 98166 Messina, Italy;
| | - Sadiq N. Henedy
- Department of Civil Engineering, Mazaya University College, Nasiriyah 64001, Iraq;
| |
Collapse
|
20
|
Eminaga O, Lau H, Shkolyar E, Wardelmann E, Abbas M. Deep learning identifies histopathologic changes in bladder cancers associated with smoke exposure status. PLoS One 2024; 19:e0305135. [PMID: 39083547 PMCID: PMC11290674 DOI: 10.1371/journal.pone.0305135] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2023] [Accepted: 05/23/2024] [Indexed: 08/02/2024] Open
Abstract
Smoke exposure is associated with bladder cancer (BC). However, little is known about whether the histologic changes of BC can predict the status of smoke exposure. Given this knowledge gap, the current study investigated the potential association between histology images and smoke exposure status. A total of 483 whole-slide histology images of 285 unique cases of BC were available from multiple centers for BC diagnosis. A deep learning model was developed to predict the smoke exposure status and externally validated on BC cases. The development set consisted of 66 cases from two centers. The external validation consisted of 94 cases from remaining centers for patients who either never smoked cigarettes or were active smokers at the time of diagnosis. The threshold for binary categorization was fixed to the median confidence score (65) of the development set. On external validation, AUC was used to assess the randomness of predicted smoke status; we utilized latent feature presentation to determine common histologic patterns for smoke exposure status and mixed effect logistic regression models determined the parameter independence from BC grade, gender, time to diagnosis, and age at diagnosis. We used 2,000-times bootstrap resampling to estimate the 95% Confidence Interval (CI) on the external validation set. The results showed an AUC of 0.67 (95% CI: 0.58-0.76), indicating non-randomness of model classification, with a specificity of 51.2% and sensitivity of 82.2%. Multivariate analyses revealed that our model provided an independent predictor for smoke exposure status derived from histology images, with an odds ratio of 1.710 (95% CI: 1.148-2.54). Common histologic patterns of BC were found in active or never smokers. In conclusion, deep learning reveals histopathologic features of BC that are predictive of smoke exposure and, therefore, may provide valuable information regarding smoke exposure status.
Collapse
Affiliation(s)
- Okyaz Eminaga
- AI Vobis, Palo Alto, California, United States of America
| | - Hubert Lau
- Department of Pathology and Laboratory Medicine, Veterans Affairs Palo Alto Health Care System, Palo Alto, California, United States of America
- Department of Pathology, Stanford University School of Medicine, Palo Alto, California, United States of America
| | - Eugene Shkolyar
- Department of Urology, Stanford University School of Medicine, Palo Alto, California, United States of America
| | - Eva Wardelmann
- Department of Pathology, University Hospital of Muenster, Münster, Germany
| | - Mahmoud Abbas
- Department of Pathology, University Hospital of Muenster, Münster, Germany
| |
Collapse
|
21
|
Martínez-Trespalacios JA, Polo-Herrera DE, Félix-Massa TY, Hernandez-Rivera SP, Hernandez-Fernandez J, Colpas-Castillo F, Castro-Suarez JR. QCL Infrared Spectroscopy Combined with Machine Learning as a Useful Tool for Classifying Acetaminophen Tablets by Brand. Molecules 2024; 29:3562. [PMID: 39124967 PMCID: PMC11313707 DOI: 10.3390/molecules29153562] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2024] [Revised: 07/18/2024] [Accepted: 07/23/2024] [Indexed: 08/12/2024] Open
Abstract
The development of new methods of identification of active pharmaceutical ingredients (API) is a subject of paramount importance for research centers, the pharmaceutical industry, and law enforcement agencies. Here, a system for identifying and classifying pharmaceutical tablets containing acetaminophen (AAP) by brand has been developed. In total, 15 tablets of 11 brands for a total of 165 samples were analyzed. Mid-infrared vibrational spectroscopy with multivariate analysis was employed. Quantum cascade lasers (QCLs) were used as mid-infrared sources. IR spectra in the spectral range 980-1600 cm-1 were recorded. Five different classification methods were used. First, a spectral search through correlation indices. Second, machine learning algorithms such as principal component analysis (PCA), support vector classification (SVC), decision tree classifier (DTC), and artificial neural network (ANN) were employed to classify tablets by brands. SNV and first derivative were used as preprocessing to improve the spectral information. Precision, recall, specificity, F1-score, and accuracy were used as criteria to evaluate the best SVC, DEE, and ANN classification models obtained. The IR spectra of the tablets show characteristic vibrational signals of AAP and other APIs present. Spectral classification by spectral search and PCA showed limitations in differentiating between brands, particularly for tablets containing AAP as the only API. Machine learning models, specifically SVC, achieved high accuracy in classifying AAP tablets according to their brand, even for brands containing only AAP.
Collapse
Affiliation(s)
- José A. Martínez-Trespalacios
- Mechanical Engineering Program, School of Engineering, Universidad Tecnológica de Bolívar, Parque Industrial y Tecnológico Carlos Vélez Pombo, Cartagena 130001, Colombia; (J.A.M.-T.); (J.H.-F.)
| | - Daniel E. Polo-Herrera
- Chemistry Program, Department of Natural and Exact Sciences, San Pablo Campus, University of Cartagena, Cartagena 130015, Colombia; (D.E.P.-H.); (F.C.-C.)
| | - Tamara Y. Félix-Massa
- Center for Chemical Sensors and Chemical Imaging and Surface Analysis Center, Department of Chemistry, University of Puerto Rico, Mayaguez, PR 00681, USA; (T.Y.F.-M.); (S.P.H.-R.)
| | - Samuel P. Hernandez-Rivera
- Center for Chemical Sensors and Chemical Imaging and Surface Analysis Center, Department of Chemistry, University of Puerto Rico, Mayaguez, PR 00681, USA; (T.Y.F.-M.); (S.P.H.-R.)
| | - Joaquín Hernandez-Fernandez
- Mechanical Engineering Program, School of Engineering, Universidad Tecnológica de Bolívar, Parque Industrial y Tecnológico Carlos Vélez Pombo, Cartagena 130001, Colombia; (J.A.M.-T.); (J.H.-F.)
- Chemistry Program, Department of Natural and Exact Sciences, San Pablo Campus, University of Cartagena, Cartagena 130015, Colombia; (D.E.P.-H.); (F.C.-C.)
- Department of Natural and Exact Science, Universidad de la Costa, Barranquilla 080002, Colombia
| | - Fredy Colpas-Castillo
- Chemistry Program, Department of Natural and Exact Sciences, San Pablo Campus, University of Cartagena, Cartagena 130015, Colombia; (D.E.P.-H.); (F.C.-C.)
| | - John R. Castro-Suarez
- Área Básicas Exactas, Universidad del Sinú, Seccional Cartagena, Cartagena 130015, Colombia
| |
Collapse
|
22
|
Cong Y, LaCroix AN, Lee J. Clinical efficacy of pre-trained large language models through the lens of aphasia. Sci Rep 2024; 14:15573. [PMID: 38971898 PMCID: PMC11227580 DOI: 10.1038/s41598-024-66576-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2024] [Accepted: 07/01/2024] [Indexed: 07/08/2024] Open
Abstract
The rapid development of large language models (LLMs) motivates us to explore how such state-of-the-art natural language processing systems can inform aphasia research. What kind of language indices can we derive from a pre-trained LLM? How do they differ from or relate to the existing language features in aphasia? To what extent can LLMs serve as an interpretable and effective diagnostic and measurement tool in a clinical context? To investigate these questions, we constructed predictive and correlational models, which utilize mean surprisals from LLMs as predictor variables. Using AphasiaBank archived data, we validated our models' efficacy in aphasia diagnosis, measurement, and prediction. Our finding is that LLMs-surprisals can effectively detect the presence of aphasia and different natures of the disorder, LLMs in conjunction with the existing language indices improve models' efficacy in subtyping aphasia, and LLMs-surprisals can capture common agrammatic deficits at both word and sentence level. Overall, LLMs have potential to advance automatic and precise aphasia prediction. A natural language processing pipeline can be greatly benefitted from integrating LLMs, enabling us to refine models of existing language disorders, such as aphasia.
Collapse
Affiliation(s)
- Yan Cong
- School of Languages and Cultures, Purdue University, West Lafayette, USA.
| | - Arianna N LaCroix
- Department of Speech, Language, and Hearing Sciences, Purdue University, West Lafayette, USA
| | - Jiyeon Lee
- Department of Speech, Language, and Hearing Sciences, Purdue University, West Lafayette, USA
| |
Collapse
|
23
|
Seixas Feio JA, de Oliveira ECL, de Sales CDS, da Costa KS, e Lima AHL. Investigating molecular descriptors in cell-penetrating peptides prediction with deep learning: Employing N, O, and hydrophobicity according to the Eisenberg scale. PLoS One 2024; 19:e0305253. [PMID: 38870192 PMCID: PMC11175476 DOI: 10.1371/journal.pone.0305253] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/11/2023] [Accepted: 05/27/2024] [Indexed: 06/15/2024] Open
Abstract
Cell-penetrating peptides comprise a group of molecules that can naturally cross the lipid bilayer membrane that protects cells, sharing physicochemical and structural properties, and having several pharmaceutical applications, particularly in drug delivery. Investigations of molecular descriptors have provided not only an improvement in the performance of classifiers but also less computational complexity and an enhanced understanding of membrane permeability. Furthermore, the employment of new technologies, such as the construction of deep learning models using overfitting treatment, promotes advantages in tackling this problem. In this study, the descriptors nitrogen, oxygen, and hydrophobicity on the Eisenberg scale were investigated, using the proposed ConvBoost-CPP composed of an improved convolutional neural network with overfitting treatment and an XGBoost model with adjusted hyperparameters. The results revealed favorable to the use of ConvBoost-CPP, having as input nitrogen, oxygen, and hydrophobicity together with ten other descriptors previously investigated in this research line, showing an increase in accuracy from 88% to 91.2% in cross-validation and 82.6% to 91.3% in independent test.
Collapse
Affiliation(s)
- Juliana Auzier Seixas Feio
- Laboratório de Inteligência Computacional e Pesquisa Operacional, Campus Belém, Instituto de Tecnologia, Universidade Federal do Pará, Pará, Brazil
| | - Ewerton Cristhian Lima de Oliveira
- Laboratório de Inteligência Computacional e Pesquisa Operacional, Campus Belém, Instituto de Tecnologia, Universidade Federal do Pará, Pará, Brazil
- Instituto Tecnológico Vale, Belém, Pará, Brazil
| | - Claudomiro de Souza de Sales
- Laboratório de Inteligência Computacional e Pesquisa Operacional, Campus Belém, Instituto de Tecnologia, Universidade Federal do Pará, Pará, Brazil
| | - Kauê Santana da Costa
- Laboratório de Simulação Computacional, Campus Marechal Rondom, Instituto de Biodiversidade, Universidade Federal do Oeste do Pará, Santarém, Pará, Brazil
| | - Anderson Henrique Lima e Lima
- Laboratório de Planejamento e Desenvolvimento de Fármacos, Instituto de Ciências Exatas e Naturais, Universidade Federal do Pará, Belém, Pará, Brazil
| |
Collapse
|
24
|
Huang J, Liu J, Ye Y, Jiang Y, Lai Y, Qin X, Zhang L, Jiang Y. Mapping Soil Properties in the Haihun River Sub-Watershed, Yangtze River Basin, China, by Integrating Machine Learning and Variable Selection. SENSORS (BASEL, SWITZERLAND) 2024; 24:3784. [PMID: 38931566 PMCID: PMC11207289 DOI: 10.3390/s24123784] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/08/2024] [Revised: 06/01/2024] [Accepted: 06/05/2024] [Indexed: 06/28/2024]
Abstract
Mapping soil properties in sub-watersheds is critical for agricultural productivity, land management, and ecological security. Machine learning has been widely applied to digital soil mapping due to a rapidly increasing number of environmental covariates. However, the inclusion of many environmental covariates in machine learning models leads to the problem of multicollinearity, with poorly understood consequences for prediction performance. Here, we explored the effects of variable selection on the prediction performance of two machine learning models for multiple soil properties in the Haihun River sub-watershed, Jiangxi Province, China. Surface soils (0-20 cm) were collected from a total of 180 sample points in 2022. The optimal covariates were selected from 40 environmental covariates using a recursive feature elimination algorithm. Compared to all-variable models, the random forest (RF) and extreme gradient boosting (XGBoost) models with variable selection improved in prediction accuracy. The R2 values of the RF and XGBoost models increased by 0.34 and 0.47 for the soil organic carbon, by 0.67 and 0.62 for the total phosphorus, and by 0.43 and 0.62 for the available phosphorus, respectively. The models with variable selection presented reduced global uncertainty, and the overall uncertainty of the RF model was lower than that of the XGBoost model. The soil properties showed high spatial heterogeneity based on the models with variable selection. Remote sensing covariates (particularly principal component 2) were the major factors controlling the distribution of the soil organic carbon. Human activity covariates (mainly land use) and organism covariates (mainly potential evapotranspiration) played a predominant role in driving the distribution of the soil total and soil available phosphorus, respectively. This study indicates the importance of variable selection for predicting multiple soil properties and mapping their spatial distribution in sub-watersheds.
Collapse
Affiliation(s)
- Jun Huang
- Basic Geological Survey Institute of Jiangxi Geological Survey and Exploration Institute (Jiangxi Nonferrous Geological Mineral Exploration and Development Institute), Nanchang 330045, China; (J.H.); (Y.L.); (X.Q.); (L.Z.)
| | - Jia Liu
- College of Land Resources and Environment, Jiangxi Agricultural University, Nanchang 330045, China; (J.L.); (Y.Y.); (Y.J.)
| | - Yingcong Ye
- College of Land Resources and Environment, Jiangxi Agricultural University, Nanchang 330045, China; (J.L.); (Y.Y.); (Y.J.)
| | - Yameng Jiang
- College of Land Resources and Environment, Jiangxi Agricultural University, Nanchang 330045, China; (J.L.); (Y.Y.); (Y.J.)
| | - Yuying Lai
- Basic Geological Survey Institute of Jiangxi Geological Survey and Exploration Institute (Jiangxi Nonferrous Geological Mineral Exploration and Development Institute), Nanchang 330045, China; (J.H.); (Y.L.); (X.Q.); (L.Z.)
| | - Xianbing Qin
- Basic Geological Survey Institute of Jiangxi Geological Survey and Exploration Institute (Jiangxi Nonferrous Geological Mineral Exploration and Development Institute), Nanchang 330045, China; (J.H.); (Y.L.); (X.Q.); (L.Z.)
| | - Lin Zhang
- Basic Geological Survey Institute of Jiangxi Geological Survey and Exploration Institute (Jiangxi Nonferrous Geological Mineral Exploration and Development Institute), Nanchang 330045, China; (J.H.); (Y.L.); (X.Q.); (L.Z.)
| | - Yefeng Jiang
- College of Land Resources and Environment, Jiangxi Agricultural University, Nanchang 330045, China; (J.L.); (Y.Y.); (Y.J.)
| |
Collapse
|
25
|
Kuhnen G, Class LC, Badekow S, Hanisch KL, Rohn S, Kuballa J. Python workflow for the selection and identification of marker peptides-proof-of-principle study with heated milk. Anal Bioanal Chem 2024; 416:3349-3360. [PMID: 38607384 PMCID: PMC11106092 DOI: 10.1007/s00216-024-05286-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2024] [Revised: 03/26/2024] [Accepted: 04/02/2024] [Indexed: 04/13/2024]
Abstract
The analysis of almost holistic food profiles has developed considerably over the last years. This has also led to larger amounts of data and the ability to obtain more information about health-beneficial and adverse constituents in food than ever before. Especially in the field of proteomics, software is used for evaluation, and these do not provide specific approaches for unique monitoring questions. An additional and more comprehensive way of evaluation can be done with the programming language Python. It offers broad possibilities by a large ecosystem for mass spectrometric data analysis, but needs to be tailored for specific sets of features, the research questions behind. It also offers the applicability of various machine-learning approaches. The aim of the present study was to develop an algorithm for selecting and identifying potential marker peptides from mass spectrometric data. The workflow is divided into three steps: (I) feature engineering, (II) chemometric data analysis, and (III) feature identification. The first step is the transformation of the mass spectrometric data into a structure, which enables the application of existing data analysis packages in Python. The second step is the data analysis for selecting single features. These features are further processed in the third step, which is the feature identification. The data used exemplarily in this proof-of-principle approach was from a study on the influence of a heat treatment on the milk proteome/peptidome.
Collapse
Affiliation(s)
- Gesine Kuhnen
- GALAB Laboratories GmbH, Am Schleusengraben 7, 21029, Hamburg, Germany
- Department of Food Chemistry and Analysis, Institute of Food Technology and Food Chemistry, Technical University Berlin, Gustav Meyer Allee 25, 13355, Berlin, Germany
| | - Lisa-Carina Class
- GALAB Laboratories GmbH, Am Schleusengraben 7, 21029, Hamburg, Germany
- Hamburg School of Food Science, Institute of Food Chemistry, University of Hamburg, Grindelallee 117, 20146, Hamburg, Germany
| | - Svenja Badekow
- GALAB Laboratories GmbH, Am Schleusengraben 7, 21029, Hamburg, Germany
| | - Kim Lara Hanisch
- GALAB Laboratories GmbH, Am Schleusengraben 7, 21029, Hamburg, Germany
| | - Sascha Rohn
- Department of Food Chemistry and Analysis, Institute of Food Technology and Food Chemistry, Technical University Berlin, Gustav Meyer Allee 25, 13355, Berlin, Germany
| | - Jürgen Kuballa
- GALAB Laboratories GmbH, Am Schleusengraben 7, 21029, Hamburg, Germany.
| |
Collapse
|
26
|
Uesugi F, Wen Y, Hashimoto A, Ishii M. Prediction of nanocomposite properties and process optimization using persistent homology and machine learning. Micron 2024; 183:103664. [PMID: 38820861 DOI: 10.1016/j.micron.2024.103664] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2024] [Revised: 05/18/2024] [Accepted: 05/22/2024] [Indexed: 06/02/2024]
Abstract
Physical property prediction and synthesis process optimization are key targets in material informatics. In this study, we propose a machine learning approach that utilizes ridge regression to predict the oxygen permeability at fuel cell electrode surfaces and determine the optimal process temperature. These predictions are based on a persistence diagram derived from tomographic images captured using transmission electron microscopy (TEM). Through machine learning analysis of the complex structures present in the Pt/CeO2 nanocomposites, we discovered that l2 regularization considering diverse structural elements is more appropriate than l1 regularization (sparse modeling). Notably, our model successfully captured the activation energy of oxygen permeability, a phenomenon that could not be solely explained by the geometric feature of the Betti numbers, as demonstrated in a previous study. The correspondence between the ridge regression coefficient and persistence diagram revealed the formation process of the local and three-dimensional structures of CeO2 and their contributions to pre-exponential factor and activation energies. This analysis facilitated the determination of the annealing temperature required to achieve the optimal structure and accurately predict the physical properties.
Collapse
Affiliation(s)
- Fumihiko Uesugi
- National Institute for Materials Science, 1-2-1 Sengen, Tsukuba, Ibaraki 305-0047, Japan.
| | - Yu Wen
- National Institute for Materials Science, 1-2-1 Sengen, Tsukuba, Ibaraki 305-0047, Japan; University of Tsukuba, 1-2-1 Sengen, Tsukuba, Ibaraki 305-0047, Japan
| | - Ayako Hashimoto
- National Institute for Materials Science, 1-2-1 Sengen, Tsukuba, Ibaraki 305-0047, Japan; University of Tsukuba, 1-2-1 Sengen, Tsukuba, Ibaraki 305-0047, Japan
| | - Masashi Ishii
- National Institute for Materials Science, 1-1 Namiki, Tsukuba, Ibaraki 305-0044, Japan
| |
Collapse
|
27
|
Böttcher B, Kienast SD, Leufken J, Eggers C, Sharma P, Leufken CM, Morgner B, Drexler HCA, Schulz D, Allert S, Jacobsen ID, Vylkova S, Leidel SA, Brunke S. A highly conserved tRNA modification contributes to C. albicans filamentation and virulence. Microbiol Spectr 2024; 12:e0425522. [PMID: 38587411 PMCID: PMC11064501 DOI: 10.1128/spectrum.04255-22] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2022] [Accepted: 01/18/2024] [Indexed: 04/09/2024] Open
Abstract
tRNA modifications play important roles in maintaining translation accuracy in all domains of life. Disruptions in the tRNA modification machinery, especially of the anticodon stem loop, can be lethal for many bacteria and lead to a broad range of phenotypes in baker's yeast. Very little is known about the function of tRNA modifications in host-pathogen interactions, where rapidly changing environments and stresses require fast adaptations. We found that two closely related fungal pathogens of humans, the highly pathogenic Candida albicans and its much less pathogenic sister species, Candida dubliniensis, differ in the function of a tRNA-modifying enzyme. This enzyme, Hma1, exhibits species-specific effects on the ability of the two fungi to grow in the hypha morphology, which is central to their virulence potential. We show that Hma1 has tRNA-threonylcarbamoyladenosine dehydratase activity, and its deletion alters ribosome occupancy, especially at 37°C-the body temperature of the human host. A C. albicans HMA1 deletion mutant also shows defects in adhesion to and invasion into human epithelial cells and shows reduced virulence in a fungal infection model. This links tRNA modifications to host-induced filamentation and virulence of one of the most important fungal pathogens of humans.IMPORTANCEFungal infections are on the rise worldwide, and their global burden on human life and health is frequently underestimated. Among them, the human commensal and opportunistic pathogen, Candida albicans, is one of the major causative agents of severe infections. Its virulence is closely linked to its ability to change morphologies from yeasts to hyphae. Here, this ability is linked-to our knowledge for the first time-to modifications of tRNA and translational efficiency. One tRNA-modifying enzyme, Hma1, plays a specific role in C. albicans and its ability to invade the host. This adds a so-far unknown layer of regulation to the fungal virulence program and offers new potential therapeutic targets to fight fungal infections.
Collapse
Affiliation(s)
- Bettina Böttcher
- Department of Microbial Pathogenicity Mechanisms, Leibniz Institute for Natural Product Research and Infection Biology – Hans Knoell Institute, Jena, Germany
- Septomics Research Center, Friedrich Schiller University and Leibniz Institute for Natural Product Research and Infection Biology – Hans Knoell Institute, Jena, Germany
| | - Sandra D. Kienast
- Max Planck Research Group for RNA Biology, Max Planck Institute for Molecular Biomedicine, Münster, Germany
- Research Group for Cellular RNA Biochemistry, Department of Chemistry, Biochemistry and Pharmaceutical Sciences, University of Bern, Bern, Switzerland
| | - Johannes Leufken
- Max Planck Research Group for RNA Biology, Max Planck Institute for Molecular Biomedicine, Münster, Germany
- Research Group for Cellular RNA Biochemistry, Department of Chemistry, Biochemistry and Pharmaceutical Sciences, University of Bern, Bern, Switzerland
| | - Cristian Eggers
- Max Planck Research Group for RNA Biology, Max Planck Institute for Molecular Biomedicine, Münster, Germany
- Research Group for Cellular RNA Biochemistry, Department of Chemistry, Biochemistry and Pharmaceutical Sciences, University of Bern, Bern, Switzerland
| | - Puneet Sharma
- Max Planck Research Group for RNA Biology, Max Planck Institute for Molecular Biomedicine, Münster, Germany
- Research Group for Cellular RNA Biochemistry, Department of Chemistry, Biochemistry and Pharmaceutical Sciences, University of Bern, Bern, Switzerland
| | - Christine M. Leufken
- Max Planck Research Group for RNA Biology, Max Planck Institute for Molecular Biomedicine, Münster, Germany
| | - Bianka Morgner
- Department of Microbial Pathogenicity Mechanisms, Leibniz Institute for Natural Product Research and Infection Biology – Hans Knoell Institute, Jena, Germany
| | - Hannes C. A. Drexler
- Bioanalytical Mass Spectrometry Unit, Max Planck Institute for Molecular Biomedicine, Münster, Germany
| | - Daniela Schulz
- Department of Microbial Pathogenicity Mechanisms, Leibniz Institute for Natural Product Research and Infection Biology – Hans Knoell Institute, Jena, Germany
| | - Stefanie Allert
- Department of Microbial Pathogenicity Mechanisms, Leibniz Institute for Natural Product Research and Infection Biology – Hans Knoell Institute, Jena, Germany
| | - Ilse D. Jacobsen
- Research Group Microbial Immunology, Leibniz Institute for Natural Product Research and Infection Biology – Hans Knoell Institute, Jena, Germany
- Institute of Microbiology, Friedrich Schiller University, Jena, Germany
| | - Slavena Vylkova
- Septomics Research Center, Friedrich Schiller University and Leibniz Institute for Natural Product Research and Infection Biology – Hans Knoell Institute, Jena, Germany
| | - Sebastian A. Leidel
- Max Planck Research Group for RNA Biology, Max Planck Institute for Molecular Biomedicine, Münster, Germany
- Research Group for Cellular RNA Biochemistry, Department of Chemistry, Biochemistry and Pharmaceutical Sciences, University of Bern, Bern, Switzerland
| | - Sascha Brunke
- Department of Microbial Pathogenicity Mechanisms, Leibniz Institute for Natural Product Research and Infection Biology – Hans Knoell Institute, Jena, Germany
| |
Collapse
|
28
|
Xu G, Gan S, Guo B, Yang L. Application of clustering strategy for automatic segmentation of tissue regions in mass spectrometry imaging. RAPID COMMUNICATIONS IN MASS SPECTROMETRY : RCM 2024; 38:e9717. [PMID: 38389435 DOI: 10.1002/rcm.9717] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/18/2023] [Revised: 01/19/2024] [Accepted: 01/21/2024] [Indexed: 02/24/2024]
Abstract
RATIONALE Mass spectrometry imaging (MSI) has been widely used in biomedical research fields. Each pixel in MSI consists of a mass spectrum that reflects the molecule feature of the tissue spot. Because MSI contains high-dimensional datasets, it is highly desired to develop computational methods for data mining and constructing tissue segmentation maps. METHODS To visualize different tissue regions based on mass spectrum features and improve the efficiency in processing enormous data, we proposed a computational strategy that consists of four procedures including preprocessing, data reduction, clustering, and quantitative validation. RESULTS In this study, we examined the combination of t-distributed stochastic neighbor embedding (t-SNE) and hierarchical clustering (HC) for MSI data analysis. Using publicly available MSI datasets, one dataset of mouse urinary bladder, and one dataset of human colorectal cancer, we demonstrated that the generated tissue segmentation maps from this combination were superior to other data reduction and clustering algorithms. Using the staining image as a reference, we assessed the performance of clustering algorithms with external and internal clustering validation measures, including purity, adjusted Rand index (ARI), Davies-Bouldin index (DBI), and spatial aggregation index (SAI). The result indicated that SAI delivered excellent performance for automatic segmentation of tissue regions in MSI. CONCLUSIONS We used a clustering algorithm to construct tissue automatic segmentation in MSI datasets. The performance was evaluated by comparing it with the stained image and calculating clustering validation indexes. The results indicated that SAI is important for automatic tissue segmentation in MSI, different from traditional clustering validation measures. Compared to the reports that used internal clustering validation measures such as DBI, our method offers more effective evaluation of clustering results for MSI segmentation. We envision that the proposed automatic image segmentation strategy can facilitate deep learning in molecular feature extraction and biomarker discovery for the biomedical applications of MSI.
Collapse
Affiliation(s)
- Guang Xu
- College of Computer, Hubei University of Education, Wuhan, China
| | - Shengfeng Gan
- College of Computer, Hubei University of Education, Wuhan, China
| | - Bo Guo
- College of Computer, Hubei University of Education, Wuhan, China
| | - Li Yang
- College of Computer, Hubei University of Education, Wuhan, China
| |
Collapse
|
29
|
Valdés-Albuernes JL, Díaz-Pico E, Alfaro S, Caballero J. Modeling of noncovalent inhibitors of the papain-like protease (PLpro) from SARS-CoV-2 considering the protein flexibility by using molecular dynamics and cross-docking. Front Mol Biosci 2024; 11:1374364. [PMID: 38601323 PMCID: PMC11004324 DOI: 10.3389/fmolb.2024.1374364] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/22/2024] [Accepted: 03/11/2024] [Indexed: 04/12/2024] Open
Abstract
The papain-like protease (PLpro) found in coronaviruses that can be transmitted from animals to humans is a critical target in respiratory diseases linked to Severe Acute Respiratory Syndrome (SARS-CoV). Researchers have proposed designing PLpro inhibitors. In this study, a set of 89 compounds, including recently reported 2-phenylthiophenes with nanomolar inhibitory potency, were investigated as PLpro noncovalent inhibitors using advanced molecular modeling techniques. To develop the work with these inhibitors, multiple structures of the SARS-CoV-2 PLpro binding site were generated using a molecular sampling method. These structures were then clustered to select a group that represents the flexibility of the site. Subsequently, models of the protein-ligand complexes were created for the set of inhibitors within the chosen conformations. The quality of the complex models was assessed using LigRMSD software to verify similarities in the orientations of the congeneric series and interaction fingerprints to determine the recurrence of chemical interactions. With the multiple models constructed, a protocol was established to choose one per ligand, optimizing the correlation between the calculated docking energy values and the biological activities while incorporating the effect of the binding site's flexibility. A strong correlation (R2 = 0.922) was found when employing this flexible docking protocol.
Collapse
Affiliation(s)
| | | | | | - Julio Caballero
- Centro de Bioinformática, Simulación y Modelado (CBSM), Facultad de Ingeniería, Universidad de Talca, Talca, Chile
| |
Collapse
|
30
|
Bazzani D, Heidrich V, Manghi P, Blanco-Miguez A, Asnicar F, Armanini F, Cavaliere S, Bertelle A, Dell'Acqua F, Dellasega E, Waldner R, Vicentini D, Bolzan M, Tomasi C, Segata N, Pasolli E, Ghensi P. Favorable subgingival plaque microbiome shifts are associated with clinical treatment for peri-implant diseases. NPJ Biofilms Microbiomes 2024; 10:12. [PMID: 38374114 PMCID: PMC10876967 DOI: 10.1038/s41522-024-00482-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2023] [Accepted: 01/31/2024] [Indexed: 02/21/2024] Open
Abstract
We performed a longitudinal shotgun metagenomic investigation of the plaque microbiome associated with peri-implant diseases in a cohort of 91 subjects with 320 quality-controlled metagenomes. Through recently improved taxonomic profiling methods, we identified the most discriminative species between healthy and diseased subjects at baseline, evaluated their change over time, and provided evidence that clinical treatment had a positive effect on plaque microbiome composition in patients affected by mucositis and peri-implantitis.
Collapse
Affiliation(s)
| | | | - Paolo Manghi
- Department CIBIO, University of Trento, Trento, Italy
| | | | | | | | - Sara Cavaliere
- Department of Agricultural Sciences, University of Naples Federico II, Portici, Italy
| | | | | | | | | | | | | | - Cristiano Tomasi
- PreBiomics S.r.l., Trento, Italy
- Department of Periodontology, Institute of Odontology, Sahlgrenska Academy, University of Gothenburg, Gothenburg, Sweden
| | - Nicola Segata
- Department CIBIO, University of Trento, Trento, Italy.
| | - Edoardo Pasolli
- Department of Agricultural Sciences, University of Naples Federico II, Portici, Italy.
| | - Paolo Ghensi
- PreBiomics S.r.l., Trento, Italy.
- Department CIBIO, University of Trento, Trento, Italy.
| |
Collapse
|
31
|
Chandrasekaran K, Kakani V, Kokkarachedu V, Abdulrahman Syedahamed HH, Palani S, Arumugam S, Shanmugam A, Kim S, Kim K. Toxicological assessment of divalent ion-modified ZnO nanomaterials through artificial intelligence and in vivo study. AQUATIC TOXICOLOGY (AMSTERDAM, NETHERLANDS) 2024; 267:106826. [PMID: 38219502 DOI: 10.1016/j.aquatox.2023.106826] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/22/2023] [Revised: 12/22/2023] [Accepted: 12/27/2023] [Indexed: 01/16/2024]
Abstract
The nanotechnology-driven industrial revolution widely relies on metal oxide-based nanomaterial (NM). Zinc oxide (ZnO) production has rapidly increased globally due to its outstanding physical and chemical properties and versatile applications in industries including cement, rubber, paints, cosmetics, and more. Nevertheless, releasing Zn2+ ions into the environment can profoundly impact living systems and affect water-based ecosystems, including biological ones. In aquatic environments, Zn2+ ions can change water properties, directly influencing underwater ecosystems, especially fish populations. These ions can accumulate in fish tissues when fish are exposed to contaminated water and pose health risks to humans who consume them, leading to symptoms such as nausea, vomiting, and even organ damage. To address this issue, safety of ZnO NMs should be enhanced without altering their nanoscale properties, thus preventing toxic-related problems. In this study, an eco-friendly precipitation method was employed to prepare ZnO NMs. These NMs were found to reduce ZnO toxicity levels by incorporating elements such as Mg, Ca, Sr, and Ba. Structural, morphological, and optical properties of synthesized NMs were thoroughly investigated. In vitro tests demonstrated potential antioxidative properties of NMs with significant effects on free radical scavenging activities. In vivo, toxicity tests were conducted using Oreochromis mossambicus fish and male Swiss Albino mice to compare toxicities of different ZnO NMs. Fish and mice exposed to these NMs exhibited biochemical changes and histological abnormalities. Notably, ZnCaO NMs demonstrated lower toxicity to fish and mice than other ZnO NMs. This was attributed to its Ca2+ ions, which could enhance body growth metabolism compared to other metals, thus improving material safety. Furthermore, whether nanomaterials' surface roughness might contribute to their increased toxicity in biological systems was investigated utilizing computer vision (CV)-based AI tools to obtain SEM images of NMs, providing valuable image-based surface morphology data that could be correlated with relevant toxicology studies.
Collapse
Affiliation(s)
| | - Vijay Kakani
- Integrated System Engineering, Inha University, Inha-ro, Incheon, 22212, Republic of Korea
| | - Varaprasad Kokkarachedu
- Facultad de Ingeniería, Arquitectura y Deseno, Universidad San Sebastián, Lientur 1457, Concepción 4080871, Bio-Bio, Chile
| | | | - Suganthi Palani
- KIRND Institute of Research and Development Pvt Ltd, Tiruchirappalli, Tamil Nadu 620 020, India
| | - Stalin Arumugam
- Department of Zoology, National College (Affiliated to Bharathidasan University), Tiruchirappalli, Tamil Nadu, 620 001, India
| | - Achiraman Shanmugam
- Department of Environmental Biotechnology, School of Environmental Sciences, Bharathidasan University, Tiruchirappalli, India
| | - Sungjun Kim
- Department of Chemical & Biochemical Engineering, Dongguk University, Seoul, 04620, Republic of Korea
| | - Kyobum Kim
- Department of Chemical & Biochemical Engineering, Dongguk University, Seoul, 04620, Republic of Korea.
| |
Collapse
|
32
|
Park H, Joachimiak MP, Jungbluth SP, Yang Z, Riehl WJ, Canon RS, Arkin AP, Dehal PS. A bacterial sensor taxonomy across earth ecosystems for machine learning applications. mSystems 2024; 9:e0002623. [PMID: 38078749 PMCID: PMC10804942 DOI: 10.1128/msystems.00026-23] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/13/2023] [Accepted: 10/23/2023] [Indexed: 01/24/2024] Open
Abstract
Microbial communities have evolved to colonize all ecosystems of the planet, from the deep sea to the human gut. Microbes survive by sensing, responding, and adapting to immediate environmental cues. This process is driven by signal transduction proteins such as histidine kinases, which use their sensing domains to bind or otherwise detect environmental cues and "transduce" signals to adjust internal processes. We hypothesized that an ecosystem's unique stimuli leave a sensor "fingerprint," able to identify and shed insight on ecosystem conditions. To test this, we collected 20,712 publicly available metagenomes from Host-associated, Environmental, and Engineered ecosystems across the globe. We extracted and clustered the collection's nearly 18M unique sensory domains into 113,712 similar groupings with MMseqs2. We built gradient-boosted decision tree machine learning models and found we could classify the ecosystem type (accuracy: 87%) and predict the levels of different physical parameters (R2 score: 83%) using the sensor cluster abundance as features. Feature importance enables identification of the most predictive sensors to differentiate between ecosystems which can lead to mechanistic interpretations if the sensor domains are well annotated. To demonstrate this, a machine learning model was trained to predict patient's disease state and used to identify domains related to oxygen sensing present in a healthy gut but missing in patients with abnormal conditions. Moreover, since 98.7% of identified sensor domains are uncharacterized, importance ranking can be used to prioritize sensors to determine what ecosystem function they may be sensing. Furthermore, these new predictive sensors can function as targets for novel sensor engineering with applications in biotechnology, ecosystem maintenance, and medicine.IMPORTANCEMicrobes infect, colonize, and proliferate due to their ability to sense and respond quickly to their surroundings. In this research, we extract the sensory proteins from a diverse range of environmental, engineered, and host-associated metagenomes. We trained machine learning classifiers using sensors as features such that it is possible to predict the ecosystem for a metagenome from its sensor profile. We use the optimized model's feature importance to identify the most impactful and predictive sensors in different environments. We next use the sensor profile from human gut metagenomes to classify their disease states and explore which sensors can explain differences between diseases. The sensors most predictive of environmental labels here, most of which correspond to uncharacterized proteins, are a useful starting point for the discovery of important environment signals and the development of possible diagnostic interventions.
Collapse
Affiliation(s)
- Helen Park
- Center for Synthetic and Systems Biology, School of Life Sciences, Tsinghua-Peking Center for Life Sciences, Tsinghua University, Beijing, China
- EPSRC/BBSRC Future Biomanufacturing Research Hub, EPSRC Synthetic Biology Research Centre SYNBIOCHEM Manchester Institute of Biotechnology and School of Chemistry, The University of Manchester, Manchester, United Kingdom
- Environmental Genomics and Systems Biology Division, Lawrence Berkeley National Laboratory, Berkeley, California, USA
| | - Marcin P. Joachimiak
- Environmental Genomics and Systems Biology Division, Lawrence Berkeley National Laboratory, Berkeley, California, USA
| | - Sean P. Jungbluth
- Environmental Genomics and Systems Biology Division, Lawrence Berkeley National Laboratory, Berkeley, California, USA
| | - Ziming Yang
- Computational Science Initiative, Brookhaven National Laboratory, Upton, New York, USA
| | - William J. Riehl
- Environmental Genomics and Systems Biology Division, Lawrence Berkeley National Laboratory, Berkeley, California, USA
| | - R. Shane Canon
- Environmental Genomics and Systems Biology Division, Lawrence Berkeley National Laboratory, Berkeley, California, USA
- National Energy Research Scientific Computing Center, Lawrence Berkeley National Laboratory, Berkeley, California, USA
| | - Adam P. Arkin
- Environmental Genomics and Systems Biology Division, Lawrence Berkeley National Laboratory, Berkeley, California, USA
- Department of Bioengineering, University of California, Berkeley, California, USA
| | - Paramvir S. Dehal
- Environmental Genomics and Systems Biology Division, Lawrence Berkeley National Laboratory, Berkeley, California, USA
| |
Collapse
|
33
|
Bibi I, Schaffert D, Blauth M, Lull C, von Ahnen JA, Gross G, Weigandt WA, Knitza J, Kuhn S, Benecke J, Leipe J, Schmieder A, Olsavszky V. Automated Machine Learning Analysis of Patients With Chronic Skin Disease Using a Medical Smartphone App: Retrospective Study. J Med Internet Res 2023; 25:e50886. [PMID: 38015608 PMCID: PMC10716771 DOI: 10.2196/50886] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2023] [Revised: 09/18/2023] [Accepted: 09/19/2023] [Indexed: 11/29/2023] Open
Abstract
BACKGROUND Rapid digitalization in health care has led to the adoption of digital technologies; however, limited trust in internet-based health decisions and the need for technical personnel hinder the use of smartphones and machine learning applications. To address this, automated machine learning (AutoML) is a promising tool that can empower health care professionals to enhance the effectiveness of mobile health apps. OBJECTIVE We used AutoML to analyze data from clinical studies involving patients with chronic hand and/or foot eczema or psoriasis vulgaris who used a smartphone monitoring app. The analysis focused on itching, pain, Dermatology Life Quality Index (DLQI) development, and app use. METHODS After extensive data set preparation, which consisted of combining 3 primary data sets by extracting common features and by computing new features, a new pseudonymized secondary data set with a total of 368 patients was created. Next, multiple machine learning classification models were built during AutoML processing, with the most accurate models ultimately selected for further data set analysis. RESULTS Itching development for 6 months was accurately modeled using the light gradient boosted trees classifier model (log loss: 0.9302 for validation, 1.0193 for cross-validation, and 0.9167 for holdout). Pain development for 6 months was assessed using the random forest classifier model (log loss: 1.1799 for validation, 1.1561 for cross-validation, and 1.0976 for holdout). Then, the random forest classifier model (log loss: 1.3670 for validation, 1.4354 for cross-validation, and 1.3974 for holdout) was used again to estimate the DLQI development for 6 months. Finally, app use was analyzed using an elastic net blender model (area under the curve: 0.6567 for validation, 0.6207 for cross-validation, and 0.7232 for holdout). Influential feature correlations were identified, including BMI, age, disease activity, DLQI, and Hospital Anxiety and Depression Scale-Anxiety scores at follow-up. App use increased with BMI >35, was less common in patients aged >47 years and those aged 23 to 31 years, and was more common in those with higher disease activity. A Hospital Anxiety and Depression Scale-Anxiety score >8 had a slightly positive effect on app use. CONCLUSIONS This study provides valuable insights into the relationship between data characteristics and targeted outcomes in patients with chronic eczema or psoriasis, highlighting the potential of smartphone and AutoML techniques in improving chronic disease management and patient care.
Collapse
Affiliation(s)
- Igor Bibi
- Department of Dermatology, Venereology and Allergology, University Medical Center and Medical Faculty Mannheim, Center of Excellence in Dermatology, Heidelberg University, Mannheim, Germany
| | - Daniel Schaffert
- Department of Dermatology, Venereology and Allergology, University Medical Center and Medical Faculty Mannheim, Center of Excellence in Dermatology, Heidelberg University, Mannheim, Germany
| | - Mara Blauth
- Department of Dermatology, Venereology and Allergology, University Medical Center and Medical Faculty Mannheim, Center of Excellence in Dermatology, Heidelberg University, Mannheim, Germany
| | - Christian Lull
- Department of Dermatology, Venereology and Allergology, University Medical Center and Medical Faculty Mannheim, Center of Excellence in Dermatology, Heidelberg University, Mannheim, Germany
| | - Jan Alwin von Ahnen
- Department of Dermatology, Venereology and Allergology, University Medical Center and Medical Faculty Mannheim, Center of Excellence in Dermatology, Heidelberg University, Mannheim, Germany
| | - Georg Gross
- Department of Medicine V, Division of Rheumatology, University Medical Centre and Medical Faculty Mannheim, Heidelberg University, Mannheim, Germany
| | - Wanja Alexander Weigandt
- Department of Dermatology, Venereology and Allergology, University Medical Center and Medical Faculty Mannheim, Center of Excellence in Dermatology, Heidelberg University, Mannheim, Germany
| | - Johannes Knitza
- Institute of Digital Medicine, Philipps-University Marburg and University Hospital of Giessen and Marburg, Marburg, Germany
| | - Sebastian Kuhn
- Institute of Digital Medicine, Philipps-University Marburg and University Hospital of Giessen and Marburg, Marburg, Germany
| | - Johannes Benecke
- Department of Dermatology, Venereology and Allergology, University Medical Center and Medical Faculty Mannheim, Center of Excellence in Dermatology, Heidelberg University, Mannheim, Germany
| | - Jan Leipe
- Department of Medicine V, Division of Rheumatology, University Medical Centre and Medical Faculty Mannheim, Heidelberg University, Mannheim, Germany
| | - Astrid Schmieder
- Department of Dermatology, Venereology, and Allergology, University Hospital Würzburg, Würzburg, Germany
| | - Victor Olsavszky
- Department of Dermatology, Venereology and Allergology, University Medical Center and Medical Faculty Mannheim, Center of Excellence in Dermatology, Heidelberg University, Mannheim, Germany
| |
Collapse
|
34
|
Ranjbar A, Montazeri F, Farashah MV, Mehrnoush V, Darsareh F, Roozbeh N. Machine learning-based approach for predicting low birth weight. BMC Pregnancy Childbirth 2023; 23:803. [PMID: 37985975 PMCID: PMC10662167 DOI: 10.1186/s12884-023-06128-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2023] [Accepted: 11/14/2023] [Indexed: 11/22/2023] Open
Abstract
BACKGROUND Low birth weight (LBW) has been linked to infant mortality. Predicting LBW is a valuable preventative tool and predictor of newborn health risks. The current study employed a machine learning model to predict LBW. METHODS This study implemented predictive LBW models based on the data obtained from the "Iranian Maternal and Neonatal Network (IMaN Net)" from January 2020 to January 2022. Women with singleton pregnancies above the gestational age of 24 weeks were included. Exclusion criteria included multiple pregnancies and fetal anomalies. A predictive model was built using eight statistical learning models (logistic regression, decision tree classification, random forest classification, deep learning feedforward, extreme gradient boost model, light gradient boost model, support vector machine, and permutation feature classification with k-nearest neighbors). Expert opinion and prior observational cohorts were used to select candidate LBW predictors for all models. The area under the receiver operating characteristic curve (AUROC), accuracy, precision, recall, and F1 score were measured to evaluate their diagnostic performance. RESULTS We found 1280 women with a recorded LBW out of 8853 deliveries, for a frequency of 14.5%. Deep learning (AUROC: 0.86), random forest classification (AUROC: 0.79), and extreme gradient boost classification (AUROC: 0.79) all have higher AUROC and perform better than others. When the other performance parameters of the models mentioned above with higher AUROC were compared, the extreme gradient boost model was the best model to predict LBW with an accuracy of 0.79, precision of 0.87, recall of 0.69, and F1 score of 0.77. According to the feature importance rank, gestational age and prior history of LBW were the top critical predictors. CONCLUSIONS Although this study found that the extreme gradient boost model performed well in predicting LBW, more research is needed to make a better conclusion on the performance of ML models in predicting LBW.
Collapse
Affiliation(s)
- Amene Ranjbar
- Fertility and Infertility Research Center, Hormozgan University of Medical Sciences, Bandar Abbas, Iran
| | - Farideh Montazeri
- Mother and Child Welfare Research Center, Hormozgan University of Medical Sciences, Bandar Abbas, Iran
| | | | - Vahid Mehrnoush
- Mother and Child Welfare Research Center, Hormozgan University of Medical Sciences, Bandar Abbas, Iran
| | - Fatemeh Darsareh
- Mother and Child Welfare Research Center, Hormozgan University of Medical Sciences, Bandar Abbas, Iran.
| | - Nasibeh Roozbeh
- Mother and Child Welfare Research Center, Hormozgan University of Medical Sciences, Bandar Abbas, Iran
| |
Collapse
|
35
|
Sun H, Zhang K, Lan W, Gu Q, Jiang G, Yang X, Qin W, Han D. An AI Dietitian for Type 2 Diabetes Mellitus Management Based on Large Language and Image Recognition Models: Preclinical Concept Validation Study. J Med Internet Res 2023; 25:e51300. [PMID: 37943581 PMCID: PMC10667983 DOI: 10.2196/51300] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2023] [Revised: 09/18/2023] [Accepted: 10/06/2023] [Indexed: 11/10/2023] Open
Abstract
BACKGROUND Nutritional management for patients with diabetes in China is a significant challenge due to the low supply of registered clinical dietitians. To address this, an artificial intelligence (AI)-based nutritionist program that uses advanced language and image recognition models was created. This program can identify ingredients from images of a patient's meal and offer nutritional guidance and dietary recommendations. OBJECTIVE The primary objective of this study is to evaluate the competence of the models that support this program. METHODS The potential of an AI nutritionist program for patients with type 2 diabetes mellitus (T2DM) was evaluated through a multistep process. First, a survey was conducted among patients with T2DM and endocrinologists to identify knowledge gaps in dietary practices. ChatGPT and GPT 4.0 were then tested through the Chinese Registered Dietitian Examination to assess their proficiency in providing evidence-based dietary advice. ChatGPT's responses to common questions about medical nutrition therapy were compared with expert responses by professional dietitians to evaluate its proficiency. The model's food recommendations were scrutinized for consistency with expert advice. A deep learning-based image recognition model was developed for food identification at the ingredient level, and its performance was compared with existing models. Finally, a user-friendly app was developed, integrating the capabilities of language and image recognition models to potentially improve care for patients with T2DM. RESULTS Most patients (182/206, 88.4%) demanded more immediate and comprehensive nutritional management and education. Both ChatGPT and GPT 4.0 passed the Chinese Registered Dietitian examination. ChatGPT's food recommendations were mainly in line with best practices, except for certain foods like root vegetables and dry beans. Professional dietitians' reviews of ChatGPT's responses to common questions were largely positive, with 162 out of 168 providing favorable reviews. The multilabel image recognition model evaluation showed that the Dino V2 model achieved an average F1 score of 0.825, indicating high accuracy in recognizing ingredients. CONCLUSIONS The model evaluations were promising. The AI-based nutritionist program is now ready for a supervised pilot study.
Collapse
Affiliation(s)
- Haonan Sun
- School of Life Science, Beijing University of Chinese Medicine, Beijing, China
| | - Kai Zhang
- School of Life Science, Beijing University of Chinese Medicine, Beijing, China
| | - Wei Lan
- Department of Pediatrics, Peking University Shenzhen Hospital, Shenzhen, China
| | - Qiufeng Gu
- Department of Pediatrics, Peking University Shenzhen Hospital, Shenzhen, China
| | - Guangxiang Jiang
- School of Life Science, Beijing University of Chinese Medicine, Beijing, China
| | - Xue Yang
- School of Life Science, Beijing University of Chinese Medicine, Beijing, China
| | - Wanli Qin
- School of Life Science, Beijing University of Chinese Medicine, Beijing, China
| | - Dongran Han
- School of Life Science, Beijing University of Chinese Medicine, Beijing, China
| |
Collapse
|
36
|
Vieluf S, Cantley S, Jackson M, Zhang B, Bosl WJ, Loddenkemper T. Development of a Multivariable Seizure Likelihood Assessment Based on Clinical Information and Short Autonomic Activity Recordings for Children With Epilepsy. Pediatr Neurol 2023; 148:118-127. [PMID: 37703656 DOI: 10.1016/j.pediatrneurol.2023.07.018] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 11/01/2022] [Revised: 04/10/2023] [Accepted: 07/24/2023] [Indexed: 09/15/2023]
Abstract
BACKGROUND Predicting seizure likelihood for the following day would enable clinicians to extend or potentially schedule video-electroencephalography (EEG) monitoring when seizure risk is high. Combining standardized clinical data with short-term recordings of wearables to predict seizure likelihood could have high practical relevance as wearable data is easy and fast to collect. As a first step toward seizure forecasting, we classified patients based on whether they had seizures or not during the following recording. METHODS Pediatric patients admitted to the epilepsy monitoring unit wore a wearable that recorded the heart rate (HR), heart rate variability (HRV), electrodermal activity (EDA), and peripheral body temperature. We utilized short recordings from 9:00 to 9:15 pm and compared mean values between patients with and without impending seizures. In addition, we collected clinical data: age, sex, age at first seizure, generalized slowing, focal slowing, and spikes on EEG, magnetic resonance imaging findings, and antiseizure medication reduction. We used conventional machine learning techniques with cross-validation to classify patients with and without impending seizures. RESULTS We included 139 patients: 78 had no seizures and 61 had at least one seizure after 9 pm during the concurrent video-EEG and E4 recordings. HR (P < 0.01) and EDA (P < 0.01) were lower and HRV (P = 0.02) was higher for patients with than for patients without impending seizures. The average accuracy of group classification was 66%, and the mean area under the receiver operating characteristics was 0.72. CONCLUSIONS Short-term wearable recordings in combination with clinical data have great potential as an easy-to-use seizure likelihood assessment tool.
Collapse
Affiliation(s)
- Solveig Vieluf
- Division of Epilepsy and Clinical Neurophysiology, Boston Children's Hospital, Harvard Medical School, Boston, Massachusetts; Institute of Sports Medicine, Paderborn University, Paderborn, Germany.
| | - Sarah Cantley
- Division of Epilepsy and Clinical Neurophysiology, Boston Children's Hospital, Harvard Medical School, Boston, Massachusetts
| | - Michele Jackson
- Division of Epilepsy and Clinical Neurophysiology, Boston Children's Hospital, Harvard Medical School, Boston, Massachusetts
| | - Bo Zhang
- Department of Neurology, Boston Children's Hospital, Harvard Medical School, Boston, Massachusetts
| | - William J Bosl
- Computational Health Informatics Program, Boston Children's Hospital, Harvard Medical School, Boston, Massachusetts; Health Informatics Program, University of San Francisco, San Francisco, California
| | - Tobias Loddenkemper
- Division of Epilepsy and Clinical Neurophysiology, Boston Children's Hospital, Harvard Medical School, Boston, Massachusetts
| |
Collapse
|
37
|
Viljanen M, Minnema J, Wassenaar PNH, Rorije E, Peijnenburg W. What is the ecotoxicity of a given chemical for a given aquatic species? Predicting interactions between species and chemicals using recommender system techniques. SAR AND QSAR IN ENVIRONMENTAL RESEARCH 2023; 34:765-788. [PMID: 37670728 DOI: 10.1080/1062936x.2023.2254225] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/21/2023] [Accepted: 08/27/2023] [Indexed: 09/07/2023]
Abstract
Ecotoxicological safety assessment of chemicals requires toxicity data on multiple species, despite the general desire of minimizing animal testing. Predictive models, specifically machine learning (ML) methods, are one of the tools capable of solving this apparent contradiction as they allow to generalize toxicity patterns across chemicals and species. However, despite the availability of large public toxicity datasets, the data is highly sparse, complicating model development. The aim of this study is to provide insights into how ML can predict toxicity using a large but sparse dataset. We developed models to predict LC50-values, based on experimental LC50-data covering 2431 organic chemicals and 1506 aquatic species from the ECOTOX-database. Several well-known ML techniques were evaluated and a new ML model was developed, inspired by recommender systems. This new model involves a simple linear model that learns low-rank interactions between species and chemicals using factorization machines. We evaluated the predictive performances of the developed models based on two validation settings: 1) predicting unseen chemical-species pairs, and 2) predicting unseen chemicals. The results of this study show that ML models can accurately predict LC50-values in both validation settings. Moreover, we show that the novel factorization machine approach can match well-tuned, complex, ML approaches.
Collapse
Affiliation(s)
- M Viljanen
- Department of Statistics, Data Science and Modelling, National Institute of Public Health and the Environment, Bilthoven, The Netherlands
| | - J Minnema
- Center for Safety of Substances and Products, National Institute of Public Health and the Environment, Bilthoven, The Netherlands
| | - P N H Wassenaar
- Center for Safety of Substances and Products, National Institute of Public Health and the Environment, Bilthoven, The Netherlands
| | - E Rorije
- Center for Safety of Substances and Products, National Institute of Public Health and the Environment, Bilthoven, The Netherlands
| | - W Peijnenburg
- Center for Safety of Substances and Products, National Institute of Public Health and the Environment, Bilthoven, The Netherlands
- Institute of Environmental Sciences (CML), Leiden University, Leiden, The Netherlands
| |
Collapse
|
38
|
Bhandari N, Walambe R, Kotecha K, Kaliya M. Integrative gene expression analysis for the diagnosis of Parkinson's disease using machine learning and explainable AI. Comput Biol Med 2023; 163:107140. [PMID: 37315380 DOI: 10.1016/j.compbiomed.2023.107140] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/21/2022] [Revised: 05/29/2023] [Accepted: 06/04/2023] [Indexed: 06/16/2023]
Abstract
Parkinson's disease (PD) is a progressive neurodegenerative disorder. Various symptoms and diagnostic tests are used in combination for the diagnosis of PD; however, accurate diagnosis at early stages is difficult. Blood-based markers can support physicians in the early diagnosis and treatment of PD. In this study, we used Machine Learning (ML) based methods for the diagnosis of PD by integrating gene expression data from different sources and applying explainable artificial intelligence (XAI) techniques to find the significant set of gene features contributing to diagnosis. We utilized the Least Absolute Shrinkage and Selection Operator (LASSO), and Ridge regression for the feature selection process. We utilized state-of-the-art ML techniques for the classification of PD cases and healthy controls. Logistic regression and Support Vector Machine showed the highest diagnostic accuracy. SHapley Additive exPlanations (SHAP) based global interpretable model-agnostic XAI method was utilized for the interpretation of the Support Vector Machine model. A set of significant biomarkers that contributed to the diagnosis of PD were identified. Some of these genes are associated with other neurodegenerative diseases. Our results suggest that the utilization of XAI can be useful in making early therapeutic decisions for the treatment of PD. The integration of datasets from different sources made this model robust. We believe that this research article will be of interest to clinicians as well as computational biologists in translational research.
Collapse
Affiliation(s)
- Nikita Bhandari
- Computer Science Department, Symbiosis Institute of Technology, Symbiosis International (Deemed University), Pune, MH, India; Symbiosis Center for Applied Artificial Intelligence (SCAAI), Symbiosis International Deemed University, Pune, Maharashtra, India
| | - Rahee Walambe
- Electronics and Telecommunication Department, Symbiosis Institute of Technology, Symbiosis International (Deemed University), Pune, Maharashtra, India; Symbiosis Center for Applied Artificial Intelligence (SCAAI), Symbiosis International Deemed University, Pune, Maharashtra, India.
| | - Ketan Kotecha
- Computer Science Department, Symbiosis Institute of Technology, Symbiosis International (Deemed University), Pune, MH, India; Electronics and Telecommunication Department, Symbiosis Institute of Technology, Symbiosis International (Deemed University), Pune, Maharashtra, India.
| | - Mehul Kaliya
- Department of General Medicine, AIIMS, Rajkot, Gujrat, India
| |
Collapse
|
39
|
Veeramani A, Zhang AS, Blackburn AZ, Etzel CM, DiSilvestro KJ, McDonald CL, Daniels AH. An Artificial Intelligence Approach to Predicting Unplanned Intubation Following Anterior Cervical Discectomy and Fusion. Global Spine J 2023; 13:1849-1855. [PMID: 35132907 PMCID: PMC10556901 DOI: 10.1177/21925682211053593] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 12/15/2022] Open
Abstract
STUDY DESIGN Level III retrospective database study. OBJECTIVES The purpose of this study is to determine if machine learning algorithms are effective in predicting unplanned intubation following anterior cervical discectomy and fusion (ACDF). METHODS The National Surgical Quality Initiative Program (NSQIP) was queried to select patients who had undergone ACDF. Machine learning analysis was conducted in Python and multivariate regression analysis was conducted in R. C-Statistics area under the curve (AUC) and prediction accuracy were used to measure the classifier's effectiveness in distinguishing cases. RESULTS In total, 54 502 patients met the study criteria. Of these patients, .51% underwent an unplanned re-intubation. Machine learning algorithms accurately classified between 72%-100% of the test cases with AUC values of between .52-.77. Multivariable regression indicated that the number of levels fused, male sex, COPD, American Society of Anesthesiologists (ASA) > 2, increased operating time, Age > 65, pre-operative weight loss, dialysis, and disseminated cancer were associated with increased risk of unplanned intubation. CONCLUSIONS The models presented here achieved high accuracy in predicting risk factors for re-intubation following ACDF surgery. Machine learning analysis may be useful in identifying patients who are at a higher risk of unplanned post-operative re-intubation and their treatment plans can be modified to prophylactically prevent respiratory compromise and consequently unplanned re-intubation.
Collapse
Affiliation(s)
- Ashwin Veeramani
- Department of Orthopedic Surgery, Rhode Island Hospital, Warren Alpert Medical School of Brown University, Providence, RI, USA
| | - Andrew S Zhang
- Department of Orthopedic Surgery, Rhode Island Hospital, Warren Alpert Medical School of Brown University, Providence, RI, USA
| | - Amy Z. Blackburn
- Department of Orthopedic Surgery, Rhode Island Hospital, Warren Alpert Medical School of Brown University, Providence, RI, USA
| | - Christine M. Etzel
- Department of Orthopedic Surgery, Rhode Island Hospital, Warren Alpert Medical School of Brown University, Providence, RI, USA
| | - Kevin J. DiSilvestro
- Department of Orthopedic Surgery, Rhode Island Hospital, Warren Alpert Medical School of Brown University, Providence, RI, USA
| | - Christopher L. McDonald
- Department of Orthopedic Surgery, Rhode Island Hospital, Warren Alpert Medical School of Brown University, Providence, RI, USA
| | - Alan H. Daniels
- Department of Orthopedic Surgery, Rhode Island Hospital, Warren Alpert Medical School of Brown University, Providence, RI, USA
| |
Collapse
|
40
|
Atwell S, Waibel DJE, Boushehri SS, Wiedenmann S, Marr C, Meier M. Label-free imaging of 3D pluripotent stem cell differentiation dynamics on chip. CELL REPORTS METHODS 2023; 3:100523. [PMID: 37533640 PMCID: PMC10391578 DOI: 10.1016/j.crmeth.2023.100523] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/26/2022] [Revised: 05/09/2023] [Accepted: 06/15/2023] [Indexed: 08/04/2023]
Abstract
Massive, parallelized 3D stem cell cultures for engineering in vitro human cell types require imaging methods with high time and spatial resolution to fully exploit technological advances in cell culture technologies. Here, we introduce a large-scale integrated microfluidic chip platform for automated 3D stem cell differentiation. To fully enable dynamic high-content imaging on the chip platform, we developed a label-free deep learning method called Bright2Nuc to predict in silico nuclear staining in 3D from confocal microscopy bright-field images. Bright2Nuc was trained and applied to hundreds of 3D human induced pluripotent stem cell cultures differentiating toward definitive endoderm on a microfluidic platform. Combined with existing image analysis tools, Bright2Nuc segmented individual nuclei from bright-field images, quantified their morphological properties, predicted stem cell differentiation state, and tracked the cells over time. Our methods are available in an open-source pipeline, enabling researchers to upscale image acquisition and phenotyping of 3D cell culture.
Collapse
Affiliation(s)
- Scott Atwell
- Helmholtz Pioneer Campus, Helmholtz Zentrum München - German Research Center for Environmental Health, Neuherberg, Germany
| | - Dominik Jens Elias Waibel
- Institute of AI for Health, Helmholtz Zentrum München - German Research Center for Environmental Health, Neuherberg, Germany
- TUM School of Life Sciences, Technical University of Munich, Weihenstephan, Germany
| | - Sayedali Shetab Boushehri
- Institute of AI for Health, Helmholtz Zentrum München - German Research Center for Environmental Health, Neuherberg, Germany
- Department of Mathematics, Technical University of Munich, Munich, Germany
- Data & Analytics, Pharmaceutical Research and Early Development, Roche Innovation Center Munich (RICM), Penzberg, Germany
| | - Sandra Wiedenmann
- Helmholtz Pioneer Campus, Helmholtz Zentrum München - German Research Center for Environmental Health, Neuherberg, Germany
| | - Carsten Marr
- Institute of AI for Health, Helmholtz Zentrum München - German Research Center for Environmental Health, Neuherberg, Germany
| | - Matthias Meier
- Helmholtz Pioneer Campus, Helmholtz Zentrum München - German Research Center for Environmental Health, Neuherberg, Germany
- Center for Biotechnology and Biomedicine, University of Leipzig, Leipzig, Germany
| |
Collapse
|
41
|
Zhang W, Wu B, Ren Y, Yang G. Regionally Compatible Individual Tree Growth Model under the Combined Influence of Environment and Competition. PLANTS (BASEL, SWITZERLAND) 2023; 12:2697. [PMID: 37514311 PMCID: PMC10385731 DOI: 10.3390/plants12142697] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/22/2023] [Revised: 07/10/2023] [Accepted: 07/13/2023] [Indexed: 07/30/2023]
Abstract
To explore the effects of competition, site, and climate on the growth of Chinese fir individual tree diameter at breast height (DBH) and tree height (H), a regionally compatible individual tree growth model under the combined influence of environment and competition was constructed. Using continuous forest inventory (CFI) sample plot data from Fujian Province between 1993 and 2018, we constructed an individual tree DBH model and an H model based on re-parameterization (RP), BP neural network (BP), and random forest (RF), which compared the accuracy of the different modeling methods. The results showed that the inclusion of competition and environmental factors could improve the prediction accuracy of the model. Among the site factors, slope position (PW) had the most significant effect, followed by elevation (HB) and slope aspect (PX). Among the climate factors, the highest contribution was made by degree-days above 18 °C (DD18), followed by mean annual precipitation (MAP) and Hargreaves reference evaporation (Eref). The comparison results of the three modeling methods show that the RF model has the best fitting effect. The R2 of the individual DBH model based on RF is 0.849, RMSE is 1.691 cm, and MAE is 1.267 cm. The R2 of the individual H model based on RF is 0.845, RMSE is 1.267 m, and MAE is 1.153 m. The model constructed in this study has the advantages of environmental sensitivity, statistical reliability, and prediction efficiency. The results can provide theoretical support for management decision-making and harvest prediction of mixed uneven-aged forest.
Collapse
Affiliation(s)
- Wenjie Zhang
- School of Information Science and Technology, Beijing Forestry University, Beijing 100083, China
- Key Laboratory of Quantitative Remote Sensing in Agriculture of Ministry of Agriculture and Rural Affairs, Information Technology Research Center, Beijing Academy of Agriculture and Forestry Sciences, Beijing 100097, China
- National Engineering Research Center for Information Technology in Agriculture, Beijing 100097, China
| | - Baoguo Wu
- School of Information Science and Technology, Beijing Forestry University, Beijing 100083, China
- Forestry Information Research Institute, Beijing Forestry University, Beijing 100083, China
| | - Yi Ren
- Academy of Forestry Inventory and Planning, Beijing 100714, China
| | - Guijun Yang
- Key Laboratory of Quantitative Remote Sensing in Agriculture of Ministry of Agriculture and Rural Affairs, Information Technology Research Center, Beijing Academy of Agriculture and Forestry Sciences, Beijing 100097, China
- National Engineering Research Center for Information Technology in Agriculture, Beijing 100097, China
| |
Collapse
|
42
|
Sajdeya R, Mardini MT, Tighe PJ, Ison RL, Bai C, Jugl S, Hanzhi G, Zandbiglari K, Adiba FI, Winterstein AG, Pearson TA, Cook RL, Rouhizadeh M. Developing and validating a natural language processing algorithm to extract preoperative cannabis use status documentation from unstructured narrative clinical notes. J Am Med Inform Assoc 2023; 30:1418-1428. [PMID: 37178155 PMCID: PMC10354766 DOI: 10.1093/jamia/ocad080] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/05/2022] [Revised: 04/12/2023] [Accepted: 05/03/2023] [Indexed: 05/15/2023] Open
Abstract
OBJECTIVE This study aimed to develop a natural language processing algorithm (NLP) using machine learning (ML) techniques to identify and classify documentation of preoperative cannabis use status. MATERIALS AND METHODS We developed and applied a keyword search strategy to identify documentation of preoperative cannabis use status in clinical documentation within 60 days of surgery. We manually reviewed matching notes to classify each documentation into 8 different categories based on context, time, and certainty of cannabis use documentation. We applied 2 conventional ML and 3 deep learning models against manual annotation. We externally validated our model using the MIMIC-III dataset. RESULTS The tested classifiers achieved classification results close to human performance with up to 93% and 94% precision and 95% recall of preoperative cannabis use status documentation. External validation showed consistent results with up to 94% precision and recall. DISCUSSION Our NLP model successfully replicated human annotation of preoperative cannabis use documentation, providing a baseline framework for identifying and classifying documentation of cannabis use. We add to NLP methods applied in healthcare for clinical concept extraction and classification, mainly concerning social determinants of health and substance use. Our systematically developed lexicon provides a comprehensive knowledge-based resource covering a wide range of cannabis-related concepts for future NLP applications. CONCLUSION We demonstrated that documentation of preoperative cannabis use status could be accurately identified using an NLP algorithm. This approach can be employed to identify comparison groups based on cannabis exposure for growing research efforts aiming to guide cannabis-related clinical practices and policies.
Collapse
Affiliation(s)
- Ruba Sajdeya
- Department of Epidemiology, College of Public Health & Health Professions & College of Medicine, University of Florida, Gainesville, Florida, USA
| | - Mamoun T Mardini
- Department of Health Outcomes & Biomedical Informatics, College of Medicine, University of Florida, Gainesville, Florida, USA
| | - Patrick J Tighe
- Department of Anesthesiology, College of Medicine, University of Florida, Gainesville, Florida, USA
| | - Ronald L Ison
- Department of Anesthesiology, College of Medicine, University of Florida, Gainesville, Florida, USA
| | - Chen Bai
- Department of Health Outcomes & Biomedical Informatics, College of Medicine, University of Florida, Gainesville, Florida, USA
| | - Sebastian Jugl
- Department of Pharmaceutical Outcomes & Policy, Center for Drug Evaluation and Safety (CoDES), University of Florida, Gainesville, Florida, USA
| | - Gao Hanzhi
- Department of Biostatistics, University of Florida, Gainesville, Florida, USA
| | - Kimia Zandbiglari
- Department of Pharmaceutical Outcomes & Policy, Center for Drug Evaluation and Safety (CoDES), University of Florida, Gainesville, Florida, USA
| | - Farzana I Adiba
- Department of Pharmaceutical Outcomes & Policy, Center for Drug Evaluation and Safety (CoDES), University of Florida, Gainesville, Florida, USA
| | - Almut G Winterstein
- Department of Pharmaceutical Outcomes & Policy, Center for Drug Evaluation and Safety (CoDES), University of Florida, Gainesville, Florida, USA
| | - Thomas A Pearson
- Department of Epidemiology, College of Public Health & Health Professions & College of Medicine, University of Florida, Gainesville, Florida, USA
| | - Robert L Cook
- Department of Epidemiology, College of Public Health & Health Professions & College of Medicine, University of Florida, Gainesville, Florida, USA
| | - Masoud Rouhizadeh
- Department of Pharmaceutical Outcomes & Policy, Center for Drug Evaluation and Safety (CoDES), University of Florida, Gainesville, Florida, USA
| |
Collapse
|
43
|
Oh SS, Kuang I, Jeong H, Song JY, Ren B, Moon JY, Park EC, Kawachi I. Predicting Fetal Alcohol Spectrum Disorders Using Machine Learning Techniques: Multisite Retrospective Cohort Study. J Med Internet Res 2023; 25:e45041. [PMID: 37463016 PMCID: PMC10394506 DOI: 10.2196/45041] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2022] [Revised: 05/22/2023] [Accepted: 06/18/2023] [Indexed: 07/21/2023] Open
Abstract
BACKGROUND Fetal alcohol syndrome (FAS) is a lifelong developmental disability that occurs among individuals with prenatal alcohol exposure (PAE). With improved prediction models, FAS can be diagnosed or treated early, if not completely prevented. OBJECTIVE In this study, we sought to compare different machine learning algorithms and their FAS predictive performance among women who consumed alcohol during pregnancy. We also aimed to identify which variables (eg, timing of exposure to alcohol during pregnancy and type of alcohol consumed) were most influential in generating an accurate model. METHODS Data from the collaborative initiative on fetal alcohol spectrum disorders from 2007 to 2017 were used to gather information about 595 women who consumed alcohol during pregnancy at 5 hospital sites around the United States. To obtain information about PAE, questionnaires or in-person interviews, as well as reviews of medical, legal, or social service records were used to gather information about alcohol consumption. Four different machine learning algorithms (logistic regression, XGBoost, light gradient-boosting machine, and CatBoost) were trained to predict the prevalence of FAS at birth, and model performance was measured by analyzing the area under the receiver operating characteristics curve (AUROC). Of the total cases, 80% were randomly selected for training, while 20% remained as test data sets for predicting FAS. Feature importance was also analyzed using Shapley values for the best-performing algorithm. RESULTS Overall, there were 20 cases of FAS within a total population of 595 individuals with PAE. Most of the drinking occurred in the first trimester only (n=491) or throughout all 3 trimesters (n=95); however, there were also reports of drinking in the first and second trimesters only (n=8), and 1 case of drinking in the third trimester only (n=1). The CatBoost method delivered the best performance in terms of AUROC (0.92) and area under the precision-recall curve (AUPRC 0.51), followed by the logistic regression method (AUROC 0.90; AUPRC 0.59), the light gradient-boosting machine (AUROC 0.89; AUPRC 0.52), and XGBoost (AUROC 0.86; AURPC 0.45). Shapley values in the CatBoost model revealed that 12 variables were considered important in FAS prediction, with drinking throughout all 3 trimesters of pregnancy, maternal age, race, and type of alcoholic beverage consumed (eg, beer, wine, or liquor) scoring highly in overall feature importance. For most predictive measures, the best performance was obtained by the CatBoost algorithm, with an AUROC of 0.92, precision of 0.50, specificity of 0.29, F1 score of 0.29, and accuracy of 0.96. CONCLUSIONS Machine learning algorithms were able to identify FAS risk with a prediction performance higher than that of previous models among pregnant drinkers. For small training sets, which are common with FAS, boosting mechanisms like CatBoost may help alleviate certain problems associated with data imbalances and difficulties in optimization or generalization.
Collapse
Affiliation(s)
- Sarah Soyeon Oh
- Department of Social and Behavioral Sciences, Harvard TH Chan School of Public Health, Boston, MA, United States
- Institute of Health Services Research, Yonsei University College of Medicine, Seoul, Republic of Korea
| | - Irene Kuang
- Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA, United States
| | - Hyewon Jeong
- Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA, United States
| | - Jin-Yeop Song
- Department of Physics, Massachusetts Institute of Technology, Cambridge, MA, United States
| | - Boyu Ren
- Department of Psychiatry, Harvard Medical School, Boston, MA, United States
| | - Jong Youn Moon
- Artificial Intelligence and Big-Data Convergence Center, Gil Medical Center, Gachon University College of Medicine, Incheon, Republic of Korea
| | - Eun-Cheol Park
- Institute of Health Services Research, Yonsei University College of Medicine, Seoul, Republic of Korea
| | - Ichiro Kawachi
- Department of Social and Behavioral Sciences, Harvard TH Chan School of Public Health, Boston, MA, United States
| |
Collapse
|
44
|
Castillo-Campos L, Velázquez-Libera JL, Caballero J. Computational study of the binding orientation and affinity of noncovalent inhibitors of the papain-like protease (PLpro) from SARS-CoV-1 considering the protein flexibility by using molecular dynamics and cross-docking. Front Mol Biosci 2023; 10:1215499. [PMID: 37426421 PMCID: PMC10326900 DOI: 10.3389/fmolb.2023.1215499] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/03/2023] [Accepted: 06/12/2023] [Indexed: 07/11/2023] Open
Abstract
The papain-like protease (PLpro) from zoonotic coronaviruses (CoVs) has been identified as a target with an essential role in viral respiratory diseases caused by Severe Acute Respiratory Syndrome-associated coronaviruses (SARS-CoVs). The design of PLpro inhibitors has been proposed as an alternative to developing potential drugs against this disease. In this work, 67 naphthalene-derived compounds as noncovalent PLpro inhibitors were studied using molecular modeling methods. Structural characteristics of the bioactive conformations of these inhibitors and their interactions at the SARS-CoV-1 PLpro binding site were reported here in detail, taking into account the flexibility of the protein residues. Firstly, a molecular docking protocol was used to obtain the orientations of the inhibitors. After this, the orientations were compared, and the recurrent interactions between the PLpro residues and ligand chemical groups were described (with LigRMSD and interaction fingerprints methods). In addition, efforts were made to find correlations between docking energy values and experimentally determined binding affinities. For this, the PLpro was sampled by using Gaussian Accelerated Molecular Dynamics (GaMD), generating multiple conformations of the binding site. Diverse protein conformations were selected and a cross-docking experiment was performed, yielding models of the 67 naphthalene-derived compounds adopting different binding modes. Representative complexes for each ligand were selected to obtain the highest correlation between docking energies and activities. A good correlation (R 2 = 0.948) was found when this flexible docking protocol was performed.
Collapse
Affiliation(s)
| | | | - Julio Caballero
- Centro de Bioinformática, Simulación y Modelado (CBSM), Facultad de Ingeniería, Universidad de Talca, Talca, Chile
| |
Collapse
|
45
|
Bachelot G, Dhombres F, Sermondade N, Haj Hamid R, Berthaut I, Frydman V, Prades M, Kolanska K, Selleret L, Mathieu-D'Argent E, Rivet-Danon D, Levy R, Lamazière A, Dupont C. A Machine Learning Approach for the Prediction of Testicular Sperm Extraction in Nonobstructive Azoospermia: Algorithm Development and Validation Study. J Med Internet Res 2023; 25:e44047. [PMID: 37342078 DOI: 10.2196/44047] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2022] [Revised: 02/19/2023] [Accepted: 04/07/2023] [Indexed: 06/22/2023] Open
Abstract
BACKGROUND Testicular sperm extraction (TESE) is an essential therapeutic tool for the management of male infertility. However, it is an invasive procedure with a success rate up to 50%. To date, no model based on clinical and laboratory parameters is sufficiently powerful to accurately predict the success of sperm retrieval in TESE. OBJECTIVE The aim of this study is to compare a wide range of predictive models under similar conditions for TESE outcomes in patients with nonobstructive azoospermia (NOA) to identify the correct mathematical approach to apply, most appropriate study size, and relevance of the input biomarkers. METHODS We analyzed 201 patients who underwent TESE at Tenon Hospital (Assistance Publique-Hôpitaux de Paris, Sorbonne University, Paris), distributed in a retrospective training cohort of 175 patients (January 2012 to April 2021) and a prospective testing cohort (May 2021 to December 2021) of 26 patients. Preoperative data (according to the French standard exploration of male infertility, 16 variables) including urogenital history, hormonal data, genetic data, and TESE outcomes (representing the target variable) were collected. A TESE was considered positive if we obtained sufficient spermatozoa for intracytoplasmic sperm injection. After preprocessing the raw data, 8 machine learning (ML) models were trained and optimized on the retrospective training cohort data set: The hyperparameter tuning was performed by random search. Finally, the prospective testing cohort data set was used for the model evaluation. The metrics used to evaluate and compare the models were the following: sensitivity, specificity, area under the receiver operating characteristic curve (AUC-ROC), and accuracy. The importance of each variable in the model was assessed using the permutation feature importance technique, and the optimal number of patients to include in the study was assessed using the learning curve. RESULTS The ensemble models, based on decision trees, showed the best performance, especially the random forest model, which yielded the following results: AUC=0.90, sensitivity=100%, and specificity=69.2%. Furthermore, a study size of 120 patients seemed sufficient to properly exploit the preoperative data in the modeling process, since increasing the number of patients beyond 120 during model training did not bring any performance improvement. Furthermore, inhibin B and a history of varicoceles exhibited the highest predictive capacity. CONCLUSIONS An ML algorithm based on an appropriate approach can predict successful sperm retrieval in men with NOA undergoing TESE, with promising performance. However, although this study is consistent with the first step of this process, a subsequent formal prospective multicentric validation study should be undertaken before any clinical applications. As future work, we consider the use of recent and clinically relevant data sets (including seminal plasma biomarkers, especially noncoding RNAs, as markers of residual spermatogenesis in NOA patients) to improve our results even more.
Collapse
Affiliation(s)
- Guillaume Bachelot
- Saint Antoine Research Center, L'Institut national de la santé et de la recherche médicale UMR 938, Sorbonne Université, Paris, France
- Service de Biologie de La Reproduction, Hôpital Tenon, Assistance Publique-Hôpitaux de Paris, Sorbonne Université, Paris, France
- Laboratory in Medical Informatics and Knowledge Engineering in e-Health, L'Institut national de la santé et de la recherche médicale, Sorbonne University, Paris, France
| | - Ferdinand Dhombres
- Laboratory in Medical Informatics and Knowledge Engineering in e-Health, L'Institut national de la santé et de la recherche médicale, Sorbonne University, Paris, France
| | - Nathalie Sermondade
- Saint Antoine Research Center, L'Institut national de la santé et de la recherche médicale UMR 938, Sorbonne Université, Paris, France
- Service de Biologie de La Reproduction, Hôpital Tenon, Assistance Publique-Hôpitaux de Paris, Sorbonne Université, Paris, France
| | - Rahaf Haj Hamid
- Service de Biologie de La Reproduction, Hôpital Tenon, Assistance Publique-Hôpitaux de Paris, Sorbonne Université, Paris, France
| | - Isabelle Berthaut
- Service de Biologie de La Reproduction, Hôpital Tenon, Assistance Publique-Hôpitaux de Paris, Sorbonne Université, Paris, France
| | - Valentine Frydman
- Service d'Urologie, Hôpital Tenon, Assistance Publique-Hôpitaux de Paris, Sorbonne Université, Paris, France
| | - Marie Prades
- Service de Biologie de La Reproduction, Hôpital Tenon, Assistance Publique-Hôpitaux de Paris, Sorbonne Université, Paris, France
| | - Kamila Kolanska
- Saint Antoine Research Center, L'Institut national de la santé et de la recherche médicale UMR 938, Sorbonne Université, Paris, France
- Service de Gynécologie Obstétrique et Médecine de la Reproduction, Hôpital Tenon, Assistance Publique-Hôpitaux de Paris, Sorbonne Université, Paris, France
| | - Lise Selleret
- Service d'Urologie, Hôpital Tenon, Assistance Publique-Hôpitaux de Paris, Sorbonne Université, Paris, France
| | - Emmanuelle Mathieu-D'Argent
- Saint Antoine Research Center, L'Institut national de la santé et de la recherche médicale UMR 938, Sorbonne Université, Paris, France
- Service de Gynécologie Obstétrique et Médecine de la Reproduction, Hôpital Tenon, Assistance Publique-Hôpitaux de Paris, Sorbonne Université, Paris, France
| | - Diane Rivet-Danon
- Service de Biologie de La Reproduction, Hôpital Tenon, Assistance Publique-Hôpitaux de Paris, Sorbonne Université, Paris, France
| | - Rachel Levy
- Saint Antoine Research Center, L'Institut national de la santé et de la recherche médicale UMR 938, Sorbonne Université, Paris, France
- Service de Biologie de La Reproduction, Hôpital Tenon, Assistance Publique-Hôpitaux de Paris, Sorbonne Université, Paris, France
| | - Antonin Lamazière
- Saint Antoine Research Center, L'Institut national de la santé et de la recherche médicale UMR 938, Sorbonne Université, Paris, France
- Département de Métabolomique Clinique, Hôpital Saint Antoine, Assistance Publique-Hôpitaux de Paris, Sorbonne Université, Paris, France
| | - Charlotte Dupont
- Saint Antoine Research Center, L'Institut national de la santé et de la recherche médicale UMR 938, Sorbonne Université, Paris, France
- Service de Biologie de La Reproduction, Hôpital Tenon, Assistance Publique-Hôpitaux de Paris, Sorbonne Université, Paris, France
| |
Collapse
|
46
|
Cascalheira CJ, Flinn RE, Zhao Y, Klooster D, Laprade D, Hamdi SM, Scheer JR, Gonzalez A, Lund EM, Gomez IN, Saha K, De Choudhury M. Models of Gender Dysphoria Using Social Media Data for Use in Technology-Delivered Interventions: Machine Learning and Natural Language Processing Validation Study. JMIR Form Res 2023; 7:e47256. [PMID: 37327053 DOI: 10.2196/47256] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2023] [Revised: 04/28/2023] [Accepted: 05/15/2023] [Indexed: 06/17/2023] Open
Abstract
BACKGROUND The optimal treatment for gender dysphoria is medical intervention, but many transgender and nonbinary people face significant treatment barriers when seeking help for gender dysphoria. When untreated, gender dysphoria is associated with depression, anxiety, suicidality, and substance misuse. Technology-delivered interventions for transgender and nonbinary people can be used discretely, safely, and flexibly, thereby reducing treatment barriers and increasing access to psychological interventions to manage distress that accompanies gender dysphoria. Technology-delivered interventions are beginning to incorporate machine learning (ML) and natural language processing (NLP) to automate intervention components and tailor intervention content. A critical step in using ML and NLP in technology-delivered interventions is demonstrating how accurately these methods model clinical constructs. OBJECTIVE This study aimed to determine the preliminary effectiveness of modeling gender dysphoria with ML and NLP, using transgender and nonbinary people's social media data. METHODS Overall, 6 ML models and 949 NLP-generated independent variables were used to model gender dysphoria from the text data of 1573 Reddit (Reddit Inc) posts created on transgender- and nonbinary-specific web-based forums. After developing a codebook grounded in clinical science, a research team of clinicians and students experienced in working with transgender and nonbinary clients used qualitative content analysis to determine whether gender dysphoria was present in each Reddit post (ie, the dependent variable). NLP (eg, n-grams, Linguistic Inquiry and Word Count, word embedding, sentiment, and transfer learning) was used to transform the linguistic content of each post into predictors for ML algorithms. A k-fold cross-validation was performed. Hyperparameters were tuned with random search. Feature selection was performed to demonstrate the relative importance of each NLP-generated independent variable in predicting gender dysphoria. Misclassified posts were analyzed to improve future modeling of gender dysphoria. RESULTS Results indicated that a supervised ML algorithm (ie, optimized extreme gradient boosting [XGBoost]) modeled gender dysphoria with a high degree of accuracy (0.84), precision (0.83), and speed (1.23 seconds). Of the NLP-generated independent variables, Diagnostic and Statistical Manual of Mental Disorders, Fifth Edition (DSM-5) clinical keywords (eg, dysphoria and disorder) were most predictive of gender dysphoria. Misclassifications of gender dysphoria were common in posts that expressed uncertainty, featured a stressful experience unrelated to gender dysphoria, were incorrectly coded, expressed insufficient linguistic markers of gender dysphoria, described past experiences of gender dysphoria, showed evidence of identity exploration, expressed aspects of human sexuality unrelated to gender dysphoria, described socially based gender dysphoria, expressed strong affective or cognitive reactions unrelated to gender dysphoria, or discussed body image. CONCLUSIONS Findings suggest that ML- and NLP-based models of gender dysphoria have significant potential to be integrated into technology-delivered interventions. The results contribute to the growing evidence on the importance of incorporating ML and NLP designs in clinical science, especially when studying marginalized populations.
Collapse
Affiliation(s)
- Cory J Cascalheira
- Department of Counseling & Educational Psychology, New Mexico State University, Las Cruces, NM, United States
- Department of Psychology, Syracuse University, Syracuse, NY, United States
| | - Ryan E Flinn
- Augusta University, Augusta, GA, United States
- University of North Dakota, Grand Forks, ND, United States
| | - Yuxuan Zhao
- Department of Counseling & Educational Psychology, New Mexico State University, Las Cruces, NM, United States
| | | | - Danica Laprade
- Northern Arizona University, Flagstaff, AZ, United States
| | - Shah Muhammad Hamdi
- Department of Computer Science, Utah State University, Logan, UT, United States
| | - Jillian R Scheer
- Department of Psychology, Syracuse University, Syracuse, NY, United States
| | | | - Emily M Lund
- University of Alabama, Tuscaloosa, AL, United States
- Ewha Women's University, Seoul, Republic of Korea
| | - Ivan N Gomez
- Department of Counseling & Educational Psychology, New Mexico State University, Las Cruces, NM, United States
| | - Koustuv Saha
- University of Illinois at Urbana-Champaign, Champaign, IL, United States
| | | |
Collapse
|
47
|
Ren Y, Wu D, Tong Y, López-DeFede A, Gareau S. Issue of Data Imbalance on Low Birthweight Baby Outcomes Prediction and Associated Risk Factors Identification: Establishment of Benchmarking Key Machine Learning Models With Data Rebalancing Strategies. J Med Internet Res 2023; 25:e44081. [PMID: 37256674 PMCID: PMC10267797 DOI: 10.2196/44081] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/05/2022] [Revised: 03/02/2023] [Accepted: 04/04/2023] [Indexed: 06/01/2023] Open
Abstract
BACKGROUND Low birthweight (LBW) is a leading cause of neonatal mortality in the United States and a major causative factor of adverse health effects in newborns. Identifying high-risk patients early in prenatal care is crucial to preventing adverse outcomes. Previous studies have proposed various machine learning (ML) models for LBW prediction task, but they were limited by small and imbalanced data sets. Some authors attempted to address this through different data rebalancing methods. However, most of their reported performances did not reflect the models' actual performance in real-life scenarios. To date, few studies have successfully benchmarked the performance of ML models in maternal health; thus, it is critical to establish benchmarks to advance ML use to subsequently improve birth outcomes. OBJECTIVE This study aimed to establish several key benchmarking ML models to predict LBW and systematically apply different rebalancing optimization methods to a large-scale and extremely imbalanced all-payer hospital record data set that connects mother and baby data at a state level in the United States. We also performed feature importance analysis to identify the most contributing features in the LBW classification task, which can aid in targeted intervention. METHODS Our large data set consisted of 266,687 birth records across 6 years, and 8.63% (n=23,019) of records were labeled as LBW. To set up benchmarking ML models to predict LBW, we applied 7 classic ML models (ie, logistic regression, naive Bayes, random forest, extreme gradient boosting, adaptive boosting, multilayer perceptron, and sequential artificial neural network) while using 4 different data rebalancing methods: random undersampling, random oversampling, synthetic minority oversampling technique, and weight rebalancing. Owing to ethical considerations, in addition to ML evaluation metrics, we primarily used recall to evaluate model performance, indicating the number of correctly predicted LBW cases out of all actual LBW cases, as false negative health care outcomes could be fatal. We further analyzed feature importance to explore the degree to which each feature contributed to ML model prediction among our best-performing models. RESULTS We found that extreme gradient boosting achieved the highest recall score-0.70-using the weight rebalancing method. Our results showed that various data rebalancing methods improved the prediction performance of the LBW group substantially. From the feature importance analysis, maternal race, age, payment source, sum of predelivery emergency department and inpatient hospitalizations, predelivery disease profile, and different social vulnerability index components were important risk factors associated with LBW. CONCLUSIONS Our findings establish useful ML benchmarks to improve birth outcomes in the maternal health domain. They are informative to identify the minority class (ie, LBW) based on an extremely imbalanced data set, which may guide the development of personalized LBW early prevention, clinical interventions, and statewide maternal and infant health policy changes.
Collapse
Affiliation(s)
- Yang Ren
- Department of Computer Science, University of South Carolina, Columbia, SC, United States
| | - Dezhi Wu
- Department of Integrated Information Technology, University of South Carolina, Columbia, SC, United States
| | - Yan Tong
- Department of Computer Science, University of South Carolina, Columbia, SC, United States
| | - Ana López-DeFede
- The Institute of Families in Society, University of South Carolina, Columbia, SC, United States
| | - Sarah Gareau
- The Institute of Families in Society, University of South Carolina, Columbia, SC, United States
| |
Collapse
|
48
|
Foufoulas Y, Zacharia E, Dimitropoulos H, Manola N, Ioannidis Y. DETEXA: declarative extensible text exploration and analysis through SQL. INTERNATIONAL JOURNAL ON DIGITAL LIBRARIES 2023:1-13. [PMID: 37361128 PMCID: PMC10170051 DOI: 10.1007/s00799-023-00358-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/10/2022] [Revised: 03/13/2023] [Accepted: 03/16/2023] [Indexed: 06/28/2023]
Abstract
Metadata enrichment through text mining techniques is becoming one of the most significant tasks in digital libraries. Due to the exponential increase of open access publications, several new challenges have emerged. Raw data are usually big, unstructured, and come from heterogeneous data sources. In this paper, we introduce a text analysis framework implemented in extended SQL that exploits the scalability characteristics of modern database management systems. The purpose of this framework is to provide the opportunity to build performant end-to-end text mining pipelines which include data harvesting, cleaning, processing, and text analysis at once. SQL is selected due to its declarative nature which offers fast experimentation and the ability to build APIs so that domain experts can edit text mining workflows via easy-to-use graphical interfaces. Our experimental analysis demonstrates that the proposed framework is very effective and achieves significant speedup, up to three times faster, in common use cases compared to other popular approaches.
Collapse
Affiliation(s)
- Yannis Foufoulas
- Department of Informatics and Telecommunications, National and Kapodistrian University of Athens, Panepistimiopolis, 15784 Ilisia, Greece
- Athena Research Center, Artemidos 6 & Epidavrou, 15125 Marousi, Greece
| | - Eleni Zacharia
- Department of Informatics and Telecommunications, National and Kapodistrian University of Athens, Panepistimiopolis, 15784 Ilisia, Greece
- Athena Research Center, Artemidos 6 & Epidavrou, 15125 Marousi, Greece
| | - Harry Dimitropoulos
- Department of Informatics and Telecommunications, National and Kapodistrian University of Athens, Panepistimiopolis, 15784 Ilisia, Greece
- Athena Research Center, Artemidos 6 & Epidavrou, 15125 Marousi, Greece
| | - Natalia Manola
- Department of Informatics and Telecommunications, National and Kapodistrian University of Athens, Panepistimiopolis, 15784 Ilisia, Greece
- Athena Research Center, Artemidos 6 & Epidavrou, 15125 Marousi, Greece
| | - Yannis Ioannidis
- Department of Informatics and Telecommunications, National and Kapodistrian University of Athens, Panepistimiopolis, 15784 Ilisia, Greece
- Athena Research Center, Artemidos 6 & Epidavrou, 15125 Marousi, Greece
| |
Collapse
|
49
|
Wu M, Qi C, Chen Q, Liu H. Evaluating the metal recovery potential of coal fly ash based on sequential extraction and machine learning. ENVIRONMENTAL RESEARCH 2023; 224:115546. [PMID: 36828251 DOI: 10.1016/j.envres.2023.115546] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/11/2023] [Revised: 02/14/2023] [Accepted: 02/21/2023] [Indexed: 06/18/2023]
Abstract
Given the depletion of metal resources and the potential leaching of toxic elements from solid waste, secondary recovery of metal from solid waste is essential to achieve coordinated development of resources and the environment. In this study, hybrid models combining the gradient boosting decision tree and particle swarm optimization algorithm were constructed and compared based on two different datasets. Additionally, a new, quantitative evaluation index for metal recovery potential (MRP) was proposed. The results showed that the model constructed using more elemental properties could more accurately predict metal fractions in coal fly ash (CFA) with an R2 value of 0.88 achieved on the testing set. The MRP index revealed that the DAT sample had the greatest recovery potential (MRP = 43,311.70). Ca was easier to recover due to its high concentration and presence mostly in soluble fractions. Model post-analysis highlighted that the elemental properties and total concentrations generally exerted a greater influence on the metal fractions. The innovative evaluation strategy based on machine learning and sequential extraction presented in this work provides an important reference for maximizing metal recovery from CFA to achieve environmental and economic benefits with the goal of sustainable development.
Collapse
Affiliation(s)
- Mengting Wu
- School of Resources and Safety Engineering, Central South University, Changsha, 410083, China
| | - Chongchong Qi
- School of Resources and Safety Engineering, Central South University, Changsha, 410083, China; School of Metallurgy and Environment, Central South University, Changsha, 410083, China.
| | - Qiusong Chen
- School of Resources and Safety Engineering, Central South University, Changsha, 410083, China
| | - Hui Liu
- School of Metallurgy and Environment, Central South University, Changsha, 410083, China
| |
Collapse
|
50
|
Carrillo A, Betancort M. Differences of Training Structures on Stimulus Class Formation in Computational Agents. MULTIMODAL TECHNOLOGIES AND INTERACTION 2023. [DOI: 10.3390/mti7040039] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/08/2023] Open
Abstract
Stimulus Equivalence (SE) is a behavioural phenomenon in which organisms respond functionally to stimuli without explicit training. SE provides a framework in the experimental analysis of behaviour to study language, symbolic behaviour, and cognition. It is also a frequently discussed matter in interdisciplinary research, linking behaviour analysis with linguistics and neuroscience. Previous research has attempted to replicate SE with computational agents, mostly based on Artificial Neural Network (ANN) models. The aim of this paper was to analyse the effect of three Training Structures (TSs) on stimulus class formation in a simulation with ANNs as computational agents performing a classification task, in a matching-to-sample procedure. Twelve simulations were carried out as a product of the implementation of four ANN architectures on the three TSs. SE was not achieved, but two agents showed an emergent response on half of the transitivity test pairs on linear sequence TSs and reflexivity on one member of the class. The results suggested that an ANN with a large enough number of units in a hidden layer can perform a limited number of emergent relations within specific experimental conditions: reflexivity on B and transitivity on AC, when pairs AB and BC are trained on a three-member stimulus class and tested in a classification task. Reinforcement learning is proposed as the framework for further simulations.
Collapse
Affiliation(s)
- Alexis Carrillo
- Departamento de Psicología Clínica, Psicobiología y Metodología, Campus de Guajara, Universidad de La Laguna, Apartado 456, 38200 San Cristóbal de La Laguna, Spain
| | - Moisés Betancort
- Departamento de Psicología Clínica, Psicobiología y Metodología, Campus de Guajara, Universidad de La Laguna, Apartado 456, 38200 San Cristóbal de La Laguna, Spain
| |
Collapse
|