1
|
Prediction and analysis of risk factors for diabetic retinopathy based on machine learning and interpretable models. Heliyon 2024; 10:e29497. [PMID: 38699007 PMCID: PMC11064081 DOI: 10.1016/j.heliyon.2024.e29497] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2024] [Revised: 04/09/2024] [Accepted: 04/09/2024] [Indexed: 05/05/2024] Open
Abstract
Objective Diabetic retinopathy is one of the major complications of diabetes. In this study, a diabetic retinopathy risk prediction model integrating machine learning models and SHAP was established to increase the accuracy of risk prediction for diabetic retinopathy, explain the rationality of the findings from model prediction and improve the reliability of prediction results. Methods Data were preprocessed for missing values and outliers, features selected through information gain, a diabetic retinopathy risk prediction model established using the CatBoost and the outputs of the mode interpreted using the SHAP model. Results One thousand early warning data of diabetes complications derived from diabetes complication early warning dataset from the National Clinical Medical Sciences Data Center were used in this study. The CatBoost-based model for diabetic retinopathy prediction performed the best in the comparative model test. ALB_CR, HbA1c, UPR_24, NEPHROPATHY and SCR were positively correlated with diabetic retinopathy, while CP, HB, ALB, DBILI and CRP were negatively correlated with diabetic retinopathy. The relationships between HEIGHT, WEIGHT and ESR characteristics and diabetic retinopathy were not significant. Conclusion The risk factors for diabetic retinopathy include poor renal function, elevated blood glucose level, liver disease, hematonosis and dysarteriotony, among others. Diabetic retinopathy can be prevented by monitoring and effectively controlling relevant indices. In this study, the influence relationships between the features were also analyzed to further explore the potential factors of diabetic retinopathy, which can provide new methods and new ideas for the early prevention and clinical diagnosis of subsequent diabetic retinopathy.
Collapse
|
2
|
Predicting the catalytic activities of transition metal (Cr, Fe, Co, Ni) complexes towards ethylene polymerization by machine learning. J Comput Chem 2024; 45:798-803. [PMID: 38126933 DOI: 10.1002/jcc.27291] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2023] [Revised: 12/02/2023] [Accepted: 12/09/2023] [Indexed: 12/23/2023]
Abstract
The study aims to execute machine learning (ML) method for building an intelligent prediction system for catalytic activities of a relatively big dataset of 1056 transition metal complex precatalysts in ethylene polymerization. Among 14 different algorithms, the CatBoost ensemble model provides the best prediction with the correlation coefficient (R2 ) values of 0.999 for training set and 0.834 for external test set. The interpretation of the obtained model indicates that the catalytic activity is highly correlated with number of atom, conjugated degree in the ligand framework, and charge distributions. Correspondingly, 10 novel complexes are designed and predicted with higher catalytic activities. This work shows the potential application of the ML method as a high-precision tool for designing advanced catalysts for ethylene polymerization.
Collapse
|
3
|
Machine learning model for predicting stroke recurrence in adult stroke patients with moyamoya disease and factors of stroke recurrence. Clin Neurol Neurosurg 2024; 242:108308. [PMID: 38733759 DOI: 10.1016/j.clineuro.2024.108308] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2023] [Revised: 04/09/2024] [Accepted: 04/27/2024] [Indexed: 05/13/2024]
Abstract
OBJECT The aim of this study was at building an effective machine learning model to contribute to the prediction of stroke recurrence in adult stroke patients subjected to moyamoya disease (MMD), while at analyzing the factors for stroke recurrence. METHODS The data of this retrospective study originated from the database of JiangXi Province Medical Big Data Engineering & Technology Research Center. Moreover, the information of MMD patients admitted to the second affiliated hospital of Nanchang university from January 1st, 2007 to December 31st, 2019 was acquired. A total of 661 patients from January 1st, 2007 to February 28th, 2017 were covered in the training set, while the external validation set comprised 284 patients that fell into a scope from March 1st, 2017 to December 31st, 2019. First, the information regarding all the subjects was compared between the training set and the external validation set. The key influencing variables were screened out using the Lasso Regression Algorithm. Furthermore, the models for predicting stroke recurrence in 1, 2, and 3 years after the initial stroke were built based on five different machine learning algorithms, and all models were externally validated and then compared. Lastly, the CatBoost model with the optimal performance was explained using the SHapley Additive exPlanations (SHAP) interpretation model. RESULT In general, 945 patients suffering from MMD were recruited, and the recurrence rate of acute stroke in 1, 2, and 3 years after the initial stroke reached 11.43%(108/945), 18.94%(179/945), and 23.17%(219/945), respectively. The CatBoost models exhibited the optimal prediction performance among all models; the area under the curve (AUC) of these models for predicting stroke recurrence in 1, 2, and 3 years was determined as 0.794 (0.787, 0.801), 0.813 (0.807, 0.818), and 0.789 (0.783, 0.795), respectively. As indicated by the results of the SHAP interpretation model, the high Suzuki stage, young adults (aged 18-44), no surgical treatment, and the presence of an aneurysm were likely to show significant correlations with the recurrence of stroke in adult stroke patients subjected to MMD. CONCLUSION In adult stroke patients suffering from MMD, the CatBoost model was confirmed to be effective in stroke recurrence prediction, yielding accurate and reliable prediction outcomes. High Suzuki stage, young adults (aged 18-44 years), no surgical treatment, and the presence of an aneurysm are likely to be significantly correlated with the recurrence of stroke in adult stroke patients subjected to MMD.
Collapse
|
4
|
A machine learning model for predicting acute exacerbation of in-home chronic obstructive pulmonary disease patients. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2024; 246:108005. [PMID: 38354578 DOI: 10.1016/j.cmpb.2023.108005] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/07/2023] [Revised: 12/16/2023] [Accepted: 12/31/2023] [Indexed: 02/16/2024]
Abstract
PURPOSE This study utilized intelligent devices to remotely monitor patients with chronic obstructive pulmonary disease (COPD), aiming to construct and evaluate machine learning (ML) models that predict the probability of acute exacerbations of COPD (AECOPD). METHODS Patients diagnosed with COPD Group C/D at our hospital between March 2019 and June 2021 were enrolled in this study. The diagnosis of COPD Group C/D and AECOPD was based on the GOLD 2018 guidelines. We developed a series of machine learning (ML)-based models, including XGBoost, LightGBM, and CatBoost, to predict AECOPD events. These models utilized data collected from portable spirometers and electronic stethoscopes within a five-day time window. The area under the ROC curve (AUC) was used to assess the effectiveness of the models. RESULTS A total of 66 patients were enrolled in COPD groups C/D, with 32 in group C and 34 in group D. Using observational data within a five-day time window, the ML models effectively predict AECOPD events, achieving high AUC scores. Among these models, the CatBoost model exhibited superior performance, boasting the highest AUC score (0.9721, 95 % CI: 0.9623-0.9810). Notably, the boosting tree methods significantly outperformed the time-series based methods, thanks to our feature engineering efforts. A post-hoc analysis of the CatBoost model reveals that features extracted from the electronic stethoscope (e.g., max/min vibration energy) hold more importance than those from the portable spirometer. CONCLUSIONS The tree-based boosting models prove to be effective in predicting AECOPD events in our study. Consequently, these models have the potential to enhance remote monitoring, enable early risk assessment, and inform treatment decisions for homebound patients with chronic COPD.
Collapse
|
5
|
Effects of non-landslide sampling strategies on machine learning models in landslide susceptibility mapping. Sci Rep 2024; 14:7201. [PMID: 38532140 DOI: 10.1038/s41598-024-57964-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2023] [Accepted: 03/23/2024] [Indexed: 03/28/2024] Open
Abstract
This study aims to explore the effects of different non-landslide sampling strategies on machine learning models in landslide susceptibility mapping. Non-landslide samples are inherently uncertain, and the selection of non-landslide samples may suffer from issues such as noisy or insufficient regional representations, which can affect the accuracy of the results. In this study, a positive-unlabeled (PU) bagging semi-supervised learning method was introduced for non-landslide sample selection. In addition, buffer control sampling (BCS) and K-means (KM) clustering were applied for comparative analysis. Based on landslide data from Qiaojia County, Yunnan Province, China, collected in 2014, three machine learning models, namely, random forest, support vector machine, and CatBoost, were used for landslide susceptibility mapping. The results show that the quality of samples selected using different non-landslide sampling strategies varies significantly. Overall, the quality of non-landslide samples selected using the PU bagging method is superior, and this method performs best when combined with CatBoost for predicting (AUC = 0.897) landslides in very high and high susceptibility zones (82.14%). Additionally, the KM results indicated overfitting, displaying high accuracy for validation but poor statistical outcomes for zoning. The BCS results were the worst.
Collapse
|
6
|
Predicting superconducting transition temperature through advanced machine learning and innovative feature engineering. Sci Rep 2024; 14:3965. [PMID: 38368476 PMCID: PMC10874381 DOI: 10.1038/s41598-024-54440-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/07/2023] [Accepted: 02/13/2024] [Indexed: 02/19/2024] Open
Abstract
Superconductivity is a remarkable phenomenon in condensed matter physics, which comprises a fascinating array of properties expected to revolutionize energy-related technologies and pertinent fundamental research. However, the field faces the challenge of achieving superconductivity at room temperature. In recent years, Artificial Intelligence (AI) approaches have emerged as a promising tool for predicting such properties as transition temperature (Tc) to enable the rapid screening of large databases to discover new superconducting materials. This study employs the SuperCon dataset as the largest superconducting materials dataset. Then, we perform various data pre-processing steps to derive the clean DataG dataset, containing 13,022 compounds. In another stage of the study, we apply the novel CatBoost algorithm to predict the transition temperatures of novel superconducting materials. In addition, we developed a package called Jabir, which generates 322 atomic descriptors. We also designed an innovative hybrid method called the Soraya package to select the most critical features from the feature space. These yield R2 and RMSE values (0.952 and 6.45 K, respectively) superior to those previously reported in the literature. Finally, as a novel contribution to the field, a web application was designed for predicting and determining the Tc values of superconducting materials.
Collapse
|
7
|
Ensemble Learning Method for the Continuous Decoding of Hand Joint Angles. SENSORS (BASEL, SWITZERLAND) 2024; 24:660. [PMID: 38276352 DOI: 10.3390/s24020660] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/17/2023] [Revised: 01/16/2024] [Accepted: 01/18/2024] [Indexed: 01/27/2024]
Abstract
Human-machine interface technology is fundamentally constrained by the dexterity of motion decoding. Simultaneous and proportional control can greatly improve the flexibility and dexterity of smart prostheses. In this research, a new model using ensemble learning to solve the angle decoding problem is proposed. Ultimately, seven models for angle decoding from surface electromyography (sEMG) signals are designed. The kinematics of five angles of the metacarpophalangeal (MCP) joints are estimated using the sEMG recorded during functional tasks. The estimation performance was evaluated through the Pearson correlation coefficient (CC). In this research, the comprehensive model, which combines CatBoost and LightGBM, is the best model for this task, whose average CC value and RMSE are 0.897 and 7.09. The mean of the CC and the mean of the RMSE for all the test scenarios of the subjects' dataset outperform the results of the Gaussian process model, with significant differences. Moreover, the research proposed a whole pipeline that uses ensemble learning to build a high-performance angle decoding system for the hand motion recognition task. Researchers or engineers in this field can quickly find the most suitable ensemble learning model for angle decoding through this process, with fewer parameters and fewer training data requirements than traditional deep learning models. In conclusion, the proposed ensemble learning approach has the potential for simultaneous and proportional control (SPC) of future hand prostheses.
Collapse
|
8
|
Diurnal Pain Classification in Critically Ill Patients using Machine Learning on Accelerometry and Analgesic Data. IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOMEDICINE WORKSHOPS. IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOMEDICINE 2023; 2023:2207-2212. [PMID: 38463539 PMCID: PMC10923604 DOI: 10.1109/bibm58861.2023.10385764] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/12/2024]
Abstract
Quantifying pain in patients admitted to intensive care units (ICUs) is challenging due to the increased prevalence of communication barriers in this patient population. Previous research has posited a positive correlation between pain and physical activity in critically ill patients. In this study, we advance this hypothesis by building machine learning classifiers to examine the ability of accelerometer data collected from daily wearables to predict self-reported pain levels experienced by patients in the ICU. We trained multiple Machine Learning (ML) models, including Logistic Regression, CatBoost, and XG-Boost, on statistical features extracted from the accelerometer data combined with previous pain measurements and patient demographics. Following previous studies that showed a change in pain sensitivity in ICU patients at night, we performed the task of pain classification separately for daytime and nighttime pain reports. In the pain versus no-pain classification setting, logistic regression gave the best classifier in daytime (AUC: 0.72, F1-score: 0.72), and CatBoost gave the best classifier at nighttime (AUC: 0.82, F1-score: 0.82). Performance of logistic regression dropped to 0.61 AUC, 0.62 F1-score (mild vs. moderate pain, nighttime), and CatBoost's performance was similarly affected with 0.61 AUC, 0.60 F1-score (moderate vs. severe pain, daytime). The inclusion of analgesic information benefited the classification between moderate and severe pain. SHAP analysis was conducted to find the most significant features in each setting. It assigned the highest importance to accelerometer-related features on all evaluated settings but also showed the contribution of the other features such as age and medications in specific contexts. In conclusion, accelerometer data combined with patient demographics and previous pain measurements can be used to screen painful from painless episodes in the ICU and can be combined with analgesic information to provide moderate classification between painful episodes of different severities.
Collapse
|
9
|
An ensemble learning approach for diabetes prediction using boosting techniques. Front Genet 2023; 14:1252159. [PMID: 37953921 PMCID: PMC10639159 DOI: 10.3389/fgene.2023.1252159] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2023] [Accepted: 10/16/2023] [Indexed: 11/14/2023] Open
Abstract
Introduction: Diabetes is considered one of the leading healthcare concerns affecting millions worldwide. Taking appropriate action at the earliest stages of the disease depends on early diabetes prediction and identification. To support healthcare providers for better diagnosis and prognosis of diseases, machine learning has been explored in the healthcare industry in recent years. Methods: To predict diabetes, this research has conducted experiments on five boosting algorithms on the Pima diabetes dataset. The dataset was obtained from the University of California, Irvine (UCI) machine learning repository, which contains several important clinical features. Exploratory data analysis was used to identify the characteristics of the dataset. Moreover, upsampling, normalisation, feature selection, and hyperparameter tuning were employed for predictive analytics. Results: The results were analysed using various statistical/machine learning metrics and k-fold cross-validation techniques. Gradient boosting achieved the greatest accuracy rate of 92.85% among all the classifiers. Precision, recall, f1-score, and receiver operating characteristic (ROC) curves were used to further validate the model. Discussion: The suggested model outperformed the current studies in terms of prediction accuracy, demonstrating its applicability to other diseases with similar predicate indications.
Collapse
|
10
|
A comparative study of machine learning algorithms for predicting domestic violence vulnerability in Liberian women. BMC Womens Health 2023; 23:542. [PMID: 37848839 PMCID: PMC10583348 DOI: 10.1186/s12905-023-02701-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/06/2023] [Accepted: 10/10/2023] [Indexed: 10/19/2023] Open
Abstract
Domestic violence against women is a prevalent in Liberia, with nearly half of women reporting physical violence. However, research on the biosocial factors contributing to this issue remains limited. This study aims to predict women's vulnerability to domestic violence using a machine learning approach, leveraging data from the Liberian Demographic and Health Survey (LDHS) conducted in 2019-2020. We employed seven machine learning algorithms to achieve this goal, including ANN, KNN, RF, DT, XGBoost, LightGBM, and CatBoost. Our analysis revealed that the LightGBM and RF models achieved the highest accuracy in predicting women's vulnerability to domestic violence in Liberia, with 81% and 82% accuracy rates, respectively. One of the key features identified across multiple algorithms was the number of people who had experienced emotional violence. These findings offer important insights into the underlying characteristics and risk factors associated with domestic violence against women in Liberia. By utilizing machine learning techniques, we can better predict and understand this complex issue, ultimately contributing to the development of more effective prevention and intervention strategies.
Collapse
|
11
|
Machine learning algorithms for prediction of entrapment efficiency in nanomaterials. Methods 2023; 218:133-140. [PMID: 37595853 DOI: 10.1016/j.ymeth.2023.08.008] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/20/2023] [Revised: 08/13/2023] [Accepted: 08/15/2023] [Indexed: 08/20/2023] Open
Abstract
Exploitation of machine learning in predicting performance of nanomaterials is a rapidly growing dynamic area of research. For instance, incorporation of therapeutic cargoes into nanovesicles (i.e., entrapment efficiency) is one of the critical parameters that ensures proper entrapment of drugs in the developed nanosystems. Several factors affect the entrapment efficiency of drugs and thus multiple assessments are required to ensure drug retention, and to reduce cost and time. Supervised machine learning can allow for the construction of algorithms that can mine data available from earlier studies to predict performance of specific types of nanoparticles. Comparative studies that utilize multiple regression algorithms to predict entrapment efficiency in nanomaterials are scarce. Herein, we report on a detailed methodology for prediction of entrapment efficiency in nanomaterials (e.g., niosomes) using different regression algorithms (i.e., CatBoost, linear regression, support vector regression and artificial neural network) to select the model that demonstrates the best performance for estimation of entrapment efficiency. The study concluded that CatBoost algorithm demonstrated the best performance with maximum R2 score (0.98) and mean square error (< 10-4). Among the various parameters that possess a role in entrapment efficiency of drugs into niosomes, the results obtained from CatBoost model revealed that the drug:lipid ratio is the major contributing factor affecting entrapment efficiency, followed by the lipid:surfactant molar ratio. Hence, supervised machine learning may be applied for future selection of the components of niosomes that achieve high entrapment efficiency of drugs while minimizing experimental procedures and cost.
Collapse
|
12
|
A machine learning model for predicting blood concentration of quetiapine in patients with schizophrenia and depression based on real-world data. Br J Clin Pharmacol 2023; 89:2714-2725. [PMID: 37005382 DOI: 10.1111/bcp.15734] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2022] [Revised: 02/26/2023] [Accepted: 03/25/2023] [Indexed: 04/04/2023] Open
Abstract
AIMS This study aimed to establish a prediction model of quetiapine concentration in patients with schizophrenia and depression, based on real-world data via machine learning techniques to assist clinical regimen decisions. METHODS A total of 650 cases of quetiapine therapeutic drug monitoring (TDM) data from 483 patients at the First Hospital of Hebei Medical University from 1 November 2019 to 31 August 2022 were included in the study. Univariate analysis and sequential forward selection (SFS) were implemented to screen the important variables influencing quetiapine TDM. After 10-fold cross validation, the algorithm with the optimal model performance was selected for predicting quetiapine TDM among nine models. SHapley Additive exPlanation was applied for model interpretation. RESULTS Four variables (daily dose of quetiapine, type of mental illness, sex and CYP2D6 competitive substrates) were selected through univariate analysis (P < .05) and SFS to establish the models. The CatBoost algorithm with the best predictive ability (mean [SD] R2 = 0.63 ± 0.02, RMSE = 137.39 ± 10.56, MAE = 103.24 ± 7.23) was chosen for predicting quetiapine TDM among nine models. The mean (SD) accuracy of the predicted TDM within ±30% of the actual TDM was 49.46 ± 3.00%, and that of the recommended therapeutic range (200-750 ng mL-1 ) was 73.54 ± 8.3%. Compared with the PBPK model in a previous study, the CatBoost model shows slightly higher accuracy within ±100% of the actual value. CONCLUSIONS This work is the first real-world study to predict the blood concentration of quetiapine in patients with schizophrenia and depression using artificial intelligent techniques, which is of significance and value for clinical medication guidance.
Collapse
|
13
|
Prediction of local tumor progression after microwave ablation for early-stage hepatocellular carcinoma with machine learning. J Cancer Res Ther 2023; 19:978-987. [PMID: 37675726 DOI: 10.4103/jcrt.jcrt_319_23] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/08/2023]
Abstract
Objectives Local tumor progression (LTP) is a major constraint for achieving technical success in microwave ablation (MWA) for the treatment of early-stage hepatocellular carcinoma (EHCC). This study aims to develop machine learning (ML)-based predictive models for LTP after initial MWA in EHCC. Materials and Methods A total of 607 treatment-naïve EHCC patients (mean ± standard deviation [SD] age, 57.4 ± 10.8 years) with 934 tumors according to the Milan criteria who subsequently underwent MWA between August 2009 and January 2016 were enrolled. During the same period, 299 patients were assigned to the external validation datasets. To identify risk factors of LTP after MWA, clinicopathological data and ablation parameters were collected. Predictive models were developed according to 21 variables using four ML algorithms and evaluated based on the area under the receiver operating characteristic curve (AUC) with 95% confidence intervals (CIs). Results After a median follow-up time of 28.7 months (range, 7.6-110.5 months), 6.9% (42/607) of patients had confirmed LTP in the training dataset. The tumor size and number were significantly related to LTP. The AUCs of the four models ranged from 0.791 to 0.898. The best performance (AUC: 0.898, 95% CI: [0.842 0.954]; SD: 0.028) occurred when nine variables were introduced to the CatBoost algorithm. According to the feature selection algorithms, the top six predictors were tumor number, albumin and alpha-fetoprotein, tumor size, age, and international normalized ratio. Conclusions Out of the four ML models, the CatBoost model performed best, and reasonable and precise ablation protocols will significantly reduce LTP.
Collapse
|
14
|
Developing a nomogram for predicting depression in diabetic patients after COVID-19 using machine learning. Front Public Health 2023; 11:1150818. [PMID: 37533521 PMCID: PMC10390766 DOI: 10.3389/fpubh.2023.1150818] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2023] [Accepted: 04/10/2023] [Indexed: 08/04/2023] Open
Abstract
Objective This study identified major risk factors for depression in community diabetic patients using machine learning techniques and developed predictive models for predicting the high-risk group for depression in diabetic patients based on multiple risk factors. Methods This study analyzed 26,829 adults living in the community who were diagnosed with diabetes by a doctor. The prevalence of a depressive disorder was the dependent variable in this study. This study developed a model for predicting diabetic depression using multiple logistic regression, which corrected all confounding factors in order to identify the relationship (influence) of predictive factors for diabetic depression by entering the top nine variables with high importance, which were identified in CatBoost. Results The prevalence of depression was 22.4% (n = 6,001). This study calculated the importance of factors related to depression in diabetic patients living in South Korean community using CatBoost to find that the top nine variables with high importance were gender, smoking status, changes in drinking before and after the COVID-19 pandemic, changes in smoking before and after the COVID-19 pandemic, subjective health, concern about economic loss due to the COVID-19 pandemic, changes in sleeping hours due to the COVID-19 pandemic, economic activity, and the number of people you can ask for help in a disaster situation such as COVID-19 infection. Conclusion It is necessary to identify the high-risk group for diabetes and depression at an early stage, while considering multiple risk factors, and to seek a personalized psychological support system at the primary medical level, which can improve their mental health.
Collapse
|
15
|
Predicting plasma concentration of quetiapine in patients with depression using machine learning techniques based on real-world evidence. Expert Rev Clin Pharmacol 2023; 16:741-750. [PMID: 37466101 DOI: 10.1080/17512433.2023.2238604] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2023] [Revised: 06/19/2023] [Accepted: 07/13/2023] [Indexed: 07/20/2023]
Abstract
OBJECTIVES We develop a model for predicting quetiapine levels in patients with depression, using machine learning to support decisions on clinical regimens. METHODS Inpatients diagnosed with depression at the First Hospital of Hebei Medical University from 1 November 2019, to 31 August were enrolled. The ratio of training cohort to testing cohort was fixed at 80%:20% for the whole dataset. Univariate analysis was executed on all information to screen the important variables influencing quetiapine TDM. The prediction abilities of nine machine learning and deep learning algorithms were compared. The prediction model was created using an algorithm with better model performance, and the model's interpretation was done using the SHapley Additive exPlanation. RESULTS There were 333 individuals and 412 cases of quetiapine TDM included in the study. Six significant variables were selected to establish the individualized medication model. A quetiapine concentration prediction model was created through CatBoost. In the testing cohort, the projected TDM's accuracy was 61.45%. The prediction accuracy of quetiapine concentration within the effective range (200-750 ng/mL) was 75.47%. CONCLUSIONS This study predicts the plasma concentration of quetiapine in depression patients by machine learning, which is meaningful for the clinical medication guidance.
Collapse
|
16
|
On the Applicability of Quantum Machine Learning. ENTROPY (BASEL, SWITZERLAND) 2023; 25:992. [PMID: 37509939 PMCID: PMC10377777 DOI: 10.3390/e25070992] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/07/2023] [Revised: 06/22/2023] [Accepted: 06/26/2023] [Indexed: 07/30/2023]
Abstract
In this article, we investigate the applicability of quantum machine learning for classification tasks using two quantum classifiers from the Qiskit Python environment: the variational quantum circuit and the quantum kernel estimator (QKE). We provide a first evaluation on the performance of these classifiers when using a hyperparameter search on six widely known and publicly available benchmark datasets and analyze how their performance varies with the number of samples on two artificially generated test classification datasets. As quantum machine learning is based on unitary transformations, this paper explores data structures and application fields that could be particularly suitable for quantum advantages. Hereby, this paper introduces a novel dataset based on concepts from quantum mechanics using the exponential map of a Lie algebra. This dataset will be made publicly available and contributes a novel contribution to the empirical evaluation of quantum supremacy. We further compared the performance of VQC and QKE on six widely applicable datasets to contextualize our results. Our results demonstrate that the VQC and QKE perform better than basic machine learning algorithms, such as advanced linear regression models (Ridge and Lasso). They do not match the accuracy and runtime performance of sophisticated modern boosting classifiers such as XGBoost, LightGBM, or CatBoost. Therefore, we conclude that while quantum machine learning algorithms have the potential to surpass classical machine learning methods in the future, especially when physical quantum infrastructure becomes widely available, they currently lag behind classical approaches. Our investigations also show that classical machine learning approaches have superior performance classifying datasets based on group structures, compared to quantum approaches that particularly use unitary processes. Furthermore, our findings highlight the significant impact of different quantum simulators, feature maps, and quantum circuits on the performance of the employed quantum estimators. This observation emphasizes the need for researchers to provide detailed explanations of their hyperparameter choices for quantum machine learning algorithms, as this aspect is currently overlooked in many studies within the field. To facilitate further research in this area and ensure the transparency of our study, we have made the complete code available in a linked GitHub repository.
Collapse
|
17
|
Evidence-Based Prediction of Cellular Toxicity for Amorphous Silica Nanoparticles. ACS NANO 2023. [PMID: 37254442 DOI: 10.1021/acsnano.2c11968] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Subscribe] [Scholar Register] [Indexed: 06/01/2023]
Abstract
Developing a generalized model for a robust prediction of nanotoxicity is critical for designing safe nanoparticles. However, complex toxicity mechanisms of nanoparticles in biological environments, such as biomolecular corona formation, prevent a reliable nanotoxicity prediction. This is exacerbated by the potential evaluation bias caused by internal validation, which is not fully appreciated. Herein, we propose an evidence-based prediction method for distinguishing between cytotoxic and noncytotoxic nanoparticles at a given condition by uniting literature data mining and machine learning. We illustrate the proposed method for amorphous silica nanoparticles (SiO2-NPs). SiO2-NPs are currently considered a safety concern; however, they are still widely produced and used in various consumer products. We generated the most diverse attributes of SiO2-NP cellular toxicity to date, using >100 publications, and built predictive models, with algorithms ranging from linear to nonlinear (deep neural network, kernel, and tree-based) classifiers. These models were validated using internal (4124-sample) and external (905-sample) data sets. The resultant categorical boosting (CatBoost) model outperformed other algorithms. We then identified 13 key attributes, including concentration, serum, cell, size, time, surface, and assay type, which can explain SiO2-NP toxicity, using the Shapley Additive exPlanation values in the CatBoost model. The serum attribute underscores the importance of nanoparticle-corona complexes for nanotoxicity prediction. We further show that internal validation does not guarantee generalizability. In general, safe SiO2-NPs can be obtained by modifying their surfaces and using low concentrations. Our work provides a strategy for predicting and explaining the toxicity of any type of engineered nanoparticles in real-world practice.
Collapse
|
18
|
Sugarbeet Seed Germination Prediction Using Hyperspectral Imaging Information Fusion. APPLIED SPECTROSCOPY 2023:37028231171908. [PMID: 37246428 DOI: 10.1177/00037028231171908] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/30/2023]
Abstract
Germination rate is important for seed selection and planting and quality. In this study, hyperspectral image technology integrated with germination tests was applied for feature association analysis and germination performance prediction of sugarbeet seeds. In this study, we proposed a nondestructive prediction method for sugarbeet seed germination. Sugarbeet seed was studied, and hyperspectral imaging (HIS) performed by binarization, morphology, and contour extraction was applied as a nondestructive and accurate technique to achieve single seed image segmentation. Comparative analysis of nine spectral pretreatment methods, SNV + 1D was used to process the average spectrum of sugarbeet seeds. Fourteen characteristic wavelengths were obtained by the Kullback-Leibler (KL) divergence, as the spectral characteristics of sugarbeet seeds. Principal component analysis (PCA) and material properties verified the validity of the extracted characteristic wavelengths. It was extracted of six image features of the hyperspectral image of a single seed obtained based on the gray-level co-occurrence matrix (GLCM). The spectral features, image features, and fusion features were used to establish partial least squares discriminant analysis (PLS-DA), CatBoost, and support vector machine radial-basis function (SVM-RBF) models respectively to predict the germination. The results showed that the prediction effect of fusion features was better than spectral features and image features. By comparing other models, the prediction results of the CatBoost model accuracy were up to 93.52%. The results indicated that, based on HSI and fusion features, the prediction of germinating sugarbeet seeds was more accurate and nondestructive.
Collapse
|
19
|
m6Aminer: Predicting the m6Am Sites on mRNA by Fusing Multiple Sequence-Derived Features into a CatBoost-Based Classifier. Int J Mol Sci 2023; 24:ijms24097878. [PMID: 37175594 PMCID: PMC10177809 DOI: 10.3390/ijms24097878] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/04/2023] [Revised: 04/20/2023] [Accepted: 04/24/2023] [Indexed: 05/15/2023] Open
Abstract
As one of the most important post-transcriptional modifications, m6Am plays a fairly important role in conferring mRNA stability and in the progression of cancers. The accurate identification of the m6Am sites is critical for explaining its biological significance and developing its application in the medical field. However, conventional experimental approaches are time-consuming and expensive, making them unsuitable for the large-scale identification of the m6Am sites. To address this challenge, we exploit a CatBoost-based method, m6Aminer, to identify the m6Am sites on mRNA. For feature extraction, nine different feature-encoding schemes (pseudo electron-ion interaction potential, hash decimal conversion method, dinucleotide binary encoding, nucleotide chemical properties, pseudo k-tuple composition, dinucleotide numerical mapping, K monomeric units, series correlation pseudo trinucleotide composition, and K-spaced nucleotide pair frequency) were utilized to form the initial feature space. To obtain the optimized feature subset, the ExtraTreesClassifier algorithm was adopted to perform feature importance ranking, and the top 300 features were selected as the optimal feature subset. With different performance assessment methods, 10-fold cross-validation and independent test, m6Aminer achieved average AUC of 0.913 and 0.754, demonstrating a competitive performance with the state-of-the-art models m6AmPred (0.905 and 0.735) and DLm6Am (0.897 and 0.730). The prediction model developed in this study can be used to identify the m6Am sites in the whole transcriptome, laying a foundation for the functional research of m6Am.
Collapse
|
20
|
Age Classification of Rice Seeds in Japan Using Gradient-Boosting and ANFIS Algorithms. SENSORS (BASEL, SWITZERLAND) 2023; 23:2828. [PMID: 36905032 PMCID: PMC10007270 DOI: 10.3390/s23052828] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 12/31/2022] [Revised: 02/28/2023] [Accepted: 03/03/2023] [Indexed: 06/18/2023]
Abstract
The rapidly changing climate affects an extensive spectrum of human-centered environments. The food industry is one of the affected industries due to rapid climate change. Rice is a staple food and an important cultural key point for Japanese people. As Japan is a country in which natural disasters continuously occur, using aged seeds for cultivation has become a regular practice. It is a well-known truth that seed quality and age highly impact germination rate and successful cultivation. However, a considerable research gap exists in the identification of seeds according to age. Hence, this study aims to implement a machine-learning model to identify Japanese rice seeds according to their age. Since agewise datasets are unavailable in the literature, this research implements a novel rice seed dataset with six rice varieties and three age variations. The rice seed dataset was created using a combination of RGB images. Image features were extracted using six feature descriptors. The proposed algorithm used in this study is called Cascaded-ANFIS. A novel structure for this algorithm is proposed in this work, combining several gradient-boosting algorithms such as XGBoost, CatBoost, and LightGBM. The classification was conducted in two steps. First, the seed variety was identified. Then, the age was predicted. As a result, seven classification models were implemented. The performance of the proposed algorithm was evaluated against 13 state-of-the-art algorithms. Overall, the proposed algorithm has a higher accuracy, precision, recall, and F1-score than the others. For the classification of variety, the proposed algorithm scored 0.7697, 0.7949, 0.7707, and 0.7862, respectively. The results of this study confirm that the proposed algorithm can be employed in the successful age classification of seeds.
Collapse
|
21
|
An Improved CatBoost-Based Classification Model for Ecological Suitability of Blueberries. SENSORS (BASEL, SWITZERLAND) 2023; 23:1811. [PMID: 36850409 PMCID: PMC9961688 DOI: 10.3390/s23041811] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 11/22/2022] [Revised: 01/30/2023] [Accepted: 01/31/2023] [Indexed: 06/18/2023]
Abstract
Selecting the best planting area for blueberries is an essential issue in agriculture. To better improve the effectiveness of blueberry cultivation, a machine learning-based classification model for blueberry ecological suitability was proposed for the first time and its validation was conducted by using multi-source environmental features data in this paper. The sparrow search algorithm (SSA) was adopted to optimize the CatBoost model and classify the ecological suitability of blueberries based on the selection of data features. Firstly, the Borderline-SMOTE algorithm was used to balance the number of positive and negative samples. The Variance Inflation Factor and information gain methods were applied to filter out the factors affecting the growth of blueberries. Subsequently, the processed data were fed into the CatBoost for training, and the parameters of the CatBoost were optimized to obtain the optimal model using SSA. Finally, the SSA-CatBoost model was adopted to classify the ecological suitability of blueberries and output the suitability types. Taking a study on a blueberry plantation in Majiang County, Guizhou Province, China as an example, the findings demonstrate that the AUC value of the SSA-CatBoost-based blueberry ecological suitability model is 0.921, which is 2.68% higher than that of the CatBoost (AUC = 0.897) and is significantly higher than Logistic Regression (AUC = 0.855), Support Vector Machine (AUC = 0.864), and Random Forest (AUC = 0.875). Furthermore, the ecological suitability of blueberries in Majiang County is mapped according to the classification results of different models. When comparing the actual blueberry cultivation situation in Majiang County, the classification results of the SSA-CatBoost model proposed in this paper matches best with the real blueberry cultivation situation in Majiang County, which is of a high reference value for the selection of blueberry cultivation sites.
Collapse
|
22
|
Volumetric Properties and Stiffness Modulus of Asphalt Concrete Mixtures Made with Selected Quarry Fillers: Experimental Investigation and Machine Learning Prediction. MATERIALS (BASEL, SWITZERLAND) 2023; 16:1017. [PMID: 36770022 PMCID: PMC9918211 DOI: 10.3390/ma16031017] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 12/16/2022] [Revised: 01/17/2023] [Accepted: 01/19/2023] [Indexed: 06/18/2023]
Abstract
In recent years, the attention of many researchers in the field of pavement engineering has focused on the search for alternative fillers that could replace Portland cement and traditional limestone in the production of asphalt mixtures. In addition, from a Czech perspective, there was the need to determine the quality of asphalt mixtures prepared with selected fillers provided by different local quarries and suppliers. This paper discusses an experimental investigation and a machine learning modeling carried out by a decision tree CatBoost approach, based on experimentally determined volumetric and mechanical properties of fine-grained asphalt concretes prepared with selected quarry fillers used as an alternative to traditional limestone and Portland cement. Air voids content and stiffness modulus at 15 °C were predicted on the basis of seven input variables, including bulk density, a categorical variable distinguishing the aggregates' quarry of origin, and five main filler-oxide contents determined by means of X-ray fluorescence spectrometry. All mixtures were prepared by fixing the filler content at 10% by mass, with a bitumen content of 6% (PG 160/220), and with roughly the same grading curve. Model predictive performance was evaluated in terms of six different evaluation metrics with Pearson correlation and coefficient of determination always higher than 0.96 and 0.92, respectively. Based on the results obtained, this study could represent a forward feasibility study on the mathematical prediction of the asphalt mixtures' mechanical behavior on the basis of its filler mineralogical composition.
Collapse
|
23
|
Machine learning approaches to predict the photocatalytic performance of bismuth ferrite-based materials in the removal of malachite green. JOURNAL OF HAZARDOUS MATERIALS 2023; 442:130031. [PMID: 36179629 DOI: 10.1016/j.jhazmat.2022.130031] [Citation(s) in RCA: 20] [Impact Index Per Article: 20.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/29/2022] [Revised: 09/05/2022] [Accepted: 09/17/2022] [Indexed: 06/16/2023]
Abstract
This study focuses on the potential capability of numerous machine learning models, namely CatBoost, GradientBoosting, HistGradientBoosting, ExtraTrees, XGBoost, DecisionTree, Bagging, light gradient boosting machine (LGBM), GaussianProcess, artificial neural network (ANN), and light long short-term memory (LightLSTM). These models were investigated to predict the photocatalytic degradation of malachite green from wastewater using various NM-BiFeO3 composites. A comprehensive databank of 1200 data points was generated under various experimental conditions. The ten input variables selected were the catalyst type, reaction time, light intensity, initial concentration, catalyst loading, solution pH, humic acid concentration, anions, surface area, and pore volume of various photocatalysts. The MG dye degradation efficiency was selected as the output variable. An evaluation of the performance metrics suggested that the CatBoost model, with the highest test coefficient of determination (0.99) and lowest mean absolute error (0.64) and root-mean-square error (1.34), outperformed all other models. The CatBoost model showed that the photocatalytic reaction conditions were more important than the material properties. The modeling results suggested that the optimized process conditions were a light intensity of 105 W, catalyst loading of 1.5 g/L, initial MG dye concentration of 5 mg/L and solution pH of 7. Finally, the implications and drawbacks of the current study were stated in detail.
Collapse
|
24
|
Alternative Fillers in Asphalt Concrete Mixtures: Laboratory Investigation and Machine Learning Modeling towards Mechanical Performance Prediction. MATERIALS (BASEL, SWITZERLAND) 2023; 16:807. [PMID: 36676543 PMCID: PMC9861159 DOI: 10.3390/ma16020807] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 12/20/2022] [Revised: 01/10/2023] [Accepted: 01/11/2023] [Indexed: 06/17/2023]
Abstract
In recent years, due to the reduction in available natural resources, the attention of many researchers has been focused on the reuse of recycled materials and industrial waste in common engineering applications. This paper discusses the feasibility of using seven different materials as alternative fillers instead of ordinary Portland cement (OPC) in road pavement base layers: namely rice husk ash (RHA), brick dust (BD), marble dust (MD), stone dust (SD), fly ash (FA), limestone dust (LD), and silica fume (SF). To exclusively evaluate the effect that selected fillers had on the mechanical performance of asphalt mixtures, we carried out Marshall, indirect tensile strength, moisture susceptibility, and Cantabro abrasion loss tests on specimens in which only the filler type and its percentage varied while keeping constant all the remaining design parameters. Experimental findings showed that all mixtures, except those prepared with 4% RHA or MD, met the requirements of Indian standards with respect to air voids, Marshall stability and quotient. LD and SF mixtures provided slightly better mechanical strength and durability than OPC ones, proving they can be successfully recycled as filler in asphalt mixtures. Furthermore, a Machine Learning methodology based on laboratory results was developed. A decision tree Categorical Boosting approach allowed the main mechanical properties of the investigated mixtures to be predicted on the basis of the main compositional variables, with a mean Pearson correlation and a mean coefficient of determination equal to 0.9724 and 0.9374, respectively.
Collapse
|
25
|
A machine learning strategy with clustering under sampling of majority instances for predicting drug target interactions. Mol Inform 2022; 42:e2200102. [PMID: 36411246 DOI: 10.1002/minf.202200102] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/11/2022] [Revised: 11/11/2022] [Accepted: 11/21/2022] [Indexed: 11/23/2022]
Abstract
Drug Target Interactions (DTIs) are crucial in drug discovery as it reduces the range of candidate searches, speeding up the drug screening process. Considering in vitro and in vivo experimentations are time and cost-expensive, there has been a surge in computational techniques, especially ML methods for DTIs prediction. Therefore, this study aims to present a methodology that uses molecular structures and amino acid sequences for generating PSSM and PubChem fingerprints for drugs and targets respectively. The proposed work uses a novel technique NearestCUS for handling the class imbalance problem of the benchmark datasets. We use Isomap Embedding to extract features from PSSMs. Feature selection is performed using ANOVA. CatBoost is used for predicting the interaction between drugs and targets for the first time. To quantify the efficacy of NearestCUS, we compared it with other sampling techniques. We found that the proposed methodology performed better than state-of-the-art approaches.
Collapse
|
26
|
An Improved Diagnostic of the Mycobacterium tuberculosis Drug Resistance Status by Applying a Decision Tree to Probabilities Assigned by the CatBoost Multiclassifier of Matrix Metalloproteinases Biomarkers. Diagnostics (Basel) 2022; 12:diagnostics12112847. [PMID: 36428907 PMCID: PMC9688965 DOI: 10.3390/diagnostics12112847] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2022] [Revised: 11/10/2022] [Accepted: 11/14/2022] [Indexed: 11/19/2022] Open
Abstract
In this work, we discuss an opportunity to use a set of the matrix metalloproteinases MMP-1, MMP-8, and MMP-9 and the tissue inhibitor TIMP, the concentrations of which can be easily obtained via a blood test from patients suffering from tuberculosis, as the biomarker for a fast diagnosis of the drug resistance status of Mycobacterium tuberculosis. The diagnostic approach is based on machine learning with the CatBoost system, which has been supplied with additional postprocessing. The latter refers not only to the simple probabilities of ML-predicted outcomes but also to the decision tree-like procedure, which takes into account the presence of strict zeros in the primary set of probabilities. It is demonstrated that this procedure significantly elevates the accuracy of distinguishing between sensitive, multi-, and extremely drug-resistant strains.
Collapse
|
27
|
An individualized medication model of sodium valproate for patients with bipolar disorder based on machine learning and deep learning techniques. Front Pharmacol 2022; 13:890221. [PMID: 36339624 PMCID: PMC9627622 DOI: 10.3389/fphar.2022.890221] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2022] [Accepted: 09/29/2022] [Indexed: 07/20/2023] Open
Abstract
Valproic acid/sodium valproate (VPA) is a widely used anticonvulsant drug for maintenance treatment of bipolar disorders. In order to balance the efficacy and adverse events of VPA treatment, an individualized dose regimen is necessary. This study aimed to establish an individualized medication model of VPA for patients with bipolar disorder based on machine learning and deep learning techniques. The sequential forward selection (SFS) algorithm was applied for selecting a feature subset, and random forest was used for interpolating missing values. Then, we compared nine models using XGBoost, LightGBM, CatBoost, random forest, GBDT, SVM, logistic regression, ANN, and TabNet, and CatBoost was chosen to establish the individualized medication model with the best performance (accuracy = 0.85, AUC = 0.91, sensitivity = 0.85, and specificity = 0.83). Three important variables that correlated with VPA daily dose included VPA TDM value, antipsychotics, and indirect bilirubin. SHapley Additive exPlanations was applied to visually interpret their impacts on VPA daily dose. Last, the confusion matrix presented that predicting a daily dose of 0.5 g VPA had a precision of 55.56% and recall rate of 83.33%, and predicting a daily dose of 1 g VPA had a precision of 95.83% and a recall rate of 85.19%. In conclusion, the individualized medication model of VPA for patients with bipolar disorder based on CatBoost had a good prediction ability, which provides guidance for clinicians to propose the optimal medication regimen.
Collapse
|
28
|
Genetic Association Study and Machine Learning to Investigate Differences in Platelet Reactivity in Patients with Acute Ischemic Stroke Treated with Aspirin. Biomedicines 2022; 10:biomedicines10102564. [PMID: 36289824 PMCID: PMC9599820 DOI: 10.3390/biomedicines10102564] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/15/2022] [Revised: 10/04/2022] [Accepted: 10/08/2022] [Indexed: 11/17/2022] Open
Abstract
Aspirin resistance (AR) is a pressing problem in current ischemic stroke care. Although the role of genetic variations is widely considered, the data still remain controversial. Our aim was to investigate the contribution of genetic features to laboratory AR measured through platelet aggregation with arachidonic acid (AA) and adenosine diphosphate (ADP) in ischemic stroke patients. A total of 461 patients were enrolled. Platelet aggregation was measured via light transmission aggregometry. Eighteen single-nucleotide polymorphisms (SNPs) in ITGB3, GPIBA, TBXA2R, ITGA2, PLA2G7, HMOX1, PTGS1, PTGS2, ADRA2A, ABCB1 and PEAR1 genes and the intergenic 9p21.3 region were determined using low-density biochips. We found an association of rs1330344 in the PTGS1 gene with AR and AA-induced platelet aggregation. Rs4311994 in ADRA2A gene also affected AA-induced aggregation, and rs4523 in the TBXA2R gene and rs12041331 in the PEAR1 gene influenced ADP-induced aggregation. Furthermore, the effect of rs1062535 in the ITGA2 gene on NIHSS dynamics during 10 days of treatment was found. The best machine learning (ML) model for AR based on clinical and genetic factors was characterized by AUC = 0.665 and F1-score = 0.628. In conclusion, the association study showed that PTGS1, ADRA2A, TBXA2R and PEAR1 polymorphisms may affect laboratory AR. However, the ML model demonstrated the predominant influence of clinical features.
Collapse
|
29
|
Exploring the risk factors of impaired fasting glucose in middle-aged population living in South Korean communities by using categorical boosting machine. Front Endocrinol (Lausanne) 2022; 13:1013162. [PMID: 36246911 PMCID: PMC9556903 DOI: 10.3389/fendo.2022.1013162] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 08/06/2022] [Accepted: 09/16/2022] [Indexed: 11/13/2022] Open
Abstract
Objective This epidemiological study (1) identified factors associated with impaired fasting glucose using 3,019 subjects (≥30 years old and <60 years old) without diabetes mellitus from national survey data and (2) developed a nomogram that could predict groups vulnerable to impaired fasting glucose by using machine learning. Methods This study analyzed 3,019 adults between 30 and 65 years old who completed blood tests, physical measurements, blood pressure measurements, and health surveys. Impaired fasting glucose, a dependent variable, was classified into normal blood glucose (glycated hemoglobin<5.7% and fasting blood glucose ≤ 100mg/dl) and impaired fasting glucose (glycated hemoglobin is 5.7-6.4% and fasting blood glucose is 100-125mg/dl). Explanatory variables included socio-demographic factors, health habit factors, anthropometric factors, dietary habit factors, and cardiovascular disease risk factors. This study developed a model for predicting impaired fasting glucose by using logistic nomogram and categorical boosting (CatBoost). Results In this study, the top eight variables with a high impact on CatBoost model output were age, high cholesterol, WHtR, BMI, drinking more than one shot per month for the past year, marital status, hypertension, and smoking. Conclusion It is necessary to improve lifestyle and continuously monitor subjects at the primary medical care level so that we can detect non-diabetics vulnerable to impaired fasting glucose living in the community at an early stage and manage their blood glucose.
Collapse
|
30
|
A pseudo-Siamese framework for circRNA-RBP binding sites prediction integrating BiLSTM and soft attention mechanism. Methods 2022; 207:57-64. [PMID: 36113743 DOI: 10.1016/j.ymeth.2022.09.003] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2022] [Revised: 08/24/2022] [Accepted: 09/09/2022] [Indexed: 11/20/2022] Open
Abstract
Circular RNAs (circRNAs) are widely expressed in tissues and play a key role in diseases through interacting with RNA binding proteins (RBPs). Since the high cost of traditional technology, computational methods are developed to identify the binding sites between circRNAs and RBPs. Unfortunately, these methods suffer from the insufficient learning of features and the single classification of output. To address these limitations, we propose a novel method named circ-pSBLA which constructs a pseudo-Siamese framework integrating Bi-directional long short-term memory (BiLSTM) network and soft attention mechanism for circRNA-RBP binding sites prediction. Softmax function and CatBoost are adopted to classify, respectively, and then a pseudo-Siamese framework is constructed. circ-pSBLA combines them to get final output. To validate the effectiveness of circ-pSBLA, we compare it with other state-of-the-art methods and carry out an ablation experiment on 17 sub-datasets. Moreover, we do motif analysis on 3 sub-datasets. The results show that circ-pSBLA achieves superior performance and outperforms other methods. All supporting source codes can be downloaded from https://github.com/gyj9811/circ-pSBLA.
Collapse
|
31
|
Product pricing solutions using hybrid machine learning algorithm. INNOVATIONS IN SYSTEMS AND SOFTWARE ENGINEERING 2022:1-12. [PMID: 35910813 PMCID: PMC9309595 DOI: 10.1007/s11334-022-00465-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 06/16/2022] [Accepted: 07/03/2022] [Indexed: 06/15/2023]
Abstract
E-commerce platforms have been around for over two decades now, and their popularity among buyers and sellers alike has been increasing. With the COVID-19 pandemic, there has been a boom in online shopping, with many sellers moving their businesses towards e-commerce platforms. Product pricing is quite difficult at this increased scale of online shopping, considering the number of products being sold online. For instance, the strong seasonal pricing trends in clothes-where Brand names seem to sway the prices heavily. Electronics, on the other hand, have product specification-based pricing, which keeps fluctuating. This work aims to help business owners price their products competitively based on similar products being sold on e-commerce platforms based on the reviews, statistical and categorical features. A hybrid algorithm X-NGBoost combining extreme gradient boost (XGBoost) with natural gradient boost (NGBoost) is proposed to predict the price. The proposed model is compared with the ensemble models like XGBoost, LightBoost and CatBoost. The proposed model outperforms the existing ensemble boosting algorithms.
Collapse
|
32
|
Revisiting the Risk Factors for Endometriosis: A Machine Learning Approach. J Pers Med 2022; 12:jpm12071114. [PMID: 35887611 PMCID: PMC9317820 DOI: 10.3390/jpm12071114] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2022] [Revised: 05/16/2022] [Accepted: 07/05/2022] [Indexed: 11/30/2022] Open
Abstract
Endometriosis is a condition characterized by implants of endometrial tissues into extrauterine sites, mostly within the pelvic peritoneum. The prevalence of endometriosis is under-diagnosed and is estimated to account for 5-10% of all women of reproductive age. The goal of this study was to develop a model for endometriosis based on the UK-biobank (UKB) and re-assess the contribution of known risk factors to endometriosis. We partitioned the data into those diagnosed with endometriosis (5924; ICD-10: N80) and a control group (142,723). We included over 1000 variables from the UKB covering personal information about female health, lifestyle, self-reported data, genetic variants, and medical history prior to endometriosis diagnosis. We applied machine learning algorithms to train an endometriosis prediction model. The optimal prediction was achieved with the gradient boosting algorithms of CatBoost for the data-combined model with an area under the ROC curve (ROC-AUC) of 0.81. The same results were obtained for women from a mixed ethnicity population of the UKB (7112; ICD-10: N80). We discovered that, prior to being diagnosed with endometriosis, affected women had significantly more ICD-10 diagnoses than the average unaffected woman. We used SHAP, an explainable AI tool, to estimate the marginal impact of a feature, given all other features. The informative features ranked by SHAP values included irritable bowel syndrome (IBS) and the length of the menstrual cycle. We conclude that the rich population-based retrospective data from the UKB are valuable for developing unified machine learning endometriosis models despite the limitations of missing data, noisy medical input, and participant age. The informative features of the model may improve clinical utility for endometriosis diagnosis.
Collapse
|
33
|
Prediction of hospital mortality in mechanically ventilated patients with congestive heart failure using machine learning approaches. Int J Cardiol 2022; 358:59-64. [PMID: 35483478 DOI: 10.1016/j.ijcard.2022.04.063] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 01/27/2022] [Revised: 03/14/2022] [Accepted: 04/22/2022] [Indexed: 12/29/2022]
Abstract
BACKGROUND Mechanically ventilated patients with congestive heart failure (CHF) are at high-risk of mortality. We aimed to develop and validate a prediction model based on machine learning (ML) algorithms to predict hospital mortality in mechanically ventilated patients with CHF. METHODS Least absolute shrinkage and selection operator (LASSO) regression was used to identify the key features. Hyperparameters optimization (HPO) was conducted to modify the prediction model. The area under the receiver operating characteristic curve (AUC), accuracy, calibration curve and decision curve analysis were used to evaluate prediction performance. The final model was validated using an external validation set from another database. The prediction results were represented by a nomogram. RESULTS A total of 4530 qualified patients were included. Among 11 ML-algorithms, CatBoost showed the best prediction performance (AUC = 0.833). And 10 key features (10/63) were selected based on the LASSO regression. After HPO, the prediction performance of the CatBoost model based on the key features was significantly improved (AUCs: 0.805 vs. 0.821). Additionally, the CatBoost model also showed the satisfactory prediction performance in the external validation set (AUC = 0.806). CONCLUSION The present study developed and validated a CatBoost model, which could accurately predict hospital mortality in mechanically ventilated patients with CHF.
Collapse
|
34
|
Predicting South Korean adolescents vulnerable to obesity after the COVID-19 pandemic using categorical boosting and shapley additive explanation values: A population-based cross-sectional survey. Front Pediatr 2022; 10:955339. [PMID: 36210956 PMCID: PMC9532523 DOI: 10.3389/fped.2022.955339] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 05/28/2022] [Accepted: 09/05/2022] [Indexed: 12/05/2022] Open
Abstract
OBJECTIVE This study identified factors related to adolescent obesity during the COVID-19 pandemic by using machine learning techniques and developed a model for predicting high-risk obesity groups among South Korean adolescents based on the result. MATERIALS AND METHODS This study analyzed 50,858 subjects (male: 26,535 subjects, and female: 24,323 subjects) between 12 and 18 years old. Outcome variables were classified into two classes (normal or obesity) based on body mass index (BMI). The explanatory variables included demographic factors, mental health factors, life habit factors, exercise factors, and academic factors. This study developed a model for predicting adolescent obesity by using multiple logistic regressions that corrected all confounding factors to understand the relationship between predictors for South Korean adolescent obesity by inputting the seven variables with the highest Shapley values found in categorical boosting (CatBoost). RESULTS In this study, the top seven variables with a high impact on model output (based on SHAP values in CatBoost) were gender, mean sitting hours per day, the number of days of conducting strength training in the past seven days, academic performance, the number of days of drinking soda in the past seven days, the number of days of conducting the moderate-intensity physical activity for 60 min or more per day in the past seven days, and subjective stress perception level. CONCLUSION To prevent obesity in adolescents, it is required to detect adolescents vulnerable to obesity early and conduct monitoring continuously to manage their physical health.
Collapse
|
35
|
Insights into Factors Affecting Traffic Accident Severity of Novice and Experienced Drivers: A Machine Learning Approach. INTERNATIONAL JOURNAL OF ENVIRONMENTAL RESEARCH AND PUBLIC HEALTH 2021; 18:ijerph182312725. [PMID: 34886451 PMCID: PMC8656871 DOI: 10.3390/ijerph182312725] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/26/2021] [Revised: 11/24/2021] [Accepted: 11/30/2021] [Indexed: 11/16/2022]
Abstract
Traffic accidents have significant financial and social impacts. Reducing the losses caused by traffic accidents has always been one of the most important issues. This paper presents an effort to investigate the factors affecting the accident severity of drivers with different driving experience. Special focus was placed on the combined effect of driving experience and age. Based on our dataset (traffic accidents that occurred between 2005 and 2021 in Shaanxi, China), CatBoost model was applied to deal with categorical feature, and SHAP (Shapley Additive exPlanations) model was used to interpret the output. Results show that accident cause, age, visibility, light condition, season, road alignment, and terrain are the key factors affecting accident severity for both novice and experienced drivers. Age has the opposite impact on fatal accident for novice and experienced drivers. Novice drivers younger than 30 or older than 55 are prone to suffer fatal accident, but for experienced drivers, the risk of fatal accident decreases when they are young and increases when they are old. These findings fill the research gap of the combined effect of driving experience and age on accident severity. Meanwhile, it can provide useful insights for practitioners to improve traffic safety for novice and experienced drivers.
Collapse
|
36
|
Comparing the performance of machine learning algorithms for remote and in situ estimations of chlorophyll-a content: A case study in the Tri An Reservoir, Vietnam. WATER ENVIRONMENT RESEARCH : A RESEARCH PUBLICATION OF THE WATER ENVIRONMENT FEDERATION 2021; 93:2941-2957. [PMID: 34547152 DOI: 10.1002/wer.1643] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/08/2021] [Revised: 09/01/2021] [Accepted: 09/14/2021] [Indexed: 06/13/2023]
Abstract
Chlorophyll-a (Chl-a) is one of the most important indicators of the trophic status of inland waters, and its continued monitoring is essential. Recently, the operated Sentinel-2 MSI satellite offers high spatial resolution images for remote water quality monitoring. In this study, we tested the performance of the three well-known machine learning (ML) (random forest [RF], support vector machine [SVM], and Gaussian process [GP]) and the two novel ML (extreme gradient boost (XGB) and CatBoost [CB]) models for estimation a wide range of Chl-a concentration (10.1-798.7 μg/L) using the Sentinel-2 MSI data and in situ water quality measurement in the Tri An Reservoir (TAR), Vietnam. GP indicated the most reliable model for predicting Chl-a from water quality parameters (R2 = 0.85, root-mean-square error [RMSE] = 56.65 μg/L, Akaike's information criterion [AIC] = 575.10, and Bayesian information criterion [BIC] = 595.24). Regarding input model as water surface reflectance, CB was the superior model for Chl-a retrieval (R2 = 0.84, RMSE = 46.28 μg/L, AIC = 229.18, and BIC = 238.50). Our results indicated that GP and CB are the two best models for the prediction of Chl-a in TAR. Overall, the Sentinel-2 MSI coupled with ML algorithms is a reliable, inexpensive, and accurate instrument for monitoring Chl-a in inland waters. PRACTITIONER POINTS: Machine learning algorithms were used for both remote sensing data and in situ water quality measurements. The performance of five well-known machine learning models was tested Gaussian process was the most reliable model for predicting Chl-a from water quality parameters CatBoost was the best model for Chl-a retrieval from water surface reflectance.
Collapse
|
37
|
Development of quantitative model of a local lymph node assay for evaluating skin sensitization potency applying machine learning CatBoost. Regul Toxicol Pharmacol 2021; 125:105019. [PMID: 34311055 DOI: 10.1016/j.yrtph.2021.105019] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2020] [Revised: 06/13/2021] [Accepted: 07/21/2021] [Indexed: 11/21/2022]
Abstract
The estimated concentrations for a stimulation index of 3 (EC3) in murine local lymph node assay (LLNA) is an important quantitative value for determining the strength of skin sensitization to chemicals, including cosmetic ingredients. However, animal testing bans on cosmetics in Europe necessitate the development of alternative testing methods to LLNA. A machine learning-based prediction method can predict complex toxicity risks from multiple variables. Therefore, we developed an LLNA EC3 regression model using CatBoost, a new gradient boosting decision tree, based on the reliable Cosmetics Europe database which included data for 119 substances. We found that a model using in chemico/in vitro tests, physical properties, and chemical information associated with key events of skin sensitization adverse outcome pathway as variables showed the best performance with a coefficient of determination (R2) of 0.75. In addition, this model can indicate the variable importance as the interpretation of the model, and the most important variable was associated with the human cell line activation test that evaluate dendritic cell activation. The good performance and interpretability of our LLNA EC3 predictable regression model suggests that it could serve as a useful approach for quantitative assessment of skin sensitization.
Collapse
|
38
|
m6AGE: A Predictor for N6-Methyladenosine Sites Identification Utilizing Sequence Characteristics and Graph Embedding-Based Geometrical Information. Front Genet 2021; 12:670852. [PMID: 34122525 PMCID: PMC8191635 DOI: 10.3389/fgene.2021.670852] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/22/2021] [Accepted: 04/29/2021] [Indexed: 11/30/2022] Open
Abstract
N6-methyladenosine (m6A) is one of the most prevalent RNA post-transcriptional modifications and is involved in various vital biological processes such as mRNA splicing, exporting, stability, and so on. Identifying m6A sites contributes to understanding the functional mechanism and biological significance of m6A. The existing biological experimental methods for identifying m6A sites are time-consuming and costly. Thus, developing a high confidence computational method is significant to explore m6A intrinsic characters. In this study, we propose a predictor called m6AGE which utilizes sequence-derived and graph embedding features. To the best of our knowledge, our predictor is the first to combine sequence-derived features and graph embeddings for m6A site prediction. Comparison results show that our proposed predictor achieved the best performance compared with other predictors on four public datasets across three species. On the A101 dataset, our predictor outperformed 1.34% (accuracy), 0.0227 (Matthew's correlation coefficient), 5.63% (specificity), and 0.0081 (AUC) than comparing predictors, which indicates that m6AGE is a useful tool for m6A site prediction. The source code of m6AGE is available at https://github.com/bokunoBike/m6AGE.
Collapse
|
39
|
CatBoost for big data: an interdisciplinary review. JOURNAL OF BIG DATA 2020; 7:94. [PMID: 33169094 PMCID: PMC7610170 DOI: 10.1186/s40537-020-00369-8] [Citation(s) in RCA: 133] [Impact Index Per Article: 33.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/05/2020] [Accepted: 10/19/2020] [Indexed: 05/25/2023]
Abstract
Gradient Boosted Decision Trees (GBDT's) are a powerful tool for classification and regression tasks in Big Data. Researchers should be familiar with the strengths and weaknesses of current implementations of GBDT's in order to use them effectively and make successful contributions. CatBoost is a member of the family of GBDT machine learning ensemble techniques. Since its debut in late 2018, researchers have successfully used CatBoost for machine learning studies involving Big Data. We take this opportunity to review recent research on CatBoost as it relates to Big Data, and learn best practices from studies that cast CatBoost in a positive light, as well as studies where CatBoost does not outshine other techniques, since we can learn lessons from both types of scenarios. Furthermore, as a Decision Tree based algorithm, CatBoost is well-suited to machine learning tasks involving categorical, heterogeneous data. Recent work across multiple disciplines illustrates CatBoost's effectiveness and shortcomings in classification and regression tasks. Another important issue we expose in literature on CatBoost is its sensitivity to hyper-parameters and the importance of hyper-parameter tuning. One contribution we make is to take an interdisciplinary approach to cover studies related to CatBoost in a single work. This provides researchers an in-depth understanding to help clarify proper application of CatBoost in solving problems. To the best of our knowledge, this is the first survey that studies all works related to CatBoost in a single publication.
Collapse
|