1
|
Lyu C, Joehanes R, Huan T, Levy D, Li Y, Wang M, Liu X, Liu C, Ma J. Enhancing selection of alcohol consumption-associated genes by random forest. Br J Nutr 2024; 131:2058-2067. [PMID: 38606596 PMCID: PMC11216877 DOI: 10.1017/s0007114524000795] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/13/2024]
Abstract
Machine learning methods have been used in identifying omics markers for a variety of phenotypes. We aimed to examine whether a supervised machine learning algorithm can improve identification of alcohol-associated transcriptomic markers. In this study, we analysed array-based, whole-blood derived expression data for 17 873 gene transcripts in 5508 Framingham Heart Study participants. By using the Boruta algorithm, a supervised random forest (RF)-based feature selection method, we selected twenty-five alcohol-associated transcripts. In a testing set (30 % of entire study participants), AUC (area under the receiver operating characteristics curve) of these twenty-five transcripts were 0·73, 0·69 and 0·66 for non-drinkers v. moderate drinkers, non-drinkers v. heavy drinkers and moderate drinkers v. heavy drinkers, respectively. The AUC of the selected transcripts by the Boruta method were comparable to those identified using conventional linear regression models, for example, AUC of 1958 transcripts identified by conventional linear regression models (false discovery rate < 0·2) were 0·74, 0·66 and 0·65, respectively. With Bonferroni correction for the twenty-five Boruta method-selected transcripts and three CVD risk factors (i.e. at P < 6·7e-4), we observed thirteen transcripts were associated with obesity, three transcripts with type 2 diabetes and one transcript with hypertension. For example, we observed that alcohol consumption was inversely associated with the expression of DOCK4, IL4R, and SORT1, and DOCK4 and SORT1 were positively associated with obesity, and IL4R was inversely associated with hypertension. In conclusion, using a supervised machine learning method, the RF-based Boruta algorithm, we identified novel alcohol-associated gene transcripts.
Collapse
Affiliation(s)
- Chenglin Lyu
- Department of Biostatistics, Boston University School of Public Health, Boston, MA
- Department of Anatomy and Neurobiology, Boston University Chobanian & Avedisian School of Medicine, Boston, MA
| | - Roby Joehanes
- Framingham Heart Study and Population Sciences Branch, NHLBI, Framingham, MA
| | - Tianxiao Huan
- Framingham Heart Study and Population Sciences Branch, NHLBI, Framingham, MA
| | - Daniel Levy
- Framingham Heart Study and Population Sciences Branch, NHLBI, Framingham, MA
| | - Yi Li
- Department of Biostatistics, Boston University School of Public Health, Boston, MA
| | - Mengyao Wang
- Department of Biostatistics, Boston University School of Public Health, Boston, MA
| | - Xue Liu
- Department of Biostatistics, Boston University School of Public Health, Boston, MA
| | - Chunyu Liu
- Department of Biostatistics, Boston University School of Public Health, Boston, MA
| | - Jiantao Ma
- Nutrition Epidemiology and Data Science, Friedman School of Nutrition Science and Policy, Tufts University, Boston, MA
| |
Collapse
|
2
|
Ahammad I, Lamisa AB, Bhattacharjee A, Jamal TB, Arefin MS, Chowdhury ZM, Hossain MU, Das KC, Keya CA, Salimullah M. AITeQ: a machine learning framework for Alzheimer's prediction using a distinctive five-gene signature. Brief Bioinform 2024; 25:bbae291. [PMID: 38877887 PMCID: PMC11179120 DOI: 10.1093/bib/bbae291] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2023] [Revised: 05/23/2024] [Accepted: 06/04/2024] [Indexed: 06/18/2024] Open
Abstract
Neurodegenerative diseases, such as Alzheimer's disease, pose a significant global health challenge with their complex etiology and elusive biomarkers. In this study, we developed the Alzheimer's Identification Tool (AITeQ) using ribonucleic acid-sequencing (RNA-seq), a machine learning (ML) model based on an optimized ensemble algorithm for the identification of Alzheimer's from RNA-seq data. Analysis of RNA-seq data from several studies identified 87 differentially expressed genes. This was followed by a ML protocol involving feature selection, model training, performance evaluation, and hyperparameter tuning. The feature selection process undertaken in this study, employing a combination of four different methodologies, culminated in the identification of a compact yet impactful set of five genes. Twelve diverse ML models were trained and tested using these five genes (CNKSR1, EPHA2, CLSPN, OLFML3, and TARBP1). Performance metrics, including precision, recall, F1 score, accuracy, Matthew's correlation coefficient, and receiver operating characteristic area under the curve were assessed for the finally selected model. Overall, the ensemble model consisting of logistic regression, naive Bayes classifier, and support vector machine with optimized hyperparameters was identified as the best and was used to develop AITeQ. AITeQ is available at: https://github.com/ishtiaque-ahammad/AITeQ.
Collapse
Affiliation(s)
- Ishtiaque Ahammad
- Bioinformatics Division, National Institute of Biotechnology, Ganakbari, Ashulia, Savar, Dhaka 1349, Bangladesh
| | - Anika Bushra Lamisa
- Bioinformatics Division, National Institute of Biotechnology, Ganakbari, Ashulia, Savar, Dhaka 1349, Bangladesh
| | - Arittra Bhattacharjee
- Bioinformatics Division, National Institute of Biotechnology, Ganakbari, Ashulia, Savar, Dhaka 1349, Bangladesh
| | - Tabassum Binte Jamal
- Bioinformatics Division, National Institute of Biotechnology, Ganakbari, Ashulia, Savar, Dhaka 1349, Bangladesh
| | - Md Shamsul Arefin
- Department of Biochemistry and Microbiology, North South University, Bashundhara, Dhaka 1229, Bangladesh
| | - Zeshan Mahmud Chowdhury
- Bioinformatics Division, National Institute of Biotechnology, Ganakbari, Ashulia, Savar, Dhaka 1349, Bangladesh
| | - Mohammad Uzzal Hossain
- Bioinformatics Division, National Institute of Biotechnology, Ganakbari, Ashulia, Savar, Dhaka 1349, Bangladesh
| | - Keshob Chandra Das
- Molecular Biotechnology Division, National Institute of Biotechnology, Ganakbari, Ashulia, Savar, Dhaka 1349, Bangladesh
| | - Chaman Ara Keya
- Department of Biochemistry and Microbiology, North South University, Bashundhara, Dhaka 1229, Bangladesh
| | - Md Salimullah
- Molecular Biotechnology Division, National Institute of Biotechnology, Ganakbari, Ashulia, Savar, Dhaka 1349, Bangladesh
| |
Collapse
|
3
|
Gebeye LG, Dessie EY, Yimam JA. Predictors of micronutrient deficiency among children aged 6-23 months in Ethiopia: a machine learning approach. Front Nutr 2024; 10:1277048. [PMID: 38249594 PMCID: PMC10796776 DOI: 10.3389/fnut.2023.1277048] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2023] [Accepted: 12/12/2023] [Indexed: 01/23/2024] Open
Abstract
Introduction Micronutrient (MN) deficiencies are a major public health problem in developing countries including Ethiopia, leading to childhood morbidity and mortality. Effective implementation of programs aimed at reducing MN deficiencies requires an understanding of the important drivers of suboptimal MN intake. Therefore, this study aimed to identify important predictors of MN deficiency among children aged 6-23 months in Ethiopia using machine learning algorithms. Methods This study employed data from the 2019 Ethiopia Mini Demographic and Health Survey (2019 EMDHS) and included a sample of 1,455 children aged 6-23 months for analysis. Machine Learning (ML) methods including, Support Vector Machine (SVM), Logistic Regression (LR), Random Forest (RF), Neural Network (NN), and Naïve Bayes (NB) were used to prioritize risk factors for MN deficiency prediction. Performance metrics including accuracy, sensitivity, specificity, and Area Under the Receiver Operating Characteristic (AUROC) curves were used to evaluate model prediction performance. Results The prediction performance of the RF model was the best performing ML model in predicting child MN deficiency, with an AUROC of 80.01% and accuracy of 72.41% in the test data. The RF algorithm identified the eastern region of Ethiopia, poorest wealth index, no maternal education, lack of media exposure, home delivery, and younger child age as the top prioritized risk factors in their order of importance for MN deficiency prediction. Conclusion The RF algorithm outperformed other ML algorithms in predicting child MN deficiency in Ethiopia. Based on the findings of this study, improving women's education, increasing exposure to mass media, introducing MN-rich foods in early childhood, enhancing access to health services, and targeted intervention in the eastern region are strongly recommended to significantly reduce child MN deficiency.
Collapse
Affiliation(s)
- Leykun Getaneh Gebeye
- Department of Statistics, College of Natural Science, Wollo University, Dessie, Ethiopia
| | - Eskezeia Yihunie Dessie
- Department of Pediatrics, Cincinnati Children’s Hospital Medical Center, University of Cincinnati, College of Medicine, Cincinnati, OH, United States
| | - Jemal Ayalew Yimam
- Department of Statistics, College of Natural Science, Wollo University, Dessie, Ethiopia
| |
Collapse
|