1
|
Taylor B, Hobensack M, Niño de Rivera S, Zhao Y, Masterson Creber R, Cato K. Identifying Depression Through Machine Learning Analysis of Omics Data: Scoping Review. JMIR Nurs 2024; 7:e54810. [PMID: 39028994 DOI: 10.2196/54810] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2023] [Revised: 04/16/2024] [Accepted: 04/22/2024] [Indexed: 07/21/2024] Open
Abstract
BACKGROUND Depression is one of the most common mental disorders that affects >300 million people worldwide. There is a shortage of providers trained in the provision of mental health care, and the nursing workforce is essential in filling this gap. The diagnosis of depression relies heavily on self-reported symptoms and clinical interviews, which are subject to implicit biases. The omics methods, including genomics, transcriptomics, epigenomics, and microbiomics, are novel methods for identifying the biological underpinnings of depression. Machine learning is used to analyze genomic data that includes large, heterogeneous, and multidimensional data sets. OBJECTIVE This scoping review aims to review the existing literature on machine learning methods for omics data analysis to identify individuals with depression, with the goal of providing insight into alternative objective and driven insights into the diagnostic process for depression. METHODS This scoping review was reported following the PRISMA-ScR (Preferred Reporting Items for Systematic Reviews and Meta-Analyses Extension for Scoping Reviews) guidelines. Searches were conducted in 3 databases to identify relevant publications. A total of 3 independent researchers performed screening, and discrepancies were resolved by consensus. Critical appraisal was performed using the Joanna Briggs Institute Critical Appraisal Checklist for Analytical Cross-Sectional Studies. RESULTS The screening process identified 15 relevant papers. The omics methods included genomics, transcriptomics, epigenomics, multiomics, and microbiomics, and machine learning methods included random forest, support vector machine, k-nearest neighbor, and artificial neural network. CONCLUSIONS The findings of this scoping review indicate that the omics methods had similar performance in identifying omics variants associated with depression. All machine learning methods performed well based on their performance metrics. When variants in omics data are associated with an increased risk of depression, the important next step is for clinicians, especially nurses, to assess individuals for symptoms of depression and provide a diagnosis and any necessary treatment.
Collapse
Affiliation(s)
- Brittany Taylor
- School of Nursing, Columbia University, New York, NY, United States
| | - Mollie Hobensack
- Brookdale Department of Geriatrics and Palliative Care, Icahn School of Medicine, Mount Sinai Health System, New York, NY, United States
| | | | - Yihong Zhao
- School of Nursing, Columbia University, New York, NY, United States
| | | | - Kenrick Cato
- School of Nursing, University of Pennsylvania, Philadelphia, PA, United States
| |
Collapse
|
2
|
Li Z, Pei S, Wang Y, Zhang G, Lin H, Dong S. Advancing predictive markers in lung adenocarcinoma: A machine learning-based immunotherapy prognostic prediction signature. ENVIRONMENTAL TOXICOLOGY 2024. [PMID: 38591820 DOI: 10.1002/tox.24284] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/12/2024] [Revised: 03/19/2024] [Accepted: 03/31/2024] [Indexed: 04/10/2024]
Abstract
The prognosis of lung adenocarcinoma (LUAD) is generally poor. Immunotherapy has emerged as a promising therapeutic modality, demonstrating remarkable potential for substantially prolonging the overall survival of individuals afflicted with LUAD. However, there is currently a lack of reliable signatures for identifying patients who would benefit from immunotherapy. We conducted a comparative analysis of two immunotherapy cohorts (OAK and POPLAR) and utilized single-factor COX regression to identify genes that significantly impact the prognosis of LUAD. Based on the TCGA-LUAD dataset, we employed a combination of 101 machine learning algorithms to construct a model and selected the optimal model. The model was validated on five GEO datasets and compared with 144 previously published signatures to assess its performance. Subsequently, we explored the underlying biological mechanisms through tumor mutation burden analysis, enrichment analysis, and immune infiltration analysis. An immunotherapy prognostic prediction signature (IPPS) was constructed based on 13 genes, showing robust performance in the TCGA-LUAD dataset. IPPS exhibited consistent predictive accuracy in the validation cohorts. Compared to 144 previously published signatures, IPPS consistently ranked among the top in terms of C-index values. Further exploration revealed differences between high and low-IPPS groups in terms of tumor mutation burden, pathway enrichment, and immune infiltration. IPPS demonstrates strong predictive capabilities for the prognosis of LUAD patients, offering the potential to identify suitable candidates for immunotherapy and contribute to precision treatment strategies for LUAD.
Collapse
Affiliation(s)
- Zhongyan Li
- Department of Geriatric Medicine, The Affiliated Huai'an Hospital of Yangzhou University
| | - Shengbin Pei
- Department of Breast Surgical Oncology, National Cancer Center/National Clinical Research Center for Cancer/Cancer Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, China
| | - Yanjuan Wang
- Department of Gastroenterology, The First Afliated Hospital of Nanjing Medical University, Nanjing, China
| | - Ge Zhang
- Department of Cardiology, The First Affiliated Hospital of Zhengzhou University, Zhengzhou, China
| | - Haoran Lin
- Department of Thoracic Surgery, The First Affiliated Hospital of Nanjing Medical University, Nanjing, China
| | - Shiyang Dong
- Department of Thoracic Surgery, Fuyang Tumor Hospital, Fuyang, China
| |
Collapse
|
3
|
Miller MI, Shih LC, Kolachalama VB. Machine Learning in Clinical Trials: A Primer with Applications to Neurology. Neurotherapeutics 2023; 20:1066-1080. [PMID: 37249836 PMCID: PMC10228463 DOI: 10.1007/s13311-023-01384-2] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 04/21/2023] [Indexed: 05/31/2023] Open
Abstract
We reviewed foundational concepts in artificial intelligence (AI) and machine learning (ML) and discussed ways in which these methodologies may be employed to enhance progress in clinical trials and research, with particular attention to applications in the design, conduct, and interpretation of clinical trials for neurologic diseases. We discussed ways in which ML may help to accelerate the pace of subject recruitment, provide realistic simulation of medical interventions, and enhance remote trial administration via novel digital biomarkers and therapeutics. Lastly, we provide a brief overview of the technical, administrative, and regulatory challenges that must be addressed as ML achieves greater integration into clinical trial workflows.
Collapse
Affiliation(s)
- Matthew I Miller
- Department of Medicine, Boston University Chobanian & Avedisian School of Medicine, 72 E. Concord Street, Evans 636, Boston, MA, 02118, USA
| | - Ludy C Shih
- Department of Neurology, Boston University Chobanian & Avedisian School of Medicine, Boston, MA, 02118, USA
| | - Vijaya B Kolachalama
- Department of Medicine, Boston University Chobanian & Avedisian School of Medicine, 72 E. Concord Street, Evans 636, Boston, MA, 02118, USA.
- Department of Computer Science and Faculty of Computing & Data Sciences, Boston University, Boston, MA, 02115, USA.
| |
Collapse
|
4
|
Sokhansanj BA, Rosen GL. Predicting COVID-19 disease severity from SARS-CoV-2 spike protein sequence by mixed effects machine learning. Comput Biol Med 2022; 149:105969. [PMID: 36041271 PMCID: PMC9384346 DOI: 10.1016/j.compbiomed.2022.105969] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/06/2022] [Revised: 07/11/2022] [Accepted: 08/13/2022] [Indexed: 11/17/2022]
Abstract
Epidemiological studies show that COVID-19 variants-of-concern, like Delta and Omicron, pose different risks for severe disease, but they typically lack sequence-level information for the virus. Studies which do obtain viral genome sequences are generally limited in time, location, and population scope. Retrospective meta-analyses require time-consuming data extraction from heterogeneous formats and are limited to publicly available reports. Fortuitously, a subset of GISAID, the global SARS-CoV-2 sequence repository, includes "patient status" metadata that can indicate whether a sequence record is associated with mild or severe disease. While GISAID lacks data on comorbidities relevant to severity, such as obesity and chronic disease, it does include metadata for age and sex to use as additional attributes in modeling. With these caveats, previous efforts have demonstrated that genotype-patient status models can be fit to GISAID data, particularly when country-of-origin is used as an additional feature. But are these models robust and biologically meaningful? This paper shows that, in fact, temporal and geographic biases in sequences submitted to GISAID, as well as the evolving pandemic response, particularly reduction in severe disease due to vaccination, create complex issues for model development and interpretation. This paper poses a potential solution: efficient mixed effects machine learning using GPBoost, treating country as a random effect group. Training and validation using temporally split GISAID data and emerging Omicron variants demonstrates that GPBoost models are more predictive of the impact of spike protein mutations on patient outcomes than fixed effect XGBoost, LightGBM, random forests, and elastic net logistic regression models.
Collapse
Affiliation(s)
- Bahrad A Sokhansanj
- Ecological and Evolutionary Signal Processing & Informatics Laboratory, Drexel University, 3100 Chestnut St., Philadelphia, PA, 19104, United States of America.
| | - Gail L Rosen
- Ecological and Evolutionary Signal Processing & Informatics Laboratory, Drexel University, 3100 Chestnut St., Philadelphia, PA, 19104, United States of America.
| |
Collapse
|
5
|
He J, Li J, Jiang S, Cheng W, Jiang J, Xu Y, Yang J, Zhou X, Chai C, Wu C. Application of machine learning algorithms in predicting HIV infection among men who have sex with men: Model development and validation. Front Public Health 2022; 10:967681. [PMID: 36091522 PMCID: PMC9452878 DOI: 10.3389/fpubh.2022.967681] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/13/2022] [Accepted: 08/02/2022] [Indexed: 01/25/2023] Open
Abstract
Background Continuously growing of HIV incidence among men who have sex with men (MSM), as well as the low rate of HIV testing of MSM in China, demonstrates a need for innovative strategies to improve the implementation of HIV prevention. The use of machine learning algorithms is an increasing tendency in disease diagnosis prediction. We aimed to develop and validate machine learning models in predicting HIV infection among MSM that can identify individuals at increased risk of HIV acquisition for transmission-reduction interventions. Methods We extracted data from MSM sentinel surveillance in Zhejiang province from 2018 to 2020. Univariate logistic regression was used to select significant variables in 2018-2019 data (P < 0.05). After data processing and feature selection, we divided the model development data into two groups by stratified random sampling: training data (70%) and testing data (30%). The Synthetic Minority Oversampling Technique (SMOTE) was applied to solve the problem of unbalanced data. The evaluation metrics of model performance were comprised of accuracy, precision, recall, F-measure, and the area under the receiver operating characteristic curve (AUC). Then, we explored three commonly-used machine learning algorithms to compare with logistic regression (LR), including decision tree (DT), support vector machines (SVM), and random forest (RF). Finally, the four models were validated prospectively with 2020 data from Zhejiang province. Results A total of 6,346 MSM were included in model development data, 372 of whom were diagnosed with HIV. In feature selection, 12 variables were selected as model predicting indicators. Compared with LR, the algorithms of DT, SVM, and RF improved the classification prediction performance in SMOTE-processed data, with the AUC of 0.778, 0.856, 0.887, and 0.942, respectively. RF was the best-performing algorithm (accuracy = 0.871, precision = 0.960, recall = 0.775, F-measure = 0.858, and AUC = 0.942). And the RF model still performed well on prospective validation (AUC = 0.846). Conclusion Machine learning models are substantially better than conventional LR model and RF should be considered in prediction tools of HIV infection in Chinese MSM. Further studies are needed to optimize and promote these algorithms and evaluate their impact on HIV prevention of MSM.
Collapse
Affiliation(s)
- Jiajin He
- School of Public Health, Zhejiang University School of Medicine, Hangzhou, China
| | - Jinhua Li
- School of Software Technology, Zhejiang University, Ningbo, China
| | - Siqing Jiang
- School of Public Health, Zhejiang University School of Medicine, Hangzhou, China
| | - Wei Cheng
- Zhejiang Provincial Center for Disease Control and Prevention, Hangzhou, China
| | - Jun Jiang
- Zhejiang Provincial Center for Disease Control and Prevention, Hangzhou, China
| | - Yun Xu
- Zhejiang Provincial Center for Disease Control and Prevention, Hangzhou, China
| | - Jiezhe Yang
- Zhejiang Provincial Center for Disease Control and Prevention, Hangzhou, China
| | - Xin Zhou
- Zhejiang Provincial Center for Disease Control and Prevention, Hangzhou, China
| | - Chengliang Chai
- Zhejiang Provincial Center for Disease Control and Prevention, Hangzhou, China,*Correspondence: Chengliang Chai
| | - Chao Wu
- School of Public Affairs, Zhejiang University, Hangzhou, China,Chao Wu
| |
Collapse
|
6
|
Ueki M, Tamiya G. Smooth-threshold multivariate genetic prediction incorporating gene–environment interactions. G3 GENES|GENOMES|GENETICS 2021; 11:6343458. [PMID: 34849749 PMCID: PMC8664495 DOI: 10.1093/g3journal/jkab278] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/20/2021] [Accepted: 07/12/2021] [Indexed: 11/17/2022]
Abstract
We propose a genetic prediction modeling approach for genome-wide association study (GWAS) data that can include not only marginal gene effects but also gene–environment (GxE) interaction effects—i.e., multiplicative effects of environmental factors with genes rather than merely additive effects of each. The proposed approach is a straightforward extension of our previous multiple regression-based method, STMGP (smooth-threshold multivariate genetic prediction), with the new feature being that genome-wide test statistics from a GxE interaction analysis are used to weight the corresponding variants. We develop a simple univariate regression approximation to the GxE interaction effect that allows a direct fit of the STMGP framework without modification. The sparse nature of our model automatically removes irrelevant predictors (including variants and GxE combinations), and the model is able to simultaneously incorporate multiple environmental variables. Simulation studies to evaluate the proposed method in comparison with other modeling approaches demonstrate its superior performance under the presence of GxE interaction effects. We illustrate the usefulness of our prediction model through application to real GWAS data from the Alzheimer’s Disease Neuroimaging Initiative (ADNI).
Collapse
Affiliation(s)
- Masao Ueki
- School of Information and Data Sciences, Nagasaki University, Nagasaki 852-8521, Japan
| | - Gen Tamiya
- Tohoku University Graduate School of Medicine, Sendai, Miyagi 980-8575, Japan
- Statistical Genetics Team, RIKEN Center for Advanced Intelligence Project, Chuo-ku, Tokyo 103-0027, Japan
- Tohoku Medical Megabank Organization, Tohoku University, Sendai, Miyagi 980-8573, Japan
| | | |
Collapse
|