1
|
Sun P, Wang X, Wang S, Jia X, Feng S, Chen J, Fang Y. Bipolar disorder: Construction and analysis of a joint diagnostic model using random forest and feedforward neural networks. IBRO Neurosci Rep 2024; 17:145-153. [PMID: 39206162 PMCID: PMC11350441 DOI: 10.1016/j.ibneur.2024.07.007] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/25/2023] [Revised: 07/22/2024] [Accepted: 07/30/2024] [Indexed: 09/04/2024] Open
Abstract
Background To construct a diagnostic model for Bipolar Disorder (BD) depressive phase using peripheral tissue RNA data from patients and combining Random Forest with Feedforward Neural Network methods. Methods Datasets GSE23848, GSE39653, and GSE69486 were selected, and differential gene expression analysis was conducted using the limma package in R. Key genes from the differentially expressed genes were identified using the Random Forest method. These key genes' expression levels in each sample were used to train a Feedforward Neural Network model. Techniques like L1 regularization, early stopping, and dropout layers were employed to prevent model overfitting. Model performance was then validated, followed by GO, KEGG, and protein-protein interaction network analyses. Results The final model was a Feedforward Neural Network with two hidden layers and two dropout layers, comprising 2345 trainable parameters. Model performance on the validation set, assessed through 1000 bootstrap resampling iterations, demonstrated a specificity of 0.769 (95 % CI 0.571-1.000), sensitivity of 0.818 (95 % CI 0.533-1.000), AUC value of 0.832 (95 % CI 0.642-0.979), and accuracy of 0.792 (95 % CI 0.625-0.958). Enrichment analysis of key genes indicated no significant enrichment in any known pathways. Conclusion Key genes with biological significance were identified based on the decrease in Gini coefficient within the Random Forest model. The combined use of Random Forest and Feedforward Neural Network to establish a diagnostic model showed good classification performance in Bipolar Disorder.
Collapse
Affiliation(s)
- Ping Sun
- Qingdao Mental Health Center, Shandong 266034, China
- Clinical Research Center, Shanghai Mental Health Center, Shanghai Jiao Tong University School of Medicine, Shanghai 200030, China
| | - Xiangwen Wang
- Qingdao Mental Health Center, Shandong 266034, China
- School of Mental Health, Research Institute of Mental Health,Jining Medical University, Shandong 272002, China
| | - Shenghai Wang
- Qingdao Mental Health Center, Shandong 266034, China
| | - Xueyu Jia
- Department of Medicine,Qingdao University, Shandong 266000, China
| | - Shunkang Feng
- Qingdao Mental Health Center, Shandong 266034, China
| | - Jun Chen
- Clinical Research Center, Shanghai Mental Health Center, Shanghai Jiao Tong University School of Medicine, Shanghai 200030, China
- Department of Psychiatry & Affective Disorders Center, Ruijin Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai 200025, China
- Shanghai Key Laboratory of Psychotic Disorders, Shanghai 201108, China
| | - Yiru Fang
- Clinical Research Center, Shanghai Mental Health Center, Shanghai Jiao Tong University School of Medicine, Shanghai 200030, China
- Department of Psychiatry & Affective Disorders Center, Ruijin Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai 200025, China
- Shanghai Key Laboratory of Psychotic Disorders, Shanghai 201108, China
- State Key Laboratory of Neuroscience, Shanghai Institue for Biological Sciences, CAS, Shanghai 200031, China
| |
Collapse
|
2
|
Seki T, Takiguchi T, Akagi Y, Ito H, Kubota K, Miyake K, Okada M, Kawazoe Y. Iterative random forest-based identification of a novel population with high risk of complications post non-cardiac surgery. Sci Rep 2024; 14:26741. [PMID: 39500963 PMCID: PMC11538396 DOI: 10.1038/s41598-024-78482-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2024] [Accepted: 10/31/2024] [Indexed: 11/08/2024] Open
Abstract
Assessing the risk of postoperative cardiovascular events before performing non-cardiac surgery is clinically important. The current risk score systems for preoperative evaluation may not adequately represent a small subset of high-risk populations. Accordingly, this study aimed at applying iterative random forest to analyze combinations of factors that could potentially be clinically valuable in identifying these high-risk populations. To this end, we used the Japan Medical Data Center database, which includes claims data from Japan between January 2005 and April 2021, and employed iterative random forests to extract factor combinations that influence outcomes. The analysis demonstrated that a combination of a prior history of stroke and extremely low LDL-C levels was associated with a high non-cardiac postoperative risk. The incidence of major adverse cardiovascular events in the population characterized by the incidence of previous stroke and extremely low LDL-C levels was 15.43 events per 100 person-30 days [95% confidence interval, 6.66-30.41] in the test data. At this stage, the results only show correlation rather than causation; however, these findings may offer valuable insights for preoperative risk assessment in non-cardiac surgery.
Collapse
Affiliation(s)
- Tomohisa Seki
- Department of Healthcare Information Management, The University of Tokyo Hospital, Tokyo, Japan.
| | - Toru Takiguchi
- Department of Healthcare Information Management, The University of Tokyo Hospital, Tokyo, Japan
| | - Yu Akagi
- Department of Biomedical Informatics, Graduate School of Medicine, The University of Tokyo, Tokyo, Japan
| | - Hiromasa Ito
- Department of Healthcare Information Management, The University of Tokyo Hospital, Tokyo, Japan
| | - Kazumi Kubota
- Department of Healthcare Information Management, The University of Tokyo Hospital, Tokyo, Japan
| | - Kana Miyake
- Department of Healthcare Information Management, The University of Tokyo Hospital, Tokyo, Japan
| | - Masafumi Okada
- Department of Healthcare Information Management, The University of Tokyo Hospital, Tokyo, Japan
| | - Yoshimasa Kawazoe
- Department of Healthcare Information Management, The University of Tokyo Hospital, Tokyo, Japan
- Artificial Intelligence and Digital Twin in Healthcare, Graduate School of Medicine, The University of Tokyo, Tokyo, Japan
| |
Collapse
|
3
|
Wang Z, Whipp AM, Heinonen-Guzejev M, Foraster M, Júlvez J, Kaprio J. The association between urban land use and depressive symptoms in young adulthood: a FinnTwin12 cohort study. JOURNAL OF EXPOSURE SCIENCE & ENVIRONMENTAL EPIDEMIOLOGY 2024; 34:770-779. [PMID: 38081942 PMCID: PMC11446816 DOI: 10.1038/s41370-023-00619-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/27/2023] [Revised: 11/20/2023] [Accepted: 11/22/2023] [Indexed: 10/04/2024]
Abstract
BACKGROUND Depressive symptoms lead to a serious public health burden and are considerably affected by the environment. Land use, describing the urban living environment, influences mental health, but complex relationship assessment is rare. OBJECTIVE We aimed to examine the complicated association between urban land use and depressive symptoms among young adults with differential land use environments, by applying multiple models. METHODS We included 1804 individual twins from the FinnTwin12 cohort, living in urban areas in 2012. There were eight types of land use exposures in three buffer radii. The depressive symptoms were assessed through the General Behavior Inventory (GBI) in young adulthood (mean age: 24.1). First, K-means clustering was performed to distinguish participants with differential land use environments. Then, linear elastic net penalized regression and eXtreme Gradient Boosting (XGBoost) were used to reduce dimensions or prioritize for importance and examine the linear and nonlinear relationships. RESULTS Two clusters were identified: one is more typical of city centers and another of suburban areas. A heterogeneous pattern in results was detected from the linear elastic net penalized regression model among the overall sample and the two separated clusters. Agricultural residential land use in a 100 m buffer contributed to GBI most (coefficient: 0.097) in the "suburban" cluster among 11 selected exposures after adjustment with demographic covariates. In the "city center" cluster, none of the land use exposures was associated with GBI, even after further adjustment with social indicators. From the XGBoost models, we observed that ranks of the importance of land use exposures on GBI and their nonlinear relationships are also heterogeneous in the two clusters. IMPACT This study examined the complex relationship between urban land use and depressive symptoms among young adults in Finland. Based on the FinnTwin12 cohort, two distinct clusters of participants were identified with different urban land use environments at first. We then employed two pluralistic models, elastic net penalized regression and XGBoost, and revealed both linear and nonlinear relationships between urban land use and depressive symptoms, which also varied in the two clusters. The findings suggest that analyses, involving land use and the broader environmental profile, should consider aspects such as population heterogeneity and linearity for comprehensive assessment in the future.
Collapse
Affiliation(s)
- Zhiyang Wang
- Institute for Molecular Medicine Finland, Helsinki Institute of Life Science, University of Helsinki, Helsinki, Finland
| | - Alyce M Whipp
- Institute for Molecular Medicine Finland, Helsinki Institute of Life Science, University of Helsinki, Helsinki, Finland
- Department of Public Health, University of Helsinki, Helsinki, Finland
| | | | - Maria Foraster
- PHAGEX Research Group, Blanquerna School of Health Science, Universitat Ramon Llull (URL), Barcelona, Spain
- ISGlobal-Instituto de Salud Global de Barcelona Campus MAR, Parc de Recerca Biomèdica de Barcelona (PRBB), Barcelona, Spain
- Universitat Pompeu Fabra (UPF), Barcelona, Spain
- CIBER Epidemiología y Salud Pública (CIBEREsp), Madrid, Spain
| | - Jordi Júlvez
- ISGlobal-Instituto de Salud Global de Barcelona Campus MAR, Parc de Recerca Biomèdica de Barcelona (PRBB), Barcelona, Spain
- Clinical and Epidemiological Neuroscience (NeuroÈpia), Institut d'Investigació Sanitària Pere Virgili (IISPV), Reus, Spain
| | - Jaakko Kaprio
- Institute for Molecular Medicine Finland, Helsinki Institute of Life Science, University of Helsinki, Helsinki, Finland.
- Department of Public Health, University of Helsinki, Helsinki, Finland.
| |
Collapse
|
4
|
Mallioris P, Luiken REC, Tobias T, Vonk J, Wagenaar JA, Stegeman A, Mughini-Gras L. Risk factors for antimicrobial use in Dutch pig farms: A cross-sectional study. Res Vet Sci 2024; 174:105307. [PMID: 38781817 DOI: 10.1016/j.rvsc.2024.105307] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/22/2024] [Revised: 05/04/2024] [Accepted: 05/13/2024] [Indexed: 05/25/2024]
Abstract
BACKGROUND Antimicrobial use (AMU) has decreased significantly in Dutch pig farms since 2009. However, this decrease has stagnated recently, with relatively high AMU levels persisting mainly among weaners. The aim of this study was to identify farm-level characteristics associated with: i) total AMU and ii) use of specific antimicrobial classes. METHODS In 2020, cross-sectional data from 154 Dutch pig farms were collected, including information on AMU and farm characteristics. A mixed-effects conditional Random Forest analysis was applied to select the subset of features that was best associated with AMU. RESULTS The main risk factors for total AMU in weaners were vaccination for PRRS in sucklings, being a conventional farm (vs. not), high within-farm density, and early weaning. The main protective factors for total AMU in sows/sucklings were E. coli vaccination in sows and having boars for estrus detection from own production. Regarding antimicrobial class-specific outcomes, several risk factors overlapped for weaners and sows/sucklings, such as farmer's non-tertiary education, not having free-sow systems during lactation, and conventional farming. An additional risk factor for weaners was having fully slatted floors. For fatteners, the main risk factor for total AMU was PRRS vaccination in sucklings. CONCLUSIONS Several factors found here to be associated with AMU. Some were known but others were novel, such as farmer's tertiary education, low pig aggression and free-sow systems which were all associated with lower AMU. These factors provide targets for developing tailor-made interventions, as well as an evidence-based selection of features for further causal assessment and mediation analysis.
Collapse
Affiliation(s)
- Panagiotis Mallioris
- Division of Infectious Diseases and Immunology, Faculty of Veterinary Medicine, Utrecht University, Utrecht, the Netherlands.
| | - Roosmarijn E C Luiken
- Division of Infectious Diseases and Immunology, Faculty of Veterinary Medicine, Utrecht University, Utrecht, the Netherlands
| | - Tijs Tobias
- Department of Population Health Sciences, Farm Animal Health unit, Faculty of Veterinary Medicine, Utrecht University, Utrecht, the Netherlands; Swine Health Department, Royal GD, Deventer, the Netherlands
| | - John Vonk
- John Vonk DVM, BSc Agriculture, De Varkenspraktijk, Obrechtstraat 2, 5344 AT, Oss, the Netherlands
| | - Jaap A Wagenaar
- Division of Infectious Diseases and Immunology, Faculty of Veterinary Medicine, Utrecht University, Utrecht, the Netherlands; Wageningen Bioveterinary Research, Lelystad, the Netherlands
| | - Arjan Stegeman
- Department of Population Health Sciences, Farm Animal Health unit, Faculty of Veterinary Medicine, Utrecht University, Utrecht, the Netherlands
| | - Lapo Mughini-Gras
- Institute for Risk Assessment Sciences, Utrecht University, Utrecht, the Netherlands; National Institute for Public Health and the Environment, Centre for Infectious Disease Control, Bilthoven, the Netherlands
| |
Collapse
|
5
|
Cheek CL, Lindner P, Grigorenko EL. Statistical and Machine Learning Analysis in Brain-Imaging Genetics: A Review of Methods. Behav Genet 2024; 54:233-251. [PMID: 38336922 DOI: 10.1007/s10519-024-10177-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2023] [Accepted: 01/24/2024] [Indexed: 02/12/2024]
Abstract
Brain-imaging-genetic analysis is an emerging field of research that aims at aggregating data from neuroimaging modalities, which characterize brain structure or function, and genetic data, which capture the structure and function of the genome, to explain or predict normal (or abnormal) brain performance. Brain-imaging-genetic studies offer great potential for understanding complex brain-related diseases/disorders of genetic etiology. Still, a combined brain-wide genome-wide analysis is difficult to perform as typical datasets fuse multiple modalities, each with high dimensionality, unique correlational landscapes, and often low statistical signal-to-noise ratios. In this review, we outline the progress in brain-imaging-genetic methodologies starting from early massive univariate to current deep learning approaches, highlighting each approach's strengths and weaknesses and elongating it with the field's development. We conclude by discussing selected remaining challenges and prospects for the field.
Collapse
Affiliation(s)
- Connor L Cheek
- Texas Institute for Evaluation, Measurement, and Statistics, University of Houston, Houston, TX, USA.
- Department of Physics, University of Houston, Houston, TX, USA.
| | - Peggy Lindner
- Texas Institute for Evaluation, Measurement, and Statistics, University of Houston, Houston, TX, USA
- Department of Information Science Technology, University of Houston, Houston, TX, USA
| | - Elena L Grigorenko
- Texas Institute for Evaluation, Measurement, and Statistics, University of Houston, Houston, TX, USA
- Department of Psychology, University of Houston, Houston, TX, USA
- Baylor College of Medicine, Houston, TX, USA
- Sirius University of Science and Technology, Sochi, Russia
| |
Collapse
|
6
|
Bramer LM, Dixon HM, Rohlman D, Scott RP, Miller RL, Kincl L, Herbstman JB, Waters KM, Anderson KA. PM 2.5 Is Insufficient to Explain Personal PAH Exposure. GEOHEALTH 2024; 8:e2023GH000937. [PMID: 38344245 PMCID: PMC10858395 DOI: 10.1029/2023gh000937] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 09/07/2023] [Revised: 01/19/2024] [Accepted: 01/22/2024] [Indexed: 10/28/2024]
Abstract
To understand how chemical exposure can impact health, researchers need tools that capture the complexities of personal chemical exposure. In practice, fine particulate matter (PM2.5) air quality index (AQI) data from outdoor stationary monitors and Hazard Mapping System (HMS) smoke density data from satellites are often used as proxies for personal chemical exposure, but do not capture total chemical exposure. Silicone wristbands can quantify more individualized exposure data than stationary air monitors or smoke satellites. However, it is not understood how these proxy measurements compare to chemical data measured from wristbands. In this study, participants wore daily wristbands, carried a phone that recorded locations, and answered daily questionnaires for a 7-day period in multiple seasons. We gathered publicly available daily PM2.5 AQI data and HMS data. We analyzed wristbands for 94 organic chemicals, including 53 polycyclic aromatic hydrocarbons. Wristband chemical detections and concentrations, behavioral variables (e.g., time spent indoors), and environmental conditions (e.g., PM2.5 AQI) significantly differed between seasons. Machine learning models were fit to predict personal chemical exposure using PM2.5 AQI only, HMS only, and a multivariate feature set including PM2.5 AQI, HMS, and other environmental and behavioral information. On average, the multivariate models increased predictive accuracy by approximately 70% compared to either the AQI model or the HMS model for all chemicals modeled. This study provides evidence that PM2.5 AQI data alone or HMS data alone is insufficient to explain personal chemical exposures. Our results identify additional key predictors of personal chemical exposure.
Collapse
Affiliation(s)
- Lisa M. Bramer
- Biological Sciences DivisionPacific Northwest National LaboratoryRichlandWAUSA
| | - Holly M. Dixon
- Department of Environmental and Molecular ToxicologyFood Safety and Environmental Stewardship ProgramOregon State UniversityCorvallisORUSA
| | - Diana Rohlman
- College of HealthOregon State UniversityCorvallisORUSA
| | - Richard P. Scott
- Department of Environmental and Molecular ToxicologyFood Safety and Environmental Stewardship ProgramOregon State UniversityCorvallisORUSA
| | - Rachel L. Miller
- Division of Clinical ImmunologyIcahn School of Medicine at Mount SinaiNew York CityNYUSA
| | - Laurel Kincl
- College of HealthOregon State UniversityCorvallisORUSA
| | - Julie B. Herbstman
- Department of Environmental Health SciencesColumbia Center for Children's Environmental HealthMailman School of Public HealthColumbia UniversityNew York CityNYUSA
| | - Katrina M. Waters
- Biological Sciences DivisionPacific Northwest National LaboratoryRichlandWAUSA
- Department of Environmental and Molecular ToxicologyFood Safety and Environmental Stewardship ProgramOregon State UniversityCorvallisORUSA
| | - Kim A. Anderson
- Department of Environmental and Molecular ToxicologyFood Safety and Environmental Stewardship ProgramOregon State UniversityCorvallisORUSA
| |
Collapse
|
7
|
Wade B, Pindale R, Camprodon J, Luccarelli J, Li S, Meisner R, Seiner S, Henry M. Individual Prediction of Optimal Treatment Allocation Between Electroconvulsive Therapy or Ketamine using the Personalized Advantage Index. RESEARCH SQUARE 2023:rs.3.rs-3682009. [PMID: 38077094 PMCID: PMC10705694 DOI: 10.21203/rs.3.rs-3682009/v1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/23/2023]
Abstract
Introduction Electroconvulsive therapy (ECT) and ketamine are two effective treatments for depression with similar efficacy; however, individual patient outcomes may be improved by models that predict optimal treatment assignment. Here, we adapt the Personalized Advantage Index (PAI) algorithm using machine learning to predict optimal treatment assignment between ECT and ketamine using medical record data from a large, naturalistic patient cohort. We hypothesized that patients who received a treatment predicted to be optimal would have significantly better outcomes following treatment compared to those who received a non-optimal treatment. Methods Data on 2526 ECT and 235 mixed IV ketamine and esketamine patients from McLean Hospital was aggregated. Depressive symptoms were measured using the Quick Inventory of Depressive Symptomatology (QIDS) before and during acute treatment. Patients were matched between treatments on pretreatment QIDS, age, inpatient status, and psychotic symptoms using a 1:1 ratio yielding a sample of 470 patients (n=235 per treatment). Random forest models were trained and predicted differential patientwise minimum QIDS scores achieved during acute treatment (min-QIDS) scores for ECT and ketamine using pretreatment patient measures. Analysis of Shapley Additive exPlanations (SHAP) values identified predictors of differential outcomes between treatments. Results Twenty-seven percent of patients with the largest PAI scores who received a treatment predicted optimal had significantly lower min-QIDS scores compared to those who received a non-optimal treatment (mean difference=1.6, t=2.38, q<0.05, Cohen's D=0.36). Analysis of SHAP values identified prescriptive pretreatment measures. Conclusions Patients assigned to a treatment predicted to be optimal had significantly better treatment outcomes. Our model identified pretreatment patient factors captured in medical records that can provide interpretable and actionable guidelines treatment selection.
Collapse
Affiliation(s)
- Benjamin Wade
- Department of Psychiatry, Massachusetts General Hospital and Harvard Medical School, Boston, MA, USA
| | - Ryan Pindale
- Department of Psychiatry, Massachusetts General Hospital and Harvard Medical School, Boston, MA, USA
| | - Joan Camprodon
- Department of Psychiatry, Massachusetts General Hospital and Harvard Medical School, Boston, MA, USA
| | - James Luccarelli
- Department of Psychiatry, Massachusetts General Hospital and Harvard Medical School, Boston, MA, USA
| | - Shuang Li
- Department of Psychiatry, McLean Hospital, Belmont, MA, USA
| | - Robert Meisner
- Department of Psychiatry, Massachusetts General Hospital and Harvard Medical School, Boston, MA, USA
- Department of Psychiatry, McLean Hospital, Belmont, MA, USA
| | - Stephen Seiner
- Department of Psychiatry, McLean Hospital, Belmont, MA, USA
| | - Michael Henry
- Department of Psychiatry, Massachusetts General Hospital and Harvard Medical School, Boston, MA, USA
| |
Collapse
|
8
|
Heinrich F, Lange TM, Kircher M, Ramzan F, Schmitt AO, Gültas M. Exploring the potential of incremental feature selection to improve genomic prediction accuracy. Genet Sel Evol 2023; 55:78. [PMID: 37946104 PMCID: PMC10634161 DOI: 10.1186/s12711-023-00853-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/03/2023] [Accepted: 11/02/2023] [Indexed: 11/12/2023] Open
Abstract
BACKGROUND The ever-increasing availability of high-density genomic markers in the form of single nucleotide polymorphisms (SNPs) enables genomic prediction, i.e. the inference of phenotypes based solely on genomic data, in the field of animal and plant breeding, where it has become an important tool. However, given the limited number of individuals, the abundance of variables (SNPs) can reduce the accuracy of prediction models due to overfitting or irrelevant SNPs. Feature selection can help to reduce the number of irrelevant SNPs and increase the model performance. In this study, we investigated an incremental feature selection approach based on ranking the SNPs according to the results of a genome-wide association study that we combined with random forest as a prediction model, and we applied it on several animal and plant datasets. RESULTS Applying our approach to different datasets yielded a wide range of outcomes, i.e. from a substantial increase in prediction accuracy in a few cases to minor improvements when only a fraction of the available SNPs were used. Compared with models using all available SNPs, our approach was able to achieve comparable performances with a considerably reduced number of SNPs in several cases. Our approach showcased state-of-the-art efficiency and performance while having a faster computation time. CONCLUSIONS The results of our study suggest that our incremental feature selection approach has the potential to improve prediction accuracy substantially. However, this gain seems to depend on the genomic data used. Even for datasets where the number of markers is smaller than the number of individuals, feature selection may still increase the performance of the genomic prediction. Our approach is implemented in R and is available at https://github.com/FelixHeinrich/GP_with_IFS/ .
Collapse
Affiliation(s)
- Felix Heinrich
- Breeding Informatics Group, Department of Animal Sciences, Georg-August University, Margarethe von Wrangell-Weg 7, 37075, Göttingen, Germany.
| | - Thomas Martin Lange
- Breeding Informatics Group, Department of Animal Sciences, Georg-August University, Margarethe von Wrangell-Weg 7, 37075, Göttingen, Germany
| | - Magdalena Kircher
- Institute for Animal Breeding and Genetics, University of Veterinary Medicine Hannover, Bünteweg 17p, 30559, Hannover, Germany
| | - Faisal Ramzan
- Institute of Animal and Dairy Sciences, University of Agriculture Faisalabad, Jail Road, 38000, Faisalabad, Pakistan
| | - Armin Otto Schmitt
- Breeding Informatics Group, Department of Animal Sciences, Georg-August University, Margarethe von Wrangell-Weg 7, 37075, Göttingen, Germany
- Center for Integrated Breeding Research (CiBreed), Georg-August University, Albrecht-Thaer-Weg 3, 37075, Göttingen, Germany
| | - Mehmet Gültas
- Center for Integrated Breeding Research (CiBreed), Georg-August University, Albrecht-Thaer-Weg 3, 37075, Göttingen, Germany.
- Faculty of Agriculture, South Westphalia University of Applied Sciences, 59494, Soest, Germany.
| |
Collapse
|
9
|
Xu W, Sampson M. Prenatal and Childbirth Risk Factors of Postpartum Pain and Depression: A Machine Learning Approach. Matern Child Health J 2023; 27:286-296. [PMID: 36526882 DOI: 10.1007/s10995-022-03532-0] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 09/08/2022] [Indexed: 12/23/2022]
Abstract
OBJECTIVES About 74.91% of U.S. mothers experience postpartum pain at 6 to 10 weeks postpartum, and one in seven U.S. mothers suffer from postpartum depression. We used machine learning to explore physical, psychological, and social factors during pregnancy and childbirth and identify the most important predictors of postpartum pain and depression. METHODS Data were from the Listening To Mothers III survey (2012), a national representative sample of postpartum mothers. We randomly split the dataset into a training set (N = 1467) and a test set (N = 723). The final models included 34 risk factors identified from previous literature. Postpartum pain was measured as "to what extent the pain interferes with mothers' daily life". PHQ2 scores measured depression. We used the random forest model, an aggregate of many regression trees, to accommodate potential nonlinear/interaction effects. RESULTS In the test data set, our models explained 15.8% of the variance in pain and 27.1% of the variance in depression. The model's strongest predictors for postpartum pain were Cesarean delivery, holding back while communicating with providers, non-use of pain relief medications, and perceived discrimination. For depression scores, the model's strongest predictors included needing help for depression during pregnancy, perceived discrimination, holding back, gestational diabetes, and pain. CONCLUSIONS FOR PRACTICE Mental and physical health are intertwined and should be considered integratively in the perinatal period. Besides, practitioners should also be aware of the importance of patient-provider-relationship, which both independently and interact with other risk factors to predict postpartum health.
Collapse
Affiliation(s)
- Wen Xu
- Graduate College of Social Work, University of Houston, Houston, USA.
| | - McClain Sampson
- Graduate College of Social Work, University of Houston, Houston, USA
| |
Collapse
|
10
|
Scornet E. Trees, forests, and impurity-based variable importance in regression. ANNALES DE L'INSTITUT HENRI POINCARÉ, PROBABILITÉS ET STATISTIQUES 2023. [DOI: 10.1214/21-aihp1240] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/18/2023]
Affiliation(s)
- Erwan Scornet
- Centre de Mathématiques Appliquées, Ecole Polytechnique, CNRS, Institut Polytechnique de Paris, Palaiseau, France
| |
Collapse
|
11
|
Hapfelmeier A, Hornung R, Haller B. Efficient permutation testing of variable importance measures by the example of random forests. Comput Stat Data Anal 2023. [DOI: 10.1016/j.csda.2022.107689] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023]
|
12
|
Zhang B, Wang H. Exploring the advantages of the maximum entropy model in calibrating cellular automata for urban growth simulation: a comparative study of four methods. GISCIENCE & REMOTE SENSING 2022; 59:71-95. [DOI: 10.1080/15481603.2021.2016240] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/14/2021] [Accepted: 11/25/2021] [Indexed: 09/01/2023]
Affiliation(s)
- Bin Zhang
- School of Resource and Environmental Sciences, Wuhan University, Wuhan, China
| | - Haijun Wang
- School of Resource and Environmental Sciences, Wuhan University, Wuhan, China
- Key Laboratory of Geographic Information System of MOE, Wuhan University, Wuhan, China
| |
Collapse
|
13
|
Mallioris P, Teunis G, Lagerweij G, Joosten P, Dewulf J, Wagenaar JA, Stegeman A, Mughini-Gras L. Biosecurity and antimicrobial use in broiler farms across nine European countries: toward identifying farm-specific options for reducing antimicrobial usage. Epidemiol Infect 2022; 151:e13. [PMID: 36573356 PMCID: PMC9990406 DOI: 10.1017/s0950268822001960] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2022] [Revised: 12/05/2022] [Accepted: 12/18/2022] [Indexed: 12/28/2022] Open
Abstract
Broiler chickens are among the main livestock sectors worldwide. With individual treatments being inapplicable, contrary to many other animal species, the need for antimicrobial use (AMU) is relatively high. AMU in animals is known to drive the emergence and spread of antimicrobial resistance (AMR). High farm biosecurity is a cornerstone for animal health and welfare, as well as food safety, as it protects animals from the introduction and spread of pathogens and therefore the need for AMU. The goal of this study was to identify the main biosecurity practices associated with AMU in broiler farms and to develop a statistical model that produces customised recommendations as to which biosecurity measures could be implemented on a farm to reduce its AMU, including a cost-effectiveness analysis of the recommended measures. AMU and biosecurity data were obtained cross-sectionally in 2014 from 181 broiler farms across nine European countries (Belgium, Bulgaria, Denmark, France, Germany, Italy, the Netherlands, Poland and Spain). Using mixed-effects random forest analysis (Mix-RF), recursive feature elimination was implemented to determine the biosecurity measures that best predicted AMU at the farm level. Subsequently, an algorithm was developed to generate AMU reduction scenarios based on the implementation of these measures. In the final Mix-RF model, 21 factors were present: 10 about internal biosecurity, 8 about external biosecurity and 3 about farm size and productivity, with the latter showing the largest (Gini) importance. Other AMU predictors, in order of importance, were the number of depopulation steps, compliance with a vaccination protocol for non-officially controlled diseases, and requiring visitors to check in before entering the farm. K-means clustering on the proximity matrix of the final Mix-RF model revealed that several measures interacted with each other, indicating that high AMU levels can arise for various reasons depending on the situation. The algorithm utilised the AMU predictive power of biosecurity measures while accounting also for their interactions, representing a first step toward aiding the decision-making process of veterinarians and farmers who are in need of implementing on-farm biosecurity measures to reduce their AMU.
Collapse
Affiliation(s)
- Panagiotis Mallioris
- Division of Infectious Diseases and Immunology, Faculty of Veterinary Medicine, Utrecht University, Utrecht, the Netherlands
| | - Gijs Teunis
- Institute for Risk Assessment Sciences, Utrecht University, Utrecht, the Netherlands
| | - Giske Lagerweij
- National Institute for Public Health and the Environment, Centre for Infectious Disease Control, Bilthoven, the Netherlands
| | - Philip Joosten
- Veterinary Epidemiology Unit, Department of Internal Medicine, Reproduction and Population Medicine, Faculty of Veterinary Medicine, Ghent University, Ghent, Belgium
| | - Jeroen Dewulf
- Veterinary Epidemiology Unit, Department of Internal Medicine, Reproduction and Population Medicine, Faculty of Veterinary Medicine, Ghent University, Ghent, Belgium
| | - Jaap A. Wagenaar
- Division of Infectious Diseases and Immunology, Faculty of Veterinary Medicine, Utrecht University, Utrecht, the Netherlands
| | - Arjan Stegeman
- Division of Farm Animal Health, Faculty of Veterinary Medicine, Utrecht University, Utrecht, the Netherlands
| | - Lapo Mughini-Gras
- Institute for Risk Assessment Sciences, Utrecht University, Utrecht, the Netherlands
- National Institute for Public Health and the Environment, Centre for Infectious Disease Control, Bilthoven, the Netherlands
| |
Collapse
|
14
|
Jardillier R, Koca D, Chatelain F, Guyon L. Optimal microRNA Sequencing Depth to Predict Cancer Patient Survival with Random Forest and Cox Models. Genes (Basel) 2022; 13:2275. [PMID: 36553544 PMCID: PMC9777708 DOI: 10.3390/genes13122275] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2022] [Revised: 11/18/2022] [Accepted: 11/23/2022] [Indexed: 12/12/2022] Open
Abstract
(1) Background: tumor profiling enables patient survival prediction. The two essential parameters to be calibrated when designing a study based on tumor profiles from a cohort are the sequencing depth of RNA-seq technology and the number of patients. This calibration is carried out under cost constraints, and a compromise has to be found. In the context of survival data, the goal of this work is to benchmark the impact of the number of patients and of the sequencing depth of miRNA-seq and mRNA-seq on the predictive capabilities for both the Cox model with elastic net penalty and random survival forest. (2) Results: we first show that the Cox model and random survival forest provide comparable prediction capabilities, with significant differences for some cancers. Second, we demonstrate that miRNA and/or mRNA data improve prediction over clinical data alone. mRNA-seq data leads to slightly better prediction than miRNA-seq, with the notable exception of lung adenocarcinoma for which the tumor miRNA profile shows higher predictive power. Third, we demonstrate that the sequencing depth of RNA-seq data can be reduced for most of the investigated cancers without degrading the prediction abilities, allowing the creation of independent validation sets at a lower cost. Finally, we show that the number of patients in the training dataset can be reduced for the Cox model and random survival forest, allowing the use of different models on different patient subgroups.
Collapse
Affiliation(s)
- Rémy Jardillier
- Univ. Grenoble Alpes, CEA, Inserm, IRIG, BioSanté U1292, BCI, 38000 Grenoble, France
- Univ. Grenoble Alpes, CNRS, Grenoble INP, GIPSA-Lab, Institute of Engineering University Grenoble Alpes, 38000 Grenoble, France
| | - Dzenis Koca
- Univ. Grenoble Alpes, CEA, Inserm, IRIG, BioSanté U1292, BCI, 38000 Grenoble, France
| | - Florent Chatelain
- Univ. Grenoble Alpes, CNRS, Grenoble INP, GIPSA-Lab, Institute of Engineering University Grenoble Alpes, 38000 Grenoble, France
| | - Laurent Guyon
- Univ. Grenoble Alpes, CEA, Inserm, IRIG, BioSanté U1292, BCI, 38000 Grenoble, France
| |
Collapse
|
15
|
Heindel P, Dey T, Feliz JD, Hentschel DM, Bhatt DL, Al-Omran M, Belkin M, Ozaki CK, Hussain MA. Predicting radiocephalic arteriovenous fistula success with machine learning. NPJ Digit Med 2022; 5:160. [PMID: 36280681 PMCID: PMC9592575 DOI: 10.1038/s41746-022-00710-w] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/20/2022] [Accepted: 10/10/2022] [Indexed: 11/09/2022] Open
Abstract
After creation of a new arteriovenous fistula (AVF), assessment of readiness for use is an important clinical task. Accurate prediction of successful use is challenging, and augmentation of the physical exam with ultrasound has become routine. Herein, we propose a point-of-care tool based on machine learning to enhance prediction of successful unassisted radiocephalic arteriovenous fistula (AVF) use. Our analysis includes pooled patient-level data from 704 patients undergoing new radiocephalic AVF creation, eligible for hemodialysis, and enrolled in the 2014-2019 international multicenter PATENCY-1 or PATENCY-2 randomized controlled trials. The primary outcome being predicted is successful unassisted AVF use within 1-year, defined as 2-needle cannulation for hemodialysis for ≥90 days without preceding intervention. Logistic, penalized logistic (lasso and elastic net), decision tree, random forest, and boosted tree classification models were built with a training, tuning, and testing paradigm using a combination of baseline clinical characteristics and 4-6 week ultrasound parameters. Performance assessment includes receiver operating characteristic curves, precision-recall curves, calibration plots, and decision curves. All modeling approaches except the decision tree have similar discrimination performance and comparable net-benefit (area under the ROC curve 0.78-0.81, accuracy 69.1-73.6%). Model performance is superior to Kidney Disease Outcome Quality Initiative and University of Alabama at Birmingham ultrasound threshold criteria. The lasso model is presented as the final model due to its parsimony, retaining only 3 covariates: larger outflow vein diameter, higher flow volume, and absence of >50% luminal stenosis. A point-of-care online calculator is deployed to facilitate AVF assessment in the clinic.
Collapse
Affiliation(s)
- Patrick Heindel
- grid.38142.3c000000041936754XDivision of Vascular and Endovascular Surgery, Department of Surgery, Brigham and Women’s Hospital, Harvard Medical School, Boston, MA USA ,grid.62560.370000 0004 0378 8294Center for Surgery and Public Health, Brigham and Women’s Hospital, Boston, MA USA
| | - Tanujit Dey
- grid.62560.370000 0004 0378 8294Center for Surgery and Public Health, Brigham and Women’s Hospital, Boston, MA USA
| | - Jessica D. Feliz
- grid.38142.3c000000041936754XDivision of Vascular and Endovascular Surgery, Department of Surgery, Brigham and Women’s Hospital, Harvard Medical School, Boston, MA USA ,grid.62560.370000 0004 0378 8294Center for Surgery and Public Health, Brigham and Women’s Hospital, Boston, MA USA
| | - Dirk M. Hentschel
- grid.38142.3c000000041936754XDivision of Renal Medicine, Department of Medicine, Brigham and Women’s Hospital, Harvard Medical School, Boston, MA USA
| | - Deepak L. Bhatt
- grid.38142.3c000000041936754XDivision of Cardiovascular Medicine, Department of Medicine, Brigham and Women’s Hospital, Harvard Medical School, Boston, MA USA
| | - Mohammed Al-Omran
- grid.17063.330000 0001 2157 2938Division of Vascular Surgery and Li Ka Shing Knowledge Institute, St. Michael’s Hospital, University of Toronto, Toronto, ON Canada ,grid.415310.20000 0001 2191 4301Department of Surgery, King Faisal Specialist Hospital and Research Center, Riyadh, Saudi Arabia
| | - Michael Belkin
- grid.38142.3c000000041936754XDivision of Vascular and Endovascular Surgery, Department of Surgery, Brigham and Women’s Hospital, Harvard Medical School, Boston, MA USA
| | - C. Keith Ozaki
- grid.38142.3c000000041936754XDivision of Vascular and Endovascular Surgery, Department of Surgery, Brigham and Women’s Hospital, Harvard Medical School, Boston, MA USA
| | - Mohamad A. Hussain
- grid.38142.3c000000041936754XDivision of Vascular and Endovascular Surgery, Department of Surgery, Brigham and Women’s Hospital, Harvard Medical School, Boston, MA USA ,grid.62560.370000 0004 0378 8294Center for Surgery and Public Health, Brigham and Women’s Hospital, Boston, MA USA
| |
Collapse
|
16
|
Culture and COVID-19-related mortality: a cross-sectional study of 50 countries. J Public Health Policy 2022; 43:413-430. [PMID: 35995942 PMCID: PMC9395903 DOI: 10.1057/s41271-022-00363-9] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 07/26/2022] [Indexed: 11/26/2022]
Abstract
Using a cross-sectional sample of 50 countries we investigate the influence of Hofstede’s six-dimensions of culture on COVID-19 related mortality. A multivariable regression model was fitted that controls for health-related, economic- and policy-related variables that have been found to be associated with mortality. We included the percentage of population aged 65 and above, the prevalence of relevant co-morbidities, and tobacco use as health-related variables. Economic variables were GDP, and the connectedness of a country. As policy variables, the Oxford Stringency Index as well as stringency speed, and the Global Health Security Index were used. We also describe the importance of the variables by means of a random forest model. The results suggest that individualistic societies are associated with lower COVID-19-related mortality rates. This finding contradicts previous studies that supported the popular narrative that collectivistic societies with an obedient population are better positioned to manage the pandemic.
Collapse
|
17
|
Hornung R, Boulesteix AL. Interaction forests: Identifying and exploiting interpretable quantitative and qualitative interaction effects. Comput Stat Data Anal 2022. [DOI: 10.1016/j.csda.2022.107460] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/03/2022]
|
18
|
|
19
|
A Novel Algorithm to Estimate the Significance Level of a Feature Interaction Using the Extreme Gradient Boosting Machine. INTERNATIONAL JOURNAL OF ENVIRONMENTAL RESEARCH AND PUBLIC HEALTH 2022; 19:ijerph19042338. [PMID: 35206527 PMCID: PMC8871671 DOI: 10.3390/ijerph19042338] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/04/2022] [Revised: 02/16/2022] [Accepted: 02/17/2022] [Indexed: 02/04/2023]
Abstract
Recent studies have revealed the importance of the interaction effect in cardiac research. An analysis would lead to an erroneous conclusion when the approach failed to tackle a significant interaction. Regression models deal with interaction by adding the product of the two interactive variables. Thus, statistical methods could evaluate the significance and contribution of the interaction term. However, machine learning strategies could not provide the p-value of specific feature interaction. Therefore, we propose a novel machine learning algorithm to assess the p-value of a feature interaction, named the extreme gradient boosting machine for feature interaction (XGB-FI). The first step incorporates the concept of statistical methodology by stratifying the original data into four subgroups according to the two interactive features. The second step builds four XGB machines with cross-validation techniques to avoid overfitting. The third step calculates a newly defined feature interaction ratio (FIR) for all possible combinations of predictors. Finally, we calculate the empirical p-value according to the FIR distribution. Computer simulation studies compared the XGB-FI with the multiple regression model with an interaction term. The results showed that the type I error of XGB-FI is valid under the nominal level of 0.05 when there is no interaction effect. The power of XGB-FI is consistently higher than the multiple regression model in all scenarios we examined. In conclusion, the new machine learning algorithm outperforms the conventional statistical model when searching for an interaction.
Collapse
|
20
|
Productivity-Based Land Suitability and Management Sensitivity Analysis: The Eucalyptus E. urophylla × E. grandis Case. FORESTS 2022. [DOI: 10.3390/f13020340] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
Eucalyptus plantations are productive and short rotation forests prevalent in tropical areas that experience fast expansion and face controversies in ecological issues. In this study, we perform a systematic analysis of factors influencing eucalyptus growth through plot records from the National Forest Inventories and satellite images. We find primary restricting factors for eucalyptus growth via machine learning algorithms with random forests and accumulated local effects plots, as conventional forest growth models are inadequate to calculate the causal effect with the large number of environmental and socioeconomic factors. As a result, despite common belief that temperature affects eucalyptus growth the most, we find that precipitation is the most evident restricting factor for eucalyptus growth. We then identify and rank key factors that affect timber growth, such as tree density, rotation period, and wood ownership. Finally, we suggest optimal management and planting strategies for local farmers and policymakers to facilitate eucalyptus growth.
Collapse
|
21
|
Nasejje JB, Mbuvha R, Mwambi H. Use of a deep learning and random forest approach to track changes in the predictive nature of socioeconomic drivers of under-5 mortality rates in sub-Saharan Africa. BMJ Open 2022; 12:e049786. [PMID: 35177443 PMCID: PMC8860054 DOI: 10.1136/bmjopen-2021-049786] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 02/02/2021] [Accepted: 01/13/2022] [Indexed: 11/12/2022] Open
Abstract
OBJECTIVES We used machine learning algorithms to track how the ranks of importance and the survival outcome of four socioeconomic determinants (place of residence, mother's level of education, wealth index and sex of the child) of under-5 mortality rate (U5MR) in sub-Saharan Africa have evolved. SETTINGS This work consists of multiple cross-sectional studies. We analysed data from the Demographic Health Surveys (DHS) collected from four countries; Uganda, Zimbabwe, Chad and Ghana, each randomly selected from the four subregions of sub-Saharan Africa. PARTICIPANTS Each country has multiple DHS datasets and a total of 11 datasets were selected for analysis. A total of n=85 688 children were drawn from the eleven datasets. PRIMARY AND SECONDARY OUTCOMES The primary outcome variable is U5MR; the secondary outcomes were to obtain the ranks of importance of the four socioeconomic factors over time and to compare the two machine learning models, the random survival forest (RSF) and the deep survival neural network (DeepSurv) in predicting U5MR. RESULTS Mother's education level ranked first in five datasets. Wealth index ranked first in three, place of residence ranked first in two and sex of the child ranked last in most of the datasets. The four factors showed a favourable survival outcome over time, confirming that past interventions targeting these factors are yielding positive results. The DeepSurv model has a higher predictive performance with mean concordance indexes (between 67% and 80%), above 50% compared with the RSF model. CONCLUSIONS The study reveals that children under the age of 5 in sub-Saharan Africa have favourable survival outcomes associated with the four socioeconomic factors over time. It also shows that deep survival neural network models are efficient in predicting U5MR and should, therefore, be used in the big data era to draft evidence-based policies to achieve the third sustainable development goal.
Collapse
Affiliation(s)
- Justine B Nasejje
- Statistics and Actuarial Science, University of the Witwatersrand, Johannesburg-Braamfontein, South Africa
| | - Rendani Mbuvha
- Statistics and Actuarial Science, University of the Witwatersrand, Johannesburg-Braamfontein, South Africa
| | - Henry Mwambi
- School of Mathematics, Statistics and Computer Science, University of Kwazulu-Natal, Pietermaritzburg, South Africa
| |
Collapse
|
22
|
Walakira A, Ocira J, Duroux D, Fouladi R, Moškon M, Rozman D, Van Steen K. Detecting gene-gene interactions from GWAS using diffusion kernel principal components. BMC Bioinformatics 2022; 23:57. [PMID: 35105309 PMCID: PMC8805268 DOI: 10.1186/s12859-022-04580-7] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2021] [Accepted: 01/18/2022] [Indexed: 11/10/2022] Open
Abstract
Genes and gene products do not function in isolation but as components of complex networks of macromolecules through physical or biochemical interactions. Dependencies of gene mutations on genetic background (i.e., epistasis) are believed to play a role in understanding molecular underpinnings of complex diseases such as inflammatory bowel disease (IBD). However, the process of identifying such interactions is complex due to for instance the curse of high dimensionality, dependencies in the data and non-linearity. Here, we propose a novel approach for robust and computationally efficient epistasis detection. We do so by first reducing dimensionality, per gene via diffusion kernel principal components (kpc). Subsequently, kpc gene summaries are used for downstream analysis including the construction of a gene-based epistasis network. We show that our approach is not only able to recover known IBD associated genes but also additional genes of interest linked to this difficult gastrointestinal disease.
Collapse
Affiliation(s)
- Andrew Walakira
- Centre for Functional Genomics and Bio-Chips, Institute for Biochemistry and Molecular Genetics, Faculty of Medicine, University of Ljubljana, Ljubljana, Slovenia
| | - Junior Ocira
- BIO3 - Laboratory for Systems Genetics, GIGA-R Medical Genomics, University of Liège, Liège, Belgium
| | - Diane Duroux
- BIO3 - Laboratory for Systems Genetics, GIGA-R Medical Genomics, University of Liège, Liège, Belgium
| | - Ramouna Fouladi
- BIO3 - Laboratory for Systems Genetics, GIGA-R Medical Genomics, University of Liège, Liège, Belgium
| | - Miha Moškon
- Faculty of Computer and Information Science, University of Ljubljana, Ljubljana, Slovenia
| | - Damjana Rozman
- Centre for Functional Genomics and Bio-Chips, Institute for Biochemistry and Molecular Genetics, Faculty of Medicine, University of Ljubljana, Ljubljana, Slovenia
| | - Kristel Van Steen
- BIO3 - Laboratory for Systems Genetics, GIGA-R Medical Genomics, University of Liège, Liège, Belgium
- BIO3 - Laboratory for Systems Medicine, Department of Human Genetics, KU Leuven, Leuven, Belgium
| |
Collapse
|
23
|
Zhang L, Wang Y, Chen J, Chen J. RFtest: A Robust and Flexible Community-Level Test for Microbiome Data Powerfully Detects Phylogenetically Clustered Signals. Front Genet 2022; 12:749573. [PMID: 35140735 PMCID: PMC8819960 DOI: 10.3389/fgene.2021.749573] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2021] [Accepted: 11/09/2021] [Indexed: 12/31/2022] Open
Abstract
Random forest is considered as one of the most successful machine learning algorithms, which has been widely used to construct microbiome-based predictive models. However, its use as a statistical testing method has not been explored. In this study, we propose “Random Forest Test” (RFtest), a global (community-level) test based on random forest for high-dimensional and phylogenetically structured microbiome data. RFtest is a permutation test using the generalization error of random forest as the test statistic. Our simulations demonstrate that RFtest has controlled type I error rates, that its power is superior to competing methods for phylogenetically clustered signals, and that it is robust to outliers and adaptive to interaction effects and non-linear associations. Finally, we apply RFtest to two real microbiome datasets to ascertain whether microbial communities are associated or not with the outcome variables.
Collapse
Affiliation(s)
- Lujun Zhang
- Department of Biostatistics and Bioinformatics, Duke University School of Medicine, Durham, NC, United States
- Institute of Soil and Water Resources and Environmental Science, College of Environmental and Resource Sciences, Zhejiang University, Hangzhou, China
| | - Yanshan Wang
- Department of Health Information Management, University of Pittsburgh, Pittsburgh, PA, United States
| | - Jingwen Chen
- Department of General Surgery, Zhongshan Hospital, Fudan University, Shanghai, China
- *Correspondence: Jingwen Chen, ; Jun Chen,
| | - Jun Chen
- Department of Quantitative Health Sciences, Mayo Clinic, Rochester, MN, United States
- *Correspondence: Jingwen Chen, ; Jun Chen,
| |
Collapse
|
24
|
Inglis A, Parnell A, Hurley CB. Visualizing Variable Importance and Variable Interaction Effects in Machine Learning Models. J Comput Graph Stat 2022. [DOI: 10.1080/10618600.2021.2007935] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Affiliation(s)
- Alan Inglis
- Hamilton Institute, Maynooth University, Maynooth, Ireland
| | - Andrew Parnell
- Hamilton Institute, Insight Centre for Data Analytics, Maynooth University, Maynooth, Ireland
| | - Catherine B. Hurley
- Department of Mathematics and Statistics, Maynooth University, Maynooth, Ireland
| |
Collapse
|
25
|
Koch TK, Romero P, Stachl C. Age and gender in language, emoji, and emoticon usage in instant messages. COMPUTERS IN HUMAN BEHAVIOR 2022. [DOI: 10.1016/j.chb.2021.106990] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
|
26
|
Nguyen HKD, Fielding MW, Buettel JC, Brook BW. Predicting spatial and seasonal patterns of wildlife–vehicle collisions in high-risk areas†. WILDLIFE RESEARCH 2022. [DOI: 10.1071/wr21018] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/23/2022]
|
27
|
Ohanyan H, Portengen L, Huss A, Traini E, Beulens JWJ, Hoek G, Lakerveld J, Vermeulen R. Machine learning approaches to characterize the obesogenic urban exposome. ENVIRONMENT INTERNATIONAL 2022; 158:107015. [PMID: 34991269 DOI: 10.1016/j.envint.2021.107015] [Citation(s) in RCA: 21] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/18/2021] [Revised: 11/26/2021] [Accepted: 11/29/2021] [Indexed: 06/14/2023]
Abstract
BACKGROUND Characteristics of the urban environment may contain upstream drivers of obesity. However, research is lacking that considers the combination of environmental factors simultaneously. OBJECTIVES We aimed to explore what environmental factors of the urban exposome are related to body mass index (BMI), and evaluated the consistency of findings across multiple statistical approaches. METHODS A cross-sectional analysis was conducted using baseline data from 14,829 participants of the Occupational and Environmental Health Cohort study. BMI was obtained from self-reported height and weight. Geocoded exposures linked to individual home addresses (using 6-digit postcode) of 86 environmental factors were estimated, including air pollution, traffic noise, green-space, built environmental and neighborhood socio-demographic characteristics. Exposure-obesity associations were identified using the following approaches: sparse group Partial Least Squares, Bayesian Model Averaging, penalized regression using the Minimax Concave Penalty, Generalized Additive Model-based boosting Random Forest, Extreme Gradient Boosting, and Multiple Linear Regression, as the most conventional approach. The models were adjusted for individual socio-demographic variables. Environmental factors were ranked according to variable importance scores attributed by each approach and median ranks were calculated across these scores to identify the most consistent associations. RESULTS The most consistent environmental factors associated with BMI were the average neighborhood value of the homes, oxidative potential of particulate matter air pollution (OP), healthy food outlets in the neighborhood (5 km buffer), low-income neighborhoods, and one-person households in the neighborhood. Higher BMI levels were observed in low-income neighborhoods, with lower average house values, lower share of one-person households and smaller amount of healthy food retailers. Higher BMI levels were observed in low-income neighborhoods, with lower average house values, lower share of one-person households, smaller amounts of healthy food retailers and higher OP levels. Across the approaches, we observed consistent patterns of results based on model's capacity to incorporate linear or nonlinear associations. DISCUSSION The pluralistic analysis on environmental obesogens strengthens the existing evidence on the role of neighborhood socioeconomic position, urbanicity and air pollution.
Collapse
Affiliation(s)
- Haykanush Ohanyan
- Department of Epidemiology and Data Science, Amsterdam Public Health Research Institute, Amsterdam UMC, Vrije Universiteit Amsterdam, Amsterdam, Noord-Holland, the Netherlands; Institute for Risk Assessment Sciences, Utrecht University, Utrecht, Utrecht, the Netherlands; Upstream Team, www.upstreamteam.nl. Amsterdam UMC, VU University Amsterdam, Amsterdam, Noord-Holland, the Netherlands.
| | - Lützen Portengen
- Institute for Risk Assessment Sciences, Utrecht University, Utrecht, Utrecht, the Netherlands
| | - Anke Huss
- Institute for Risk Assessment Sciences, Utrecht University, Utrecht, Utrecht, the Netherlands
| | - Eugenio Traini
- Institute for Risk Assessment Sciences, Utrecht University, Utrecht, Utrecht, the Netherlands
| | - Joline W J Beulens
- Department of Epidemiology and Data Science, Amsterdam Public Health Research Institute, Amsterdam UMC, Vrije Universiteit Amsterdam, Amsterdam, Noord-Holland, the Netherlands; Upstream Team, www.upstreamteam.nl. Amsterdam UMC, VU University Amsterdam, Amsterdam, Noord-Holland, the Netherlands; Julius Center for Health Sciences and Primary Care, University Medical Center Utrecht, Utrecht, the Netherland
| | - Gerard Hoek
- Institute for Risk Assessment Sciences, Utrecht University, Utrecht, Utrecht, the Netherlands
| | - Jeroen Lakerveld
- Department of Epidemiology and Data Science, Amsterdam Public Health Research Institute, Amsterdam UMC, Vrije Universiteit Amsterdam, Amsterdam, Noord-Holland, the Netherlands; Upstream Team, www.upstreamteam.nl. Amsterdam UMC, VU University Amsterdam, Amsterdam, Noord-Holland, the Netherlands
| | - Roel Vermeulen
- Institute for Risk Assessment Sciences, Utrecht University, Utrecht, Utrecht, the Netherlands
| |
Collapse
|
28
|
Hornung R. Diversity Forests: Using Split Sampling to Enable Innovative Complex Split Procedures in Random Forests. ACTA ACUST UNITED AC 2021; 3:1. [PMID: 34723205 PMCID: PMC8533673 DOI: 10.1007/s42979-021-00920-1] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2021] [Accepted: 10/02/2021] [Indexed: 11/24/2022]
Abstract
The diversity forest algorithm is an alternative candidate node split sampling scheme that makes innovative complex split procedures in random forests possible. While conventional univariable, binary splitting suffices for obtaining strong predictive performance, new complex split procedures can help tackling practically important issues. For example, interactions between features can be exploited effectively by bivariable splitting. With diversity forests, each split is selected from a candidate split set that is sampled in the following way: for \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$l = 1, \dots , {nsplits}$$\end{document}l=1,⋯,nsplits: (1) sample one split problem; (2) sample a single or few splits from the split problem sampled in (1) and add this or these splits to the candidate split set. The split problems are specifically structured collections of splits that depend on the respective split procedure considered. This sampling scheme makes innovative complex split procedures computationally tangible while avoiding overfitting. Important general properties of the diversity forest algorithm are evaluated empirically using univariable, binary splitting. Based on 220 data sets with binary outcomes, diversity forests are compared with conventional random forests and random forests using extremely randomized trees. It is seen that the split sampling scheme of diversity forests does not impair the predictive performance of random forests and that the performance is quite robust with regard to the specified nsplits value. The recently developed interaction forests are the first diversity forest method that uses a complex split procedure. Interaction forests allow modeling and detecting interactions between features effectively. Further potential complex split procedures are discussed as an outlook.
Collapse
Affiliation(s)
- Roman Hornung
- Institute for Medical Information Processing, Biometry and Epidemiology, University of Munich, Marchioninistr. 15, 81377 Munich, Germany
| |
Collapse
|
29
|
Sykes AL, Silva GS, Holtkamp DJ, Mauch BW, Osemeke O, Linhares DCL, Machado G. Interpretable machine learning applied to on-farm biosecurity and porcine reproductive and respiratory syndrome virus. Transbound Emerg Dis 2021; 69:e916-e930. [PMID: 34719136 DOI: 10.1111/tbed.14369] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2021] [Revised: 09/22/2021] [Accepted: 10/24/2021] [Indexed: 11/28/2022]
Abstract
Effective biosecurity practices in swine production are key in preventing the introduction and dissemination of infectious pathogens. Ideally, on-farm biosecurity practices should be chosen by their impact on bio-containment and bio-exclusion; however, quantitative supporting evidence is often unavailable. Therefore, the development of methodologies capable of quantifying and ranking biosecurity practices according to their efficacy in reducing disease risk has the potential to facilitate better-informed choices of biosecurity practices. Using survey data on biosecurity practices, farm demographics, and previous outbreaks from 139 herds, a set of machine learning algorithms were trained to classify farms by porcine reproductive and respiratory syndrome virus status, depending on their biosecurity practices and farm demographics, to produce a predicted outbreak risk. A novel interpretable machine learning toolkit, MrIML-biosecurity, was developed to benchmark farms and production systems by predicted risk and quantify the impact of biosecurity practices on disease risk at individual farms. By quantifying the variable impact on predicted risk, 50% of 42 variables were associated with fomite spread while 31% were associated with local transmission. Results from machine learning interpretations identified similar results, finding substantial contribution to predicted outbreak risk from biosecurity practices relating to the turnover and number of employees, the surrounding density of swine premises and pigs, the sharing of haul trailers, distance from the public road and farm production type. In addition, the development of individualized biosecurity assessments provides the opportunity to better guide biosecurity implementation on a case-by-case basis. Finally, the flexibility of the MrIML-biosecurity toolkit gives it the potential to be applied to wider areas of biosecurity benchmarking, to address biosecurity weaknesses in other livestock systems and industry-relevant diseases.
Collapse
Affiliation(s)
- Abagael L Sykes
- Department of Population Health and Pathobiology, College of Veterinary Medicine, North Carolina State University, Raleigh, North Carolina, USA
| | - Gustavo S Silva
- Veterinary Diagnostic and Production Animal Medicine Department, College of Veterinary Medicine, Iowa State University, Ames, Iowa, USA
| | - Derald J Holtkamp
- Veterinary Diagnostic and Production Animal Medicine Department, College of Veterinary Medicine, Iowa State University, Ames, Iowa, USA
| | - Broc W Mauch
- Veterinary Diagnostic and Production Animal Medicine Department, College of Veterinary Medicine, Iowa State University, Ames, Iowa, USA
| | - Onyekachukwu Osemeke
- Veterinary Diagnostic and Production Animal Medicine Department, College of Veterinary Medicine, Iowa State University, Ames, Iowa, USA
| | - Daniel C L Linhares
- Veterinary Diagnostic and Production Animal Medicine Department, College of Veterinary Medicine, Iowa State University, Ames, Iowa, USA
| | - Gustavo Machado
- Department of Population Health and Pathobiology, College of Veterinary Medicine, North Carolina State University, Raleigh, North Carolina, USA
| |
Collapse
|
30
|
DiMucci D, Kon M, Segrè D. BowSaw: Inferring Higher-Order Trait Interactions Associated With Complex Biological Phenotypes. Front Mol Biosci 2021; 8:663532. [PMID: 34222331 PMCID: PMC8245782 DOI: 10.3389/fmolb.2021.663532] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/04/2021] [Accepted: 05/24/2021] [Indexed: 11/15/2022] Open
Abstract
Machine learning is helping the interpretation of biological complexity by enabling the inference and classification of cellular, organismal and ecological phenotypes based on large datasets, e.g., from genomic, transcriptomic and metagenomic analyses. A number of available algorithms can help search these datasets to uncover patterns associated with specific traits, including disease-related attributes. While, in many instances, treating an algorithm as a black box is sufficient, it is interesting to pursue an enhanced understanding of how system variables end up contributing to a specific output, as an avenue toward new mechanistic insight. Here we address this challenge through a suite of algorithms, named BowSaw, which takes advantage of the structure of a trained random forest algorithm to identify combinations of variables (“rules”) frequently used for classification. We first apply BowSaw to a simulated dataset and show that the algorithm can accurately recover the sets of variables used to generate the phenotypes through complex Boolean rules, even under challenging noise levels. We next apply our method to data from the integrative Human Microbiome Project and find previously unreported high-order combinations of microbial taxa putatively associated with Crohn’s disease. By leveraging the structure of trees within a random forest, BowSaw provides a new way of using decision trees to generate testable biological hypotheses.
Collapse
Affiliation(s)
- Demetrius DiMucci
- Bioinformatics Graduate Program, Boston University, Boston, MA, United States.,Biological Design Center, Boston University, Boston, MA, United States
| | - Mark Kon
- Bioinformatics Graduate Program, Boston University, Boston, MA, United States.,Department of Mathematics and Statistics, Boston University, Boston, MA, United States
| | - Daniel Segrè
- Bioinformatics Graduate Program, Boston University, Boston, MA, United States.,Biological Design Center, Boston University, Boston, MA, United States.,Department of Biology, Boston University, Boston, MA, United States.,Department of Biomedical Engineering, Boston University, Boston, MA, United States.,Department of Physics, Boston University, Boston, MA, United States
| |
Collapse
|
31
|
Hamlet A, Ramos DG, Gaythorpe KAM, Romano APM, Garske T, Ferguson NM. Seasonality of agricultural exposure as an important predictor of seasonal yellow fever spillover in Brazil. Nat Commun 2021; 12:3647. [PMID: 34131128 PMCID: PMC8206143 DOI: 10.1038/s41467-021-23926-y] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2019] [Accepted: 05/24/2021] [Indexed: 01/04/2023] Open
Abstract
Yellow fever virus (YFV) is a zoonotic arbovirus affecting both humans and non-human primates (NHP's) in Africa and South America. Previous descriptions of YF's seasonality have relied purely on climatic explanations, despite the high proportion of cases occurring in people involved in agriculture. We use a series of random forest classification models to predict the monthly occurrence of YF in humans and NHP's across Brazil, by fitting four classes of covariates related to the seasonality of climate and agriculture (planting and harvesting), crop output and host demography. We find that models captured seasonal YF reporting in humans and NHPs when they considered seasonality of agriculture rather than climate, particularly for monthly aggregated reports. These findings illustrate the seasonality of exposure, through agriculture, as a component of zoonotic spillover. Additionally, by highlighting crop types and anthropogenic seasonality, these results could directly identify areas at highest risk of zoonotic spillover.
Collapse
Affiliation(s)
- Arran Hamlet
- MRC Centre for Global Infectious Disease Analysis; and the Abdul Latif Jameel Institute for Disease and Emergency Analytics, School of Public Health, Imperial College London, London, UK.
| | | | - Katy A M Gaythorpe
- MRC Centre for Global Infectious Disease Analysis; and the Abdul Latif Jameel Institute for Disease and Emergency Analytics, School of Public Health, Imperial College London, London, UK
| | | | - Tini Garske
- MRC Centre for Global Infectious Disease Analysis; and the Abdul Latif Jameel Institute for Disease and Emergency Analytics, School of Public Health, Imperial College London, London, UK
| | - Neil M Ferguson
- MRC Centre for Global Infectious Disease Analysis; and the Abdul Latif Jameel Institute for Disease and Emergency Analytics, School of Public Health, Imperial College London, London, UK
| |
Collapse
|
32
|
Askland KD, Strong D, Wright MN, Moore JH. The Translational Machine: A novel machine-learning approach to illuminate complex genetic architectures. Genet Epidemiol 2021; 45:485-536. [PMID: 33942369 DOI: 10.1002/gepi.22383] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2020] [Revised: 03/05/2021] [Accepted: 03/23/2021] [Indexed: 11/08/2022]
Abstract
The Translational Machine (TM) is a machine learning (ML)-based analytic pipeline that translates genotypic/variant call data into biologically contextualized features that richly characterize complex variant architectures and permit greater interpretability and biological replication. It also reduces potentially confounding effects of population substructure on outcome prediction. The TM consists of three main components. First, replicable but flexible feature engineering procedures translate genome-scale data into biologically informative features that appropriately contextualize simple variant calls/genotypes within biological and functional contexts. Second, model-free, nonparametric ML-based feature filtering procedures empirically reduce dimensionality and noise of both original genotype calls and engineered features. Third, a powerful ML algorithm for feature selection is used to differentiate risk variant contributions across variant frequency and functional prediction spectra. The TM simultaneously evaluates potential contributions of variants operative under polygenic and heterogeneous models of genetic architecture. Our TM enables integration of biological information (e.g., genomic annotations) within conceptual frameworks akin to geneset-/pathways-based and collapsing methods, but overcomes some of these methods' limitations. The full TM pipeline is executed in R. Our approach and initial findings from its application to a whole-exome schizophrenia case-control data set are presented. These TM procedures extend the findings of the primary investigation and yield novel results.
Collapse
Affiliation(s)
- Kathleen D Askland
- Waypoint Centre for Mental Health Care Penetanguishene, University of Toronto, Toronto, Ontario, Canada
| | - David Strong
- Department of Family Medicine and Public Health, University of California San Diego, San Diego, California, USA
| | - Marvin N Wright
- Department Biometry and Data Management, Leibniz Institute for Prevention Research and Epidemiology - BIPS GmbH, Germany
| | - Jason H Moore
- Department of Biostatistics, Epidemiology, & Informatics, The Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, USA
| |
Collapse
|
33
|
Epistasis Analysis: Classification Through Machine Learning Methods. Methods Mol Biol 2021. [PMID: 33733366 DOI: 10.1007/978-1-0716-0947-7_21] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register]
Abstract
Complex disease is different from Mendelian disorders. Its development usually involves the interaction of multiple genes or the interaction between genes and the environment (i.e. epistasis). Although the high-throughput sequencing technologies for complex diseases have produced a large amount of data, it is extremely difficult to analyze the data due to the high feature dimension and the combination in the epistasis analysis. In this work, we introduce machine learning methods to effectively reduce the gene dimensionality, retain the key epistatic effects, and effectively characterize the relationship between epistatic effects and complex diseases.
Collapse
|
34
|
García de la Garza Á, Blanco C, Olfson M, Wall MM. Identification of Suicide Attempt Risk Factors in a National US Survey Using Machine Learning. JAMA Psychiatry 2021; 78:398-406. [PMID: 33404590 PMCID: PMC7788508 DOI: 10.1001/jamapsychiatry.2020.4165] [Citation(s) in RCA: 70] [Impact Index Per Article: 17.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 12/11/2022]
Abstract
IMPORTANCE Because more than one-third of people making nonfatal suicide attempts do not receive mental health treatment, it is essential to extend suicide attempt risk factors beyond high-risk clinical populations to the general adult population. OBJECTIVE To identify future suicide attempt risk factors in the general population using a data-driven machine learning approach including more than 2500 questions from a large, nationally representative survey of US adults. DESIGN, SETTING, AND PARTICIPANTS Data came from wave 1 (2001 to 2002) and wave 2 (2004 to 2005) of the National Epidemiologic Survey on Alcohol and Related Conditions (NESARC). NESARC is a face-to-face longitudinal survey conducted with a national representative sample of noninstitutionalized civilian population 18 years and older in the US. The cumulative response rate across both waves was 70.2% resulting in 34 653 wave 2 interviews. A balanced random forest was trained using cross-validation to develop a suicide attempt risk model. Out-of-fold model prediction was used to assess model performance, including the area under the receiver operator curve, sensitivity, and specificity. Survey design and nonresponse weights allowed estimates to be representative of the US civilian population based on the 2000 census. Analyses were performed between May 15, 2019, and June 10, 2020. MAIN OUTCOMES AND MEASURES Attempted suicide in the 3 years between wave 1 and wave 2 interviews. RESULTS Of 34 653 participants, 20 089 were female (weighted proportion, 52.1%). The weighted mean (SD) age was 45.1 (17.3) years at wave 1 and 48.2 (17.3) years at wave 2. Attempted suicide during the 3 years between wave 1 and wave 2 interviews was self-reported by 222 of 34 653 participants (0.6%). Using survey questions measured at wave 1, the suicide attempt risk model yielded a cross-validated area under the receiver operator characteristic curve of 0.857 with a sensitivity of 85.3% (95% CI, 79.8-89.7) and a specificity of 73.3% (95% CI, 72.8-73.8) at an optimized threshold. The model identified 1.8% of the US population to be at a 10% or greater risk of suicide attempt. The most important risk factors were 3 questions about previous suicidal ideation or behavior; 3 items from the 12-Item Short Form Health Survey, namely feeling downhearted, doing activities less carefully, or accomplishing less because of emotional problems; younger age; lower educational achievement; and recent financial crisis. CONCLUSIONS AND RELEVANCE In this study, after searching through more than 2500 survey questions, several well-known risk factors of suicide attempt were confirmed, such as previous suicidal behaviors and ideation, and new risks were identified, including functional impairment resulting from mental disorders and socioeconomic disadvantage. These results may help guide future clinical assessment and the development of new suicide risk scales.
Collapse
Affiliation(s)
| | - Carlos Blanco
- Division of Epidemiology, Services and Prevention Research, National Institute on Drug Abuse, Bethesda, Maryland
| | - Mark Olfson
- Department of Psychiatry, New York State Psychiatric Institute, Columbia University Medical Center, New York
| | - Melanie M. Wall
- Department of Biostatistics, Columbia University, New York, New York,Department of Psychiatry, New York State Psychiatric Institute, Columbia University Medical Center, New York
| |
Collapse
|
35
|
Gola D, König IR. Empowering individual trait prediction using interactions for precision medicine. BMC Bioinformatics 2021; 22:74. [PMID: 33602124 PMCID: PMC7890638 DOI: 10.1186/s12859-021-04011-z] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2019] [Accepted: 02/08/2021] [Indexed: 11/11/2022] Open
Abstract
Background One component of precision medicine is to construct prediction models with their predicitve ability as high as possible, e.g. to enable individual risk prediction. In genetic epidemiology, complex diseases like coronary artery disease, rheumatoid arthritis, and type 2 diabetes, have a polygenic basis and a common assumption is that biological and genetic features affect the outcome under consideration via interactions. In the case of omics data, the use of standard approaches such as generalized linear models may be suboptimal and machine learning methods are appealing to make individual predictions. However, most of these algorithms focus mostly on main or marginal effects of the single features in a dataset. On the other hand, the detection of interacting features is an active area of research in the realm of genetic epidemiology. One big class of algorithms to detect interacting features is based on the multifactor dimensionality reduction (MDR). Here, we further develop the model-based MDR (MB-MDR), a powerful extension of the original MDR algorithm, to enable interaction empowered individual prediction. Results Using a comprehensive simulation study we show that our new algorithm (median AUC: 0.66) can use information hidden in interactions and outperforms two other state-of-the-art algorithms, namely the Random Forest (median AUC: 0.54) and Elastic Net (median AUC: 0.50), if interactions are present in a scenario of two pairs of two features having small effects. The performance of these algorithms is comparable if no interactions are present. Further, we show that our new algorithm is applicable to real data by comparing the performance of the three algorithms on a dataset of rheumatoid arthritis cases and healthy controls. As our new algorithm is not only applicable to biological/genetic data but to all datasets with discrete features, it may have practical implications in other research fields where interactions between features have to be considered as well, and we made our method available as an R package (https://github.com/imbs-hl/MBMDRClassifieR). Conclusions The explicit use of interactions between features can improve the prediction performance and thus should be included in further attempts to move precision medicine forward.
Collapse
Affiliation(s)
- Damian Gola
- Institut für Medizinische Biometrie und Statistik, Universität zu Lübeck, Universitätsklinikum Schleswig-Holstein, Campus Lübeck, Lübeck, Germany
| | - Inke R König
- Institut für Medizinische Biometrie und Statistik, Universität zu Lübeck, Universitätsklinikum Schleswig-Holstein, Campus Lübeck, Lübeck, Germany.
| |
Collapse
|
36
|
Orlenko A, Moore JH. A comparison of methods for interpreting random forest models of genetic association in the presence of non-additive interactions. BioData Min 2021; 14:9. [PMID: 33514397 PMCID: PMC7847145 DOI: 10.1186/s13040-021-00243-0] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2020] [Accepted: 01/13/2021] [Indexed: 01/19/2023] Open
Abstract
BACKGROUND Non-additive interactions among genes are frequently associated with a number of phenotypes, including known complex diseases such as Alzheimer's, diabetes, and cardiovascular disease. Detecting interactions requires careful selection of analytical methods, and some machine learning algorithms are unable or underpowered to detect or model feature interactions that exhibit non-additivity. The Random Forest method is often employed in these efforts due to its ability to detect and model non-additive interactions. In addition, Random Forest has the built-in ability to estimate feature importance scores, a characteristic that allows the model to be interpreted with the order and effect size of the feature association with the outcome. This characteristic is very important for epidemiological and clinical studies where results of predictive modeling could be used to define the future direction of the research efforts. An alternative way to interpret the model is with a permutation feature importance metric which employs a permutation approach to calculate a feature contribution coefficient in units of the decrease in the model's performance and with the Shapely additive explanations which employ cooperative game theory approach. Currently, it is unclear which Random Forest feature importance metric provides a superior estimation of the true informative contribution of features in genetic association analysis. RESULTS To address this issue, and to improve interpretability of Random Forest predictions, we compared different methods for feature importance estimation in real and simulated datasets with non-additive interactions. As a result, we detected a discrepancy between the metrics for the real-world datasets and further established that the permutation feature importance metric provides more precise feature importance rank estimation for the simulated datasets with non-additive interactions. CONCLUSIONS By analyzing both real and simulated data, we established that the permutation feature importance metric provides more precise feature importance rank estimation in the presence of non-additive interactions.
Collapse
Affiliation(s)
- Alena Orlenko
- Institute for Biomedical Informatics, University of Pennsylvania, Philadelphia, PA, USA
| | - Jason H Moore
- Institute for Biomedical Informatics, University of Pennsylvania, Philadelphia, PA, USA.
| |
Collapse
|
37
|
Martins AS, Neves LA, de Faria PR, Tosta TAA, Longo LC, Silva AB, Roberto GF, do Nascimento MZ. A Hermite polynomial algorithm for detection of lesions in lymphoma images. Pattern Anal Appl 2020. [DOI: 10.1007/s10044-020-00927-z] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/29/2022]
|
38
|
Affiliation(s)
- Tim C. D. Lucas
- Big Data Institute University of Oxford Old Road Campus Oxford OX3 7LF United Kingdom
| |
Collapse
|
39
|
McWilliam A, Khalifa J, Vasquez Osorio E, Banfill K, Abravan A, Faivre-Finn C, van Herk M. Novel Methodology to Investigate the Effect of Radiation Dose to Heart Substructures on Overall Survival. Int J Radiat Oncol Biol Phys 2020; 108:1073-1081. [PMID: 32585334 DOI: 10.1016/j.ijrobp.2020.06.031] [Citation(s) in RCA: 76] [Impact Index Per Article: 15.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/22/2020] [Revised: 05/18/2020] [Accepted: 06/17/2020] [Indexed: 12/25/2022]
Abstract
PURPOSE For patients with lung cancer treated with radiation therapy, a dose to the heart is associated with excess mortality; however, it is often not feasible to spare the whole heart. Our aim is to define cardiac substructures and dose thresholds that optimally reduce early mortality. METHODS AND MATERIALS Fourteen cardiac substructures were delineated on 5 template patients with representative anatomies. One thousand one hundred sixty-one patients with non-small cell lung cancer were registered nonrigidly to these 5 template anatomies, and their radiation therapy doses were mapped. Mean and maximum dose to each substructure were extracted, and the means were evaluated as input to prediction models. The cohort was bootstrapped into 2 variable reduction techniques: elastic net least absolute shrinkage and selection operator and the random survival forest model. Each method was optimized to extract variables contributing most to overall survival, and model coefficients were evaluated to select these substructures. The most important variables common to both models were selected and evaluated in multivariable Cox-proportional hazard models. A threshold dose was defined, and Kaplan-Meier survival curves plotted. RESULTS Nine hundred seventy-eight patients remained after visual quality assurance of the registration. Ranking the model coefficients across the bootstraps selected the maximum dose to the right atrium, right coronary artery, and ascending aorta as the most important factors associated with survival. The maximum dose to the combined cardiac region showed significance in the multivariable model, a hazard ratio of 1.01/Gy, and P = .03 after accounting for tumor volume (P < .001), N stage (P < .01), and performance status (P = .01). The optimal threshold for the maximum dose, equivalent dose in 2-Gy fractions, was 23 Gy. Kaplan-Meier survival curves showed a significant split (log-rank P = .008). CONCLUSIONS The maximum dose to the combined cardiac region encompassing the right atrium, right coronary artery, and ascending aorta was found to have the greatest effect on patient survival. A maximum equivalent dose in 2-Gy fractions of 23 Gy was identified for consideration as a dose limit in future studies.
Collapse
Affiliation(s)
- Alan McWilliam
- Division of Clinical Cancer Science, School of Medical Sciences, Faculty of Biology, Medicine and Health, University of Manchester, Manchester, United Kingdom; Department of Radiotherapy Related Research, The Christie NHS Foundation Trust, Manchester, United Kingdom.
| | - Jonathan Khalifa
- Department of Radiation Oncology, Institut Universitaire du Cancer de Toulouse, Toulouse, France
| | - Eliana Vasquez Osorio
- Division of Clinical Cancer Science, School of Medical Sciences, Faculty of Biology, Medicine and Health, University of Manchester, Manchester, United Kingdom; Department of Radiotherapy Related Research, The Christie NHS Foundation Trust, Manchester, United Kingdom
| | - Kathryn Banfill
- Division of Clinical Cancer Science, School of Medical Sciences, Faculty of Biology, Medicine and Health, University of Manchester, Manchester, United Kingdom; Department of Radiotherapy Related Research, The Christie NHS Foundation Trust, Manchester, United Kingdom
| | - Azadeh Abravan
- Division of Clinical Cancer Science, School of Medical Sciences, Faculty of Biology, Medicine and Health, University of Manchester, Manchester, United Kingdom; Department of Radiotherapy Related Research, The Christie NHS Foundation Trust, Manchester, United Kingdom
| | - Corinne Faivre-Finn
- Division of Clinical Cancer Science, School of Medical Sciences, Faculty of Biology, Medicine and Health, University of Manchester, Manchester, United Kingdom; Department of Radiotherapy Related Research, The Christie NHS Foundation Trust, Manchester, United Kingdom
| | - Marcel van Herk
- Division of Clinical Cancer Science, School of Medical Sciences, Faculty of Biology, Medicine and Health, University of Manchester, Manchester, United Kingdom; Department of Radiotherapy Related Research, The Christie NHS Foundation Trust, Manchester, United Kingdom
| |
Collapse
|
40
|
Malten J, König IR. Modified entropy-based procedure detects gene-gene-interactions in unconventional genetic models. BMC Med Genomics 2020; 13:65. [PMID: 32326960 PMCID: PMC7181579 DOI: 10.1186/s12920-020-0703-4] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/19/2019] [Accepted: 03/13/2020] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Since it is assumed that genetic interactions play an important role in understanding the mechanisms of complex diseases, different statistical approaches have been suggested in recent years for this task. One interesting approach is the entropy-based IGENT method by Kwon et al. that promises an efficient detection of main effects and interaction effects simultaneously. However, a modification is required if the aim is to only detect interaction effects. METHODS Based on the IGENT method, we present a modification that leads to a conditional mutual information based approach under the condition of linkage equilibrium. The modified estimator is investigated in a comprehensive simulation based on five genetic interaction models and applied to real data from the genome-wide association study by the North American Rheumatoid Arthritis Consortium (NARAC). RESULTS The presented modification of IGENT controls the type I error in all simulated constellations. Furthermore, it provides high power for detecting pure interactions specifically on unconventional genetic models both in simulation and real data. CONCLUSIONS The proposed method uses the IGENT software, which is free available, simple and fast, and detects pure interactions on unconventional genetic models. Our results demonstrate that this modification is an attractive complement to established analysis methods.
Collapse
Affiliation(s)
- Jörg Malten
- Institut für Medizinische Biometrie und Statistik, Universität zu Lübeck, Universitätsklinikum Schleswig-Holstein, Campus Lübeck, Ratzeburger Allee 160, 23562 Lübeck, Germany
| | - Inke R König
- Institut für Medizinische Biometrie und Statistik, Universität zu Lübeck, Universitätsklinikum Schleswig-Holstein, Campus Lübeck, Ratzeburger Allee 160, 23562 Lübeck, Germany.
| |
Collapse
|
41
|
Gola D, Erdmann J, Müller‐Myhsok B, Schunkert H, König IR. Polygenic risk scores outperform machine learning methods in predicting coronary artery disease status. Genet Epidemiol 2020; 44:125-138. [DOI: 10.1002/gepi.22279] [Citation(s) in RCA: 21] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2019] [Revised: 12/05/2019] [Accepted: 12/23/2019] [Indexed: 02/06/2023]
Affiliation(s)
- Damian Gola
- Institut für Medizinische Biometrie und StatistikUniversität zu Lübeck Lübeck Germany
| | | | - Bertram Müller‐Myhsok
- Department of Translational Research in PsychiatryMax Planck Institute of PsychiatryMunich Germany
| | - Heribert Schunkert
- Deutsches Herzzentrum MünchenTechnische Universität MünchenMünchen Germany
| | - Inke R. König
- Institut für Medizinische Biometrie und StatistikUniversität zu Lübeck Lübeck Germany
| |
Collapse
|
42
|
Abstract
There has been considerable development in machine learning in recent years with some remarkable successes. Although there are many high-performance methods, the interpretation of learning models remains challenging. Understanding the underlying theory behind the specific prediction of various models is difficult. Various studies have attempted to explain the working principle behind learning models using techniques like feature importance, partial dependency, feature interaction, and the Shapley value. This study introduces a new feature interaction measure. While recent studies have measured feature interaction using partial dependency, this study redefines feature interaction in terms of prediction performance. The proposed measure is easy to interpret, faster than partial dependency-based measures, and useful to explain feature interaction, which affects prediction performance in both regression and classification models.
Collapse
|
43
|
Grinberg NF, Orhobor OI, King RD. An evaluation of machine-learning for predicting phenotype: studies in yeast, rice, and wheat. Mach Learn 2019; 109:251-277. [PMID: 32174648 PMCID: PMC7048706 DOI: 10.1007/s10994-019-05848-5] [Citation(s) in RCA: 53] [Impact Index Per Article: 8.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2015] [Revised: 09/17/2019] [Accepted: 09/19/2019] [Indexed: 11/01/2022]
Abstract
In phenotype prediction the physical characteristics of an organism are predicted from knowledge of its genotype and environment. Such studies, often called genome-wide association studies, are of the highest societal importance, as they are of central importance to medicine, crop-breeding, etc. We investigated three phenotype prediction problems: one simple and clean (yeast), and the other two complex and real-world (rice and wheat). We compared standard machine learning methods; elastic net, ridge regression, lasso regression, random forest, gradient boosting machines (GBM), and support vector machines (SVM), with two state-of-the-art classical statistical genetics methods; genomic BLUP and a two-step sequential method based on linear regression. Additionally, using the clean yeast data, we investigated how performance varied with the complexity of the biological mechanism, the amount of observational noise, the number of examples, the amount of missing data, and the use of different data representations. We found that for almost all the phenotypes considered, standard machine learning methods outperformed the methods from classical statistical genetics. On the yeast problem, the most successful method was GBM, followed by lasso regression, and the two statistical genetics methods; with greater mechanistic complexity GBM was best, while in simpler cases lasso was superior. In the wheat and rice studies the best two methods were SVM and BLUP. The most robust method in the presence of noise, missing data, etc. was random forests. The classical statistical genetics method of genomic BLUP was found to perform well on problems where there was population structure. This suggests that standard machine learning methods need to be refined to include population structure information when this is present. We conclude that the application of machine learning methods to phenotype prediction problems holds great promise, but that determining which methods is likely to perform well on any given problem is elusive and non-trivial.
Collapse
Affiliation(s)
- Nastasiya F. Grinberg
- School of Computer Science, University of Manchester, Oxford Road, Manchester, M13 9PL UK
- Present Address: Department of Medicine, Cambridge Institute of Therapeutic Immunology & Infectious Disease, Jeffrey Cheah Biomedical Centre, Cambridge Biomedical Campus, University of Cambridge, Cambridge, CB2 0AW UK
| | | | - Ross D. King
- Department of Biology and Biological Engineering, Division of Systems and Synthetic Biology, Chalmers University of Technology, Kemivägen 10, SE-412 96 Gothenburg, Sweden
| |
Collapse
|
44
|
Tietz T, Selinski S, Golka K, Hengstler JG, Gripp S, Ickstadt K, Ruczinski I, Schwender H. Identification of interactions of binary variables associated with survival time using survivalFS. Arch Toxicol 2019; 93:585-602. [PMID: 30694373 DOI: 10.1007/s00204-019-02398-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/13/2018] [Accepted: 01/16/2019] [Indexed: 12/01/2022]
Abstract
Many medical studies aim to identify factors associated with a time to an event such as survival time or time to relapse. Often, in particular, when binary variables are considered in such studies, interactions of these variables might be the actual relevant factors for predicting, e.g., the time to recurrence of a disease. Testing all possible interactions is often not possible, so that procedures such as logic regression are required that avoid such an exhaustive search. In this article, we present an ensemble method based on logic regression that can cope with the instability of the regression models generated by logic regression. This procedure called survivalFS also provides measures for quantifying the importance of the interactions forming the logic regression models on the time to an event and for the assessment of the individual variables that take the multivariate data structure into account. In this context, we introduce a new performance measure, which is an adaptation of Harrel's concordance index. The performance of survivalFS and the proposed importance measures is evaluated in a simulation study as well as in an application to genotype data from a urinary bladder cancer study. Furthermore, we compare the performance of survivalFS and its importance measures for the individual variables with the variable importance measure used in random survival forests, a popular procedure for the analysis of survival data. These applications show that survivalFS is able to identify interactions associated with time to an event and to outperform random survival forests.
Collapse
Affiliation(s)
- Tobias Tietz
- Mathematical Institute, Heinrich Heine University Düsseldorf, 40225, Düsseldorf, Germany
| | - Silvia Selinski
- Leibniz Research Centre for Working Environment and Human Factors, TU Dortmund University, IfADo, 44139, Dortmund, Germany
| | - Klaus Golka
- Leibniz Research Centre for Working Environment and Human Factors, TU Dortmund University, IfADo, 44139, Dortmund, Germany
| | - Jan G Hengstler
- Leibniz Research Centre for Working Environment and Human Factors, TU Dortmund University, IfADo, 44139, Dortmund, Germany
| | - Stephan Gripp
- Department of Radiation Oncology, Heinrich Heine University Hospital, 44225, Düsseldorf, Germany
| | - Katja Ickstadt
- Faculty of Statistics, TU Dortmund University, 44221, Dortmund, Germany
| | - Ingo Ruczinski
- Department of Biostatistics, Johns Hopkins University, Baltimore, MD, 21205, USA
| | - Holger Schwender
- Mathematical Institute, Heinrich Heine University Düsseldorf, 40225, Düsseldorf, Germany.
| |
Collapse
|
45
|
Dorani F, Hu T, Woods MO, Zhai G. Ensemble learning for detecting gene-gene interactions in colorectal cancer. PeerJ 2018; 6:e5854. [PMID: 30397551 PMCID: PMC6211269 DOI: 10.7717/peerj.5854] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/05/2018] [Accepted: 09/28/2018] [Indexed: 11/20/2022] Open
Abstract
Colorectal cancer (CRC) has a high incident rate in both men and women and is affecting millions of people every year. Genome-wide association studies (GWAS) on CRC have successfully revealed common single-nucleotide polymorphisms (SNPs) associated with CRC risk. However, they can only explain a very limited fraction of the disease heritability. One reason may be the common uni-variable analyses in GWAS where genetic variants are examined one at a time. Given the complexity of cancers, the non-additive interaction effects among multiple genetic variants have a potential of explaining the missing heritability. In this study, we employed two powerful ensemble learning algorithms, random forests and gradient boosting machine (GBM), to search for SNPs that contribute to the disease risk through non-additive gene-gene interactions. We were able to find 44 possible susceptibility SNPs that were ranked most significant by both algorithms. Out of those 44 SNPs, 29 are in coding regions. The 29 genes include ARRDC5, DCC, ALK, and ITGA1, which have been found previously associated with CRC, and E2F3 and NID2, which are potentially related to CRC since they have known associations with other types of cancer. We performed pairwise and three-way interaction analysis on the 44 SNPs using information theoretical techniques and found 17 pairwise (p < 0.02) and 16 three-way (p ≤ 0.001) interactions among them. Moreover, functional enrichment analysis suggested 16 functional terms or biological pathways that may help us better understand the etiology of the disease.
Collapse
Affiliation(s)
- Faramarz Dorani
- Department of Computer Science, Memorial University, St. John's, Newfoundland and Labrador, Canada
| | - Ting Hu
- Department of Computer Science, Memorial University, St. John's, Newfoundland and Labrador, Canada
| | - Michael O Woods
- Faculty of Medicine, Memorial University, St. John's, Newfoundland and Labrador, Canada
| | - Guangju Zhai
- Faculty of Medicine, Memorial University, St. John's, Newfoundland and Labrador, Canada
| |
Collapse
|
46
|
Wolf BJ, Ramos PS, Hyer JM, Ramakrishnan V, Gilkeson GS, Hardiman G, Nietert PJ, Kamen DL. An Analytic Approach Using Candidate Gene Selection and Logic Forest to Identify Gene by Environment Interactions (G × E) for Systemic Lupus Erythematosus in African Americans. Genes (Basel) 2018; 9:genes9100496. [PMID: 30326636 PMCID: PMC6211136 DOI: 10.3390/genes9100496] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2018] [Revised: 09/27/2018] [Accepted: 10/03/2018] [Indexed: 12/17/2022] Open
Abstract
Development and progression of many human diseases, such as systemic lupus erythematosus (SLE), are hypothesized to result from interactions between genetic and environmental factors. Current approaches to identify and evaluate interactions are limited, most often focusing on main effects and two-way interactions. While higher order interactions associated with disease are documented, they are difficult to detect since expanding the search space to all possible interactions of p predictors means evaluating 2p − 1 terms. For example, data with 150 candidate predictors requires considering over 1045 main effects and interactions. In this study, we present an analytical approach involving selection of candidate single nucleotide polymorphisms (SNPs) and environmental and/or clinical factors and use of Logic Forest to identify predictors of disease, including higher order interactions, followed by confirmation of the association between those predictors and interactions identified with disease outcome using logistic regression. We applied this approach to a study investigating whether smoking and/or secondhand smoke exposure interacts with candidate SNPs resulting in elevated risk of SLE. The approach identified both genetic and environmental risk factors, with evidence suggesting potential interactions between exposure to secondhand smoke as a child and genetic variation in the ITGAM gene associated with increased risk of SLE.
Collapse
Affiliation(s)
- Bethany J Wolf
- Department of Public Health Sciences, Medical University of South Carolina, Charleston, SC 29425, USA.
| | - Paula S Ramos
- Department of Public Health Sciences, Medical University of South Carolina, Charleston, SC 29425, USA.
- Division of Rheumatology and Immunology, Department of Medicine, Medical Univeristy of South Carolina, Charleston, SC 29425, USA.
| | - J Madison Hyer
- Department of Public Health Sciences, Medical University of South Carolina, Charleston, SC 29425, USA.
| | - Viswanathan Ramakrishnan
- Department of Public Health Sciences, Medical University of South Carolina, Charleston, SC 29425, USA.
| | - Gary S Gilkeson
- Division of Rheumatology and Immunology, Department of Medicine, Medical Univeristy of South Carolina, Charleston, SC 29425, USA.
| | - Gary Hardiman
- Department of Public Health Sciences, Medical University of South Carolina, Charleston, SC 29425, USA.
- Center for Genomic Medicine, Department of Medicine, Medical Univeristy of South Carolina, Charleston, SC 29425, USA.
- Division of Nephrology, Department of Medicine, Medical Univeristy of South Carolina, Charleston, SC 29425, USA.
- School of Biological Sciences & Institute for Global Food Security, Queens University Belfast, Stranmillis Road, Belfast BT9 5AG, UK.
| | - Paul J Nietert
- Department of Public Health Sciences, Medical University of South Carolina, Charleston, SC 29425, USA.
| | - Diane L Kamen
- Division of Rheumatology and Immunology, Department of Medicine, Medical Univeristy of South Carolina, Charleston, SC 29425, USA.
| |
Collapse
|
47
|
Ascorbic acid metabolites are involved in intraocular pressure control in the general population. Redox Biol 2018; 20:349-353. [PMID: 30391827 PMCID: PMC6223183 DOI: 10.1016/j.redox.2018.10.004] [Citation(s) in RCA: 27] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/01/2018] [Revised: 09/07/2018] [Accepted: 10/08/2018] [Indexed: 02/07/2023] Open
Abstract
Elevated intraocular pressure (IOP) is an important risk factor for glaucoma. Mechanisms involved in its homeostasis are not well understood, but associations between metabolic factors and IOP have been reported. To investigate the relationship between levels of circulating metabolites and IOP, we performed a metabolome-wide association using a machine learning algorithm, and then employing Mendelian Randomization models to further explore the strength and directionality of effect of the metabolites on IOP. We show that O-methylascorbate, a circulating Vitamin C metabolite, has a significant IOP-lowering effect, consistent with previous knowledge of the anti-hypertensive and anti-oxidative role of ascorbate compounds. These results enhance understanding of IOP control and may potentially benefit future IOP treatment and reduce vision loss from glaucoma. Vitamin C and its metabolites are highly concentrated in human brain and eye tissues. Multi-omics analysis finds evidence for association between ascorbic acid metabolism and intraocular pressure. O-methylascorbate lowers intraocular pressure in the general population. O-methylascorbate's role in intraocular pressure regulation is likely mediated by its anti-photooxidative properties.
Collapse
|
48
|
Darst BF, Malecki KC, Engelman CD. Using recursive feature elimination in random forest to account for correlated variables in high dimensional data. BMC Genet 2018; 19:65. [PMID: 30255764 PMCID: PMC6157185 DOI: 10.1186/s12863-018-0633-8] [Citation(s) in RCA: 145] [Impact Index Per Article: 20.7] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
Background Random forest (RF) is a machine-learning method that generally works well with high-dimensional problems and allows for nonlinear relationships between predictors; however, the presence of correlated predictors has been shown to impact its ability to identify strong predictors. The Random Forest-Recursive Feature Elimination algorithm (RF-RFE) mitigates this problem in smaller data sets, but this approach has not been tested in high-dimensional omics data sets. Results We integrated 202,919 genotypes and 153,422 methylation sites in 680 individuals, and compared the abilities of RF and RF-RFE to detect simulated causal associations, which included simulated genotype–methylation interactions, between these variables and triglyceride levels. Results show that RF was able to identify strong causal variables with a few highly correlated variables, but it did not detect other causal variables. Conclusions Although RF-RFE decreased the importance of correlated variables, in the presence of many correlated variables, it also decreased the importance of causal variables, making both hard to detect. These findings suggest that RF-RFE may not scale to high-dimensional data.
Collapse
Affiliation(s)
- Burcu F Darst
- Department of Population Health Sciences, School of Medicine and Public Health, University of Wisconsin, 610 Walnut Street, 1007 WARF, Madison, WI, 53726, USA
| | - Kristen C Malecki
- Department of Population Health Sciences, School of Medicine and Public Health, University of Wisconsin, 610 Walnut Street, 1007 WARF, Madison, WI, 53726, USA
| | - Corinne D Engelman
- Department of Population Health Sciences, School of Medicine and Public Health, University of Wisconsin, 610 Walnut Street, 1007 WARF, Madison, WI, 53726, USA.
| |
Collapse
|
49
|
Gonda X, Petschner P, Eszlari N, Baksa D, Edes A, Antal P, Juhasz G, Bagdy G. Genetic variants in major depressive disorder: From pathophysiology to therapy. Pharmacol Ther 2018; 194:22-43. [PMID: 30189291 DOI: 10.1016/j.pharmthera.2018.09.002] [Citation(s) in RCA: 44] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/11/2022]
Abstract
In spite of promising preclinical results there is a decreasing number of new registered medications in major depression. The main reason behind this fact is the lack of confirmation in clinical studies for the assumed, and in animals confirmed, therapeutic results. This suggests low predictive value of animal studies for central nervous system disorders. One solution for identifying new possible targets is the application of genetics and genomics, which may pinpoint new targets based on the effect of genetic variants in humans. The present review summarizes such research focusing on depression and its therapy. The inconsistency between most genetic studies in depression suggests, first of all, a significant role of environmental stress. Furthermore, effect of individual genes and polymorphisms is weak, therefore gene x gene interactions or complete biochemical pathways should be analyzed. Even genes encoding target proteins of currently used antidepressants remain non-significant in genome-wide case control investigations suggesting no main effect in depression, but rather an interaction with stress. The few significant genes in GWASs are related to neurogenesis, neuronal synapse, cell contact and DNA transcription and as being nonspecific for depression are difficult to harvest pharmacologically. Most candidate genes in replicable gene x environment interactions, on the other hand, are connected to the regulation of stress and the HPA axis and thus could serve as drug targets for depression subgroups characterized by stress-sensitivity and anxiety while other risk polymorphisms such as those related to prominent cognitive symptoms in depression may help to identify additional subgroups and their distinct treatment. Until these new targets find their way into therapy, the optimization of current medications can be approached by pharmacogenomics, where metabolizing enzyme polymorphisms remain prominent determinants of therapeutic success.
Collapse
Affiliation(s)
- Xenia Gonda
- Department of Psychiatry and Psychotherapy, Kutvolgyi Clinical Centre, Semmelweis University, Budapest, Hungary; NAP-2-SE New Antidepressant Target Research Group, Hungarian Brain Research Program, Semmelweis University, Budapest, Hungary; MTA-SE Neuropsychopharmacology and Neurochemistry Research Group, Hungarian Academy of Sciences, Semmelweis University, Budapest, Hungary.
| | - Peter Petschner
- MTA-SE Neuropsychopharmacology and Neurochemistry Research Group, Hungarian Academy of Sciences, Semmelweis University, Budapest, Hungary; Department of Pharmacodynamics, Faculty of Pharmacy, Semmelweis University, Budapest, Hungary
| | - Nora Eszlari
- NAP-2-SE New Antidepressant Target Research Group, Hungarian Brain Research Program, Semmelweis University, Budapest, Hungary; Department of Pharmacodynamics, Faculty of Pharmacy, Semmelweis University, Budapest, Hungary
| | - Daniel Baksa
- Department of Pharmacodynamics, Faculty of Pharmacy, Semmelweis University, Budapest, Hungary; SE-NAP 2 Genetic Brain Imaging Migraine Research Group, Hungarian Academy of Sciences, Hungarian Brain Research Program, Semmelweis University, Budapest, Hungary
| | - Andrea Edes
- Department of Pharmacodynamics, Faculty of Pharmacy, Semmelweis University, Budapest, Hungary; SE-NAP 2 Genetic Brain Imaging Migraine Research Group, Hungarian Academy of Sciences, Hungarian Brain Research Program, Semmelweis University, Budapest, Hungary
| | - Peter Antal
- Department of Measurement and Information Systems, Budapest University of Technology and Economics, Budapest, Hungary
| | - Gabriella Juhasz
- Department of Pharmacodynamics, Faculty of Pharmacy, Semmelweis University, Budapest, Hungary; SE-NAP 2 Genetic Brain Imaging Migraine Research Group, Hungarian Academy of Sciences, Hungarian Brain Research Program, Semmelweis University, Budapest, Hungary; Neuroscience and Psychiatry Unit, University of Manchester, Manchester Academic Health Sciences Centre, Manchester, UK
| | - Gyorgy Bagdy
- NAP-2-SE New Antidepressant Target Research Group, Hungarian Brain Research Program, Semmelweis University, Budapest, Hungary; MTA-SE Neuropsychopharmacology and Neurochemistry Research Group, Hungarian Academy of Sciences, Semmelweis University, Budapest, Hungary; Department of Pharmacodynamics, Faculty of Pharmacy, Semmelweis University, Budapest, Hungary.
| |
Collapse
|
50
|
Forester BR, Lasky JR, Wagner HH, Urban DL. Comparing methods for detecting multilocus adaptation with multivariate genotype-environment associations. Mol Ecol 2018; 27:2215-2233. [DOI: 10.1111/mec.14584] [Citation(s) in RCA: 267] [Impact Index Per Article: 38.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2017] [Revised: 03/16/2018] [Accepted: 03/19/2018] [Indexed: 12/18/2022]
Affiliation(s)
- Brenna R. Forester
- Nicholas School of the Environment; Duke University; Durham North Carolina
| | - Jesse R. Lasky
- Department of Biology; Pennsylvania State University; University Park Pennsylvania
| | - Helene H. Wagner
- Department of Biology; University of Toronto Mississauga; Mississauga ON Canada
| | - Dean L. Urban
- Nicholas School of the Environment; Duke University; Durham North Carolina
| |
Collapse
|