1
|
Shen Y, Domingo-Relloso A, Kupsco A, Kioumourtzoglou MA, Tellez-Plaza M, Umans JG, Fretts AM, Zhang Y, Schnatz PF, Casanova R, Martin LW, Horvath S, Manson JE, Cole SA, Wu H, Whitsel EA, Baccarelli AA, Navas-Acien A, Gao F. AESurv: autoencoder survival analysis for accurate early prediction of coronary heart disease. Brief Bioinform 2024; 25:bbae479. [PMID: 39323093 PMCID: PMC11424508 DOI: 10.1093/bib/bbae479] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2024] [Revised: 08/17/2024] [Accepted: 09/12/2024] [Indexed: 09/27/2024] Open
Abstract
Coronary heart disease (CHD) is one of the leading causes of mortality and morbidity in the United States. Accurate time-to-event CHD prediction models with high-dimensional DNA methylation and clinical features may assist with early prediction and intervention strategies. We developed a state-of-the-art deep learning autoencoder survival analysis model (AESurv) to effectively analyze high-dimensional blood DNA methylation features and traditional clinical risk factors by learning low-dimensional representation of participants for time-to-event CHD prediction. We demonstrated the utility of our model in two cohort studies: the Strong Heart Study cohort (SHS), a prospective cohort studying cardiovascular disease and its risk factors among American Indians adults; the Women's Health Initiative (WHI), a prospective cohort study including randomized clinical trials and observational study to improve postmenopausal women's health with one of the main focuses on cardiovascular disease. Our AESurv model effectively learned participant representations in low-dimensional latent space and achieved better model performance (concordance index-C index of 0.864 ± 0.009 and time-to-event mean area under the receiver operating characteristic curve-AUROC of 0.905 ± 0.009) than other survival analysis models (Cox proportional hazard, Cox proportional hazard deep neural network survival analysis, random survival forest, and gradient boosting survival analysis models) in the SHS. We further validated the AESurv model in WHI and also achieved the best model performance. The AESurv model can be used for accurate CHD prediction and assist health care professionals and patients to perform early intervention strategies. We suggest using AESurv model for future time-to-event CHD prediction based on DNA methylation features.
Collapse
Affiliation(s)
- Yike Shen
- Department of Earth and Environmental Sciences, University of Texas at Arlington, 500 Yates Street, Arlington, TX, 76019, USA
- Department of Environmental Health Sciences, Columbia University Mailman School of Public Health, 722 West 168th Street, New York, NY, 10032, USA
| | - Arce Domingo-Relloso
- Department of Chronic Diseases Epidemiology, National Center for Epidemiology, Carlos III Health Institute, C. de Melchor Fernández Almagro, 5, Fuencarral-El Pardo, 5, Madrid, 28029, Spain
- Department of Statistics and Operations Research, University of Valencia, Carrer del Dr. Moliner, 50, Valencia, 46100, Spain
- Department of Biostatistics, Columbia University Mailman School of Public Health, 722 West 168th Street, New York, NY, 10032, USA
| | - Allison Kupsco
- Department of Environmental Health Sciences, Columbia University Mailman School of Public Health, 722 West 168th Street, New York, NY, 10032, USA
| | - Marianthi-Anna Kioumourtzoglou
- Department of Environmental Health Sciences, Columbia University Mailman School of Public Health, 722 West 168th Street, New York, NY, 10032, USA
| | - Maria Tellez-Plaza
- Department of Chronic Diseases Epidemiology, National Center for Epidemiology, Carlos III Health Institute, C. de Melchor Fernández Almagro, 5, Fuencarral-El Pardo, 5, Madrid, 28029, Spain
| | - Jason G Umans
- Department of Medicine, Georgetown-Howard Universities Center for Clinical and Translational Science, 4000 Reservoir Road NW, Washington, DC, 20007, USA
| | - Amanda M Fretts
- Department of Epidemiology, University of Washington, 3980 15th Ave NE, Seattle, WA, 98195, USA
| | - Ying Zhang
- Center for American Indian Health Research, Department of Biostatistics and Epidemiology, The University of Oklahoma Health Sciences Center, 801 N.E. 13th Street, Oklahoma City, OK, 73104, USA
| | - Peter F Schnatz
- Department of OB/GYN and Internal Medicine, Reading Hospital/Tower Health & Drexel University, 301 S 7th Ave, West Reading, PA, 19611, USA
| | - Ramon Casanova
- Department of Biostatistics and Data Science, Wake Forest University School of Medicine, 475 Vine St, Winston Salem, NC, 27101, USA
| | - Lisa Warsinger Martin
- Department of Medicine, Division of Cardiology, George Washington University, 2300 Eye Street, NW, Washington, DC, 20037, USA
| | - Steve Horvath
- Department of Human Genetics, David Geffen School of Medicine, University of California, Los Angeles (UCLA), 695 Charles E. Young Drive South, Los Angeles, CA, 90095, USA
- Altos Lab Inc, Granta Park, Little Abington, Cambridge, CB21 6GQ, United Kingdom
| | - JoAnn E Manson
- Department of Medicine, Brigham and Women's Hospital, Harvard Medical School, 900 Commonwealth Ave, Boston, MA, 02215, USA
| | - Shelley A Cole
- Population Health Program, Texas Biomedical Research Institute, 8715 W. Military Dr., San Antonio, TX, 78227, USA
| | - Haotian Wu
- Department of Environmental Health Sciences, Columbia University Mailman School of Public Health, 722 West 168th Street, New York, NY, 10032, USA
| | - Eric A Whitsel
- Department of Epidemiology, Gillings School of Global Public Health and Department of Medicine, School of Medicine, University of North Carolina at Chapel Hill, 135 Dauer Drive, Chapel Hill, NC, 27599, USA
| | - Andrea A Baccarelli
- Department of Environmental Health Sciences, Columbia University Mailman School of Public Health, 722 West 168th Street, New York, NY, 10032, USA
- Harvard T.H. Chan School of Public Health, Harvard University, 677 Huntington Avenue, Boston, MA, 02115, USA
| | - Ana Navas-Acien
- Department of Environmental Health Sciences, Columbia University Mailman School of Public Health, 722 West 168th Street, New York, NY, 10032, USA
| | - Feng Gao
- Department of Environmental Health Sciences, Fielding School of Public Health, University of California Los Angeles (UCLA), 650 Charles E. Young Drive South, Los Angeles, CA, 90095, USA
- Department of Molecular and Medical Pharmacology, David Geffen School of Medicine, University of California, Los Angeles (UCLA), 650 Charles E. Young Drive South, Los Angeles, CA, 90095, USA
| |
Collapse
|
2
|
von Borries K, Holmquist H, Kosnik M, Beckwith KV, Jolliet O, Goodman JM, Fantke P. Potential for Machine Learning to Address Data Gaps in Human Toxicity and Ecotoxicity Characterization. ENVIRONMENTAL SCIENCE & TECHNOLOGY 2023; 57:18259-18270. [PMID: 37914529 PMCID: PMC10666540 DOI: 10.1021/acs.est.3c05300] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/14/2023] [Revised: 10/12/2023] [Accepted: 10/13/2023] [Indexed: 11/03/2023]
Abstract
Machine Learning (ML) is increasingly applied to fill data gaps in assessments to quantify impacts associated with chemical emissions and chemicals in products. However, the systematic application of ML-based approaches to fill chemical data gaps is still limited, and their potential for addressing a wide range of chemicals is unknown. We prioritized chemical-related parameters for chemical toxicity characterization to inform ML model development based on two criteria: (1) each parameter's relevance to robustly characterize chemical toxicity described by the uncertainty in characterization results attributable to each parameter and (2) the potential for ML-based approaches to predict parameter values for a wide range of chemicals described by the availability of chemicals with measured parameter data. We prioritized 13 out of 38 parameters for developing ML-based approaches, while flagging another nine with critical data gaps. For all prioritized parameters, we performed a chemical space analysis to assess further the potential for ML-based approaches to predict data for diverse chemicals considering the structural diversity of available measured data, showing that ML-based approaches can potentially predict 8-46% of marketed chemicals based on 1-10% with available measured data. Our results can systematically inform future ML model development efforts to address data gaps in chemical toxicity characterization.
Collapse
Affiliation(s)
- Kerstin von Borries
- Quantitative
Sustainability Assessment, Department of Environmental and Resource
Engineering, Technical University of Denmark, Bygningstorvet 115, 2800 Kgs. Lyngby, Denmark
| | - Hanna Holmquist
- IVL
Swedish Environmental Research Institute, Aschebergsgatan 44, 411 33 Göteborg, Sweden
| | - Marissa Kosnik
- Quantitative
Sustainability Assessment, Department of Environmental and Resource
Engineering, Technical University of Denmark, Bygningstorvet 115, 2800 Kgs. Lyngby, Denmark
| | - Katie V. Beckwith
- Centre
for Molecular Informatics, Yusuf Hamied Department of Chemistry, University of Cambridge, Lensfield Road, Cambridge CB2 1EW, United
Kingdom
| | - Olivier Jolliet
- Quantitative
Sustainability Assessment, Department of Environmental and Resource
Engineering, Technical University of Denmark, Bygningstorvet 115, 2800 Kgs. Lyngby, Denmark
| | - Jonathan M. Goodman
- Centre
for Molecular Informatics, Yusuf Hamied Department of Chemistry, University of Cambridge, Lensfield Road, Cambridge CB2 1EW, United
Kingdom
| | - Peter Fantke
- Quantitative
Sustainability Assessment, Department of Environmental and Resource
Engineering, Technical University of Denmark, Bygningstorvet 115, 2800 Kgs. Lyngby, Denmark
| |
Collapse
|
3
|
Zhang B, Hou H, Liu L, Huang Z, Zhao L. Spatial prediction and influencing factors identification of potential toxic element contamination in soil of different karst landform regions using integration model. CHEMOSPHERE 2023; 327:138404. [PMID: 36931406 DOI: 10.1016/j.chemosphere.2023.138404] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/06/2023] [Revised: 03/05/2023] [Accepted: 03/13/2023] [Indexed: 06/18/2023]
Abstract
The prediction of contamination distribution of potentially toxic elements (PTEs) in soils of Guangxi province, China and the identification of their controlling factors pose great challenges due to diverse bedrock types, intense leaching and weathering, and discontinuous terrain distributions. Herein, we integrated the random forest (RF) and empirical Bayesian kriging (EBK) to interpret and predict complex PTEs contamination distribution from three different karst landform regions (fenglin, fengcong, isolated peak plain) in Guangxi province. The modeling results are compared with the commonly used ordinary kriging and regression-kriging. In this study, our developed RF-EBK model combines the advantages of the RF and EBK model to promote the prediction accurately and efficiently. In this study, it was shown that the integration RF-EBK model exhibited desirable for Cd and As concentrations, with R2 of 0.89 and 0.83, respectively. The average RMSE and MAE of integration RF-EBK model decreased by 39% and 44%, respectively, relative to the regression-kriging with the second highest accuracy. Furthermore, the modeling results showed that approximately 41.96% and 18.96% of total area was classified as Cd and As polluted and above regions (Igeo >0) in Guangxi province, respectively. Higher Cd concentration was observed in the soil of fenglin and fengcong regions than that in isolated peak plain region due to the secondary enrichment and parent rock inheritance, while the As concentration exhibited no significant difference among the three regions. The modeling results indicated that the elevated Cd concentration might be associated with soil CaO concentration and alkaline soil environment, whereas As concentration tended to be increased with the elevating Fe2O3 concentrations in weakly acidic soil environment. This result confirmed the applicability and effectiveness of integration model in predicting complex spatial patterns of soil PTEs and identifying their controlling factors.
Collapse
Affiliation(s)
- Bolun Zhang
- State Key Laboratory of Environmental Criteria and Risk Assessment, Chinese Research Academy of Environmental Sciences, Beijing, 100012, China; School of Chemical & Environmental Engineering, China University of Mining and Technology-Beijing, Beijing, 100083, China
| | - Hong Hou
- State Key Laboratory of Environmental Criteria and Risk Assessment, Chinese Research Academy of Environmental Sciences, Beijing, 100012, China.
| | - Lingling Liu
- State Key Laboratory of Environmental Criteria and Risk Assessment, Chinese Research Academy of Environmental Sciences, Beijing, 100012, China
| | - Zhanbin Huang
- School of Chemical & Environmental Engineering, China University of Mining and Technology-Beijing, Beijing, 100083, China
| | - Long Zhao
- State Key Laboratory of Environmental Criteria and Risk Assessment, Chinese Research Academy of Environmental Sciences, Beijing, 100012, China
| |
Collapse
|
4
|
Zhang B, Hou H, Huang Z, Zhao L. Estimation of heavy metal soil contamination distribution, hazard probability, and population at risk by machine learning prediction modeling in Guangxi, China. ENVIRONMENTAL POLLUTION (BARKING, ESSEX : 1987) 2023; 330:121607. [PMID: 37031848 DOI: 10.1016/j.envpol.2023.121607] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/09/2023] [Revised: 03/20/2023] [Accepted: 04/07/2023] [Indexed: 05/27/2023]
Abstract
Due to superposition of diverse pollution sources, soil heavy metal concentrations have been detected to exceed the recommended maximum permissible levels in many areas of Guangxi province, China. However, the heavy metal contamination distribution, hazard probability, and population at risk of heavy metals in the entire Guangxi province remain largely unclear. In this study, machine learning prediction models with different standard risk values determined according to land use types were used to identify high-risk areas and estimate populations at risk of Cr and Ni based on 658 topsoil samples from Guangxi province, China. Our results showed that soil Cr and Ni contamination derived from carbonate rocks was relatively serious in Guangxi province, and that their co-enrichment during soil formation was associated with Fe and Mn oxides and alkaline soil environment. Our established model exhibited excellent performance in predicting contamination distribution (R2 > 0.85) and hazard probability (AUC>0.85). Pollution of Cr and Ni exhibited a pattern of decreasing gradually from the central-west areas to the surrounding areas with the polluted area (Igeo>0) of Cr and Ni accounting for approximately 24.46% and 29.24% of total area in Guangxi province, respectively, but only 10.4% and 8.51% of total area was classified as Cr and Ni high-risk regions. We estimated approximately 1.44 and 1.47 million people were potentially exposed to the risk of Cr and Ni contamination, which were mainly concentrated in the Nanning, Laibin, and Guigang. These regions are main heavily-populated agricultural regions in Guangxi, and thus heavy metal contamination localization and risk control in these regions are urgent and essential from the perspective of food safety.
Collapse
Affiliation(s)
- Bolun Zhang
- State Key Laboratory of Environmental Criteria and Risk Assessment, Chinese Research Academy of Environmental Sciences, Beijing, 100012, China; School of Chemical & Environmental Engineering, China University of Mining and Technology-Beijing, Beijing, 100083, China
| | - Hong Hou
- State Key Laboratory of Environmental Criteria and Risk Assessment, Chinese Research Academy of Environmental Sciences, Beijing, 100012, China.
| | - Zhanbin Huang
- School of Chemical & Environmental Engineering, China University of Mining and Technology-Beijing, Beijing, 100083, China
| | - Long Zhao
- State Key Laboratory of Environmental Criteria and Risk Assessment, Chinese Research Academy of Environmental Sciences, Beijing, 100012, China
| |
Collapse
|
5
|
Shen Y, Zhao E, Zhang W, Baccarelli AA, Gao F. Predicting pesticide dissipation half-life intervals in plants with machine learning models. JOURNAL OF HAZARDOUS MATERIALS 2022; 436:129177. [PMID: 35643003 DOI: 10.1016/j.jhazmat.2022.129177] [Citation(s) in RCA: 11] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/07/2022] [Revised: 05/04/2022] [Accepted: 05/15/2022] [Indexed: 06/15/2023]
Abstract
Pesticide dissipation half-life in plants is an important factor to assessing environmental fate of pesticides and establishing pre-harvest intervals critical to good agriculture practices. However, empirically measured pesticide dissipation half-lives are highly variable and the accurate prediction with models is challenging. This study utilized a dataset of pesticide dissipation half-lives containing 1363 datapoints, 311 pesticides, 10 plant types, and 4 plant component classes. Novel dissipation half-life intervals were proposed and predicted to account for high variations in empirical data. Four machine learning models (i.e., gradient boosting regression tree [GBRT], random forest [RF], supporting vector classifier [SVC], and logistic regression [LR]) were developed to predict dissipation half-life intervals using extended connectivity fingerprints (ECFP), temperature, plant type, and plant component class as model inputs. GBRT-ECFP had the best model performance with F1-microbinary score of 0.698 ± 0.010 for the binary classification compared with other machine learning models (e.g., LR-ECFP, F1-microbinary= 0.662 ± 0.009). Feature importance analysis of molecular structures in the binary classification identified aromatic rings, carbonyl group, organophosphate, =C-H, and N-containing heterocyclic groups as important substructures related to pesticide dissipation half-lives. This study suggests the utility of machine learning models in assessing the environmental fate of pesticides in agricultural crops.
Collapse
Affiliation(s)
- Yike Shen
- Department of Environmental Health Sciences, Mailman School of Public Health, Columbia University, New York, NY 10032, United States
| | - Ercheng Zhao
- Institute of Plant Protection, Beijing Academy of Agricultural and Forestry Science, Beijing 100097, PR China
| | - Wei Zhang
- Department of Plant, Soil and Microbial Sciences, Michigan State University, East Lansing, Michigan 48823, United States.
| | - Andrea A Baccarelli
- Department of Environmental Health Sciences, Mailman School of Public Health, Columbia University, New York, NY 10032, United States
| | - Feng Gao
- Department of Environmental Health Sciences, Mailman School of Public Health, Columbia University, New York, NY 10032, United States.
| |
Collapse
|