1
|
Deonarine A, Batwara A, Wada R, Sharma P, Loscalzo J, Ojikutu B, Hall K. De Novo exposomic geospatial assembly of chronic disease regions with machine learning & network analysis. EBioMedicine 2025; 112:105575. [PMID: 39891994 PMCID: PMC11833148 DOI: 10.1016/j.ebiom.2025.105575] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2024] [Revised: 12/26/2024] [Accepted: 01/16/2025] [Indexed: 02/03/2025] Open
Abstract
BACKGROUND Determining spatial relationships between diseases and the exposome is limited by available methodologies. aPEER (algorithm for Projection of Exposome and Epidemiological Relationships) uses machine learning (ML) and network analysis to find spatial relationships between diseases and the exposome in the United States. METHODS Using aPEER we examined the relationship between 12 chronic diseases and 186 pollutants. PCA, K-means clustering, and map projection produced clusters of counties derived from pollutants, and the Jaccard correlation between these clusters with chronic disease geography (defined as groups of counties with high chronic disease prevalence rates) was calculated. Disease-pollution correlation matrices were used together with network analysis to identify the strongest disease-pollution relationships. Results were compared to LISA, Moran's I, univariate, elastic net, and random forest regression. FINDINGS aPEER produced 68,820 human interpretable maps with distinct pollution-derived regions, and acetaldehyde/benzo(a)pyrene was found to be strongly associated with hypertension (J = 0.5316, p = 3.89 × 10-208), stroke (J = 0.4517, p = 1.15 × 10-127), and diabetes mellitus (J = 0.4425, p = 2.34 × 10-127); formaldehyde/glycol ethers with COPD (J = 0.4545, p = 8.27 × 10-131); and acetaldehyde/formaldehyde with stroke mortality (J = 0.4445, p = 4.28 × 10-125). Methanol, acetaldehyde, and formaldehyde formed distinct regions in the southeast United States (which correlated with both the Stroke and Diabetes Belts) which were strongly associated with multiple chronic diseases. Pollutants predicted chronic disease geography with similar or superior areas under the curve compared to SDOH and preventive healthcare models (determined with random forest and elastic net methods). Conventional geospatial analysis methods did not identify these geospatial relationships, highlighting aPEER's utility. INTERPRETATION aPEER identified a pollution-defined geographical region associated with chronic disease, highlighting the role of aPEER in epidemiological and geospatial analysis, and exposomics in understanding chronic disease geography. FUNDING This work was primarily funded by the BPHC, NHLBI (R03 HL157890) and the CDC, and this work was funded in part by grants from the NIH (U01 HG007691, R01 HL155107, and HL166137), the American Heart Association (AHA24MERIT1185447), and the EU (HorizonHealth 2021 101057619) to JL.
Collapse
Affiliation(s)
- Andrew Deonarine
- Boston Public Health Commission, 1010 Massachusetts Avenue, 6th Floor, Boston, MA 02118, USA; School of Population and Public Health, University of British Columbia, 2206 East Mall, Vancouver, BC V6T 1Z3, Canada; Icahn School of Medicine at Mount Sinai, 1 Gustave L. Levy Pl, New York, NY 10029, USA.
| | - Ayushi Batwara
- Boston Public Health Commission, 1010 Massachusetts Avenue, 6th Floor, Boston, MA 02118, USA; University of California, Berkeley, 110 Sproul Hall #5800, Berkeley, CA 94720-5800, USA
| | - Roy Wada
- Boston Public Health Commission, 1010 Massachusetts Avenue, 6th Floor, Boston, MA 02118, USA
| | - Puneet Sharma
- Boston Public Health Commission, 1010 Massachusetts Avenue, 6th Floor, Boston, MA 02118, USA
| | - Joseph Loscalzo
- Harvard Medical School, New Research Building, 77 Avenue Louis Pasteur, Room 630M, Boston, MA 02115, USA; Brigham and Women's Hospital, Department of Medicine, 75 Francis Street, Boston, MA 02115, USA
| | - Bisola Ojikutu
- Boston Public Health Commission, 1010 Massachusetts Avenue, 6th Floor, Boston, MA 02118, USA; Harvard Medical School, New Research Building, 77 Avenue Louis Pasteur, Room 630M, Boston, MA 02115, USA; Brigham and Women's Hospital, Department of Medicine, 75 Francis Street, Boston, MA 02115, USA
| | - Kathryn Hall
- Boston Public Health Commission, 1010 Massachusetts Avenue, 6th Floor, Boston, MA 02118, USA; Harvard Medical School, New Research Building, 77 Avenue Louis Pasteur, Room 630M, Boston, MA 02115, USA; Brigham and Women's Hospital, Department of Medicine, 75 Francis Street, Boston, MA 02115, USA; New York Academy of Medicine, 1216 5th Ave, New York, NY 10029, USA
| |
Collapse
|
2
|
Li L, Hu L, Ji J, Mckendrick K, Moreno J, Kelley AS, Mazumdar M, Aldridge M. Determinants of Total End-of-Life Health Care Costs of Medicare Beneficiaries: A Quantile Regression Forests Analysis. J Gerontol A Biol Sci Med Sci 2022; 77:1065-1071. [PMID: 34153101 PMCID: PMC9071433 DOI: 10.1093/gerona/glab176] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2021] [Indexed: 01/12/2023] Open
Abstract
BACKGROUND To identify and rank the importance of key determinants of end-of-life (EOL) health care costs, and to understand how the key factors impact different percentiles of the distribution of health care costs. METHOD We applied a principled, machine learning-based variable selection algorithm, using Quantile Regression Forests, to identify key determinants for predicting the 10th (low), 50th (median), and 90th (high) quantiles of EOL health care costs, including costs paid for by Medicare, Medicaid, Medicare Health Maintenance Organizations (HMOs), private HMOs, and patient's out-of-pocket expenditures. RESULTS Our sample included 7 539 Medicare beneficiaries who died between 2002 and 2017. The 10th, 50th, and 90th quantiles of EOL health care cost are $5 244, $35 466, and $87 241, respectively. Regional characteristics, specifically, the EOL-Expenditure Index, a measure for regional variation in Medicare spending driven by physician practice, and the number of total specialists in the hospital referral region were the top 2 influential determinants for predicting the 50th and 90th quantiles of EOL costs but were not determinants of the 10th quantile. Black race and Hispanic ethnicity were associated with lower EOL health care costs among decedents with lower total EOL health care costs but were associated with higher costs among decedents with the highest total EOL health care costs. CONCLUSIONS Factors associated with EOL health care costs varied across different percentiles of the cost distribution. Regional characteristics and decedent race/ethnicity exemplified factors that did not impact EOL costs uniformly across its distribution, suggesting the need to use a "higher-resolution" analysis for examining the association between risk factors and health care costs.
Collapse
Affiliation(s)
- Lihua Li
- Department of Population Health Science and Policy, Icahn School of Medicine at Mount Sinai, New York, New York, USA
- Institute for Healthcare Delivery Science, Mount Sinai Health System, New York, New York, USA
- Tisch Cancer Institute, New York, New York, USA
| | - Liangyuan Hu
- Department of Population Health Science and Policy, Icahn School of Medicine at Mount Sinai, New York, New York, USA
- Institute for Healthcare Delivery Science, Mount Sinai Health System, New York, New York, USA
- Tisch Cancer Institute, New York, New York, USA
| | - Jiayi Ji
- Department of Population Health Science and Policy, Icahn School of Medicine at Mount Sinai, New York, New York, USA
- Institute for Healthcare Delivery Science, Mount Sinai Health System, New York, New York, USA
- Tisch Cancer Institute, New York, New York, USA
| | - Karen Mckendrick
- Brookdale Department of Geriatrics and Palliative Medicine, Icahn School of Medicine at Mount Sinai, New York, New York, USA
| | - Jaison Moreno
- Brookdale Department of Geriatrics and Palliative Medicine, Icahn School of Medicine at Mount Sinai, New York, New York, USA
| | - Amy S Kelley
- Brookdale Department of Geriatrics and Palliative Medicine, Icahn School of Medicine at Mount Sinai, New York, New York, USA
| | - Madhu Mazumdar
- Department of Population Health Science and Policy, Icahn School of Medicine at Mount Sinai, New York, New York, USA
- Institute for Healthcare Delivery Science, Mount Sinai Health System, New York, New York, USA
- Tisch Cancer Institute, New York, New York, USA
| | - Melissa Aldridge
- Brookdale Department of Geriatrics and Palliative Medicine, Icahn School of Medicine at Mount Sinai, New York, New York, USA
| |
Collapse
|
3
|
Hu L, Joyce Lin JY, Ji J. Variable selection with missing data in both covariates and outcomes: Imputation and machine learning. Stat Methods Med Res 2021; 30:2651-2671. [PMID: 34696650 PMCID: PMC11181487 DOI: 10.1177/09622802211046385] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
Variable selection in the presence of both missing covariates and outcomes is an important statistical research topic. Parametric regression are susceptible to misspecification, and as a result are sub-optimal for variable selection. Flexible machine learning methods mitigate the reliance on the parametric assumptions, but do not provide as naturally defined variable importance measure as the covariate effect native to parametric models. We investigate a general variable selection approach when both the covariates and outcomes can be missing at random and have general missing data patterns. This approach exploits the flexibility of machine learning models and bootstrap imputation, which is amenable to nonparametric methods in which the covariate effects are not directly available. We conduct expansive simulations investigating the practical operating characteristics of the proposed variable selection approach, when combined with four tree-based machine learning methods, extreme gradient boosting, random forests, Bayesian additive regression trees, and conditional random forests, and two commonly used parametric methods, lasso and backward stepwise selection. Numeric results suggest that, extreme gradient boosting and Bayesian additive regression trees have the overall best variable selection performance with respect to the F 1 score and Type I error, while the lasso and backward stepwise selection have subpar performance across various settings. There is no significant difference in the variable selection performance due to imputation methods. We further demonstrate the methods via a case study of risk factors for 3-year incidence of metabolic syndrome with data from the Study of Women's Health Across the Nation.
Collapse
Affiliation(s)
- Liangyuan Hu
- Department of Biostatistics and Epidemiology, Rutgers University School of Public Health, USA
| | - Jung-Yi Joyce Lin
- Department of Population Health Science & Policy, Icahn School of Medicine at Mount Sinai, USA
| | - Jiayi Ji
- Department of Population Health Science & Policy, Icahn School of Medicine at Mount Sinai, USA
| |
Collapse
|
4
|
Hu L, Lin JY, Sigel K, Kale M. Estimating heterogeneous survival treatment effects of lung cancer screening approaches: A causal machine learning analysis. Ann Epidemiol 2021; 62:36-42. [PMID: 34157399 PMCID: PMC8463451 DOI: 10.1016/j.annepidem.2021.06.008] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/08/2021] [Revised: 05/18/2021] [Accepted: 06/14/2021] [Indexed: 12/20/2022]
Abstract
The National Lung Screening Trial (NLST) found that low-dose computed tomography (LDCT) screening provided lung cancer (LC) mortality benefit compared to chest radiography (CXR). Considerable research concerns identifying the differential treatment effects that may exist in certain subpopulations. We shed light on several important issues in existing research and highlight the need for further investigation of the heterogeneous comparative effect of LDCT versus CXR, using more flexible and rigorous statistical approaches. We used a high-performance Bayesian machine learning approach designed for censored survival data, accelerated failure time Bayesian additive regression trees model (AFT-BART), to flexibly capture the relationships between the failure time and predictors. We then used the counterfactual framework to draw Markov chain Monte Carlo samples of the individual treatment effect for each participant. Using these posterior samples, we explored the possible treatment effect heterogeneity via a stepwise binary tree approach. When re-analyzed with AFT-BART, LDCT did not have a statistically significant LC or overall mortality benefit compared to CXR. The Asian and Black (particularly those with pack-year ≥ 37 years and without emphysema) NLST population were shown to have enhanced overall mortality benefit from LDCT than the population average. Although inconclusive for LC mortality benefit, Asians, Blacks and Whites with history of chronic obstructive pulmonary disease showed a small trend towards benefit from LDCT. Causal inference with flexible machine learning modeling can provide valuable knowledge for informing treatment decision and planning targeted clinical trials emphasizing personalized medicine approaches.
Collapse
Affiliation(s)
- Liangyuan Hu
- Department of Population Health Science and Policy, Icahn School of Medicine at Mount Sinai, New York, NY; Department of Biostatistics and Epidemiology, Rutgers University, Piscataway, NJ.
| | - Jung-Yi Lin
- Department of Population Health Science and Policy, Icahn School of Medicine at Mount Sinai, New York, NY; Icahn School of Medicine at Mount Sinai, Institute for Health Care Delivery Science, New York, NY
| | - Keith Sigel
- Department of Medicine, Icahn School of Medicine at Mount Sinai, New York, NY
| | - Minal Kale
- Department of Medicine, Icahn School of Medicine at Mount Sinai, New York, NY
| |
Collapse
|
5
|
Estimation of causal effects of multiple treatments in healthcare database studies with rare outcomes. HEALTH SERVICES AND OUTCOMES RESEARCH METHODOLOGY 2021. [DOI: 10.1007/s10742-020-00234-4] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/03/2023]
|
6
|
Hu L, Li L, Ji J, Sanderson M. Identifying and understanding determinants of high healthcare costs for breast cancer: a quantile regression machine learning approach. BMC Health Serv Res 2020; 20:1066. [PMID: 33228683 PMCID: PMC7684910 DOI: 10.1186/s12913-020-05936-6] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2020] [Accepted: 11/18/2020] [Indexed: 12/16/2022] Open
Abstract
BACKGROUND To identify and rank the importance of key determinants of high medical expenses among breast cancer patients and to understand the underlying effects of these determinants. METHODS The Oncology Care Model (OCM) developed by the Center for Medicare & Medicaid Innovation were used. The OCM data provided to Mount Sinai on 2938 breast-cancer episodes included both baseline periods and three performance periods between Jan 1, 2012 and Jan 1, 2018. We included 11 variables representing information on treatment, demography and socio-economics status, in addition to episode expenditures. OCM data were collected from participating practices and payers. We applied a principled variable selection algorithm using a flexible tree-based machine learning technique, Quantile Regression Forests. RESULTS We found that the use of chemotherapy drugs (versus hormonal therapy) and interval of days without chemotherapy predominantly affected medical expenses among high-cost breast cancer patients. The second-tier major determinants were comorbidities and age. Receipt of surgery or radiation, geographically adjusted relative cost and insurance type were also identified as important high-cost drivers. These factors had disproportionally larger effects upon the high-cost patients. CONCLUSIONS Data-driven machine learning methods provide insights into the underlying web of factors driving up the costs for breast cancer care management. Results from our study may help inform population health management initiatives and allow policymakers to develop tailored interventions to meet the needs of those high-cost patients and to avoid waste of scarce resource.
Collapse
Affiliation(s)
- Liangyuan Hu
- Department of Population Health Science and Policy, Icahn School of Medicine at Mount Sinai, 1425 Madison Avenue, One Gustave L. Levy Place, Box 1077, New York, NY, 10029, USA.
| | - Lihua Li
- Department of Population Health Science and Policy, Icahn School of Medicine at Mount Sinai, 1425 Madison Avenue, One Gustave L. Levy Place, Box 1077, New York, NY, 10029, USA
| | - Jiayi Ji
- Department of Population Health Science and Policy, Icahn School of Medicine at Mount Sinai, 1425 Madison Avenue, One Gustave L. Levy Place, Box 1077, New York, NY, 10029, USA
| | - Mark Sanderson
- Department of Health System Design and Global Health, Icahn School of Medicine at Mount Sinai, New York, NY, 10029, USA
| |
Collapse
|