1
|
Chakraborty AK, Wang H, Ramazi P. From Policy to Prediction: Assessing Forecasting Accuracy in an Integrated Framework with Machine Learning and Disease Models. J Comput Biol 2024. [PMID: 39092497 DOI: 10.1089/cmb.2023.0377] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/04/2024] Open
Abstract
To improve the forecasting accuracy of the spread of infectious diseases, a hybrid model was recently introduced where the commonly assumed constant disease transmission rate was actively estimated from enforced mitigating policy data by a machine learning (ML) model and then fed to an extended susceptible-infected-recovered model to forecast the number of infected cases. Testing only one ML model, that is, gradient boosting model (GBM), the work left open whether other ML models would perform better. Here, we compared GBMs, linear regressions, k-nearest neighbors, and Bayesian networks (BNs) in forecasting the number of COVID-19-infected cases in the United States and Canadian provinces based on policy indices of future 35 days. There was no significant difference in the mean absolute percentage errors of these ML models over the combined dataset [H ( 3 ) = 3.10 , p = 0.38 ]. In two provinces, a significant difference was observed [H ( 3 ) = 8.77 , H ( 3 ) = 8.07 , p < 0.05 ], yet posthoc tests revealed no significant difference in pairwise comparisons. Nevertheless, BNs significantly outperformed the other models in most of the training datasets. The results put forward that the ML models have equal forecasting power overall, and BNs are best for data-fitting applications.
Collapse
Affiliation(s)
- Amit K Chakraborty
- Department of Mathematical and Statistical Sciences, University of Alberta, Edmonton, Canada
| | - Hao Wang
- Department of Mathematical and Statistical Sciences, University of Alberta, Edmonton, Canada
| | - Pouria Ramazi
- Department of Mathematics and Statistics, Brock University, St. Catharines, Canada
| |
Collapse
|
2
|
Emmons S, Woods T, Cashman M, Devereux O, Noe G, Young J, Stranko S, Kilian J, Hanna K, Maloney K. Causal inference approaches reveal both positive and negative unintended effects of agricultural and urban management practices on instream biological condition. JOURNAL OF ENVIRONMENTAL MANAGEMENT 2024; 361:121234. [PMID: 38805958 DOI: 10.1016/j.jenvman.2024.121234] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/20/2024] [Revised: 05/16/2024] [Accepted: 05/23/2024] [Indexed: 05/30/2024]
Abstract
Agricultural and urban management practices (MPs) are primarily designed and implemented to reduce nutrient and sediment concentrations in streams. However, there is growing interest in determining if MPs produce any unintended positive effects, or co-benefits, to instream biological and habitat conditions. Identifying co-benefits is challenging though because of confounding variables (i.e., those that affect both where MPs are applied and stream biota), which can be accounted for in novel causal inference approaches. Here, we used two causal inference approaches, propensity score matching (PSM) and Bayesian network learning (BNL), to identify potential MP co-benefits in the Chesapeake Bay watershed portion of Maryland, USA. Specifically, we examined how MPs may modify instream conditions that impact fish and macroinvertebrate indices of biotic integrity (IBI) and functional and taxonomic endpoints. We found evidence of positive unintended effects of MPs for both benthic macroinvertebrates and fish indicated by higher IBI scores and specific endpoints like the number of scraper macroinvertebrate taxa and lithophilic spawning fish taxa in a subset of regions. However, our results also suggest MPs have negative unintended effects, especially on sensitive benthic macroinvertebrate taxa and key instream habitat and water quality metrics like specific conductivity. Overall, our results suggest MPs offer co-benefits in some regions and catchments with largely degraded conditions but can have negative unintended effects in some regions, especially in catchments with good biological conditions. We suggest the number and types of MPs drove these mixed results and highlight carefully designed MP implementation that incorporates instream biological data at the catchment scale could facilitate co-benefits to instream biological conditions. Our study underscores the need for more research on identifying effects of individual MP types on instream biological and habitat conditions.
Collapse
Affiliation(s)
- Sean Emmons
- U.S. Geological Survey, Eastern Ecological Science Center, Kearneysville, WV, USA.
| | - Taylor Woods
- U.S. Geological Survey, Eastern Ecological Science Center, Kearneysville, WV, USA
| | - Matthew Cashman
- U.S. Geological Survey, Maryland/Delaware/District of Columbia Water Science Center, Baltimore, MD, USA
| | | | - Greg Noe
- U.S. Geological Survey, Florence Bascom Geoscience Center, Reston, VA, USA
| | - John Young
- U.S. Geological Survey, Eastern Ecological Science Center, Kearneysville, WV, USA
| | - Scott Stranko
- Maryland Department of Natural Resources, Annapolis, MD, USA
| | - Jay Kilian
- Maryland Department of Natural Resources, Annapolis, MD, USA
| | - Katherine Hanna
- Maryland Department of Natural Resources, Annapolis, MD, USA
| | - Kelly Maloney
- U.S. Geological Survey, Eastern Ecological Science Center, Kearneysville, WV, USA
| |
Collapse
|
3
|
Ozminkowski S, Solís‐Lemus C. Identifying microbial drivers in biological phenotypes with a Bayesian network regression model. Ecol Evol 2024; 14:e11039. [PMID: 38774136 PMCID: PMC11106058 DOI: 10.1002/ece3.11039] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/04/2023] [Revised: 01/29/2024] [Accepted: 02/03/2024] [Indexed: 05/24/2024] Open
Abstract
In Bayesian Network Regression models, networks are considered the predictors of continuous responses. These models have been successfully used in brain research to identify regions in the brain that are associated with specific human traits, yet their potential to elucidate microbial drivers in biological phenotypes for microbiome research remains unknown. In particular, microbial networks are challenging due to their high dimension and high sparsity compared to brain networks. Furthermore, unlike in brain connectome research, in microbiome research, it is usually expected that the presence of microbes has an effect on the response (main effects), not just the interactions. Here, we develop the first thorough investigation of whether Bayesian Network Regression models are suitable for microbial datasets on a variety of synthetic and real data under diverse biological scenarios. We test whether the Bayesian Network Regression model that accounts only for interaction effects (edges in the network) is able to identify key drivers (microbes) in phenotypic variability. We show that this model is indeed able to identify influential nodes and edges in the microbial networks that drive changes in the phenotype for most biological settings, but we also identify scenarios where this method performs poorly which allows us to provide practical advice for domain scientists aiming to apply these tools to their datasets. BNR models provide a framework for microbiome researchers to identify connections between microbes and measured phenotypes. We allow the use of this statistical model by providing an easy-to-use implementation which is publicly available Julia package at https://github.com/solislemuslab/BayesianNetworkRegression.jl.
Collapse
Affiliation(s)
- Samuel Ozminkowski
- Department of Statistics and Wisconsin Institute for DiscoveryUniversity of Wisconsin‐MadisonMadisonWisconsinUSA
| | - Claudia Solís‐Lemus
- Department of Plant Pathology and Wisconsin Institute for DiscoveryUniversity of Wisconsin‐MadisonMadisonWisconsinUSA
| |
Collapse
|
4
|
Rowland FE, Kotalik CJ, Marcot BG, Hinck JE, Walters DM. A novel approach to assessing natural resource injury with Bayesian networks. INTEGRATED ENVIRONMENTAL ASSESSMENT AND MANAGEMENT 2024; 20:562-573. [PMID: 37664978 DOI: 10.1002/ieam.4836] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/22/2023] [Revised: 08/16/2023] [Accepted: 08/25/2023] [Indexed: 09/05/2023]
Abstract
Quantifying the effects of environmental stressors on natural resources is problematic because of complex interactions among environmental factors that influence endpoints of interest. This complexity, coupled with data limitations, propagates uncertainty that can make it difficult to causally associate specific environmental stressors with injury endpoints. The Natural Resource Damage Assessment and Restoration (NRDAR) regulations under the Comprehensive Environmental Response, Compensation, and Liability Act and Oil Pollution Act aim to restore natural resources injured by oil spills and hazardous substances released into the environment; exploration of alternative statistical methods to evaluate effects could help address NRDAR legal claims. Bayesian networks (BNs) are statistical tools that can be used to estimate the influence and interrelatedness of abiotic and biotic environmental variables on environmental endpoints of interest. We investigated the application of a BN for injury assessment using a hypothetical case study by simulating data of acid mine drainage (AMD) affecting a fictional stream-dwelling bird species. We compared the BN-generated probability estimates for injury with a more traditional approach using toxicity thresholds for water and sediment chemistry. Bayesian networks offered several distinct advantages over traditional approaches, including formalizing the use of expert knowledge, probabilistic estimates of injury using intermediate direct and indirect effects, and the incorporation of a more nuanced and ecologically relevant representation of effects. Given the potential that BNs have for natural resource injury assessment, more research and field-based application are needed to determine their efficacy in NRDAR. We expect the resulting methods will be of interest to many US federal, state, and tribal programs devoted to the evaluation, mitigation, remediation, and/or restoration of natural resources injured by releases or spills of contaminants. Integr Environ Assess Manag 2024;20:562-573. Published 2023. This article is a U.S. Government work and is in the public domain in the USA. Integrated Environmental Assessment and Management published by Wiley Periodicals LLC on behalf of Society of Environmental Toxicology & Chemistry (SETAC).
Collapse
Affiliation(s)
- Freya E Rowland
- US Geological Survey, Columbia Environmental Research Center, Columbia, Missouri, USA
| | - Christopher J Kotalik
- US Geological Survey, Columbia Environmental Research Center, Columbia, Missouri, USA
| | - Bruce G Marcot
- US Forest Service, Pacific Northwest Research Station, Portland, Oregon, USA
| | - Jo Ellen Hinck
- US Geological Survey Natural Resource Damage Assessment and Restoration and Disaster Supplemental Science Coordinator, Natural Hazards Mission Area, Reston, Virginia, USA
| | - David M Walters
- US Geological Survey, Columbia Environmental Research Center, Columbia, Missouri, USA
| |
Collapse
|
5
|
Roohi AM, Nazif S, Ramazi P. Tackling data challenges in forecasting effluent characteristics of wastewater treatment plants. JOURNAL OF ENVIRONMENTAL MANAGEMENT 2024; 354:120324. [PMID: 38364537 DOI: 10.1016/j.jenvman.2024.120324] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/22/2023] [Revised: 01/21/2024] [Accepted: 02/08/2024] [Indexed: 02/18/2024]
Abstract
In wastewater treatment plants (WWTPs), the stochastic nature of influent wastewater and operational and weather conditions cause fluctuations in effluent quality. Data-driven models can forecast effluent quality a few hours ahead as a response to the influent characteristics, providing enough time to adjust system operations and avoid undesired consequences. However, existing data for training models are often incomplete and contain missing values. On the other hand, collecting additional data by installing new sensors is costly. The trade-off between using existing incomplete data and collecting costly new data results in three data challenges faced when developing data-driven WWTP effluent forecasters. These challenges are to determine important variables to be measured, the minimum number of required data instances, and the maximum percentage of tolerable missing values that do not impede the development of an accurate model. As these issues are not discussed in previous studies, in this research, for the first time, a comprehensive analysis is done to provide answers to these challenges. Another issue that arises in all data-driven modeling is how to select an appropriate forecasting model. This paper addresses these issues by first testing nine machine learning models on data collected from three wastewater treatment plants located in Iran, Australia, and Spain. The most accurate forecaster, Bayesian network, was then used to address the articulated challenges. Key variables in forecasting effluent characteristics were flow rate, total suspended solids, electrical conductivity, phosphorus compounds, wastewater temperature, and air temperature. A minimum of 250 samples was needed during the model training to achieve a great reduction in the forecasting error. Moreover, a steep increase in the error was observed should the portion of missing values exceed 10%. The results assist plant managers in estimating the necessary data collection effort to obtain an accurate forecaster, contributing to the quality of the effluent.
Collapse
Affiliation(s)
- Ali Mohammad Roohi
- School of Civil Engineering, College of Engineering, University of Tehran, Tehran, Iran
| | - Sara Nazif
- School of Civil Engineering, College of Engineering, University of Tehran, Tehran, Iran.
| | - Pouria Ramazi
- Department of Mathematics and Statistics, Brock University, St. Catharines, ON, L2S 3A1, Canada
| |
Collapse
|
6
|
Greco T, Poole EM, Young AC, Alexander JK. Application of the Bayesian network theory in clinical trial data: Severity shift in spasticity numeric rating scale in patients with multiple sclerosis. Mult Scler Relat Disord 2024; 83:105466. [PMID: 38310831 DOI: 10.1016/j.msard.2024.105466] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2023] [Revised: 11/07/2023] [Accepted: 01/20/2024] [Indexed: 02/06/2024]
Abstract
BACKGROUND Data digitization expands data collection opportunities, representing both a chance to understand interrelationships between variables and a challenge to identify the most appropriate clinical factors. Applications of causal inference techniques to clinical trial data is becoming very attractive, especially with the intent to provide insights into the relationships between baseline characteristics and outcomes. Graphical representations of model structures and conditional probabilities can be powerful tools to illustrate relationships in a high-dimensional data setting. METHODS We review and apply Bayesian network theory to a clinical case study, presenting an analytical approach to investigating and visualizing causal relationships. We propose the use of the adherence score to compare data networks' patterns based on different variables' discretization. Data from adult patients with spasticity related to multiple sclerosis (MSS) from two randomized placebo-controlled clinical trials of nabiximols were used as analysis sets. The training and validation sets included 106 (53 treated, 53 placebo) and 155 (76 treated, 79 placebo) participants, respectively. The primary objective was to create a network and estimate the causal dependencies between participants' characteristics, changes in MSS severity as reflected by shifts in the patient-reported numeric rating scale (NRS), and changes in symptoms, functional abilities, and quality of life factors. RESULTS A causal network was identified between the key factors of assigned treatment, end of study spasticity NRS, and mental health/vitality subscales of the 36-Item Short Form Health Survey questionnaire (4 nodes and 3 edges; adherence score = 93%). In patients with mild spasticity, the impact of nabiximols on mental health or vitality subscales resulted in a probability ratio of 1.63. The decomposed mediation effect of spasticity NRS was observed through a mediation analysis between treatment and mental health (99.4%) or vitality (93.7%) subscales. CONCLUSIONS The use of innovative methods such as causal networks is highly encouraged to identify dependent relationships among key factors in clinical trial data and drive insights for additional research.
Collapse
Affiliation(s)
- Teresa Greco
- Jazz Pharmaceuticals, Inc., Gentium S.P.A., Piazza XX Settembre, 2, Villa Guardia 22079, Italy.
| | - Elizabeth M Poole
- Jazz Pharmaceuticals, Inc., 2005 Market Street, Suite 2100, Philadelphia, PA 19103, USA
| | - Amy C Young
- Jazz Pharmaceuticals, Inc., 3170 Porter Drive, Palo Alto, CA 94304, USA
| | | |
Collapse
|
7
|
Sujani S, White RR, Firkins JL, Wenner BA. Network analysis to evaluate complexities in relationships among fermentation variables measured within continuous culture experiments. J Anim Sci 2023; 101:skad085. [PMID: 37078886 PMCID: PMC10158529 DOI: 10.1093/jas/skad085] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/13/2023] [Accepted: 04/17/2023] [Indexed: 04/21/2023] Open
Abstract
The objective of this study was to leverage a frequentist (ELN) and Bayesian learning (BLN) network analyses to summarize quantitative associations among variables measured in 4 previously published dual-flow continuous culture fermentation experiments. Experiments were originally designed to evaluate effects of nitrate, defaunation, yeast, and/or physiological shifts associated with pH or solids passage rates on rumen conditions. Measurements from these experiments that were used as nodes within the networks included concentrations of individual volatile fatty acids, mM and nitrate, NO3-,%; outflows of non-ammonia nitrogen (NAN, g/d), bacterial N (BN, g/d), residual N (RN, g/d), and ammonia N (NH3-N, mg/dL); degradability of neutral detergent fiber (NDFd, %) and degradability of organic matter (OMd, %); dry matter intake (DMI, kg/d); urea in buffer (%); fluid passage rate (FF, L/d); total protozoa count (PZ, cells/mL); and methane production (CH4, mmol/d). A frequentist network (ELN) derived using a graphical LASSO (least absolute shrinkage and selection operator) technique with tuning parameters selected by Extended Bayesian Information Criteria (EBIC) and a BLN were constructed from these data. The illustrated associations in the ELN were unidirectional yet assisted in identifying prominent relationships within the rumen that were largely consistent with current understanding of fermentation mechanisms. Another advantage of the ELN approach was that it focused on understanding the role of individual nodes within the network. Such understanding may be critical in exploring candidates for biomarkers, indicator variables, model targets, or other measurement-focused explorations. As an example, acetate was highly central in the network suggesting it may be a strong candidate as a rumen biomarker. Alternatively, the major advantage of the BLN was its unique ability to imply causal directionality in relationships. Because the BLN identified directional, cascading relationships, this analytics approach was uniquely suited to exploring the edges within the network as a strategy to direct future work researching mechanisms of fermentation. For example, in the BLN acetate responded to treatment conditions such as the source of N used and the quantity of substrate provided, while acetate drove changes in the protozoal populations, non-NH3-N and residual N flows. In conclusion, the analyses exhibit complementary strengths in supporting inference on the connectedness and directionality of quantitative associations among fermentation variables that may be useful in driving future studies.
Collapse
Affiliation(s)
- Sathya Sujani
- School of Animal Sciences, Virginia Tech, Blacksburg, VA 24061, USA
| | - Robin R White
- School of Animal Sciences, Virginia Tech, Blacksburg, VA 24061, USA
| | - Jeffrey L Firkins
- Department of Animal Sciences, The Ohio State University, Columbus, OH 43210, USA
| | - Benjamin A Wenner
- Department of Animal Sciences, The Ohio State University, Columbus, OH 43210, USA
| |
Collapse
|
8
|
Hagy JD, Kreakie BJ, Pelletier MC, Nojavan F, Kiddon JA, Oczkowski AJ. Quantifying coastal ecosystem trophic state at a macroscale using a Bayesian analytical framework. ECOLOGICAL INDICATORS 2022; 142:1-12. [PMID: 36969322 PMCID: PMC10031516 DOI: 10.1016/j.ecolind.2022.109267] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/18/2023]
Abstract
One of the goals of coastal ecological research is to describe, quantify and predict human effects on coastal ecosystems. Broad cross-systems assessments to classify ecosystem status or condition have been developed, but are not updated frequently, likely because a lot of information and effort is needed to implement them. Such assessments could be more useful if the probability of being in a class indicating status or condition could be predicted using widely available data and information, providing a useful way to interpret changes in underlying predictors by considering their expected impact on ecosystem condition. To illustrate a possible approach, we used chlorophyll-a as an indicator of condition, in place of the intended comprehensive condition assessment. We demonstrated a predictive approach starting with a random forest model to inform variable selection, then used a Bayesian multilevel ordered categorical regression to quantify a coastal trophic state index and predict system status. We initially fit the model using non-informative priors to water quality data (total nitrogen and phosphorus, dissolved inorganic nitrogen and phosphorus, secchi depth) from 2010 and a regional factor. We then updated the model using prior distributions based on posterior parameter distributions from the initial fit and data from 2015. The Bayesian model demonstrates an intuitive way to update a model or analysis with new data while retaining the benefit of prior knowledge and maintaining flexibility to consider new kinds of information. To illustrate how the model could be used, we applied our developed trophic state index and classification to a time series of water quality data from Boston Harbor, a coastal ecosystem that has undergone significant changes in nutrient inputs. The analysis shows how water quality status and trends in Boston Harbor can be understood in the comparative ecological context provided by data from estuaries around the continental US and illustrates how the analytical approach could be used as an interpretive tool by non-practitioners of Bayesian statistics as well as a framework for further model development and analysis.
Collapse
Affiliation(s)
- James D Hagy
- Atlantic Coastal Environmental Science Division, Center for Environmental Measurement and Modeling, Office of Research and Development, US Environmental Protection Agency. 27 Tarzwell Drive, Narragansett, RI 02882
| | - Betty J Kreakie
- Atlantic Coastal Environmental Science Division, Center for Environmental Measurement and Modeling, Office of Research and Development, US Environmental Protection Agency. 27 Tarzwell Drive, Narragansett, RI 02882
| | - Marguerite C Pelletier
- Atlantic Coastal Environmental Science Division, Center for Environmental Measurement and Modeling, Office of Research and Development, US Environmental Protection Agency. 27 Tarzwell Drive, Narragansett, RI 02882
| | - Farnaz Nojavan
- Atlantic Coastal Environmental Science Division, Center for Environmental Measurement and Modeling, Office of Research and Development, US Environmental Protection Agency. 27 Tarzwell Drive, Narragansett, RI 02882
| | - John A Kiddon
- Atlantic Coastal Environmental Science Division, Center for Environmental Measurement and Modeling, Office of Research and Development, US Environmental Protection Agency. 27 Tarzwell Drive, Narragansett, RI 02882
| | - Autumn J Oczkowski
- Atlantic Coastal Environmental Science Division, Center for Environmental Measurement and Modeling, Office of Research and Development, US Environmental Protection Agency. 27 Tarzwell Drive, Narragansett, RI 02882
| |
Collapse
|
9
|
Wang X, Wang H, Ramazi P, Nah K, Lewis M. A Hypothesis-Free Bridging of Disease Dynamics and Non-pharmaceutical Policies. Bull Math Biol 2022; 84:57. [PMID: 35394257 PMCID: PMC8991680 DOI: 10.1007/s11538-022-01012-8] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/26/2021] [Accepted: 03/08/2022] [Indexed: 11/22/2022]
Abstract
Accurate prediction of the number of daily or weekly confirmed cases of COVID-19 is critical to the control of the pandemic. Existing mechanistic models nicely capture the disease dynamics. However, to forecast the future, they require the transmission rate to be known, limiting their prediction power. Typically, a hypothesis is made on the form of the transmission rate with respect to time. Yet the real form is too complex to be mechanistically modeled due to the unknown dynamics of many influential factors. We tackle this problem by using a hypothesis-free machine-learning algorithm to estimate the transmission rate from data on non-pharmaceutical policies, and in turn forecast the confirmed cases using a mechanistic disease model. More specifically, we build a hybrid model consisting of a mechanistic ordinary differential equation (ODE) model and a gradient boosting model (GBM). To calibrate the parameters, we develop an "inverse method" that obtains the transmission rate inversely from the other variables in the ODE model and then feed it into the GBM to connect with the policy data. The resulting model forecasted the number of daily confirmed cases up to 35 days in the future in the USA with an averaged mean absolute percentage error of 27%. It can identify the most informative predictive variables, which can be helpful in designing improved forecasters as well as informing policymakers.
Collapse
Affiliation(s)
- Xiunan Wang
- Department of Mathematical and Statistical Sciences, University of Alberta, Edmonton, AB, T6G 2G1, Canada
- Department of Mathematics, University of Tennessee at Chattanooga, Chattanooga, TN, 37403, USA
| | - Hao Wang
- Department of Mathematical and Statistical Sciences, University of Alberta, Edmonton, AB, T6G 2G1, Canada.
| | - Pouria Ramazi
- Department of Mathematics and Statistics, Brock University, St. Catharines, ON, L2S 3A1, Canada
| | - Kyeongah Nah
- Department of Mathematical and Statistical Sciences, University of Alberta, Edmonton, AB, T6G 2G1, Canada
- National Institute for Mathematical Sciences, Daejeon, 34047, Korea
| | - Mark Lewis
- Department of Mathematical and Statistical Sciences, University of Alberta, Edmonton, AB, T6G 2G1, Canada
- Department of Biological Sciences, University of Alberta, Edmonton, AB, T6G 2G1, Canada
| |
Collapse
|
10
|
Ramazi P, Kunegel‐Lion M, Greiner R, Lewis MA. Predicting insect outbreaks using machine learning: A mountain pine beetle case study. Ecol Evol 2021; 11:13014-13028. [PMID: 34646449 PMCID: PMC8495826 DOI: 10.1002/ece3.7921] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2020] [Revised: 06/29/2020] [Accepted: 07/20/2020] [Indexed: 11/09/2022] Open
Abstract
Planning forest management relies on predicting insect outbreaks such as mountain pine beetle, particularly in the intermediate-term future, e.g., 5-year. Machine-learning algorithms are potential solutions to this challenging problem due to their many successes across a variety of prediction tasks. However, there are many subtle challenges in applying them: identifying the best learning models and the best subset of available covariates (including time lags) and properly evaluating the models to avoid misleading performance-measures. We systematically address these issues in predicting the chance of a mountain pine beetle outbreak in the Cypress Hills area and seek models with the best performance at predicting future 1-, 3-, 5- and 7-year infestations. We train nine machine-learning models, including two generalized boosted regression trees (GBM) that predict future 1- and 3-year infestations with 92% and 88% AUC, and two novel mixed models that predict future 5- and 7-year infestations with 86% and 84% AUC, respectively. We also consider forming the train and test datasets by splitting the original dataset randomly rather than using the appropriate year-based approach and show that this may obtain models that score high on the test dataset but low in practice, resulting in inaccurate performance evaluations. For example, a k-nearest neighbor model with the actual performance of 68% AUC, scores the misleadingly high 78% on a test dataset obtained from a random split, but the more accurate 66% on a year-based split. We then investigate how the prediction accuracy varies with respect to the provided history length of the covariates and find that neural network and naive Bayes, predict more accurately as history-length increases, particularly for future 1- and 3-year predictions, and roughly the same holds with GBM. Our approach is applicable to other invasive species. The resulting predictors can be used in planning forest and pest management and planning sampling locations in field studies.
Collapse
Affiliation(s)
- Pouria Ramazi
- Department of Mathematical and Statistical SciencesUniversity of AlbertaEdmontonABCanada
- Department of Computing ScienceUniversity of AlbertaEdmontonABCanada
| | | | - Russell Greiner
- Department of Computing ScienceUniversity of AlbertaEdmontonABCanada
- Alberta Machine Intelligence InstituteEdmontonABCanada
| | - Mark A. Lewis
- Department of Mathematical and Statistical SciencesUniversity of AlbertaEdmontonABCanada
- Department of Biological SciencesUniversity of AlbertaEdmontonABCanada
| |
Collapse
|
11
|
Ramazi P, Haratian A, Meghdadi M, Mari Oriyad A, Lewis MA, Maleki Z, Vega R, Wang H, Wishart DS, Greiner R. Accurate long-range forecasting of COVID-19 mortality in the USA. Sci Rep 2021; 11:13822. [PMID: 34226584 PMCID: PMC8257700 DOI: 10.1038/s41598-021-91365-2] [Citation(s) in RCA: 14] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2021] [Accepted: 05/20/2021] [Indexed: 02/01/2023] Open
Abstract
The need for improved models that can accurately predict COVID-19 dynamics is vital to managing the pandemic and its consequences. We use machine learning techniques to design an adaptive learner that, based on epidemiological data available at any given time, produces a model that accurately forecasts the number of reported COVID-19 deaths and cases in the United States, up to 10 weeks into the future with a mean absolute percentage error of 9%. In addition to being the most accurate long-range COVID predictor so far developed, it captures the observed periodicity in daily reported numbers. Its effectiveness is based on three design features: (1) producing different model parameters to predict the number of COVID deaths (and cases) from each time and for a given number of weeks into the future, (2) systematically searching over the available covariates and their historical values to find an effective combination, and (3) training the model using "last-fold partitioning", where each proposed model is validated on only the last instance of the training dataset, rather than being cross-validated. Assessments against many other published COVID predictors show that this predictor is 19-48% more accurate.
Collapse
Affiliation(s)
- Pouria Ramazi
- Department of Mathematics and Statistics, Brock University, St. Catharines, ON, L2S 3A1, Canada.
| | - Arezoo Haratian
- Department of Electrical and Computer Engineering, Isfahan University of Technology, 84156-83111, Isfahan, Iran
| | - Maryam Meghdadi
- Department of Electrical and Computer Engineering, Isfahan University of Technology, 84156-83111, Isfahan, Iran
| | - Arash Mari Oriyad
- Department of Electrical and Computer Engineering, Isfahan University of Technology, 84156-83111, Isfahan, Iran
| | - Mark A Lewis
- Department of Mathematical and Statistical Sciences, University of Alberta, Edmonton, AB, T6G 2G1, Canada
- Department of Biological Sciences, University of Alberta, Edmonton, AB, T6G 2E9, Canada
| | - Zeinab Maleki
- Department of Electrical and Computer Engineering, Isfahan University of Technology, 84156-83111, Isfahan, Iran
| | - Roberto Vega
- Department of Computing Science, University of Alberta, Edmonton, AB, T6G 2E8, Canada
- Alberta Machine Intelligence Institute, Edmonton, AB, T5J 3B1, Canada
| | - Hao Wang
- Department of Mathematical and Statistical Sciences, University of Alberta, Edmonton, AB, T6G 2G1, Canada
| | - David S Wishart
- Department of Biological Sciences, University of Alberta, Edmonton, AB, T6G 2E9, Canada
- Department of Computing Science, University of Alberta, Edmonton, AB, T6G 2E8, Canada
| | - Russell Greiner
- Department of Computing Science, University of Alberta, Edmonton, AB, T6G 2E8, Canada
- Alberta Machine Intelligence Institute, Edmonton, AB, T5J 3B1, Canada
| |
Collapse
|