1
|
Kolluru V, John R, Saraf S, Chen J, Hankerson B, Robinson S, Kussainova M, Jain K. Gridded livestock density database and spatial trends for Kazakhstan. Sci Data 2023; 10:839. [PMID: 38030700 PMCID: PMC10687097 DOI: 10.1038/s41597-023-02736-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2023] [Accepted: 11/08/2023] [Indexed: 12/01/2023] Open
Abstract
Livestock rearing is a major source of livelihood for food and income in dryland Asia. Increasing livestock density (LSKD) affects ecosystem structure and function, amplifies the effects of climate change, and facilitates disease transmission. Significant knowledge and data gaps regarding their density, spatial distribution, and changes over time exist but have not been explored beyond the county level. This is especially true regarding the unavailability of high-resolution gridded livestock data. Hence, we developed a gridded LSKD database of horses and small ruminants (i.e., sheep & goats) at high-resolution (1 km) for Kazakhstan (KZ) from 2000-2019 using vegetation proxies, climatic, socioeconomic, topographic, and proximity forcing variables through a random forest (RF) regression modeling. We found high-density livestock hotspots in the south-central and southeastern regions, whereas medium-density clusters in the northern and northwestern regions of KZ. Interestingly, population density, proximity to settlements, nighttime lights, and temperature contributed to the efficient downscaling of district-level censuses to gridded estimates. This database will benefit stakeholders, the research community, land managers, and policymakers at regional and national levels.
Collapse
Affiliation(s)
- Venkatesh Kolluru
- Department of Sustainability and Environment, University of South Dakota, Vermillion, SD, 57069, USA.
| | - Ranjeet John
- Department of Sustainability and Environment, University of South Dakota, Vermillion, SD, 57069, USA
- Department of Biology, University of South Dakota, Vermillion, SD, 57069, USA
| | - Sakshi Saraf
- Department of Biology, University of South Dakota, Vermillion, SD, 57069, USA
| | - Jiquan Chen
- Department of Geography, Environment, and Spatial Sciences, Michigan State University, East Lansing, MI, 48823, USA
- Center for Global Change and Earth Observations, Michigan State University, East Lansing, MI, 48823, USA
| | - Brett Hankerson
- Leibniz Institute of Agricultural Development in Transition Economies (IAMO), Theodor-Lieser-Str. 2, 06120, Halle (Saale), Germany
| | - Sarah Robinson
- Institute for Agricultural Policy and Market Research & Centre for International Development and Environmental Research (ZEU), Justus Liebig University, Giessen, Germany
| | - Maira Kussainova
- Center for Global Change and Earth Observations, Michigan State University, East Lansing, MI, 48823, USA
- Kazakh National Agrarian Research University, AgriTech Hub KazNARU, 8 Abay Avenue, Almaty, 050010, Kazakhstan
- Kazakh-German University (DKU), Nazarbaev avenue, 173, 050010, Almaty, Kazakhstan
| | - Khushboo Jain
- Department of Sustainability and Environment, University of South Dakota, Vermillion, SD, 57069, USA
| |
Collapse
|
2
|
Random Forests in Count Data Modelling: An Analysis of the Influence of Data Features and Overdispersion on Regression Performance. JOURNAL OF PROBABILITY AND STATISTICS 2022. [DOI: 10.1155/2022/2833537] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/04/2022] Open
Abstract
Machine learning algorithms, especially random forests (RFs), have become an integrated part of the modern scientific methodology and represent an efficient alternative to conventional parametric algorithms. This study aimed to assess the influence of data features and overdispersion on RF regression performance. We assessed the effect of types of predictors (100, 75, 50, and 20% continuous, and 100% categorical), the number of predictors (p = 816 and 24), and the sample size (N = 50, 250, and 1250) on RF parameter settings. We also compared RF performance to that of classical generalized linear models (Poisson, negative binomial, and zero-inflated Poisson) and the linear model applied to log-transformed data. Two real datasets were analysed to demonstrate the usefulness of RF for overdispersed data modelling. Goodness-of-fit statistics such as root mean square error (RMSE) and biases were used to determine RF accuracy and validity. Results revealed that the number of variables to be randomly selected for each split, the proportion of samples to train the model, the minimal number of samples within each terminal node, and RF regression performance are not influenced by the sample size, number, and type of predictors. However, the ratio of observations to the number of predictors affects the stability of the best RF parameters. RF performs well for all types of covariates and different levels of dispersion. The magnitude of dispersion does not significantly influence RF predictive validity. In contrast, its predictive accuracy is significantly influenced by the magnitude of dispersion in the response variable, conditional on the explanatory variables. RF has performed almost as well as the models of the classical Poisson family in the presence of overdispersion. Given RF’s advantages, it is an appropriate statistical alternative for counting data.
Collapse
|
3
|
Krennmair P, Schmid T. Flexible domain prediction using mixed effects random forests. J R Stat Soc Ser C Appl Stat 2022. [DOI: 10.1111/rssc.12600] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Affiliation(s)
- Patrick Krennmair
- Institute of Statistics and Econometrics Freie Universität Berlin Berlin Germany
| | - Timo Schmid
- Institute of Statistics Otto‐Friedrich‐Universität Bamberg Bamberg Germany
| |
Collapse
|
4
|
Viljanen M, Meijerink L, Zwakhals L, van de Kassteele J. A machine learning approach to small area estimation: predicting the health, housing and well-being of the population of Netherlands. Int J Health Geogr 2022; 21:4. [PMID: 35668432 PMCID: PMC9169293 DOI: 10.1186/s12942-022-00304-5] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2022] [Accepted: 05/12/2022] [Indexed: 11/12/2022] Open
Abstract
Background Local policymakers require information about public health, housing and well-being at small geographical areas. A municipality can for example use this information to organize targeted activities with the aim of improving the well-being of their residents. Surveys are often used to gather data, but many neighborhoods can have only few or even zero respondents. In that case, estimating the status of the local population directly from survey responses is prone to be unreliable. Methods Small Area Estimation (SAE) is a technique to provide estimates at small geographical levels with only few or even zero respondents. In classical individual-level SAE, a complex statistical regression model is fitted to the survey responses by using auxiliary administrative data for the population as predictors, the missing responses are then predicted and aggregated to the desired geographical level. In this paper we compare gradient boosted trees (XGBoost), a well-known machine learning technique, to a structured additive regression model (STAR) designed for the specific problem of estimating public health and well-being in the whole population of the Netherlands. Results We compare the accuracy and performance of these models using out-of-sample predictions with five-fold Cross Validation (5CV). We do this for three data sets of different sample sizes and outcome types. Compared to the STAR model, gradient boosted trees are able to improve both the accuracy of the predictions and the total time taken to get these predictions. Even though the models appear quite similar in overall accuracy, the small area predictions at neighborhood level sometimes differ significantly. It may therefore make sense to pursue slightly more accurate models for better predictions into small areas. However, one of the biggest benefits is that XGBoost does not require prior knowledge or model specification. Data preparation and modelling is much easier, since the method automatically handles missing data, non-linear responses, interactions and accounts for spatial correlation structures. Conclusions In this paper we provide new nationwide estimates of health, housing and well-being indicators at neighborhood level in the Netherlands, see ’Online materials’. We demonstrate that machine learning provides a good alternative to complex statistical regression modelling for small area estimation in terms of accuracy, robustness, speed and data preparation. These results can be used to make appropriate policy decisions at a local level and make recommendations about which estimation methods are beneficial in terms of accuracy, time and budget constraints.
Collapse
Affiliation(s)
- Markus Viljanen
- National Institute for Public Health and the Environment - RIVM, PO Box 1, 3720BA, Bilthoven, Netherlands.
| | - Lotta Meijerink
- National Institute for Public Health and the Environment - RIVM, PO Box 1, 3720BA, Bilthoven, Netherlands
| | - Laurens Zwakhals
- National Institute for Public Health and the Environment - RIVM, PO Box 1, 3720BA, Bilthoven, Netherlands
| | - Jan van de Kassteele
- National Institute for Public Health and the Environment - RIVM, PO Box 1, 3720BA, Bilthoven, Netherlands
| |
Collapse
|
5
|
Szarka N, Biljecki F. Population estimation beyond counts-Inferring demographic characteristics. PLoS One 2022; 17:e0266484. [PMID: 35381028 PMCID: PMC8982831 DOI: 10.1371/journal.pone.0266484] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2021] [Accepted: 03/21/2022] [Indexed: 11/18/2022] Open
Abstract
Mapping population distribution at a fine spatial scale is essential for urban studies and planning. Numerous studies, mainly supported by geospatial and statistical methods, have focused primarily on predicting population counts. However, estimating their socio-economic characteristics beyond population counts, such as average age, income, and gender ratio, remains unattended. We enhance traditional population estimation by predicting not only the number of residents in an area, but also their demographic characteristics: average age and the proportion of seniors. By implementing and comparing different machine learning techniques (Random Forest, Support Vector Machines, and Linear Regression) in administrative areas in Singapore, we investigate the use of point of interest (POI) and real estate data for this purpose. The developed regression model predicts the average age of residents in a neighbourhood with a mean error of about 1.5 years (the range of average resident age across Singaporean districts spans approx. 14 years). The results reveal that age patterns of residents can be predicted using real estate information rather than with amenities, which is in contrast to estimating population counts. Another contribution of our work in population estimation is the use of previously unexploited POI and real estate datasets for it, such as property transactions, year of construction, and flat types (number of rooms). Advancing the domain of population estimation, this study reveals the prospects of a small set of detailed and strong predictors that might have the potential of estimating other demographic characteristics such as income.
Collapse
Affiliation(s)
- Noée Szarka
- School of GeoSciences, University of Edinburgh, Edinburgh, United Kingdom
- Department of Architecture, National University of Singapore, Singapore, Singapore
| | - Filip Biljecki
- Department of Architecture, National University of Singapore, Singapore, Singapore
- Department of Real Estate, National University of Singapore, Singapore, Singapore
| |
Collapse
|
6
|
Estimating small-area population density in Sri Lanka using surveys and Geo-spatial data. PLoS One 2020; 15:e0237063. [PMID: 32756580 PMCID: PMC7406065 DOI: 10.1371/journal.pone.0237063] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/07/2019] [Accepted: 07/20/2020] [Indexed: 11/24/2022] Open
Abstract
Country-level census data are typically collected once every 10 years. However, conflicts, migration, urbanization, and natural disasters can rapidly shift local population patterns. This study demonstrates the feasibility of a “bottom-up”-method to estimate local population density in the between-census years by combining household surveys with contemporaneous geo-spatial data, including village-area and satellite imagery-based indicators. We apply this technique to the case of Sri Lanka using Poisson regression models based on variables selected using the Least Absolute Shrinkage and Selection Operator (LASSO). The model is estimated in villages sampled in the 2012/13 Household Income and Expenditure Survey, and is employed to obtain out-of-sample density estimates in the non-surveyed villages. These estimates approximate the census density accurately and are more precise than other bottom-up studies using similar geo-spatial data. While most open-source population products redistribute census population “top-down” from higher to lower spatial units using areal interpolation and dasymetric mapping techniques, these products become less accurate as the census itself ages. Our method circumvents the problem of the aging census by relying instead on more up-to-date household surveys. The collective evidence suggests that our method is cost effective in tracking local population density with greater frequency in the between-census years.
Collapse
|
7
|
Disaggregating Population Data and Evaluating the Accuracy of Modeled High-Resolution Population Distribution—The Case Study of Germany. SUSTAINABILITY 2020. [DOI: 10.3390/su12103976] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
High-resolution population data are a necessary basis for identifying affected regions (e.g., natural disasters, accessibility of social infrastructures) and deriving recommendations for policy and planning, but municipalities are, as in Germany, regularly the smallest available reference unit for data. The article presents a dasymetric-based approach for modeling high-resolution population data based on urban density, dispersion, and land cover/use. In addition to common test statistics like MAE or MAPE, the Gini-coefficient and the local Moran’s I are applied and their added value for accuracy assessment is tested. With data on urban density, a relative deviation between the modeled and actual population of 14.1% is achieved. Data on land cover/use reduces the deviation to 12.4%. With 23.6%, the dispersion measure cannot improve distribution accuracy. Overall, the algorithms perform better for urban than for rural areas. Gini-coefficients show that same spatial concentration patterns are achieved as in the actual population distribution. According to local Moran’s I, there are statistically significant underestimations, especially in the highly-dense inner-urban areas. Overestimates are found in the transition to less urbanized areas and the core areas of peripheral cities. Overall, the additional test statistics can provide important insights into the data, which go beyond common methods for evaluation.
Collapse
|
8
|
Xiao F, Wang Y, Gao Y, Zhu Y, Zhao J. Continuous estimation of joint angle from electromyography using multiple time-delayed features and random forests. Biomed Signal Process Control 2018. [DOI: 10.1016/j.bspc.2017.08.015] [Citation(s) in RCA: 22] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
9
|
Biljecki F, Arroyo Ohori K, Ledoux H, Peters R, Stoter J. Population Estimation Using a 3D City Model: A Multi-Scale Country-Wide Study in the Netherlands. PLoS One 2016; 11:e0156808. [PMID: 27254151 PMCID: PMC4890761 DOI: 10.1371/journal.pone.0156808] [Citation(s) in RCA: 38] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2016] [Accepted: 05/19/2016] [Indexed: 11/27/2022] Open
Abstract
The remote estimation of a region’s population has for decades been a key application of geographic information science in demography. Most studies have used 2D data (maps, satellite imagery) to estimate population avoiding field surveys and questionnaires. As the availability of semantic 3D city models is constantly increasing, we investigate to what extent they can be used for the same purpose. Based on the assumption that housing space is a proxy for the number of its residents, we use two methods to estimate the population with 3D city models in two directions: (1) disaggregation (areal interpolation) to estimate the population of small administrative entities (e.g. neighbourhoods) from that of larger ones (e.g. municipalities); and (2) a statistical modelling approach to estimate the population of large entities from a sample composed of their smaller ones (e.g. one acquired by a government register). Starting from a complete Dutch census dataset at the neighbourhood level and a 3D model of all 9.9 million buildings in the Netherlands, we compare the population estimates obtained by both methods with the actual population as reported in the census, and use it to evaluate the quality that can be achieved by estimations at different administrative levels. We also analyse how the volume-based estimation enabled by 3D city models fares in comparison to 2D methods using building footprints and floor areas, as well as how it is affected by different levels of semantic detail in a 3D city model. We conclude that 3D city models are useful for estimations of large areas (e.g. for a country), and that the 3D approach has clear advantages over the 2D approach.
Collapse
Affiliation(s)
- Filip Biljecki
- 3D Geoinformation, Delft University of Technology, Delft, The Netherlands
- * E-mail:
| | - Ken Arroyo Ohori
- 3D Geoinformation, Delft University of Technology, Delft, The Netherlands
| | - Hugo Ledoux
- 3D Geoinformation, Delft University of Technology, Delft, The Netherlands
| | - Ravi Peters
- 3D Geoinformation, Delft University of Technology, Delft, The Netherlands
| | - Jantien Stoter
- 3D Geoinformation, Delft University of Technology, Delft, The Netherlands
| |
Collapse
|
10
|
Nicolas G, Robinson TP, Wint GRW, Conchedda G, Cinardi G, Gilbert M. Using Random Forest to Improve the Downscaling of Global Livestock Census Data. PLoS One 2016; 11:e0150424. [PMID: 26977807 PMCID: PMC4792414 DOI: 10.1371/journal.pone.0150424] [Citation(s) in RCA: 42] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2015] [Accepted: 02/12/2016] [Indexed: 11/23/2022] Open
Abstract
Large scale, high-resolution global data on farm animal distributions are essential for spatially explicit assessments of the epidemiological, environmental and socio-economic impacts of the livestock sector. This has been the major motivation behind the development of the Gridded Livestock of the World (GLW) database, which has been extensively used since its first publication in 2007. The database relies on a downscaling methodology whereby census counts of animals in sub-national administrative units are redistributed at the level of grid cells as a function of a series of spatial covariates. The recent upgrade of GLW1 to GLW2 involved automating the processing, improvement of input data, and downscaling at a spatial resolution of 1 km per cell (5 km per cell in the earlier version). The underlying statistical methodology, however, remained unchanged. In this paper, we evaluate new methods to downscale census data with a higher accuracy and increased processing efficiency. Two main factors were evaluated, based on sample census datasets of cattle in Africa and chickens in Asia. First, we implemented and evaluated Random Forest models (RF) instead of stratified regressions. Second, we investigated whether models that predicted the number of animals per rural person (per capita) could provide better downscaled estimates than the previous approach that predicted absolute densities (animals per km2). RF models consistently provided better predictions than the stratified regressions for both continents and species. The benefit of per capita over absolute density models varied according to the species and continent. In addition, different technical options were evaluated to reduce the processing time while maintaining their predictive power. Future GLW runs (GLW 3.0) will apply the new RF methodology with optimized modelling options. The potential benefit of per capita models will need to be further investigated with a better distinction between rural and agricultural populations.
Collapse
Affiliation(s)
- Gaëlle Nicolas
- Biological Control and Spatial Ecology, Université Libre de Bruxelles, Brussels, Belgium
- Fonds National de la Recherche Scientifique, Brussels, Belgium
| | - Timothy P. Robinson
- International Livestock Research Institute (ILRI), Livestock Systems and Environment (LSE), Nairobi, Kenya
| | - G. R. William Wint
- Environmental Research Group Oxford (ERGO) - Department of Zoology, University of Oxford, Oxford, United Kingdom
| | - Giulia Conchedda
- Animal Production and Health Division (AGA), Food and Agriculture Organization of the United Nations (FAO), Rome, Italy
| | - Giuseppina Cinardi
- Animal Production and Health Division (AGA), Food and Agriculture Organization of the United Nations (FAO), Rome, Italy
| | - Marius Gilbert
- Biological Control and Spatial Ecology, Université Libre de Bruxelles, Brussels, Belgium
- Fonds National de la Recherche Scientifique, Brussels, Belgium
| |
Collapse
|