1
|
Gil-Marin JK, Shirazi M, Ivan JN. Assessing the Negative Binomial-Lindley model for crash hotspot identification: Insights from Monte Carlo simulation analysis. Accid Anal Prev 2024; 199:107478. [PMID: 38458009 DOI: 10.1016/j.aap.2024.107478] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/26/2023] [Revised: 12/27/2023] [Accepted: 01/13/2024] [Indexed: 03/10/2024]
Abstract
Identifying hazardous crash sites (or hotspots) is a crucial step in highway safety management. The Negative Binomial (NB) model is the most common model used in safety analyses and evaluations - including hotspot identification. The NB model, however, is not without limitations. In fact, this model does not perform well when data are highly dispersed, include excess zero observations, or have a long tail. Recently, the Negative Binomial-Lindley (NB-L) model has been proposed as an alternative to the NB. The NB-L model overcomes several limitations related to the NB, such as addressing the issue of excess zero observations in highly dispersed data. However, it is not clear how the NB-L model performs regarding the hotspot identification. In this paper, an innovative Monte Carlo simulation protocol was designed to generate a wide range of simulated data characterized by different means, dispersions, and percentage of zeros. Next, the NB-L model was written as a Full-Bayes hierarchical model and compared with the Full-Bayes NB model for hotspot identification using extensive simulation scenarios. Most previous studies focused on statistical fit, and showed that the NB-L model fits the data better than the NB. In this research, however, we investigated the performance of the NB-L model in identifying the hazardous sites. We showed that there is a trade-off between the NB-L and NB when it comes to hotspot identification. Multiple performance metrics were used for the assessment. Among those, the results show that the NB-L model provides a better specificity in identifying hotspots, while the NB model provides a better sensitivity, especially for highly dispersed data. In other words, while the NB model performs better in identifying hazardous sites, the NB-L model performs better, when budget is limited, by not selecting non-hazardous sites as hazardous.
Collapse
Affiliation(s)
- Jhan Kevin Gil-Marin
- Department of Civil and Environmental Engineering, University of Maine, Orono, ME, 04469, USA.
| | - Mohammadali Shirazi
- Department of Civil and Environmental Engineering, University of Maine, Orono, ME, 04469, USA.
| | - John N Ivan
- Department of Civil and Environmental Engineering, University of Connecticut, Storrs, CT, 06269, USA.
| |
Collapse
|
2
|
Pelizzola M, Laursen R, Hobolth A. Model selection and robust inference of mutational signatures using Negative Binomial non-negative matrix factorization. BMC Bioinformatics 2023; 24:187. [PMID: 37158829 PMCID: PMC10165836 DOI: 10.1186/s12859-023-05304-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/11/2023] [Accepted: 04/25/2023] [Indexed: 05/10/2023] Open
Abstract
BACKGROUND The spectrum of mutations in a collection of cancer genomes can be described by a mixture of a few mutational signatures. The mutational signatures can be found using non-negative matrix factorization (NMF). To extract the mutational signatures we have to assume a distribution for the observed mutational counts and a number of mutational signatures. In most applications, the mutational counts are assumed to be Poisson distributed, and the rank is chosen by comparing the fit of several models with the same underlying distribution and different values for the rank using classical model selection procedures. However, the counts are often overdispersed, and thus the Negative Binomial distribution is more appropriate. RESULTS We propose a Negative Binomial NMF with a patient specific dispersion parameter to capture the variation across patients and derive the corresponding update rules for parameter estimation. We also introduce a novel model selection procedure inspired by cross-validation to determine the number of signatures. Using simulations, we study the influence of the distributional assumption on our method together with other classical model selection procedures. We also present a simulation study with a method comparison where we show that state-of-the-art methods are highly overestimating the number of signatures when overdispersion is present. We apply our proposed analysis on a wide range of simulated data and on two real data sets from breast and prostate cancer patients. On the real data we describe a residual analysis to investigate and validate the model choice. CONCLUSIONS With our results on simulated and real data we show that our model selection procedure is more robust at determining the correct number of signatures under model misspecification. We also show that our model selection procedure is more accurate than the available methods in the literature for finding the true number of signatures. Lastly, the residual analysis clearly emphasizes the overdispersion in the mutational count data. The code for our model selection procedure and Negative Binomial NMF is available in the R package SigMoS and can be found at https://github.com/MartaPelizzola/SigMoS .
Collapse
Affiliation(s)
- Marta Pelizzola
- Department of Mathematics, Aarhus University, Aarhus, Denmark.
| | | | - Asger Hobolth
- Department of Mathematics, Aarhus University, Aarhus, Denmark
| |
Collapse
|
3
|
Tan YL, Yiew TH, Lau LS, Tan AL. Environmental Kuznets curve for biodiversity loss: evidence from South and Southeast Asian countries. Environ Sci Pollut Res Int 2022; 29:64004-64021. [PMID: 35467185 DOI: 10.1007/s11356-022-20090-8] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/23/2021] [Accepted: 04/01/2022] [Indexed: 06/14/2023]
Abstract
This study aims to explore the income-biodiversity loss nexus in South and Southeast Asian countries covering the period between 2013 and 2019. Negative Binomial regression models are used to deal with the count regressand variable with specific emphasis on different taxonomic groups of threatened species, namely, mammal, bird, reptile, amphibian, fish, mollusk, other invertebrate, plant, and total threatened species. We find strong support of an inverted U-shaped relationship between income and biodiversity loss in all taxonomic groups of threatened species examined. Additionally, agricultural land has a significant and positive effect on biodiversity loss. Control of corruption and biodiversity loss are found to be negatively associated. The inverted U-shaped EKC suggests that South and Southeast Asian countries are required to identify policy priority areas that could achieve robust economic growth while reducing biodiversity loss. Our findings also provide valuable policy insights to assist the policy makers to better cope with the problem of biodiversity loss via corruption control and agricultural land use.
Collapse
Affiliation(s)
- Yan-Ling Tan
- Faculty of Business and Management, Universiti Teknologi MARA Cawangan Johor Kampus Segamat, Segamat, Malaysia.
| | - Thian-Hee Yiew
- Faculty of Business and Finance, Universiti Tunku Abdul Rahman, Kampar, Malaysia
| | - Lin-Sea Lau
- Faculty of Business and Finance, Universiti Tunku Abdul Rahman, Kampar, Malaysia
| | - Ai-Lian Tan
- Faculty of Business and Finance, Universiti Tunku Abdul Rahman, Kampar, Malaysia
| |
Collapse
|
4
|
Delgado R. Detecting target species: with how many samples? R Soc Open Sci 2022; 9:220046. [PMID: 35958088 PMCID: PMC9364006 DOI: 10.1098/rsos.220046] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 01/12/2022] [Accepted: 07/20/2022] [Indexed: 06/15/2023]
Abstract
The detection of target species is of paramount importance in ecological studies, with implications for environmental management and natural resource conservation planning. This is usually done by sampling the area: the species is detected if the presence of at least one individual is detected in the samples. Green & Young (Green & Young 1993 Sampling to detectrare species. Ecol. Appl. 3, 351-356. (doi:10.2307/1941837) introduce two models to determine the minimum number of samples n to ensure that the probability of failing to detect the species from them, if the species is actually present in the area, does not exceed a fixed threshold: based on the Poisson and the Negative Binomial distributions. We generalize them to two scenarios, one considering the area size N to be finite, and the other allowing detectability errors, with probability δ. The results in Green & Young are recovered by taking N → ∞ and δ = 0. Not taking into consideration the finite size of the area, if known, leads to an overestimation of n, which is vital to avoid if sampling is expensive or difficult, while assuming that there are no detectability errors, if they really exist, produces an undesirable bias. Our approximation manages to skirt both problems, for the Poisson and the Negative Binomial.
Collapse
Affiliation(s)
- Rosario Delgado
- Department of Mathematics, Universitat Autònoma de Barcelona, Campus de la UAB, Cerdanyola del Vallès 08193, Spain
| |
Collapse
|
5
|
Goksuluk D, Zararsiz G, Korkmaz S, Eldem V, Zararsiz GE, Ozcetin E, Ozturk A, Karaagaoglu AE. MLSeq: Machine learning interface for RNA-sequencing data. Comput Methods Programs Biomed 2019; 175:223-231. [PMID: 31104710 DOI: 10.1016/j.cmpb.2019.04.007] [Citation(s) in RCA: 26] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/23/2018] [Revised: 03/21/2019] [Accepted: 04/08/2019] [Indexed: 06/09/2023]
Abstract
BACKGROUND AND OBJECTIVE In the last decade, RNA-sequencing technology has become method-of-choice and prefered to microarray technology for gene expression based classification and differential expression analysis since it produces less noisy data. Although there are many algorithms proposed for microarray data, the number of available algorithms and programs are limited for classification of RNA-sequencing data. For this reason, we developed MLSeq, to bring not only frequently used classification algorithms but also novel approaches together and make them available to be used for classification of RNA sequencing data. This package is developed using R language environment and distributed through BIOCONDUCTOR network. METHODS Classification of RNA-sequencing data is not straightforward since raw data should be preprocessed before downstream analysis. With MLSeq package, researchers can easily preprocess (normalization, filtering, transformation etc.) and classify raw RNA-sequencing data using two strategies: (i) to perform algorithms which are directly proposed for RNA-sequencing data structure or (ii) to transform RNA-sequencing data in order to bring it distributionally closer to microarray data structure, and perform algorithms which are developed for microarray data. Moreover, we proposed novel algorithms such as voom (an acronym for variance modelling at observational level) based nearest shrunken centroids (voomNSC), diagonal linear discriminant analysis (voomDLDA), etc. through MLSeq. MATERIALS Three real RNA-sequencing datasets (i.e cervical cancer, lung cancer and aging datasets) were used to evalute model performances. Poisson linear discriminant analysis (PLDA) and negative binomial linear discriminant analysis (NBLDA) were selected as algorithms based on dicrete distributions, and voomNSC, nearest shrunken centroids (NSC) and support vector machines (SVM) were selected as algorithms based on continuous distributions for model comparisons. Each algorithm is compared using classification accuracies and sparsities on an independent test set. RESULTS The algorithms which are based on discrete distributions performed better in cervical cancer and aging data with accuracies above 0.92. In lung cancer data, the most of algorithms performed similar with accuracies of 0.88 except that SVM achieved 0.94 of accuracy. Our voomNSC algorithm was the most sparse algorithm, and able to select 2.2% and 6.6% of all features for cervical cancer and lung cancer datasets respectively. However, in aging data, sparse classifiers were not able to select an optimal subset of all features. CONCLUSION MLSeq is comprehensive and easy-to-use interface for classification of gene expression data. It allows researchers perform both preprocessing and classification tasks through single platform. With this property, MLSeq can be considered as a pipeline for the classification of RNA-sequencing data.
Collapse
Affiliation(s)
- Dincer Goksuluk
- Department of Biostatistics, School of Medicine, Hacettepe University, 06100, Ankara, Turkey; Turcosa Analytics Solutions Ltd. Co., Erciyes Teknopark 5, 38030, Kayseri, Turkey
| | - Gokmen Zararsiz
- Department of Biostatistics, School of Medicine, Erciyes University, 38030, Kayseri, Turkey; Turcosa Analytics Solutions Ltd. Co., Erciyes Teknopark 5, 38030, Kayseri, Turkey.
| | - Selcuk Korkmaz
- Department of Biostatistics, School of Medicine, Trakya University, 22030, Edirne, Turkey; Turcosa Analytics Solutions Ltd. Co., Erciyes Teknopark 5, 38030, Kayseri, Turkey
| | - Vahap Eldem
- Department of Biology, Faculty of Science, Istanbul University, 34452, Istanbul, Turkey
| | - Gozde Erturk Zararsiz
- Department of Biostatistics, School of Medicine, Erciyes University, 38030, Kayseri, Turkey
| | - Erdener Ozcetin
- Department of Industrial Engineering, Faculty of Engineering, Hitit University, 19030, Corum, Turkey
| | - Ahmet Ozturk
- Department of Biostatistics, School of Medicine, Erciyes University, 38030, Kayseri, Turkey; Turcosa Analytics Solutions Ltd. Co., Erciyes Teknopark 5, 38030, Kayseri, Turkey
| | - Ahmet Ergun Karaagaoglu
- Department of Biostatistics, School of Medicine, Hacettepe University, 06100, Ankara, Turkey
| |
Collapse
|
6
|
Largajolli A, Beerahee M, Yang S. Bayesian approach to investigate a two-state mixed model of COPD exacerbations. J Pharmacokinet Pharmacodyn 2019; 46:371-384. [PMID: 31197640 PMCID: PMC6848253 DOI: 10.1007/s10928-019-09643-6] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/11/2019] [Accepted: 06/05/2019] [Indexed: 11/29/2022]
Abstract
Chronic obstructive pulmonary disease (COPD) is a chronic obstructive
disease of the airways. An exacerbation of COPD is defined as shortness of breath,
cough, and sputum production. New therapies for COPD exacerbations are examined in
clinical trials frequently based on the number of exacerbations that implies
long-term study due to the high variability in occurrence and duration of the
events. In this work, we expanded the two-state model developed by Cook et al. where
the patient transits from an asymptomatic (state 1) to a symptomatic state (state 2)
and vice versa, through investigating different semi-Markov models in a Bayesian
context using data from actual clinical trials. Of the four models tested, the
log-logistic model was shown to adequately characterize the duration and number of
COPD exacerbations. The patient disease stage was found a significant covariate with
an effect of accelerating the transition from asymptomatic to symptomatic state. In
addition, the best dropout model (log-logistic) was incorporated in the final
two-state model to describe the dropout mechanism. Simulation based diagnostics such
as posterior predictive check (PPC) and visual predictive check (VPC) were used to
assess the behaviour of the model. The final model was applied in three clinical
trial data to investigate its ability to detect the drug effect: the drug effect was
captured in all three datasets and in both directions (from state 1 to state 2 and
vice versa). A practical design investigation was also carried out and showed the
limits of reducing the number of subjects and study length on the drug effect
identification. Finally, clinical trial simulation confirmed that the model can
potentially be used to predict medium term (6–12 months) clinical trial outcome
using the first 3 months data, but at the expense of showing a non-significant drug
effect.
Collapse
Affiliation(s)
- Anna Largajolli
- GlaxoSmithKline, Research and Development, Uxbridge, UK.,Certara Strategic Consulting, Via G.B. Pirelli 27, 20124, Milano, Italy
| | | | - Shuying Yang
- GlaxoSmithKline, Research and Development, Uxbridge, UK. .,Clinical Pharmacology Modelling and Simulation, Quantitative Sciences, GlaxoSmithKline, Stockley Park West, 1-3 Ironbridge Road, Uxbridge, Middlesex, UB11 1BT, UK.
| |
Collapse
|
7
|
Shirazi M, Dhavala SS, Lord D, Geedipally SR. A methodology to design heuristics for model selection based on the characteristics of data: Application to investigate when the Negative Binomial Lindley (NB-L) is preferred over the Negative Binomial (NB). Accid Anal Prev 2017; 107:186-194. [PMID: 28886410 DOI: 10.1016/j.aap.2017.07.002] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/23/2017] [Revised: 05/25/2017] [Accepted: 07/04/2017] [Indexed: 06/07/2023]
Abstract
Safety analysts usually use post-modeling methods, such as the Goodness-of-Fit statistics or the Likelihood Ratio Test, to decide between two or more competitive distributions or models. Such metrics require all competitive distributions to be fitted to the data before any comparisons can be accomplished. Given the continuous growth in introducing new statistical distributions, choosing the best one using such post-modeling methods is not a trivial task, in addition to all theoretical or numerical issues the analyst may face during the analysis. Furthermore, and most importantly, these measures or tests do not provide any intuitions into why a specific distribution (or model) is preferred over another (Goodness-of-Logic). This paper ponders into these issues by proposing a methodology to design heuristics for Model Selection based on the characteristics of data, in terms of descriptive summary statistics, before fitting the models. The proposed methodology employs two analytic tools: (1) Monte-Carlo Simulations and (2) Machine Learning Classifiers, to design easy heuristics to predict the label of the 'most-likely-true' distribution for analyzing data. The proposed methodology was applied to investigate when the recently introduced Negative Binomial Lindley (NB-L) distribution is preferred over the Negative Binomial (NB) distribution. Heuristics were designed to select the 'most-likely-true' distribution between these two distributions, given a set of prescribed summary statistics of data. The proposed heuristics were successfully compared against classical tests for several real or observed datasets. Not only they are easy to use and do not need any post-modeling inputs, but also, using these heuristics, the analyst can attain useful information about why the NB-L is preferred over the NB - or vice versa- when modeling data.
Collapse
Affiliation(s)
- Mohammadali Shirazi
- Zachry Department of Civil Engineering, Texas A&M University, College Station, TX 77843, United States.
| | | | - Dominique Lord
- Zachry Department of Civil Engineering, Texas A&M University, College Station, TX 77843, United States.
| | | |
Collapse
|
8
|
Cairns J, Lynch AG, Tavaré S. Quantifying the impact of inter-site heterogeneity on the distribution of ChIP-seq data. Front Genet 2014; 5:399. [PMID: 25452765 PMCID: PMC4231950 DOI: 10.3389/fgene.2014.00399] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2014] [Accepted: 10/29/2014] [Indexed: 12/13/2022] Open
Abstract
Chromatin Immunoprecipitation followed by sequencing (ChIP-seq) is a valuable tool for epigenetic studies. Analysis of the data arising from ChIP-seq experiments often requires implicit or explicit statistical modeling of the read counts. The simple Poisson model is attractive, but does not provide a good fit to observed ChIP-seq data. Researchers therefore often either extend to a more general model (e.g., the Negative Binomial), and/or exclude regions of the genome that do not conform to the model. Since many modeling strategies employed for ChIP-seq data reduce to fitting a mixture of Poisson distributions, we explore the problem of inferring the optimal mixing distribution. We apply the Constrained Newton Method (CNM), which suggests the Negative Binomial - Negative Binomial (NB-NB) mixture model as a candidate for modeling ChIP-seq data. We illustrate fitting the NB-NB model with an accelerated EM algorithm on four data sets from three species. Zero-inflated models have been suggested as an approach to improve model fit for ChIP-seq data. We show that the NB-NB mixture model requires no zero-inflation and suggest that in some cases the need for zero inflation is driven by the model's inability to cope with both artifactual large read counts and the frequently observed very low read counts. We see that the CNM-based approach is a useful diagnostic for the assessment of model fit and inference in ChIP-seq data and beyond. Use of the suggested NB-NB mixture model will be of value not only when calling peaks or otherwise modeling ChIP-seq data, but also when simulating data or constructing blacklists de novo.
Collapse
Affiliation(s)
- Jonathan Cairns
- Nuclear Dynamics Group, The Babraham Institute Cambridge, UK ; Cancer Research UK Cambridge Institute, University of Cambridge Cambridge, UK
| | - Andy G Lynch
- Cancer Research UK Cambridge Institute, University of Cambridge Cambridge, UK
| | - Simon Tavaré
- Cancer Research UK Cambridge Institute, University of Cambridge Cambridge, UK
| |
Collapse
|
9
|
Goh KCK, Currie G, Sarvi M, Logan D. Bus accident analysis of routes with/without bus priority. Accid Anal Prev 2014; 65:18-27. [PMID: 24406378 DOI: 10.1016/j.aap.2013.12.002] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/15/2013] [Revised: 11/21/2013] [Accepted: 12/05/2013] [Indexed: 06/03/2023]
Abstract
This paper summarises findings on road safety performance and bus-involved accidents in Melbourne along roads where bus priority measures had been applied. Results from an empirical analysis of the accident types revealed significant reduction in the proportion of accidents involving buses hitting stationary objects and vehicles, which suggests the effect of bus priority in addressing manoeuvrability issues for buses. A mixed-effects negative binomial (MENB) regression and back-propagation neural network (BPNN) modelling of bus accidents considering wider influences on accident rates at a route section level also revealed significant safety benefits when bus priority is provided. Sensitivity analyses done on the BPNN model showed general agreement in the predicted accident frequency between both models. The slightly better performance recorded by the MENB model results suggests merits in adopting a mixed effects modelling approach for accident count prediction in practice given its capability to account for unobserved location and time-specific factors. A major implication of this research is that bus priority in Melbourne's context acts to improve road safety and should be a major consideration for road management agencies when implementing bus priority and road schemes.
Collapse
Affiliation(s)
- Kelvin Chun Keong Goh
- Department of Civil Engineering, Building 60, Monash University, Clayton, VIC 3800, Australia.
| | - Graham Currie
- Department of Civil Engineering, Building 60, Monash University, Clayton, VIC 3800, Australia.
| | - Majid Sarvi
- Department of Civil Engineering, Building 60, Monash University, Clayton, VIC 3800, Australia.
| | - David Logan
- Monash University Accident Research Centre, Building 70, Monash University, Clayton, VIC 3800, Australia.
| |
Collapse
|