1
|
Lee KH, Pedroza C, Avritscher EBC, Mosquera RA, Tyson JE. Evaluation of negative binomial and zero-inflated negative binomial models for the analysis of zero-inflated count data: application to the telemedicine for children with medical complexity trial. Trials 2023; 24:613. [PMID: 37752579 PMCID: PMC10523642 DOI: 10.1186/s13063-023-07648-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/20/2023] [Accepted: 09/12/2023] [Indexed: 09/28/2023] Open
Abstract
BACKGROUND Two characteristics of commonly used outcomes in medical research are zero inflation and non-negative integers; examples include the number of hospital admissions or emergency department visits, where the majority of patients will have zero counts. Zero-inflated regression models were devised to analyze this type of data. However, the performance of zero-inflated regression models or the properties of data best suited for these analyses have not been thoroughly investigated. METHODS We conducted a simulation study to evaluate the performance of two generalized linear models, negative binomial and zero-inflated negative binomial, for analyzing zero-inflated count data. Simulation scenarios assumed a randomized controlled trial design and varied the true underlying distribution, sample size, and rate of zero inflation. We compared the models in terms of bias, mean squared error, and coverage. Additionally, we used logistic regression to determine which data properties are most important for predicting the best-fitting model. RESULTS We first found that, regardless of the rate of zero inflation, there was little difference between the conventional negative binomial and its zero-inflated counterpart in terms of bias of the marginal treatment group coefficient. Second, even when the outcome was simulated from a zero-inflated distribution, a negative binomial model was favored above its ZI counterpart in terms of the Akaike Information Criterion. Third, the mean and skewness of the non-zero part of the data were stronger predictors of model preference than the percentage of zero counts. These results were not affected by the sample size, which ranged from 60 to 800. CONCLUSIONS We recommend that the rate of zero inflation and overdispersion in the outcome should not be the sole and main justification for choosing zero-inflated regression models. Investigators should also consider other data characteristics when choosing a model for count data. In addition, if the performance of the NB and ZINB regression models is reasonably comparable even with ZI outcomes, we advocate the use of the NB regression model due to its clear and straightforward interpretation of the results.
Collapse
Affiliation(s)
- Kyung Hyun Lee
- The Institute for Clinical Research and Learning Health Care, The University of Texas Health Science Center at Houston, Houston, TX, USA.
| | - Claudia Pedroza
- The Institute for Clinical Research and Learning Health Care, The University of Texas Health Science Center at Houston, Houston, TX, USA
| | - Elenir B C Avritscher
- The Institute for Clinical Research and Learning Health Care, The University of Texas Health Science Center at Houston, Houston, TX, USA
| | - Ricardo A Mosquera
- The Institute for Clinical Research and Learning Health Care, The University of Texas Health Science Center at Houston, Houston, TX, USA
| | - Jon E Tyson
- The Institute for Clinical Research and Learning Health Care, The University of Texas Health Science Center at Houston, Houston, TX, USA
| |
Collapse
|
2
|
Lin BM, Cho H, Liu C, Roach J, Ribeiro AA, Divaris K, Wu D. BZINB Model-Based Pathway Analysis and Module Identification Facilitates Integration of Microbiome and Metabolome Data. Microorganisms 2023; 11:766. [PMID: 36985339 PMCID: PMC10056694 DOI: 10.3390/microorganisms11030766] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2023] [Revised: 03/04/2023] [Accepted: 03/12/2023] [Indexed: 03/19/2023] Open
Abstract
Integration of multi-omics data is a challenging but necessary step to advance our understanding of the biology underlying human health and disease processes. To date, investigations seeking to integrate multi-omics (e.g., microbiome and metabolome) employ simple correlation-based network analyses; however, these methods are not always well-suited for microbiome analyses because they do not accommodate the excess zeros typically present in these data. In this paper, we introduce a bivariate zero-inflated negative binomial (BZINB) model-based network and module analysis method that addresses this limitation and improves microbiome-metabolome correlation-based model fitting by accommodating excess zeros. We use real and simulated data based on a multi-omics study of childhood oral health (ZOE 2.0; investigating early childhood dental caries, ECC) and find that the accuracy of the BZINB model-based correlation method is superior compared to Spearman's rank and Pearson correlations in terms of approximating the underlying relationships between microbial taxa and metabolites. The new method, BZINB-iMMPath, facilitates the construction of metabolite-species and species-species correlation networks using BZINB and identifies modules of (i.e., correlated) species by combining BZINB and similarity-based clustering. Perturbations in correlation networks and modules can be efficiently tested between groups (i.e., healthy and diseased study participants). Upon application of the new method in the ZOE 2.0 study microbiome-metabolome data, we identify that several biologically-relevant correlations of ECC-associated microbial taxa with carbohydrate metabolites differ between healthy and dental caries-affected participants. In sum, we find that the BZINB model is a useful alternative to Spearman or Pearson correlations for estimating the underlying correlation of zero-inflated bivariate count data and thus is suitable for integrative analyses of multi-omics data such as those encountered in microbiome and metabolome studies.
Collapse
Affiliation(s)
- Bridget M. Lin
- Department of Biostatistics, Gillings School of Global Public Health, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA
| | - Hunyong Cho
- Department of Biostatistics, Gillings School of Global Public Health, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA
| | - Chuwen Liu
- Department of Biostatistics, Gillings School of Global Public Health, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA
| | - Jeff Roach
- Research Computing, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA
| | - Apoena Aguiar Ribeiro
- Division of Diagnostic Sciences, Adams School of Dentistry, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA
| | - Kimon Divaris
- Division of Pediatric and Public Health, Adams School of Dentistry, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA
- Department of Epidemiology, Gillings School of Global Public Health, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA
| | - Di Wu
- Department of Biostatistics, Gillings School of Global Public Health, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA
- Division of Oral and Craniofacial Health Sciences, Adams School of Dentistry, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA
- Lineberger Comprehensive Cancer Center, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA
| |
Collapse
|
3
|
Acharjee A, Singh U, Choudhury SP, Gkoutos GV. The diagnostic potential and barriers of microbiome based therapeutics. Diagnosis (Berl) 2022; 9:411-420. [PMID: 36000189 DOI: 10.1515/dx-2022-0052] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/28/2022] [Accepted: 08/03/2022] [Indexed: 02/07/2023]
Abstract
High throughput technological innovations in the past decade have accelerated research into the trillions of commensal microbes in the gut. The 'omics' technologies used for microbiome analysis are constantly evolving, and large-scale datasets are being produced. Despite of the fact that much of the research is still in its early stages, specific microbial signatures have been associated with the promotion of cancer, as well as other diseases such as inflammatory bowel disease, neurogenerative diareses etc. It has been also reported that the diversity of the gut microbiome influences the safety and efficacy of medicines. The availability and declining sequencing costs has rendered the employment of RNA-based diagnostics more common in the microbiome field necessitating improved data-analytical techniques so as to fully exploit all the resulting rich biological datasets, while accounting for their unique characteristics, such as their compositional nature as well their heterogeneity and sparsity. As a result, the gut microbiome is increasingly being demonstrating as an important component of personalised medicine since it not only plays a role in inter-individual variability in health and disease, but it also represents a potentially modifiable entity or feature that may be addressed by treatments in a personalised way. In this context, machine learning and artificial intelligence-based methods may be able to unveil new insights into biomedical analyses through the generation of models that may be used to predict category labels, and continuous values. Furthermore, diagnostic aspects will add value in the identification of the non invasive markers in the critical diseases like cancer.
Collapse
Affiliation(s)
- Animesh Acharjee
- Institute of Cancer and Genomic Sciences, University of Birmingham, Birmingham, UK.,Institute of Translational Medicine, University of Birmingham, Birmingham, UK.,NIHR Surgical Reconstruction and Microbiology Research Centre, University Hospital Birmingham, Birmingham, UK.,MRC Health Data Research UK (HDR UK), Birmingham, UK
| | - Utpreksha Singh
- Department of Health and Life Sciences, Coventry University, Coventry, UK
| | | | - Georgios V Gkoutos
- Institute of Cancer and Genomic Sciences, University of Birmingham, Birmingham, UK.,Institute of Translational Medicine, University of Birmingham, Birmingham, UK.,NIHR Surgical Reconstruction and Microbiology Research Centre, University Hospital Birmingham, Birmingham, UK.,MRC Health Data Research UK (HDR UK), Birmingham, UK.,NIHR Experimental Cancer Medicine Centre, Birmingham, UK
| |
Collapse
|
4
|
Cappellato M, Baruzzo G, Di Camillo B. Investigating differential abundance methods in microbiome data: A benchmark study. PLoS Comput Biol 2022; 18:e1010467. [PMID: 36074761 PMCID: PMC9488820 DOI: 10.1371/journal.pcbi.1010467] [Citation(s) in RCA: 20] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/23/2021] [Revised: 09/20/2022] [Accepted: 08/03/2022] [Indexed: 11/19/2022] Open
Abstract
The development of increasingly efficient and cost-effective high throughput DNA sequencing techniques has enhanced the possibility of studying complex microbial systems. Recently, researchers have shown great interest in studying the microorganisms that characterise different ecological niches. Differential abundance analysis aims to find the differences in the abundance of each taxa between two classes of subjects or samples, assigning a significance value to each comparison. Several bioinformatic methods have been specifically developed, taking into account the challenges of microbiome data, such as sparsity, the different sequencing depth constraint between samples and compositionality. Differential abundance analysis has led to important conclusions in different fields, from health to the environment. However, the lack of a known biological truth makes it difficult to validate the results obtained. In this work we exploit metaSPARSim, a microbial sequencing count data simulator, to simulate data with differential abundance features between experimental groups. We perform a complete comparison of recently developed and established methods on a common benchmark with great effort to the reliability of both the simulated scenarios and the evaluation metrics. The performance overview includes the investigation of numerous scenarios, studying the effect on methods’ results on the main covariates such as sample size, percentage of differentially abundant features, sequencing depth, feature variability, normalisation approach and ecological niches. Mainly, we find that methods show a good control of the type I error and, generally, also of the false discovery rate at high sample size, while recall seem to depend on the dataset and sample size. The Microbiota is the set of microorganisms that characterize an ecological environment or niche. Several studies have shown that the microbiota is involved in various biological mechanisms that affect the health or balance of the host organism or the ecosystem. New discoveries and insights have been possible thanks to the increasingly efficient sequencing technologies together with the development of bioinformatic computational methods. One of the most interesting analyses in this landscape is the identification of microorganisms that show significant different abundances when two groups of subjects are analysed. Although many computational methods have been developed, it is still unclear which one has the best performance. Therefore, we exploited a simulator of microbiome data to build a simulation framework that allowed us to carry out an extensive benchmarking of the known tools of differential abundance analysis. Our work is not only a starting point to guide analysts in the choice of tools, but also a first step towards a robust, reliable and fair simulation framework.
Collapse
Affiliation(s)
- Marco Cappellato
- Department of Information Engineering, University of Padova, Padova, Italy
| | - Giacomo Baruzzo
- Department of Information Engineering, University of Padova, Padova, Italy
| | - Barbara Di Camillo
- Department of Information Engineering, University of Padova, Padova, Italy
- Department of Comparative Biomedicine and Food Science, University of Padova, Padova, Italy
- * E-mail:
| |
Collapse
|
5
|
Ma S, Ren B, Mallick H, Moon YS, Schwager E, Maharjan S, Tickle TL, Lu Y, Carmody RN, Franzosa EA, Janson L, Huttenhower C. A statistical model for describing and simulating microbial community profiles. PLoS Comput Biol 2021; 17:e1008913. [PMID: 34516542 PMCID: PMC8491899 DOI: 10.1371/journal.pcbi.1008913] [Citation(s) in RCA: 17] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2021] [Revised: 10/05/2021] [Accepted: 08/19/2021] [Indexed: 12/26/2022] Open
Abstract
Many methods have been developed for statistical analysis of microbial community profiles, but due to the complex nature of typical microbiome measurements (e.g. sparsity, zero-inflation, non-independence, and compositionality) and of the associated underlying biology, it is difficult to compare or evaluate such methods within a single systematic framework. To address this challenge, we developed SparseDOSSA (Sparse Data Observations for the Simulation of Synthetic Abundances): a statistical model of microbial ecological population structure, which can be used to parameterize real-world microbial community profiles and to simulate new, realistic profiles of known structure for methods evaluation. Specifically, SparseDOSSA's model captures marginal microbial feature abundances as a zero-inflated log-normal distribution, with additional model components for absolute cell counts and the sequence read generation process, microbe-microbe, and microbe-environment interactions. Together, these allow fully known covariance structure between synthetic features (i.e. "taxa") or between features and "phenotypes" to be simulated for method benchmarking. Here, we demonstrate SparseDOSSA's performance for 1) accurately modeling human-associated microbial population profiles; 2) generating synthetic communities with controlled population and ecological structures; 3) spiking-in true positive synthetic associations to benchmark analysis methods; and 4) recapitulating an end-to-end mouse microbiome feeding experiment. Together, these represent the most common analysis types in assessment of real microbial community environmental and epidemiological statistics, thus demonstrating SparseDOSSA's utility as a general-purpose aid for modeling communities and evaluating quantitative methods. An open-source implementation is available at http://huttenhower.sph.harvard.edu/sparsedossa2.
Collapse
Affiliation(s)
- Siyuan Ma
- Harvard Chan Microbiome in Public Health Center, Harvard T.H. Chan School of Public Health, Boston, Massachusetts, United States of America
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, Massachusetts, United States of America
- Broad Institute, Cambridge, Massachusetts, United States of America
| | - Boyu Ren
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, Massachusetts, United States of America
- Broad Institute, Cambridge, Massachusetts, United States of America
| | - Himel Mallick
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, Massachusetts, United States of America
- Broad Institute, Cambridge, Massachusetts, United States of America
| | - Yo Sup Moon
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, Massachusetts, United States of America
| | - Emma Schwager
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, Massachusetts, United States of America
| | - Sagun Maharjan
- Harvard Chan Microbiome in Public Health Center, Harvard T.H. Chan School of Public Health, Boston, Massachusetts, United States of America
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, Massachusetts, United States of America
- Broad Institute, Cambridge, Massachusetts, United States of America
| | - Timothy L. Tickle
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, Massachusetts, United States of America
- Broad Institute, Cambridge, Massachusetts, United States of America
| | - Yiren Lu
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, Massachusetts, United States of America
| | - Rachel N. Carmody
- Department of Human Evolutionary Biology, Harvard University, Cambridge, Massachusetts, United States of America
| | - Eric A. Franzosa
- Harvard Chan Microbiome in Public Health Center, Harvard T.H. Chan School of Public Health, Boston, Massachusetts, United States of America
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, Massachusetts, United States of America
- Broad Institute, Cambridge, Massachusetts, United States of America
| | - Lucas Janson
- Department of Statistics, Harvard University, Cambridge, Massachusetts, United States of America
| | - Curtis Huttenhower
- Harvard Chan Microbiome in Public Health Center, Harvard T.H. Chan School of Public Health, Boston, Massachusetts, United States of America
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, Massachusetts, United States of America
- Broad Institute, Cambridge, Massachusetts, United States of America
- Department of Immunology and Infectious Diseases, Harvard T.H. Chan School of Public Health, Boston, Massachusetts, United States of America
| |
Collapse
|