1
|
Li S, Li R, Lee JR, Zhao N, Ling W. ZINQ-L: a zero-inflated quantile approach for differential abundance analysis of longitudinal microbiome data. Front Genet 2025; 15:1494401. [PMID: 39944355 PMCID: PMC11814158 DOI: 10.3389/fgene.2024.1494401] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/10/2024] [Accepted: 12/10/2024] [Indexed: 02/16/2025] Open
Abstract
Background Identifying bacterial taxa associated with disease phenotypes or clinical treatments over time is critical for understanding the underlying biological mechanism. Association testing for microbiome data is already challenging due to its complex distribution that involves sparsity, over-dispersion, heavy tails, etc. The longitudinal nature of the data adds another layer of complexity - one needs to account for the within-subject correlations to avoid biased results. Existing longitudinal differential abundance approaches usually depend on strong parametric assumptions, such as zero-inflated normal or negative binomial. However, the complex microbiome data frequently violate these distributional assumptions, leading to inflated false discovery rates. In addition, the existing methods are mostly mean-based, unable to identify heterogeneous associations such as tail events or subgroup effects, which could be important biomedical signals. Methods We propose a zero-inflated quantile approach for longitudinal (ZINQ-L) microbiome differential abundance test. A mixed-effects quantile rank-score-based test was proposed for hypothesis testing, which consists of a test in mixed-effects logistic model for the presence-absence status of the investigated taxon, and a series of mixed-effects quantile rank-score tests adjusted for zero inflation given its presence. As a regression method with minimal distributional assumptions, it is robust to the complex microbiome data, controlling false discovery rate, and is flexible to adjust for important covariates. Its comprehensive examination of the abundance distribution enables the identification of heterogeneous associations, improving the testing power. Results Extensive simulation studies and an application to a real kidney transplant microbiome study demonstrate the improved power of ZINQ-L in detecting true signals while controlling false discovery rates. Conclusion ZINQ-L is a zero-inflated quantile-based approach for detecting individual taxa associated with outcomes or exposures in longitudinal microbiome studies, providing a robust and powerful option to improve and complement the existing methods in the field.
Collapse
Affiliation(s)
- Shuai Li
- Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, United States
| | - Runzhe Li
- Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, United States
| | - John R. Lee
- Division of Nephrology and Hypertension, Department of Medicine, Weill Medical College of Cornell University, New York, NY, United States
- Department of Transplantation Medicine, New York Presbyterian Hospital–Weill Cornell Medical Center, New York, NY, United States
| | - Ni Zhao
- Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, United States
| | - Wodan Ling
- Division of Biostatistics, Department of Population Health Sciences, Weill Medical College of Cornell University, New York, NY, United States
| |
Collapse
|
2
|
Casto AM, Song H, Xie H, Selke S, Roychoudhury P, Wu MC, Wald A, Greninger AL, Johnston C. Viral Genomic Variation and the Severity of Genital Herpes Simplex Virus-2 Infection as Quantified by Shedding Rate: A Viral Genome-Wide Association Study. J Infect Dis 2024; 230:1357-1366. [PMID: 38805234 PMCID: PMC11646587 DOI: 10.1093/infdis/jiae283] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/14/2024] [Revised: 05/16/2024] [Accepted: 05/23/2024] [Indexed: 05/29/2024] Open
Abstract
BACKGROUND The clinical severity of genital herpes simplex virus-2 (HSV-2) infection varies widely among infected persons with some experiencing frequent genital lesions while others are asymptomatic. The viral genital shedding rate is closely associated with, and has been established as, a surrogate marker of clinical severity. METHODS To assess the relationship between viral genetics and shedding, we assembled a set of 145 persons who had the severity of their genital herpes quantified through determination of their HSV genital shedding rate. An HSV-2 sample from each person was sequenced and biallelic variants among these genomes were identified. RESULTS We found no association between metrics of genome-wide variation in HSV-2 and shedding rate. A viral genome-wide association study identified the minor alleles of 3 individual unlinked variants as significantly associated with higher shedding rate (P < 8.4 × 10-5): C44973T (A512T), a nonsynonymous variant in UL22 (glycoprotein H); A74534G, a synonymous variant in UL36 (large tegument protein); and T119283C, an intergenic variant. We also found an association between the total number of minor alleles for the significant variants and shedding rate (P = 6.6 × 10-7). CONCLUSIONS These results add to a growing body of literature for HSV suggesting a connection between viral genetic variation and clinically important phenotypes of infection.
Collapse
Affiliation(s)
- Amanda M Casto
- Division of Allergy and Infectious Diseases, Department of Medicine, University of Washington, Seattle, Washington, USA
- Vaccine and Infectious Diseases Division, Fred Hutch Cancer Center, Seattle, Washington, USA
| | - Hoseung Song
- Division of Industrial and Systems Engineering, Graduate School of Data Science, Korea Advanced Institute of Science and Technology, Daejeon, South Korea
| | - Hong Xie
- Department of Laboratory Medicine and Pathology, University of Washington, Seattle, Washington, USA
| | - Stacy Selke
- Division of Allergy and Infectious Diseases, Department of Medicine, University of Washington, Seattle, Washington, USA
| | - Pavitra Roychoudhury
- Vaccine and Infectious Diseases Division, Fred Hutch Cancer Center, Seattle, Washington, USA
- Department of Laboratory Medicine and Pathology, University of Washington, Seattle, Washington, USA
| | - Michael C Wu
- Public Health Sciences Division, Fred Hutch Cancer Center, Seattle, Washington, USA
| | - Anna Wald
- Division of Allergy and Infectious Diseases, Department of Medicine, University of Washington, Seattle, Washington, USA
- Vaccine and Infectious Diseases Division, Fred Hutch Cancer Center, Seattle, Washington, USA
- Department of Laboratory Medicine and Pathology, University of Washington, Seattle, Washington, USA
- Department of Epidemiology, University of Washington, Seattle, Washington, USA
| | - Alexander L Greninger
- Vaccine and Infectious Diseases Division, Fred Hutch Cancer Center, Seattle, Washington, USA
- Department of Laboratory Medicine and Pathology, University of Washington, Seattle, Washington, USA
| | - Christine Johnston
- Division of Allergy and Infectious Diseases, Department of Medicine, University of Washington, Seattle, Washington, USA
- Vaccine and Infectious Diseases Division, Fred Hutch Cancer Center, Seattle, Washington, USA
- Department of Laboratory Medicine and Pathology, University of Washington, Seattle, Washington, USA
| |
Collapse
|
3
|
Feng C, Jia H, Wang H, Wang J, Lin M, Hu X, Yu C, Song H, Wang L. MicroNet-MIMRF: a microbial network inference approach based on mutual information and Markov random fields. BIOINFORMATICS ADVANCES 2024; 4:vbae167. [PMID: 39526038 PMCID: PMC11549015 DOI: 10.1093/bioadv/vbae167] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 07/18/2024] [Revised: 10/19/2024] [Accepted: 10/25/2024] [Indexed: 11/16/2024]
Abstract
Motivation The human microbiome, comprises complex associations and communication networks among microbial communities, which are crucial for maintaining health. The construction of microbial networks is vital for elucidating these associations. However, existing microbial networks inference methods cannot solve the issues of zero-inflation and non-linear associations. Therefore, necessitating novel methods to improve the accuracy of microbial networks inference. Results In this study, we introduce the Microbial Network based on Mutual Information and Markov Random Fields (MicroNet-MIMRF) as a novel approach for inferring microbial networks. Abundance data of microbes are modeled through the zero-inflated Poisson distribution, and the discrete matrix is estimated for further calculation. Markov random fields based on mutual information are used to construct accurate microbial networks. MicroNet-MIMRF excels at estimating pairwise associations between microbes, effectively addressing zero-inflation and non-linear associations in microbial abundance data. It outperforms commonly used techniques in simulation experiments, achieving area under the curve values exceeding 0.75 for all parameters. A case study on inflammatory bowel disease data further demonstrates the method's ability to identify insightful associations. Conclusively, MicroNet-MIMRF is a powerful tool for microbial network inference that handles the biases caused by zero-inflation and overestimation of associations. Availability and implementation The MicroNet-MIMRF is provided at https://github.com/Fionabiostats/MicroNet-MIMRF.
Collapse
Affiliation(s)
- Chenqionglu Feng
- Department of Epidemiology and Health Statistics, School of Public Health, China Medical University, Shenyang 110122, China
- Department of Infectious Disease Prevention and Control, Chinese PLA Center for Disease Control and Prevention, Beijing 100071, China
| | - Huiqun Jia
- Department of Infectious Disease Prevention and Control, Chinese PLA Center for Disease Control and Prevention, Beijing 100071, China
| | - Hui Wang
- Department of Infectious Disease Prevention and Control, Chinese PLA Center for Disease Control and Prevention, Beijing 100071, China
| | - Jiaojiao Wang
- The State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation Chinese Academy of Sciences, Beijing 100190, China
| | - Mengxuan Lin
- The Academy of Military Medical Sciences, Academy of Military Science of Chinese People’s Liberation Army, Beijing 100071, China
| | - Xiaoyan Hu
- Department of Infectious Disease Prevention and Control, Chinese PLA Center for Disease Control and Prevention, Beijing 100071, China
| | - Chenjing Yu
- Department of Infectious Disease Prevention and Control, Chinese PLA Center for Disease Control and Prevention, Beijing 100071, China
| | - Hongbin Song
- Department of Infectious Disease Prevention and Control, Chinese PLA Center for Disease Control and Prevention, Beijing 100071, China
| | - Ligui Wang
- Department of Infectious Disease Prevention and Control, Chinese PLA Center for Disease Control and Prevention, Beijing 100071, China
| |
Collapse
|
4
|
Wirbel J, Essex M, Forslund SK, Zeller G. A realistic benchmark for differential abundance testing and confounder adjustment in human microbiome studies. Genome Biol 2024; 25:247. [PMID: 39322959 PMCID: PMC11423519 DOI: 10.1186/s13059-024-03390-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/30/2023] [Accepted: 09/06/2024] [Indexed: 09/27/2024] Open
Abstract
BACKGROUND In microbiome disease association studies, it is a fundamental task to test which microbes differ in their abundance between groups. Yet, consensus on suitable or optimal statistical methods for differential abundance testing is lacking, and it remains unexplored how these cope with confounding. Previous differential abundance benchmarks relying on simulated datasets did not quantitatively evaluate the similarity to real data, which undermines their recommendations. RESULTS Our simulation framework implants calibrated signals into real taxonomic profiles, including signals mimicking confounders. Using several whole meta-genome and 16S rRNA gene amplicon datasets, we validate that our simulated data resembles real data from disease association studies much more than in previous benchmarks. With extensively parametrized simulations, we benchmark the performance of nineteen differential abundance methods and further evaluate the best ones on confounded simulations. Only classic statistical methods (linear models, the Wilcoxon test, t-test), limma, and fastANCOM properly control false discoveries at relatively high sensitivity. When additionally considering confounders, these issues are exacerbated, but we find that adjusted differential abundance testing can effectively mitigate them. In a large cardiometabolic disease dataset, we showcase that failure to account for covariates such as medication causes spurious association in real-world applications. CONCLUSIONS Tight error control is critical for microbiome association studies. The unsatisfactory performance of many differential abundance methods and the persistent danger of unchecked confounding suggest these contribute to a lack of reproducibility among such studies. We have open-sourced our simulation and benchmarking software to foster a much-needed consolidation of statistical methodology for microbiome research.
Collapse
Affiliation(s)
- Jakob Wirbel
- Structural and Computational Biology Unit (SCB), European Molecular Biology Laboratory (EMBL), Heidelberg, Germany
| | - Morgan Essex
- Experimental and Clinical Research Center (ECRC), a cooperation of the Max-Delbrück Center and Charité-Universitätsmedizin, Berlin, Germany
- Max-Delbrück Center for Molecular Medicine in the Helmholtz Association (MDC), Berlin, Germany
- Charité-Universitätsmedizin Berlin (a corporate member of Freie Universität Berlin and Humboldt-Universität zu Berlin), Berlin, Germany
| | - Sofia Kirke Forslund
- Structural and Computational Biology Unit (SCB), European Molecular Biology Laboratory (EMBL), Heidelberg, Germany.
- Experimental and Clinical Research Center (ECRC), a cooperation of the Max-Delbrück Center and Charité-Universitätsmedizin, Berlin, Germany.
- Max-Delbrück Center for Molecular Medicine in the Helmholtz Association (MDC), Berlin, Germany.
- Charité-Universitätsmedizin Berlin (a corporate member of Freie Universität Berlin and Humboldt-Universität zu Berlin), Berlin, Germany.
- German Center for Cardiovascular Research (DZHK), Partner Site Berlin, Berlin, Germany.
| | - Georg Zeller
- Structural and Computational Biology Unit (SCB), European Molecular Biology Laboratory (EMBL), Heidelberg, Germany.
- Center for Infectious Diseases (LUCID), Leiden University, Leiden University Medical Center (LUMC), Leiden, Netherlands.
- Center for Microbiome Analyses and Therapeutics (CMAT), Leiden University Medical Center, Leiden, Netherlands.
| |
Collapse
|
5
|
He M, Zhao N, Satten GA. MIDASim: a fast and simple simulator for realistic microbiome data. MICROBIOME 2024; 12:135. [PMID: 39039570 PMCID: PMC11264979 DOI: 10.1186/s40168-024-01822-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/21/2023] [Accepted: 04/22/2024] [Indexed: 07/24/2024]
Abstract
BACKGROUND Advances in sequencing technology has led to the discovery of associations between the human microbiota and many diseases, conditions, and traits. With the increasing availability of microbiome data, many statistical methods have been developed for studying these associations. The growing number of newly developed methods highlights the need for simple, rapid, and reliable methods to simulate realistic microbiome data, which is essential for validating and evaluating the performance of these methods. However, generating realistic microbiome data is challenging due to the complex nature of microbiome data, which feature correlation between taxa, sparsity, overdispersion, and compositionality. Current methods for simulating microbiome data are deficient in their ability to capture these important features of microbiome data, or can require exorbitant computational time. METHODS We develop MIDASim (MIcrobiome DAta Simulator), a fast and simple approach for simulating realistic microbiome data that reproduces the distributional and correlation structure of a template microbiome dataset. MIDASim is a two-step approach. The first step generates correlated binary indicators that represent the presence-absence status of all taxa, and the second step generates relative abundances and counts for the taxa that are considered to be present in step 1, utilizing a Gaussian copula to account for the taxon-taxon correlations. In the second step, MIDASim can operate in both a nonparametric and parametric mode. In the nonparametric mode, the Gaussian copula uses the empirical distribution of relative abundances for the marginal distributions. In the parametric mode, a generalized gamma distribution is used in place of the empirical distribution. RESULTS We demonstrate improved performance of MIDASim relative to other existing methods using gut and vaginal data. MIDASim showed superior performance by PERMANOVA and in terms of alpha diversity and beta dispersion in either parametric or nonparametric mode. We also show how MIDASim in parametric mode can be used to assess the performance of methods for finding differentially abundant taxa in a compositional model. CONCLUSIONS MIDASim is easy to implement, flexible and suitable for most microbiome data simulation situations. MIDASim has three major advantages. First, MIDASim performs better in reproducing the distributional features of real data compared to other methods, at both the presence-absence level and the relative-abundance level. MIDASim-simulated data are more similar to the template data than competing methods, as quantified using a variety of measures. Second, MIDASim makes few distributional assumptions for the relative abundances, and thus can easily accommodate complex distributional features in real data. Third, MIDASim is computationally efficient and can be used to simulate large microbiome datasets. Video Abstract.
Collapse
Affiliation(s)
- Mengyu He
- Department of Biostatistics and Bioinformatics, Emory University, Atlanta, GA, 30329, USA
| | - Ni Zhao
- Department of Biostatistics, Johns Hopkins University, Baltimore, MD, 21205, USA.
| | - Glen A Satten
- Department of Gynecology and Obstetrics, Emory University, Atlanta, GA, 30329, USA
| |
Collapse
|
6
|
Chan LS, Li G. Zero is not absence: censoring-based differential abundance analysis for microbiome data. Bioinformatics 2024; 40:btae071. [PMID: 38331411 PMCID: PMC10885211 DOI: 10.1093/bioinformatics/btae071] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2023] [Revised: 01/29/2024] [Accepted: 02/05/2024] [Indexed: 02/10/2024] Open
Abstract
MOTIVATION Microbiome data analysis faces the challenge of sparsity, with many entries recorded as zeros. In differential abundance analysis, the presence of excessive zeros in data violates distributional assumptions and creates ties, leading to an increased risk of type I errors and reduced statistical power. RESULTS We developed a novel normalization method, called censoring-based analysis of microbiome proportions (CAMP), for microbiome data by treating zeros as censored observations, transforming raw read counts into tie-free time-to-event-like data. This enables the use of survival analysis techniques, like the Cox proportional hazards model, for differential abundance analysis. Extensive simulations demonstrate that CAMP achieves proper type I error control and high power. Applying CAMP to a human gut microbiome dataset, we identify 60 new differentially abundant taxa across geographic locations, showcasing its usefulness. CAMP overcomes sparsity challenges, enabling improved statistical analysis and providing valuable insights into microbiome data in various contexts. AVAILABILITY AND IMPLEMENTATION The R package is available at https://github.com/lapsumchan/CAMP.
Collapse
Affiliation(s)
- Lap Sum Chan
- Department of Biostatistics, University of Michigan, Ann Arbor, MI 48109, United States
| | - Gen Li
- Department of Biostatistics, University of Michigan, Ann Arbor, MI 48109, United States
| |
Collapse
|
7
|
Kodalci L, Thas O. Simple and flexible sign and rank-based methods for testing for differential abundance in microbiome studies. PLoS One 2023; 18:e0292055. [PMID: 37751452 PMCID: PMC10522045 DOI: 10.1371/journal.pone.0292055] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/12/2023] [Accepted: 09/12/2023] [Indexed: 09/28/2023] Open
Abstract
Microbiome data obtained with amplicon sequencing are considered as compositional data. It has been argued that these data can be analysed after appropriate transformation to log-ratios, but ratios and logarithms cause problems with the many zeroes in typical microbiome experiments. We demonstrate that some well chosen sign and rank transformations also allow for valid inference with compositional data, and we show how logistic regression and probabilistic index models can be used for testing for differential abundance, while inheriting the flexibility of a statistical modelling framework. The results of a simulation study demonstrate that the new methods perform better than most other methods, and that it is comparable with ANCOM-BC. These methods are implemented in an R-package 'signtrans' and can be installed from Github (https://github.com/lucp9827/signtrans).
Collapse
Affiliation(s)
- Leyla Kodalci
- Data Science Institute and I-BioStat, Hasselt University, Diepenbeek, Belgium
| | - Olivier Thas
- Data Science Institute and I-BioStat, Hasselt University, Diepenbeek, Belgium
- Department of Applied Mathematics, Computer Science and Statistics, Ghent University, Gent, Belgium
- National Institute for Applied Statistics Research Australia (NIASRA), University of Wollongong, Wollongong, New South Wales, Australia
| |
Collapse
|
8
|
Song H, Ling W, Zhao N, Plantinga AM, Broedlow CA, Klatt NR, Hensley-McBain T, Wu MC. Accommodating multiple potential normalizations in microbiome associations studies. BMC Bioinformatics 2023; 24:22. [PMID: 36658484 PMCID: PMC9850542 DOI: 10.1186/s12859-023-05147-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2022] [Accepted: 01/12/2023] [Indexed: 01/20/2023] Open
Abstract
BACKGROUND Microbial communities are known to be closely related to many diseases, such as obesity and HIV, and it is of interest to identify differentially abundant microbial species between two or more environments. Since the abundances or counts of microbial species usually have different scales and suffer from zero-inflation or over-dispersion, normalization is a critical step before conducting differential abundance analysis. Several normalization approaches have been proposed, but it is difficult to optimize the characterization of the true relationship between taxa and interesting outcomes. RESULTS: To avoid the challenge of picking an optimal normalization and accommodate the advantages of several normalization strategies, we propose an omnibus approach. Our approach is based on a Cauchy combination test, which is flexible and powerful by aggregating individual p values. We also consider a truncated test statistic to prevent substantial power loss. We experiment with a basic linear regression model as well as recently proposed powerful association tests for microbiome data and compare the performance of the omnibus approach with individual normalization approaches. Experimental results show that, regardless of simulation settings, the new approach exhibits power that is close to the best normalization strategy, while controling the type I error well. CONCLUSIONS: The proposed omnibus test releases researchers from choosing among various normalization methods and it is an aggregated method that provides the powerful result to the underlying optimal normalization, which requires tedious trial and error. While the power may not exceed the best normalization, it is always much better than using a poor choice of normalization.
Collapse
Affiliation(s)
- Hoseung Song
- Public Health Sciences Division, Fred Hutchinson Cancer Center, Seattle, WA USA
| | - Wodan Ling
- Public Health Sciences Division, Fred Hutchinson Cancer Center, Seattle, WA USA
| | - Ni Zhao
- Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD USA
| | - Anna M. Plantinga
- Department of Mathematics and Statistics, Williams College, Williamstown, MA USA
| | - Courtney A. Broedlow
- Division of Surgical Outcomes and Precision Medicine Research, Department of Surgery, University of Minnesota School of Medicine, Minneapolis, MN USA
| | - Nichole R. Klatt
- Division of Surgical Outcomes and Precision Medicine Research, Department of Surgery, University of Minnesota School of Medicine, Minneapolis, MN USA
| | | | - Michael C. Wu
- Public Health Sciences Division, Fred Hutchinson Cancer Center, Seattle, WA USA
| |
Collapse
|
9
|
Li M, Liu J, Zhu J, Wang H, Sun C, Gao NL, Zhao XM, Chen WH. Performance of Gut Microbiome as an Independent Diagnostic Tool for 20 Diseases: Cross-Cohort Validation of Machine-Learning Classifiers. Gut Microbes 2023; 15:2205386. [PMID: 37140125 PMCID: PMC10161951 DOI: 10.1080/19490976.2023.2205386] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 05/05/2023] Open
Abstract
Cross-cohort validation is essential for gut-microbiome-based disease stratification but was only performed for limited diseases. Here, we systematically evaluated the cross-cohort performance of gut microbiome-based machine-learning classifiers for 20 diseases. Using single-cohort classifiers, we obtained high predictive accuracies in intra-cohort validation (~0.77 AUC), but low accuracies in cross-cohort validation, except the intestinal diseases (~0.73 AUC). We then built combined-cohort classifiers trained on samples combined from multiple cohorts to improve the validation of non-intestinal diseases, and estimated the required sample size to achieve validation accuracies of >0.7. In addition, we observed higher validation performance for classifiers using metagenomic data than 16S amplicon data in intestinal diseases. We further quantified the cross-cohort marker consistency using a Marker Similarity Index and observed similar trends. Together, our results supported the gut microbiome as an independent diagnostic tool for intestinal diseases and revealed strategies to improve cross-cohort performance based on identified determinants of consistent cross-cohort gut microbiome alterations.
Collapse
Affiliation(s)
- Min Li
- Key Laboratory of Molecular Biophysics of the Ministry of Education, Hubei Key Laboratory of Bioinformatics and Molecular-imaging, Center for Artificial Intelligence Biology, Department of Bioinformatics and Systems Biology, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan, China
| | - Jinxin Liu
- Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai, China
| | - Jiaying Zhu
- Key Laboratory of Molecular Biophysics of the Ministry of Education, Hubei Key Laboratory of Bioinformatics and Molecular-imaging, Center for Artificial Intelligence Biology, Department of Bioinformatics and Systems Biology, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan, China
| | - Huarui Wang
- Key Laboratory of Molecular Biophysics of the Ministry of Education, Hubei Key Laboratory of Bioinformatics and Molecular-imaging, Center for Artificial Intelligence Biology, Department of Bioinformatics and Systems Biology, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan, China
| | - Chuqing Sun
- Key Laboratory of Molecular Biophysics of the Ministry of Education, Hubei Key Laboratory of Bioinformatics and Molecular-imaging, Center for Artificial Intelligence Biology, Department of Bioinformatics and Systems Biology, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan, China
| | - Na L Gao
- Key Laboratory of Molecular Biophysics of the Ministry of Education, Hubei Key Laboratory of Bioinformatics and Molecular-imaging, Center for Artificial Intelligence Biology, Department of Bioinformatics and Systems Biology, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan, China
| | - Xing-Ming Zhao
- Department of Neurology, Zhongshan Hospital, Fudan University, Shanghai, China
- State Key Laboratory of Medical Neurobiology, Institutes of Brain Science, Fudan University, Shanghai, China
- MOE Key Laboratory of Computational Neuroscience and Brain-Inspired Intelligence, and MOE Frontiers Center for Brain Science, Fudan University, Shanghai, China
- International Human Phenome Institutes (Shanghai), Shanghai, China
| | - Wei-Hua Chen
- Key Laboratory of Molecular Biophysics of the Ministry of Education, Hubei Key Laboratory of Bioinformatics and Molecular-imaging, Center for Artificial Intelligence Biology, Department of Bioinformatics and Systems Biology, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan, China
- College of Life Science, Henan Normal University, Xinxiang, China
- Institution of Medical Artificial Intelligence, Binzhou Medical University, Yantai, China
| |
Collapse
|
10
|
Li Q, Vehik K, Li C, Triplett E, Roesch L, Hu YJ, Krischer J. A robust and transformation-free joint model with matching and regularization for metagenomic trajectory and disease onset. BMC Genomics 2022; 23:661. [PMID: 36123651 PMCID: PMC9484160 DOI: 10.1186/s12864-022-08890-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2022] [Accepted: 09/14/2022] [Indexed: 11/22/2022] Open
Abstract
BACKGROUND To identify operational taxonomy units (OTUs) signaling disease onset in an observational study, a powerful strategy was selecting participants by matched sets and profiling temporal metagenomes, followed by trajectory analysis. Existing trajectory analyses modeled individual OTU or microbial community without adjusting for the within-community correlation and matched-set-specific latent factors. RESULTS We proposed a joint model with matching and regularization (JMR) to detect OTU-specific trajectory predictive of host disease status. The between- and within-matched-sets heterogeneity in OTU relative abundance and disease risk were modeled by nested random effects. The inherent negative correlation in microbiota composition was adjusted by incorporating and regularizing the top-correlated taxa as longitudinal covariate, pre-selected by Bray-Curtis distance and elastic net regression. We designed a simulation pipeline to generate true biomarkers for disease onset and the pseudo biomarkers caused by compositionality. We demonstrated that JMR effectively controlled the false discovery and pseudo biomarkers in a simulation study generating temporal high-dimensional metagenomic counts with random intercept or slope. Application of the competing methods in the simulated data and the TEDDY cohort showed that JMR outperformed the other methods and identified important taxa in infants' fecal samples with dynamics preceding host disease status. CONCLUSION Our method JMR is a robust framework that models taxon-specific trajectory and host disease status for matched participants without transformation of relative abundance, improving the power of detecting disease-associated microbial features in certain scenarios. JMR is available in R package mtradeR at https://github.com/qianli10000/mtradeR.
Collapse
Affiliation(s)
- Qian Li
- Department of Biostatistics, St. Jude Children's Research Hospital, Memphis, 38105, TN, USA.
| | - Kendra Vehik
- Health Informatics Institute, University of South Florida, Tampa, 33620, FL, USA
| | - Cai Li
- Department of Biostatistics, St. Jude Children's Research Hospital, Memphis, 38105, TN, USA
| | - Eric Triplett
- Department of Microbiology and Cell Science, University of Florida, Gainesville, 32611, FL, USA
| | - Luiz Roesch
- Department of Microbiology and Cell Science, University of Florida, Gainesville, 32611, FL, USA
| | - Yi-Juan Hu
- Department of Biostatistics and Bioinformatics, Emory University, Atlanta, 30322, GA, USA
| | - Jeffrey Krischer
- Health Informatics Institute, University of South Florida, Tampa, 33620, FL, USA
| |
Collapse
|
11
|
Yang L, Chen J. A comprehensive evaluation of microbial differential abundance analysis methods: current status and potential solutions. MICROBIOME 2022; 10:130. [PMID: 35986393 PMCID: PMC9392415 DOI: 10.1186/s40168-022-01320-0] [Citation(s) in RCA: 57] [Impact Index Per Article: 19.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/09/2021] [Accepted: 07/04/2022] [Indexed: 06/12/2023]
Abstract
BACKGROUND Differential abundance analysis (DAA) is one central statistical task in microbiome data analysis. A robust and powerful DAA tool can help identify highly confident microbial candidates for further biological validation. Numerous DAA tools have been proposed in the past decade addressing the special characteristics of microbiome data such as zero inflation and compositional effects. Disturbingly, different DAA tools could sometimes produce quite discordant results, opening to the possibility of cherry-picking the tool in favor of one's own hypothesis. To recommend the best DAA tool or practice to the field, a comprehensive evaluation, which covers as many biologically relevant scenarios as possible, is critically needed. RESULTS We performed by far the most comprehensive evaluation of existing DAA tools using real data-based simulations. We found that DAA methods explicitly addressing compositional effects such as ANCOM-BC, Aldex2, metagenomeSeq (fitFeatureModel), and DACOMP did have improved performance in false-positive control. But they are still not optimal: type 1 error inflation or low statistical power has been observed in many settings. The recent LDM method generally had the best power, but its false-positive control in the presence of strong compositional effects was not satisfactory. Overall, none of the evaluated methods is simultaneously robust, powerful, and flexible, which makes the selection of the best DAA tool difficult. To meet the analysis needs, we designed an optimized procedure, ZicoSeq, drawing on the strength of the existing DAA methods. We show that ZicoSeq generally controlled for false positives across settings, and the power was among the highest. Application of DAA methods to a large collection of real datasets revealed a similar pattern observed in simulation studies. CONCLUSIONS Based on the benchmarking study, we conclude that none of the existing DAA methods evaluated can be applied blindly to any real microbiome dataset. The applicability of an existing DAA method depends on specific settings, which are usually unknown a priori. To circumvent the difficulty of selecting the best DAA tool in practice, we design ZicoSeq, which addresses the major challenges in DAA and remedies the drawbacks of existing DAA methods. ZicoSeq can be applied to microbiome datasets from diverse settings and is a useful DAA tool for robust microbiome biomarker discovery. Video Abstract.
Collapse
Affiliation(s)
- Lu Yang
- Division of Computational Biology, Department of Quantitative Health Sciences, Mayo Clinic, Rochester, MN, 55905, USA
- Center for Individualized Medicine, Mayo Clinic, Rochester, MN, 55905, USA
| | - Jun Chen
- Division of Computational Biology, Department of Quantitative Health Sciences, Mayo Clinic, Rochester, MN, 55905, USA.
- Center for Individualized Medicine, Mayo Clinic, Rochester, MN, 55905, USA.
| |
Collapse
|
12
|
Host phenotype classification from human microbiome data is mainly driven by the presence of microbial taxa. PLoS Comput Biol 2022; 18:e1010066. [PMID: 35446845 PMCID: PMC9064115 DOI: 10.1371/journal.pcbi.1010066] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2021] [Revised: 05/03/2022] [Accepted: 03/29/2022] [Indexed: 12/14/2022] Open
Abstract
Machine learning-based classification approaches are widely used to predict host phenotypes from microbiome data. Classifiers are typically employed by considering operational taxonomic units or relative abundance profiles as input features. Such types of data are intrinsically sparse, which opens the opportunity to make predictions from the presence/absence rather than the relative abundance of microbial taxa. This also poses the question whether it is the presence rather than the abundance of particular taxa to be relevant for discrimination purposes, an aspect that has been so far overlooked in the literature. In this paper, we aim at filling this gap by performing a meta-analysis on 4,128 publicly available metagenomes associated with multiple case-control studies. At species-level taxonomic resolution, we show that it is the presence rather than the relative abundance of specific microbial taxa to be important when building classification models. Such findings are robust to the choice of the classifier and confirmed by statistical tests applied to identifying differentially abundant/present taxa. Results are further confirmed at coarser taxonomic resolutions and validated on 4,026 additional 16S rRNA samples coming from 30 public case-control studies. The composition of the human microbiome has been linked to a large number of different diseases. In this context, classification methodologies based on machine learning approaches have represented a promising tool for diagnostic purposes from metagenomics data. The link between microbial population composition and host phenotypes has been usually performed by considering taxonomic profiles represented by relative abundances of microbial species. In this study, we show that it is more the presence rather than the relative abundance of microbial taxa to be relevant to maximize classification accuracy. This is accomplished by conducting a meta-analysis on more than 4,000 shotgun metagenomes coming from 25 case-control studies and in which original relative abundance data are degraded to presence/absence profiles. Findings are also extended to 16S rRNA data and advance the research field in building prediction models directly from human microbiome data.
Collapse
|