1
|
Geistlinger L, Mirzayi C, Zohra F, Azhar R, Elsafoury S, Grieve C, Wokaty J, Gamboa-Tuz SD, Sengupta P, Hecht I, Ravikrishnan A, Gonçalves RS, Franzosa E, Raman K, Carey V, Dowd JB, Jones HE, Davis S, Segata N, Huttenhower C, Waldron L. BugSigDB captures patterns of differential abundance across a broad range of host-associated microbial signatures. Nat Biotechnol 2024; 42:790-802. [PMID: 37697152 PMCID: PMC11098749 DOI: 10.1038/s41587-023-01872-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/24/2022] [Accepted: 06/20/2023] [Indexed: 09/13/2023]
Abstract
The literature of human and other host-associated microbiome studies is expanding rapidly, but systematic comparisons among published results of host-associated microbiome signatures of differential abundance remain difficult. We present BugSigDB, a community-editable database of manually curated microbial signatures from published differential abundance studies accompanied by information on study geography, health outcomes, host body site and experimental, epidemiological and statistical methods using controlled vocabulary. The initial release of the database contains >2,500 manually curated signatures from >600 published studies on three host species, enabling high-throughput analysis of signature similarity, taxon enrichment, co-occurrence and coexclusion and consensus signatures. These data allow assessment of microbiome differential abundance within and across experimental conditions, environments or body sites. Database-wide analysis reveals experimental conditions with the highest level of consistency in signatures reported by independent studies and identifies commonalities among disease-associated signatures, including frequent introgression of oral pathobionts into the gut.
Collapse
Affiliation(s)
- Ludwig Geistlinger
- Center for Computational Biomedicine, Harvard Medical School, Boston, MA, USA
| | - Chloe Mirzayi
- Institute for Implementation Science in Population Health, City University of New York School of Public Health, New York, NY, USA
- Department of Epidemiology and Biostatistics, City University of New York School of Public Health, New York, NY, USA
| | - Fatima Zohra
- Institute for Implementation Science in Population Health, City University of New York School of Public Health, New York, NY, USA
- Department of Epidemiology and Biostatistics, City University of New York School of Public Health, New York, NY, USA
| | - Rimsha Azhar
- Institute for Implementation Science in Population Health, City University of New York School of Public Health, New York, NY, USA
- Department of Epidemiology and Biostatistics, City University of New York School of Public Health, New York, NY, USA
| | - Shaimaa Elsafoury
- Institute for Implementation Science in Population Health, City University of New York School of Public Health, New York, NY, USA
- Department of Epidemiology and Biostatistics, City University of New York School of Public Health, New York, NY, USA
| | - Clare Grieve
- Institute for Implementation Science in Population Health, City University of New York School of Public Health, New York, NY, USA
- Department of Epidemiology and Biostatistics, City University of New York School of Public Health, New York, NY, USA
| | - Jennifer Wokaty
- Institute for Implementation Science in Population Health, City University of New York School of Public Health, New York, NY, USA
- Department of Epidemiology and Biostatistics, City University of New York School of Public Health, New York, NY, USA
| | - Samuel David Gamboa-Tuz
- Institute for Implementation Science in Population Health, City University of New York School of Public Health, New York, NY, USA
- Department of Epidemiology and Biostatistics, City University of New York School of Public Health, New York, NY, USA
| | - Pratyay Sengupta
- Department of Biotechnology, Bhupat and Jyoti Mehta School of Biosciences, Indian Institute of Technology (IIT) Madras, Chennai, India
- Robert Bosch Centre for Data Science and Artificial Intelligence, Indian Institute of Technology (IIT) Madras, Chennai, India
- Centre for Integrative Biology and Systems mEdicine (IBSE), Indian Institute of Technology (IIT) Madras, Chennai, India
| | | | - Aarthi Ravikrishnan
- Genome Institute of Singapore (GIS), Agency for Science, Technology and Research (A*STAR), Singapore, Republic of Singapore
| | - Rafael S Gonçalves
- Center for Computational Biomedicine, Harvard Medical School, Boston, MA, USA
| | - Eric Franzosa
- Infectious Disease and Microbiome Program, Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Department of Biostatistics, Harvard T. H. Chan School of Public Health, Boston, MA, USA
- Harvard Chan Microbiome in Public Health Center, Harvard T. H. Chan School of Public Health, Boston, MA, USA
| | - Karthik Raman
- Department of Biotechnology, Bhupat and Jyoti Mehta School of Biosciences, Indian Institute of Technology (IIT) Madras, Chennai, India
- Robert Bosch Centre for Data Science and Artificial Intelligence, Indian Institute of Technology (IIT) Madras, Chennai, India
- Centre for Integrative Biology and Systems mEdicine (IBSE), Indian Institute of Technology (IIT) Madras, Chennai, India
| | - Vincent Carey
- Channing Division of Network Medicine, Mass General Brigham, Harvard Medical School, Boston, MA, USA
| | - Jennifer B Dowd
- Leverhulme Centre for Demographic Science, University of Oxford, Oxford, UK
| | - Heidi E Jones
- Institute for Implementation Science in Population Health, City University of New York School of Public Health, New York, NY, USA
- Department of Epidemiology and Biostatistics, City University of New York School of Public Health, New York, NY, USA
| | - Sean Davis
- Departments of Biomedical Informatics and Medicine, University of Colorado Anschutz School of Medicine, Denver, CO, USA
| | - Nicola Segata
- Department CIBIO, University of Trento, Trento, Italy
- Istituto Europeo di Oncologia (IEO) IRCSS, Milan, Italy
| | - Curtis Huttenhower
- Infectious Disease and Microbiome Program, Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Department of Biostatistics, Harvard T. H. Chan School of Public Health, Boston, MA, USA
- Harvard Chan Microbiome in Public Health Center, Harvard T. H. Chan School of Public Health, Boston, MA, USA
| | - Levi Waldron
- Institute for Implementation Science in Population Health, City University of New York School of Public Health, New York, NY, USA.
- Department of Epidemiology and Biostatistics, City University of New York School of Public Health, New York, NY, USA.
- Department CIBIO, University of Trento, Trento, Italy.
| |
Collapse
|
2
|
McGovern KC, Nixon MP, Silverman JD. Addressing erroneous scale assumptions in microbe and gene set enrichment analysis. PLoS Comput Biol 2023; 19:e1011659. [PMID: 37983251 PMCID: PMC10695402 DOI: 10.1371/journal.pcbi.1011659] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2023] [Revised: 12/04/2023] [Accepted: 11/04/2023] [Indexed: 11/22/2023] Open
Abstract
By applying Differential Set Analysis (DSA) to sequence count data, researchers can determine whether groups of microbes or genes are differentially enriched. Yet sequence count data suffer from a scale limitation: these data lack information about the scale (i.e., size) of the biological system under study, leading some authors to call these data compositional (i.e., proportional). In this article, we show that commonly used DSA methods that rely on normalization make strong, implicit assumptions about the unmeasured system scale. We show that even small errors in these scale assumptions can lead to positive predictive values as low as 9%. To address this problem, we take three novel approaches. First, we introduce a sensitivity analysis framework to identify when modeling results are robust to such errors and when they are suspect. Unlike standard benchmarking studies, this framework does not require ground-truth knowledge and can therefore be applied to both simulated and real data. Second, we introduce a statistical test that provably controls Type-I error at a nominal rate despite errors in scale assumptions. Finally, we discuss how the impact of scale limitations depends on a researcher's scientific goals and provide tools that researchers can use to evaluate whether their goals are at risk from erroneous scale assumptions. Overall, the goal of this article is to catalyze future research into the impact of scale limitations in analyses of sequence count data; to illustrate that scale limitations can lead to inferential errors in practice; yet to also show that rigorous and reproducible scale reliant inference is possible if done carefully.
Collapse
Affiliation(s)
- Kyle C. McGovern
- Program in Bioinformatics and Genomics, Pennsylvania State University, State College, Pennsylvania, United States of America
| | - Michelle Pistner Nixon
- College of Information Sciences and Technology, Pennsylvania State University, State College, Pennsylvania, United States of America
| | - Justin D. Silverman
- Program in Bioinformatics and Genomics, Pennsylvania State University, State College, Pennsylvania, United States of America
- College of Information Sciences and Technology, Pennsylvania State University, State College, Pennsylvania, United States of America
- Departments of Medicine and Statistics, Pennsylvania State University, State College, Pennsylvania, United States of America
- Institute for Computational and Data Science, Pennsylvania State University, State College, Pennsylvania, United States of America
| |
Collapse
|