1
|
Liu Y, Fachrul M, Inouye M, Méric G. Harnessing human microbiomes for disease prediction. Trends Microbiol 2024; 32:707-719. [PMID: 38246848 DOI: 10.1016/j.tim.2023.12.004] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2023] [Revised: 12/12/2023] [Accepted: 12/12/2023] [Indexed: 01/23/2024]
Abstract
The human microbiome has been increasingly recognized as having potential use for disease prediction. Predicting the risk, progression, and severity of diseases holds promise to transform clinical practice, empower patient decisions, and reduce the burden of various common diseases, as has been demonstrated for cardiovascular disease or breast cancer. Combining multiple modifiable and non-modifiable risk factors, including high-dimensional genomic data, has been traditionally favored, but few studies have incorporated the human microbiome into models for predicting the prospective risk of disease. Here, we review research into the use of the human microbiome for disease prediction with a particular focus on prospective studies as well as the modulation and engineering of the microbiome as a therapeutic strategy.
Collapse
Affiliation(s)
- Yang Liu
- Cambridge Baker Systems Genomics Initiative, Department of Public Health and Primary Care, University of Cambridge, Cambridge, UK; Cambridge Baker Systems Genomics Initiative, Baker Heart and Diabetes Institute, Melbourne, Victoria, Australia; Department of Clinical Pathology, Melbourne Medical School, The University of Melbourne, Melbourne, Victoria, Australia; Victor Phillip Dahdaleh Heart and Lung Research Institute, University of Cambridge, Cambridge, UK; British Heart Foundation Cardiovascular Epidemiology Unit, Department of Public Health and Primary Care, University of Cambridge, Cambridge, UK
| | - Muhamad Fachrul
- Cambridge Baker Systems Genomics Initiative, Baker Heart and Diabetes Institute, Melbourne, Victoria, Australia; Department of Clinical Pathology, Melbourne Medical School, The University of Melbourne, Melbourne, Victoria, Australia; Human Genomics and Evolution Unit, St Vincent's Institute of Medical Research, Victoria, Australia; Melbourne Integrative Genomics, University of Melbourne, Parkville, Victoria, Australia; School of BioSciences, University of Melbourne, Parkville, Victoria, Australia
| | - Michael Inouye
- Cambridge Baker Systems Genomics Initiative, Department of Public Health and Primary Care, University of Cambridge, Cambridge, UK; Cambridge Baker Systems Genomics Initiative, Baker Heart and Diabetes Institute, Melbourne, Victoria, Australia; Victor Phillip Dahdaleh Heart and Lung Research Institute, University of Cambridge, Cambridge, UK; British Heart Foundation Cardiovascular Epidemiology Unit, Department of Public Health and Primary Care, University of Cambridge, Cambridge, UK; Health Data Research UK Cambridge, Wellcome Genome Campus and University of Cambridge, Cambridge, UK; British Heart Foundation Cambridge Centre of Research Excellence, School of Clinical Medicine, University of Cambridge, Cambridge, UK
| | - Guillaume Méric
- Cambridge Baker Systems Genomics Initiative, Baker Heart and Diabetes Institute, Melbourne, Victoria, Australia; Department of Cardiometabolic Health, University of Melbourne, Melbourne, Victoria, Australia; Central Clinical School, Monash University, Melbourne, Victoria, Australia; Department of Medical Science, Molecular Epidemiology, Uppsala University, Uppsala, Sweden; Department of Cardiovascular Research, Translation, and Implementation, La Trobe University, Melbourne, Victoria, Australia.
| |
Collapse
|
2
|
Gao W, Lin W, Li Q, Chen W, Yin W, Zhu X, Gao S, Liu L, Li W, Wu D, Zhang G, Zhu R, Jiao N. Identification and validation of microbial biomarkers from cross-cohort datasets using xMarkerFinder. Nat Protoc 2024:10.1038/s41596-024-00999-9. [PMID: 38745111 DOI: 10.1038/s41596-024-00999-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2023] [Accepted: 03/05/2024] [Indexed: 05/16/2024]
Abstract
Microbial signatures have emerged as promising biomarkers for disease diagnostics and prognostics, yet their variability across different studies calls for a standardized approach to biomarker research. Therefore, we introduce xMarkerFinder, a four-stage computational framework for microbial biomarker identification with comprehensive validations from cross-cohort datasets, including differential signature identification, model construction, model validation and biomarker interpretation. xMarkerFinder enables the identification and validation of reproducible biomarkers for cross-cohort studies, along with the establishment of classification models and potential microbiome-induced mechanisms. Originally developed for gut microbiome research, xMarkerFinder's adaptable design makes it applicable to various microbial habitats and data types. Distinct from existing biomarker research tools that typically concentrate on a singular aspect, xMarkerFinder uniquely incorporates a sophisticated feature selection process, specifically designed to address the heterogeneity between different cohorts, extensive internal and external validations, and detailed specificity assessments. Execution time varies depending on the sample size, selected algorithm and computational resource. Accessible via GitHub ( https://github.com/tjcadd2020/xMarkerFinder ), xMarkerFinder supports users with diverse expertise levels through different execution options, including step-to-step scripts with detailed tutorials and frequently asked questions, a single-command execution script, a ready-to-use Docker image and a user-friendly web server ( https://www.biosino.org/xmarkerfinder ).
Collapse
Affiliation(s)
- Wenxing Gao
- The Shanghai Tenth People's Hospital, School of Life Sciences and Technology, Tongji University, Shanghai, P. R. China
| | - Weili Lin
- The Shanghai Tenth People's Hospital, School of Life Sciences and Technology, Tongji University, Shanghai, P. R. China
| | - Qiang Li
- National Genomics Data Center & Bio-Med Big Data Center, Chinese Academy of Sciences Key Laboratory of Computational Biology, Shanghai Institute of Nutrition and Health, University of the Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai, P. R. China
| | - Wanning Chen
- The Shanghai Tenth People's Hospital, School of Life Sciences and Technology, Tongji University, Shanghai, P. R. China
| | - Wenjing Yin
- The Shanghai Tenth People's Hospital, School of Life Sciences and Technology, Tongji University, Shanghai, P. R. China
| | - Xinyue Zhu
- The Shanghai Tenth People's Hospital, School of Life Sciences and Technology, Tongji University, Shanghai, P. R. China
| | - Sheng Gao
- The Shanghai Tenth People's Hospital, School of Life Sciences and Technology, Tongji University, Shanghai, P. R. China
| | - Lei Liu
- The Shanghai Tenth People's Hospital, School of Life Sciences and Technology, Tongji University, Shanghai, P. R. China
| | - Wenjie Li
- Shanghai Southgene Technology Co., Ltd., Shanghai, P. R. China
| | - Dingfeng Wu
- National Clinical Research Center for Child Health, the Children's Hospital, Zhejiang University School of Medicine, Hangzhou, Zhejiang, P. R. China
| | - Guoqing Zhang
- National Genomics Data Center & Bio-Med Big Data Center, Chinese Academy of Sciences Key Laboratory of Computational Biology, Shanghai Institute of Nutrition and Health, University of the Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai, P. R. China.
| | - Ruixin Zhu
- The Shanghai Tenth People's Hospital, School of Life Sciences and Technology, Tongji University, Shanghai, P. R. China.
| | - Na Jiao
- National Clinical Research Center for Child Health, the Children's Hospital, Zhejiang University School of Medicine, Hangzhou, Zhejiang, P. R. China.
- State Key Laboratory of Genetic Engineering, Fudan Microbiome Center, School of Life Sciences, Fudan University, Shanghai, P. R. China.
| |
Collapse
|
3
|
Austin GI, Kav AB, Park H, Biermann J, Uhlemann AC, Korem T. Processing-bias correction with DEBIAS-M improves cross-study generalization of microbiome-based prediction models. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.02.09.579716. [PMID: 38405914 PMCID: PMC10888995 DOI: 10.1101/2024.02.09.579716] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/27/2024]
Abstract
Every step in common microbiome profiling protocols has variable efficiency for each microbe. For example, different DNA extraction kits may have different efficiency for Gram-positive and -negative bacteria. These variable efficiencies, combined with technical variation, create strong processing biases, which impede the identification of signals that are reproducible across studies and the development of generalizable and biologically interpretable prediction models. "Batch-correction" methods have been used to alleviate these issues computationally with some success. However, many make strong parametric assumptions which do not necessarily apply to microbiome data or processing biases, or require the use of an outcome variable, which risks overfitting. Lastly and importantly, existing transformations used to correct microbiome data are largely non-interpretable, and could, for example, introduce values to features that were initially mostly zeros. Altogether, processing bias currently compromises our ability to glean robust and generalizable biological insights from microbiome data. Here, we present DEBIAS-M (Domain adaptation with phenotype Estimation and Batch Integration Across Studies of the Microbiome), an interpretable framework for inference and correction of processing bias, which facilitates domain adaptation in microbiome studies. DEBIAS-M learns bias-correction factors for each microbe in each batch that simultaneously minimize batch effects and maximize cross-study associations with phenotypes. Using benchmarks of HIV and colorectal cancer classification from gut microbiome data, and cervical neoplasia prediction from cervical microbiome data, we demonstrate that DEBIAS-M outperforms batch-correction methods commonly used in the field. Notably, we show that the inferred bias-correction factors are stable, interpretable, and strongly associated with specific experimental protocols. Overall, we show that DEBIAS-M allows for better modeling of microbiome data and identification of interpretable signals that are reproducible across studies.
Collapse
Affiliation(s)
- George I. Austin
- Department of Biomedical Informatics, Columbia University Irving Medical Center, New York, NY, USA
- Program for Mathematical Genomics, Department of Systems Biology, Columbia University Irving Medical Center, New York, NY, USA
| | - Aya Brown Kav
- Program for Mathematical Genomics, Department of Systems Biology, Columbia University Irving Medical Center, New York, NY, USA
| | - Heekuk Park
- Division of Infectious Diseases, Columbia University Irving Medical Center, New York, NY, USA
| | - Jana Biermann
- Program for Mathematical Genomics, Department of Systems Biology, Columbia University Irving Medical Center, New York, NY, USA
- Department of Medicine, Division of Hematology/Oncology, Columbia University Irving Medical Center, New York, NY, USA
- Herbert Irving Comprehensive Cancer Center, Columbia University Irving Medical Center, New York, NY, USA
| | - Anne-Catrin Uhlemann
- Division of Infectious Diseases, Columbia University Irving Medical Center, New York, NY, USA
| | - Tal Korem
- Program for Mathematical Genomics, Department of Systems Biology, Columbia University Irving Medical Center, New York, NY, USA
- Department of Obstetrics and Gynecology, Columbia University Irving Medical Center, New York, NY, USA
| |
Collapse
|
4
|
Xia Y. Statistical normalization methods in microbiome data with application to microbiome cancer research. Gut Microbes 2023; 15:2244139. [PMID: 37622724 PMCID: PMC10461514 DOI: 10.1080/19490976.2023.2244139] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 02/20/2023] [Revised: 07/12/2023] [Accepted: 07/31/2023] [Indexed: 08/26/2023] Open
Abstract
Mounting evidence has shown that gut microbiome is associated with various cancers, including gastrointestinal (GI) tract and non-GI tract cancers. But microbiome data have unique characteristics and pose major challenges when using standard statistical methods causing results to be invalid or misleading. Thus, to analyze microbiome data, it not only needs appropriate statistical methods, but also requires microbiome data to be normalized prior to statistical analysis. Here, we first describe the unique characteristics of microbiome data and the challenges in analyzing them (Section 2). Then, we provide an overall review on the available normalization methods of 16S rRNA and shotgun metagenomic data along with examples of their applications in microbiome cancer research (Section 3). In Section 4, we comprehensively investigate how the normalization methods of 16S rRNA and shotgun metagenomic data are evaluated. Finally, we summarize and conclude with remarks on statistical normalization methods (Section 5). Altogether, this review aims to provide a broad and comprehensive view and remarks on the promises and challenges of the statistical normalization methods in microbiome data with microbiome cancer research examples.
Collapse
Affiliation(s)
- Yinglin Xia
- Division of Gastroenterology and Hepatology, Department of Medicine, University of Illinois Chicago, Chicago, USA
| |
Collapse
|
5
|
Zhou R, Ng SK, Sung JJY, Goh WWB, Wong SH. Data pre-processing for analyzing microbiome data - A mini review. Comput Struct Biotechnol J 2023; 21:4804-4815. [PMID: 37841330 PMCID: PMC10569954 DOI: 10.1016/j.csbj.2023.10.001] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2023] [Revised: 10/01/2023] [Accepted: 10/01/2023] [Indexed: 10/17/2023] Open
Abstract
The human microbiome is an emerging research frontier due to its profound impacts on health. High-throughput microbiome sequencing enables studying microbial communities but suffers from analytical challenges. In particular, the lack of dedicated preprocessing methods to improve data quality impedes effective minimization of biases prior to downstream analysis. This review aims to address this gap by providing a comprehensive overview of preprocessing techniques relevant to microbiome research. We outline a typical workflow for microbiome data analysis. Preprocessing methods discussed include quality filtering, batch effect correction, imputation of missing values, normalization, and data transformation. We highlight strengths and limitations of each technique to serve as a practical guide for researchers and identify areas needing further methodological development. Establishing robust, standardized preprocessing will be essential for drawing valid biological conclusions from microbiome studies.
Collapse
Affiliation(s)
- Ruwen Zhou
- Lee Kong Chian School of Medicine, Nanyang Technological University, 11 Mandalay Road, 308232, Singapore
| | - Siu Kin Ng
- Lee Kong Chian School of Medicine, Nanyang Technological University, 11 Mandalay Road, 308232, Singapore
| | - Joseph Jao Yiu Sung
- Lee Kong Chian School of Medicine, Nanyang Technological University, 11 Mandalay Road, 308232, Singapore
- Department of Gastroenterology and Hepatology, Tan Tock Seng Hospital, National Healthcare Group, 11 Jalan Tan Tock Seng, 308433, Singapore
| | - Wilson Wen Bin Goh
- Lee Kong Chian School of Medicine, Nanyang Technological University, 11 Mandalay Road, 308232, Singapore
- School of Biological Sciences, Nanyang Technological University, 60 Nanyang Drive, 637551, Singapore
- Center for Biomedical Informatics, Nanyang Technological University, 59 Nanyang Drive, 636921, Singapore
| | - Sunny Hei Wong
- Lee Kong Chian School of Medicine, Nanyang Technological University, 11 Mandalay Road, 308232, Singapore
- Department of Gastroenterology and Hepatology, Tan Tock Seng Hospital, National Healthcare Group, 11 Jalan Tan Tock Seng, 308433, Singapore
| |
Collapse
|
6
|
Papoutsoglou G, Tarazona S, Lopes MB, Klammsteiner T, Ibrahimi E, Eckenberger J, Novielli P, Tonda A, Simeon A, Shigdel R, Béreux S, Vitali G, Tangaro S, Lahti L, Temko A, Claesson MJ, Berland M. Machine learning approaches in microbiome research: challenges and best practices. Front Microbiol 2023; 14:1261889. [PMID: 37808286 PMCID: PMC10556866 DOI: 10.3389/fmicb.2023.1261889] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2023] [Accepted: 09/04/2023] [Indexed: 10/10/2023] Open
Abstract
Microbiome data predictive analysis within a machine learning (ML) workflow presents numerous domain-specific challenges involving preprocessing, feature selection, predictive modeling, performance estimation, model interpretation, and the extraction of biological information from the results. To assist decision-making, we offer a set of recommendations on algorithm selection, pipeline creation and evaluation, stemming from the COST Action ML4Microbiome. We compared the suggested approaches on a multi-cohort shotgun metagenomics dataset of colorectal cancer patients, focusing on their performance in disease diagnosis and biomarker discovery. It is demonstrated that the use of compositional transformations and filtering methods as part of data preprocessing does not always improve the predictive performance of a model. In contrast, the multivariate feature selection, such as the Statistically Equivalent Signatures algorithm, was effective in reducing the classification error. When validated on a separate test dataset, this algorithm in combination with random forest modeling, provided the most accurate performance estimates. Lastly, we showed how linear modeling by logistic regression coupled with visualization techniques such as Individual Conditional Expectation (ICE) plots can yield interpretable results and offer biological insights. These findings are significant for clinicians and non-experts alike in translational applications.
Collapse
Affiliation(s)
- Georgios Papoutsoglou
- Department of Computer Science, University of Crete, Heraklion, Greece
- JADBio Gnosis DA S.A., Science and Technology Park of Crete, Heraklion, Greece
| | - Sonia Tarazona
- Department of Applied Statistics and Operations Research and Quality, Polytechnic University of Valencia, Valencia, Spain
| | - Marta B. Lopes
- Center for Mathematics and Applications (NOVA Math), NOVA School of Science and Technology, Caparica, Portugal
- Research and Development Unit for Mechanical and Industrial Engineering (UNIDEMI), Department of Mechanical and Industrial Engineering, NOVA School of Science and Technology, Caparica, Portugal
| | - Thomas Klammsteiner
- Department of Ecology, Universität Innsbruck, Innsbruck, Austria
- Department of Microbiology, Universität Innsbruck, Innsbruck, Austria
| | - Eliana Ibrahimi
- Department of Biology, University of Tirana, Tirana, Albania
| | - Julia Eckenberger
- School of Microbiology, University College Cork, Cork, Ireland
- APC Microbiome Ireland, Cork, Ireland
| | - Pierfrancesco Novielli
- Department of Soil, Plant, and Food Sciences, University of Bari Aldo Moro, Bari, Italy
- National Institute for Nuclear Physics, Bari Division, Bari, Italy
| | - Alberto Tonda
- UMR 518 MIA-PS, INRAE, Paris-Saclay University, Palaiseau, France
- Complex Systems Institute of Paris Ile-de-France (ISC-PIF) - UAR 3611 CNRS, Paris, France
| | - Andrea Simeon
- BioSense Institute, University of Novi Sad, Novi Sad, Serbia
| | - Rajesh Shigdel
- Department of Clinical Science, University of Bergen, Bergen, Norway
| | - Stéphane Béreux
- MetaGenoPolis, INRAE, Paris-Saclay University, Jouy-en-Josas, France
- MaIAGE, INRAE, Paris-Saclay University, Jouy-en-Josas, France
| | - Giacomo Vitali
- MetaGenoPolis, INRAE, Paris-Saclay University, Jouy-en-Josas, France
| | - Sabina Tangaro
- Department of Soil, Plant, and Food Sciences, University of Bari Aldo Moro, Bari, Italy
- National Institute for Nuclear Physics, Bari Division, Bari, Italy
| | - Leo Lahti
- Department of Computing, University of Turku, Turku, Finland
| | - Andriy Temko
- Department of Electrical and Electronic Engineering, University College Cork, Cork, Ireland
| | - Marcus J. Claesson
- School of Microbiology, University College Cork, Cork, Ireland
- APC Microbiome Ireland, Cork, Ireland
| | - Magali Berland
- MetaGenoPolis, INRAE, Paris-Saclay University, Jouy-en-Josas, France
| |
Collapse
|
7
|
Aizpurua O, Dunn RR, Hansen LH, Gilbert MTP, Alberdi A. Field and laboratory guidelines for reliable bioinformatic and statistical analysis of bacterial shotgun metagenomic data. Crit Rev Biotechnol 2023:1-19. [PMID: 37731336 DOI: 10.1080/07388551.2023.2254933] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2023] [Accepted: 06/27/2023] [Indexed: 09/22/2023]
Abstract
Shotgun metagenomics is an increasingly cost-effective approach for profiling environmental and host-associated microbial communities. However, due to the complexity of both microbiomes and the molecular techniques required to analyze them, the reliability and representativeness of the results are contingent upon the field, laboratory, and bioinformatic procedures employed. Here, we consider 15 field and laboratory issues that critically impact downstream bioinformatic and statistical data processing, as well as result interpretation, in bacterial shotgun metagenomic studies. The issues we consider encompass intrinsic properties of samples, study design, and laboratory-processing strategies. We identify the links of field and laboratory steps with downstream analytical procedures, explain the means for detecting potential pitfalls, and propose mitigation measures to overcome or minimize their impact in metagenomic studies. We anticipate that our guidelines will assist data scientists in appropriately processing and interpreting their data, while aiding field and laboratory researchers to implement strategies for improving the quality of the generated results.
Collapse
Affiliation(s)
- Ostaizka Aizpurua
- Center for Evolutionary Hologenomics, Globe Institute, University of Copenhagen, Copenhagen, Denmark
| | - Robert R Dunn
- Department of Applied Ecology, North Carolina State University, Raleigh, NC, USA
| | - Lars H Hansen
- Department of Plant and Environmental Sciences, University of Copenhagen, Frederiksberg, Denmark
| | - M T P Gilbert
- Center for Evolutionary Hologenomics, Globe Institute, University of Copenhagen, Copenhagen, Denmark
- University Museum, NTNU, Trondheim, Norway
| | - Antton Alberdi
- Center for Evolutionary Hologenomics, Globe Institute, University of Copenhagen, Copenhagen, Denmark
| |
Collapse
|
8
|
Wang Y, Lê Cao KA. PLSDA-batch: a multivariate framework to correct for batch effects in microbiome data. Brief Bioinform 2023; 24:6991121. [PMID: 36653900 PMCID: PMC10025448 DOI: 10.1093/bib/bbac622] [Citation(s) in RCA: 15] [Impact Index Per Article: 15.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/19/2022] [Revised: 12/14/2022] [Accepted: 12/17/2022] [Indexed: 01/20/2023] Open
Abstract
Microbial communities are highly dynamic and sensitive to changes in the environment. Thus, microbiome data are highly susceptible to batch effects, defined as sources of unwanted variation that are not related to and obscure any factors of interest. Existing batch effect correction methods have been primarily developed for gene expression data. As such, they do not consider the inherent characteristics of microbiome data, including zero inflation, overdispersion and correlation between variables. We introduce new multivariate and non-parametric batch effect correction methods based on Partial Least Squares Discriminant Analysis (PLSDA). PLSDA-batch first estimates treatment and batch variation with latent components, then subtracts batch-associated components from the data. The resulting batch-effect-corrected data can then be input in any downstream statistical analysis. Two variants are proposed to handle unbalanced batch x treatment designs and to avoid overfitting when estimating the components via variable selection. We compare our approaches with popular methods managing batch effects, namely, removeBatchEffect, ComBat and Surrogate Variable Analysis, in simulated and three case studies using various visual and numerical assessments. We show that our three methods lead to competitive performance in removing batch variation while preserving treatment variation, especially for unbalanced batch $\times $ treatment designs. Our downstream analyses show selections of biologically relevant taxa. This work demonstrates that batch effect correction methods can improve microbiome research outputs. Reproducible code and vignettes are available on GitHub.
Collapse
Affiliation(s)
- Yiwen Wang
- Shenzhen Branch, Guangdong Laboratory of Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, 97 Buxin Rd, Shenzhen, 518000, Guangdong, China
- Melbourne Integrative Genomics, School of Mathematics and Statistics, The University of Melbourne, 30 Royal Parade, Melbourne, 3052, VIC, Australia
| | - Kim-Anh Lê Cao
- Melbourne Integrative Genomics, School of Mathematics and Statistics, The University of Melbourne, 30 Royal Parade, Melbourne, 3052, VIC, Australia
| |
Collapse
|
9
|
Ma S, Shungin D, Mallick H, Schirmer M, Nguyen LH, Kolde R, Franzosa E, Vlamakis H, Xavier R, Huttenhower C. Population structure discovery in meta-analyzed microbial communities and inflammatory bowel disease using MMUPHin. Genome Biol 2022; 23:208. [PMID: 36192803 PMCID: PMC9531436 DOI: 10.1186/s13059-022-02753-4] [Citation(s) in RCA: 32] [Impact Index Per Article: 16.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2021] [Accepted: 08/19/2022] [Indexed: 01/19/2023] Open
Abstract
Microbiome studies of inflammatory bowel diseases (IBD) have achieved a scale for meta-analysis of dysbioses among populations. To enable microbial community meta-analyses generally, we develop MMUPHin for normalization, statistical meta-analysis, and population structure discovery using microbial taxonomic and functional profiles. Applying it to ten IBD cohorts, we identify consistent associations, including novel taxa such as Acinetobacter and Turicibacter, and additional exposure and interaction effects. A single gradient of dysbiosis severity is favored over discrete types to summarize IBD microbiome population structure. These results provide a benchmark for characterization of IBD and a framework for meta-analysis of any microbial communities.
Collapse
Affiliation(s)
- Siyuan Ma
- grid.38142.3c000000041936754XHarvard Chan Microbiome in Public Health Center, Harvard T.H. Chan School of Public Health, Boston, MA USA
| | - Dmitry Shungin
- grid.66859.340000 0004 0546 1623Broad Institute of MIT and Harvard, Cambridge, MA USA
| | - Himel Mallick
- grid.66859.340000 0004 0546 1623Broad Institute of MIT and Harvard, Cambridge, MA USA
| | - Melanie Schirmer
- grid.66859.340000 0004 0546 1623Broad Institute of MIT and Harvard, Cambridge, MA USA
| | - Long H. Nguyen
- grid.32224.350000 0004 0386 9924Massachusetts General Hospital, Boston, MA USA
| | - Raivo Kolde
- grid.66859.340000 0004 0546 1623Broad Institute of MIT and Harvard, Cambridge, MA USA
| | - Eric Franzosa
- grid.38142.3c000000041936754XHarvard Chan Microbiome in Public Health Center, Harvard T.H. Chan School of Public Health, Boston, MA USA
| | - Hera Vlamakis
- grid.66859.340000 0004 0546 1623Broad Institute of MIT and Harvard, Cambridge, MA USA
| | - Ramnik Xavier
- grid.66859.340000 0004 0546 1623Broad Institute of MIT and Harvard, Cambridge, MA USA
| | - Curtis Huttenhower
- grid.38142.3c000000041936754XHarvard Chan Microbiome in Public Health Center, Harvard T.H. Chan School of Public Health, Boston, MA USA
| |
Collapse
|
10
|
Ling W, Lu J, Zhao N, Lulla A, Plantinga AM, Fu W, Zhang A, Liu H, Song H, Li Z, Chen J, Randolph TW, Koay WLA, White JR, Launer LJ, Fodor AA, Meyer KA, Wu MC. Batch effects removal for microbiome data via conditional quantile regression. Nat Commun 2022; 13:5418. [PMID: 36109499 PMCID: PMC9477887 DOI: 10.1038/s41467-022-33071-9] [Citation(s) in RCA: 15] [Impact Index Per Article: 7.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/03/2021] [Accepted: 08/29/2022] [Indexed: 11/10/2022] Open
Abstract
Batch effects in microbiome data arise from differential processing of specimens and can lead to spurious findings and obscure true signals. Strategies designed for genomic data to mitigate batch effects usually fail to address the zero-inflated and over-dispersed microbiome data. Most strategies tailored for microbiome data are restricted to association testing or specialized study designs, failing to allow other analytic goals or general designs. Here, we develop the Conditional Quantile Regression (ConQuR) approach to remove microbiome batch effects using a two-part quantile regression model. ConQuR is a comprehensive method that accommodates the complex distributions of microbial read counts by non-parametric modeling, and it generates batch-removed zero-inflated read counts that can be used in and benefit usual subsequent analyses. We apply ConQuR to simulated and real microbiome datasets and demonstrate its advantages in removing batch effects while preserving the signals of interest.
Collapse
Affiliation(s)
- Wodan Ling
- Public Health Sciences Division, Fred Hutchinson Cancer Center, 1100 Fairview Ave N, 98109, Seattle, USA
| | - Jiuyao Lu
- Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, 615 N Wolfe St, 21205, Baltimore, USA
| | - Ni Zhao
- Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, 615 N Wolfe St, 21205, Baltimore, USA.
| | - Anju Lulla
- Nutrition Research Institute and Department of Nutrition, University of North Carolina, 500 Laureate Way, 28081, Kannapolis, USA
| | - Anna M Plantinga
- Department of Mathematics and Statistics, Williams College, 18 Hoxsey St, 01267, Williamstown, USA
| | - Weijia Fu
- Department of Biostatistics, School of Public Health, University of Washington, 1705 NE Pacific St, 98195, Seattle, USA
| | - Angela Zhang
- Public Health Sciences Division, Fred Hutchinson Cancer Center, 1100 Fairview Ave N, 98109, Seattle, USA
- Department of Biostatistics, School of Public Health, University of Washington, 1705 NE Pacific St, 98195, Seattle, USA
| | - Hongjiao Liu
- Public Health Sciences Division, Fred Hutchinson Cancer Center, 1100 Fairview Ave N, 98109, Seattle, USA
- Department of Biostatistics, School of Public Health, University of Washington, 1705 NE Pacific St, 98195, Seattle, USA
| | - Hoseung Song
- Public Health Sciences Division, Fred Hutchinson Cancer Center, 1100 Fairview Ave N, 98109, Seattle, USA
| | - Zhigang Li
- Department of Biostatistics, College of Public Health & Health Professions, College of Medicine, University of Florida, 2004 Mowry Rd, 32611, Gainesville, USA
| | - Jun Chen
- Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, 200 First St SW, 55905, Rochester, USA
| | - Timothy W Randolph
- Public Health Sciences Division, Fred Hutchinson Cancer Center, 1100 Fairview Ave N, 98109, Seattle, USA
| | - Wei Li A Koay
- Children's National Hospital, 111 Michigan Ave NW, 20010, Washington DC, USA
- Department of Pediatrics, George Washington University, Ross Hall 2300 Eye St NW, 20037, Washington DC, USA
| | - James R White
- Resphera Biosciences, 1529 Lancaster St, 21231, Baltimore, USA
| | - Lenore J Launer
- Laboratory of Epidemiology and Population Science, NIA, NIH, 7201 Wisconsin Ave, 20814, Bethesda, USA
| | - Anthony A Fodor
- Department of Bioinformatics and Genomics, University of North Carolina at Charlotte, 9201 University City Blvd, 28223, Charlotte, USA
| | - Katie A Meyer
- Nutrition Research Institute and Department of Nutrition, University of North Carolina, 500 Laureate Way, 28081, Kannapolis, USA
| | - Michael C Wu
- Public Health Sciences Division, Fred Hutchinson Cancer Center, 1100 Fairview Ave N, 98109, Seattle, USA.
- Department of Biostatistics, School of Public Health, University of Washington, 1705 NE Pacific St, 98195, Seattle, USA.
| |
Collapse
|
11
|
Xiao L, Zhang F, Zhao F. Large-scale microbiome data integration enables robust biomarker identification. NATURE COMPUTATIONAL SCIENCE 2022; 2:307-316. [PMID: 38177817 PMCID: PMC10766547 DOI: 10.1038/s43588-022-00247-8] [Citation(s) in RCA: 18] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/22/2021] [Accepted: 04/12/2022] [Indexed: 01/06/2024]
Abstract
The close association between gut microbiota dysbiosis and human diseases is being increasingly recognized. However, contradictory results are frequently reported, as confounding effects exist. The lack of unbiased data integration methods is also impeding the discovery of disease-associated microbial biomarkers from different cohorts. Here we propose an algorithm, NetMoss, for assessing shifts of microbial network modules to identify robust biomarkers associated with various diseases. Compared to previous approaches, the NetMoss method shows better performance in removing batch effects. Through comprehensive evaluations on both simulated and real datasets, we demonstrate that NetMoss has great advantages in the identification of disease-related biomarkers. Based on analysis of pandisease microbiota studies, there is a high prevalence of multidisease-related bacteria in global populations. We believe that large-scale data integration will help in understanding the role of the microbiome from a more comprehensive perspective and that accurate biomarker identification will greatly promote microbiome-based medical diagnosis.
Collapse
Affiliation(s)
- Liwen Xiao
- Beijing Institutes of Life Science, Chinese Academy of Sciences, Beijing, China
- University of Chinese Academy of Sciences, Beijing, China
| | - Fengyi Zhang
- Beijing Institutes of Life Science, Chinese Academy of Sciences, Beijing, China
| | - Fangqing Zhao
- Beijing Institutes of Life Science, Chinese Academy of Sciences, Beijing, China.
- University of Chinese Academy of Sciences, Beijing, China.
- Key Laboratory of Systems Biology, Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences, Hangzhou, China.
- State Key Laboratory of Integrated Management of Pest Insects and Rodents, Institute of Zoology, Chinese Academy of Sciences, Beijing, China.
| |
Collapse
|
12
|
Sung JJY, Wong SH. What is unknown in using microbiota as a therapeutic? J Gastroenterol Hepatol 2022; 37:39-44. [PMID: 34668228 DOI: 10.1111/jgh.15716] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 09/30/2021] [Revised: 10/15/2021] [Accepted: 10/18/2021] [Indexed: 12/17/2022]
Abstract
Fecal microbiota transplantation (FMT) has been used extensively in the treatment of various gastrointestinal and extraintestinal conditions, despite that there are still a lot of missing gaps in our knowledge in the gut microbiota and its behavior. This article describes the unknowns in microbiota biology (undetected microbes, uncertain colonization, unclear mechanisms of action, uncertain indications, unsure long-term efficacy, or side effects). We discuss how these unknowns may affect the therapeutic uses of FMT, and the potentials and caveats of other related microbiota-based therapies. When used as an experimental therapy or last resort in difficult conditions, caution should be taken against inadvertent complications. Clear documentations of post-treatment events should be made mandatory, classified, and graded as in clinical trials. Further robust scientific experiments and properly designed clinical studies are needed.
Collapse
Affiliation(s)
- Joseph J Y Sung
- Lee Kong Chian School of Medicine, Nanyang Technological University, Singapore
| | - Sunny H Wong
- Lee Kong Chian School of Medicine, Nanyang Technological University, Singapore
| |
Collapse
|
13
|
Shi P, Zhou Y, Zhang AR. High-dimensional log-error-in-variable regression with applications to microbial compositional data analysis. Biometrika 2021. [DOI: 10.1093/biomet/asab020] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022] Open
Abstract
Summary
In microbiome and genomic studies, the regression of compositional data has been a crucial tool for identifying microbial taxa or genes that are associated with clinical phenotypes. To account for the variation in sequencing depth, the classic log-contrast model is often used where read counts are normalized into compositions. However, zero read counts and the randomness in covariates remain critical issues. We introduce a surprisingly simple, interpretable and efficient method for the estimation of compositional data regression through the lens of a novel high-dimensional log-error-in-variable regression model. The proposed method provides corrections on sequencing data with possible overdispersion and simultaneously avoids any subjective imputation of zero read counts. We provide theoretical justifications with matching upper and lower bounds for the estimation error. The merit of the procedure is illustrated through real data analysis and simulation studies.
Collapse
Affiliation(s)
- Pixu Shi
- Department of Biostatistics & Bioinformatics, Duke University, 2424 Erwin Road, Durham, North Carolina 27710, U.S.A
| | - Yuchen Zhou
- Department of Biostatistics & Bioinformatics, Duke University, 2424 Erwin Road, Durham, North Carolina 27710, U.S.A
| | - Anru R Zhang
- Department of Statistics, University of Wisconsin-Madison, 1300 University Avenue, Madison, Wisconsin 53706, U.S.A
| |
Collapse
|
14
|
Moossavi S, Fehr K, Khafipour E, Azad MB. Repeatability and reproducibility assessment in a large-scale population-based microbiota study: case study on human milk microbiota. MICROBIOME 2021; 9:41. [PMID: 33568231 PMCID: PMC7877029 DOI: 10.1186/s40168-020-00998-4] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 04/20/2020] [Accepted: 12/29/2020] [Indexed: 06/12/2023]
Abstract
BACKGROUND Quality control including assessment of batch variabilities and confirmation of repeatability and reproducibility are integral component of high throughput omics studies including microbiome research. Batch effects can mask true biological results and/or result in irreproducible conclusions and interpretations. Low biomass samples in microbiome research are prone to reagent contamination; yet, quality control procedures for low biomass samples in large-scale microbiome studies are not well established. RESULTS In this study, we have proposed a framework for an in-depth step-by-step approach to address this gap. The framework consists of three independent stages: (1) verification of sequencing accuracy by assessing technical repeatability and reproducibility of the results using mock communities and biological controls; (2) contaminant removal and batch variability correction by applying a two-tier strategy using statistical algorithms (e.g. decontam) followed by comparison of the data structure between batches; and (3) corroborating the repeatability and reproducibility of microbiome composition and downstream statistical analysis. Using this approach on the milk microbiota data from the CHILD Cohort generated in two batches (extracted and sequenced in 2016 and 2019), we were able to identify potential reagent contaminants that were missed with standard algorithms and substantially reduce contaminant-induced batch variability. Additionally, we confirmed the repeatability and reproducibility of our results in each batch before merging them for downstream analysis. CONCLUSION This study provides important insight to advance quality control efforts in low biomass microbiome research. Within-study quality control that takes advantage of the data structure (i.e. differential prevalence of contaminants between batches) would enhance the overall reliability and reproducibility of research in this field. Video abstract.
Collapse
Affiliation(s)
- Shirin Moossavi
- Department of Medical Microbiology and Infectious Diseases, University of Manitoba, Winnipeg, MB, Canada.
- Children's Hospital Research Institute of Manitoba, Winnipeg, MB, Canada.
- Developmental Origins of Chronic Diseases in Children Network (DEVOTION), Winnipeg, MB, Canada.
- Digestive Oncology Research Center, Digestive Disease Research Institute, Tehran University of Medical Sciences, Tehran, Iran.
- Department of Physiology and Pharmacology & Mechanical and Manufacturing Engineering, University of Calgary, Calgary, AB, Canada.
| | - Kelsey Fehr
- Children's Hospital Research Institute of Manitoba, Winnipeg, MB, Canada
- Department of Pediatrics and Child Health, University of Manitoba, Winnipeg, MB, Canada
| | - Ehsan Khafipour
- Department of Animal Science, University of Manitoba, Winnipeg, MB, Canada
- Microbiome Research and Technical Support, Cargill Animal Nutrition, Diamond V brand, Cedar Rapids, USA
| | - Meghan B Azad
- Children's Hospital Research Institute of Manitoba, Winnipeg, MB, Canada.
- Developmental Origins of Chronic Diseases in Children Network (DEVOTION), Winnipeg, MB, Canada.
- Department of Pediatrics and Child Health, University of Manitoba, Winnipeg, MB, Canada.
| |
Collapse
|
15
|
Li Z, Tian L, O’Malley AJ, Karagas MR, Hoen AG, Christensen BC, Madan JC, Wu Q, Gharaibeh RZ, Jobin C, Li H. IFAA: Robust Association Identification and Inference for Absolute Abundance in Microbiome Analyses. J Am Stat Assoc 2021; 116:1595-1608. [PMID: 35241863 PMCID: PMC8890673 DOI: 10.1080/01621459.2020.1860770] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2019] [Revised: 09/30/2020] [Accepted: 12/03/2020] [Indexed: 12/15/2022]
Abstract
The target of inference in microbiome analyses is usually relative abundance (RA) because RA in a sample (e.g., stool) can be considered as an approximation of RA in an entire ecosystem (e.g., gut). However, inference on RA suffers from the fact that RA are calculated by dividing absolute abundances (AAs) over the common denominator (CD), the summation of all AA (i.e., library size). Because of that, perturbation in one taxon will result in a change in the CD and thus cause false changes in RA of all other taxa, and those false changes could lead to false positive/negative findings. We propose a novel analysis approach (IFAA) to make robust inference on AA of an ecosystem that can circumvent the issues induced by the CD problem and compositional structure of RA. IFAA can also address the issues of overdispersion and handle zero-inflated data structures. IFAA identifies microbial taxa associated with the covariates in Phase 1 and estimates the association parameters by employing an independent reference taxon in Phase 2. Two real data applications are presented and extensive simulations show that IFAA outperforms other established existing approaches by a big margin in the presence of unbalanced library size. Supplementary materials for this article are available online.
Collapse
Affiliation(s)
- Zhigang Li
- Department of Biostatistics, University of Florida, Gainesville, FL
| | - Lu Tian
- Department of Biomedical Data Science, Stanford University, Palo Alto, CA
| | - A. James O’Malley
- The Dartmouth Institute, Geisel School of Medicine at Dartmouth, Hanover, NH
| | - Margaret R. Karagas
- Department of Epidemiology, Geisel School of Medicine at Dartmouth, Hanover, NH
| | - Anne G. Hoen
- Department of Epidemiology, Geisel School of Medicine at Dartmouth, Hanover, NH
| | | | - Juliette C. Madan
- Department of Epidemiology, Geisel School of Medicine at Dartmouth, Hanover, NH
| | - Quran Wu
- Department of Biostatistics, University of Florida, Gainesville, FL
| | | | - Christian Jobin
- Department of Medicine, University of Florida, Gainesville, FL
| | - Hongzhe Li
- Department of Biostatistics, Epidemiology & Informatics, University of Pennsylvania, Philadelphia, PA
| |
Collapse
|
16
|
Xia Y. Correlation and association analyses in microbiome study integrating multiomics in health and disease. PROGRESS IN MOLECULAR BIOLOGY AND TRANSLATIONAL SCIENCE 2020; 171:309-491. [PMID: 32475527 DOI: 10.1016/bs.pmbts.2020.04.003] [Citation(s) in RCA: 37] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/07/2023]
Abstract
Correlation and association analyses are one of the most widely used statistical methods in research fields, including microbiome and integrative multiomics studies. Correlation and association have two implications: dependence and co-occurrence. Microbiome data are structured as phylogenetic tree and have several unique characteristics, including high dimensionality, compositionality, sparsity with excess zeros, and heterogeneity. These unique characteristics cause several statistical issues when analyzing microbiome data and integrating multiomics data, such as large p and small n, dependency, overdispersion, and zero-inflation. In microbiome research, on the one hand, classic correlation and association methods are still applied in real studies and used for the development of new methods; on the other hand, new methods have been developed to target statistical issues arising from unique characteristics of microbiome data. Here, we first provide a comprehensive view of classic and newly developed univariate correlation and association-based methods. We discuss the appropriateness and limitations of using classic methods and demonstrate how the newly developed methods mitigate the issues of microbiome data. Second, we emphasize that concepts of correlation and association analyses have been shifted by introducing network analysis, microbe-metabolite interactions, functional analysis, etc. Third, we introduce multivariate correlation and association-based methods, which are organized by the categories of exploratory, interpretive, and discriminatory analyses and classification methods. Fourth, we focus on the hypothesis testing of univariate and multivariate regression-based association methods, including alpha and beta diversities-based, count-based, and relative abundance (or compositional)-based association analyses. We demonstrate the characteristics and limitations of each approaches. Fifth, we introduce two specific microbiome-based methods: phylogenetic tree-based association analysis and testing for survival outcomes. Sixth, we provide an overall view of longitudinal methods in analysis of microbiome and omics data, which cover standard, static, regression-based time series methods, principal trend analysis, and newly developed univariate overdispersed and zero-inflated as well as multivariate distance/kernel-based longitudinal models. Finally, we comment on current association analysis and future direction of association analysis in microbiome and multiomics studies.
Collapse
Affiliation(s)
- Yinglin Xia
- Department of Medicine, University of Illinois at Chicago, Chicago, IL, United States.
| |
Collapse
|
17
|
Wang Y, LêCao KA. Managing batch effects in microbiome data. Brief Bioinform 2019; 21:1954-1970. [PMID: 31776547 DOI: 10.1093/bib/bbz105] [Citation(s) in RCA: 53] [Impact Index Per Article: 10.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/07/2019] [Revised: 07/24/2019] [Indexed: 12/20/2022] Open
Abstract
Microbial communities have been increasingly studied in recent years to investigate their role in ecological habitats. However, microbiome studies are difficult to reproduce or replicate as they may suffer from confounding factors that are unavoidable in practice and originate from biological, technical or computational sources. In this review, we define batch effects as unwanted variation introduced by confounding factors that are not related to any factors of interest. Computational and analytical methods are required to remove or account for batch effects. However, inherent microbiome data characteristics (e.g. sparse, compositional and multivariate) challenge the development and application of batch effect adjustment methods to either account or correct for batch effects. We present commonly encountered sources of batch effects that we illustrate in several case studies. We discuss the limitations of current methods, which often have assumptions that are not met due to the peculiarities of microbiome data. We provide practical guidelines for assessing the efficiency of the methods based on visual and numerical outputs and a thorough tutorial to reproduce the analyses conducted in this review.
Collapse
Affiliation(s)
- Yiwen Wang
- Melbourne Integrative Genomics, School of Mathematics and Statistics, University of Melbourne, Melbourne, VIC, 3052, Australia
| | - Kim-Anh LêCao
- Melbourne Integrative Genomics, School of Mathematics and Statistics, University of Melbourne, Melbourne, VIC, 3052, Australia
| |
Collapse
|