51
|
Bai X, Ren J, Fan Y, Sun F. KIMI: Knockoff Inference for Motif Identification from molecular sequences with controlled false discovery rate. Bioinformatics 2021; 37:759-766. [PMID: 33119059 PMCID: PMC8599924 DOI: 10.1093/bioinformatics/btaa912] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2020] [Revised: 09/11/2020] [Accepted: 10/14/2020] [Indexed: 01/09/2023] Open
Abstract
MOTIVATION The rapid development of sequencing technologies has enabled us to generate a large number of metagenomic reads from genetic materials in microbial communities, making it possible to gain deep insights into understanding the differences between the genetic materials of different groups of microorganisms, such as bacteria, viruses, plasmids, etc. Computational methods based on k-mer frequencies have been shown to be highly effective for classifying metagenomic sequencing reads into different groups. However, such methods usually use all the k-mers as features for prediction without selecting relevant k-mers for the different groups of sequences, i.e. unique nucleotide patterns containing biological significance. RESULTS To select k-mers for distinguishing different groups of sequences with guaranteed false discovery rate (FDR) control, we develop KIMI, a general framework based on model-X Knockoffs regarded as the state-of-the-art statistical method for FDR control, for sequence motif discovery with arbitrary target FDR level, such that reproducibility can be theoretically guaranteed. KIMI is shown through simulation studies to be effective in simultaneously controlling FDR and yielding high power, outperforming the broadly used Benjamini-Hochberg procedure and the q-value method for FDR control. To illustrate the usefulness of KIMI in analyzing real datasets, we take the viral motif discovery problem as an example and implement KIMI on a real dataset consisting of viral and bacterial contigs. We show that the accuracy of predicting viral and bacterial contigs can be increased by training the prediction model only on relevant k-mers selected by KIMI. AVAILABILITYAND IMPLEMENTATION Our implementation of KIMI is available at https://github.com/xinbaiusc/KIMI. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Xin Bai
- Quantitative and Computational Biology Program, Department of Biological Sciences, Los Angeles, CA 90089, USA
| | - Jie Ren
- Quantitative and Computational Biology Program, Department of Biological Sciences, Los Angeles, CA 90089, USA
| | - Yingying Fan
- Data Sciences and Operations Department, Marshall School of Business, University of Southern California, Los Angeles, CA 90089, USA
| | - Fengzhu Sun
- Quantitative and Computational Biology Program, Department of Biological Sciences, Los Angeles, CA 90089, USA
| |
Collapse
|
52
|
Murga-Garrido SM, Hong Q, Cross TWL, Hutchison ER, Han J, Thomas SP, Vivas EI, Denu J, Ceschin DG, Tang ZZ, Rey FE. Gut microbiome variation modulates the effects of dietary fiber on host metabolism. MICROBIOME 2021; 9:117. [PMID: 34016169 PMCID: PMC8138933 DOI: 10.1186/s40168-021-01061-6] [Citation(s) in RCA: 65] [Impact Index Per Article: 21.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/19/2020] [Accepted: 03/24/2021] [Indexed: 05/11/2023]
Abstract
BACKGROUND There is general consensus that consumption of dietary fermentable fiber improves cardiometabolic health, in part by promoting mutualistic microbes and by increasing production of beneficial metabolites in the distal gut. However, human studies have reported variations in the observed benefits among individuals consuming the same fiber. Several factors likely contribute to this variation, including host genetic and gut microbial differences. We hypothesized that gut microbial metabolism of dietary fiber represents an important and differential factor that modulates how dietary fiber impacts the host. RESULTS We examined genetically identical gnotobiotic mice harboring two distinct complex gut microbial communities and exposed to four isocaloric diets, each containing different fibers: (i) cellulose, (ii) inulin, (iii) pectin, (iv) a mix of 5 fermentable fibers (assorted fiber). Gut microbiome analysis showed that each transplanted community preserved a core of common taxa across diets that differentiated it from the other community, but there were variations in richness and bacterial taxa abundance within each community among the different diet treatments. Host epigenetic, transcriptional, and metabolomic analyses revealed diet-directed differences between animals colonized with the two communities, including variation in amino acids and lipid pathways that were associated with divergent health outcomes. CONCLUSION This study demonstrates that interindividual variation in the gut microbiome is causally linked to differential effects of dietary fiber on host metabolic phenotypes and suggests that a one-fits-all fiber supplementation approach to promote health is unlikely to elicit consistent effects across individuals. Overall, the presented results underscore the importance of microbe-diet interactions on host metabolism and suggest that gut microbes modulate dietary fiber efficacy. Video abstract.
Collapse
Affiliation(s)
- Sofia M Murga-Garrido
- Department of Bacteriology, University of Wisconsin-Madison, 1550 Linden Dr., Madison, WI, 53706, USA
- PECEM (MD/PhD), Facultad de Medicina, Universidad Nacional Autónoma de México, Coyoacán, Ciudad de México, México
| | - Qilin Hong
- Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, 600 Highland Avenue, Madison, WI, 53792, USA
| | - Tzu-Wen L Cross
- Department of Bacteriology, University of Wisconsin-Madison, 1550 Linden Dr., Madison, WI, 53706, USA
- Present Address: Department of Nutrition Science, Purdue University, 700 W. State Street, Stone Hall 205, West Lafayette, IN, 47907, USA
| | - Evan R Hutchison
- Department of Bacteriology, University of Wisconsin-Madison, 1550 Linden Dr., Madison, WI, 53706, USA
| | - Jessica Han
- Wisconsin Institute for Discovery, Madison, WI, USA
| | | | - Eugenio I Vivas
- Department of Bacteriology, University of Wisconsin-Madison, 1550 Linden Dr., Madison, WI, 53706, USA
| | - John Denu
- Wisconsin Institute for Discovery, Madison, WI, USA
| | - Danilo G Ceschin
- Unidad de Bioinformática Traslacional, Centro de Investigación en Medicina Traslacional Severo Amuchástegui, Instituto Universitario de Ciencias Biomédicas de Córdoba, Av. Naciones Unidas 420, 5000, Córdoba, CP, Argentina
| | - Zheng-Zheng Tang
- Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, 600 Highland Avenue, Madison, WI, 53792, USA.
- Wisconsin Institute for Discovery, Madison, WI, USA.
| | - Federico E Rey
- Department of Bacteriology, University of Wisconsin-Madison, 1550 Linden Dr., Madison, WI, 53706, USA.
| |
Collapse
|
53
|
Uh HW, Klarić L, Ugrina I, Lauc G, Smilde AK, Houwing-Duistermaat JJ. Choosing proper normalization is essential for discovery of sparse glycan biomarkers. Mol Omics 2021; 16:231-242. [PMID: 32211690 DOI: 10.1039/c9mo00174c] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/01/2023]
Abstract
Rapid progress in high-throughput glycomics analysis enables the researchers to conduct large sample studies. Typically, the between-subject differences in total abundance of raw glycomics data are very large, and it is necessary to reduce the differences, making measurements comparable across samples. Essentially there are two ways to approach this issue: row-wise and column-wise normalization. In glycomics, the differences per subject are usually forced to be exactly zero, by scaling each sample having the sum of all glycan intensities equal to 100%. This total area (row-wise) normalization (TA) results in so-called compositional data, rendering many standard multivariate statistical methods inappropriate or inapplicable. Ignoring the compositional nature of the data, moreover, may lead to spurious results. Alternatively, a log-transformation to the raw data can be performed prior to column-wise normalization and implementing standard statistical tools. Until now, there is no clear consensus on the appropriate normalization method applied to glycomics data. Nor is systematic investigation of impact of TA on downstream analysis available to justify the choice of TA. Our motivation lies in efficient variable selection to identify glycan biomarkers with regard to accurate prediction as well as interpretability of the model chosen. Via extensive simulations we investigate how different normalization methods affect the performance of variable selection, and compare their performance. We also address the effect of various types of measurement error in glycans: additive, multiplicative and two-component error. We show that when sample-wise differences are not large row-wise normalization (like TA) can have deleterious effects on variable selection and prediction.
Collapse
Affiliation(s)
- Hae-Won Uh
- Department of Biostatistics and Research Support, University Medical Center Utrecht, Utrecht, Netherlands.
| | - Lucija Klarić
- Genos Glycoscience Research Laboratory, Zagreb, Croatia and MRC Human Genetics Unit, MRC Institute of Genetics and Molecular Medicine, University of Edinburgh, Edinburgh, UK
| | - Ivo Ugrina
- Genos Glycoscience Research Laboratory, Zagreb, Croatia and University of Split, Faculty of Science, Split, Croatia and Intellomics Ltd, Croatia
| | - Gordan Lauc
- Genos Glycoscience Research Laboratory, Zagreb, Croatia and Faculty of Pharmacy and Biochemistry, University of Zagreb, Zagreb, Croatia
| | - Age K Smilde
- Biosystems Data Analysis, Swammerdam Institute for Life Sciences, University of Amsterdam, Amsterdam, The Netherlands
| | - Jeanine J Houwing-Duistermaat
- Department of Biostatistics and Research Support, University Medical Center Utrecht, Utrecht, Netherlands. and Department of Statistics, University of Leeds, Leeds, UK
| |
Collapse
|
54
|
Fiksel J, Zeger S, Datta A. A transformation-free linear regression for compositional outcomes and predictors. Biometrics 2021; 78:974-987. [PMID: 33788259 DOI: 10.1111/biom.13465] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/17/2020] [Revised: 03/09/2021] [Accepted: 03/15/2021] [Indexed: 11/29/2022]
Abstract
Compositional data are common in many fields, both as outcomes and predictor variables. The inventory of models for the case when both the outcome and predictor variables are compositional is limited, and the existing models are often difficult to interpret in the compositional space, due to their use of complex log-ratio transformations. We develop a transformation-free linear regression model where the expected value of the compositional outcome is expressed as a single Markov transition from the compositional predictor. Our approach is based on estimating equations thereby not requiring complete specification of data likelihood and is robust to different data-generating mechanisms. Our model is simple to interpret, allows for 0s and 1s in both the compositional outcome and covariates, and subsumes several interesting subcases of interest. We also develop permutation tests for linear independence and equality of effect sizes of two components of the predictor. Finally, we show that despite its simplicity, our model accurately captures the relationship between compositional data using two datasets from education and medical research.
Collapse
Affiliation(s)
- Jacob Fiksel
- Department of Biostatistics, Johns Hopkins University, Baltimore, Maryland, USA
| | - Scott Zeger
- Department of Biostatistics, Johns Hopkins University, Baltimore, Maryland, USA
| | - Abhirup Datta
- Department of Biostatistics, Johns Hopkins University, Baltimore, Maryland, USA
| |
Collapse
|
55
|
Wu X, Liang R, Yang H. Penalized and constrained LAD estimation in fixed and high dimension. Stat Pap (Berl) 2021; 63:53-95. [PMID: 33814727 PMCID: PMC8009762 DOI: 10.1007/s00362-021-01229-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/06/2020] [Revised: 02/27/2021] [Indexed: 11/26/2022]
Abstract
Recently, many literatures have proved that prior information and structure in many application fields can be formulated as constraints on regression coefficients. Following these work, we propose a L 1 penalized LAD estimation with some linear constraints in this paper. Different from constrained lasso, our estimation performs well when heavy-tailed errors or outliers are found in the response. In theory, we show that the proposed estimation enjoys the Oracle property with adjusted normal variance when the dimension of the estimated coefficients p is fixed. And when p is much greater than the sample size n, the error bound of proposed estimation is sharper thank log ( p ) / n . It is worth noting the result is true for a wide range of noise distribution, even for the Cauchy distribution. In algorithm, we not only consider an typical linear programming to solve proposed estimation in fixed dimension , but also present an nested alternating direction method of multipliers (ADMM) in high dimension. Simulation and application to real data also confirm that proposed estimation is an effective alternative when constrained lasso is unreliable.
Collapse
Affiliation(s)
- Xiaofei Wu
- College of Mathematics and Statistics, Chongqing University, Chongqing, 401331 People’s Republic of China
| | - Rongmei Liang
- College of Mathematics and Statistics, Chongqing University, Chongqing, 401331 People’s Republic of China
| | - Hu Yang
- College of Mathematics and Statistics, Chongqing University, Chongqing, 401331 People’s Republic of China
| |
Collapse
|
56
|
Shi P, Zhou Y, Zhang AR. High-dimensional log-error-in-variable regression with applications to microbial compositional data analysis. Biometrika 2021. [DOI: 10.1093/biomet/asab020] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022] Open
Abstract
Summary
In microbiome and genomic studies, the regression of compositional data has been a crucial tool for identifying microbial taxa or genes that are associated with clinical phenotypes. To account for the variation in sequencing depth, the classic log-contrast model is often used where read counts are normalized into compositions. However, zero read counts and the randomness in covariates remain critical issues. We introduce a surprisingly simple, interpretable and efficient method for the estimation of compositional data regression through the lens of a novel high-dimensional log-error-in-variable regression model. The proposed method provides corrections on sequencing data with possible overdispersion and simultaneously avoids any subjective imputation of zero read counts. We provide theoretical justifications with matching upper and lower bounds for the estimation error. The merit of the procedure is illustrated through real data analysis and simulation studies.
Collapse
Affiliation(s)
- Pixu Shi
- Department of Biostatistics & Bioinformatics, Duke University, 2424 Erwin Road, Durham, North Carolina 27710, U.S.A
| | - Yuchen Zhou
- Department of Biostatistics & Bioinformatics, Duke University, 2424 Erwin Road, Durham, North Carolina 27710, U.S.A
| | - Anru R Zhang
- Department of Statistics, University of Wisconsin-Madison, 1300 University Avenue, Madison, Wisconsin 53706, U.S.A
| |
Collapse
|
57
|
Zhou F, He K, Li Q, Chapkin RS, Ni Y. Bayesian biclustering for microbial metagenomic sequencing data via multinomial matrix factorization. Biostatistics 2021; 23:891-909. [PMID: 33634824 DOI: 10.1093/biostatistics/kxab002] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/17/2020] [Revised: 10/08/2020] [Accepted: 01/10/2021] [Indexed: 12/26/2022] Open
Abstract
High-throughput sequencing technology provides unprecedented opportunities to quantitatively explore human gut microbiome and its relation to diseases. Microbiome data are compositional, sparse, noisy, and heterogeneous, which pose serious challenges for statistical modeling. We propose an identifiable Bayesian multinomial matrix factorization model to infer overlapping clusters on both microbes and hosts. The proposed method represents the observed over-dispersed zero-inflated count matrix as Dirichlet-multinomial mixtures on which latent cluster structures are built hierarchically. Under the Bayesian framework, the number of clusters is automatically determined and available information from a taxonomic rank tree of microbes is naturally incorporated, which greatly improves the interpretability of our findings. We demonstrate the utility of the proposed approach by comparing to alternative methods in simulations. An application to a human gut microbiome data set involving patients with inflammatory bowel disease reveals interesting clusters, which contain bacteria families Bacteroidaceae, Bifidobacteriaceae, Enterobacteriaceae, Fusobacteriaceae, Lachnospiraceae, Ruminococcaceae, Pasteurellaceae, and Porphyromonadaceae that are known to be related to the inflammatory bowel disease and its subtypes according to biological literature. Our findings can help generate potential hypotheses for future investigation of the heterogeneity of the human gut microbiome.
Collapse
Affiliation(s)
- Fangting Zhou
- Department of Statistics, Texas A&M University, College Station, TX, USA and Institute of Statistics and Big Data, Renmin University of China, Beijing, China
| | - Kejun He
- Center for Applied Statistics, Institute of Statistics and Big Data, Renmin University of China, Beijing, China
| | - Qiwei Li
- Department of Mathematical Sciences, The University of Texas at Dallas, Dallas, TX, USA
| | - Robert S Chapkin
- Department of Nutrition and Food Science, Texas A&M University, College Station, TX, USA
| | | |
Collapse
|
58
|
Zhang H, Chen J, Feng Y, Wang C, Li H, Liu L. Mediation effect selection in high-dimensional and compositional microbiome data. Stat Med 2021; 40:885-896. [PMID: 33205470 PMCID: PMC7855955 DOI: 10.1002/sim.8808] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2020] [Revised: 08/31/2020] [Accepted: 10/16/2020] [Indexed: 01/08/2023]
Abstract
The microbiome plays an important role in human health by mediating the path from environmental exposures to health outcomes. The relative abundances of the high-dimensional microbiome data have an unit-sum restriction, rendering standard statistical methods in the Euclidean space invalid. To address this problem, we use the isometric log-ratio transformations of the relative abundances as the mediator variables. To select significant mediators, we consider a closed testing-based selection procedure with desirable confidence. Simulations are provided to verify the effectiveness of our method. As an illustrative example, we apply the proposed method to study the mediation effects of murine gut microbiome between subtherapeutic antibiotic treatment and body weight gain, and identify Coprobacillus and Adlercreutzia as two significant mediators.
Collapse
Affiliation(s)
- Haixiang Zhang
- Center for Applied Mathematics, Tianjin University, Tianjin, 300072, China
| | - Jun Chen
- Division of Biomedical Statistics and Informatics, Mayo Clinic, Rochester, MN 55905, USA
| | - Yang Feng
- Department of Biostatistics, College of Global Public Health, New York University, New York, NY 10003, USA
| | - Chan Wang
- Division of Biostatistics, Department of Population Health, New York University School of Medicine, New York, NY 10016, USA
| | - Huilin Li
- Division of Biostatistics, Department of Population Health, New York University School of Medicine, New York, NY 10016, USA
| | - Lei Liu
- Division of Biostatistics, Washington University in St. Louis, St. Louis, MO 63110, USA
| |
Collapse
|
59
|
Yang F, Zou Q, Gao B. GutBalance: a server for the human gut microbiome-based disease prediction and biomarker discovery with compositionality addressed. Brief Bioinform 2021; 22:6123951. [PMID: 33515036 DOI: 10.1093/bib/bbaa436] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2020] [Revised: 12/17/2020] [Accepted: 12/26/2020] [Indexed: 02/07/2023] Open
Abstract
The compositionality of the microbiome data is well-known but often neglected. The compositional transformation pertains to the supervised learning of microbiome data and is a critical step that decides the performance and reliability of the disease classifiers. We value the excellent performance of the distal discriminative balance analysis (DBA) method, which selects distal balances of pairs and trios of bacteria, in addressing the classification of high-dimensional microbiome data. By applying this method to the species-level abundances of all the disease phenotypes in the GMrepo database, we build a balance-based model repository for the classification of human gut microbiome-related diseases. The model repository supports the prediction of disease risks for new sample(s). More importantly, we highlight the concept of balance-disease associations rather than the conventional microbe-disease associations and develop the human Gut Balance-Disease Association Database (GBDAD). Each predictable balance for each disease model indicates a potential biomarker-disease relationship and can be interpreted as a bacteria ratio positively or negatively correlated with the disease. Furthermore, by linking the balance-disease associations to the evidenced microbe-disease associations in MicroPhenoDB, we surprisingly found that most species-disease associations inferred from the shotgun metagenomic datasets can be validated by external evidence beyond MicroPhenoDB. The balance-based species-disease association inference will accelerate the generation of new microbe-disease association hypotheses in gastrointestinal microecology research and clinical trials. The model repository and the GBDAD database are deployed on the GutBalance server, which supports interactive visualization and systematic interrogation of the disease models, disease-related balances and disease-related species of interest.
Collapse
Affiliation(s)
- Fenglong Yang
- University of Electronic Science and Technology of China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu 610054, China
- Hainan Key Laboratory for Computational Science and Application, Hainan Normal University, Haikou 571158, China
| | - Bo Gao
- Department of Radiology, The Second Affiliated Hospital, Harbin Medical University, Harbin 150001, China
| |
Collapse
|
60
|
Li Z, Tian L, O’Malley AJ, Karagas MR, Hoen AG, Christensen BC, Madan JC, Wu Q, Gharaibeh RZ, Jobin C, Li H. IFAA: Robust Association Identification and Inference for Absolute Abundance in Microbiome Analyses. J Am Stat Assoc 2021; 116:1595-1608. [PMID: 35241863 PMCID: PMC8890673 DOI: 10.1080/01621459.2020.1860770] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2019] [Revised: 09/30/2020] [Accepted: 12/03/2020] [Indexed: 12/15/2022]
Abstract
The target of inference in microbiome analyses is usually relative abundance (RA) because RA in a sample (e.g., stool) can be considered as an approximation of RA in an entire ecosystem (e.g., gut). However, inference on RA suffers from the fact that RA are calculated by dividing absolute abundances (AAs) over the common denominator (CD), the summation of all AA (i.e., library size). Because of that, perturbation in one taxon will result in a change in the CD and thus cause false changes in RA of all other taxa, and those false changes could lead to false positive/negative findings. We propose a novel analysis approach (IFAA) to make robust inference on AA of an ecosystem that can circumvent the issues induced by the CD problem and compositional structure of RA. IFAA can also address the issues of overdispersion and handle zero-inflated data structures. IFAA identifies microbial taxa associated with the covariates in Phase 1 and estimates the association parameters by employing an independent reference taxon in Phase 2. Two real data applications are presented and extensive simulations show that IFAA outperforms other established existing approaches by a big margin in the presence of unbalanced library size. Supplementary materials for this article are available online.
Collapse
Affiliation(s)
- Zhigang Li
- Department of Biostatistics, University of Florida, Gainesville, FL
| | - Lu Tian
- Department of Biomedical Data Science, Stanford University, Palo Alto, CA
| | - A. James O’Malley
- The Dartmouth Institute, Geisel School of Medicine at Dartmouth, Hanover, NH
| | - Margaret R. Karagas
- Department of Epidemiology, Geisel School of Medicine at Dartmouth, Hanover, NH
| | - Anne G. Hoen
- Department of Epidemiology, Geisel School of Medicine at Dartmouth, Hanover, NH
| | | | - Juliette C. Madan
- Department of Epidemiology, Geisel School of Medicine at Dartmouth, Hanover, NH
| | - Quran Wu
- Department of Biostatistics, University of Florida, Gainesville, FL
| | | | - Christian Jobin
- Department of Medicine, University of Florida, Gainesville, FL
| | - Hongzhe Li
- Department of Biostatistics, Epidemiology & Informatics, University of Pennsylvania, Philadelphia, PA
| |
Collapse
|
61
|
Affiliation(s)
- Xuejun Ma
- School of Mathematical Sciences, Soochow University, Suzhou, China
| | - Ping Zhang
- School of Mathematical Sciences, Soochow University, Suzhou, China
| |
Collapse
|
62
|
Sun Z, Xu W, Cong X, Li G, Chen K. LOG-CONTRAST REGRESSION WITH FUNCTIONAL COMPOSITIONAL PREDICTORS: LINKING PRETERM INFANT'S GUT MICROBIOME TRAJECTORIES TO NEUROBEHAVIORAL OUTCOME. Ann Appl Stat 2020; 14:1535-1556. [PMID: 34163544 DOI: 10.1214/20-aoas1357] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Abstract
The neonatal intensive care unit (NICU) experience is known to be one of the most crucial factors that drive preterm infant's neurodevelopmental and health outcome. It is hypothesized that stressful early life experience of very preterm neonate is imprinting gut microbiome by the regulation of the so-called brain-gut axis, and consequently, certain microbiome markers are predictive of later infant neurodevelopment. To investigate, a preterm infant study was conducted; infant fecal samples were collected during the infants' first month of postnatal age, resulting in functional compositional microbiome data, and neurobehavioral outcomes were measured when infants reached 36-38 weeks of post-menstrual age. To identify potential microbiome markers and estimate how the trajectories of gut microbiome compositions during early postnatal stage impact later neurobehavioral outcomes of the preterm infants, we innovate a sparse log-contrast regression with functional compositional predictors. The functional simplex structure is strictly preserved, and the functional compositional predictors are allowed to have sparse, smoothly varying, and accumulating effects on the outcome through time. Through a pragmatic basis expansion step, the problem boils down to a linearly constrained sparse group regression, for which we develop an efficient algorithm and obtain theoretical performance guarantees. Our approach yields insightful results in the preterm infant study. The identified microbiome markers and the estimated time dynamics of their impact on the neurobehavioral outcome shed lights on the linkage between stress accumulation in early postnatal stage and neurodevelpomental process of infants.
Collapse
|
63
|
Affiliation(s)
| | - Jacob Bien
- Data Sciences and Operations, USC Marshall, Los Angeles, CA
| |
Collapse
|
64
|
Koslovsky MD, Hoffman KL, Daniel CR, Vannucci M. A Bayesian model of microbiome data for simultaneous identification of covariate associations and prediction of phenotypic outcomes. Ann Appl Stat 2020. [DOI: 10.1214/20-aoas1354] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
65
|
Srinivasan S, Chambers LC, Tapia KA, Hoffman NG, Munch MM, Morgan JL, Domogala D, Sylvan Lowens M, Proll S, Huang ML, Soge OO, Jerome KR, Golden MR, Hughes JP, Fredricks DN, Manhart LE. Urethral Microbiota in Men: Association of Haemophilus influenzae and Mycoplasma penetrans With Nongonococcal Urethritis. Clin Infect Dis 2020; 73:e1684-e1693. [PMID: 32750107 DOI: 10.1093/cid/ciaa1123] [Citation(s) in RCA: 29] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2020] [Accepted: 07/30/2020] [Indexed: 01/15/2023] Open
Abstract
BACKGROUND Nongonococcal urethritis (NGU) is a common syndrome with no known etiology in ≤50% of cases. We estimated associations between urethral bacteria and NGU in men who have sex with men (MSM) and men who have sex with women (MSW). METHODS Urine was collected from NGU cases (129 MSM, 121 MSW) and controls (70 MSM, 114 MSW) attending a Seattle STD clinic. Cases had ≥5 polymorphonuclear leukocytes on Gram stain plus symptoms or discharge; controls had <5 PMNs, no symptoms, no discharge. NGU was considered idiopathic when Neisseria gonorrhoeae, Chlamydia trachomatis, Mycoplasma genitalium, Trichomonas vaginalis, adenovirus, and herpes simplex virus were absent. The urethral microbiota was characterized using 16S rRNA gene sequencing. Compositional lasso analysis was conducted to identify associations between bacterial taxa and NGU and to select bacteria for targeted qPCR. RESULTS Among NGU cases, 45.2% were idiopathic. Based on compositional lasso analysis, we selected Haemophilus influenzae (HI) and Mycoplasma penetrans (MP) for targeted qPCR. Compared with 182 men without NGU, the 249 men with NGU were more likely to have HI (14% vs 2%) and MP (21% vs 1%) (both P ≤ .001). In stratified analyses, detection of HI was associated with NGU among MSM (12% vs 3%, P = .036) and MSW (17% vs 1%, P < .001), but MP was associated with NGU only among MSM (13% vs 1%, P = .004). Associations were stronger in men with idiopathic NGU. CONCLUSIONS HI and MP are potential causes of male urethritis. MP was more often detected among MSM than MSW with urethritis.
Collapse
Affiliation(s)
- Sujatha Srinivasan
- Vaccine and Infectious Disease Division, Fred Hutchinson Cancer Research Center, Seattle, Washington, USA
| | - Laura C Chambers
- Department of Epidemiology, University of Washington, Seattle, Washington, USA
| | - Kenneth A Tapia
- Department of Global Health, University of Washington, Seattle, Washington, USA
| | - Noah G Hoffman
- Department of Laboratory Medicine, University of Washington, Seattle, Washington, USA
| | - Matthew M Munch
- Vaccine and Infectious Disease Division, Fred Hutchinson Cancer Research Center, Seattle, Washington, USA
| | - Jennifer L Morgan
- Public Health-Seattle & King County HIV/STD Program, Seattle, Washington, USA
| | - Daniel Domogala
- Vaccine and Infectious Disease Division, Fred Hutchinson Cancer Research Center, Seattle, Washington, USA
| | - M Sylvan Lowens
- Public Health-Seattle & King County HIV/STD Program, Seattle, Washington, USA
| | - Sean Proll
- Vaccine and Infectious Disease Division, Fred Hutchinson Cancer Research Center, Seattle, Washington, USA
| | - Meei-Li Huang
- Department of Laboratory Medicine, University of Washington, Seattle, Washington, USA
| | - Olusegun O Soge
- Department of Global Health, University of Washington, Seattle, Washington, USA.,Department of Medicine, University of Washington, Seattle, Washington, USA
| | - Keith R Jerome
- Vaccine and Infectious Disease Division, Fred Hutchinson Cancer Research Center, Seattle, Washington, USA.,Department of Laboratory Medicine, University of Washington, Seattle, Washington, USA
| | - Matthew R Golden
- Public Health-Seattle & King County HIV/STD Program, Seattle, Washington, USA.,Department of Medicine, University of Washington, Seattle, Washington, USA
| | - James P Hughes
- Department of Biostatistics, University of Washington, Seattle, Washington, USA
| | - David N Fredricks
- Vaccine and Infectious Disease Division, Fred Hutchinson Cancer Research Center, Seattle, Washington, USA.,Department of Medicine, University of Washington, Seattle, Washington, USA
| | - Lisa E Manhart
- Department of Epidemiology, University of Washington, Seattle, Washington, USA.,Department of Global Health, University of Washington, Seattle, Washington, USA
| |
Collapse
|
66
|
Jeon JJ, Kim Y, Won S, Choi H. Primal path algorithm for compositional data analysis. Comput Stat Data Anal 2020. [DOI: 10.1016/j.csda.2020.106958] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/20/2023]
|
67
|
Chen J, Zhang X, Hron K. Partial least squares regression with compositional response variables and covariates. J Appl Stat 2020; 48:3130-3149. [DOI: 10.1080/02664763.2020.1795813] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
Affiliation(s)
- Jiajia Chen
- School of Statistics, Shanxi University of Finance and Economics, Taiyuan, People's Republic of China
| | - Xiaoqin Zhang
- School of Statistics, Shanxi University of Finance and Economics, Taiyuan, People's Republic of China
| | - Karel Hron
- Department of Mathematical Analysis and Applications of Mathematics, Faculty of Science, Palacký University, Olomouc, Czech Republic
| |
Collapse
|
68
|
Zhang L, Shi Y, Jenq RR, Do KA, Peterson CB. Bayesian compositional regression with structured priors for microbiome feature selection. Biometrics 2020; 77:824-838. [PMID: 32686846 DOI: 10.1111/biom.13335] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2019] [Accepted: 07/13/2020] [Indexed: 01/10/2023]
Abstract
The microbiome plays a critical role in human health and disease, and there is a strong scientific interest in linking specific features of the microbiome to clinical outcomes. There are key aspects of microbiome data, however, that limit the applicability of standard variable selection methods. In particular, the observed data are compositional, as the counts within each sample have a fixed-sum constraint. In addition, microbiome features, typically quantified as operational taxonomic units, often reflect microorganisms that are similar in function, and may therefore have a similar influence on the response variable. To address the challenges posed by these aspects of the data structure, we propose a variable selection technique with the following novel features: a generalized transformation and z-prior to handle the compositional constraint, and an Ising prior that encourages the joint selection of microbiome features that are closely related in terms of their genetic sequence similarity. We demonstrate that our proposed method outperforms existing penalized approaches for microbiome variable selection in both simulation and the analysis of real data exploring the relationship of the gut microbiome to body mass index.
Collapse
Affiliation(s)
- Liangliang Zhang
- Department of Biostatistics, University of Texas MD Anderson Cancer Center, Houston, Texas
| | - Yushu Shi
- Department of Statistics, University of Missouri, Columbia, Missouri
| | - Robert R Jenq
- Department of Genomic Medicine, University of Texas MD Anderson Cancer Center, Houston, Texas
| | - Kim-Anh Do
- Department of Biostatistics, University of Texas MD Anderson Cancer Center, Houston, Texas
| | - Christine B Peterson
- Department of Biostatistics, University of Texas MD Anderson Cancer Center, Houston, Texas
| |
Collapse
|
69
|
Wang S, Cai TT, Li H. Hypothesis testing for phylogenetic composition: a minimum-cost flow perspective. Biometrika 2020; 108:17-36. [PMID: 33716568 DOI: 10.1093/biomet/asaa061] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2019] [Indexed: 12/30/2022] Open
Abstract
Quantitative comparison of microbial composition from different populations is a fundamental task in various microbiome studies. We consider two-sample testing for microbial compositional data by leveraging phylogenetic information. Motivated by existing phylogenetic distances, we take a minimum-cost flow perspective to study such testing problems. We first show that multivariate analysis of variance with permutation using phylogenetic distances, one of the most commonly used methods in practice, is essentially a sum-of-squares type of test and has better power for dense alternatives. However, empirical evidence from real datasets suggests that the phylogenetic microbial composition difference between two populations is usually sparse. Motivated by this observation, we propose a new maximum type test, detector of active flow on a tree, and investigate its properties. We show that the proposed method is particularly powerful against sparse phylogenetic composition difference and enjoys certain optimality. The practical merit of the proposed method is demonstrated by simulation studies and an application to a human intestinal biopsy microbiome dataset on patients with ulcerative colitis.
Collapse
Affiliation(s)
- Shulei Wang
- Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania 19104, U.S.A
| | - T Tony Cai
- Department of Statistics, The Wharton School, University of Pennsylvania, Philadelphia, Pennsylvania 19104, U.S.A
| | - Hongzhe Li
- Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania 19104, U.S.A
| |
Collapse
|
70
|
Regression Models for Compositional Data: General Log-Contrast Formulations, Proximal Optimization, and Microbiome Data Applications. STATISTICS IN BIOSCIENCES 2020. [DOI: 10.1007/s12561-020-09283-2] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]
Abstract
AbstractCompositional data sets are ubiquitous in science, including geology, ecology, and microbiology. In microbiome research, compositional data primarily arise from high-throughput sequence-based profiling experiments. These data comprise microbial compositions in their natural habitat and are often paired with covariate measurements that characterize physicochemical habitat properties or the physiology of the host. Inferring parsimonious statistical associations between microbial compositions and habitat- or host-specific covariate data is an important step in exploratory data analysis. A standard statistical model linking compositional covariates to continuous outcomes is the linear log-contrast model. This model describes the response as a linear combination of log-ratios of the original compositions and has been extended to the high-dimensional setting via regularization. In this contribution, we propose a general convex optimization model for linear log-contrast regression which includes many previous proposals as special cases. We introduce a proximal algorithm that solves the resulting constrained optimization problem exactly with rigorous convergence guarantees. We illustrate the versatility of our approach by investigating the performance of several model instances on soil and gut microbiome data analysis tasks.
Collapse
|
71
|
Liu T, Zhao H, Wang T. An empirical Bayes approach to normalization and differential abundance testing for microbiome data. BMC Bioinformatics 2020; 21:225. [PMID: 32493208 PMCID: PMC7268703 DOI: 10.1186/s12859-020-03552-z] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2019] [Accepted: 05/18/2020] [Indexed: 12/14/2022] Open
Abstract
Background Advances in DNA sequencing have offered researchers an unprecedented opportunity to better study the variety of species living in and on the human body. However, the analysis of microbiome data is complicated by several challenges. First, the sequencing depth may vary by orders of magnitude across samples. Second, species are rare and the data often contain many zeros. Third, the specimen is a fraction of the microbial ecosystem, and so the data are compositional carrying only relative information. Other characteristics of microbiome data include pronounced over-dispersion in taxon abundances, and the existence of a phylogenetic tree that relates all bacterial species. To address some of these challenges, microbiome analysis workflows often normalize the read counts prior to downstream analysis. However, there are limitations in the current literature on the normalization of microbiome data. Results Under the multinomial distribution for the read counts and a prior for the unknown proportions, we propose an empirical Bayes approach to microbiome data normalization. Using a tree-based extension of the Dirichlet prior, we further extend our method by incorporating the phylogenetic tree into the normalization process. We study the impact of normalization on differential abundance analysis. In the presence of tree structure, we propose a phylogeny-aware detection procedure. Conclusions Extensive simulations and gut microbiome data applications are conducted to demonstrate the superior performance of our empirical Bayes method over other normalization methods, and over commonly-used methods for differential abundance testing. Original R scripts are available at GitHub (https://github.com/liudoubletian/eBay).
Collapse
Affiliation(s)
- Tiantian Liu
- Department of Bioinformatics and Biostatistics, Shanghai Jiao Tong University, 800 Dongchuan Road, Shanghai, 200240, China.,SJTU-Yale Joint Center for Biostatistics and Data Science, Shanghai Jiao Tong University, 800 Dongchuan Road, Shanghai, 200240, China
| | - Hongyu Zhao
- Department of Biostatistics, Yale University, 300 George Street, New Haven, 06511, USA.,SJTU-Yale Joint Center for Biostatistics and Data Science, Shanghai Jiao Tong University, 800 Dongchuan Road, Shanghai, 200240, China
| | - Tao Wang
- Department of Bioinformatics and Biostatistics, Shanghai Jiao Tong University, 800 Dongchuan Road, Shanghai, 200240, China. .,SJTU-Yale Joint Center for Biostatistics and Data Science, Shanghai Jiao Tong University, 800 Dongchuan Road, Shanghai, 200240, China. .,MoE Key Lab of Artificial Intelligence, Shanghai Jiao Tong University, 800 Dongchuan Road, Shanghai, 200240, China.
| |
Collapse
|
72
|
Altenbuchinger M, Weihs A, Quackenbush J, Grabe HJ, Zacharias HU. Gaussian and Mixed Graphical Models as (multi-)omics data analysis tools. BIOCHIMICA ET BIOPHYSICA ACTA. GENE REGULATORY MECHANISMS 2020; 1863:194418. [PMID: 31639475 PMCID: PMC7166149 DOI: 10.1016/j.bbagrm.2019.194418] [Citation(s) in RCA: 27] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/31/2019] [Revised: 08/21/2019] [Accepted: 08/21/2019] [Indexed: 11/30/2022]
Abstract
Gaussian Graphical Models (GGMs) are tools to infer dependencies between biological variables. Popular applications are the reconstruction of gene, protein, and metabolite association networks. GGMs are an exploratory research tool that can be useful to discover interesting relations between genes (functional clusters) or to identify therapeutically interesting genes, but do not necessarily infer a network in the mechanistic sense. Although GGMs are well investigated from a theoretical and applied perspective, important extensions are not well known within the biological community. GGMs assume, for instance, multivariate normal distributed data. If this assumption is violated Mixed Graphical Models (MGMs) can be the better choice. In this review, we provide the theoretical foundations of GGMs, present extensions such as MGMs or multi-class GGMs, and illustrate how those methods can provide insight in biological mechanisms. We summarize several applications and present user-friendly estimation software. This article is part of a Special Issue entitled: Transcriptional Profiles and Regulatory Gene Networks edited by Dr. Dr. Federico Manuel Giorgi and Dr. Shaun Mahony.
Collapse
Affiliation(s)
- Michael Altenbuchinger
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, MA Boston, 02115, USA.
| | - Antoine Weihs
- Department of Psychiatry and Psychotherapy, University Medicine Greifswald, 17475 Greifswald, Germany
| | - John Quackenbush
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, MA Boston, 02115, USA; Channing Division of Network Medicine, Brigham and Women's Hospital, Boston, MA 02115, USA; Department of Medicine, Harvard Medical School, Boston, MA 02115, USA
| | - Hans Jörgen Grabe
- Department of Psychiatry and Psychotherapy, University Medicine Greifswald, 17475 Greifswald, Germany; German Center for Neurodegenerative Diseases DZNE, Site Rostock/Greifswald, 17475 Greifswald, Germany
| | - Helena U Zacharias
- Department of Psychiatry and Psychotherapy, University Medicine Greifswald, 17475 Greifswald, Germany.
| |
Collapse
|
73
|
Susin A, Wang Y, Lê Cao KA, Calle ML. Variable selection in microbiome compositional data analysis. NAR Genom Bioinform 2020; 2:lqaa029. [PMID: 33575585 PMCID: PMC7671404 DOI: 10.1093/nargab/lqaa029] [Citation(s) in RCA: 33] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2019] [Revised: 03/13/2020] [Accepted: 04/29/2020] [Indexed: 12/25/2022] Open
Abstract
Though variable selection is one of the most relevant tasks in microbiome analysis, e.g. for the identification of microbial signatures, many studies still rely on methods that ignore the compositional nature of microbiome data. The applicability of compositional data analysis methods has been hampered by the availability of software and the difficulty in interpreting their results. This work is focused on three methods for variable selection that acknowledge the compositional structure of microbiome data: selbal, a forward selection approach for the identification of compositional balances, and clr-lasso and coda-lasso, two penalized regression models for compositional data analysis. This study highlights the link between these methods and brings out some limitations of the centered log-ratio transformation for variable selection. In particular, the fact that it is not subcompositionally consistent makes the microbial signatures obtained from clr-lasso not readily transferable. Coda-lasso is computationally efficient and suitable when the focus is the identification of the most associated microbial taxa. Selbal stands out when the goal is to obtain a parsimonious model with optimal prediction performance, but it is computationally greedy. We provide a reproducible vignette for the application of these methods that will enable researchers to fully leverage their potential in microbiome studies.
Collapse
Affiliation(s)
- Antoni Susin
- Mathematical Department, UPC-Barcelona Tech, 08028 Barcelona, Spain
| | - Yiwen Wang
- Melbourne Integrative Genomics, School of Mathematics and Statistics, The University of Melbourne, Parkville, VIC 3010, Australia
| | - Kim-Anh Lê Cao
- Melbourne Integrative Genomics, School of Mathematics and Statistics, The University of Melbourne, Parkville, VIC 3010, Australia
| | - M Luz Calle
- Biosciences Department, Faculty of Sciences and Technology, University of Vic—Central University of Catalonia, Carrer de la Laura, 13, 08500 Vic, Spain
| |
Collapse
|
74
|
Xia Y. Correlation and association analyses in microbiome study integrating multiomics in health and disease. PROGRESS IN MOLECULAR BIOLOGY AND TRANSLATIONAL SCIENCE 2020; 171:309-491. [PMID: 32475527 DOI: 10.1016/bs.pmbts.2020.04.003] [Citation(s) in RCA: 37] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/07/2023]
Abstract
Correlation and association analyses are one of the most widely used statistical methods in research fields, including microbiome and integrative multiomics studies. Correlation and association have two implications: dependence and co-occurrence. Microbiome data are structured as phylogenetic tree and have several unique characteristics, including high dimensionality, compositionality, sparsity with excess zeros, and heterogeneity. These unique characteristics cause several statistical issues when analyzing microbiome data and integrating multiomics data, such as large p and small n, dependency, overdispersion, and zero-inflation. In microbiome research, on the one hand, classic correlation and association methods are still applied in real studies and used for the development of new methods; on the other hand, new methods have been developed to target statistical issues arising from unique characteristics of microbiome data. Here, we first provide a comprehensive view of classic and newly developed univariate correlation and association-based methods. We discuss the appropriateness and limitations of using classic methods and demonstrate how the newly developed methods mitigate the issues of microbiome data. Second, we emphasize that concepts of correlation and association analyses have been shifted by introducing network analysis, microbe-metabolite interactions, functional analysis, etc. Third, we introduce multivariate correlation and association-based methods, which are organized by the categories of exploratory, interpretive, and discriminatory analyses and classification methods. Fourth, we focus on the hypothesis testing of univariate and multivariate regression-based association methods, including alpha and beta diversities-based, count-based, and relative abundance (or compositional)-based association analyses. We demonstrate the characteristics and limitations of each approaches. Fifth, we introduce two specific microbiome-based methods: phylogenetic tree-based association analysis and testing for survival outcomes. Sixth, we provide an overall view of longitudinal methods in analysis of microbiome and omics data, which cover standard, static, regression-based time series methods, principal trend analysis, and newly developed univariate overdispersed and zero-inflated as well as multivariate distance/kernel-based longitudinal models. Finally, we comment on current association analysis and future direction of association analysis in microbiome and multiomics studies.
Collapse
Affiliation(s)
- Yinglin Xia
- Department of Medicine, University of Illinois at Chicago, Chicago, IL, United States.
| |
Collapse
|
75
|
Louzada F, Shimizu TK, Suzuki AK. The Spike-and-Slab Lasso regression modeling with compositional covariates: An application on Brazilian children malnutrition data. Stat Methods Med Res 2020; 29:1434-1446. [PMID: 31333069 DOI: 10.1177/0962280219863817] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
There are considerable challenges in analyzing large-scale compositional data. In this paper, we introduce the Spike-and-Slab Lasso linear regression in the presence of compositional covariates for parameter estimation and variable selection. We consider the well-known isometric log-ratio (ilr) coordinates to avoid misleading statistical inference. The separable and non-separable (adaptative) Spike-and-Slab Lasso penalties are compared to verify the advantages of each approach. The proposed method is illustrated on simulated and on real Brazilian child malnutrition data.
Collapse
Affiliation(s)
- Francisco Louzada
- Department of Applied Mathematics & Statistics, ICMC, University of São Paulo, São Carlos, SP, Brazil
| | - Taciana Ko Shimizu
- Department of Applied Mathematics & Statistics, ICMC, University of São Paulo, São Carlos, SP, Brazil
| | - Adriano K Suzuki
- Department of Applied Mathematics & Statistics, ICMC, University of São Paulo, São Carlos, SP, Brazil
| |
Collapse
|
76
|
McGregor DE, Palarea-Albaladejo J, Dall PM, Hron K, Chastin S. Cox regression survival analysis with compositional covariates: Application to modelling mortality risk from 24-h physical activity patterns. Stat Methods Med Res 2020; 29:1447-1465. [PMID: 31342855 DOI: 10.1177/0962280219864125] [Citation(s) in RCA: 31] [Impact Index Per Article: 7.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
Survival analysis is commonly conducted in medical and public health research to assess the association of an exposure or intervention with a hard end outcome such as mortality. The Cox (proportional hazards) regression model is probably the most popular statistical tool used in this context. However, when the exposure includes compositional covariables (that is, variables representing a relative makeup such as a nutritional or physical activity behaviour composition), some basic assumptions of the Cox regression model and associated significance tests are violated. Compositional variables involve an intrinsic interplay between one another which precludes results and conclusions based on considering them in isolation as is ordinarily done. In this work, we introduce a formulation of the Cox regression model in terms of log-ratio coordinates which suitably deals with the constraints of compositional covariates, facilitates the use of common statistical inference methods, and allows for scientifically meaningful interpretations. We illustrate its practical application to a public health problem: the estimation of the mortality hazard associated with the composition of daily activity behaviour (physical activity, sitting time and sleep) using data from the U.S. National Health and Nutrition Examination Survey (NHANES).
Collapse
Affiliation(s)
- D E McGregor
- School of Health and Life Science, Glasgow Caledonian University, Glasgow, UK.,Biomathematics and Statistics Scotland, Edinburgh, UK
| | | | - P M Dall
- School of Health and Life Science, Glasgow Caledonian University, Glasgow, UK
| | - K Hron
- Faculty of Science, Department of Mathematical Analysis and Applications of Mathematics, Palacký University Olomouc, Olomouc, Czech Republic
| | - Sfm Chastin
- School of Health and Life Science, Glasgow Caledonian University, Glasgow, UK.,Department of Movement and Sport Science, Ghent University, Ghent, Belgium
| |
Collapse
|
77
|
Interpretable Log Contrasts for the Classification of Health Biomarkers: a New Approach to Balance Selection. mSystems 2020; 5:5/2/e00230-19. [PMID: 32265314 PMCID: PMC7141889 DOI: 10.1128/msystems.00230-19] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023] Open
Abstract
High-throughput sequencing provides an easy and cost-effective way to measure the relative abundance of bacteria in any environmental or biological sample. When these samples come from humans, the microbiome signatures can act as biomarkers for disease prediction. However, because bacterial abundance is measured as a composition, the data have unique properties that make conventional analyses inappropriate. To overcome this, analysts often use cumbersome normalizations. This article proposes an alternative method that identifies pairs and trios of bacteria whose stoichiometric presence can differentiate between diseased and nondiseased samples. By using interpretable log contrasts called balances, we developed an entirely normalization-free classification procedure that reduces the feature space and improves the interpretability, without sacrificing classifier performance. Since the turn of the century, technological advances have made it possible to obtain the molecular profile of any tissue in a cost-effective manner. Among these advances are sophisticated high-throughput assays that measure the relative abundances of microorganisms, RNA molecules, and metabolites. While these data are most often collected to gain new insights into biological systems, they can also be used as biomarkers to create clinically useful diagnostic classifiers. How best to classify high-dimensional -omics data remains an area of active research. However, few explicitly model the relative nature of these data and instead rely on cumbersome normalizations. This report (i) emphasizes the relative nature of health biomarkers, (ii) discusses the literature surrounding the classification of relative data, and (iii) benchmarks how different transformations perform for regularized logistic regression across multiple biomarker types. We show how an interpretable set of log contrasts, called balances, can prepare data for classification. We propose a simple procedure, called discriminative balance analysis, to select groups of 2 and 3 bacteria that can together discriminate between experimental conditions. Discriminative balance analysis is a fast, accurate, and interpretable alternative to data normalization. IMPORTANCE High-throughput sequencing provides an easy and cost-effective way to measure the relative abundance of bacteria in any environmental or biological sample. When these samples come from humans, the microbiome signatures can act as biomarkers for disease prediction. However, because bacterial abundance is measured as a composition, the data have unique properties that make conventional analyses inappropriate. To overcome this, analysts often use cumbersome normalizations. This article proposes an alternative method that identifies pairs and trios of bacteria whose stoichiometric presence can differentiate between diseased and nondiseased samples. By using interpretable log contrasts called balances, we developed an entirely normalization-free classification procedure that reduces the feature space and improves the interpretability, without sacrificing classifier performance.
Collapse
|
78
|
|
79
|
Bar HY, Booth JG, Wells MT. A Scalable Empirical Bayes Approach to Variable Selection in Generalized Linear Models. J Comput Graph Stat 2020; 29:535-546. [PMID: 38919169 PMCID: PMC11198964 DOI: 10.1080/10618600.2019.1706542] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2018] [Revised: 04/20/2019] [Accepted: 12/11/2019] [Indexed: 10/25/2022]
Abstract
A new empirical Bayes approach to variable selection in the context of generalized linear models is developed. The proposed algorithm scales to situations in which the number of putative explanatory variables is very large, possibly much larger than the number of responses. The coefficients in the linear predictor are modeled as a three-component mixture allowing the explanatory variables to have a random positive effect on the response, a random negative effect, or no effect. A key assumption is that only a small (but unknown) fraction of the candidate variables have a non-zero effect. This assumption, in addition to treating the coefficients as random effects facilitates an approach that is computationally efficient. In particular, the number of parameters that have to be estimated is small, and remains constant regardless of the number of explanatory variables. The model parameters are estimated using a Generalized Alternating Maximization algorithm which is scalable, and leads to significantly faster convergence compared with simulation-based fully Bayesian methods.
Collapse
Affiliation(s)
- Haim Y Bar
- Department of Statistics University of Connecticut, Storrs CT, 06269, USA
| | - James G Booth
- Department of Statistics and Data Science, Cornell University, Ithaca NY, 14853, USA
| | - Martin T Wells
- Department of Statistics and Data Science, Cornell University, Ithaca NY, 14853, USA
| |
Collapse
|
80
|
Lausser L, Szekely R, Klimmek A, Schmid F, Kestler HA. Constraining classifiers in molecular analysis: invariance and robustness. J R Soc Interface 2020; 17:20190612. [PMID: 32019472 PMCID: PMC7061712 DOI: 10.1098/rsif.2019.0612] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/03/2020] [Accepted: 01/09/2020] [Indexed: 12/02/2022] Open
Abstract
Analysing molecular profiles requires the selection of classification models that can cope with the high dimensionality and variability of these data. Also, improper reference point choice and scaling pose additional challenges. Often model selection is somewhat guided by ad hoc simulations rather than by sophisticated considerations on the properties of a categorization model. Here, we derive and report four linked linear concept classes/models with distinct invariance properties for high-dimensional molecular classification. We can further show that these concept classes also form a half-order of complexity classes in terms of Vapnik-Chervonenkis dimensions, which also implies increased generalization abilities. We implemented support vector machines with these properties. Surprisingly, we were able to attain comparable or even superior generalization abilities to the standard linear one on the 27 investigated RNA-Seq and microarray datasets. Our results indicate that a priori chosen invariant models can replace ad hoc robustness analysis by interpretable and theoretically guaranteed properties in molecular categorization.
Collapse
Affiliation(s)
- Ludwig Lausser
- Institute of Medical Systems Biology, Ulm University, Ulm, Germany
| | - Robin Szekely
- Institute of Medical Systems Biology, Ulm University, Ulm, Germany
| | - Attila Klimmek
- Institute of Medical Systems Biology, Ulm University, Ulm, Germany
| | - Florian Schmid
- Institute of Medical Systems Biology, Ulm University, Ulm, Germany
| | - Hans A. Kestler
- Institute of Medical Systems Biology, Ulm University, Ulm, Germany
- Leibniz Institute on Aging, Jena, Germany
| |
Collapse
|
81
|
Cao Y, Zhang A, Li H. Multisample estimation of bacterial composition matrices in metagenomics data. Biometrika 2019. [DOI: 10.1093/biomet/asz062] [Citation(s) in RCA: 17] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/09/2023] Open
Abstract
Summary
Metagenomics sequencing is routinely applied to quantify bacterial abundances in microbiome studies, where bacterial composition is estimated based on the sequencing read counts. Due to limited sequencing depth and DNA dropouts, many rare bacterial taxa might not be captured in the final sequencing reads, which results in many zero counts. Naive composition estimation using count normalization leads to many zero proportions, which tend to result in inaccurate estimates of bacterial abundance and diversity. This paper takes a multisample approach to estimation of bacterial abundances in order to borrow information across samples and across species. Empirical results from real datasets suggest that the composition matrix over multiple samples is approximately low rank, which motivates a regularized maximum likelihood estimation with a nuclear norm penalty. An efficient optimization algorithm using the generalized accelerated proximal gradient and Euclidean projection onto simplex space is developed. Theoretical upper bounds and the minimax lower bounds of the estimation errors, measured by the Kullback–Leibler divergence and the Frobenius norm, are established. Simulation studies demonstrate that the proposed estimator outperforms the naive estimators. The method is applied to an analysis of a human gut microbiome dataset.
Collapse
Affiliation(s)
- Yuanpei Cao
- Department of Biostatistics and Epidemiology, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania 19104, U.S.A
| | - Anru Zhang
- Department of Statistics, University of Wisconsin-Madison, 1300 University Avenue, Madison, Wisconsin 53706, U.S.A
| | - Hongzhe Li
- Department of Biostatistics and Epidemiology, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania 19104, U.S.A
| |
Collapse
|
82
|
Wang Y, LêCao KA. Managing batch effects in microbiome data. Brief Bioinform 2019; 21:1954-1970. [PMID: 31776547 DOI: 10.1093/bib/bbz105] [Citation(s) in RCA: 53] [Impact Index Per Article: 10.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/07/2019] [Revised: 07/24/2019] [Indexed: 12/20/2022] Open
Abstract
Microbial communities have been increasingly studied in recent years to investigate their role in ecological habitats. However, microbiome studies are difficult to reproduce or replicate as they may suffer from confounding factors that are unavoidable in practice and originate from biological, technical or computational sources. In this review, we define batch effects as unwanted variation introduced by confounding factors that are not related to any factors of interest. Computational and analytical methods are required to remove or account for batch effects. However, inherent microbiome data characteristics (e.g. sparse, compositional and multivariate) challenge the development and application of batch effect adjustment methods to either account or correct for batch effects. We present commonly encountered sources of batch effects that we illustrate in several case studies. We discuss the limitations of current methods, which often have assumptions that are not met due to the peculiarities of microbiome data. We provide practical guidelines for assessing the efficiency of the methods based on visual and numerical outputs and a thorough tutorial to reproduce the analyses conducted in this review.
Collapse
Affiliation(s)
- Yiwen Wang
- Melbourne Integrative Genomics, School of Mathematics and Statistics, University of Melbourne, Melbourne, VIC, 3052, Australia
| | - Kim-Anh LêCao
- Melbourne Integrative Genomics, School of Mathematics and Statistics, University of Melbourne, Melbourne, VIC, 3052, Australia
| |
Collapse
|
83
|
Jiang D, Armour CR, Hu C, Mei M, Tian C, Sharpton TJ, Jiang Y. Microbiome Multi-Omics Network Analysis: Statistical Considerations, Limitations, and Opportunities. Front Genet 2019; 10:995. [PMID: 31781153 PMCID: PMC6857202 DOI: 10.3389/fgene.2019.00995] [Citation(s) in RCA: 83] [Impact Index Per Article: 16.6] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2019] [Accepted: 09/18/2019] [Indexed: 12/21/2022] Open
Abstract
The advent of large-scale microbiome studies affords newfound analytical opportunities to understand how these communities of microbes operate and relate to their environment. However, the analytical methodology needed to model microbiome data and integrate them with other data constructs remains nascent. This emergent analytical toolset frequently ports over techniques developed in other multi-omics investigations, especially the growing array of statistical and computational techniques for integrating and representing data through networks. While network analysis has emerged as a powerful approach to modeling microbiome data, oftentimes by integrating these data with other types of omics data to discern their functional linkages, it is not always evident if the statistical details of the approach being applied are consistent with the assumptions of microbiome data or how they impact data interpretation. In this review, we overview some of the most important network methods for integrative analysis, with an emphasis on methods that have been applied or have great potential to be applied to the analysis of multi-omics integration of microbiome data. We compare advantages and disadvantages of various statistical tools, assess their applicability to microbiome data, and discuss their biological interpretability. We also highlight on-going statistical challenges and opportunities for integrative network analysis of microbiome data.
Collapse
Affiliation(s)
- Duo Jiang
- Department of Statistics, Oregon State University, Corvallis, OR, United States
| | - Courtney R Armour
- Department of Microbiology, Oregon State University, Corvallis, OR, United States
| | - Chenxiao Hu
- Department of Statistics, Oregon State University, Corvallis, OR, United States
| | - Meng Mei
- Department of Statistics, Oregon State University, Corvallis, OR, United States
| | - Chuan Tian
- Department of Statistics, Oregon State University, Corvallis, OR, United States
| | - Thomas J Sharpton
- Department of Statistics, Oregon State University, Corvallis, OR, United States
- Department of Microbiology, Oregon State University, Corvallis, OR, United States
| | - Yuan Jiang
- Department of Statistics, Oregon State University, Corvallis, OR, United States
| |
Collapse
|
84
|
A multi-source data integration approach reveals novel associations between metabolites and renal outcomes in the German Chronic Kidney Disease study. Sci Rep 2019; 9:13954. [PMID: 31562371 PMCID: PMC6764972 DOI: 10.1038/s41598-019-50346-2] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2019] [Accepted: 09/09/2019] [Indexed: 01/25/2023] Open
Abstract
Omics data facilitate the gain of novel insights into the pathophysiology of diseases and, consequently, their diagnosis, treatment, and prevention. To this end, omics data are integrated with other data types, e.g., clinical, phenotypic, and demographic parameters of categorical or continuous nature. We exemplify this data integration issue for a chronic kidney disease (CKD) study, comprising complex clinical, demographic, and one-dimensional 1H nuclear magnetic resonance metabolic variables. Routine analysis screens for associations of single metabolic features with clinical parameters while accounting for confounders typically chosen by expert knowledge. This knowledge can be incomplete or unavailable. We introduce a framework for data integration that intrinsically adjusts for confounding variables. We give its mathematical and algorithmic foundation, provide a state-of-the-art implementation, and evaluate its performance by sanity checks and predictive performance assessment on independent test data. Particularly, we show that discovered associations remain significant after variable adjustment based on expert knowledge. In contrast, we illustrate that associations discovered in routine univariate screening approaches can be biased by incorrect or incomplete expert knowledge. Our data integration approach reveals important associations between CKD comorbidities and metabolites, including novel associations of the plasma metabolite trimethylamine-N-oxide with cardiac arrhythmia and infarction in CKD stage 3 patients.
Collapse
|
85
|
Abstract
A major goal in microbial ecology is to understand how microbial community structure influences ecosystem functioning. Various methods to directly associate bacterial taxa to functional groups in the environment are being developed. In this study, we applied machine learning methods to relate taxonomic data obtained from marker gene surveys to functional groups identified by flow cytometry. This allowed us to identify the taxa that are associated with heterotrophic productivity in freshwater lakes and indicated that the key contributors were highly system specific, regularly rare members of the community, and that some could possibly switch between being low and high contributors. Our approach provides a promising framework to identify taxa that contribute to ecosystem functioning and can be further developed to explore microbial contributions beyond heterotrophic production. High-nucleic-acid (HNA) and low-nucleic-acid (LNA) bacteria are two operational groups identified by flow cytometry (FCM) in aquatic systems. A number of reports have shown that HNA cell density correlates strongly with heterotrophic production, while LNA cell density does not. However, which taxa are specifically associated with these groups, and by extension, productivity has remained elusive. Here, we addressed this knowledge gap by using a machine learning-based variable selection approach that integrated FCM and 16S rRNA gene sequencing data collected from 14 freshwater lakes spanning a broad range in physicochemical conditions. There was a strong association between bacterial heterotrophic production and HNA absolute cell abundances (R2 = 0.65), but not with the more abundant LNA cells. This solidifies findings, mainly from marine systems, that HNA and LNA bacteria could be considered separate functional groups, the former contributing a disproportionately large share of carbon cycling. Taxa selected by the models could predict HNA and LNA absolute cell abundances at all taxonomic levels. Selected operational taxonomic units (OTUs) ranged from low to high relative abundance and were mostly lake system specific (89.5% to 99.2%). A subset of selected OTUs was associated with both LNA and HNA groups (12.5% to 33.3%), suggesting either phenotypic plasticity or within-OTU genetic and physiological heterogeneity. These findings may lead to the identification of system-specific putative ecological indicators for heterotrophic productivity. Generally, our approach allows for the association of OTUs with specific functional groups in diverse ecosystems in order to improve our understanding of (microbial) biodiversity-ecosystem functioning relationships. IMPORTANCE A major goal in microbial ecology is to understand how microbial community structure influences ecosystem functioning. Various methods to directly associate bacterial taxa to functional groups in the environment are being developed. In this study, we applied machine learning methods to relate taxonomic data obtained from marker gene surveys to functional groups identified by flow cytometry. This allowed us to identify the taxa that are associated with heterotrophic productivity in freshwater lakes and indicated that the key contributors were highly system specific, regularly rare members of the community, and that some could possibly switch between being low and high contributors. Our approach provides a promising framework to identify taxa that contribute to ecosystem functioning and can be further developed to explore microbial contributions beyond heterotrophic production.
Collapse
|
86
|
Wang C, Hu J, Blaser MJ, Li H. Estimating and testing the microbial causal mediation effect with high-dimensional and compositional microbiome data. Bioinformatics 2019; 36:347-355. [PMID: 31329243 PMCID: PMC7867996 DOI: 10.1093/bioinformatics/btz565] [Citation(s) in RCA: 33] [Impact Index Per Article: 6.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2019] [Revised: 06/17/2019] [Accepted: 07/16/2019] [Indexed: 02/07/2023] Open
Abstract
MOTIVATION Recent microbiome association studies have revealed important associations between microbiome and disease/health status. Such findings encourage scientists to dive deeper to uncover the causal role of microbiome in the underlying biological mechanism, and have led to applying statistical models to quantify causal microbiome effects and to identify the specific microbial agents. However, there are no existing causal mediation methods specifically designed to handle high dimensional and compositional microbiome data. RESULTS We propose a rigorous Sparse Microbial Causal Mediation Model (SparseMCMM) specifically designed for the high dimensional and compositional microbiome data in a typical three-factor (treatment, microbiome and outcome) causal study design. In particular, linear log-contrast regression model and Dirichlet regression model are proposed to estimate the causal direct effect of treatment and the causal mediation effects of microbiome at both the community and individual taxon levels. Regularization techniques are used to perform the variable selection in the proposed model framework to identify signature causal microbes. Two hypothesis tests on the overall mediation effect are proposed and their statistical significance is estimated by permutation procedures. Extensive simulated scenarios show that SparseMCMM has excellent performance in estimation and hypothesis testing. Finally, we showcase the utility of the proposed SparseMCMM method in a study which the murine microbiome has been manipulated by providing a clear and sensible causal path among antibiotic treatment, microbiome composition and mouse weight. AVAILABILITY AND IMPLEMENTATION https://sites.google.com/site/huilinli09/software and https://github.com/chanw0/SparseMCMM. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Chan Wang
- Division of Biostatistics, Department of Population Health, New York University School of Medicine, New York, NY 10016, USA
| | - Jiyuan Hu
- Division of Biostatistics, Department of Population Health, New York University School of Medicine, New York, NY 10016, USA
| | - Martin J Blaser
- Department of Medicine and Microbiology, Center for Advanced Biotechnology and Medicine, Rutgers University, Piscataway, NJ 08854-8021, USA
| | - Huilin Li
- To whom correspondence should be addressed.
| |
Collapse
|
87
|
|
88
|
Yoon G, Gaynanova I, Müller CL. Microbial Networks in SPRING - Semi-parametric Rank-Based Correlation and Partial Correlation Estimation for Quantitative Microbiome Data. Front Genet 2019; 10:516. [PMID: 31244881 PMCID: PMC6563871 DOI: 10.3389/fgene.2019.00516] [Citation(s) in RCA: 42] [Impact Index Per Article: 8.4] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/18/2019] [Accepted: 05/13/2019] [Indexed: 12/15/2022] Open
Abstract
High-throughput microbial sequencing techniques, such as targeted amplicon-based and metagenomic profiling, provide low-cost genomic survey data of microbial communities in their natural environment, ranging from marine ecosystems to host-associated habitats. While standard microbiome profiling data can provide sparse relative abundances of operational taxonomic units or genes, recent advances in experimental protocols give a more quantitative picture of microbial communities by pairing sequencing-based techniques with orthogonal measurements of microbial cell counts from the same sample. These tandem measurements provide absolute microbial count data albeit with a large excess of zeros due to limited sequencing depth. In this contribution we consider the fundamental statistical problem of estimating correlations and partial correlations from such quantitative microbiome data. To this end, we propose a semi-parametric rank-based approach to correlation estimation that can naturally deal with the excess zeros in the data. Combining this estimator with sparse graphical modeling techniques leads to the Semi-Parametric Rank-based approach for INference in Graphical model (SPRING). SPRING enables inference of statistical microbial association networks from quantitative microbiome data which can serve as high-level statistical summary of the underlying microbial ecosystem and can provide testable hypotheses for functional species-species interactions. Due to the absence of verified microbial associations we also introduce a novel quantitative microbiome data generation mechanism which mimics empirical marginal distributions of measured count data while simultaneously allowing user-specified dependencies among the variables. SPRING shows superior network recovery performance on a wide range of realistic benchmark problems with varying network topologies and is robust to misspecifications of the total cell count estimate. To highlight SPRING's broad applicability we infer taxon-taxon associations from the American Gut Project data and genus-genus associations from a recent quantitative gut microbiome dataset. We believe that, as quantitative microbiome profiling data will become increasingly available, the semi-parametric estimators for correlation and partial correlation estimation introduced here provide an important tool for reliable statistical analysis of quantitative microbiome data.
Collapse
Affiliation(s)
- Grace Yoon
- Department of Statistics, Texas A&M University, College Station, TX, United States
| | - Irina Gaynanova
- Department of Statistics, Texas A&M University, College Station, TX, United States
| | - Christian L. Müller
- Center for Computational Mathematics, Flatiron Institute, New York, NY, United States
| |
Collapse
|
89
|
Tang ZZ, Chen G, Hong Q, Huang S, Smith HM, Shah RD, Scholz M, Ferguson JF. Multi-Omic Analysis of the Microbiome and Metabolome in Healthy Subjects Reveals Microbiome-Dependent Relationships Between Diet and Metabolites. Front Genet 2019; 10:454. [PMID: 31164901 PMCID: PMC6534069 DOI: 10.3389/fgene.2019.00454] [Citation(s) in RCA: 67] [Impact Index Per Article: 13.4] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2019] [Accepted: 04/30/2019] [Indexed: 12/22/2022] Open
Abstract
The human microbiome has been associated with health status, and risk of disease development. While the etiology of microbiome-mediated disease remains to be fully elucidated, one mechanism may be through microbial metabolism. Metabolites produced by commensal organisms, including in response to host diet, may affect host metabolic processes, with potentially protective or pathogenic consequences. We conducted multi-omic phenotyping of healthy subjects (N = 136), in order to investigate the interaction between diet, the microbiome, and the metabolome in a cross-sectional sample. We analyzed the nutrient composition of self-reported diet (3-day food records and food frequency questionnaires). We profiled the gut and oral microbiome (16S rRNA) from stool and saliva, and applied metabolomic profiling to plasma and stool samples in a subset of individuals (N = 75). We analyzed these multi-omic data to investigate the relationship between diet, the microbiome, and the gut and circulating metabolome. On a global level, we observed significant relationships, particularly between long-term diet, the gut microbiome and the metabolome. Intake of plant-derived nutrients as well as consumption of artificial sweeteners were associated with significant differences in circulating metabolites, particularly bile acids, which were dependent on gut enterotype, indicating that microbiome composition mediates the effect of diet on host physiology. Our analysis identifies dietary compounds and phytochemicals that may modulate bacterial abundance within the gut and interact with microbiome composition to alter host metabolism.
Collapse
Affiliation(s)
- Zheng-Zheng Tang
- Department of Biostatistics and Medical Informatics, University of Wisconsin–Madison, Madison, WI, United States
- Wisconsin Institute for Discovery, Madison, WI, United States
| | - Guanhua Chen
- Department of Biostatistics and Medical Informatics, University of Wisconsin–Madison, Madison, WI, United States
| | - Qilin Hong
- Department of Statistics, University of Wisconsin–Madison, Madison, WI, United States
| | - Shi Huang
- Department of Biostatistics, Vanderbilt University Medical Center, Nashville, TN, United States
| | - Holly M. Smith
- Division of Cardiovascular Medicine, Vanderbilt University Medical Center, Nashville, TN, United States
| | - Rachana D. Shah
- Division of Pediatric Endocrinology, Children’s Hospital of Philadelphia, Philadelphia, PA, United States
| | - Matthew Scholz
- Vanderbilt Technologies for Advanced Genomics (VANTAGE), Vanderbilt University Medical Center, Nashville, TN, United States
| | - Jane F. Ferguson
- Division of Cardiovascular Medicine, Vanderbilt University Medical Center, Nashville, TN, United States
- Vanderbilt Translational and Clinical Cardiovascular Research Center (VTRACC), Vanderbilt University Medical Center, Nashville, TN, United States
| |
Collapse
|
90
|
Zhang J, Lin W. Scalable estimation and regularization for the logistic normal multinomial model. Biometrics 2019; 75:1098-1108. [PMID: 31009062 DOI: 10.1111/biom.13071] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2018] [Revised: 02/15/2019] [Accepted: 04/02/2019] [Indexed: 11/29/2022]
Abstract
Clustered multinomial data are prevalent in a variety of applications such as microbiome studies, where metagenomic sequencing data are summarized as multinomial counts for a large number of bacterial taxa per subject. Count normalization with ad hoc zero adjustment tends to result in poor estimates of abundances for taxa with zero or small counts. To account for heterogeneity and overdispersion in such data, we suggest using the logistic normal multinomial (LNM) model with an arbitrary correlation structure to simultaneously estimate the taxa compositions by borrowing information across subjects. We overcome the computational difficulties in high dimensions by developing a stochastic approximation EM algorithm with Hamiltonian Monte Carlo sampling for scalable parameter estimation in the LNM model. The ill-conditioning problem due to unstructured covariance is further mitigated by a covariance-regularized estimator with a condition number constraint. The advantages of the proposed methods are illustrated through simulations and an application to human gut microbiome data.
Collapse
Affiliation(s)
- Jingru Zhang
- Center for Statistical Science, School of Mathematical Sciences, Peking University, Beijing, China
| | - Wei Lin
- Center for Statistical Science, School of Mathematical Sciences, Peking University, Beijing, China
| |
Collapse
|
91
|
Wang T, Yang C, Zhao H. Prediction analysis for microbiome sequencing data. Biometrics 2019; 75:875-884. [PMID: 30994187 DOI: 10.1111/biom.13061] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2017] [Revised: 03/08/2019] [Accepted: 03/13/2019] [Indexed: 01/22/2023]
Abstract
One goal of human microbiome studies is to relate host traits with human microbiome compositions. The analysis of microbial community sequencing data presents great statistical challenges, especially when the samples have different library sizes and the data are overdispersed with many zeros. To address these challenges, we introduce a new statistical framework, called predictive analysis in metagenomics via inverse regression (PAMIR), to analyze microbiome sequencing data. Within this framework, an inverse regression model is developed for overdispersed microbiota counts given the trait, and then a prediction rule is constructed by taking advantage of the dimension-reduction structure in the model. An efficient Monte Carlo expectation-maximization algorithm is proposed for maximum likelihood estimation. The method is further generalized to accommodate other types of covariates. We demonstrate the advantages of PAMIR through simulations and two real data examples.
Collapse
Affiliation(s)
- Tao Wang
- Department of Bioinformatics and Biostatistics, Shanghai Jiao Tong University, Shanghai, China.,MoE Key Lab of Artificial Intelligence, Shanghai Jiao Tong University, Shanghai, China.,SJTU-Yale Joint Center for Biostatistics and Data Science, Shanghai Jiao Tong University, Shanghai, China
| | - Can Yang
- Department of Mathematics, Hong Kong University of Science and Technology, Kowloon, Hong Kong
| | - Hongyu Zhao
- SJTU-Yale Joint Center for Biostatistics and Data Science, Shanghai Jiao Tong University, Shanghai, China.,Department of Biostatistics, Yale University, New Haven, Connecticut
| |
Collapse
|
92
|
Bates S, Tibshirani R. Log-ratio lasso: Scalable, sparse estimation for log-ratio models. Biometrics 2019; 75:613-624. [PMID: 30387139 DOI: 10.1111/biom.12995] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/05/2018] [Accepted: 10/16/2018] [Indexed: 11/28/2022]
Abstract
Positive-valued signal data is common in the biological and medical sciences, due to the prevalence of mass spectrometry other imaging techniques. With such data, only the relative intensities of the raw measurements are meaningful. It is desirable to consider models consisting of the log-ratios of all pairs of the raw features, since log-ratios are the simplest meaningful derived features. In this case, however, the dimensionality of the predictor space becomes large, and computationally efficient estimation procedures are required. In this work, we introduce an embedding of the log-ratio parameter space into a space of much lower dimension and use this representation to develop an efficient penalized fitting procedure. This procedure serves as the foundation for a two-step fitting procedure that combines a convex filtering step with a second non-convex pruning step to yield highly sparse solutions. On a cancer proteomics data set, the proposed method fits a highly sparse model consisting of features of known biological relevance while greatly improving upon the predictive accuracy of less interpretable methods.
Collapse
Affiliation(s)
| | - Robert Tibshirani
- Departments of Biomedical Data Science and Statistics, Stanford University
| |
Collapse
|
93
|
|
94
|
Wang H, Wang Z, Wang S. Sliced inverse regression method for multivariate compositional data modeling. Stat Pap (Berl) 2019. [DOI: 10.1007/s00362-019-01093-z] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
95
|
Towards Quantitative Microbiome Community Profiling Using Internal Standards. Appl Environ Microbiol 2019; 85:AEM.02634-18. [PMID: 30552195 DOI: 10.1128/aem.02634-18] [Citation(s) in RCA: 33] [Impact Index Per Article: 6.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2018] [Accepted: 12/10/2018] [Indexed: 12/16/2022] Open
Abstract
An inherent issue in high-throughput rRNA gene tag sequencing microbiome surveys is that they provide compositional data in relative abundances. This often leads to spurious correlations, making the interpretation of relationships to biogeochemical rates challenging. To overcome this issue, we quantitatively estimated the abundance of microorganisms by spiking in known amounts of internal DNA standards. Using a 3-year sample set of diverse microbial communities from the Western Antarctica Peninsula, we demonstrated that the internal standard method yielded community profiles and taxon cooccurrence patterns substantially different from those derived using relative abundances. We found that the method provided results consistent with the traditional CHEMTAX analysis of pigments and total bacterial counts by flow cytometry. Using the internal standard method, we also showed that chloroplast 16S rRNA gene data in microbial surveys can be used to estimate abundances of certain eukaryotic phototrophs such as cryptophytes and diatoms. In Phaeocystis, scatter in the 16S/18S rRNA gene ratio may be explained by physiological adaptation to environmental conditions. We conclude that the internal standard method, when applied to rRNA gene microbial community profiling, is quantitative and that its application will substantially improve our understanding of microbial ecosystems.IMPORTANCE High-throughput-sequencing-based marine microbiome profiling is rapidly expanding and changing how we study the oceans. Although powerful, the technique is not fully quantitative; it provides taxon counts only in relative abundances. In order to address this issue, we present a method to quantitatively estimate microbial abundances per unit volume of seawater filtered by spiking known amounts of internal DNA standards into each sample. We validated this method by comparing the calculated abundances to other independent estimates, including chemical markers (pigments) and total bacterial cell counts by flow cytometry. The internal standard approach allows us to quantitatively estimate and compare marine microbial community profiles, with important implications for linking environmental microbiomes to quantitative processes such as metabolic and biogeochemical rates.
Collapse
|
96
|
Sinha R, Ahsan H, Blaser M, Caporaso JG, Carmical JR, Chan AT, Fodor A, Gail MH, Harris CC, Helzlsouer K, Huttenhower C, Knight R, Kong HH, Lai GY, Hutchinson DLS, Le Marchand L, Li H, Orlich MJ, Shi J, Truelove A, Verma M, Vogtmann E, White O, Willett W, Zheng W, Mahabir S, Abnet C. Next steps in studying the human microbiome and health in prospective studies, Bethesda, MD, May 16-17, 2017. MICROBIOME 2018; 6:210. [PMID: 30477563 PMCID: PMC6257978 DOI: 10.1186/s40168-018-0596-z] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/22/2018] [Accepted: 11/15/2018] [Indexed: 06/09/2023]
Abstract
The National Cancer Institute (NCI) sponsored a 2-day workshop, "Next Steps in Studying the Human Microbiome and Health in Prospective Studies," in Bethesda, Maryland, May 16-17, 2017. The workshop brought together researchers in the field to discuss the challenges of conducting microbiome studies, including study design, collection and processing of samples, bioinformatics and statistical methods, publishing results, and ensuring reproducibility of published results. The presenters emphasized the great potential of microbiome research in understanding the etiology of cancer. This report summarizes the workshop and presents practical suggestions for conducting microbiome studies, from workshop presenters, moderators, and participants.
Collapse
Affiliation(s)
- Rashmi Sinha
- Division of Cancer Epidemiology and Genetics, National Cancer Institute, Bethesda, MD, 20892, USA.
| | - Habibul Ahsan
- Comprehensive Cancer Center University of Chicago Medicine and Biological Sciences, Chicago, IL, 60615, USA
| | - Martin Blaser
- Departments of Medicine and Microbiology, New York University Langone Medical Center, New York, NY, 10016, USA
| | - J Gregory Caporaso
- Pathogen and Microbiome Institute and Department of Biological Sciences, Northern Arizona University, Flagstaff, AZ, 86011, USA
| | - Joseph Russell Carmical
- Department of Molecular Virology and Microbiology, Baylor College of Medicine, Houston, TX, 77030, USA
| | - Andrew T Chan
- Clinical and Translational Epidemiology Unit, Massachusetts General Hospital and Harvard Medical School, Boston, MA, 02114, USA
- Division of Gastroenterology, Massachusetts General Hospital, Boston, MA, 02114, USA
- Channing Division of Network Medicine, Department of Medicine, Brigham and Women's Hospital, and Harvard Medical School, Boston, MA, 02115, USA
- Department of Immunology and Infectious Diseases, Harvard T.H. Chan School of Public Health, Boston, MA, 02115, USA
- Broad Institute of Massachusetts Institute of Technology and Harvard, Cambridge, MA, 02142, USA
| | - Anthony Fodor
- Department of Bioinformatics and Genomics, University of North Carolina at Charlotte, Charlotte, NC, 28223, USA
| | - Mitchell H Gail
- Division of Cancer Epidemiology and Genetics, National Cancer Institute, Bethesda, MD, 20892, USA
| | - Curtis C Harris
- Laboratory of Human Carcinogenesis, National Cancer Institute, National Institutes of Health, Bethesda, MD, 20892, USA
| | - Kathy Helzlsouer
- Division of Cancer Control and Population Sciences, National Cancer Institute, National Cancer Institute, National Institutes of Health, Bethesda, MD, 20892, USA
| | - Curtis Huttenhower
- Broad Institute of Massachusetts Institute of Technology and Harvard, Cambridge, MA, 02142, USA
- Department of Biostatistics, Harvard T. H. Chan School of Public Health, Boston, MA, 02115, USA
| | - Rob Knight
- Center for Microbiome Innovation, and Departments of Pediatrics and Computer Science and Engineering, University of California San Diego, San Diego, CA, 92093, USA
| | - Heidi H Kong
- Dermatology Branch, National Cancer Institute, National Cancer Institute, National Institutes of Health, Bethesda, MD, 20892, USA
| | - Gabriel Y Lai
- Environmental Epidemiology Branch, National Cancer Institute, Bethesda, MD, 20892, USA
| | - Diane Leigh Smith Hutchinson
- Alkek Center for Metagenomics and Microbiome Research, Department of Molecular Virology and Microbiology, Baylor College of Medicine, Houston, TX, 77030, USA
| | - Loic Le Marchand
- Cancer Epidemiology Program, University of Hawaii Cancer Center, Honolulu, HI, 96813, USA
| | - Hongzhe Li
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA, 19104, USA
| | - Michael J Orlich
- School of Public Health and Department of Preventive Medicine, School of Medicine, Loma Linda University, Loma Linda, CA, 92350, USA
| | - Jianxin Shi
- Division of Cancer Epidemiology and Genetics, National Cancer Institute, Bethesda, MD, 20892, USA
| | | | - Mukesh Verma
- Division of Cancer Control and Population Sciences, National Cancer Institute, National Cancer Institute, National Institutes of Health, Bethesda, MD, 20892, USA
| | - Emily Vogtmann
- Division of Cancer Epidemiology and Genetics, National Cancer Institute, Bethesda, MD, 20892, USA
| | - Owen White
- Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, MD, 21201, USA
| | - Walter Willett
- Channing Division of Network Medicine, Department of Medicine, Brigham and Women's Hospital, and Harvard Medical School, Boston, MA, 02115, USA
- Departments of Epidemiology and Nutrition, Harvard T.H. Chan School of Public Health, Boston, MA, 02115, USA
| | - Wei Zheng
- Division of Epidemiology, Vanderbilt University Medical Center, Nashville, TN, 37232, USA
| | - Somdat Mahabir
- Division of Cancer Control and Population Sciences, National Cancer Institute, National Cancer Institute, National Institutes of Health, Bethesda, MD, 20892, USA
| | - Christian Abnet
- Division of Cancer Epidemiology and Genetics, National Cancer Institute, Bethesda, MD, 20892, USA
| |
Collapse
|
97
|
Zacharias HU, Altenbuchinger M, Gronwald W. Statistical Analysis of NMR Metabolic Fingerprints: Established Methods and Recent Advances. Metabolites 2018; 8:E47. [PMID: 30154338 PMCID: PMC6161311 DOI: 10.3390/metabo8030047] [Citation(s) in RCA: 22] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/02/2018] [Revised: 08/01/2018] [Accepted: 08/18/2018] [Indexed: 01/02/2023] Open
Abstract
In this review, we summarize established and recent bioinformatic and statistical methods for the analysis of NMR-based metabolomics. Data analysis of NMR metabolic fingerprints exhibits several challenges, including unwanted biases, high dimensionality, and typically low sample numbers. Common analysis tasks comprise the identification of differential metabolites and the classification of specimens. However, analysis results strongly depend on the preprocessing of the data, and there is no consensus yet on how to remove unwanted biases and experimental variance prior to statistical analysis. Here, we first review established and new preprocessing protocols and illustrate their pros and cons, including different data normalizations and transformations. Second, we give a brief overview of state-of-the-art statistical analysis in NMR-based metabolomics. Finally, we discuss a recent development in statistical data analysis, where data normalization becomes obsolete. This method, called zero-sum regression, builds metabolite signatures whose estimation as well as predictions are independent of prior normalization.
Collapse
Affiliation(s)
- Helena U Zacharias
- Institute of Computational Biology, Helmholtz Zentrum München, Ingolstädter Landstraße 1, 85764 Neuherberg, Germany.
| | - Michael Altenbuchinger
- Statistical Bioinformatics, Institute of Functional Genomics, University of Regensburg, Am Biopark 9, 93053 Regensburg, Germany.
| | - Wolfram Gronwald
- Institute of Functional Genomics, University of Regensburg, Am Biopark 9, 93053 Regensburg, Germany.
| |
Collapse
|
98
|
Fröhlich H, Balling R, Beerenwinkel N, Kohlbacher O, Kumar S, Lengauer T, Maathuis MH, Moreau Y, Murphy SA, Przytycka TM, Rebhan M, Röst H, Schuppert A, Schwab M, Spang R, Stekhoven D, Sun J, Weber A, Ziemek D, Zupan B. From hype to reality: data science enabling personalized medicine. BMC Med 2018; 16:150. [PMID: 30145981 PMCID: PMC6109989 DOI: 10.1186/s12916-018-1122-7] [Citation(s) in RCA: 187] [Impact Index Per Article: 31.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 03/28/2018] [Accepted: 07/09/2018] [Indexed: 02/08/2023] Open
Abstract
BACKGROUND Personalized, precision, P4, or stratified medicine is understood as a medical approach in which patients are stratified based on their disease subtype, risk, prognosis, or treatment response using specialized diagnostic tests. The key idea is to base medical decisions on individual patient characteristics, including molecular and behavioral biomarkers, rather than on population averages. Personalized medicine is deeply connected to and dependent on data science, specifically machine learning (often named Artificial Intelligence in the mainstream media). While during recent years there has been a lot of enthusiasm about the potential of 'big data' and machine learning-based solutions, there exist only few examples that impact current clinical practice. The lack of impact on clinical practice can largely be attributed to insufficient performance of predictive models, difficulties to interpret complex model predictions, and lack of validation via prospective clinical trials that demonstrate a clear benefit compared to the standard of care. In this paper, we review the potential of state-of-the-art data science approaches for personalized medicine, discuss open challenges, and highlight directions that may help to overcome them in the future. CONCLUSIONS There is a need for an interdisciplinary effort, including data scientists, physicians, patient advocates, regulatory agencies, and health insurance organizations. Partially unrealistic expectations and concerns about data science-based solutions need to be better managed. In parallel, computational methods must advance more to provide direct benefit to clinical practice.
Collapse
Affiliation(s)
- Holger Fröhlich
- UCB Biosciences GmbH, Alfred-Nobel-Str. Str. 10, 40789 Monheim, Germany
- University of Bonn, Bonn-Aachen International Center for IT, Endenicher Allee 19c, 53115 Bonn, Germany
| | - Rudi Balling
- University of Luxembourg, 6 avenue du Swing, 4367 Belvaux, Luxembourg
| | - Niko Beerenwinkel
- Department of Biosciences and Engineering, ETH Zurich, Mattenstr. 26, 4058 Basel, Switzerland
| | - Oliver Kohlbacher
- University of Tübingen, WSI/ZBIT, Sand 14, 72076 Tübingen, Germany
- Max Planck Institute for Developmental Biology, Max-Planck-Ring 5, 72076 Tübingen, Germany
- Quantitative Biology Center, University of Tübingen, Auf der Morgenstelle 8, 72076 Tübingen, Germany
- Institute for Translational Bioinformatics, University Medical Center Tübingen, Sand 14, 72076 Tübingen, Germany
| | - Santosh Kumar
- Department of Computer Science, University of Memphis, 2222 Dunn Hall, Memphis, TN 38152 USA
| | - Thomas Lengauer
- Max-Planck-Institute for Informatics, 66123 Saarbrücken, Germany
| | - Marloes H. Maathuis
- ETH Zurich, Seminar für Statistik, Rämistrasse 101, 8092 Zurich, Switzerland
| | - Yves Moreau
- University of Leuven, ESAT, Kasteelpark Arenberg 10, 3001 Leuven, Belgium
| | - Susan A. Murphy
- Harvard University, Science Center 400 Suite, Oxford Street, Cambridge, MA 02138-2901 USA
| | - Teresa M. Przytycka
- National Center of Biotechnology Information, National Institute of Health, 8600 Rockville Pike, Bethesda, MD 20894-6075 USA
| | - Michael Rebhan
- Novartis Institutes for Biomedical Research, 4056 Basel, Switzerland
| | - Hannes Röst
- Donnelly Centre for Cellular and Biomolecular Research, University of Toronto, 160 College Street, Toronto, ON M5S 3E1 Canada
| | - Andreas Schuppert
- RWTH Aachen, Joint Research Center for Computational Biomedicine, Pauwelsstrasse 19, 52074 Aachen, Germany
| | - Matthias Schwab
- Dr. Margarete Fischer-Bosch Institute of Clinical Pharmacology, Aucherbachstrasse 112, 70376 Stuttgart, Germany
- University of Tübingen, Departments of Clinical Pharmacology and of Pharmacy and Biochemistry, Tübingen, Germany
| | - Rainer Spang
- University of Regensburg, Institute of Functional Genomics, Am BioPark 9, 93053 Regensburg, Germany
| | - Daniel Stekhoven
- ETH Zurich, NEXUS Personalized Health Technol., Otto-Stern-Weg 7, 8093 Zurich, Switzerland
| | - Jimeng Sun
- Georgia Tech University, 801 Atlantic Drive, Atlanta, GA 30332-0280 USA
| | - Andreas Weber
- Institute for Computer Science, University of Bonn, Endenicher Allee 19a, 53115 Bonn, Germany
| | - Daniel Ziemek
- Pfizer, Worldwide Research and Development, Linkstraße 10, 10785 Berlin, Germany
| | - Blaz Zupan
- Faculty of Computer and Information Science, University of Ljubljana, Večna pot 113, SI-1000 Ljubljana, Slovenia
| |
Collapse
|
99
|
Lu J, Shi P, Li H. Generalized linear models with linear constraints for microbiome compositional data. Biometrics 2018; 75:235-244. [PMID: 30039859 DOI: 10.1111/biom.12956] [Citation(s) in RCA: 28] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2017] [Revised: 06/01/2018] [Accepted: 06/01/2018] [Indexed: 01/04/2023]
Abstract
Motivated by regression analysis for microbiome compositional data, this article considers generalized linear regression analysis with compositional covariates, where a group of linear constraints on regression coefficients are imposed to account for the compositional nature of the data and to achieve subcompositional coherence. A penalized likelihood estimation procedure using a generalized accelerated proximal gradient method is developed to efficiently estimate the regression coefficients. A de-biased procedure is developed to obtain asymptotically unbiased and normally distributed estimates, which leads to valid confidence intervals of the regression coefficients. Simulations results show the correctness of the coverage probability of the confidence intervals and smaller variances of the estimates when the appropriate linear constraints are imposed. The methods are illustrated by a microbiome study in order to identify bacterial species that are associated with inflammatory bowel disease (IBD) and to predict IBD using fecal microbiome.
Collapse
Affiliation(s)
- Jiarui Lu
- Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania 19104, U.S.A
| | - Pixu Shi
- Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania 19104, U.S.A
| | - Hongzhe Li
- Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania 19104, U.S.A
| |
Collapse
|
100
|
Gaines BR, Kim J, Zhou H. Algorithms for Fitting the Constrained Lasso. J Comput Graph Stat 2018; 27:861-871. [PMID: 30618485 PMCID: PMC6320228 DOI: 10.1080/10618600.2018.1473777] [Citation(s) in RCA: 23] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2016] [Revised: 03/01/2018] [Indexed: 01/22/2023]
Abstract
We compare alternative computing strategies for solving the constrained lasso problem. As its name suggests, the constrained lasso extends the widely-used lasso to handle linear constraints, which allow the user to incorporate prior information into the model. In addition to quadratic programming, we employ the alternating direction method of multipliers (ADMM) and also derive an efficient solution path algorithm. Through both simulations and benchmark data examples, we compare the different algorithms and provide practical recommendations in terms of efficiency and accuracy for various sizes of data. We also show that, for an arbitrary penalty matrix, the generalized lasso can be transformed to a constrained lasso, while the converse is not true. Thus, our methods can also be used for estimating a generalized lasso, which has wide-ranging applications. Code for implementing the algorithms is freely available in both the Matlab toolbox SparseReg and the Julia package ConstrainedLasso. Supplementary materials for this article are available online.
Collapse
Affiliation(s)
- Brian R Gaines
- Department of Statistics, North Carolina State University
| | - Juhyun Kim
- Department of Biostatistics, University of California, Los Angeles (UCLA)
| | - Hua Zhou
- Department of Biostatistics, University of California, Los Angeles (UCLA)
| |
Collapse
|