1
|
Stephen MA, Burke CR, Pryce JE, Steele NM, Amer PR, Meier S, Phyn CVC, Garrick DJ. Comparison of methods for deriving phenotypes from incomplete observation data with an application to age at puberty in dairy cattle. J Anim Sci Biotechnol 2023; 14:119. [PMID: 37684681 PMCID: PMC10492402 DOI: 10.1186/s40104-023-00921-5] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/12/2023] [Accepted: 07/13/2023] [Indexed: 09/10/2023] Open
Abstract
BACKGROUND Many phenotypes in animal breeding are derived from incomplete measures, especially if they are challenging or expensive to measure precisely. Examples include time-dependent traits such as reproductive status, or lifespan. Incomplete measures for these traits result in phenotypes that are subject to left-, interval- and right-censoring, where phenotypes are only known to fall below an upper bound, between a lower and upper bound, or above a lower bound respectively. Here we compare three methods for deriving phenotypes from incomplete data using age at first elevation (> 1 ng/mL) in blood plasma progesterone (AGEP4), which generally coincides with onset of puberty, as an example trait. METHODS We produced AGEP4 phenotypes from three blood samples collected at about 30-day intervals from approximately 5,000 Holstein-Friesian or Holstein-Friesian × Jersey cross-bred dairy heifers managed in 54 seasonal-calving, pasture-based herds in New Zealand. We used these actual data to simulate 7 different visit scenarios, increasing the extent of censoring by disregarding data from one or two of the three visits. Three methods for deriving phenotypes from these data were explored: 1) ordinal categorical variables which were analysed using categorical threshold analysis; 2) continuous variables, with a penalty of 31 d assigned to right-censored phenotypes; and 3) continuous variables, sampled from within a lower and upper bound using a data augmentation approach. RESULTS Credibility intervals for heritability estimations overlapped across all methods and visit scenarios, but estimated heritabilities tended to be higher when left censoring was reduced. For sires with at least 5 daughters, the correlations between estimated breeding values (EBVs) from our three-visit scenario and each reduced data scenario varied by method, ranging from 0.65 to 0.95. The estimated breed effects also varied by method, but breed differences were smaller as phenotype censoring increased. CONCLUSION Our results indicate that using some methods, phenotypes derived from one observation per offspring for a time-dependent trait such as AGEP4 may provide comparable sire rankings to three observations per offspring. This has implications for the design of large-scale phenotyping initiatives where animal breeders aim to estimate variance parameters and estimated breeding values (EBVs) for phenotypes that are challenging to measure or prohibitively expensive.
Collapse
Affiliation(s)
- Melissa A Stephen
- DairyNZ Ltd, 605 Ruakura Road, Hamilton, 3240, New Zealand.
- AL Rae Centre for Genetics and Breeding - Massey University, Ruakura, Hamilton, 3214, New Zealand.
| | - Chris R Burke
- DairyNZ Ltd, 605 Ruakura Road, Hamilton, 3240, New Zealand
| | - Jennie E Pryce
- Agriculture Victoria Research, AgriBio, Centre for AgriBioscience, Bundoora, Victoria, 3083, Australia
- School of Applied Systems Biology, La Trobe University, Bundoora, Victoria , 3083, Australia
| | | | | | - Susanne Meier
- DairyNZ Ltd, 605 Ruakura Road, Hamilton, 3240, New Zealand
| | | | - Dorian J Garrick
- AL Rae Centre for Genetics and Breeding - Massey University, Ruakura, Hamilton, 3214, New Zealand
| |
Collapse
|
2
|
Li N, Zhu W. A Bayesian approach for subgroup analysis. Biom J 2023; 65:e2200231. [PMID: 36908004 DOI: 10.1002/bimj.202200231] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2022] [Revised: 01/06/2023] [Accepted: 01/09/2023] [Indexed: 03/14/2023]
Abstract
Several penalization approaches have been developed to identify homogeneous subgroups based on a regression model with subject-specific intercepts in subgroup analysis. These methods often apply concave penalty functions to pairwise comparisons of the intercepts, such that the subjects with similar intercept values are assigned to the same group, which is very similar to the procedure of the penalization approaches for variable selection. Since the Bayesian methods are commonly used in variable selection, it is worth considering the corresponding approaches to subgroup analysis in the Bayesian framework. In this paper, a Bayesian hierarchical model with appropriate prior structures is developed for the pairwise differences of intercepts based on a regression model with subject-specific intercepts, which can automatically detect and identify homogeneous subgroups. A Gibbs sampling algorithm is also provided to select the hyperparameter and estimate the intercepts and coefficients of the covariates simultaneously, which is computationally efficient for pairwise comparisons compared to the time-consuming procedures for parameter estimation of the penalization methods (e.g., alternating direction method of multiplier) in the case of large sample sizes. The effectiveness and usefulness of the proposed Bayesian method are evaluated through simulation studies and analysis of a Cleveland Heart Disease Dataset.
Collapse
Affiliation(s)
- Nan Li
- Key Laboratory for Applied Statistics of MOE, School of Mathematics and Statistics, Northeast Normal University, Changchun, China
| | - Wensheng Zhu
- Key Laboratory for Applied Statistics of MOE, School of Mathematics and Statistics, Northeast Normal University, Changchun, China
| |
Collapse
|
3
|
Wei R, Wang J. Left-Censored Missing Value Imputation Approach for MS-Based Proteomics Data with GSimp. Methods Mol Biol 2023; 2426:119-129. [PMID: 36308687 DOI: 10.1007/978-1-0716-1967-4_6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/16/2023]
Abstract
Missing values caused by the limit of detection or quantification (LOD/LOQ) were widely observed in mass spectrometry (MS)-based omics studies and could be recognized as missing not at random (MNAR). MNAR leads to biased statistical estimations and jeopardizes downstream analyses. Although a wide range of missing value imputation methods was developed for omics studies, a limited number of methods were designed appropriately for the situation of MNAR. To facilitate MS-based omics studies, we introduce GSimp, a Gibbs sampler-based missing value imputation approach, to deal with left-censor missing values in MS-proteomics datasets. In this book, we explain the MNAR and elucidate the usage of GSimp for MNAR in detail.
Collapse
Affiliation(s)
- Runmin Wei
- The University of Texas MD Anderson Cancer Center, Department of Genetics, Houston, TX, USA.
| | | |
Collapse
|
4
|
Wang Y, Wang W, Tang Y. A Bayesian semiparametric accelerate failure time mixture cure model. Int J Biostat 2022; 18:473-485. [PMID: 34592069 DOI: 10.1515/ijb-2021-0012] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2021] [Accepted: 09/15/2021] [Indexed: 01/10/2023]
Abstract
The accelerated failure time mixture cure (AFTMC) model is widely used for survival data when a portion of patients can be cured. In this paper, a Bayesian semiparametric method is proposed to obtain the estimation of parameters and density distribution for both the cure probability and the survival distribution of the uncured patients in the AFTMC model. Specifically, the baseline error distribution of the uncured patients is nonparametrically modeled by a mixture of Dirichlet process. Based on the stick-breaking formulation of the Dirichlet process, the techniques of retrospective and slice sampling, an efficient and easy-to-implement Gibbs sampler is developed for the posterior calculation. The proposed approach can be easily implemented in commonly used statistical softwares, and its performance is comparable to fully parametric method via comprehensive simulation studies. Besides, the proposed approach is adopted to the analysis of a colorectal cancer clinical trial data.
Collapse
Affiliation(s)
- Yijun Wang
- School of Statistics and Mathematics, Zhejiang Gongshang University, Hangzhou, Zhejiang Province, 310018, People's Republic of China.,Collaborative Innovation Center of Statistical Data Engineering, Technology & Application, Zhejiang Gongshang University, Hangzhou, Zhejiang Province, 310018, People's Republic of China
| | - Weiwei Wang
- School of Statistics and Mathematics, Zhejiang Gongshang University, Hangzhou, Zhejiang Province, 310018, People's Republic of China.,Collaborative Innovation Center of Statistical Data Engineering, Technology & Application, Zhejiang Gongshang University, Hangzhou, Zhejiang Province, 310018, People's Republic of China
| | - Yincai Tang
- Key Laboratory of Advanced Theory and Application in Statistics and Data Science - MOE, School of Statistics, East China Normal University, Shanghai, 200062, People's Republic of China
| |
Collapse
|
5
|
Tang A, Duan X, Zhao Y. Bayesian Variable Selection and Estimation in Semiparametric Simplex Mixed-Effects Models with Longitudinal Proportional Data. Entropy (Basel) 2022; 24:1466. [PMID: 37420486 DOI: 10.3390/e24101466] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/06/2022] [Revised: 10/08/2022] [Accepted: 10/09/2022] [Indexed: 07/09/2023]
Abstract
In the development of simplex mixed-effects models, random effects in these mixed-effects models are generally distributed in normal distribution. The normality assumption may be violated in an analysis of skewed and multimodal longitudinal data. In this paper, we adopt the centered Dirichlet process mixture model (CDPMM) to specify the random effects in the simplex mixed-effects models. Combining the block Gibbs sampler and the Metropolis-Hastings algorithm, we extend a Bayesian Lasso (BLasso) to simultaneously estimate unknown parameters of interest and select important covariates with nonzero effects in semiparametric simplex mixed-effects models. Several simulation studies and a real example are employed to illustrate the proposed methodologies.
Collapse
Affiliation(s)
- Anmin Tang
- Yunnan Key Laboratory of Statistical Modeling and Data Analysis, Yunnan University, Kunming 650091, China
| | - Xingde Duan
- Department of Mathematics and Statistics, Guizhou University of Finance and Economics, Guiyang 550025, China
| | - Yuanying Zhao
- College of Mathematics and Information Science, Guiyang University, Guiyang 550005, China
| |
Collapse
|
6
|
Bazzi F, Mescam M, Diab A, Falou O, Amoud H, Basarab A, Kouamé D. Marmoset brain segmentation from deconvolved magnetic resonance images and estimated label maps. Magn Reson Med 2021; 86:2766-2779. [PMID: 34170032 DOI: 10.1002/mrm.28881] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2021] [Revised: 04/23/2021] [Accepted: 05/13/2021] [Indexed: 11/10/2022]
Abstract
PURPOSE The proposed method aims to create label maps that can be used for the segmentation of animal brain MR images without the need of a brain template. This is achieved by performing a joint deconvolution and segmentation of the brain MR images. METHODS It is based on modeling locally the image statistics using a generalized Gaussian distribution (GGD) and couples the deconvolved image and its corresponding labels map using the GGD-Potts model. Because of the complexity of the resulting Bayesian estimators of the unknown model parameters, a Gibbs sampler is used to generate samples following the desired posterior probability. RESULTS The performance of the proposed algorithm is assessed on simulated and real MR images by the segmentation of enhanced marmoset brain images into its main compartments using the corresponding label maps created. Quantitative assessment showed that this method presents results that are comparable to those obtained with the classical method-registering the volumes to a brain template. CONCLUSION The proposed method of using labels as prior information for brain segmentation provides a similar or a slightly better performance compared with the classical reference method based on a dedicated template.
Collapse
Affiliation(s)
- Farah Bazzi
- Computer Science Research Institute of Toulouse (IRIT), Toulouse University UPS, CNRS, UMR, Toulouse, France.,Centre de Recherche Cerveau et Cognition (CerCo), Université de Toulouse UPS, CNRS, UMR, Toulouse, France.,Doctoral School of Sciences and Technology, AZM Center for Research in Biotechnology and Its Applications, Lebanese University, Beirut, Lebanon
| | - Muriel Mescam
- Centre de Recherche Cerveau et Cognition (CerCo), Université de Toulouse UPS, CNRS, UMR, Toulouse, France
| | - Ahmad Diab
- Doctoral School of Sciences and Technology, AZM Center for Research in Biotechnology and Its Applications, Lebanese University, Beirut, Lebanon
| | - Omar Falou
- Doctoral School of Sciences and Technology, AZM Center for Research in Biotechnology and Its Applications, Lebanese University, Beirut, Lebanon
| | - Hassan Amoud
- Doctoral School of Sciences and Technology, AZM Center for Research in Biotechnology and Its Applications, Lebanese University, Beirut, Lebanon
| | - Adrian Basarab
- Computer Science Research Institute of Toulouse (IRIT), Toulouse University UPS, CNRS, UMR, Toulouse, France
| | - Denis Kouamé
- Computer Science Research Institute of Toulouse (IRIT), Toulouse University UPS, CNRS, UMR, Toulouse, France
| |
Collapse
|
7
|
Kristjánsson ÓH, Gjerde B, Ødegård J, Lillehammer M. Quantitative Genetics of Growth Rate and Filet Quality Traits in Atlantic Salmon Inferred From a Longitudinal Bayesian Model for the Left-Censored Gaussian Trait Growth Rate. Front Genet 2020; 11:573265. [PMID: 33329713 PMCID: PMC7734147 DOI: 10.3389/fgene.2020.573265] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2020] [Accepted: 11/04/2020] [Indexed: 11/13/2022] Open
Abstract
In selective breeding programs for Atlantic salmon, test fish are slaughtered at an average body weight where growth rate and carcass traits as filet fat (F F), filet pigment (F P) and visceral fat index (F F) are recorded. The objective of this study was to obtain estimates of genetic correlations between growth rate (GR), and the three carcass quality traits when fish from the same 206 families (offspring of 120 sires and 206 dams from 2 year-classes) were recorded both at the same age (SA) and about the same body weight (SW). In the SW group, the largest fish were slaughtered at five different slaughter events and the remaining fish at the sixth slaughter event over 6 months. Estimates of genetic parameters for the traits were obtained from a Bayesian multivariate model for (potentially) truncated Gaussian traits through a Gibbs sampler procedure in which phantom GR values were obtained for the unslaughtered, and thus censored SW group fish at each slaughter event. The heritability estimates for the same trait in each group was similar; about 0.2 for FF, 0.15 for FP and 0.35 for VF and GR. The genetic correlation between the same traits in the two groups was high for growth rate (0.91 ± 0.05) visceral index (0.86 ± 0.05), medium for filet fat (0.45 ± 0.17) and low for filet pigment (0.13 ± 0.27). Within the two groups, the genetic correlation between growth rate and filet fat changed from positive (0.59 ± 0.14) for the SA group to negative (-0.45 ± 0.17) for the SW group, while the genetic correlation between growth rate and filet pigment changed from negative (-0.33 ± 0.22) for the SA group to positive (0.62 ± 0.16) for the SW group. The genetic correlation of growth rate with FF and FP is sensitive to whether the latter traits are measured at the same age or the same body weight. The results indicate that selection for increased growth rate is not expected to have a detrimental effect on the quality traits if increased growth potential is realized through a reduced production time.
Collapse
Affiliation(s)
- Ólafur H Kristjánsson
- Stofnfiskur HF, Hafnarfjörður, Iceland.,Department of Animal and Aquacultural Sciences, Norwegian University of Life Sciences, Ås, Norway
| | - Bjarne Gjerde
- Department of Animal and Aquacultural Sciences, Norwegian University of Life Sciences, Ås, Norway.,Department of Breeding and Genetics, Nofima AS, Ås, Norway
| | - Jørgen Ødegård
- Department of Animal and Aquacultural Sciences, Norwegian University of Life Sciences, Ås, Norway.,Department of Breeding and Genetics, Nofima AS, Ås, Norway
| | | |
Collapse
|
8
|
Abstract
Trees and their seeds regulate their germination, growth, and reproduction in response to environmental stimuli. These stimuli, through signal transduction, trigger transcription factors that alter the expression of various genes leading to the unfolding of the genetic program. A regulon is conceptually defined as a set of target genes regulated by a transcription factor by physically binding to regulatory motifs to accomplish a specific biological function, such as the CO-FT regulon for flowering timing and fall growth cessation in trees. Only with a clear characterization of regulatory motifs, can candidate target genes be experimentally validated, but motif characterization represents the weakest feature of regulon research, especially in tree genetics. I review here relevant experimental and bioinformatics approaches in characterizing transcription factors and their binding sites, outline problems in tree regulon research, and demonstrate how transcription factor databases can be effectively used to aid the characterization of tree regulons.
Collapse
Affiliation(s)
- Xuhua Xia
- Department of Biology, University of Ottawa, Ottawa, ON K1N 6N5, Canada;
- Ottawa Institute of Systems Biology, Ottawa, ON K1H 8M5, Canada
| |
Collapse
|
9
|
Milkevych V, Madsen P, Gao H, Jensen J. The relative effect of genomic information on efficiency of Bayesian analysis of the mixed linear model with unknown variance. J Anim Breed Genet 2020; 138:14-22. [PMID: 32729965 DOI: 10.1111/jbg.12497] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/27/2019] [Revised: 06/17/2020] [Accepted: 06/18/2020] [Indexed: 11/28/2022]
Abstract
This work focuses on the effects of variable amount of genomic information in the Bayesian estimation of unknown variance components associated with single-step genomic prediction. We propose a quantitative criterion for the amount of genomic information included in the model and use it to study the relative effect of genomic data on efficiency of sampling from the posterior distribution of parameters of the single-step model when conducting a Bayesian analysis with estimating unknown variances. The rate of change of estimated variances was dependent on the amount of genomic information involved in the analysis, but did not depend on the Gibbs updating schemes applied for sampling realizations of the posterior distribution. Simulation revealed a gradual deterioration of convergence rates for the locations parameters when new genomic data were gradually added into the analysis. In contrast, the convergence of variance components showed continuous improvement under the same conditions. The sampling efficiency increased proportionally to the amount of genomic information. In addition, an optimal amount of genomic information in variance-covariance matrix that guaranty the most (computationally) efficient analysis was found to correspond a proportion of animals genotyped ***0.8. The proposed criterion yield a characterization of expected performance of the Gibbs sampler if the analysis is subject to adjustment of the amount of genomic data and can be used to guide researchers on how large a proportion of animals should be genotyped in order to attain an efficient analysis.
Collapse
Affiliation(s)
- Viktor Milkevych
- Center for Quantitative Genetics and Genomics, Aarhus University, Tjele, Denmark
| | - Per Madsen
- Center for Quantitative Genetics and Genomics, Aarhus University, Tjele, Denmark
| | - Hongding Gao
- Center for Quantitative Genetics and Genomics, Aarhus University, Tjele, Denmark
| | - Just Jensen
- Center for Quantitative Genetics and Genomics, Aarhus University, Tjele, Denmark
| |
Collapse
|
10
|
Varona L, Legarra A. GIBBSTHUR: Software for Estimating Variance Components and Predicting Breeding Values for Ranking Traits Based on a Thurstonian Model. Animals (Basel) 2020; 10:E1001. [PMID: 32521773 DOI: 10.3390/ani10061001] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2020] [Revised: 06/02/2020] [Accepted: 06/04/2020] [Indexed: 11/16/2022] Open
Abstract
Simple Summary This article describes a new software (GIBBSTHUR) that provides Bayesian estimation of variance components and predictions of breeding values for ranking traits generated from equine competitions based on a Thurstonian approach. The GIBBSTHUR software was developed in FORTRAN 90 and can be executed in UNIX, OSX, or WINDOWS environments, and is freely available in a public repository (https://github.com/lvaronaunizar/Gibbsthur). Abstract (1) Background: Ranking traits are used commonly for breeding purposes in several equine populations; however, implementation is complex, because the position of a horse in a competition event is discontinuous and is influenced by the performance of its competitors. One approach to overcoming these limitations is to assume an underlying Gaussian liability that represents a horse’s performance and dictates the observed classification in a competition event. That approach can be implemented using Montecarlo Markov Chain (McMC) techniques with a procedure known as the Thurstonian model. (2) Methods: We have developed software (GIBBSTHUR) that analyses ranking traits along with other continuous or threshold traits. The software implements a Gibbs Sampler scheme with a data-augmentation step for the liability of the ranking traits and provides estimates of the variance and covariance components and predictions of the breeding values and the average performance of the competitors in competition events. (3) Results: The results of a simple example are presented, in which it is shown that the procedure can recover the simulated variance and covariance components. In addition, the correlation between the simulated and predicted breeding values and between the estimates of the event effects and the average additive genetic effect of the competitors demonstrates the ability of the software to produce useful predictions for breeding purposes. (4) Conclusions: the GIBBSTHUR software provides a useful tool for the breeding evaluation of ranking traits in horses and is freely available in a public repository (https://github.com/lvaronaunizar/Gibbsthur).
Collapse
|
11
|
Choi JY, Hwang H. Bayesian generalized structured component analysis. Br J Math Stat Psychol 2020; 73:347-373. [PMID: 31049946 DOI: 10.1111/bmsp.12166] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/04/2017] [Revised: 02/20/2019] [Indexed: 06/09/2023]
Abstract
Generalized structured component analysis (GSCA) is a component-based approach to structural equation modelling, which adopts components of observed variables as proxies for latent variables and examines directional relationships among latent and observed variables. GSCA has been extended to deal with a wider range of data types, including discrete, multilevel or intensive longitudinal data, as well as to accommodate a greater variety of complex analyses such as latent moderation analysis, the capturing of cluster-level heterogeneity, and regularized analysis. To date, however, there has been no attempt to generalize the scope of GSCA into the Bayesian framework. In this paper, a novel extension of GSCA, called BGSCA, is proposed that estimates parameters within the Bayesian framework. BGSCA can be more attractive than the original GSCA for various reasons. For example, it can infer the probability distributions of random parameters, account for error variances in the measurement model, provide additional fit measures for model assessment and comparison from the Bayesian perspectives, and incorporate external information on parameters, which may be obtainable from past research, expert opinions, subjective beliefs or knowledge on the parameters. We utilize a Markov chain Monte Carlo method, the Gibbs sampler, to update the posterior distributions for the parameters of BGSCA. We conduct a simulation study to evaluate the performance of BGSCA. We also apply BGSCA to real data to demonstrate its empirical usefulness.
Collapse
Affiliation(s)
- Ji Yeh Choi
- Department of Psychology, National University of Singapore, Singapore
| | | |
Collapse
|
12
|
Zhang S, Chen Y, Liu Y. An improved stochastic EM algorithm for large-scale full-information item factor analysis. Br J Math Stat Psychol 2020; 73:44-71. [PMID: 30511445 DOI: 10.1111/bmsp.12153] [Citation(s) in RCA: 21] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/11/2017] [Revised: 09/05/2018] [Indexed: 06/09/2023]
Abstract
In this paper, we explore the use of the stochastic EM algorithm (Celeux & Diebolt (1985) Computational Statistics Quarterly, 2, 73) for large-scale full-information item factor analysis. Innovations have been made on its implementation, including an adaptive-rejection-based Gibbs sampler for the stochastic E step, a proximal gradient descent algorithm for the optimization in the M step, and diagnostic procedures for determining the burn-in size and the stopping of the algorithm. These developments are based on the theoretical results of Nielsen (2000, Bernoulli, 6, 457), as well as advanced sampling and optimization techniques. The proposed algorithm is computationally efficient and virtually tuning-free, making it scalable to large-scale data with many latent traits (e.g. more than five latent traits) and easy to use for practitioners. Standard errors of parameter estimation are also obtained based on the missing-information identity (Louis, 1982, Journal of the Royal Statistical Society, Series B, 44, 226). The performance of the algorithm is evaluated through simulation studies and an application to the analysis of the IPIP-NEO personality inventory. Extensions of the proposed algorithm to other latent variable models are discussed.
Collapse
Affiliation(s)
- Siliang Zhang
- Shanghai Center for Mathematical Sciences, Fudan University, Shanghai, China
| | - Yunxiao Chen
- Department of Statistics, London School of Economics and Political Science, London, UK
| | - Yang Liu
- Department of Human Development and Quantitative Methodology, University of Maryland, College Park MD
| |
Collapse
|
13
|
Abstract
In the metabolomics, glycomics, and mass spectrometry of structured small molecules, the combinatoric nature of the problem renders a database impossibly large, and thus de novo analysis is necessary. De novo analysis requires an alphabet of mass difference values used to link peaks in fragmentation spectra when they are different by a mass in the alphabet divided by a charge. Often, this alphabet is not known, prohibiting de novo analysis. A method is proposed that, given fragmentation mass spectra, identifies an alphabet of m/z differences that can build large connected graphs from many intense peaks in each spectrum from a collection. We then introduce a novel approach to efficiently find recurring substructures in the de novo graph results.
Collapse
Affiliation(s)
- Patrick A Kreitzberg
- Department of Computer Science , University of Montana , Missoula , Montana 59801 , United States
| | - Marshall Bern
- Protein Metrics, Inc. , Cupertino , California 95014 , United States
| | - Qingbo Shu
- Biodesign Institute , Arizona State University , Tempe , Arizona 85287 , United States
| | - Fuquan Yang
- Institute of Biophysics , Chinese Academy of Sciences , Beijing 100101 , China
| | - Oliver Serang
- Department of Computer Science , University of Montana , Missoula , Montana 59801 , United States
| |
Collapse
|
14
|
Alhamzawi R, Alhamzawi A, Mohammad Ali HT. New Gibbs sampling methods for bayesian regularized quantile regression. Comput Biol Med 2019; 110:52-65. [PMID: 31125847 DOI: 10.1016/j.compbiomed.2019.05.011] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2018] [Revised: 05/11/2019] [Accepted: 05/12/2019] [Indexed: 11/23/2022]
Abstract
In this paper, we propose new Bayesian hierarchical representations of lasso, adaptive lasso and elastic net quantile regression models. We explore these representations by observing that the lasso penalty function corresponds to a scale mixture of truncated normal distribution (with exponential mixing densities). We consider fully Bayesian treatments that lead to new Gibbs sampler methods with tractable full conditional posteriors. The new methods are then illustrated with both simulated and real data. Results show that the new methods perform very well under a variety of simulations, such as the presence of a moderately large number of predictors, collinearity and heterogeneity.
Collapse
|
15
|
Tang N, Wang S, Ye G. A nonparametric Bayesian continual reassessment method in single-agent dose-finding studies. BMC Med Res Methodol 2018; 18:172. [PMID: 30563454 PMCID: PMC6299663 DOI: 10.1186/s12874-018-0604-9] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2018] [Accepted: 11/01/2018] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The main purpose of dose-finding studies in Phase I trial is to estimate maximum tolerated dose (MTD), which is the maximum test dose that can be assigned with an acceptable level of toxicity. Existing methods developed for single-agent dose-finding assume that the dose-toxicity relationship follows a specific parametric potency curve. This assumption may lead to bias and unsafe dose escalations due to the misspecification of parametric curve. METHODS This paper relaxes the parametric assumption of dose-toxicity relationship by imposing a Dirichlet process prior on unknown dose-toxicity curve. A hybrid algorithm combining the Gibbs sampler and adaptive rejection Metropolis sampling (ARMS) algorithm is developed to estimate the dose-toxicity curve, and a two-stage Bayesian nonparametric adaptive design is presented to estimate MTD. RESULTS For comparison, we consider two classical continual reassessment methods (CRMs) (i.e., logistic and power models). Numerical results show the flexibility of the proposed method for single-agent dose-finding trials, and the proposed method behaves better than two classical CRMs under our considered scenarios. CONCLUSIONS The proposed dose-finding procedure is model-free and robust, and behaves satisfactorily even in small sample cases.
Collapse
Affiliation(s)
- Niansheng Tang
- Key Lab of Statistical Modeling and Data Analysis of Yunnan Province, Yunnan University, Kunming, 650091, People's Republic of China.
| | - Songjian Wang
- Key Lab of Statistical Modeling and Data Analysis of Yunnan Province, Yunnan University, Kunming, 650091, People's Republic of China
| | - Gen Ye
- Key Lab of Statistical Modeling and Data Analysis of Yunnan Province, Yunnan University, Kunming, 650091, People's Republic of China
| |
Collapse
|
16
|
Liang F, Jia B, Xue J, Li Q, Luo Y. An imputation-regularized optimization algorithm for high dimensional missing data problems and beyond. J R Stat Soc Series B Stat Methodol 2018; 80:899-926. [PMID: 31130816 PMCID: PMC6533005 DOI: 10.1111/rssb.12279] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
Missing data are frequently encountered in high dimensional problems, but they are usually difficult to deal with by using standard algorithms, such as the expectation-maximization algorithm and its variants. To tackle this difficulty, some problem-specific algorithms have been developed in the literature, but there still lacks a general algorithm. This work is to fill the gap: we propose a general algorithm for high dimensional missing data problems. The algorithm works by iterating between an imputation step and a regularized optimization step. At the imputation step, the missing data are imputed conditionally on the observed data and the current estimates of parameters and, at the regularized optimization step, a consistent estimate is found via the regularization approach for the minimizer of a Kullback-Leibler divergence defined on the pseudocomplete data. For high dimensional problems, the consistent estimate can be found under sparsity constraints. The consistency of the averaged estimate for the true parameter can be established under quite general conditions. The algorithm is illustrated by using high dimensional Gaussian graphical models, high dimensional variable selection and a random-coefficient model.
Collapse
Affiliation(s)
| | | | | | - Qizhai Li
- Chinese Academy of Sciences, Beijing, People's Republic of China
| | - Ye Luo
- University of Florida, Gainesville, USA
| |
Collapse
|
17
|
de Araujo Neto FR, Pegolo NT, Aspilcueta-Borquis RR, Pessoa MC, Bonifácio A, Lobo RB, de Oliveira HN. Study of the effect of genotype-environment interaction on age at first calving and production traits in Nellore cattle using multi-trait reaction norms and Bayesian inference. Anim Sci J 2018; 89:939-945. [PMID: 29766602 DOI: 10.1111/asj.12994] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2016] [Accepted: 12/13/2017] [Indexed: 11/28/2022]
Abstract
This study investigated the effects of genotype-environment interaction on yearling weight, age at first calving and post-weaning weight gain in Nellore cattle using multi-trait reaction norm models. The environmental gradient was defined as a function of the mean yearling weight of the contemporary groups. A first-order random regression sire model with four classes of residual variance was used in the analyses and Bayesian methods were applied to estimate the (co)variance components. The heritability estimates ranged from 0.284 to 0.547, 0.222 to 0.316 and 0.256 to 0.522 for yearling weight, age at first calving and post-weaning weight gain, respectively. The lowest genetic correlations between environment groups for each trait were 0.38, 0.02 and 0.04 for yearling weight, age at first calving and post-weaning weight gain, respectively. Differences in the correlation estimates were observed between traits in the same environments, with the magnitude of the estimates tending toward zero as the environment improved. The results highlight the importance of including genotype-environment interactions in genetic evaluation programs considering the differences observed between environmental groups not only in terms of heritability, but also of genetic correlations.
Collapse
|
18
|
Diekmann Y, Smith D, Gerbault P, Dyble M, Page AE, Chaudhary N, Migliano AB, Thomas MG. Accurate age estimation in small-scale societies. Proc Natl Acad Sci U S A 2017; 114:8205-10. [PMID: 28696282 DOI: 10.1073/pnas.1619583114] [Citation(s) in RCA: 25] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
Precise estimation of age is essential in evolutionary anthropology, especially to infer population age structures and understand the evolution of human life history diversity. However, in small-scale societies, such as hunter-gatherer populations, time is often not referred to in calendar years, and accurate age estimation remains a challenge. We address this issue by proposing a Bayesian approach that accounts for age uncertainty inherent to fieldwork data. We developed a Gibbs sampling Markov chain Monte Carlo algorithm that produces posterior distributions of ages for each individual, based on a ranking order of individuals from youngest to oldest and age ranges for each individual. We first validate our method on 65 Agta foragers from the Philippines with known ages, and show that our method generates age estimations that are superior to previously published regression-based approaches. We then use data on 587 Agta collected during recent fieldwork to demonstrate how multiple partial age ranks coming from multiple camps of hunter-gatherers can be integrated. Finally, we exemplify how the distributions generated by our method can be used to estimate important demographic parameters in small-scale societies: here, age-specific fertility patterns. Our flexible Bayesian approach will be especially useful to improve cross-cultural life history datasets for small-scale societies for which reliable age records are difficult to acquire.
Collapse
|
19
|
Fresnedo-Ramírez J, Famula TR, Gradziel TM. Application of a Bayesian ordinal animal model for the estimation of breeding values for the resistance to Monilinia fruticola (G.Winter) Honey in progenies of peach [ Prunus persica (L.) Batsch]. Breed Sci 2017; 67:110-122. [PMID: 28588387 PMCID: PMC5445959 DOI: 10.1270/jsbbs.16027] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/16/2016] [Accepted: 11/07/2016] [Indexed: 06/01/2023]
Abstract
Fruit brown rot caused by Monilinia spp. is the most important fungal disease of stone fruits worldwide. Several phenotyping protocols to accurately characterize and evaluate brown rot infection have been proposed; however, the outcomes from those studies have not led to consistent advances in resistance breeding programs. Breeding for disease resistance is one of the most challenging objectives for crop improvement because disease expression is tetrahedral: it is simultaneously influenced by agent, host, environment, and human management. The present study presents a strategy based on Bayesian inference to analyze a peach breeding progeny for resistance to brown rot, evaluated using a polytomous ordinal scale. A pedigree containing two sources of resistance, one from peach and the other from almond, several commercial cultivars, and two segregating populations were analyzed to estimate the narrow-sense heritability (h2 ) and breeding values (EBVs) for brown rot resistance in progenies. Results show promise for genetic improvement of disease resistance and other traits characterized by strong environmental interactions.
Collapse
Affiliation(s)
| | - Thomas R. Famula
- Department of Animal Science, University of California,
1 Shields Avenue, Davis, CA 95616,
USA
| | - Thomas M. Gradziel
- Department of Plant Sciences, University of California,
1 Shields Avenue, Davis, CA 95616,
USA
| |
Collapse
|
20
|
van Iterson M, van Zwet EW, Heijmans BT. Controlling bias and inflation in epigenome- and transcriptome-wide association studies using the empirical null distribution. Genome Biol 2017; 18:19. [PMID: 28129774 PMCID: PMC5273857 DOI: 10.1186/s13059-016-1131-9] [Citation(s) in RCA: 203] [Impact Index Per Article: 29.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/23/2016] [Accepted: 12/12/2016] [Indexed: 01/08/2023] Open
Abstract
We show that epigenome- and transcriptome-wide association studies (EWAS and TWAS) are prone to significant inflation and bias of test statistics, an unrecognized phenomenon introducing spurious findings if left unaddressed. Neither GWAS-based methodology nor state-of-the-art confounder adjustment methods completely remove bias and inflation. We propose a Bayesian method to control bias and inflation in EWAS and TWAS based on estimation of the empirical null distribution. Using simulations and real data, we demonstrate that our method maximizes power while properly controlling the false positive rate. We illustrate the utility of our method in large-scale EWAS and TWAS meta-analyses of age and smoking.
Collapse
Affiliation(s)
- Maarten van Iterson
- Molecular Epidemiology section, Department of Medical Statistics and Bioinformatics, Leiden University Medical Center, Leiden, the Netherlands
| | - Erik W. van Zwet
- Department of Medical Statistics and Bioinformatics, Leiden University Medical Center, Leiden, the Netherlands
| | - the BIOS Consortium
- Molecular Epidemiology section, Department of Medical Statistics and Bioinformatics, Leiden University Medical Center, Leiden, the Netherlands
- Department of Medical Statistics and Bioinformatics, Leiden University Medical Center, Leiden, the Netherlands
| | - Bastiaan T. Heijmans
- Molecular Epidemiology section, Department of Medical Statistics and Bioinformatics, Leiden University Medical Center, Leiden, the Netherlands
| |
Collapse
|
21
|
Shotwell ME, McFee WE, Slate EH. A Bayesian mixture model for missing data in marine mammal growth analysis. Environ Ecol Stat 2016; 23:585-603. [PMID: 28503080 PMCID: PMC5425172 DOI: 10.1007/s10651-016-0355-x] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/04/2015] [Revised: 08/31/2016] [Indexed: 06/07/2023]
Abstract
Much of what is known about bottle nose dolphin (Tursiops truncatus) anatomy and physiology is based on necropsies from stranding events. Measurements of total body length, total body mass, and age are used to estimate growth. It is more feasible to retrieve and transport smaller animals for total body mass measurement than larger animals, introducing a systematic bias in sampling. Adverse weather events, volunteer availability, and other unforeseen circumstances also contribute to incomplete measurement. We have developed a Bayesian mixture model to describe growth in detected stranded animals using data from both those that are fully measured and those not fully measured. Our approach uses a shared random effect to link the missingness mechanism (i.e. full/partial measurement) to distinct growth curves in the fully and partially measured populations, thereby enabling drawing of strength for estimation. We use simulation to compare our model to complete case analysis and two common multiple imputation methods according to model mean square error. Results indicate that our mixture model provides better fit both when the two populations are present and when they are not. The feasibility and utility of our new method is demonstrated by application to South Carolina strandings data.
Collapse
Affiliation(s)
- Mary E. Shotwell
- Department of Computer Information Systems, Middle Tennessee State University, Murfreesboro, TN 37132, USA
| | - Wayne E. McFee
- Center for Coastal Environmental Health and Behavioral Research, National Ocean Service, Fort Johnson, Charleston, SC 29412, USA
| | - Elizabeth H. Slate
- Department of Statistics, Florida State University, Tallahassee, FL 32306, USA
| |
Collapse
|
22
|
Seaman SR, Hughes RA. Relative efficiency of joint-model and full-conditional-specification multiple imputation when conditional models are compatible: The general location model. Stat Methods Med Res 2016; 27:1603-1614. [PMID: 27597798 PMCID: PMC5496676 DOI: 10.1177/0962280216665872] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
Estimating the parameters of a regression model of interest is complicated by missing data on the variables in that model. Multiple imputation is commonly used to handle these missing data. Joint model multiple imputation and full-conditional specification multiple imputation are known to yield imputed data with the same asymptotic distribution when the conditional models of full-conditional specification are compatible with that joint model. We show that this asymptotic equivalence of imputation distributions does not imply that joint model multiple imputation and full-conditional specification multiple imputation will also yield asymptotically equally efficient inference about the parameters of the model of interest, nor that they will be equally robust to misspecification of the joint model. When the conditional models used by full-conditional specification multiple imputation are linear, logistic and multinomial regressions, these are compatible with a restricted general location joint model. We show that multiple imputation using the restricted general location joint model can be substantially more asymptotically efficient than full-conditional specification multiple imputation, but this typically requires very strong associations between variables. When associations are weaker, the efficiency gain is small. Moreover, full-conditional specification multiple imputation is shown to be potentially much more robust than joint model multiple imputation using the restricted general location model to mispecification of that model when there is substantial missingness in the outcome variable.
Collapse
Affiliation(s)
- Shaun R Seaman
- MRC Biostatistics Unit, Institute of Public Health, Cambridge, UK
- Shaun R Seaman, MRC Biostatistics Unit, Institute of Public Health, Forvie Site, Robinson Way, Cambridge CB20SR, UK.
| | - Rachael A Hughes
- School of Social and Community Medicine, University of Bristol, Bristol, UK
| |
Collapse
|
23
|
Montesinos-López A, Montesinos-López OA, Crossa J, Burgueño J, Eskridge KM, Falconi-Castillo E, He X, Singh P, Cichy K. Genomic Bayesian Prediction Model for Count Data with Genotype × Environment Interaction. G3 (Bethesda) 2016; 6:1165-77. [PMID: 26921298 DOI: 10.1534/g3.116.028118] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 01/09/2023]
Abstract
Genomic tools allow the study of the whole genome, and facilitate the study of genotype-environment combinations and their relationship with phenotype. However, most genomic prediction models developed so far are appropriate for Gaussian phenotypes. For this reason, appropriate genomic prediction models are needed for count data, since the conventional regression models used on count data with a large sample size (nT) and a small number of parameters (p) cannot be used for genomic-enabled prediction where the number of parameters (p) is larger than the sample size (nT). Here, we propose a Bayesian mixed-negative binomial (BMNB) genomic regression model for counts that takes into account genotype by environment (G×E) interaction. We also provide all the full conditional distributions to implement a Gibbs sampler. We evaluated the proposed model using a simulated data set, and a real wheat data set from the International Maize and Wheat Improvement Center (CIMMYT) and collaborators. Results indicate that our BMNB model provides a viable option for analyzing count data.
Collapse
|
24
|
Linderman SW, Johnson MJ, Wilson MA, Chen Z. A Bayesian nonparametric approach for uncovering rat hippocampal population codes during spatial navigation. J Neurosci Methods 2016; 263:36-47. [PMID: 26854398 PMCID: PMC4801699 DOI: 10.1016/j.jneumeth.2016.01.022] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2015] [Revised: 01/25/2016] [Accepted: 01/25/2016] [Indexed: 01/22/2023]
Abstract
BACKGROUND Rodent hippocampal population codes represent important spatial information about the environment during navigation. Computational methods have been developed to uncover the neural representation of spatial topology embedded in rodent hippocampal ensemble spike activity. NEW METHOD We extend our previous work and propose a novel Bayesian nonparametric approach to infer rat hippocampal population codes during spatial navigation. To tackle the model selection problem, we leverage a Bayesian nonparametric model. Specifically, we apply a hierarchical Dirichlet process-hidden Markov model (HDP-HMM) using two Bayesian inference methods, one based on Markov chain Monte Carlo (MCMC) and the other based on variational Bayes (VB). RESULTS The effectiveness of our Bayesian approaches is demonstrated on recordings from a freely behaving rat navigating in an open field environment. COMPARISON WITH EXISTING METHODS The HDP-HMM outperforms the finite-state HMM in both simulated and experimental data. For HPD-HMM, the MCMC-based inference with Hamiltonian Monte Carlo (HMC) hyperparameter sampling is flexible and efficient, and outperforms VB and MCMC approaches with hyperparameters set by empirical Bayes. CONCLUSION The Bayesian nonparametric HDP-HMM method can efficiently perform model selection and identify model parameters, which can used for modeling latent-state neuronal population dynamics.
Collapse
Affiliation(s)
- Scott W Linderman
- Paulson School of Engineering and Applied Sciences, Harvard University, Cambridge, MA 02138, USA.
| | - Matthew J Johnson
- Paulson School of Engineering and Applied Sciences, Harvard University, Cambridge, MA 02138, USA; Department of Neurobiology, Harvard Medical School, Boston, MA 02115, USA.
| | - Matthew A Wilson
- Picower Institute for Learning and Memory, Massachusetts Institute of Technology, Cambridge, MA 02139, USA.
| | - Zhe Chen
- Department of Psychiatry, Department of Neuroscience and Physiology, New York University School of Medicine, New York, NY 10016, USA.
| |
Collapse
|
25
|
Abstract
The Gibbs sampler has been used extensively in the statistics literature. It relies on iteratively sampling from a set of compatible conditional distributions and the sampler is known to converge to a unique invariant joint distribution. However, the Gibbs sampler behaves rather differently when the conditional distributions are not compatible. Such applications have seen increasing use in areas such as multiple imputation. In this paper, we demonstrate that what a Gibbs sampler converges to is a function of the order of the sampling scheme. Besides providing the mathematical background of this behavior, we also explain how that happens through a thorough analysis of the examples.
Collapse
Affiliation(s)
- Shyh-Huei Chen
- Department of Biostatistical Sciences, Wake Forest University School of Medicine, Winston-Salem, NC 27157, USA
| | - Edward H Ip
- Department of Biostatistical Sciences, Wake Forest University School of Medicine, Winston-Salem, NC 27157, USA ; Department of Social Sciences and Health Policy, Wake Forest University School of Medicine, Winston-Salem, NC 27157, USA
| |
Collapse
|
26
|
Montesinos-López OA, Montesinos-López A, Crossa J, Burgueño J, Eskridge K. Genomic-Enabled Prediction of Ordinal Data with Bayesian Logistic Ordinal Regression. G3 (Bethesda) 2015; 5:2113-26. [PMID: 26290569 DOI: 10.1534/g3.115.021154] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Most genomic-enabled prediction models developed so far assume that the response variable is continuous and normally distributed. The exception is the probit model, developed for ordered categorical phenotypes. In statistical applications, because of the easy implementation of the Bayesian probit ordinal regression (BPOR) model, Bayesian logistic ordinal regression (BLOR) is implemented rarely in the context of genomic-enabled prediction [sample size (n) is much smaller than the number of parameters (p)]. For this reason, in this paper we propose a BLOR model using the Pólya-Gamma data augmentation approach that produces a Gibbs sampler with similar full conditional distributions of the BPOR model and with the advantage that the BPOR model is a particular case of the BLOR model. We evaluated the proposed model by using simulation and two real data sets. Results indicate that our BLOR model is a good alternative for analyzing ordinal data in the context of genomic-enabled prediction with the probit or logit link.
Collapse
|
27
|
Abstract
Meta-analysis of microarray studies to produce an overall gene list is relatively straightforward when complete data are available. When some studies lack information-providing only a ranked list of genes, for example-it is common to reduce all studies to ranked lists prior to combining them. Since this entails a loss of information, we consider a hierarchical Bayes approach to meta-analysis using different types of information from different studies: the full data matrix, summary statistics, or ranks. The model uses an informative prior for the parameter of interest to aid the detection of differentially expressed genes. Simulations show that the new approach can give substantial power gains compared with classical meta-analysis and list aggregation methods. A meta-analysis of 11 published studies with different data types identifies genes known to be involved in ovarian cancer and shows significant enrichment.
Collapse
Affiliation(s)
- Alix Zollinger
- Ecole Polytechnique Fédérale de Lausanne, EPFL-FSB-MATHAA-STAT, Station 8, 1015 Lausanne, Switzerland
| | - Anthony C Davison
- Ecole Polytechnique Fédérale de Lausanne, EPFL-FSB-MATHAA-STAT, Station 8, 1015 Lausanne, Switzerland
| | - Darlene R Goldstein
- Ecole Polytechnique Fédérale de Lausanne, EPFL-FSB-MATHAA-STAT, Station 8, 1015 Lausanne, Switzerland
| |
Collapse
|
28
|
de Castro M, Chen MH, Zhang Y. Bayesian path specific frailty models for multi-state survival data with applications. Biometrics 2015; 71:760-71. [PMID: 25762198 DOI: 10.1111/biom.12298] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2014] [Revised: 01/01/2015] [Accepted: 01/01/2015] [Indexed: 12/01/2022]
Abstract
Multi-state models can be viewed as generalizations of both the standard and competing risks models for survival data. Models for multi-state data have been the theme of many recent published works. Motivated by bone marrow transplant data, we propose a Bayesian model using the gap times between two successive events in a path of events experienced by a subject. Path specific frailties are introduced to capture the dependence structure of the gap times in the paths with two or more states. Under improper prior distributions for the parameters, we establish propriety of the posterior distribution. An efficient Gibbs sampling algorithm is developed for drawing samples from the posterior distribution. An extensive simulation study is carried out to examine the empirical performance of the proposed approach. A bone marrow transplant data set is analyzed in detail to further demonstrate the proposed methodology.
Collapse
Affiliation(s)
- Mário de Castro
- Universidade de São Paulo, Instituto de Ciências Matemáticas e de Computação, São Carlos, SP, Brazil
| | - Ming-Hui Chen
- Department of Statistics, University of Connecticut, Storrs, Connecticut, U.S.A
| | - Yuanye Zhang
- Novartis Institutes for BioMedical Research, Inc., Cambridge, Massachusetts, U.S.A
| |
Collapse
|
29
|
Abstract
In this paper we propose a general class of gamma frailty transformation models for multivariate survival data. The transformation class includes the commonly used proportional hazards and proportional odds models. The proposed class also includes a family of cure rate models. Under an improper prior for the parameters, we establish propriety of the posterior distribution. A novel Gibbs sampling algorithm is developed for sampling from the observed data posterior distribution. A simulation study is conducted to examine the properties of the proposed methodology. An application to a data set from a cord blood transplantation study is also reported.
Collapse
Affiliation(s)
- Mário DE Castro
- Instituto de Ciências Matemáticas e de Computação, Universidade de São Paulo
| | | | | | - John P Klein
- Division of Biostatistics, Medical College of Wisconsin
| |
Collapse
|
30
|
Abstract
In most statistical applications, the Gibbs sampler is the method of choice for inference regarding conditionally specified distributions that are compatible. Compatibility ensures that a unique Gibbs distribution exists. For machine learning of complex models such as dependency networks, the conditional models are sometimes incompatible. In this paper, we review an ensemble approach using the Gibbs sampler as the base procedure. A Gibbs ensemble consists of many joint distributions resulting from different scan orders of the same conditional model, and the solution is a weighted sum of the ensemble. The algorithm is scalable and can handle large data sets of high dimensionality. The proposed approach provides joint distributions that conform with the conditional specifications better than the solutions obtained by linear programming and by a fixed-scan Gibbs sampler alone. Owing to incompatibility, the invariant distribution of a Gibbs sampler is scan-order dependent. A Gibbs ensemble is the collection of joint distributions estimated from the Gibbs samples of different scan orders.
Collapse
Affiliation(s)
- Shyh-Huei Chen
- Department of Biostatistical Sciences, Wake Forest University School of Medicine, Winston-Salem, NC, USA
| | - Edward H. Ip
- Department of Biostatistical Sciences, Wake Forest University School of Medicine, Winston-Salem, NC, USA
| | - Yuchung J. Wang
- Department of Mathematical Sciences, Rutgers University, Camden, NJ, USA
| |
Collapse
|
31
|
Abstract
To investigate interactions between parasite species in a host, a population of field voles was studied longitudinally, with presence or absence of six different parasites measured repeatedly. Although trapping sessions were regular, a different set of voles was caught at each session, leading to incomplete profiles for all subjects. We use a discrete time hidden Markov model for each disease with transition probabilities dependent on covariates via a set of logistic regressions. For each disease the hidden states for each of the other diseases at a given time point form part of the covariate set for the Markov transition probabilities from that time point. This allows us to gauge the influence of each parasite species on the transition probabilities for each of the other parasite species. Inference is performed via a Gibbs sampler, which cycles through each of the diseases, first using an adaptive Metropolis-Hastings step to sample from the conditional posterior of the covariate parameters for that particular disease given the hidden states for all other diseases and then sampling from the hidden states for that disease given the parameters. We find evidence for interactions between several pairs of parasites and of an acquired immune response for two of the parasites.
Collapse
|
32
|
Abstract
Since its first release in 2001 as mainly a software package for phylogenetic analysis, data analysis for molecular biology and evolution (DAMBE) has gained many new functions that may be classified into six categories: 1) sequence retrieval, editing, manipulation, and conversion among more than 20 standard sequence formats including MEGA, NEXUS, PHYLIP, GenBank, and the new NeXML format for interoperability, 2) motif characterization and discovery functions such as position weight matrix and Gibbs sampler, 3) descriptive genomic analysis tools with improved versions of codon adaptation index, effective number of codons, protein isoelectric point profiling, RNA and protein secondary structure prediction and calculation of minimum folding energy, and genomic skew plots with optimized window size, 4) molecular phylogenetics including sequence alignment, testing substitution saturation, distance-based, maximum parsimony, and maximum-likelihood methods for tree reconstructions, testing the molecular clock hypothesis with either a phylogeny or with relative-rate tests, dating gene duplication and speciation events, choosing the best-fit substitution models, and estimating rate heterogeneity over sites, 5) phylogeny-based comparative methods for continuous and discrete variables, and 6) graphic functions including secondary structure display, optimized skew plot, hydrophobicity plot, and many other plots of amino acid properties along a protein sequence, tree display and drawing by dragging nodes to each other, and visual searching of the maximum parsimony tree. DAMBE features a graphic, user-friendly, and intuitive interface and is freely available from http://dambe.bio.uottawa.ca (last accessed April 16, 2013).
Collapse
Affiliation(s)
- Xuhua Xia
- Department of Biology and Center for Advanced Research in Environmental Genomics, University of Ottawa, Ottawa, Ontario, Canada
| |
Collapse
|
33
|
Chen H, Quandt SA, Grzywacz JG, Arcury TA. A Bayesian Multiple Imputation Method for Handling Longitudinal Pesticide Data with Values below the Limit of Detection. Environmetrics 2013; 24:132-142. [PMID: 23504271 PMCID: PMC3596170 DOI: 10.1002/env.2193] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/05/2023]
Abstract
Environmental and biomedical research often produces data below the limit of detection (LOD), or left-censored data. Imputing explicit values for values < LOD in a multivariate setting, such as with longitudinal data, is difficult using a likelihood-based approach. A Bayesian multiple imputation (MI) method is introduced to handle left-censored multivariate data. A Gibbs sampler, which uses an iterative process, is employed to simulate the target multivariate distribution within a Bayesian framework. Following convergence, multiple plausible data sets are generated for analysis by standard statistical methods outside of a Bayesian framework. With explicit imputed values available variables can be analyzed as outcomes or predictors. We illustrate a practical application using longitudinal data from the Community Participatory Approach to Measuring Farmworker Pesticide Exposure (PACE3) study to evaluate the association between urinary acephate concentrations (indicating pesticide exposure) and self-reported potential pesticide poisoning symptoms. Additionally, a simulation study is used to evaluate the sampling property of the estimators for distributional parameters as well as regression coefficients estimated with the generalized estimating equation (GEE) approach. Results demonstrated that the Bayesian MI estimates performed well in most settings, and we recommend the use of this valid and feasible approach to analyze multivariate data with values < LOD.
Collapse
Affiliation(s)
- Haiying Chen
- Department of Biostatistical Sciences, Division of Public Health Sciences, Wake Forest School of Medicine, Winston-Salem, North Carolina
- Center for Worker Health, Wake Forest School of Medicine, Winston-Salem, North Carolina
- Correspondence to: Haiying Chen, Department of Biostatistical Sciences, Division of Public Health Sciences, Wake Forest School of Medicine, Medical Center Boulevard, Winston-Salem, NC 27157.
| | - Sara A. Quandt
- Department of Epidemiology and Prevention, Division of Public Health Sciences, Wake Forest School of Medicine, Winston-Salem, North Carolina
- Center for Worker Health, Wake Forest School of Medicine, Winston-Salem, North Carolina
| | - Joseph G. Grzywacz
- Department of Family and Community Medicine, Wake Forest School of Medicine, Winston-Salem, North Carolina
- Center for Worker Health, Wake Forest School of Medicine, Winston-Salem, North Carolina
| | - Thomas A. Arcury
- Department of Family and Community Medicine, Wake Forest School of Medicine, Winston-Salem, North Carolina
- Center for Worker Health, Wake Forest School of Medicine, Winston-Salem, North Carolina
| |
Collapse
|
34
|
Abstract
Real world networks exhibit a complex set of phenomena such as underlying hierarchical organization, multiscale interaction, and varying topologies of communities. Most existing methods do not adequately capture the intrinsic interplay among such phenomena. We propose a nonparametric Multiscale Community Blockmodel (MSCB) to model the generation of hierarchies in social communities, selective membership of actors to subsets of these communities, and the resultant networks due to within- and cross-community interactions. By using the nested Chinese Restaurant Process, our model automatically infers the hierarchy structure from the data. We develop a collapsed Gibbs sampling algorithm for posterior inference, conduct extensive validation using synthetic networks, and demonstrate the utility of our model in real-world datasets such as predator-prey networks and citation networks.
Collapse
Affiliation(s)
- Qirong Ho
- School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15217
| | | | | |
Collapse
|
35
|
Singh SP, Mishra BN. Prediction of MHC binding peptide using Gibbs motif sampler, weight matrix and artificial neural network. Bioinformation 2008; 3:150-5. [PMID: 19238237 PMCID: PMC2639663 DOI: 10.6026/97320630003150] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2008] [Accepted: 11/05/2008] [Indexed: 11/28/2022] Open
Abstract
The identification of MHC restricted epitopes is an important goal in peptide based vaccine and diagnostic development. As
wet lab experiments for identification of MHC binding peptide are expensive and time consuming, in silico tools have been
developed as fast alternatives, however with low performance. In the present study, we used IEDB training and blind
validation datasets for the prediction of peptide binding to fourteen human MHC class I and II molecules using Gibbs motif
sampler, weight matrix and artificial neural network methods. As compare to MHC class I predictor based on sequence
weighting (Aroc=0.95 and CC=0.56) and artificial neural network (Aroc=0.73 and CC=0.25), MHC class II predictor based on
Gibbs sampler did not perform well (Aroc=0.62 and CC=0.19). The predictive accuracy of Gibbs motif sampler in identifying
the 9-mer cores of a binding peptide to DRB1 alleles are also limited (40¢), however above the random prediction (14¢).
Therefore, the size of dataset (training and validation) and the correct identification of the binding core are the two main
factors limiting the performance of MHC class-II binding peptide prediction. Overall, these data suggest that there is
substantial room to improve the quality of the core predictions using novel approaches that capture distinct features of
MHC-peptide interactions than the current approaches.
Collapse
Affiliation(s)
- Satarudra Prakash Singh
- Amity Institute of Biotechnology, Amity University Uttar Pradesh, Gomti Nagar, Lucknow-226010, India, Department of Biotechnology, Institute of Engineering and Technology, U.P. Technical University, Sitapur Road, Lucknow-226021, India
| | | |
Collapse
|