1
|
Xu H, Li X, Zhang Z, Grannis S. Variable selection for latent class analysis in the presence of missing data with application to record linkage. Stat Methods Med Res 2024:9622802241242317. [PMID: 38592341 DOI: 10.1177/09622802241242317] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/10/2024]
Abstract
The Fellegi-Sunter model is a latent class model widely used in probabilistic linkage to identify records that belong to the same entity. Record linkage practitioners typically employ all available matching fields in the model with the premise that more fields convey greater information about the true match status and hence result in improved match performance. In the context of model-based clustering, it is well known that such a premise is incorrect and the inclusion of noisy variables could compromise the clustering. Variable selection procedures have therefore been developed to remove noisy variables. Although these procedures have the potential to improve record matching, they cannot be applied directly due to the ubiquity of the missing data in record linkage applications. In this paper, we modify the stepwise variable selection procedure proposed by Fop, Smart, and Murphy and extend it to account for missing data common in record linkage. Through simulation studies, our proposed method is shown to select the correct set of matching fields across various settings, leading to better-performing algorithms. The improved match performance is also seen in a real-world application. We therefore recommend the use of our proposed selection procedure to identify informative matching fields for probabilistic record linkage algorithms.
Collapse
Affiliation(s)
- Huiping Xu
- Department of Biostatistics and Health Data Science, Indiana University, Indianapolis, IN, USA
| | - Xiaochun Li
- Department of Biostatistics and Health Data Science, Indiana University, Indianapolis, IN, USA
| | | | | |
Collapse
|
2
|
Pedone M, Argiento R, Stingo FC. Personalized treatment selection via product partition models with covariates. Biometrics 2024; 80:ujad003. [PMID: 38364806 DOI: 10.1093/biomtc/ujad003] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2023] [Revised: 07/27/2023] [Accepted: 11/03/2023] [Indexed: 02/18/2024]
Abstract
Precision medicine is an approach for disease treatment that defines treatment strategies based on the individual characteristics of the patients. Motivated by an open problem in cancer genomics, we develop a novel model that flexibly clusters patients with similar predictive characteristics and similar treatment responses; this approach identifies, via predictive inference, which one among a set of treatments is better suited for a new patient. The proposed method is fully model based, avoiding uncertainty underestimation attained when treatment assignment is performed by adopting heuristic clustering procedures, and belongs to the class of product partition models with covariates, here extended to include the cohesion induced by the normalized generalized gamma process. The method performs particularly well in scenarios characterized by considerable heterogeneity of the predictive covariates in simulation studies. A cancer genomics case study illustrates the potential benefits in terms of treatment response yielded by the proposed approach. Finally, being model based, the approach allows estimating clusters' specific response probabilities and then identifying patients more likely to benefit from personalized treatment.
Collapse
Affiliation(s)
- Matteo Pedone
- Department of Statistics, Computer Science and Applications, University of Florence, Florence, Italy, 50134
| | - Raffaele Argiento
- Department of Economics, University of Bergamo, Bergamo, Italy, 24121
| | - Francesco C Stingo
- Department of Statistics, Computer Science and Applications, University of Florence, Florence, Italy, 50134
| |
Collapse
|
3
|
Riggott C, Fairbrass KM, Black CJ, Gracie DJ, Ford AC. Novel symptom clusters predict disease impact and healthcare utilisation in inflammatory bowel disease: Prospective longitudinal follow-up study. Aliment Pharmacol Ther 2023; 58:1163-1174. [PMID: 37792347 DOI: 10.1111/apt.17735] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 08/16/2023] [Revised: 09/07/2023] [Accepted: 09/19/2023] [Indexed: 10/05/2023]
Abstract
BACKGROUND Predicting adverse disease outcomes and high-volume users of healthcare amongst patients with inflammatory bowel disease (IBD) is difficult. AIMS The aim of this study is to use latent class analysis to create novel clusters of patients and to assess whether these predict outcomes during 6.5 years of longitudinal follow-up. METHODS Baseline demographic features, disease activity indices, anxiety, depression, and somatoform symptom-reporting scores were recorded for 692 adults. Faecal calprotectin (FC) was analysed at baseline in 348 (50.3%) patients (<250 mcg/g defined biochemical remission). Using baseline gastrointestinal and psychological symptoms, latent class analysis identified specific patient clusters. Rates of glucocorticosteroid prescription or flare, escalation, hospitalisation, or intestinal resection were compared between clusters using multivariate Cox regression. RESULTS A three-cluster model was the optimum solution; 132 (19.1%) patients had below-average gastrointestinal and psychological symptoms (cluster 1), 352 (50.9%) had average levels of gastrointestinal and psychological symptoms (cluster 2), and 208 (30.1%) had the highest levels of both gastrointestinal and psychological symptoms (cluster 3). Compared with cluster 1, cluster 3 had significantly increased risk of flare or glucocorticosteroid prescription (hazard ratio (HR): 2.13; 95% confidence interval (CI): 1.46-3.10), escalation (HR: 1.92; 95% CI: 1.34-2.76), a composite of escalation, hospitalisation, or intestinal resection (HR: 2.05; 95% CI: 1.45-2.88), or any of the endpoints of interest (HR: 2.06; 95% CI: 1.45-2.93). Healthcare utilisation was highest in cluster 3. CONCLUSIONS Novel model-based clusters identify patients with IBD at higher risk of adverse disease outcomes who are high-volume users of healthcare.
Collapse
Affiliation(s)
- Christy Riggott
- Leeds Gastroenterology Institute, St. James's University Hospital, Leeds, UK
- Leeds Institute of Medical Research at St. James's, University of Leeds, Leeds, UK
| | - Keeley M Fairbrass
- Leeds Gastroenterology Institute, St. James's University Hospital, Leeds, UK
- Leeds Institute of Medical Research at St. James's, University of Leeds, Leeds, UK
| | - Christopher J Black
- Leeds Gastroenterology Institute, St. James's University Hospital, Leeds, UK
- Leeds Institute of Medical Research at St. James's, University of Leeds, Leeds, UK
| | - David J Gracie
- Leeds Gastroenterology Institute, St. James's University Hospital, Leeds, UK
- Leeds Institute of Medical Research at St. James's, University of Leeds, Leeds, UK
| | - Alexander C Ford
- Leeds Gastroenterology Institute, St. James's University Hospital, Leeds, UK
- Leeds Institute of Medical Research at St. James's, University of Leeds, Leeds, UK
| |
Collapse
|
4
|
Jo B, Hastie TJ, Li Z, Youngstrom EA, Findling RL, Horwitz SM. Reorienting Latent Variable Modeling for Supervised Learning. Multivariate Behav Res 2023; 58:1057-1071. [PMID: 37229653 PMCID: PMC10674034 DOI: 10.1080/00273171.2023.2182753] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/27/2023]
Abstract
Despite its potentials benefits, using prediction targets generated based on latent variable (LV) modeling is not a common practice in supervised learning, a dominating framework for developing prediction models. In supervised learning, it is typically assumed that the outcome to be predicted is clear and readily available, and therefore validating outcomes before predicting them is a foreign concept and an unnecessary step. The usual goal of LV modeling is inference, and therefore using it in supervised learning and in the prediction context requires a major conceptual shift. This study lays out methodological adjustments and conceptual shifts necessary for integrating LV modeling into supervised learning. It is shown that such integration is possible by combining the traditions of LV modeling, psychometrics, and supervised learning. In this interdisciplinary learning framework, generating practical outcomes using LV modeling and systematically validating them based on clinical validators are the two main strategies. In the example using the data from the Longitudinal Assessment of Manic Symptoms (LAMS) Study, a large pool of candidate outcomes is generated by flexible LV modeling. It is demonstrated that this exploratory situation can be used as an opportunity to tailor desirable prediction targets taking advantage of contemporary science and clinical insights.
Collapse
|
5
|
Wang W, Sun Y, Wang HJ. Latent group detection in functional partially linear regression models. Biometrics 2023; 79:280-291. [PMID: 34482542 DOI: 10.1111/biom.13557] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2020] [Revised: 08/06/2021] [Accepted: 08/19/2021] [Indexed: 11/28/2022]
Abstract
In this paper, we propose a functional partially linear regression model with latent group structures to accommodate the heterogeneous relationship between a scalar response and functional covariates. The proposed model is motivated by a salinity tolerance study of barley families, whose main objective is to detect salinity tolerant barley plants. Our model is flexible, allowing for heterogeneous functional coefficients while being efficient by pooling information within a group for estimation. We develop an algorithm in the spirit of the K-means clustering to identify latent groups of the subjects under study. We establish the consistency of the proposed estimator, derive the convergence rate and the asymptotic distribution, and develop inference procedures. We show by simulation studies that the proposed method has higher accuracy for recovering latent groups and for estimating the functional coefficients than existing methods. The analysis of the barley data shows that the proposed method can help identify groups of barley families with different salinity tolerant abilities.
Collapse
Affiliation(s)
- Wu Wang
- Center for Applied Statistics and School of Statistics, Renmin University of China, Beijing, China
| | - Ying Sun
- Statistics Program, King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
| | - Huixia Judy Wang
- Department of Statistics, The George Washington University, Washington, DC, USA
| |
Collapse
|
6
|
Elayouty A, Abou-Ali H. Functional data analysis of the relationship between electricity consumption and climate change drivers. J Appl Stat 2022; 50:2267-2285. [PMID: 37434625 PMCID: PMC10332224 DOI: 10.1080/02664763.2022.2108773] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2021] [Accepted: 07/26/2022] [Indexed: 10/15/2022]
Abstract
Climate change has become increasingly important in recent years. It is the outcome of the burning of fossil fuels that increased the concentration of atmospheric carbon dioxide (CO2 ), over the last century. Mitigating the impacts of climate change requires a better understanding and assessment of the countries' economic decisions on the amount of CO2 emissions. This paper assesses the variability between the different countries in the trends of CO2 emissions and electricity consumption from 1975 to 2014, while identifying clusters of countries of similar trends over time. The novel methodology applied in this paper enables us to assess long-debated issues in climate literature. The temporal dynamic effects of electricity consumption and economic growth on CO2 emissions across countries are studied using functional data analysis (FDA) methods. The latter have proven to be useful tools for visualising similarities and differences in the non-linear trends of CO2 emissions without forcing linear trends and stationary relationships which can be unrealistic and misleading. The results indicate the possibility of identifying changes in the trends of CO2 emissions and electricity consumption for a wide range of heterogeneous countries over the study period. The findings also reveal that economic growth puts a strain on the environment, where many high-income countries are still away from attaining economic-energy sustainability.
Collapse
Affiliation(s)
- A. Elayouty
- Department of Statistics, Faculty of Economics and Political Science, Cairo University, Giza, Egypt
| | - H. Abou-Ali
- Department of Economics, Faculty of Economics and Political Science, Cairo University, Giza, Egypt
- Economic Research Forum, Giza, Egypt
| |
Collapse
|
7
|
D'Angelo L, Canale A, Yu Z, Guindani M. Bayesian nonparametric analysis for the detection of spikes in noisy calcium imaging data. Biometrics 2022. [PMID: 35191539 DOI: 10.1111/biom.13626] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/18/2021] [Accepted: 01/13/2022] [Indexed: 11/30/2022]
Abstract
Recent advancements in miniaturized fluorescence microscopy have made it possible to investigate neuronal responses to external stimuli in awake behaving animals through the analysis of intra-cellular calcium signals. An on-going challenge is deconvolving the temporal signals to extract the spike trains from the noisy calcium signals' time-series. In this manuscript, we propose a nested Bayesian finite mixture specification that allows the estimation of spiking activity and, simultaneously, reconstructing the distributions of the calcium transient spikes' amplitudes under different experimental conditions. The proposed model leverages two nested layers of random discrete mixture priors to borrow information between experiments and discover similarities in the distributional patterns of neuronal responses to different stimuli. Furthermore, the spikes' intensity values are also clustered within and between experimental conditions to determine the existence of common (recurring) response amplitudes. Simulation studies and the analysis of a data set from the Allen Brain Observatory show the effectiveness of the method in clustering and detecting neuronal activities. This article is protected by copyright. All rights reserved.
Collapse
Affiliation(s)
- Laura D'Angelo
- Department of Economics, Management and Statistics, University of Milano-Bicocca, Milan, Italy
| | - Antonio Canale
- Department of Statistical Sciences, University of Padova, Padova, Italy
| | - Zhaoxia Yu
- Department of Statistics, University of California, Irvine Irvine, U.S.A
| | - Michele Guindani
- Department of Statistics, University of California, Irvine Irvine, U.S.A
| |
Collapse
|
8
|
Maleki M, Bidram H, Wraith D. Robust clustering of COVID-19 cases across U.S. counties using mixtures of asymmetric time series models with time varying and freely indexed covariates. J Appl Stat 2022; 50:2648-2662. [PMID: 37529575 PMCID: PMC10388823 DOI: 10.1080/02664763.2021.2019688] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2020] [Accepted: 12/12/2021] [Indexed: 10/19/2022]
Abstract
In this paper, we develop a mixture of autoregressive (MoAR) process model with time varying and freely indexed covariates under the flexible class of two-piece distributions using the scale mixtures of normal (TP-SMN) family. This novel family of time series (TP-SMN-MoAR) models was used to examine flexible and robust clustering of reported cases of Covid-19 across 313 counties in the U.S. The TP-SMN distributions allow for symmetrical/ asymmetrical distributions as well as heavy-tailed distributions providing for flexibility to handle outliers and complex data. Developing a suitable hierarchical representation of the TP-SMN family enabled the construction of a pseudo-likelihood function to derive the maximum pseudo-likelihood estimates via an EM-type algorithm.
Collapse
Affiliation(s)
- Mohsen Maleki
- Department of Statistics, Faculty of Mathematics and Statistics, University of Isfahan, Isfahan, Iran
| | - Hamid Bidram
- Department of Statistics, Faculty of Mathematics and Statistics, University of Isfahan, Isfahan, Iran
| | - Darren Wraith
- School of Public Health & Social Work and Centre for Data Science, Queensland University of Technology (QUT), Brisbane, Australia
| |
Collapse
|
9
|
Kyoya S, Yamanishi K. Summarizing Finite Mixture Model with Overlapping Quantification. Entropy (Basel) 2021; 23:1503. [PMID: 34828201 PMCID: PMC8622449 DOI: 10.3390/e23111503] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/28/2021] [Revised: 11/07/2021] [Accepted: 11/08/2021] [Indexed: 11/18/2022]
Abstract
Finite mixture models are widely used for modeling and clustering data. When they are used for clustering, they are often interpreted by regarding each component as one cluster. However, this assumption may be invalid when the components overlap. It leads to the issue of analyzing such overlaps to correctly understand the models. The primary purpose of this paper is to establish a theoretical framework for interpreting the overlapping mixture models by estimating how they overlap, using measures of information such as entropy and mutual information. This is achieved by merging components to regard multiple components as one cluster and summarizing the merging results. First, we propose three conditions that any merging criterion should satisfy. Then, we investigate whether several existing merging criteria satisfy the conditions and modify them to fulfill more conditions. Second, we propose a novel concept named clustering summarization to evaluate the merging results. In it, we can quantify how overlapped and biased the clusters are, using mutual information-based criteria. Using artificial and real datasets, we empirically demonstrate that our methods of modifying criteria and summarizing results are effective for understanding the cluster structures. We therefore give a new view of interpretability/explainability for model-based clustering.
Collapse
Affiliation(s)
- Shunki Kyoya
- Graduate School of Information Science and Technology, The University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo 113-8656, Japan;
| | | |
Collapse
|
10
|
Park J, Choi T, Chung Y. Nonparametric Bayesian functional two-part random effects model for longitudinal semicontinuous data analysis. Biom J 2021; 63:787-805. [PMID: 33554393 DOI: 10.1002/bimj.201900280] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2019] [Revised: 04/23/2020] [Accepted: 07/17/2020] [Indexed: 11/08/2022]
Abstract
Longitudinal semicontinuous data, characterized by repeated measures of a large portion of zeros and continuous positive values, are frequently encountered in many applications including biomedical, epidemiological, and social science studies. Two-part random effects models (TPREM) have been used to investigate the association between such longitudinal semicontinuous data and covariates accounting for the within-subject correlation. The existing TPREM is, however, limited to incorporate a functional covariate, which is often available in a longitudinal study. Moreover, the existing TPREM typically assumes the normality of subject-specific random effects, which can be easily violated when there exists a subgroup structure. In this article, we propose a nonparametric Bayesian functional TPREM to assess the relationship between the longitudinal semicontinuous outcome and various types of covariates including a functional covariate. The proposed model also relaxes the normality assumption for the random effects through a Dirichlet process mixture of normals, which allows for identifying an underlying subgroup structure. The methodology is illustrated through an application to social insurance expenditure data collected by the Korean Welfare Panel Study and a simulation study.
Collapse
Affiliation(s)
- Jinsu Park
- Department of Mathematical Sciences, Korea Advanced Institute of Science and Technology, Daejeon, Korea
| | - Taeryon Choi
- Department of Statistics, Korea University, Seoul, Korea
| | - Yeonseung Chung
- Department of Mathematical Sciences, Korea Advanced Institute of Science and Technology, Daejeon, Korea
| |
Collapse
|
11
|
Shirinkam S, Alaeddini A, Gross E. IDENTIFYING THE NUMBER OF COMPONENTS IN GAUSSIAN MIXTURE MODELS USING NUMERICAL ALGEBRAIC GEOMETRY. J Algebra Appl 2020; 19:2050204. [PMID: 33867617 PMCID: PMC8048412 DOI: 10.1142/s0219498820502047] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/26/2023]
Abstract
Using Gaussian mixture models for clustering is a statistically mature method for clustering in data science with numerous successful applications in science and engineering. The parameters for a Gaussian mixture model are typically estimated from training data using the iterative expectation-maximization algorithm, which requires the number of Gaussian components a priori. In this study, we propose two algorithms rooted in numerical algebraic geometry, namely an area-based algorithm and a local maxima algorithm, to identify the optimal number of components. The area-based algorithm transforms several Gaussian mixture models with varying number of components into sets of equivalent polynomial regression splines. Next, it uses homotopy continuation methods for evaluating the resulting splines to identify the number of components that results in the best fit. The local maxima algorithm forms a set of polynomials by fitting a smoothing spline to a kernel density estimate of the data. Next, it uses numerical algebraic geometry to solve the system of the first derivatives for finding the local maxima of the resulting smoothing spline, which estimates the number of mixture components. The local maxima algorithm also identifies the location of the centers of Gaussian components. Using a real-world case study in automotive manufacturing and multiple simulations, we compare the performance of the proposed algorithms with that of Akaike information criterion (AIC) and Bayesian information criterion (BIC), which are popular methods in the literature. We show the proposed algorithms are more robust than AIC and BIC when the Gaussian assumption is violated.
Collapse
Affiliation(s)
- Sara Shirinkam
- Department of Mathematics and Statistics, University of the Incarnate Word, 4301 Broadway, CPO 311, San Antonio, TX 78209, USA
| | - Adel Alaeddini
- Department of Mechanical Engineering, University of Texas at San Antonio, One UTSA Circle San Antonio, TX 78249, USA
| | - Elizabeth Gross
- Department of Mathematics, University of Hawai'i at Mānoa, 2565 McCarthy Mall, Honolulu, Hawaii 96822, USA
| |
Collapse
|
12
|
Abstract
This paper presents a new model-based generalized functional clustering method for discrete longitudinal data, such as multivariate binomial and Poisson distributed data. For this purpose, we propose a multivariate functional principal component analysis (MFPCA)-based clustering procedure for a latent multivariate Gaussian process instead of the original functional data directly. The main contribution of this study is two-fold: modeling of discrete longitudinal data with the latent multivariate Gaussian process and developing of a clustering algorithm based on MFPCA coupled with the latent multivariate Gaussian process. Numerical experiments, including real data analysis and a simulation study, demonstrate the promising empirical properties of the proposed approach.
Collapse
Affiliation(s)
- Yaeji Lim
- Department of Applied Statistics, Chung-Ang University, Seoul, Republic of Korea
| | | | - Hee-Seok Oh
- Department of Statistics, Seoul National University, Seoul, Republic of Korea
| |
Collapse
|
13
|
Jiang T, Lu Y, Duan H, Zhang W, Liu A. A model-based approach for clustering of multivariate semicontinuous data with application to dietary pattern analysis and intervention. Stat Med 2020; 39:16-25. [PMID: 31702055 DOI: 10.1002/sim.8391] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/26/2018] [Revised: 09/16/2019] [Accepted: 09/17/2019] [Indexed: 11/10/2022]
Abstract
Semicontinuous data, characterized by a sizable number of zeros and observations from a continuous distribution, are frequently encountered in health research concerning food consumptions, physical activities, medical and pharmacy claims expenditures, and many others. In analyzing such semicontinuous data, it is imperative that the excessive zeros be adequately accounted for to obtain unbiased and efficient inference. Although many methods have been proposed in the literature for the modeling and analysis of semicontinuous data, little attention has been given to clustering of semicontinuous data to identify important patterns that could be indicative of certain health outcomes or intervention effects. We propose a Bernoulli-normal mixture model for clustering of multivariate semicontinuous data and demonstrate its accuracy as compared to the well-known clustering method with the conventional normal mixture model. The proposed method is illustrated with data from a dietary intervention trial to promote healthy eating behavior among children with type 1 diabetes. In the trial, certain diabetes friendly foods (eg, total fruit, whole fruit, dark green and orange vegetables and legumes, whole grain) were only consumed by a proportion of study participants, yielding excessive zero values due to nonconsumption of the foods. Baseline foods consumptions data in the trial are used to explore preintervention dietary patterns among study participants. While the conventional normal mixture model approach fails to do so, the proposed Bernoulli-normal mixture model approach has shown to be able to identify a dietary profile that significantly differentiates the intervention effects from others, as measured by the popular healthy eating index at the end of the trial.
Collapse
Affiliation(s)
- Tao Jiang
- School of Statistics and Mathematics, Zhejiang Gongshang University, Hangzhou, China
| | - Yahui Lu
- School of Statistics and Mathematics, Zhejiang Gongshang University, Hangzhou, China
| | - Huimin Duan
- School of Statistics and Mathematics, Zhejiang Gongshang University, Hangzhou, China
| | - Wei Zhang
- Biostatistics and Bioinformatics Branch, Eunice Kennedy Shriver National Institute of Child Health and Human Development, Bethesda, Maryland
| | - Aiyi Liu
- Biostatistics and Bioinformatics Branch, Eunice Kennedy Shriver National Institute of Child Health and Human Development, Bethesda, Maryland
| |
Collapse
|
14
|
Saraiva EF, Suzuki AK, Milan LA, Pereira CDB. An Integrated Approach for Making Inference on the Number of Clusters in a Mixture Model. Entropy (Basel) 2019; 21:1063. [PMCID: PMC7514367 DOI: 10.3390/e21111063] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/23/2019] [Accepted: 10/26/2019] [Indexed: 06/12/2023]
Abstract
This paper presents an integrated approach for the estimation of the parameters of a mixture model in the context of data clustering. The method is designed to estimate the unknown number of clusters from observed data. For this, we marginalize out the weights for getting allocation probabilities that depend on the number of clusters but not on the number of components of the mixture model. As an alternative to the stochastic expectation maximization (SEM) algorithm, we propose the integrated stochastic expectation maximization (ISEM) algorithm, which in contrast to SEM, does not need the specification, a priori, of the number of components of the mixture. Using this algorithm, one estimates the parameters associated with the clusters, with at least two observations, via local maximization of the likelihood function. In addition, at each iteration of the algorithm, there exists a positive probability of a new cluster being created by a single observation. Using simulated datasets, we compare the performance of the ISEM algorithm against both SEM and reversible jump (RJ) algorithms. The obtained results show that ISEM outperforms SEM and RJ algorithms. We also provide the performance of the three algorithms in two real datasets.
Collapse
Affiliation(s)
| | - Adriano Kamimura Suzuki
- Departamento de Matemática Aplicada e Estatística, Universidade de São Paulo, São Carlos 13566-590, Brazil;
| | - Luis Aparecido Milan
- Departamento de Estatística, Universidade Federal de São Carlos, São Carlos 13565-905, Brazil;
| | - Carlos Alberto de Bragança Pereira
- Instituto de Matemática, Universidade Federal de Mato Grosso do Sul, Campo Grande 79070-900, Brazil;
- Instituto de Matemática e Estatística, Universidade de São Paulo, São Paulo 05508-090, Brazil
| |
Collapse
|
15
|
Shigyo N, Umeki K, Hirao T. Seasonal Dynamics of Soil Fungal and Bacterial Communities in Cool-Temperate Montane Forests. Front Microbiol 2019; 10:1944. [PMID: 31507559 PMCID: PMC6716449 DOI: 10.3389/fmicb.2019.01944] [Citation(s) in RCA: 36] [Impact Index Per Article: 7.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2019] [Accepted: 08/07/2019] [Indexed: 02/01/2023] Open
Abstract
Both fungal and bacterial communities in soils play key roles in driving forest ecosystem processes across multiple time scales, but how seasonal changes in environmental factors shape these microbial communities is not well understood. Here, we aimed to evaluate the importance of seasons, elevation, and soil depth in determining soil fungal and bacterial communities, given the influence of climate conditions, soil properties and plant traits. In this study, seasonal patterns of diversity and abundance did not synchronize between fungi and bacteria, where soil fertility explained the diversity and abundance of soil fungi but soil water content explained those of soil bacteria. Model-based clustering showed that seasonal changes in both abundant and rare taxonomic groups were different between soil fungi and bacteria. The cluster represented by ectomycorrhizal genus Lactarius was a dominant group across soil fungal communities and fluctuated seasonally. For soil bacteria, the clusters composed of dominant genera were seasonally stable but varied greatly depending on elevation and soil depth. Seasonally changing clusters of soil bacteria (e.g., Nitrospira and Pelosinus) were not dominant groups and were related to plant phenology. These findings suggest that the contribution of seasonal changes in climate conditions, soil fertility, and plant phenology to microbial communities might be equal to or greater than the effects of spatial heterogeneity of those factors. Our study identifies aboveground-belowground components as key factors explaining how microbial communities change during a year in forest soils at mid-to-high latitudes.
Collapse
Affiliation(s)
- Nobuhiko Shigyo
- The University of Tokyo Chichibu Forest, Graduate School of Agricultural and Life Sciences, The University of Tokyo, Chichibu, Japan
| | - Kiyoshi Umeki
- Graduate School of Horticulture, Chiba University, Matsudo, Japan
| | - Toshihide Hirao
- The University of Tokyo Chichibu Forest, Graduate School of Agricultural and Life Sciences, The University of Tokyo, Chichibu, Japan
| |
Collapse
|
16
|
Paul S, Corwin EJ. Identifying clusters from multidimensional symptom trajectories in postpartum women. Res Nurs Health 2019; 42:119-127. [PMID: 30710373 DOI: 10.1002/nur.21935] [Citation(s) in RCA: 16] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2018] [Accepted: 01/01/2019] [Indexed: 12/15/2022]
Abstract
Depressive symptoms, stress, fatigue, and lack of sleep are often experienced by women in the perinatal period and are potential contributors to adverse maternal and child health outcomes. To explore the evolution of symptoms and identify groups of women of similar severity and patterns, we utilized clustering of multidimensional symptom trajectories. In an observational study data were collected from pregnant women in the 3rd trimester (36 weeks prenatal) and in the postnatal period at weeks 1 and 2 as well as at 1-, 2-, 3-, and 6-months postpartum. Depressive symptoms and maternal stress were measured using the Edinburg Postnatal Depression Scale (EPDS) and the Perceived Stress Scale (PSS), respectively. Self-reported duration of sleep and levels of fatigue also were collected. A model-based clustering approach was used to classify women by their symptom severity. The sample included 151 pregnant women with a 6-month follow-up. Two clusters were identified. Cluster 1 (n = 43) comprised women with fewer depressive symptoms, less perceived stress, lower likelihood of being fatigued, increased sleep duration and a negative trend in EPDS (β = -0.05, CI [-0.09, -0.001]), and PSS (β = -0.09, CI [-0.17, -0.01]). Cluster 2 (n = 108) comprised women with higher EPDS and PSS scores, increased likelihood of fatigue and lower sleep duration with a positive trend in sleep hours (β = -0.02, CI [0.01, 0.03]). Pro-inflammatory markers interleukin-6 and tumor necrosis factor-α were associated with longer sleep duration and fewer depressive symptoms, respectively. Using this methodology in maternal and child health research can potentially predict women's risk of developing severe symptoms and help clinicians provide timely interventions.
Collapse
Affiliation(s)
- Sudeshna Paul
- Nell Hodgson Woodruff School of Nursing, Emory University, Atlanta, Georgia
| | - Elizabeth J Corwin
- Nell Hodgson Woodruff School of Nursing, Emory University, Atlanta, Georgia
| |
Collapse
|
17
|
Abstract
A classic problem in population genetics is the characterization of discrete population structure in the presence of continuous patterns of genetic differentiation. Especially when sampling is discontinuous, the use of clustering or assignment methods may incorrectly ascribe differentiation due to continuous processes (e.g., geographic isolation by distance) to discrete processes, such as geographic, ecological, or reproductive barriers between populations. This reflects a shortcoming of current methods for inferring and visualizing population structure when applied to genetic data deriving from geographically distributed populations. Here, we present a statistical framework for the simultaneous inference of continuous and discrete patterns of population structure. The method estimates ancestry proportions for each sample from a set of two-dimensional population layers, and, within each layer, estimates a rate at which relatedness decays with distance. This thereby explicitly addresses the "clines versus clusters" problem in modeling population genetic variation, and remedies some of the overfitting to which nonspatial models are prone. The method produces useful descriptions of structure in genetic relatedness in situations where separated, geographically distributed populations interact, as after a range expansion or secondary contact. We demonstrate the utility of this approach using simulations and by applying it to empirical datasets of poplars and black bears in North America.
Collapse
Affiliation(s)
- Gideon S Bradburd
- Ecology, Evolutionary Biology, and Behavior Graduate Group, Department of Integrative Biology, Michigan State University, East Lansing, Michigan 48824
| | - Graham M Coop
- Center for Population Biology, Department of Evolution and Ecology, University of California, Davis, California 95616
| | - Peter L Ralph
- Institute of Ecology and Evolution, Departments of Mathematics and Biology, University of Oregon, Eugene, Oregon 97403
| |
Collapse
|
18
|
Molsberry SA, Cheng Y, Kingsley L, Jacobson L, Levine AJ, Martin E, Miller EN, Munro CA, Ragin A, Sacktor N, Becker JT. Neuropsychological phenotypes among men with and without HIV disease in the multicenter AIDS cohort study. AIDS 2018; 32:1679-1688. [PMID: 29762177 PMCID: PMC6082155 DOI: 10.1097/qad.0000000000001865] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
Abstract
OBJECTIVE Mild forms of HIV-associated neurocognitive disorder (HAND) remain prevalent in the combination antiretroviral therapy (cART) era. This study's objective was to identify neuropsychological subgroups within the Multicenter AIDS Cohort Study (MACS) based on the participant-based latent structure of cognitive function and to identify factors associated with subgroups. DESIGN The MACS is a four-site longitudinal study of the natural and treated history of HIV disease among gay and bisexual men. METHODS Using neuropsychological domain scores, we used a cluster variable selection algorithm to identify the optimal subset of domains with cluster information. Latent profile analysis was applied using scores from identified domains. Exploratory and posthoc analyses were conducted to identify factors associated with cluster membership and the drivers of the observed associations. RESULTS Cluster variable selection identified all domains as containing cluster information except for Working Memory. A three-profile solution produced the best fit for the data. Profile 1 performed below average on all domains, Profile 2 performed average on executive functioning, motor, and speed and below average on learning and memory, Profile 3 performed at or above average across all domains. Several demographic, cognitive, and social factors were associated with profile membership; these associations were driven by differences between Profile 1 and the other profiles. CONCLUSION There is an identifiable pattern of neuropsychological performance among MACS members determined by all domains except Working Memory. Neither HIV nor HIV-related biomarkers were related with cluster membership, consistent with other findings that cognitive performance patterns do not map directly onto HIV serostatus.
Collapse
Affiliation(s)
- Samantha A Molsberry
- Population Health Sciences Program, Graduate School of Arts and Sciences, Harvard University, Cambridge, Massachusetts
| | - Yu Cheng
- Department of Statistics
- Department of Psychiatry, University of Pittsburgh
| | - Lawrence Kingsley
- Department of Epidemiology
- Department of Infectious Diseases and Microbiology, Graduate School of Public Health, University of Pittsburgh, Pittsburgh, Pennsylvania
| | - Lisa Jacobson
- Department of Epidemiology, Bloomberg School of Public Health, The Johns Hopkins University, Baltimore, Maryland
| | - Andrew J Levine
- Department of Neurology, David Geffen School of Medicine, UCLA, Los Angeles, California
| | - Eileen Martin
- Department of Psychiatry, Rush University School of Medicine, Chicago, Illinois
| | - Eric N Miller
- Department of Neurology, David Geffen School of Medicine, UCLA, Los Angeles, California
| | - Cynthia A Munro
- Department of Psychiatry
- Department of Neurology, The Johns Hopkins University School of Medicine, Baltimore, Maryland
| | - Ann Ragin
- Department of Radiology, Northwestern University, Evanston, Illinois
| | - Ned Sacktor
- Department of Neurology, The Johns Hopkins University School of Medicine, Baltimore, Maryland
| | - James T Becker
- Department of Psychiatry, University of Pittsburgh
- Department of Psychology
- Department of Neurology , University of Pittsburgh, Pittsburgh, Pennsylvania, USA
| |
Collapse
|
19
|
Storlie CB, Myers SM, Katusic SK, Weaver AL, Voigt RG, Croarkin PE, Stoeckel RE, Port JD. Clustering and variable selection in the presence of mixed variable types and missing data. Stat Med 2018; 37:10.1002/sim.7697. [PMID: 29774571 PMCID: PMC6240391 DOI: 10.1002/sim.7697] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2017] [Revised: 03/11/2018] [Accepted: 03/20/2018] [Indexed: 11/09/2022]
Abstract
We consider the problem of model-based clustering in the presence of many correlated, mixed continuous, and discrete variables, some of which may have missing values. Discrete variables are treated with a latent continuous variable approach, and the Dirichlet process is used to construct a mixture model with an unknown number of components. Variable selection is also performed to identify the variables that are most influential for determining cluster membership. The work is motivated by the need to cluster patients thought to potentially have autism spectrum disorder on the basis of many cognitive and/or behavioral test scores. There are a modest number of patients (486) in the data set along with many (55) test score variables (many of which are discrete valued and/or missing). The goal of the work is to (1) cluster these patients into similar groups to help identify those with similar clinical presentation and (2) identify a sparse subset of tests that inform the clusters in order to eliminate unnecessary testing. The proposed approach compares very favorably with other methods via simulation of problems of this type. The results of the autism spectrum disorder analysis suggested 3 clusters to be most likely, while only 4 test scores had high (>0.5) posterior probability of being informative. This will result in much more efficient and informative testing. The need to cluster observations on the basis of many correlated, continuous/discrete variables with missing values is a common problem in the health sciences as well as in many other disciplines.
Collapse
Affiliation(s)
| | - S M Myers
- Geisinger Autism & Developmental Medicine Institute, Lewisburg, USA
| | | | | | - R G Voigt
- Texas Children's Hospital, Houston, USA
| | | | | | | |
Collapse
|
20
|
Abstract
Finite mixture modeling provides a framework for cluster analysis based on parsimonious Gaussian mixture models. Variable or feature selection is of particular importance in situations where only a subset of the available variables provide clustering information. This enables the selection of a more parsimonious model, yielding more efficient estimates, a clearer interpretation and, often, improved clustering partitions. This paper describes the R package clustvarsel which performs subset selection for model-based clustering. An improved version of the Raftery and Dean (2006) methodology is implemented in the new release of the package to find the (locally) optimal subset of variables with group/cluster information in a dataset. Search over the solution space is performed using either a step-wise greedy search or a headlong algorithm. Adjustments for speeding up these algorithms are discussed, as well as a parallel implementation of the stepwise search. Usage of the package is presented through the discussion of several data examples.
Collapse
Affiliation(s)
- Luca Scrucca
- Department of Economics, Università degli Studi di Perugia, Via A. Pascoli, 20, 06123 Perugia, Italy, URL: http://www.stat.unipg.it/luca
| | - Adrian E Raftery
- Department of Statistics, University of Washington, Box 354320, Seattle, WA 98195-4320, United States of America, URL: http://www.stat.washington.edu/raftery/
| |
Collapse
|
21
|
Lee S, Liang F, Cai L, Xiao G. A two-stage approach of gene network analysis for high-dimensional heterogeneous data. Biostatistics 2018; 19:216-232. [PMID: 29036516 PMCID: PMC5862270 DOI: 10.1093/biostatistics/kxx033] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/08/2016] [Revised: 04/18/2017] [Accepted: 05/07/2017] [Indexed: 11/13/2022] Open
Abstract
Gaussian graphical models have been widely used to construct gene regulatory networks from gene expression data. Most existing methods for Gaussian graphical models are designed to model homogeneous data, assuming a single Gaussian distribution. In practice, however, data may consist of gene expression studies with unknown confounding factors, such as study cohort, microarray platforms, experimental batches, which produce heterogeneous data, and hence lead to false positive edges or low detection power in resulting network, due to those unknown factors. To overcome this problem and improve the performance in constructing gene networks, we propose a two-stage approach to construct a gene network from heterogeneous data. The first stage is to perform a clustering analysis in order to assign samples to a few clusters where the samples in each cluster are approximately homogeneous, and the second stage is to conduct an integrative analysis of networks from each cluster. In particular, we first apply a model-based clustering method using the singular value decomposition for high-dimensional data, and then integrate the networks from each cluster using the integrative $\psi$-learning method. The proposed method is based on an equivalent measure of partial correlation coefficients in Gaussian graphical models, which is computed with a reduced conditional set and thus it is useful for high-dimensional data. We compare the proposed two-stage learning approach with some existing methods in various simulation settings, and demonstrate the robustness of the proposed method. Finally, it is applied to integrate multiple gene expression studies of lung adenocarcinoma to identify potential therapeutic targets and treatment biomarkers.
Collapse
Affiliation(s)
- Sangin Lee
- Department of Information and Statistics, Chungnam National University, 99 Daehak-ro, Yuseong-gu, Daejeon 34134, Korea
| | - Faming Liang
- Department of Biostatistics, University of Florida, FL 32610, USA
| | - Ling Cai
- Quantitative Biomedical Research Center, Department of Clinical Sciences, and Children's Research Institute, University of Texas Southwestern Medical Center, 6000 Harry Hines Blvd, Dallas, TX 75390, USA
| | - Guanghua Xiao
- Quantitative Biomedical Research Center, Department of Clinical Sciences, Department of Bioinformatics, Simmons Comprehensive Cancer Center, University of Texas Southwestern Medical Center, 6000 Harry Hines Blvd, Dallas, TX 75390, USA
| |
Collapse
|
22
|
Lalonde A, Love T. USING THE SEYCHELLES CHILD DEVELOPMENT STUDY TO CLUSTER MULTIPLE OUTCOMES INTO DOMAINS TO IMPROVE ESTIMATION OF THE OVERALL EFFECT OF MERCURY ON NEURODEVELOPMENT. Math Appl 2018; 7:53-62. [PMID: 30636979 PMCID: PMC6329395 DOI: 10.13164/ma.2018.05] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
Environmental exposure effects on human development can be small and difficult to detect due to the nature of observational data. In the Seychelles Child Development Study, researchers examined the effect of prenatal methylmercury exposure using a battery of tests measuring aspects of child development [23, 25]. We build a multiple outcomes model similar to that of the previous analyses (see [23, 25]); however, our multiple outcomes model makes no assumptions of relationships between the testing outcomes. Instead, the nesting of outcomes into domains is a clustering problem we address with a Dirichlet process mixture model implemented through a Bayesian MCMC approach [16]. This model provides inference for the methylmercury exposure effect as well as greater insight into the similarities and differences across the outcomes.
Collapse
Affiliation(s)
- Amy Lalonde
- Department of Biostatistics and Computational Biology, University of Rochester, Rochester, NY
| | - Tanzy Love
- Department of Biostatistics and Computational Biology, University of Rochester, Rochester, NY
| |
Collapse
|
23
|
Ma T, Liang F, Tseng G. Biomarker detection and categorization in ribonucleic acid sequencing meta-analysis using Bayesian hierarchical models. J R Stat Soc Ser C Appl Stat 2016; 66:847-867. [PMID: 28785119 DOI: 10.1111/rssc.12199] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/01/2023]
Abstract
Meta-analysis combining multiple transcriptomic studies increases statistical power and accuracy in detecting differentially expressed genes. As the next-generation sequencing experiments become mature and affordable, increasing number of RNA-seq datasets are available in the public domain. The count-data based technology provides better experimental accuracy, reproducibility and ability to detect low-expressed genes. A naive approach to combine multiple RNA-seq studies is to apply differential analysis tools such as edgeR and DESeq to each study and then combine the summary statistics of p-values or effect sizes by conventional meta-analysis methods. Such a two-stage approach loses statistical power, especially for genes with short length or low expression abundance. In this paper, we propose a full Bayesian hierarchical model (namely, BayesMetaSeq) for RNA-seq meta-analysis by modelling count data, integrating information across genes and across studies, and modelling potentially heterogeneous differential signals across studies via latent variables. A Dirichlet process mixture (DPM) prior is further applied on the latent variables to provide categorization of detected biomarkers according to their differential expression patterns across studies, facilitating improved interpretation and biological hypothesis generation. Simulations and a real application on multi-brain-region HIV-1 transgenic rats demonstrate improved sensitivity, accuracy and biological findings of the proposed method.
Collapse
Affiliation(s)
- Tianzhou Ma
- Department of Biostatistics, University of Pittsburgh, Pittsburgh, PA 15261
| | - Faming Liang
- Department of Biostatistics, University of Florida, Gainesville, FL 32611
| | - George Tseng
- Department of Biostatistics (primary appointment), Department of Human Genetics, Department of Computational Biology, University of Pittsburgh, Pittsburgh, PA 15261
| |
Collapse
|
24
|
Abstract
We aim to estimate multiple networks in the presence of sample heterogeneity, where the independent samples (i.e. observations) may come from different and unknown populations or distributions. Specifically, we consider penalized estimation of multiple precision matrices in the framework of a Gaussian mixture model. A major innovation is to take advantage of the commonalities across the multiple precision matrices through possibly nonconvex fusion regularization, which for example makes it possible to achieve simultaneous discovery of unknown disease subtypes and detection of differential gene (dys)regulations in functional genomics. We embed in the EM algorithm one of two recently proposed methods for estimating multiple precision matrices in Gaussian graphical models. We demonstrate the feasibility and potential usefulness of the proposed methods in an application to glioblastoma subtype discovery and differential gene network analysis with a microarray gene expression data set. We also conduct realistic simulation studies to evaluate and compare the performance of various methods.
Collapse
Affiliation(s)
- Chen Gao
- Division of Biostatistics, School of Public Health, University of Minnesota
| | | | | | - Wei Pan
- Division of Biostatistics, School of Public Health, University of Minnesota
| |
Collapse
|
25
|
Abstract
Finite mixture models have been used to model population heterogeneity and to relax distributional assumptions. These models are also convenient tools for clustering and classification of complex data such as, for example, repeated-measurements data. The performance of model-based clustering algorithms is sensitive to influential and outlying observations. Methods for identifying outliers in a finite mixture model have been described in the literature. Approaches to identify influential observations are less common. In this paper, we apply local-influence diagnostics to a finite mixture model with known number of components. The methodology is illustrated on real-life data.
Collapse
Affiliation(s)
| | - Geert Molenberghs
- 1 I-BioStat, Universiteit Hasselt, Hasselt, Belgium.,2 I-Biostat, Katholieke Universiteit Leuven, Leuven, Belgium
| | - Geert Verbeke
- 1 I-BioStat, Universiteit Hasselt, Hasselt, Belgium.,2 I-Biostat, Katholieke Universiteit Leuven, Leuven, Belgium
| | | |
Collapse
|
26
|
Flynt A, Daepp MIG. Diet-related chronic disease in the northeastern United States: a model-based clustering approach. Int J Health Geogr 2015; 14:25. [PMID: 26338084 PMCID: PMC4559302 DOI: 10.1186/s12942-015-0017-5] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/20/2015] [Accepted: 08/14/2015] [Indexed: 12/29/2022] Open
Abstract
BACKGROUND Obesity and diabetes are global public health concerns. Studies indicate a relationship between socioeconomic, demographic and environmental variables and the spatial patterns of diet-related chronic disease. In this paper, we propose a methodology using model-based clustering and variable selection to predict rates of obesity and diabetes. We test this method through an application in the northeastern United States. METHODS We use model-based clustering, an unsupervised learning approach, to find latent clusters of similar US counties based on a set of socioeconomic, demographic, and environmental variables chosen through the process of variable selection. We then use Analysis of Variance and Post-hoc Tukey comparisons to examine differences in rates of obesity and diabetes for the clusters from the resulting clustering solution. RESULTS We find access to supermarkets, median household income, population density and socioeconomic status to be important in clustering the counties of two northeastern states. The results of the cluster analysis can be used to identify two sets of counties with significantly lower rates of diet-related chronic disease than those observed in the other identified clusters. These relatively healthy clusters are distinguished by the large central and large fringe metropolitan areas contained in their component counties. However, the relationship of socio-demographic factors and diet-related chronic disease is more complicated than previous research would suggest. Additionally, we find evidence of low food access in two clusters of counties adjacent to large central and fringe metropolitan areas. While food access has previously been seen as a problem of inner-city or remote rural areas, this study offers preliminary evidence of declining food access in suburban areas. CONCLUSIONS Model-based clustering with variable selection offers a new approach to the analysis of socioeconomic, demographic, and environmental data for diet-related chronic disease prediction. In a test application to two northeastern states, this method allows us to identify two sets of metropolitan counties with significantly lower diet-related chronic disease rates than those observed in most rural and suburban areas. Our method could be applied to larger geographic areas or other countries with comparable data sets, offering a promising method for researchers interested in the global increase in diet-related chronic disease.
Collapse
Affiliation(s)
- Abby Flynt
- Department of Mathematics, Bucknell University, 701 Moore Ave, 17837, Lewisburg, PA, USA.
| | - Madeleine I G Daepp
- Integrated Studies in Land and Food Systems, The University of British Columbia Vancouver, 2329 West Mall, V6T 1Z4, Vancouver, BC, Canada.
| |
Collapse
|
27
|
Abstract
The rescue and relief operations triggered by the September 11, 2001 attacks on the World Trade Center in New York City demanded collaboration among hundreds of organisations. To shed light on the response to the September 11, 2001 attacks and help to plan and prepare the response to future disasters, we study the inter-organisational network that emerged in response to the attacks. Studying the inter-organisational network can help to shed light on (1) whether some organisations dominated the inter-organisational network and facilitated communication and coordination of the disaster response; (2) whether the dominating organisations were supposed to coordinate disaster response or emerged as coordinators in the wake of the disaster; and (3) the degree of network redundancy and sensitivity of the inter-organisational network to disturbances following the initial disaster. We introduce a Bayesian framework which can answer the substantive questions of interest while being as simple and parsimonious as possible. The framework allows organisations to have varying propensities to collaborate, while taking covariates into account, and allows to assess whether the inter-organisational network had network redundancy-in the form of transitivity-by using a test which may be regarded as a Bayesian score test. We discuss implications in terms of disaster management.
Collapse
Affiliation(s)
| | | | - Duy Quang Vu
- Department of Mathematics and Statistics,
University of Melbourne, Melbourne, Australia
| |
Collapse
|
28
|
Zhang K, Rood RB, Michailidis G, Oswald EM, Schwartz JD, Zanobetti A, Ebi KL, O'Neill MS. Comparing exposure metrics for classifying 'dangerous heat' in heat wave and health warning systems. Environ Int 2012; 46:23-9. [PMID: 22673187 PMCID: PMC3401591 DOI: 10.1016/j.envint.2012.05.001] [Citation(s) in RCA: 38] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/11/2011] [Revised: 05/05/2012] [Accepted: 05/07/2012] [Indexed: 05/18/2023]
Abstract
Heat waves have been linked to excess mortality and morbidity, and are projected to increase in frequency and intensity with a warming climate. This study compares exposure metrics to trigger heat wave and health warning systems (HHWS), and introduces a novel multi-level hybrid clustering method to identify potential dangerously hot days. Two-level and three-level hybrid clustering analysis as well as common indices used to trigger HHWS, including spatial synoptic classification (SSC), and the 90th, 95th, and 99th percentiles of minimum and relative minimum temperature (using a 10 day reference period), were calculated using a summertime weather dataset in Detroit from 1976 to 2006. The days classified as 'hot' with hybrid clustering analysis, SSC, minimum and relative minimum temperature methods differed by method type. SSC tended to include the days with, on average, 2.5 °C lower daily minimum temperature and 5.3 °C lower dew point than days identified by other methods. These metrics were evaluated by comparing their performance in predicting excess daily mortality. The 99th percentile of minimum temperature was generally the most predictive, followed by the three-level hybrid clustering method, the 95th percentile of minimum temperature, SSC and others. Our proposed clustering framework has more flexibility and requires less substantial meteorological prior information than the synoptic classification methods. Comparison of these metrics in predicting excess daily mortality suggests that metrics thought to better characterize physiological heat stress by considering several weather conditions simultaneously may not be the same metrics that are better at predicting heat-related mortality, which has significant implications in HHWSs.
Collapse
Affiliation(s)
- Kai Zhang
- Department of Environmental Health Sciences, University of Michigan, Ann Arbor, MI 48109-2029, USA.
| | | | | | | | | | | | | | | |
Collapse
|
29
|
Abstract
Latent class models (LCMs) are used increasingly for addressing a broad variety of problems, including sparse modeling of multivariate and longitudinal data, model-based clustering, and flexible inferences on predictor effects. Typical frequentist LCMs require estimation of a single finite number of classes, which does not increase with the sample size, and have a well-known sensitivity to parametric assumptions on the distributions within a class. Bayesian nonparametric methods have been developed to allow an infinite number of classes in the general population, with the number represented in a sample increasing with sample size. In this article, we propose a new nonparametric Bayes model that allows predictors to flexibly impact the allocation to latent classes, while limiting sensitivity to parametric assumptions by allowing class-specific distributions to be unknown subject to a stochastic ordering constraint. An efficient MCMC algorithm is developed for posterior computation. The methods are validated using simulation studies and applied to the problem of ranking medical procedures in terms of the distribution of patient morbidity.
Collapse
Affiliation(s)
- Hongxia Yang
- Mathematical Sciences Department, Watson Research Center, IBM, Yorktown Heights, NY 10598 ()
| | - Sean O’Brien
- Department of Biostatistics and Bioinformatics, Duke University, Durham, NC 27710 ()
| | - David B. Dunson
- Department of Statistical Science, Duke University, Durham, NC 27708 ()
| |
Collapse
|
30
|
Kim S, Dahl DB, Vannucci M. Spiked Dirichlet Process Prior for Bayesian Multiple Hypothesis Testing in Random Effects Models. Bayesian Anal 2009; 4:707-732. [PMID: 23950766 PMCID: PMC3741668 DOI: 10.1214/09-ba426] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/02/2023]
Abstract
We propose a Bayesian method for multiple hypothesis testing in random effects models that uses Dirichlet process (DP) priors for a nonparametric treatment of the random effects distribution. We consider a general model formulation which accommodates a variety of multiple treatment conditions. A key feature of our method is the use of a product of spiked distributions, i.e., mixtures of a point-mass and continuous distributions, as the centering distribution for the DP prior. Adopting these spiked centering priors readily accommodates sharp null hypotheses and allows for the estimation of the posterior probabilities of such hypotheses. Dirichlet process mixture models naturally borrow information across objects through model-based clustering while inference on single hypotheses averages over clustering uncertainty. We demonstrate via a simulation study that our method yields increased sensitivity in multiple hypothesis testing and produces a lower proportion of false discoveries than other competitive methods. While our modeling framework is general, here we present an application in the context of gene expression from microarray experiments. In our application, the modeling framework allows simultaneous inference on the parameters governing differential expression and inference on the clustering of genes. We use experimental data on the transcriptional response to oxidative stress in mouse heart muscle and compare the results from our procedure with existing nonparametric Bayesian methods that provide only a ranking of the genes by their evidence for differential expression.
Collapse
Affiliation(s)
- Sinae Kim
- Department of Biostatistics, University of Michigan, Ann Arbor, MI,
| | - David B. Dahl
- Department of Statistics, Texas A&M University, College Station, TX,
| | | |
Collapse
|
31
|
Heard NA, Holmes CC, Stephens DA, Hand DJ, Dimopoulos G. Bayesian coclustering of Anopheles gene expression time series: study of immune defense response to multiple experimental challenges. Proc Natl Acad Sci U S A 2005; 102:16939-44. [PMID: 16287981 PMCID: PMC1287961 DOI: 10.1073/pnas.0408393102] [Citation(s) in RCA: 49] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2004] [Accepted: 08/30/2005] [Indexed: 11/18/2022] Open
Abstract
We present a method for Bayesian model-based hierarchical coclustering of gene expression data and use it to study the temporal transcription responses of an Anopheles gambiae cell line upon challenge with multiple microbial elicitors. The method fits statistical regression models to the gene expression time series for each experiment and performs coclustering on the genes by optimizing a joint probability model, characterizing gene coregulation between multiple experiments. We compute the model using a two-stage Expectation-Maximization-type algorithm, first fixing the cross-experiment covariance structure and using efficient Bayesian hierarchical clustering to obtain a locally optimal clustering of the gene expression profiles and then, conditional on that clustering, carrying out Bayesian inference on the cross-experiment covariance using Markov chain Monte Carlo simulation to obtain an expectation. For the problem of model choice, we use a cross-validatory approach to decide between individual experiment modeling and varying levels of coclustering. Our method successfully generates tightly coregulated clusters of genes that are implicated in related processes and therefore can be used for analysis of global transcript responses to various stimuli and prediction of gene functions.
Collapse
Affiliation(s)
- Nicholas A Heard
- Department of Mathematics, Imperial College London, Huxley Building, 180 Queens Gate, London SW7 2AZ, United Kingdom.
| | | | | | | | | |
Collapse
|