1
|
Lu Z, Chandra NK. A sparse factor model for clustering high-dimensional longitudinal data. Stat Med 2024. [PMID: 38885953 DOI: 10.1002/sim.10151] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/06/2023] [Revised: 04/09/2024] [Accepted: 06/06/2024] [Indexed: 06/20/2024]
Abstract
Recent advances in engineering technologies have enabled the collection of a large number of longitudinal features. This wealth of information presents unique opportunities for researchers to investigate the complex nature of diseases and uncover underlying disease mechanisms. However, analyzing such kind of data can be difficult due to its high dimensionality, heterogeneity and computational challenges. In this article, we propose a Bayesian nonparametric mixture model for clustering high-dimensional mixed-type (eg, continuous, discrete and categorical) longitudinal features. We employ a sparse factor model on the joint distribution of random effects and the key idea is to induce clustering at the latent factor level instead of the original data to escape the curse of dimensionality. The number of clusters is estimated through a Dirichlet process prior. An efficient Gibbs sampler is developed to estimate the posterior distribution of the model parameters. Analysis of real and simulated data is presented and discussed. Our study demonstrates that the proposed model serves as a useful analytical tool for clustering high-dimensional longitudinal data.
Collapse
Affiliation(s)
- Zihang Lu
- Department of Public Health Sciences, Queen's University, Kingston, Ontario, Canada
- Department of Mathematics and Statistics, Queen's University, Kingston, Ontario, Canada
| | - Noirrit Kiran Chandra
- Department of Mathematical Sciences, The University of Texas at Dallas, Richardson, Texas, USA
| |
Collapse
|
2
|
Lu Z, Ahmadiankalati M, Tan Z. Joint clustering multiple longitudinal features: A comparison of methods and software packages with practical guidance. Stat Med 2023; 42:5513-5540. [PMID: 37789706 DOI: 10.1002/sim.9917] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2022] [Revised: 06/07/2023] [Accepted: 09/13/2023] [Indexed: 10/05/2023]
Abstract
Clustering longitudinal features is a common goal in medical studies to identify distinct disease developmental trajectories. Compared to clustering a single longitudinal feature, integrating multiple longitudinal features allows additional information to be incorporated into the clustering process, which may reveal co-existing longitudinal patterns and generate deeper biological insight. Despite its increasing importance and popularity, there is limited practical guidance for implementing cluster analysis approaches for multiple longitudinal features and evaluating their comparative performance in medical datasets. In this paper, we provide an overview of several commonly used approaches to clustering multiple longitudinal features, with an emphasis on application and implementation through R software. These methods can be broadly categorized into two categories, namely model-based (including frequentist and Bayesian) approaches and algorithm-based approaches. To evaluate their performance, we compare these approaches using real-life and simulated datasets. These results provide practical guidance to applied researchers who are interested in applying these approaches for clustering multiple longitudinal features. Recommendations for applied researchers and suggestions for future research in this area are also discussed.
Collapse
Affiliation(s)
- Zihang Lu
- Department of Public Health Sciences, Queen's University, Kingston, Ontario, Canada
- Department of Mathematics and Statistics, Queen's University, Kingston, Ontario, Canada
| | | | - Zhiwen Tan
- Department of Public Health Sciences, Queen's University, Kingston, Ontario, Canada
| |
Collapse
|
3
|
Romero-Moreno R, Márquez-González M, Gallego-Alberto L, Cabrera I, Vara-García C, Pedroso-Chaparro MDS, Barrera-Caballero S, Losada A. Guilt Focused Intervention for Family Caregivers. Preliminary Results of a Randomized Clinical Trial. Clin Gerontol 2022; 45:1304-1316. [PMID: 35286236 DOI: 10.1080/07317115.2022.2048287] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 11/03/2022]
Abstract
OBJECTIVES A pilot randomized controlled trial study was conducted for testing the efficacy of a novel Guilt Focused Intervention (GFI), that was compared with a Cognitive Behavioral Intervention (CBI) for caregivers of people with dementia with high levels of guilt and distress. METHODS Participants were 42 caregivers who were randomized assigned to the intervention conditions. RESULTS Participants in the GFI showed significant reductions in depression, anxiety, and guilt at posttreatment and follow-up. Participants in the CBI presented reductions in anxiety and guilt at posttreatment and follow-up. Clinically significant change for guilt was found in 62.5% in the GFI and 9.09% in the CBI group at posttreatment. At follow-up, 58.33% in GFI and 12.5% in the CBI group were recovered. CONCLUSIONS The preliminary results of this pilot study suggest that caregivers with significant levels of guilt and distress might benefit from an intervention specifically designed to target guilt feelings. CLINICAL IMPLICATIONS A novel and initial intervention approach specifically designed for targeting caregivers' feelings of guilt might have the potential to reduce caregiver's emotional distress.
Collapse
Affiliation(s)
| | - María Márquez-González
- Department of Biological and Health Psychology, Universidad Autónoma de Madrid, Madrid, Spain
| | - Laura Gallego-Alberto
- Department of Biological and Health Psychology, Universidad Autónoma de Madrid, Madrid, Spain
| | - Isabel Cabrera
- Department of Biological and Health Psychology, Universidad Autónoma de Madrid, Madrid, Spain
| | | | | | | | - Andrés Losada
- Department of Psychology, Universidad Rey Juan Carlos, Alcorcón, Spain
| |
Collapse
|
4
|
Vávra J, Komárek A. Classification based on multivariate mixed type longitudinal data with an application to the EU-SILC database. ADV DATA ANAL CLASSI 2022. [DOI: 10.1007/s11634-022-00504-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|
5
|
Mezzetti M, Borzelli D, d’Avella A. A Bayesian approach to model individual differences and to partition individuals: case studies in growth and learning curves. STAT METHOD APPL-GER 2022. [DOI: 10.1007/s10260-022-00625-6] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]
Abstract
AbstractThe first objective of the paper is to implement a two stage Bayesian hierarchical nonlinear model for growth and learning curves, particular cases of longitudinal data with an underlying nonlinear time dependence. The aim is to model simultaneously individual trajectories over time, each with specific and potentially different characteristics, and a time-dependent behavior shared among individuals, including eventual effect of covariates. At the first stage inter-individual differences are taken into account, while, at the second stage, we search for an average model. The second objective is to partition individuals into homogeneous groups, when inter individual parameters present high level of heterogeneity. A new multivariate partitioning approach is proposed to cluster individuals according to the posterior distributions of the parameters describing the individual time-dependent behaviour. To assess the proposed methods, we present simulated data and two applications to real data, one related to growth curve modeling in agriculture and one related to learning curves for motor skills. Furthermore a comparison with finite mixture analysis is shown.
Collapse
|
6
|
LeBrun DG, Tran T, Wypij D, Kocher MS. Statistical Analysis of Dependent Observations in the Orthopaedic Sports Literature. Orthop J Sports Med 2019; 7:2325967118818410. [PMID: 30637265 PMCID: PMC6317150 DOI: 10.1177/2325967118818410] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
Background: Orthopaedic research may involve multiple observations from the same patient because of bilateral joint involvement, multiple disease sites, or recurrent disease episodes. These situations violate statistical independence and need to be accounted for via appropriate statistical techniques. Failing to account for nonindependence may lead to biased and overly precise effect estimates. Purpose: To determine the degree to which orthopaedic sports medicine studies analyze dependent observations and the proportion of these failing to account for nonindependence. Study Design: Cross-sectional study. Methods: Clinical studies published in The American Journal of Sports Medicine from 2012 to 2017 were reviewed. Studies reporting nonindependent observations because of multiple extremity involvement or multiple disease episodes were identified. Methods to account for nonindependence were recorded. Studies violating the assumption of independence were identified and stratified by study design, level of evidence, body part involved, and inclusion of a statistician coauthor. Univariate logistic regression was used to determine whether these factors were associated with violations of statistical independence. Results: After screening 1016 articles, 886 clinical studies were reviewed. A total of 135 (15%) studies analyzed dependent observations, and 111 (82%) of these failed to account for nonindependence. Relative to the knee, studies of the hip (odds ratio [OR], 0.21; P = .02) and the thigh or leg (OR, 0.03; P = .004) were less likely to violate statistical independence. Study design (P = .03) was also associated with violations of statistical independence. Among studies that analyzed dependent observations, the median proportion of dependent observations relative to the total number of observations in each study was 0.07 (interquartile range, 0.04-0.12). Conclusion: The analysis of dependent observations is common in the orthopaedic sports literature, but most studies do not adjust for nonindependence in these situations. Investigators should be aware of incorrect inferences arising from nonindependence and how to statistically adjust for dependent data.
Collapse
Affiliation(s)
- Drake G LeBrun
- Hospital for Special Surgery, New York, New York, USA.,Department of Biostatistics, Harvard T.H. Chan School of Public Health, Harvard University, Boston, Massachusetts, USA
| | - Tram Tran
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Harvard University, Boston, Massachusetts, USA.,Warren Alpert Medical School, Brown University, Providence, Rhode Island, USA
| | - David Wypij
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Harvard University, Boston, Massachusetts, USA
| | - Mininder S Kocher
- Division of Sports Medicine, Boston Children's Hospital, Boston, Massachusetts, USA
| |
Collapse
|
7
|
Sun J, Herazo-Maya JD, Kaminski N, Zhao H, Warren JL. A Dirichlet process mixture model for clustering longitudinal gene expression data. Stat Med 2017; 36:3495-3506. [PMID: 28620908 PMCID: PMC5583037 DOI: 10.1002/sim.7374] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2017] [Revised: 04/15/2017] [Accepted: 05/23/2017] [Indexed: 12/27/2022]
Abstract
Subgroup identification (clustering) is an important problem in biomedical research. Gene expression profiles are commonly utilized to define subgroups. Longitudinal gene expression profiles might provide additional information on disease progression than what is captured by baseline profiles alone. Therefore, subgroup identification could be more accurate and effective with the aid of longitudinal gene expression data. However, existing statistical methods are unable to fully utilize these data for patient clustering. In this article, we introduce a novel clustering method in the Bayesian setting based on longitudinal gene expression profiles. This method, called BClustLonG, adopts a linear mixed-effects framework to model the trajectory of genes over time, while clustering is jointly conducted based on the regression coefficients obtained from all genes. In order to account for the correlations among genes and alleviate the high dimensionality challenges, we adopt a factor analysis model for the regression coefficients. The Dirichlet process prior distribution is utilized for the means of the regression coefficients to induce clustering. Through extensive simulation studies, we show that BClustLonG has improved performance over other clustering methods. When applied to a dataset of severely injured (burn or trauma) patients, our model is able to identify interesting subgroups. Copyright © 2017 John Wiley & Sons, Ltd.
Collapse
Affiliation(s)
- Jiehuan Sun
- Department of Biostatistics, Yale University, New Haven, 06520, CT, U.S.A
| | - Jose D Herazo-Maya
- Pulmonary, Critical Care and Sleep Medicine, Yale School of Medicine, New Haven, 06520, CT, U.S.A
| | - Naftali Kaminski
- Pulmonary, Critical Care and Sleep Medicine, Yale School of Medicine, New Haven, 06520, CT, U.S.A
| | - Hongyu Zhao
- Department of Biostatistics, Yale University, New Haven, 06520, CT, U.S.A
| | - Joshua L Warren
- Department of Biostatistics, Yale University, New Haven, 06520, CT, U.S.A
| |
Collapse
|
8
|
|
9
|
Heinzl F, Tutz G. Clustering in linear-mixed models with a group fused lasso penalty. Biom J 2013; 56:44-68. [DOI: 10.1002/bimj.201200111] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/11/2012] [Revised: 12/04/2012] [Accepted: 08/18/2013] [Indexed: 11/06/2022]
Affiliation(s)
- Felix Heinzl
- Department of Statistics; Ludwig-Maximilians-University Munich, Akademiestr. 1; 80799 Munich Germany
| | - Gerhard Tutz
- Department of Statistics; Ludwig-Maximilians-University Munich, Akademiestr. 1; 80799 Munich Germany
| |
Collapse
|
10
|
Komárek A, Komárková L. Clustering for multivariate continuous and discrete longitudinal data. Ann Appl Stat 2013. [DOI: 10.1214/12-aoas580] [Citation(s) in RCA: 35] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
11
|
Heinzl F, Tutz G. Clustering in linear mixed models with approximate Dirichlet process mixtures using EM algorithm. STAT MODEL 2013. [DOI: 10.1177/1471082x12471372] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
In linear mixed models, the assumption of normally distributed random effects is often inappropriate and unnecessarily restrictive. The proposed approximate Dirichlet process mixture assumes a hierarchical Gaussian mixture that is based on the truncated version of the stick breaking presentation of the Dirichlet process. In addition to the weakening of distributional assumptions, the specification allows to identify clusters of observations with a similar random effects structure. An Expectation-Maximization algorithm is given that solves the estimation problem and that, in certain respects, may exhibit advantages over Markov chain Monte Carlo approaches when modelling with Dirichlet processes. The method is evaluated in a simulation study and applied to the dynamics of unemployment in Germany as well as lung function growth data.
Collapse
Affiliation(s)
- Felix Heinzl
- Department of Statistics, Ludwig-Maximilians-University Munich, Munich, Germany
| | - Gerhard Tutz
- Department of Statistics, Ludwig-Maximilians-University Munich, Munich, Germany
| |
Collapse
|