1
|
Seemingly unrelated clusterwise linear regression for contaminated data. Stat Pap (Berl) 2022. [DOI: 10.1007/s00362-022-01344-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
AbstractClusterwise regression is an approach to regression analysis based on finite mixtures which is generally employed when sample observations come from a population composed of several unknown sub-populations. Whenever the response is continuous, Gaussian clusterwise linear regression models are usually employed. Such models have been recently robustified with respect to the possible presence of mild outliers in the sub-populations. However, in some fields of research, especially in the modelling of multivariate economic data or data from the social sciences, there may be prior information on the specific covariates to be considered in the linear term employed in the prediction of a certain response. As a consequence, covariates may not be the same for all responses. Thus, a novel class of multivariate Gaussian linear clusterwise regression models is proposed. This class provides an extension to mixture-based regression analysis for modelling multivariate and correlated responses in the presence of mild outliers that let the researcher free to use a different vector of covariates for each response. Details about the model identification and maximum likelihood estimation via an expectation-conditional maximisation algorithm are given. The performance of the new models is studied by simulation in comparison with other clusterwise linear regression models. A comparative evaluation of their effectiveness and usefulness is provided through the analysis of a real dataset.
Collapse
|
2
|
Wang T, Yu L, Leurgans SE, Wilson RS, Bennett DA, Boyle PA. Conditional functional clustering for longitudinal data with heterogeneous nonlinear patterns. Ann Appl Stat 2022. [DOI: 10.1214/21-aoas1542] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Affiliation(s)
- Tianhao Wang
- Rush Alzheimer’s Disease Center, Rush University Medical Center
| | - Lei Yu
- Rush Alzheimer’s Disease Center, Rush University Medical Center
| | - Sue E. Leurgans
- Rush Alzheimer’s Disease Center, Rush University Medical Center
| | | | | | | |
Collapse
|
3
|
Mou X, Zhang H, Arshad SH. Identifying intergenerational patterns of correlated methylation sites. Ann Appl Stat 2022. [DOI: 10.1214/21-aoas1511] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Affiliation(s)
- Xichen Mou
- Division of Epidemiology, Biostatistics, and Environmental Health, School of Public Health, University of Memphis
| | - Hongmei Zhang
- Division of Epidemiology, Biostatistics, and Environmental Health, School of Public Health, University of Memphis
| | | |
Collapse
|
4
|
Zhang T, Lin G. Generalized k -means in GLMs with applications to the outbreak of COVID-19 in the United States. Comput Stat Data Anal 2021; 159:107217. [PMID: 33723467 PMCID: PMC7943386 DOI: 10.1016/j.csda.2021.107217] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2020] [Revised: 02/28/2021] [Accepted: 02/28/2021] [Indexed: 11/30/2022]
Abstract
Generalized k-means can be combined with any similarity or dissimilarity measure for clustering. Using the well known likelihood ratio or F-statistic as the dissimilarity measure, a generalized k-means method is proposed to group generalized linear models (GLMs) for exponential family distributions. Given the number of clusters k, the proposed method is established by the uniform most powerful unbiased (UMPU) test statistic for the comparison between GLMs. If k is unknown, then the proposed method can be combined with generalized liformation criterion (GIC) to automatically select the best k for clustering. Both AIC and BIC are investigated as special cases of GIC. Theoretical and simulation results show that the number of clusters can be correctly identified by BIC but not AIC. The proposed method is applied to the state-level daily COVID-19 data in the United States, and it identifies 6 clusters. A further study shows that the models between clusters are significantly different from each other, which confirms the result with 6 clusters.
Collapse
Affiliation(s)
- Tonglin Zhang
- Department of Statistics, Purdue University, 250 North University Street, West Lafayette, IN 47907-2066, USA
| | - Ge Lin
- Department of Environmental and Occupational Health, University of Nevada Las Vegas, Las Vegas, NV 89154, USA
| |
Collapse
|
5
|
Novoa A, Richardson DM, Pyšek P, Meyerson LA, Bacher S, Canavan S, Catford JA, Čuda J, Essl F, Foxcroft LC, Genovesi P, Hirsch H, Hui C, Jackson MC, Kueffer C, Le Roux JJ, Measey J, Mohanty NP, Moodley D, Müller-Schärer H, Packer JG, Pergl J, Robinson TB, Saul WC, Shackleton RT, Visser V, Weyl OLF, Yannelli FA, Wilson JRU. Invasion syndromes: a systematic approach for predicting biological invasions and facilitating effective management. Biol Invasions 2020. [DOI: 10.1007/s10530-020-02220-w] [Citation(s) in RCA: 57] [Impact Index Per Article: 14.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/13/2023]
Abstract
AbstractOur ability to predict invasions has been hindered by the seemingly idiosyncratic context-dependency of individual invasions. However, we argue that robust and useful generalisations in invasion science can be made by considering “invasion syndromes” which we define as “a combination of pathways, alien species traits, and characteristics of the recipient ecosystem which collectively result in predictable dynamics and impacts, and that can be managed effectively using specific policy and management actions”. We describe this approach and outline examples that highlight its utility, including: cacti with clonal fragmentation in arid ecosystems; small aquatic organisms introduced through ballast water in harbours; large ranid frogs with frequent secondary transfers; piscivorous freshwater fishes in connected aquatic ecosystems; plant invasions in high-elevation areas; tall-statured grasses; and tree-feeding insects in forests with suitable hosts. We propose a systematic method for identifying and delimiting invasion syndromes. We argue that invasion syndromes can account for the context-dependency of biological invasions while incorporating insights from comparative studies. Adopting this approach will help to structure thinking, identify transferrable risk assessment and management lessons, and highlight similarities among events that were previously considered disparate invasion phenomena.
Collapse
|
6
|
Han S, Zhang H, Sheng W, Arshad H. The nested joint clustering via Dirichlet process mixture model. J STAT COMPUT SIM 2019; 89:815-830. [DOI: 10.1080/00949655.2019.1572756] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
Affiliation(s)
- Shengtong Han
- Joseph J. Zilber School of Public Health, University of Wisconsin, Milwaukee, WI, USA
| | - Hongmei Zhang
- School of Public Health, University of Memphis, Memphis, TN, USA
| | - Wenhui Sheng
- Department of Mathematics, Statistics and Computer Science, Marquette University, Milwaukee, WI, USA
| | - Hasan Arshad
- Allergy and Clinical Immunology, Clinical and Experimental Sciences, University of Southampton, Southampton, UK
| |
Collapse
|
7
|
|
8
|
|
9
|
Han S, Zhang H, Karmaus W, Roberts G, Arshad H. Adjusting background noise in cluster analyses of longitudinal data. Comput Stat Data Anal 2017; 109:93-104. [PMID: 28603324 PMCID: PMC5464744 DOI: 10.1016/j.csda.2016.11.009] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
Abstract
Background noise in cluster analyses can potentially mask the true underlying patterns. To tease out patterns uniquely to certain populations, a Bayesian semi-parametric clustering method is presented. It infers and adjusts background noise. The method is built upon a mixture of the Dirichlet process and a point mass function. Simulations demonstrate the effectiveness of the proposed method. The method is then applied to analyze a longitudinal data set on allergic sensitization and asthma status.
Collapse
Affiliation(s)
- Shengtong Han
- School of Public Health, University of Memphis, Memphis, TN
| | - Hongmei Zhang
- School of Public Health, University of Memphis, Memphis, TN
| | | | - Graham Roberts
- Paediatric Allergy and Respiratory Medicine, University of Southampton, Southampton, UK
| | - Hasan Arshad
- Allergy and Clinical Immunology, Clinical and Experimental Sciences, University of Southampton, Southampton, UK
| |
Collapse
|
10
|
Han S, Zhang H, Lockett GA, Mukherjee N, Holloway JW, Karmaus W. Identifying heterogeneous transgenerational DNA methylation sites via clustering in beta regression. Ann Appl Stat 2015. [DOI: 10.1214/15-aoas865] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
11
|
Yang S, Cui X, Fang Z. BCRgt: a Bayesian cluster regression-based genotyping algorithm for the samples with copy number alterations. BMC Bioinformatics 2014; 15:74. [PMID: 24629125 PMCID: PMC4003822 DOI: 10.1186/1471-2105-15-74] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/03/2013] [Accepted: 03/10/2014] [Indexed: 11/17/2022] Open
Abstract
Background Accurate genotype calling is a pre-requisite of a successful Genome-Wide Association Study (GWAS). Although most genotyping algorithms can achieve an accuracy rate greater than 99% for genotyping DNA samples without copy number alterations (CNAs), almost all of these algorithms are not designed for genotyping tumor samples that are known to have large regions of CNAs. Results This study aims to develop a statistical method that can accurately genotype tumor samples with CNAs. The proposed method adds a Bayesian layer to a cluster regression model and is termed a Bayesian Cluster Regression-based genotyping algorithm (BCRgt). We demonstrate that high concordance rates with HapMap calls can be achieved without using reference/training samples, when CNAs do not exist. By adding a training step, we have obtained higher genotyping concordance rates, without requiring large sample sizes. When CNAs exist in the samples, accuracy can be dramatically improved in regions with DNA copy loss and slightly improved in regions with copy number gain, comparing with the Bayesian Robust Linear Model with Mahalanobis distance classifier (BRLMM). Conclusions In conclusion, we have demonstrated that BCRgt can provide accurate genotyping calls for tumor samples with CNAs.
Collapse
|
12
|
|
13
|
Ng SK, McLachlan GJ, Wang K, Nagymanyoki Z, Liu S, Ng SW. Inference on differences between classes using cluster-specific contrasts of mixed effects. Biostatistics 2014; 16:98-112. [PMID: 24963011 DOI: 10.1093/biostatistics/kxu028] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
The detection of differentially expressed (DE) genes, that is, genes whose expression levels vary between two or more classes representing different experimental conditions (say, diseases), is one of the most commonly studied problems in bioinformatics. For example, the identification of DE genes between distinct disease phenotypes is an important first step in understanding and developing treatment drugs for the disease. We present a novel approach to the problem of detecting DE genes that is based on a test statistic formed as a weighted (normalized) cluster-specific contrast in the mixed effects of the mixture model used in the first instance to cluster the gene profiles into a manageable number of clusters. The key factor in the formation of our test statistic is the use of gene-specific mixed effects in the cluster-specific contrast. It thus means that the (soft) assignment of a given gene to a cluster is not crucial. This is because in addition to class differences between the (estimated) fixed effects terms for a cluster, gene-specific class differences also contribute to the cluster-specific contributions to the final form of the test statistic. The proposed test statistic can be used where the primary aim is to rank the genes in order of evidence against the null hypothesis of no DE. We also show how a P-value can be calculated for each gene for use in multiple hypothesis testing where the intent is to control the false discovery rate (FDR) at some desired level. With the use of publicly available and simulated datasets, we show that the proposed contrast-based approach outperforms other methods commonly used for the detection of DE genes both in a ranking context with lower proportion of false discoveries and in a multiple hypothesis testing context with higher power for a specified level of the FDR.
Collapse
Affiliation(s)
- Shu Kay Ng
- School of Medicine, Griffith Health Institute, Griffith University, Meadowbrook, QLD 4131, Australia
| | - Geoffrey J McLachlan
- Department of Mathematics, University of Queensland, Brisbane, QLD 4072, Australia
| | - Kui Wang
- Department of Mathematics, University of Queensland, Brisbane, QLD 4072, Australia
| | - Zoltan Nagymanyoki
- Laboratory of Gynecologic Oncology, Department of Obstetrics, Gynecology and Reproductive Biology, Brigham and Women's Hospital, Boston, MA 02115, USA
| | - Shubai Liu
- Laboratory of Gynecologic Oncology, Department of Obstetrics, Gynecology and Reproductive Biology, Brigham and Women's Hospital, Boston, MA 02115, USA
| | - Shu-Wing Ng
- Laboratory of Gynecologic Oncology, Department of Obstetrics, Gynecology and Reproductive Biology, Brigham and Women's Hospital, Boston, MA 02115, USA
| |
Collapse
|
14
|
Kim HJ, Luo J, Kim J, Chen HS, Feuer EJ. Clustering of trend data using joinpoint regression models. Stat Med 2014; 33:4087-103. [PMID: 24895073 DOI: 10.1002/sim.6221] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/28/2013] [Revised: 03/06/2014] [Accepted: 05/07/2014] [Indexed: 11/11/2022]
Abstract
In this paper, we propose methods to cluster groups of two-dimensional data whose mean functions are piecewise linear into several clusters with common characteristics such as the same slopes. To fit segmented line regression models with common features for each possible cluster, we use a restricted least squares method. In implementing the restricted least squares method, we estimate the maximum number of segments in each cluster by using both the permutation test method and the Bayes information criterion method and then propose to use the Bayes information criterion to determine the number of clusters. For a more effective implementation of the clustering algorithm, we propose a measure of the minimum distance worth detecting and illustrate its use in two examples. We summarize simulation results to study properties of the proposed methods and also prove the consistency of the cluster grouping estimated with a given number of clusters. The presentation and examples in this paper focus on the segmented line regression model with the ordered values of the independent variable, which has been the model of interest in cancer trend analysis, but the proposed method can be applied to a general model with design points either ordered or unordered.
Collapse
Affiliation(s)
- Hyune-Ju Kim
- Department of Mathematics, Syracuse University, Syracuse, NY, 13244, U.S.A
| | | | | | | | | |
Collapse
|
15
|
Coffey N, Hinde J, Holian E. Clustering longitudinal profiles using P-splines and mixed effects models applied to time-course gene expression data. Comput Stat Data Anal 2014. [DOI: 10.1016/j.csda.2013.04.001] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
16
|
Eng KH, Hanlon BM. Discrete mixture modeling to address genetic heterogeneity in time-to-event regression. ACTA ACUST UNITED AC 2014; 30:1690-7. [PMID: 24532723 PMCID: PMC4058947 DOI: 10.1093/bioinformatics/btu065] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/21/2023]
Abstract
MOTIVATION Time-to-event regression models are a critical tool for associating survival time outcomes with molecular data. Despite mounting evidence that genetic subgroups of the same clinical disease exist, little attention has been given to exploring how this heterogeneity affects time-to-event model building and how to accommodate it. Methods able to diagnose and model heterogeneity should be valuable additions to the biomarker discovery toolset. RESULTS We propose a mixture of survival functions that classifies subjects with similar relationships to a time-to-event response. This model incorporates multivariate regression and model selection and can be fit with an expectation maximization algorithm, we call Cox-assisted clustering. We illustrate a likely manifestation of genetic heterogeneity and demonstrate how it may affect survival models with little warning. An application to gene expression in ovarian cancer DNA repair pathways illustrates how the model may be used to learn new genetic subsets for risk stratification. We explore the implications of this model for censored observations and the effect on genomic predictors and diagnostic analysis. AVAILABILITY AND IMPLEMENTATION R implementation of CAC using standard packages is available at https://gist.github.com/programeng/8620b85146b14b6edf8f Data used in the analysis are publicly available.
Collapse
Affiliation(s)
- Kevin H Eng
- Department of Biostatistics and Bioinformatics, Roswell Park Cancer Institute, Elm and Carlton Streets, Buffalo, NY 14263, USA and Department of Statistics, University of Wisconsin-Madison, 1300 University Avenue, Madison, WI 53705, USA
| | - Bret M Hanlon
- Department of Biostatistics and Bioinformatics, Roswell Park Cancer Institute, Elm and Carlton Streets, Buffalo, NY 14263, USA and Department of Statistics, University of Wisconsin-Madison, 1300 University Avenue, Madison, WI 53705, USA
| |
Collapse
|
17
|
Qin LX, Breeden L, Self SG. Finding gene clusters for a replicated time course study. BMC Res Notes 2014; 7:60. [PMID: 24460656 PMCID: PMC3906880 DOI: 10.1186/1756-0500-7-60] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2013] [Accepted: 01/15/2014] [Indexed: 11/12/2022] Open
Abstract
Background Finding genes that share similar expression patterns across samples is an important question that is frequently asked in high-throughput microarray studies. Traditional clustering algorithms such as K-means clustering and hierarchical clustering base gene clustering directly on the observed measurements and do not take into account the specific experimental design under which the microarray data were collected. A new model-based clustering method, the clustering of regression models method, takes into account the specific design of the microarray study and bases the clustering on how genes are related to sample covariates. It can find useful gene clusters for studies from complicated study designs such as replicated time course studies. Findings In this paper, we applied the clustering of regression models method to data from a time course study of yeast on two genotypes, wild type and YOX1 mutant, each with two technical replicates, and compared the clustering results with K-means clustering. We identified gene clusters that have similar expression patterns in wild type yeast, two of which were missed by K-means clustering. We further identified gene clusters whose expression patterns were changed in YOX1 mutant yeast compared to wild type yeast. Conclusions The clustering of regression models method can be a valuable tool for identifying genes that are coordinately transcribed by a common mechanism.
Collapse
Affiliation(s)
- Li-Xuan Qin
- Department of Epidemiology and Biostatistics, Memorial Sloan-Kettering Cancer Center, New York, NY 10065, USA.
| | | | | |
Collapse
|
18
|
Shi J, Qin LX. CORM: An R Package Implementing the Clustering of Regression Models Method for Gene Clustering. Cancer Inform 2014; 13:11-3. [PMID: 25452684 PMCID: PMC4218679 DOI: 10.4137/cin.s13967] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/15/2014] [Revised: 07/21/2014] [Accepted: 07/21/2014] [Indexed: 11/05/2022] Open
Abstract
We report a new R package implementing the clustering of regression models (CORM) method for clustering genes using gene expression data and provide data examples illustrating each clustering function in the package. The CORM package is freely available at CRAN from http://cran.r-project.org .
Collapse
Affiliation(s)
- Jiejun Shi
- Department of Epidemiology and Biostatistics, Memorial Sloan Kettering Cancer Center, New York, NY, USA
| | - Li-Xuan Qin
- Department of Epidemiology and Biostatistics, Memorial Sloan Kettering Cancer Center, New York, NY, USA
| |
Collapse
|
19
|
Komárek A, Komárková L. Clustering for multivariate continuous and discrete longitudinal data. Ann Appl Stat 2013. [DOI: 10.1214/12-aoas580] [Citation(s) in RCA: 35] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
20
|
Robinson JF, Piersma AH. Toxicogenomic approaches in developmental toxicology testing. Methods Mol Biol 2013; 947:451-73. [PMID: 23138921 DOI: 10.1007/978-1-62703-131-8_31] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2022]
Abstract
The emergence of toxicogenomic applications provides new tools to characterize, classify, and potentially predict teratogens. However, due to the vast number of experimental and statistical procedural steps, toxicogenomic studies are challenging. Here, we guide researchers through the basic framework of conducting toxicogenomic investigations in the field of developmental toxicology, providing examples of biological and technical factors that may influence response and interpretation. Furthermore, we review current, diverse applications of toxicogenomic-based approaches in teratology testing, including exposure-response characterization (dose and duration), chemical classification studies, and cross-model comparisons study designs. This review is intended to guide scientists through the challenging and complex structure of conducting toxicogenomic analyses, while considering the many applications of using toxicogenomics in study designs and the future of these types of "omics" approaches in developmental toxicology.
Collapse
Affiliation(s)
- Joshua F Robinson
- Laboratory for Health Protection Research-National Institute for Public Health and the Environment (RIVM), Bilthoven, The Netherlands.
| | | |
Collapse
|
21
|
Wang K, Ng SK, McLachlan GJ. Clustering of time-course gene expression profiles using normal mixture models with autoregressive random effects. BMC Bioinformatics 2012; 13:300. [PMID: 23151154 PMCID: PMC3574839 DOI: 10.1186/1471-2105-13-300] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2012] [Accepted: 11/07/2012] [Indexed: 11/26/2022] Open
Abstract
Background Time-course gene expression data such as yeast cell cycle data may be periodically expressed. To cluster such data, currently used Fourier series approximations of periodic gene expressions have been found not to be sufficiently adequate to model the complexity of the time-course data, partly due to their ignoring the dependence between the expression measurements over time and the correlation among gene expression profiles. We further investigate the advantages and limitations of available models in the literature and propose a new mixture model with autoregressive random effects of the first order for the clustering of time-course gene-expression profiles. Some simulations and real examples are given to demonstrate the usefulness of the proposed models. Results We illustrate the applicability of our new model using synthetic and real time-course datasets. We show that our model outperforms existing models to provide more reliable and robust clustering of time-course data. Our model provides superior results when genetic profiles are correlated. It also gives comparable results when the correlation between the gene profiles is weak. In the applications to real time-course data, relevant clusters of coregulated genes are obtained, which are supported by gene-function annotation databases. Conclusions Our new model under our extension of the EMMIX-WIRE procedure is more reliable and robust for clustering time-course data because it adopts a random effects model that allows for the correlation among observations at different time points. It postulates gene-specific random effects with an autocorrelation variance structure that models coregulation within the clusters. The developed R package is flexible in its specification of the random effects through user-input parameters that enables improved modelling and consequent clustering of time-course data.
Collapse
Affiliation(s)
- Kui Wang
- Department of Mathematics, University of Queensland, Brisbane, QLD 4072, Australia
| | | | | |
Collapse
|
22
|
Tarpey T, Petkova E, Lu Y, Govindarajulu U. Optimal Partitioning for Linear Mixed Effects Models: Applications to Identifying Placebo Responders. J Am Stat Assoc 2012; 105:968-977. [PMID: 21494314 DOI: 10.1198/jasa.2010.ap08713] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
Abstract
A long-standing problem in clinical research is distinguishing drug treated subjects that respond due to specific effects of the drug from those that respond to non-specific (or placebo) effects of the treatment. Linear mixed effect models are commonly used to model longitudinal clinical trial data. In this paper we present a solution to the problem of identifying placebo responders using an optimal partitioning methodology for linear mixed effects models. Since individual outcomes in a longitudinal study correspond to curves, the optimal partitioning methodology produces a set of prototypical outcome profiles. The optimal partitioning methodology can accommodate both continuous and discrete covariates. The proposed partitioning strategy is compared and contrasted with the growth mixture modelling approach. The methodology is applied to a two-phase depression clinical trial where subjects in a first phase were treated openly for 12 weeks with fluoxetine followed by a double blind discontinuation phase where responders to treatment in the first phase were randomized to either stay on fluoxetine or switched to a placebo. The optimal partitioning methodology is applied to the first phase to identify prototypical outcome profiles. Using time to relapse in the second phase of the study, a survival analysis is performed on the partitioned data. The optimal partitioning results identify prototypical profiles that distinguish whether subjects relapse depending on whether or not they stay on the drug or are randomized to a placebo.
Collapse
Affiliation(s)
- Thaddeus Tarpey
- Professor in the Department of Mathematics and Statistics, Wright State University, Dayton, Ohio 45435
| | | | | | | |
Collapse
|
23
|
Blackstock AJ, Manatunga AK, Park Y, Jones DP, Yu T. Clustering Based on Periodicity in High-Throughput Time Course Data. Stat Anal Data Min 2011; 4:579-589. [PMID: 23762213 DOI: 10.1002/sam.10137] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/04/2023]
Abstract
Nuclear magnetic resonance (NMR) spectroscopy, traditionally used in analytical chemistry, has recently been introduced to studies of metabolite composition of biological fluids and tissues. Metabolite levels change over time, and providing a tool for better extraction of NMR peaks exhibiting periodic behavior is of interest. We propose a method in which NMR peaks are clustered based on periodic behavior. Periodic regression is used to obtain estimates of the parameter corresponding to period for individual NMR peaks. A mixture model is then used to develop clusters of peaks, taking into account the variability of the regression parameter estimates. Methods are applied to NMR data collected from human blood plasma over a 24-hour period. Simulation studies show that the extra variance component due to the estimation of the parameter estimate should be accounted for in the clustering procedure.
Collapse
Affiliation(s)
- Anna J Blackstock
- Department of Biostatistics and Bioinformatics, Rollins School of Public Health, Emory University, Atlanta, GA, USA
| | | | | | | | | |
Collapse
|
24
|
Villarroel L, Marshall G, Barón AE. Cluster analysis using multivariate mixed effects models. Stat Med 2009; 28:2552-65. [PMID: 19536743 DOI: 10.1002/sim.3632] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/09/2023]
Abstract
A common situation in the biological and social sciences is to have data on one or more variables measured longitudinally on a sample of individuals. A problem of growing interest in these areas is the grouping of individuals into one of two or more clusters according to their longitudinal behavior. Recently, methods have been proposed to deal with cases where individuals are classified into clusters through a linear model of mixed univariate effects deriving from a longitudinally measured variable. The method proposed in the current work deals with the case of clustering and then classification based on two or more variables measured longitudinally, through the fitting of non-linear multivariate mixed effect models, and with consideration given to parameter estimation for balanced and unbalanced data using an EM algorithm. The application of the method is illustrated with an example in which the clusters are identified and the classification into clusters is compared with the true membership of individuals in one of two groups, which is known at the end of the follow-up period.
Collapse
Affiliation(s)
- Luis Villarroel
- Departamento de Salud Publica, Facultad de Medicina, Pontificia Universidad Catolica de Chile, Santiago, Chile.
| | | | | |
Collapse
|
25
|
Li L, Lu Y, Qin LX, Bar-Joseph Z, Werner-Washburne M, Breeden LL. Budding yeast SSD1-V regulates transcript levels of many longevity genes and extends chronological life span in purified quiescent cells. Mol Biol Cell 2009; 20:3851-64. [PMID: 19570907 DOI: 10.1091/mbc.e09-04-0347] [Citation(s) in RCA: 42] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022] Open
Abstract
Ssd1 is an RNA-binding protein that affects literally hundreds of different processes and is polymorphic in both wild and lab yeast strains. We have used transcript microarrays to compare mRNA levels in an isogenic pair of mutant (ssd1-d) and wild-type (SSD1-V) cells across the cell cycle. We find that 15% of transcripts are differentially expressed, but there is no correlation with those mRNAs bound by Ssd1. About 20% of cell cycle regulated transcripts are affected, and most show sharper amplitudes of oscillation in SSD1-V cells. Many transcripts whose gene products influence longevity are also affected, the largest class of which is involved in translation. Ribosomal protein mRNAs are globally down-regulated by SSD1-V. SSD1-V has been shown to increase replicative life span currency and we show that SSD1-V also dramatically increases chronological life span (CLS). Using a new assay of CLS in pure populations of quiescent prototrophs, we find that the CLS for SSD1-V cells is twice that of ssd1-d cells.
Collapse
Affiliation(s)
- Lihong Li
- Fred Hutchinson Cancer Research Center, Basic Sciences Division, Seattle, WA 98109, USA
| | | | | | | | | | | |
Collapse
|
26
|
Yuan Y, Li CT, Wilson R. Partial mixture model for tight clustering of gene expression time-course. BMC Bioinformatics 2008; 9:287. [PMID: 18564420 PMCID: PMC2492882 DOI: 10.1186/1471-2105-9-287] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2007] [Accepted: 06/18/2008] [Indexed: 11/29/2022] Open
Abstract
Background Tight clustering arose recently from a desire to obtain tighter and potentially more informative clusters in gene expression studies. Scattered genes with relatively loose correlations should be excluded from the clusters. However, in the literature there is little work dedicated to this area of research. On the other hand, there has been extensive use of maximum likelihood techniques for model parameter estimation. By contrast, the minimum distance estimator has been largely ignored. Results In this paper we show the inherent robustness of the minimum distance estimator that makes it a powerful tool for parameter estimation in model-based time-course clustering. To apply minimum distance estimation, a partial mixture model that can naturally incorporate replicate information and allow scattered genes is formulated. We provide experimental results of simulated data fitting, where the minimum distance estimator demonstrates superior performance to the maximum likelihood estimator. Both biological and statistical validations are conducted on a simulated dataset and two real gene expression datasets. Our proposed partial regression clustering algorithm scores top in Gene Ontology driven evaluation, in comparison with four other popular clustering algorithms. Conclusion For the first time partial mixture model is successfully extended to time-course data analysis. The robustness of our partial regression clustering algorithm proves the suitability of the combination of both partial mixture model and minimum distance estimator in this field. We show that tight clustering not only is capable to generate more profound understanding of the dataset under study well in accordance to established biological knowledge, but also presents interesting new hypotheses during interpretation of clustering results. In particular, we provide biological evidences that scattered genes can be relevant and are interesting subjects for study, in contrast to prevailing opinion.
Collapse
Affiliation(s)
- Yinyin Yuan
- Department of Computer Science, University of Warwick, Coventry, UK.
| | | | | |
Collapse
|
27
|
Abstract
BACKGROUND MicroRNAs are believed to play an important role in gene expression regulation. They have been shown to be involved in cell cycle regulation and cancer. MicroRNA expression profiling became available owing to recent technology advancement. In some studies, both microRNA expression and mRNA expression are measured, which allows an integrated analysis of microRNA and mRNA expression. RESULTS We demonstrated three aspects of an integrated analysis of microRNA and mRNA expression, through a case study of human cancer data. We showed that (1) microRNA expression efficiently sorts tumors from normal tissues regardless of tumor type, while gene expression does not; (2) many microRNAs are down-regulated in tumors and these microRNAs can be clustered in two ways: microRNAs similarly affected by cancer and microRNAs similarly interacting with genes; (3) taking let-7f as an example, targets genes can be identified and they can be clustered based on their relationship with let-7f expression. DISCUSSION Our findings in this paper were made using novel applications of existing statistical methods: hierarchical clustering was applied with a new distance measure-the co-clustering frequency-to identify sample clusters that are stable; microRNA-gene correlation profiles were subject to hierarchical clustering to identify microRNAs that similarly interact with genes and hence are likely functionally related; the clustering of regression models method was applied to identify microRNAs similarly related to cancer while adjusting for tissue type and genes similarly related to microRNA while adjusting for disease status. These analytic methods are applicable to interrogate multiple types of -omics data in general.
Collapse
Affiliation(s)
- Li-Xuan Qin
- Department of Epidemiology and Biostatistics, Memorial Sloan-Kettering Cancer Center, New York, New York, USA.
| |
Collapse
|
28
|
Testing the significance of cell-cycle patterns in time-course microarray data using nonparametric quadratic inference functions. Comput Stat Data Anal 2008. [DOI: 10.1016/j.csda.2007.03.018] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
|
29
|
Turner HL, Bailey TC, Krzanowski WJ, Hemingway CA. Biclustering models for structured microarray data. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2005; 2:316-29. [PMID: 17044169 DOI: 10.1109/tcbb.2005.49] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/12/2023]
Abstract
Microarrays have become a standard tool for investigating gene function and more complex microarray experiments are increasingly being conducted. For example, an experiment may involve samples from several groups or may investigate changes in gene expression over time for several subjects, leading to large three-way data sets. In response to this increase in data complexity, we propose some extensions to the plaid model, a biclustering method developed for the analysis of gene expression data. This model-based method lends itself to the incorporation of any additional structure such as external grouping or repeated measures. We describe how the extended models may be fitted and illustrate their use on real data.
Collapse
Affiliation(s)
- Heather L Turner
- Department of Mathematical Sciences, University of Exeter, Laver Building, North Park Rd., Exeter, Devon EX4 4QE, UK.
| | | | | | | |
Collapse
|