1
|
Rastaghi S, Saki A, Tabesh H. Modifying the false discovery rate procedure based on the information theory under arbitrary correlation structure and its performance in high-dimensional genomic data. BMC Bioinformatics 2024; 25:57. [PMID: 38317067 PMCID: PMC10840263 DOI: 10.1186/s12859-024-05678-w] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2023] [Accepted: 01/26/2024] [Indexed: 02/07/2024] Open
Abstract
BACKGROUND Controlling the False Discovery Rate (FDR) in Multiple Comparison Procedures (MCPs) has widespread applications in many scientific fields. Previous studies show that the correlation structure between test statistics increases the variance and bias of FDR. The objective of this study is to modify the effect of correlation in MCPs based on the information theory. We proposed three modified procedures (M1, M2, and M3) under strong, moderate, and mild assumptions based on the conditional Fisher Information of the consecutive sorted test statistics for controlling the false discovery rate under arbitrary correlation structure. The performance of the proposed procedures was compared with the Benjamini-Hochberg (BH) and Benjamini-Yekutieli (BY) procedures in simulation study and real high-dimensional data of colorectal cancer gene expressions. In the simulation study, we generated 1000 differential multivariate Gaussian features with different levels of the correlation structure and screened the significance features by the FDR controlling procedures, with strong control on the Family Wise Error Rates. RESULTS When there was no correlation between 1000 simulated features, the performance of the BH procedure was similar to the three proposed procedures. In low to medium correlation structures the BY procedure is too conservative. The BH procedure is too liberal, and the mean number of screened features was constant at the different levels of the correlation between features. The mean number of screened features by proposed procedures was between BY and BH procedures and reduced when the correlations increased. Where the features are highly correlated the number of screened features by proposed procedures reached the Bonferroni (BF) procedure, as expected. In real data analysis the BY, BH, M1, M2, and M3 procedures were done to screen gene expressions of colorectal cancer. To fit a predictive model based on the screened features the Efficient Bayesian Logistic Regression (EBLR) model was used. The fitted EBLR models based on the screened features by M1 and M2 procedures have minimum entropies and are more efficient than BY and BH procedures. CONCLUSION The modified proposed procedures based on information theory, are much more flexible than BH and BY procedures for the amount of correlation between test statistics. The modified procedures avoided screening the non-informative features and so the number of screened features reduced with the increase in the level of correlation.
Collapse
Affiliation(s)
- Sedighe Rastaghi
- Department of Epidemiology and Biostatistics, School of Health, Mashhad University of Medical Sciences, Mashhad, Iran
| | - Azadeh Saki
- Department of Epidemiology and Biostatistics, School of Health, Mashhad University of Medical Sciences, Mashhad, Iran.
| | - Hamed Tabesh
- Department of Medical Informatics, School of Medicine, Mashhad University of Medical Sciences, Mashhad, Iran
| |
Collapse
|
2
|
Hahn G, Novak T, Crawford JC, Randolph AG, Lange C. Longitudinal Analysis of Contrasts in Gene Expression Data. Genes (Basel) 2023; 14:1134. [PMID: 37372314 DOI: 10.3390/genes14061134] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2023] [Revised: 05/19/2023] [Accepted: 05/21/2023] [Indexed: 06/29/2023] Open
Abstract
We are interested in detecting a departure from the baseline in a longitudinal analysis in the context of multiple organ dysfunction syndrome (MODS). In particular, we are given gene expression reads at two time points for a fixed number of genes and individuals. The individuals can be subdivided into two groups, denoted as groups A and B. Using the two time points, we compute a contrast of gene expression reads per individual and gene. The age of each individual is known and it is used to compute, for each gene separately, a linear regression of the gene expression contrasts on the individual's age. Looking at the intercept of the linear regression to detect a departure from the baseline, we aim to reliably single out those genes for which there is a difference in the intercept among those individuals in group A and not in group B. In this work, we develop testing methodology for this setting based on two hypothesis tests-one under the null and one under an appropriately formulated alternative. We demonstrate the validity of our approach using a dataset created by bootstrapping from a real data application in the context of multiple organ dysfunction syndrome (MODS).
Collapse
Affiliation(s)
- Georg Hahn
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA 02115, USA
| | - Tanya Novak
- Critical Care Medicine, Department of Anesthesiology, Boston Children's Hospital, Boston, MA 02115, USA
| | | | - Adrienne G Randolph
- Critical Care Medicine, Department of Anesthesiology, Boston Children's Hospital, Boston, MA 02115, USA
| | - Christoph Lange
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA 02115, USA
| |
Collapse
|
3
|
Sarkar SK, Zhao Z. Local false discovery rate based methods for multiple testing of one-way classified hypotheses. Electron J Stat 2022. [DOI: 10.1214/22-ejs2080] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022]
Affiliation(s)
- Sanat K. Sarkar
- Department of Statistics, Operations, and Data Science, Temple University, Philadelphia, PA, 19122, USA
| | - Zhigen Zhao
- Department of Statistics, Operations, and Data Science, Temple University, Philadelphia, PA, 19122, USA
| |
Collapse
|
4
|
Du L, Guo X, Sun W, Zou C. False Discovery Rate Control Under General Dependence By Symmetrized Data Aggregation. J Am Stat Assoc 2021. [DOI: 10.1080/01621459.2021.1945459] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
Affiliation(s)
- Lilun Du
- Department of ISOM, Hong Kong University of Science and Technology, ISOM, Kowloon, Hong Kong
| | - Xu Guo
- Department of Mathematical Statistics, Beijing Normal University, Beijing, China
| | - Wenguang Sun
- Data Sciences and Operations, University of Southern California, Los Angeles, CA
| | - Changliang Zou
- Department of Statistics and Data Sciences, Nankai University, Tianjin, China
| |
Collapse
|
5
|
Tian T, Cheng R, Wei Z. An empirical Bayes change-point model for transcriptome time-course data. Ann Appl Stat 2021. [DOI: 10.1214/20-aoas1403] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Affiliation(s)
- Tian Tian
- Department of Computer Science, New Jersey Institute of Technology
| | - Ruihua Cheng
- Big Data Statistics Research Center, Tianjin University of Finance and Economics
| | - Zhi Wei
- Department of Computer Science, New Jersey Institute of Technology
| |
Collapse
|
6
|
Fu L, Gang B, James GM, Sun W. Heteroscedasticity-Adjusted Ranking and Thresholding for Large-Scale Multiple Testing. J Am Stat Assoc 2020. [DOI: 10.1080/01621459.2020.1840992] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
Affiliation(s)
- Luella Fu
- Department of Mathematics, San Francisco State University, San Francisco, CA
| | - Bowen Gang
- Department of Statistics, Fudan University, Shanghai, China
| | - Gareth M. James
- Department of Data Sciences and Operations, University of Southern California, Los Angeles, CA
| | - Wenguang Sun
- Department of Data Sciences and Operations, University of Southern California, Los Angeles, CA
| |
Collapse
|
7
|
Heller R, Rosset S. Optimal control of false discovery criteria in the two‐group model. J R Stat Soc Series B Stat Methodol 2020. [DOI: 10.1111/rssb.12403] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]
Affiliation(s)
- Ruth Heller
- Department of Statistics and Operations Research Tel‐Aviv university Tel‐Aviv Israel
| | - Saharon Rosset
- Department of Statistics and Operations Research Tel‐Aviv university Tel‐Aviv Israel
| |
Collapse
|
8
|
|
9
|
Zhao H, Cui X. Constructing confidence intervals for selected parameters. Biometrics 2020; 76:1098-1108. [PMID: 31975369 DOI: 10.1111/biom.13222] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/22/2017] [Revised: 12/28/2019] [Accepted: 01/08/2020] [Indexed: 11/27/2022]
Abstract
In large-scale problems, it is common practice to select important parameters by a procedure such as the Benjamini and Hochberg procedure and construct confidence intervals (CIs) for further investigation while the false coverage-statement rate (FCR) for the CIs is controlled at a desired level. Although the well-known BY CIs control the FCR, they are uniformly inflated. In this paper, we propose two methods to construct shorter selective CIs. The first method produces shorter CIs by allowing a reduced number of selective CIs. The second method produces shorter CIs by allowing a prefixed proportion of CIs containing the values of uninteresting parameters. We theoretically prove that the proposed CIs are uniformly shorter than BY CIs and control the FCR asymptotically for independent data. Numerical results confirm our theoretical results and show that the proposed CIs still work for correlated data. We illustrate the advantage of the proposed procedures by analyzing the microarray data from a HIV study.
Collapse
Affiliation(s)
- Haibing Zhao
- School of Statistics and Management, Shanghai University of Finance and Economics, Shanghai, China
| | - Xinping Cui
- Department of Statistics, Center for Plant Cell Biology and Institute for Integrative Genome Biology, University of California, Riverside, CA
| |
Collapse
|
10
|
Banerjee T, Mukherjee G, Sun W. Adaptive Sparse Estimation With Side Information. J Am Stat Assoc 2019. [DOI: 10.1080/01621459.2019.1679639] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]
Affiliation(s)
- Trambak Banerjee
- Department of Data Sciences and Operations, University of Southern California, Los Angeles, CA
| | - Gourab Mukherjee
- Department of Data Sciences and Operations, University of Southern California, Los Angeles, CA
| | - Wenguang Sun
- Department of Data Sciences and Operations, University of Southern California, Los Angeles, CA
| |
Collapse
|
11
|
Xia Y, Cai TT, Sun W. GAP: A General Framework for Information Pooling in Two-Sample Sparse Inference. J Am Stat Assoc 2019. [DOI: 10.1080/01621459.2019.1611585] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]
Affiliation(s)
- Yin Xia
- Department of Statistics, School of Management, Fudan University, Shanghai, China
| | - T. Tony Cai
- Department of Statistics, The Wharton School, University of Pennsylvania, Philadelphia, PA
| | - Wenguang Sun
- Department of Data Sciences and Operations, University of Southern California, Los Angeles, CA
| |
Collapse
|
12
|
Bhattacharjee A, Vishwakarma GK. Time-course data prediction for repeatedly measured gene expression. INT J BIOMATH 2019. [DOI: 10.1142/s1793524519500335] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022]
Abstract
Variability in time course gene expression data is a natural phenomenon. The intention of this work is to predict the future time point data through observed sample data point. The Bayesian inference is carried to serve the objective. A total of 6 replicates 3 time point’s data of 218 genes expression is adopted to illustrate the method. The estimates are found consistent with HPD interval to predict the future time point gene expression value. This proposed method can be adopted in other gene expression data setup to predict the future time course data.
Collapse
Affiliation(s)
- Atanu Bhattacharjee
- Section of Biostatistics, Centre for Cancer Epidemiology, Tata Memorial Centre, Navi Mumbai 410210, India
| | - Gajendra K. Vishwakarma
- Department of Applied Mathematics, Indian Institute of Technology (ISM), Dhanbad-826004, India
| |
Collapse
|
13
|
Tony Cai T, Sun W, Wang W. Covariate‐assisted ranking and screening for large‐scale two‐sample inference. J R Stat Soc Series B Stat Methodol 2019. [DOI: 10.1111/rssb.12304] [Citation(s) in RCA: 28] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/17/2023]
Affiliation(s)
| | - Wenguang Sun
- University of Southern California Los Angeles USA
| | - Weinan Wang
- University of Southern California Los Angeles USA
| |
Collapse
|
14
|
Bogomolov M, Heller R. Assessing replicability of findings across two studies of multiple features. Biometrika 2018. [DOI: 10.1093/biomet/asy029] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Affiliation(s)
- Marina Bogomolov
- The William Davidson Faculty of Industrial Engineering and Management, Technion–Israel Institute of Technology, Technion City, Haifa 3200003, Israel
| | - Ruth Heller
- Department of Statistics and Operations Research, Tel-Aviv University, P.O. Box 39040, Tel-Aviv 6997801, Israel
| |
Collapse
|
15
|
Sun J, Herazo-Maya JD, Kaminski N, Zhao H, Warren JL. A Dirichlet process mixture model for clustering longitudinal gene expression data. Stat Med 2017; 36:3495-3506. [PMID: 28620908 PMCID: PMC5583037 DOI: 10.1002/sim.7374] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2017] [Revised: 04/15/2017] [Accepted: 05/23/2017] [Indexed: 12/27/2022]
Abstract
Subgroup identification (clustering) is an important problem in biomedical research. Gene expression profiles are commonly utilized to define subgroups. Longitudinal gene expression profiles might provide additional information on disease progression than what is captured by baseline profiles alone. Therefore, subgroup identification could be more accurate and effective with the aid of longitudinal gene expression data. However, existing statistical methods are unable to fully utilize these data for patient clustering. In this article, we introduce a novel clustering method in the Bayesian setting based on longitudinal gene expression profiles. This method, called BClustLonG, adopts a linear mixed-effects framework to model the trajectory of genes over time, while clustering is jointly conducted based on the regression coefficients obtained from all genes. In order to account for the correlations among genes and alleviate the high dimensionality challenges, we adopt a factor analysis model for the regression coefficients. The Dirichlet process prior distribution is utilized for the means of the regression coefficients to induce clustering. Through extensive simulation studies, we show that BClustLonG has improved performance over other clustering methods. When applied to a dataset of severely injured (burn or trauma) patients, our model is able to identify interesting subgroups. Copyright © 2017 John Wiley & Sons, Ltd.
Collapse
Affiliation(s)
- Jiehuan Sun
- Department of Biostatistics, Yale University, New Haven, 06520, CT, U.S.A
| | - Jose D Herazo-Maya
- Pulmonary, Critical Care and Sleep Medicine, Yale School of Medicine, New Haven, 06520, CT, U.S.A
| | - Naftali Kaminski
- Pulmonary, Critical Care and Sleep Medicine, Yale School of Medicine, New Haven, 06520, CT, U.S.A
| | - Hongyu Zhao
- Department of Biostatistics, Yale University, New Haven, 06520, CT, U.S.A
| | - Joshua L Warren
- Department of Biostatistics, Yale University, New Haven, 06520, CT, U.S.A
| |
Collapse
|
16
|
|
17
|
Zhao H, Zhang J. Weighted p-value procedures for controlling FDR of grouped hypotheses. J Stat Plan Inference 2014. [DOI: 10.1016/j.jspi.2014.04.004] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]
|
18
|
Martini P, Sales G, Calura E, Cagnin S, Chiogna M, Romualdi C. timeClip: pathway analysis for time course data without replicates. BMC Bioinformatics 2014; 15 Suppl 5:S3. [PMID: 25077979 PMCID: PMC4095003 DOI: 10.1186/1471-2105-15-s5-s3] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/23/2023] Open
Abstract
Background Time-course gene expression experiments are useful tools for exploring biological processes. In this type of experiments, gene expression changes are monitored along time. Unfortunately, replication of time series is still costly and usually long time course do not have replicates. Many approaches have been proposed to deal with this data structure, but none of them in the field of pathway analysis. Pathway analyses have acquired great relevance for helping the interpretation of gene expression data. Several methods have been proposed to this aim: from the classical enrichment to the more complex topological analysis that gains power from the topology of the pathway. None of them were devised to identify temporal variations in time course data. Results Here we present timeClip, a topology based pathway analysis specifically tailored to long time series without replicates. timeClip combines dimension reduction techniques and graph decomposition theory to explore and identify the portion of pathways that is most time-dependent. In the first step, timeClip selects the time-dependent pathways; in the second step, the most time dependent portions of these pathways are highlighted. We used timeClip on simulated data and on a benchmark dataset regarding mouse muscle regeneration model. Our approach shows good performance on different simulated settings. On the real dataset, we identify 76 time-dependent pathways, most of which known to be involved in the regeneration process. Focusing on the 'mTOR signaling pathway' we highlight the timing of key processes of the muscle regeneration: from the early pathway activation through growth factor signals to the late burst of protein production needed for the fiber regeneration. Conclusions timeClip represents a new improvement in the field of time-dependent pathway analysis. It allows to isolate and dissect pathways characterized by time-dependent components. Furthermore, using timeClip on a mouse muscle regeneration dataset we were able to characterize the process of muscle fiber regeneration with its correct timing.
Collapse
|
19
|
Li Y, Ghosh D. A two-step hierarchical hypothesis set testing framework, with applications to gene expression data on ordered categories. BMC Bioinformatics 2014; 15:108. [PMID: 24731138 PMCID: PMC4000433 DOI: 10.1186/1471-2105-15-108] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2013] [Accepted: 04/09/2014] [Indexed: 11/10/2022] Open
Abstract
Background In complex large-scale experiments, in addition to simultaneously considering a large number of features, multiple hypotheses are often being tested for each feature. This leads to a problem of multi-dimensional multiple testing. For example, in gene expression studies over ordered categories (such as time-course or dose-response experiments), interest is often in testing differential expression across several categories for each gene. In this paper, we consider a framework for testing multiple sets of hypothesis, which can be applied to a wide range of problems. Results We adopt the concept of the overall false discovery rate (OFDR) for controlling false discoveries on the hypothesis set level. Based on an existing procedure for identifying differentially expressed gene sets, we discuss a general two-step hierarchical hypothesis set testing procedure, which controls the overall false discovery rate under independence across hypothesis sets. In addition, we discuss the concept of the mixed-directional false discovery rate (mdFDR), and extend the general procedure to enable directional decisions for two-sided alternatives. We applied the framework to the case of microarray time-course/dose-response experiments, and proposed three procedures for testing differential expression and making multiple directional decisions for each gene. Simulation studies confirm the control of the OFDR and mdFDR by the proposed procedures under independence and positive correlations across genes. Simulation results also show that two of our new procedures achieve higher power than previous methods. Finally, the proposed methodology is applied to a microarray dose-response study, to identify 17 β-estradiol sensitive genes in breast cancer cells that are induced at low concentrations. Conclusions The framework we discuss provides a platform for multiple testing procedures covering situations involving two (or potentially more) sources of multiplicity. The framework is easy to use and adaptable to various practical settings that frequently occur in large-scale experiments. Procedures generated from the framework are shown to maintain control of the OFDR and mdFDR, quantities that are especially relevant in the case of multiple hypothesis set testing. The procedures work well in both simulations and real datasets, and are shown to have better power than existing methods.
Collapse
Affiliation(s)
| | - Debashis Ghosh
- Department of Statistics, Pennsylvania State University, University Park, State College, Pennsylvania 16802, USA.
| |
Collapse
|
20
|
|
21
|
Zhao Z, Wang W, Wei Z. An empirical Bayes testing procedure for detecting variants in analysis of next generation sequencing data. Ann Appl Stat 2013. [DOI: 10.1214/13-aoas660] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
22
|
Benjamini Y, Bogomolov M. Selective inference on multiple families of hypotheses. J R Stat Soc Series B Stat Methodol 2013. [DOI: 10.1111/rssb.12028] [Citation(s) in RCA: 67] [Impact Index Per Article: 6.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
23
|
Wu S, Wu H. More powerful significant testing for time course gene expression data using functional principal component analysis approaches. BMC Bioinformatics 2013; 14:6. [PMID: 23323795 PMCID: PMC3617096 DOI: 10.1186/1471-2105-14-6] [Citation(s) in RCA: 36] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2012] [Accepted: 11/07/2012] [Indexed: 11/24/2022] Open
Abstract
BACKGROUND One of the fundamental problems in time course gene expression data analysis is to identify genes associated with a biological process or a particular stimulus of interest, like a treatment or virus infection. Most of the existing methods for this problem are designed for data with longitudinal replicates. But in reality, many time course gene experiments have no replicates or only have a small number of independent replicates. RESULTS We focus on the case without replicates and propose a new method for identifying differentially expressed genes by incorporating the functional principal component analysis (FPCA) into a hypothesis testing framework. The data-driven eigenfunctions allow a flexible and parsimonious representation of time course gene expression trajectories, leaving more degrees of freedom for the inference compared to that using a prespecified basis. Moreover, the information of all genes is borrowed for individual gene inferences. CONCLUSION The proposed approach turns out to be more powerful in identifying time course differentially expressed genes compared to the existing methods. The improved performance is demonstrated through simulation studies and a real data application to the Saccharomyces cerevisiae cell cycle data.
Collapse
Affiliation(s)
- Shuang Wu
- Department of Biostatistics and Computational Biology, University of Rochester, 601 Elmwood Avenue, Rochester, NY, 14642, USA
| | - Hulin Wu
- Department of Biostatistics and Computational Biology, University of Rochester, 601 Elmwood Avenue, Rochester, NY, 14642, USA
| |
Collapse
|
24
|
Wang K, Ng SK, McLachlan GJ. Clustering of time-course gene expression profiles using normal mixture models with autoregressive random effects. BMC Bioinformatics 2012; 13:300. [PMID: 23151154 PMCID: PMC3574839 DOI: 10.1186/1471-2105-13-300] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2012] [Accepted: 11/07/2012] [Indexed: 11/26/2022] Open
Abstract
Background Time-course gene expression data such as yeast cell cycle data may be periodically expressed. To cluster such data, currently used Fourier series approximations of periodic gene expressions have been found not to be sufficiently adequate to model the complexity of the time-course data, partly due to their ignoring the dependence between the expression measurements over time and the correlation among gene expression profiles. We further investigate the advantages and limitations of available models in the literature and propose a new mixture model with autoregressive random effects of the first order for the clustering of time-course gene-expression profiles. Some simulations and real examples are given to demonstrate the usefulness of the proposed models. Results We illustrate the applicability of our new model using synthetic and real time-course datasets. We show that our model outperforms existing models to provide more reliable and robust clustering of time-course data. Our model provides superior results when genetic profiles are correlated. It also gives comparable results when the correlation between the gene profiles is weak. In the applications to real time-course data, relevant clusters of coregulated genes are obtained, which are supported by gene-function annotation databases. Conclusions Our new model under our extension of the EMMIX-WIRE procedure is more reliable and robust for clustering time-course data because it adopts a random effects model that allows for the correlation among observations at different time points. It postulates gene-specific random effects with an autocorrelation variance structure that models coregulation within the clusters. The developed R package is flexible in its specification of the random effects through user-input parameters that enables improved modelling and consequent clustering of time-course data.
Collapse
Affiliation(s)
- Kui Wang
- Department of Mathematics, University of Queensland, Brisbane, QLD 4072, Australia
| | | | | |
Collapse
|
25
|
Wei Z, Wang W, Hu P, Lyon GJ, Hakonarson H. SNVer: a statistical tool for variant calling in analysis of pooled or individual next-generation sequencing data. Nucleic Acids Res 2011; 39:e132. [PMID: 21813454 PMCID: PMC3201884 DOI: 10.1093/nar/gkr599] [Citation(s) in RCA: 176] [Impact Index Per Article: 13.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
We develop a statistical tool SNVer for calling common and rare variants in analysis of pooled or individual next-generation sequencing (NGS) data. We formulate variant calling as a hypothesis testing problem and employ a binomial-binomial model to test the significance of observed allele frequency against sequencing error. SNVer reports one single overall P-value for evaluating the significance of a candidate locus being a variant based on which multiplicity control can be obtained. This is particularly desirable because tens of thousands loci are simultaneously examined in typical NGS experiments. Each user can choose the false-positive error rate threshold he or she considers appropriate, instead of just the dichotomous decisions of whether to 'accept or reject the candidates' provided by most existing methods. We use both simulated data and real data to demonstrate the superior performance of our program in comparison with existing methods. SNVer runs very fast and can complete testing 300 K loci within an hour. This excellent scalability makes it feasible for analysis of whole-exome sequencing data, or even whole-genome sequencing data using high performance computing cluster. SNVer is freely available at http://snver.sourceforge.net/.
Collapse
Affiliation(s)
- Zhi Wei
- Department of Computer Science, New Jersey Institute of Technology, Newark, NJ 08540, USA.
| | | | | | | | | |
Collapse
|