1
|
Wang Y, Sun Z, Song D, Hero A. Kronecker-structured covariance models for multiway data. STATISTICS SURVEYS 2022. [DOI: 10.1214/22-ss139] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/23/2022]
Affiliation(s)
- Yu Wang
- University of Michigan, Ann Arbor, MI 48109
| | - Zeyu Sun
- University of Michigan, Ann Arbor, MI 48109
| | | | | |
Collapse
|
2
|
Niu L, Liu X, Zhao J. Robust estimator of the correlation matrix with sparse Kronecker structure for a high-dimensional matrix-variate. J MULTIVARIATE ANAL 2020. [DOI: 10.1016/j.jmva.2020.104598] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]
|
3
|
|
4
|
Data Wisdom in Computational Genomics Research. STATISTICS IN BIOSCIENCES 2017. [DOI: 10.1007/s12561-016-9173-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
5
|
Yang H, Liu X. Studies on the Clustering Algorithm for Analyzing Gene Expression Data with a Bidirectional Penalty. J Comput Biol 2017; 24:689-698. [PMID: 28489418 DOI: 10.1089/cmb.2017.0051] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
This article reports a new clustering method based on the k-means algorithm to high-dimensional gene expression data. The proposed approach makes use of bidirectional penalties to constrain the number of clusters and centroids of clusters to simultaneously determine the unknown number of clusters and handle large amounts of noise in gene expression data. Numeric studies indicate that this algorithm not only performs better in clustering but is also comparable to other approaches in its ability to obtain the correct number of clusters and correct signal features. Finally, we apply the proposed approach to analyze two benchmark gene expression datasets. These analyses again indicate that the proposed algorithm performs well in clustering high-dimensional gene expression data with an unknown number of clusters.
Collapse
Affiliation(s)
- Hu Yang
- 1 School of Information, Central University of Finance and Economics , Beijing, China
| | - Xiaoqin Liu
- 2 The National Center for Register-Based Research, Aarhus University , Aarhus, Demark
| |
Collapse
|
6
|
Kim S, Lin CW, Tseng GC. MetaKTSP: a meta-analytic top scoring pair method for robust cross-study validation of omics prediction analysis. Bioinformatics 2016; 32:1966-73. [PMID: 27153719 DOI: 10.1093/bioinformatics/btw115] [Citation(s) in RCA: 27] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/18/2015] [Accepted: 02/19/2016] [Indexed: 01/08/2023] Open
Abstract
MOTIVATION Supervised machine learning is widely applied to transcriptomic data to predict disease diagnosis, prognosis or survival. Robust and interpretable classifiers with high accuracy are usually favored for their clinical and translational potential. The top scoring pair (TSP) algorithm is an example that applies a simple rank-based algorithm to identify rank-altered gene pairs for classifier construction. Although many classification methods perform well in cross-validation of single expression profile, the performance usually greatly reduces in cross-study validation (i.e. the prediction model is established in the training study and applied to an independent test study) for all machine learning methods, including TSP. The failure of cross-study validation has largely diminished the potential translational and clinical values of the models. The purpose of this article is to develop a meta-analytic top scoring pair (MetaKTSP) framework that combines multiple transcriptomic studies and generates a robust prediction model applicable to independent test studies. RESULTS We proposed two frameworks, by averaging TSP scores or by combining P-values from individual studies, to select the top gene pairs for model construction. We applied the proposed methods in simulated data sets and three large-scale real applications in breast cancer, idiopathic pulmonary fibrosis and pan-cancer methylation. The result showed superior performance of cross-study validation accuracy and biomarker selection for the new meta-analytic framework. In conclusion, combining multiple omics data sets in the public domain increases robustness and accuracy of the classification model that will ultimately improve disease understanding and clinical treatment decisions to benefit patients. AVAILABILITY AND IMPLEMENTATION An R package MetaKTSP is available online. (http://tsenglab.biostat.pitt.edu/software.htm). CONTACT ctseng@pitt.edu SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- SungHwan Kim
- Department of Biostatistics, University of Pittsburgh, Pittsburgh, PA, USA Department of Statistics, Korea University, Seoul, South Korea
| | - Chien-Wei Lin
- Department of Biostatistics, University of Pittsburgh, Pittsburgh, PA, USA
| | - George C Tseng
- Department of Biostatistics, University of Pittsburgh, Pittsburgh, PA, USA Department of Computational and Systems Biology Department of Human Genetics, University of Pittsburgh, Pittsburgh, PA, USA
| |
Collapse
|
7
|
Freytag S, Gagnon-Bartsch J, Speed TP, Bahlo M. Systematic noise degrades gene co-expression signals but can be corrected. BMC Bioinformatics 2015; 16:309. [PMID: 26403471 PMCID: PMC4583191 DOI: 10.1186/s12859-015-0745-3] [Citation(s) in RCA: 32] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/11/2015] [Accepted: 09/16/2015] [Indexed: 12/31/2022] Open
Abstract
Background In the past decade, the identification of gene co-expression has become a routine part of the analysis of high-dimensional microarray data. Gene co-expression, which is mostly detected via the Pearson correlation coefficient, has played an important role in the discovery of molecular pathways and networks. Unfortunately, the presence of systematic noise in high-dimensional microarray datasets corrupts estimates of gene co-expression. Removing systematic noise from microarray data is therefore crucial. Many cleaning approaches for microarray data exist, however these methods are aimed towards improving differential expression analysis and their performances have been primarily tested for this application. To our knowledge, the performances of these approaches have never been systematically compared in the context of gene co-expression estimation. Results Using simulations we demonstrate that standard cleaning procedures, such as background correction and quantile normalization, fail to adequately remove systematic noise that affects gene co-expression and at times further degrade true gene co-expression. Instead we show that a global version of removal of unwanted variation (RUV), a data-driven approach, removes systematic noise but also allows the estimation of the true underlying gene-gene correlations. We compare the performance of all noise removal methods when applied to five large published datasets on gene expression in the human brain. RUV retrieves the highest gene co-expression values for sets of genes known to interact, but also provides the greatest consistency across all five datasets. We apply the method to prioritize epileptic encephalopathy candidate genes. Conclusions Our work raises serious concerns about the quality of many published gene co-expression analyses. RUV provides an efficient and flexible way to remove systematic noise from high-dimensional microarray datasets when the objective is gene co-expression analysis. The RUV method as applicable in the context of gene-gene correlation estimation is available as a BioconductoR-package: RUVcorr. Electronic supplementary material The online version of this article (doi:10.1186/s12859-015-0745-3) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Saskia Freytag
- Bioinformatics Division, Walter + Eliza Hall Institute, 1G Royal Parade, Melbourne, 3050, Australia. .,Department of Mathematics and Statistics, University of Melbourne, Melbourne, 3010, Australia.
| | - Johann Gagnon-Bartsch
- Department of Statistics, University of California, 367 Evans Hall, Berkeley, 94720, USA.
| | - Terence P Speed
- Bioinformatics Division, Walter + Eliza Hall Institute, 1G Royal Parade, Melbourne, 3050, Australia. .,Department of Mathematics and Statistics, University of Melbourne, Melbourne, 3010, Australia. .,Department of Statistics, University of California, 367 Evans Hall, Berkeley, 94720, USA.
| | - Melanie Bahlo
- Bioinformatics Division, Walter + Eliza Hall Institute, 1G Royal Parade, Melbourne, 3050, Australia. .,Department of Mathematics and Statistics, University of Melbourne, Melbourne, 3010, Australia. .,Department of Medical Biology, University of Melbourne, Melbourne, 3010, Australia.
| |
Collapse
|
8
|
Wang YXR, Jiang K, Feldman LJ, Bickel PJ, Huang H. Inferring gene–gene interactions and functional modules using sparse canonical correlation analysis. Ann Appl Stat 2015. [DOI: 10.1214/14-aoas792] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
9
|
Touloumis A, Tavaré S, Marioni JC. Testing the mean matrix in high-dimensional transposable data. Biometrics 2015; 71:157-166. [DOI: 10.1111/biom.12257] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/01/2013] [Revised: 09/01/2014] [Accepted: 09/01/2014] [Indexed: 11/29/2022]
Affiliation(s)
- Anestis Touloumis
- Cancer Research UK Cambridge Institute; University of Cambridge; Cambridge CB2 0RE U.K
| | - Simon Tavaré
- Cancer Research UK Cambridge Institute; University of Cambridge; Cambridge CB2 0RE U.K
| | - John C. Marioni
- EMBL-European Bioinformatics Institute; Hinxton CB10 1SD U.K
| |
Collapse
|
10
|
Wang YXR, Huang H. Review on statistical methods for gene network reconstruction using expression data. J Theor Biol 2014; 362:53-61. [PMID: 24726980 DOI: 10.1016/j.jtbi.2014.03.040] [Citation(s) in RCA: 97] [Impact Index Per Article: 9.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/26/2014] [Revised: 03/29/2014] [Accepted: 03/31/2014] [Indexed: 12/16/2022]
Abstract
Network modeling has proven to be a fundamental tool in analyzing the inner workings of a cell. It has revolutionized our understanding of biological processes and made significant contributions to the discovery of disease biomarkers. Much effort has been devoted to reconstruct various types of biochemical networks using functional genomic datasets generated by high-throughput technologies. This paper discusses statistical methods used to reconstruct gene regulatory networks using gene expression data. In particular, we highlight progress made and challenges yet to be met in the problems involved in estimating gene interactions, inferring causality and modeling temporal changes of regulation behaviors. As rapid advances in technologies have made available diverse, large-scale genomic data, we also survey methods of incorporating all these additional data to achieve better, more accurate inference of gene networks.
Collapse
Affiliation(s)
- Y X Rachel Wang
- Department of Statistics, University of California, Berkeley, CA 94720, USA.
| | - Haiyan Huang
- Department of Statistics, University of California, Berkeley, CA 94720, USA.
| |
Collapse
|
11
|
|
12
|
Abstract
Motivated by analysis of gene expression data measured over different tissues or over time, we consider matrix-valued random variable and matrix-normal distribution, where the precision matrices have a graphical interpretation for genes and tissues, respectively. We present a l(1) penalized likelihood method and an efficient coordinate descent-based computational algorithm for model selection and estimation in such matrix normal graphical models (MNGMs). We provide theoretical results on the asymptotic distributions, the rates of convergence of the estimates and the sparsistency, allowing both the numbers of genes and tissues to diverge as the sample size goes to infinity. Simulation results demonstrate that the MNGMs can lead to better estimate of the precision matrices and better identifications of the graph structures than the standard Gaussian graphical models. We illustrate the methods with an analysis of mouse gene expression data measured over ten different tissues.
Collapse
Affiliation(s)
- Jianxin Yin
- School of Statistics, Renmin University of China, No. 59 Zhongguancun Street, Haidian District, Beijing 100872, China and Department of Biostatistics and Epidemiology, University of Pennsylvania School of Medicine, Philadelphia, PA 19104-6021, USA
| | - Hongzhe Li
- School of Statistics, Renmin University of China, No. 59 Zhongguancun Street, Haidian District, Beijing 100872, China and Department of Biostatistics and Epidemiology, University of Pennsylvania School of Medicine, Philadelphia, PA 19104-6021, USA
| |
Collapse
|
13
|
Allen GI, Tibshirani R. Inference with transposable data: modelling the effects of row and column correlations. J R Stat Soc Series B Stat Methodol 2012; 74:721-743. [DOI: 10.1111/j.1467-9868.2011.01027.x] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/01/2022]
|
14
|
Kim K, Jiang K, Teng SL, Feldman LJ, Huang H. Using biologically interrelated experiments to identify pathway genes in Arabidopsis. ACTA ACUST UNITED AC 2012; 28:815-22. [PMID: 22271267 DOI: 10.1093/bioinformatics/bts038] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
MOTIVATION Pathway genes are considered as a group of genes that work cooperatively in the same pathway constituting a fundamental functional grouping in a biological process. Identifying pathway genes has been one of the major tasks in understanding biological processes. However, due to the difficulty in characterizing/inferring different types of biological gene relationships, as well as several computational issues arising from dealing with high-dimensional biological data, deducing genes in pathways remain challenging. RESULTS In this work, we elucidate higher level gene-gene interactions by evaluating the conditional dependencies between genes, i.e. the relationships between genes after removing the influences of a set of previously known pathway genes. These previously known pathway genes serve as seed genes in our model and will guide the detection of other genes involved in the same pathway. The detailed statistical techniques involve the estimation of a precision matrix whose elements are known to be proportional to partial correlations (i.e. conditional dependencies) between genes under appropriate normality assumptions. Likelihood ratio tests on two forms of precision matrices are further performed to see if a candidate pathway gene is conditionally independent of all the previously known pathway genes. When used effectively, this is a promising approach to recover gene relationships that would have otherwise been missed by standard methods. The advantage of the proposed method is demonstrated using both simulation studies and real datasets. We also demonstrated the importance of taking into account experimental dependencies in the simulation and real data studies.
Collapse
Affiliation(s)
- Kyungpil Kim
- Division of Biostatistics, University of California, Berkeley, CA, USA
| | | | | | | | | |
Collapse
|