1
|
SOTTOSANTI ANDREA, RISSO DAVIDE. CO-CLUSTERING OF SPATIALLY RESOLVED TRANSCRIPTOMIC DATA. Ann Appl Stat 2023; 17:1444-1468. [PMID: 37811520 PMCID: PMC10552783 DOI: 10.1214/22-aoas1677] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/10/2023]
Abstract
Spatial transcriptomics is a groundbreaking technology that allows the measurement of the activity of thousands of genes in a tissue sample and maps where the activity occurs. This technology has enabled the study of the spatial variation of the genes across the tissue. Comprehending gene functions and interactions in different areas of the tissue is of great scientific interest, as it might lead to a deeper understanding of several key biological mechanisms, such as cell-cell communication or tumor-microenvironment interaction. To do so, one can group cells of the same type and genes that exhibit similar expression patterns. However, adequate statistical tools that exploit the previously unavailable spatial information to more coherently group cells and genes are still lacking. In this work, we introduce SpaRTaCo, a new statistical model that clusters the spatial expression profiles of the genes according to a partition of the tissue. This is accomplished by performing a co-clustering, i.e., inferring the latent block structure of the data and inducing two types of clustering: of the genes, using their expression across the tissue, and of the image areas, using the gene expression in the spots where the RNA is collected. Our proposed methodology is validated with a series of simulation experiments and its usefulness in responding to specific biological questions is illustrated with an application to a human brain tissue sample processed with the 10X-Visium protocol.
Collapse
|
2
|
Hassler GW, Magee A, Zhang Z, Baele G, Lemey P, Ji X, Fourment M, Suchard MA. Data integration in Bayesian phylogenetics. ANNUAL REVIEW OF STATISTICS AND ITS APPLICATION 2022; 10:353-377. [PMID: 38774036 PMCID: PMC11108065 DOI: 10.1146/annurev-statistics-033021-112532] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/24/2024]
Abstract
Researchers studying the evolution of viral pathogens and other organisms increasingly encounter and use large and complex data sets from multiple different sources. Statistical research in Bayesian phylogenetics has risen to this challenge. Researchers use phylogenetics not only to reconstruct the evolutionary history of a group of organisms, but also to understand the processes that guide its evolution and spread through space and time. To this end, it is now the norm to integrate numerous sources of data. For example, epidemiologists studying the spread of a virus through a region incorporate data including genetic sequences (e.g. DNA), time, location (both continuous and discrete) and environmental covariates (e.g. social connectivity between regions) into a coherent statistical model. Evolutionary biologists routinely do the same with genetic sequences, location, time, fossil and modern phenotypes, and ecological covariates. These complex, hierarchical models readily accommodate both discrete and continuous data and have enormous combined discrete/continuous parameter spaces including, at a minimum, phylogenetic tree topologies and branch lengths. The increased size and complexity of these statistical models have spurred advances in computational methods to make them tractable. We discuss both the modeling and computational advances below, as well as unsolved problems and areas of active research.
Collapse
Affiliation(s)
- Gabriel W Hassler
- Department of Computational Medicine, University of California, Los Angeles, USA, 90095
| | - Andrew Magee
- Department of Biostatistics, University of California, Los Angeles, USA, 90095
| | - Zhenyu Zhang
- Department of Biostatistics, University of California, Los Angeles, USA, 90095
| | - Guy Baele
- Department of Microbiology and Immunology, Rega Institute, KU Leuven, Leuven, Belgium, 3000
| | - Philippe Lemey
- Department of Microbiology and Immunology, Rega Institute, KU Leuven, Leuven, Belgium, 3000
| | - Xiang Ji
- Department of Mathematics, Tulane University, New Orleans, USA, 70118
| | - Mathieu Fourment
- Australian Institute for Microbiology and Infection, University of Technology Sydney, Ultimo NSW, Australia, 2007
| | - Marc A Suchard
- Department of Computational Medicine, University of California, Los Angeles, USA, 90095
- Department of Biostatistics, University of California, Los Angeles, USA, 90095
- Department of Human Genetics, University of California, Los Angeles, USA, 90095
| |
Collapse
|
3
|
Wang Y, Sun Z, Song D, Hero A. Kronecker-structured covariance models for multiway data. STATISTICS SURVEYS 2022. [DOI: 10.1214/22-ss139] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/23/2022]
Affiliation(s)
- Yu Wang
- University of Michigan, Ann Arbor, MI 48109
| | - Zeyu Sun
- University of Michigan, Ann Arbor, MI 48109
| | | | | |
Collapse
|
4
|
|
5
|
Affiliation(s)
- Mathias Drton
- Department of Mathematics, Technical University of Munich
| | | | - Peter Hoff
- Department of Statistical Science, Duke University
| |
Collapse
|
6
|
Gerard D, Stephens M. UNIFYING AND GENERALIZING METHODS FOR REMOVING UNWANTED VARIATION BASED ON NEGATIVE CONTROLS. Stat Sin 2021; 31:1145-1166. [PMID: 38148787 PMCID: PMC10751021 DOI: 10.5705/ss.202018.0345] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/28/2023]
Abstract
Unwanted variation, including hidden confounding, is a well-known problem in many fields, but particularly in large-scale gene expression studies. Recent proposals to use control genes, genes assumed to be unassociated with the covariates of interest, have led to new methods to deal with this problem. Several versions of these removing unwanted variation (RUV) methods have been proposed, including RUV1, RUV2, RUV4, RUVinv, RUVrinv, and RUVfun. Here, we introduce a general framework, RUV*, that both unites and generalizes these approaches. This unifying framework helps clarify the connections between existing methods. In particular, we provide conditions under which RUV2 and RUV4 are equivalent. The RUV* framework preserves an advantage of the RUV approaches, namely, their modularity, which facilitates the development of novel methods based on existing matrix imputation algorithms. We illustrate this by implementing RUVB, a version of RUV* based on Bayesian factor analysis. In realistic simulations based on real data, we found RUVB to be competitive with existing methods in terms of both power and calibration. However, providing a consistently reliable calibration among the data sets remains challenging.
Collapse
Affiliation(s)
- David Gerard
- Department of Mathematics and Statistics, American University, Washington, DC 20016, USA
| | - Matthew Stephens
- Departments of Human Genetics and Statistics, University of Chicago, Chicago, IL 60637, USA
| |
Collapse
|
7
|
Tripathi S, Muhr D, Brunner M, Jodlbauer H, Dehmer M, Emmert-Streib F. Ensuring the Robustness and Reliability of Data-Driven Knowledge Discovery Models in Production and Manufacturing. Front Artif Intell 2021; 4:576892. [PMID: 34195608 PMCID: PMC8236533 DOI: 10.3389/frai.2021.576892] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2020] [Accepted: 02/12/2021] [Indexed: 11/20/2022] Open
Abstract
The Cross-Industry Standard Process for Data Mining (CRISP-DM) is a widely accepted framework in production and manufacturing. This data-driven knowledge discovery framework provides an orderly partition of the often complex data mining processes to ensure a practical implementation of data analytics and machine learning models. However, the practical application of robust industry-specific data-driven knowledge discovery models faces multiple data- and model development-related issues. These issues need to be carefully addressed by allowing a flexible, customized and industry-specific knowledge discovery framework. For this reason, extensions of CRISP-DM are needed. In this paper, we provide a detailed review of CRISP-DM and summarize extensions of this model into a novel framework we call Generalized Cross-Industry Standard Process for Data Science (GCRISP-DS). This framework is designed to allow dynamic interactions between different phases to adequately address data- and model-related issues for achieving robustness. Furthermore, it emphasizes also the need for a detailed business understanding and the interdependencies with the developed models and data quality for fulfilling higher business objectives. Overall, such a customizable GCRISP-DS framework provides an enhancement for model improvements and reusability by minimizing robustness-issues.
Collapse
Affiliation(s)
- Shailesh Tripathi
- Production and Operations Management, University of Applied Sciences Upper Austria, Linz, Austria
| | - David Muhr
- Production and Operations Management, University of Applied Sciences Upper Austria, Linz, Austria
| | - Manuel Brunner
- Production and Operations Management, University of Applied Sciences Upper Austria, Linz, Austria
| | - Herbert Jodlbauer
- Production and Operations Management, University of Applied Sciences Upper Austria, Linz, Austria
| | - Matthias Dehmer
- Department of Computer Science, Swiss Distance University of Applied Sciences, Brig, Switzerland
- School of Science, Xian Technological University, Xian, China
- Department of Biomedical Computer Science and Mechatronics, UMIT-The Health and Life Science University, Hall in Tyrol, Austria
- College of Artificial Intelligence, Nankai University, Tianjin, China
| | - Frank Emmert-Streib
- Predictive Society and Data Analytics Lab, Faculty of Information Technology and Communication Sciences, Tampere University, Tampere, Finland
- Institute of Biosciences and Medical Technology, Tampere University, Tampere, Finland
| |
Collapse
|
8
|
Park S, Wang X, Lim J. Estimating high-dimensional covariance and precision matrices under general missing dependence. Electron J Stat 2021. [DOI: 10.1214/21-ejs1892] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Affiliation(s)
- Seongoh Park
- Department of Statistics, Sungshin Women’s University, Seoul, Korea
| | - Xinlei Wang
- Department of Statistical Science, Southern Methodist University, Dallas, TX, USA
| | - Johan Lim
- Department of Statistics, Seoul National University, Seoul, Korea
| |
Collapse
|
9
|
Hassler G, Tolkoff MR, Allen WL, Ho LST, Lemey P, Suchard MA. Inferring Phenotypic Trait Evolution on Large Trees With Many Incomplete Measurements. J Am Stat Assoc 2020; 117:678-692. [PMID: 36060555 PMCID: PMC9438787 DOI: 10.1080/01621459.2020.1799812] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2019] [Revised: 05/27/2020] [Accepted: 07/15/2020] [Indexed: 01/03/2023]
Abstract
Comparative biologists are often interested in inferring covariation between multiple biological traits sampled across numerous related taxa. To properly study these relationships, we must control for the shared evolutionary history of the taxa to avoid spurious inference. An additional challenge arises as obtaining a full suite of measurements becomes increasingly difficult with increasing taxa. This generally necessitates data imputation or integration, and existing control techniques typically scale poorly as the number of taxa increases. We propose an inference technique that integrates out missing measurements analytically and scales linearly with the number of taxa by using a post-order traversal algorithm under a multivariate Brownian diffusion (MBD) model to characterize trait evolution. We further exploit this technique to extend the MBD model to account for sampling error or non-heritable residual variance. We test these methods to examine mammalian life history traits, prokaryotic genomic and phenotypic traits, and HIV infection traits. We find computational efficiency increases that top two orders-of-magnitude over current best practices. While we focus on the utility of this algorithm in phylogenetic comparative methods, our approach generalizes to solve long-standing challenges in computing the likelihood for matrix-normal and multivariate normal distributions with missing data at scale.
Collapse
Affiliation(s)
- Gabriel Hassler
- Department of Biomathematics, David Geffen School of Medicine at UCLA, University of California, Los Angeles, United States
| | - Max R Tolkoff
- Department of Biostatistics, Jonathan and Karin Fielding School of Public Health, University of California, Los Angeles, United States
| | - William L Allen
- Department of Biosciences, Swansea University, Swansea, United Kingdom
| | - Lam Si Tung Ho
- Department of Mathematics and Statistics, Dalhousie University, Halifax, Nova Scotia, Canada
| | - Philippe Lemey
- Department of Microbiology and Immunology, Rega Institute, KU Leuven, Leuven, Belgium
| | - Marc A Suchard
- Department of Biomathematics, David Geffen School of Medicine at UCLA, University of California, Los Angeles, United States
- Department of Biostatistics, Jonathan and Karin Fielding School of Public Health, University of California, Los Angeles, United States
- Department of Human Genetics, David Geffen School of Medicine at UCLA, Universtiy of California, Los Angeles, United States
| |
Collapse
|
10
|
Kundu S, Risk BB. Scalable Bayesian matrix normal graphical models for brain functional networks. Biometrics 2020; 77:439-450. [PMID: 32569385 DOI: 10.1111/biom.13319] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/26/2019] [Accepted: 06/04/2020] [Indexed: 01/23/2023]
Abstract
Recently, there has been an explosive growth in graphical modeling approaches for estimating brain functional networks. In a detailed study, we show that surprisingly, standard graphical modeling approaches for fMRI data may not yield accurate estimates of the brain network due to the inability to suitably account for temporal correlations. We propose a novel Bayesian matrix normal graphical model that jointly models the temporal covariance and the brain network under a separable structure for the covariance to obtain improved estimates. The approach is implemented via an efficient optimization algorithm that computes the maximum-a-posteriori network estimates having desirable theoretical properties and which is scalable to high dimensions. The proposed method leads to substantial gains in network estimation accuracy compared to standard brain network modeling approaches as illustrated via extensive simulations. We apply the method to resting state fMRI data from the Human Connectome Project involving a large number of time scans and brain regions, to study the relationships between fluid intelligence and functional connectivity, where it is not computationally feasible to apply existing matrix normal graphical models. Our proposed approach led to the detection of differences in connectivity between high and low fluid intelligence groups, whereas these differences were less pronounced or absent using the graphical lasso.
Collapse
Affiliation(s)
- Suprateek Kundu
- Department of Biostatistics and Bioinformatics, Emory University, Atlanta, Georgia
| | - Benjamin B Risk
- Department of Biostatistics and Bioinformatics, Emory University, Atlanta, Georgia
| |
Collapse
|
11
|
Clavel J, Morlon H. Reliable Phylogenetic Regressions for Multivariate Comparative Data: Illustration with the MANOVA and Application to the Effect of Diet on Mandible Morphology in Phyllostomid Bats. Syst Biol 2020; 69:927-943. [DOI: 10.1093/sysbio/syaa010] [Citation(s) in RCA: 22] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/11/2019] [Revised: 02/02/2020] [Accepted: 02/07/2020] [Indexed: 11/12/2022] Open
Abstract
Abstract
Understanding what shapes species phenotypes over macroevolutionary timescales from comparative data often requires studying the relationship between phenotypes and putative explanatory factors or testing for differences in phenotypes across species groups. In phyllostomid bats for example, is mandible morphology associated to diet preferences? Performing such analyses depends upon reliable phylogenetic regression techniques and associated tests (e.g., phylogenetic Generalized Least Squares, pGLS, and phylogenetic analyses of variance and covariance, pANOVA, pANCOVA). While these tools are well established for univariate data, their multivariate counterparts are lagging behind. This is particularly true for high-dimensional phenotypic data, such as morphometric data. Here, we implement much-needed likelihood-based multivariate pGLS, pMANOVA, and pMANCOVA, and use a recently developed penalized-likelihood framework to extend their application to the difficult case when the number of traits $p$ approaches or exceeds the number of species $n$. We then focus on the pMANOVA and use intensive simulations to assess the performance of the approach as $p$ increases, under various levels of phylogenetic signal and correlations between the traits, phylogenetic structure in the predictors, and under various types of phenotypic differences across species groups. We show that our approach outperforms available alternatives under all circumstances, with greater power to detect phenotypic differences across species group when they exist, and a lower risk of improperly detecting nonexistent differences. Finally, we provide an empirical illustration of our pMANOVA on a geometric-morphometric data set describing mandible morphology in phyllostomid bats along with data on their diet preferences. Overall our results show significant differences between ecological groups. Our approach, implemented in the R package mvMORPH and illustrated in a tutorial for end-users, provides efficient multivariate phylogenetic regression tools for understanding what shapes phenotypic differences across species. [Generalized least squares; high-dimensional data sets; multivariate phylogenetic comparative methods; penalized likelihood; phenomics; phyllostomid bats; phylogenetic MANOVA; phylogenetic regression.]
Collapse
Affiliation(s)
- Julien Clavel
- Institut de Biologie de l’École Normale Supérieure (IBENS), École Normale Supérieure, Paris Sciences et Lettres (PSL) Research University, CNRS UMR 8197, INSERM U1024, 46 rue d’Ulm, F-75005 Paris, France
- Life Sciences Department, The Natural History Museum, Cromwell Road, London SW7 5BD, UK
- Univ Lyon, Laboratoire d’Ecologie des Hydrosystémes Naturels et Anthropisés, UMR CNRS 5023, Université Claude Bernard Lyon 1, ENTPE, Boulevard du 11 Novembre 1918 F-69622, Villeurbanne Cedex, France
| | - Hélène Morlon
- Institut de Biologie de l’École Normale Supérieure (IBENS), École Normale Supérieure, Paris Sciences et Lettres (PSL) Research University, CNRS UMR 8197, INSERM U1024, 46 rue d’Ulm, F-75005 Paris, France
| |
Collapse
|
12
|
Greenewald K, Zhou S, Hero A. Tensor graphical lasso (TeraLasso). J R Stat Soc Series B Stat Methodol 2019. [DOI: 10.1111/rssb.12339] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|
13
|
Lafaye de Micheaux P, Liquet B, Sutton M. PLS for Big Data: A unified parallel algorithm for regularised group PLS. STATISTICS SURVEYS 2019. [DOI: 10.1214/19-ss125] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
14
|
Clavel J, Aristide L, Morlon H. A Penalized Likelihood Framework for High-Dimensional Phylogenetic Comparative Methods and an Application to New-World Monkeys Brain Evolution. Syst Biol 2018; 68:93-116. [PMID: 29931145 DOI: 10.1093/sysbio/syy045] [Citation(s) in RCA: 65] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2017] [Accepted: 06/13/2018] [Indexed: 01/03/2023] Open
Abstract
Working with high-dimensional phylogenetic comparative data sets is challenging because likelihood-based multivariate methods suffer from low statistical performances as the number of traits $p $ approaches the number of species $n $ and because some computational complications occur when $p $ exceeds $n$. Alternative phylogenetic comparative methods have recently been proposed to deal with the large $p $ small $n $ scenario but their use and performances are limited. Herein, we develop a penalized likelihood (PL) framework to deal with high-dimensional comparative data sets. We propose various penalizations and methods for selecting the intensity of the penalties. We apply this general framework to the estimation of parameters (the evolutionary trait covariance matrix and parameters of the evolutionary model) and model comparison for the high-dimensional multivariate Brownian motion (BM), Early-burst (EB), Ornstein-Uhlenbeck (OU), and Pagel's lambda models. We show using simulations that our PL approach dramatically improves the estimation of evolutionary trait covariance matrices and model parameters when $p$ approaches $n$, and allows for their accurate estimation when $p$ equals or exceeds $n$. In addition, we show that PL models can be efficiently compared using generalized information criterion (GIC). We implement these methods, as well as the related estimation of ancestral states and the computation of phylogenetic principal component analysis in the R package RPANDA and mvMORPH. Finally, we illustrate the utility of the new proposed framework by evaluating evolutionary models fit, analyzing integration patterns, and reconstructing evolutionary trajectories for a high-dimensional 3D data set of brain shape in the New World monkeys. We find a clear support for an EB model suggesting an early diversification of brain morphology during the ecological radiation of the clade. PL offers an efficient way to deal with high-dimensional multivariate comparative data.
Collapse
Affiliation(s)
- Julien Clavel
- École Normale Supérieure, Paris Sciences et Lettres (PSL) Research University, Institut de Biologie de l'École Normale Supérieure (IBENS), CNRS UMR 8197, INSERM U1024, 46 rue d'Ulm, F-75005 Paris, France
| | - Leandro Aristide
- École Normale Supérieure, Paris Sciences et Lettres (PSL) Research University, Institut de Biologie de l'École Normale Supérieure (IBENS), CNRS UMR 8197, INSERM U1024, 46 rue d'Ulm, F-75005 Paris, France
| | - Hélène Morlon
- École Normale Supérieure, Paris Sciences et Lettres (PSL) Research University, Institut de Biologie de l'École Normale Supérieure (IBENS), CNRS UMR 8197, INSERM U1024, 46 rue d'Ulm, F-75005 Paris, France
| |
Collapse
|
15
|
|
16
|
An expectation–maximization algorithm for the matrix normal distribution with an application in remote sensing. J MULTIVARIATE ANAL 2018. [DOI: 10.1016/j.jmva.2018.03.010] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
|
17
|
Molstad AJ, Rothman AJ. A Penalized Likelihood Method for Classification With Matrix-Valued Predictors. J Comput Graph Stat 2018. [DOI: 10.1080/10618600.2018.1476249] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/16/2022]
Affiliation(s)
- Aaron J. Molstad
- Biostatistics Program, Fred Hutchinson Cancer Research Center, Seattle, WA
| | - Adam J. Rothman
- School of Statistics, University of Minnesota, Minneapolis, MN
| |
Collapse
|
18
|
Hatfield LA, Zaslavsky AM. Separable covariance models for health care quality measures across years and topics. Stat Med 2018; 37:2053-2066. [PMID: 29609196 DOI: 10.1002/sim.7656] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/03/2017] [Revised: 01/17/2018] [Accepted: 02/05/2018] [Indexed: 11/10/2022]
Abstract
Public quality reports for Medicare Advantage health plans include 11 measures of patient experiences reported in the annual Consumer Assessment of Healthcare Providers and Systems surveys. Computing summaries at the health plan level (of multiple measures in multiple years) yields an array-structured random variable. To summarize associations among measures and years, we model the variance-covariance matrix governing the plan-level vectors of yearly quality measures as a Kronecker product of an across-measure matrix and an across-year matrix, or a sum of such Kronecker products. This approach extends separable covariance structure to Fay-Herriot models. In addition, we develop linear combinations of Kronecker products similar to principal components for array random variables. To each Kronecker-product term, we apply post hoc analyses suited to the corresponding dimension of the cross-classification: 1-way factor analysis for the across-measure factor and time-series analysis to the across-year factor. These methods draw out key patterns of variation in the quality measures over time and suggest new strategies for reporting quality information to consumers.
Collapse
Affiliation(s)
- Laura A Hatfield
- Department of Health Care Policy, Harvard Medical School, Boston, MA, 02115, USA
| | - Alan M Zaslavsky
- Department of Health Care Policy, Harvard Medical School, Boston, MA, 02115, USA
| |
Collapse
|
19
|
Simon T, Valmadre J, Matthews I, Sheikh Y. Kronecker-Markov Prior for Dynamic 3D Reconstruction. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2017; 39:2201-2214. [PMID: 27992328 DOI: 10.1109/tpami.2016.2638904] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
Recovering dynamic 3D structures from 2D image observations is highly under-constrained because of projection and missing data, motivating the use of strong priors to constrain shape deformation. In this paper, we empirically show that the spatiotemporal covariance of natural deformations is dominated by a Kronecker pattern. We demonstrate that this pattern arises as the limit of a spatiotemporal autoregressive process, and derive a Kronecker Markov Random Field as a prior distribution over dynamic structures. This distribution unifies shape and trajectory models of prior art and has the individual models as its marginals. The key assumption of the Kronecker MRF is that the spatiotemporal covariance is separable into the product of a temporal and a shape covariance, and can therefore be modeled using the matrix normal distribution. Analysis on motion capture data validates that this distribution is an accurate approximation with significantly fewer free parameters. Using the trace-norm, we present a convex method to estimate missing data from a single sequence when the marginal shape distribution is unknown. The Kronecker-Markov distribution, fit to a single sequence, outperforms state-of-the-art methods at inferring missing 3D data, and additionally provides covariance estimates of the uncertainty.
Collapse
|
20
|
|
21
|
Gaynanova I, Booth JG, Wells MT. Penalized Versus Constrained Generalized Eigenvalue Problems. J Comput Graph Stat 2017. [DOI: 10.1080/10618600.2016.1172017] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
Affiliation(s)
- Irina Gaynanova
- Department of Statistics, Texas A&M University, College Station, Texas
| | - James G. Booth
- Department of Biological Statistics and Computational Biology, Cornell University, Ithaca, New York
| | - Martin T. Wells
- Department of Statistical Science, Cornell University, Ithaca, New York
| |
Collapse
|
22
|
Ni Y, Stingo FC, Baladandayuthapani V. Sparse Multi-Dimensional Graphical Models: A Unified Bayesian Framework. J Am Stat Assoc 2017. [DOI: 10.1080/01621459.2016.1167694] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/20/2023]
Affiliation(s)
- Yang Ni
- Department of Biostatistics, The University of Texas MD Anderson Cancer Center, Houston, TX
- Department of Statistics and Data Sciences, The University of Texas at Austin, Austin, TX
| | - Francesco C. Stingo
- Department of Biostatistics, The University of Texas MD Anderson Cancer Center, Houston, TX
- Dipartimento di Statistica, Informatica, Applicazioni “G.Parenti,” University of Florence, Florence, Italy
| | | |
Collapse
|
23
|
|
24
|
Xia Y, Li L. Hypothesis testing of matrix graph model with application to brain connectivity analysis. Biometrics 2016; 73:780-791. [DOI: 10.1111/biom.12633] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2015] [Revised: 09/01/2016] [Accepted: 10/01/2016] [Indexed: 01/21/2023]
Affiliation(s)
- Yin Xia
- Department of Statistics; School of Management, Fudan University; Shanghai 200433 China
- Department of Statistics and Operations Research; University of North Carolina at Chapel Hill; Chapel Hill, North Carolina 27599 USA
| | - Lexin Li
- Division of Biostatistics; University of California at Berkeley; Berkeley, California 94720 USA
| |
Collapse
|
25
|
Gaussian and robust Kronecker product covariance estimation: Existence and uniqueness. J MULTIVARIATE ANAL 2016. [DOI: 10.1016/j.jmva.2016.04.001] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
26
|
Dahl A, Iotchkova V, Baud A, Johansson Å, Gyllensten U, Soranzo N, Mott R, Kranis A, Marchini J. A multiple-phenotype imputation method for genetic studies. Nat Genet 2016; 48:466-72. [PMID: 26901065 PMCID: PMC4817234 DOI: 10.1038/ng.3513] [Citation(s) in RCA: 68] [Impact Index Per Article: 7.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2015] [Accepted: 01/25/2016] [Indexed: 12/15/2022]
Abstract
Genetic association studies have yielded a wealth of biological discoveries. However, these studies have mostly analyzed one trait and one SNP at a time, thus failing to capture the underlying complexity of the data sets. Joint genotype-phenotype analyses of complex, high-dimensional data sets represent an important way to move beyond simple genome-wide association studies (GWAS) with great potential. The move to high-dimensional phenotypes will raise many new statistical problems. Here we address the central issue of missing phenotypes in studies with any level of relatedness between samples. We propose a multiple-phenotype mixed model and use a computationally efficient variational Bayesian algorithm to fit the model. On a variety of simulated and real data sets from a range of organisms and trait types, we show that our method outperforms existing state-of-the-art methods from the statistics and machine learning literature and can boost signals of association.
Collapse
Affiliation(s)
- Andrew Dahl
- Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford, UK
| | - Valentina Iotchkova
- Human Genetics, Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, UK.,European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, UK
| | - Amelie Baud
- European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, UK
| | - Åsa Johansson
- Department of Immunology, Genetics and Pathology, Science for Life Laboratory Uppsala, Uppsala University, Uppsala, Sweden
| | - Ulf Gyllensten
- Department of Immunology, Genetics and Pathology, Science for Life Laboratory Uppsala, Uppsala University, Uppsala, Sweden
| | - Nicole Soranzo
- Human Genetics, Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, UK
| | - Richard Mott
- Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford, UK
| | - Andreas Kranis
- Aviagen, Ltd., Newbridge, UK.,Roslin Institute, University of Edinburgh, Midlothian, UK
| | - Jonathan Marchini
- Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford, UK.,Department of Statistics, University of Oxford, Oxford, UK
| |
Collapse
|
27
|
Existence and uniqueness of the maximum likelihood estimator for models with a Kronecker product covariance structure. J MULTIVARIATE ANAL 2016. [DOI: 10.1016/j.jmva.2015.05.019] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
|
28
|
Hero AO, Rajaratnam B. Foundational Principles for Large-Scale Inference: Illustrations Through Correlation Mining. PROCEEDINGS OF THE IEEE. INSTITUTE OF ELECTRICAL AND ELECTRONICS ENGINEERS 2016; 104:93-110. [PMID: 27087700 PMCID: PMC4827453 DOI: 10.1109/jproc.2015.2494178] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
When can reliable inference be drawn in fue "Big Data" context? This paper presents a framework for answering this fundamental question in the context of correlation mining, wifu implications for general large scale inference. In large scale data applications like genomics, connectomics, and eco-informatics fue dataset is often variable-rich but sample-starved: a regime where the number n of acquired samples (statistical replicates) is far fewer than fue number p of observed variables (genes, neurons, voxels, or chemical constituents). Much of recent work has focused on understanding the computational complexity of proposed methods for "Big Data". Sample complexity however has received relatively less attention, especially in the setting when the sample size n is fixed, and the dimension p grows without bound. To address fuis gap, we develop a unified statistical framework that explicitly quantifies the sample complexity of various inferential tasks. Sampling regimes can be divided into several categories: 1) the classical asymptotic regime where fue variable dimension is fixed and fue sample size goes to infinity; 2) the mixed asymptotic regime where both variable dimension and sample size go to infinity at comparable rates; 3) the purely high dimensional asymptotic regime where the variable dimension goes to infinity and the sample size is fixed. Each regime has its niche but only the latter regime applies to exa cale data dimension. We illustrate this high dimensional framework for the problem of correlation mining, where it is the matrix of pairwise and partial correlations among the variables fua t are of interest. Correlation mining arises in numerous applications and subsumes the regression context as a special case. we demonstrate various regimes of correlation mining based on the unifying perspective of high dimensional learning rates and sample complexity for different structured covariance models and different inference tasks.
Collapse
Affiliation(s)
- Alfred O Hero
- University of Michigan, Ann Arbor, MI 48109-2122, USA
| | | |
Collapse
|
29
|
Abstract
Relational data are often represented as a square matrix, the entries of which record the relationships between pairs of objects. Many statistical methods for the analysis of such data assume some degree of similarity or dependence between objects in terms of the way they relate to each other. However, formal tests for such dependence have not been developed. We provide a test for such dependence using the framework of the matrix normal model, a type of multivariate normal distribution parameterized in terms of row- and column-specific covariance matrices. We develop a likelihood ratio test (LRT) for row and column dependence based on the observation of a single relational data matrix. We obtain a reference distribution for the LRT statistic, thereby providing an exact test for the presence of row or column correlations in a square relational data matrix. Additionally, we provide extensions of the test to accommodate common features of such data, such as undefined diagonal entries, a non-zero mean, multiple observations, and deviations from normality. Supplementary materials for this article are available online.
Collapse
Affiliation(s)
| | - Peter D Hoff
- Departments of Statistics and Biostatistics, University of Washington
| |
Collapse
|
30
|
Touloumis A, Tavaré S, Marioni JC. Testing the mean matrix in high-dimensional transposable data. Biometrics 2015; 71:157-166. [DOI: 10.1111/biom.12257] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/01/2013] [Revised: 09/01/2014] [Accepted: 09/01/2014] [Indexed: 11/29/2022]
Affiliation(s)
- Anestis Touloumis
- Cancer Research UK Cambridge Institute; University of Cambridge; Cambridge CB2 0RE U.K
| | - Simon Tavaré
- Cancer Research UK Cambridge Institute; University of Cambridge; Cambridge CB2 0RE U.K
| | - John C. Marioni
- EMBL-European Bioinformatics Institute; Hinxton CB10 1SD U.K
| |
Collapse
|
31
|
He S, Yin J, Li H, Wang X. Graphical model selection and estimation for high dimensional tensor data. J MULTIVARIATE ANAL 2014. [DOI: 10.1016/j.jmva.2014.03.007] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]
|
32
|
Koyejo O, Lee C, Ghosh J. A constrained matrix-variate Gaussian process for transposable data. Mach Learn 2014. [DOI: 10.1007/s10994-014-5444-1] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
|
33
|
|
34
|
|
35
|
Abstract
Human mortality data sets can be expressed as multiway data arrays, the dimensions of which correspond to categories by which mortality rates are reported, such as age, sex, country and year. Regression models for such data typically assume an independent error distribution or an error model that allows for dependence along at most one or two dimensions of the data array. However, failing to account for other dependencies can lead to inefficient estimates of regression parameters, inaccurate standard errors and poor predictions. An alternative to assuming independent errors is to allow for dependence along each dimension of the array using a separable covariance model. However, the number of parameters in this model increases rapidly with the dimensions of the array and, for many arrays, maximum likelihood estimates of the covariance parameters do not exist. In this paper, we propose a submodel of the separable covariance model that estimates the covariance matrix for each dimension as having factor analytic structure. This model can be viewed as an extension of factor analysis to array-valued data, as it uses a factor model to estimate the covariance along each dimension of the array. We discuss properties of this model as they relate to ordinary factor analysis, describe maximum likelihood and Bayesian estimation methods, and provide a likelihood ratio testing procedure for selecting the factor model ranks. We apply this methodology to the analysis of data from the Human Mortality Database, and show in a cross-validation experiment how it outperforms simpler methods. Additionally, we use this model to impute mortality rates for countries that have no mortality data for several years. Unlike other approaches, our methodology is able to estimate similarities between the mortality rates of countries, time periods and sexes, and use this information to assist with the imputations.
Collapse
Affiliation(s)
- Bailey K Fosdick
- Statistical and Applied Mathematical Sciences Institute and University of Washington
| | - Peter D Hoff
- Statistical and Applied Mathematical Sciences Institute and University of Washington
| |
Collapse
|
36
|
Abstract
We consider the task of simultaneously clustering the rows and columns of a large transposable data matrix. We assume that the matrix elements are normally distributed with a bicluster-specific mean term and a common variance, and perform biclustering by maximizing the corresponding log likelihood. We apply an ℓ1 penalty to the means of the biclusters in order to obtain sparse and interpretable biclusters. Our proposal amounts to a sparse, symmetrized version of k-means clustering. We show that k-means clustering of the rows and of the columns of a data matrix can be seen as special cases of our proposal, and that a relaxation of our proposal yields the singular value decomposition. In addition, we propose a framework for bi-clustering based on the matrix-variate normal distribution. The performances of our proposals are demonstrated in a simulation study and on a gene expression data set. This article has supplementary material online.
Collapse
Affiliation(s)
- Kean Ming Tan
- Department of Biostatistics, University of Washington, Seattle, WA 98115
| | - Daniela M. Witten
- Department of Biostatistics, University of Washington, 1705 NE Pacific Street, Box 357232, F-649 Health Sciences Building, Seattle, WA 98195-7232
| |
Collapse
|
37
|
Prabhakaran S, Adametz D, Metzner KJ, Böhm A, Roth V. Recovering networks from distance data. Mach Learn 2013. [DOI: 10.1007/s10994-013-5370-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]
|
38
|
Affiliation(s)
- Chenlei Leng
- a Department of Statistics and Applied Probability , National University of Singapore , Singapore
| | - Cheng Yong Tang
- a Department of Statistics and Applied Probability , National University of Singapore , Singapore
- b Business School , University of Colorado Denver
| |
Collapse
|
39
|
Abstract
Motivated by analysis of gene expression data measured over different tissues or over time, we consider matrix-valued random variable and matrix-normal distribution, where the precision matrices have a graphical interpretation for genes and tissues, respectively. We present a l(1) penalized likelihood method and an efficient coordinate descent-based computational algorithm for model selection and estimation in such matrix normal graphical models (MNGMs). We provide theoretical results on the asymptotic distributions, the rates of convergence of the estimates and the sparsistency, allowing both the numbers of genes and tissues to diverge as the sample size goes to infinity. Simulation results demonstrate that the MNGMs can lead to better estimate of the precision matrices and better identifications of the graph structures than the standard Gaussian graphical models. We illustrate the methods with an analysis of mouse gene expression data measured over ten different tissues.
Collapse
Affiliation(s)
- Jianxin Yin
- School of Statistics, Renmin University of China, No. 59 Zhongguancun Street, Haidian District, Beijing 100872, China and Department of Biostatistics and Epidemiology, University of Pennsylvania School of Medicine, Philadelphia, PA 19104-6021, USA
| | - Hongzhe Li
- School of Statistics, Renmin University of China, No. 59 Zhongguancun Street, Haidian District, Beijing 100872, China and Department of Biostatistics and Epidemiology, University of Pennsylvania School of Medicine, Philadelphia, PA 19104-6021, USA
| |
Collapse
|
40
|
Allen GI, Tibshirani R. Inference with transposable data: modelling the effects of row and column correlations. J R Stat Soc Series B Stat Methodol 2012; 74:721-743. [DOI: 10.1111/j.1467-9868.2011.01027.x] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/01/2022]
|
41
|
Tibshirani R. Regression shrinkage and selection via the lasso: a retrospective. J R Stat Soc Series B Stat Methodol 2011. [DOI: 10.1111/j.1467-9868.2011.00771.x] [Citation(s) in RCA: 1141] [Impact Index Per Article: 81.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/22/2023]
|
42
|
Allen GI, Tibshirani R. TRANSPOSABLE REGULARIZED COVARIANCE MODELS WITH AN APPLICATION TO MISSING DATA IMPUTATION. Ann Appl Stat 2010; 4:764-790. [PMID: 26877823 PMCID: PMC4751046 DOI: 10.1214/09-aoas314] [Citation(s) in RCA: 41] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/04/2023]
Abstract
Missing data estimation is an important challenge with high-dimensional data arranged in the form of a matrix. Typically this data matrix is transposable, meaning that either the rows, columns or both can be treated as features. To model transposable data, we present a modification of the matrix-variate normal, the mean-restricted matrix-variate normal, in which the rows and columns each have a separate mean vector and covariance matrix. By placing additive penalties on the inverse covariance matrices of the rows and columns, these so called transposable regularized covariance models allow for maximum likelihood estimation of the mean and non-singular covariance matrices. Using these models, we formulate EM-type algorithms for missing data imputation in both the multivariate and transposable frameworks. We present theoretical results exploiting the structure of our transposable models that allow these models and imputation methods to be applied to high-dimensional data. Simulations and results on microarray data and the Netflix data show that these imputation techniques often outperform existing methods and offer a greater degree of flexibility.
Collapse
Affiliation(s)
- Genevera I Allen
- Department of Statistics, Stanford University, Stanford, California, 94305, USA,
| | - Robert Tibshirani
- Department of Statistics, Stanford University, Stanford, California, 94305, USA,
| |
Collapse
|