1
|
Gudmundarson RL, Peters GW. Assessing portfolio diversification via two-sample graph kernel inference. A case study on the influence of ESG screening. PLoS One 2024; 19:e0301804. [PMID: 38626019 PMCID: PMC11020627 DOI: 10.1371/journal.pone.0301804] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/26/2023] [Accepted: 03/22/2024] [Indexed: 04/18/2024] Open
Abstract
In this work we seek to enhance the frameworks practitioners in asset management and wealth management may adopt to asses how different screening rules may influence the diversification benefits of portfolios. The problem arises naturally in the area of Environmental, Social, and Governance (ESG) based investing practices as practitioners often need to select subsets of the total available assets based on some ESG screening rule. Once a screening rule is identified, one constructs a dynamic portfolio which is usually compared with another dynamic portfolio to check if it satisfies or outperforms the risk and return profile set by the company. Our study proposes a novel method that tackles the problem of comparing diversification benefits of portfolios constructed under different screening rules. Each screening rule produces a sequence of graphs, where the nodes are assets and edges are partial correlations. To compare the diversification benefits of screening rules, we propose to compare the obtained graph sequences. The method proposed is based on a machine learning hypothesis testing framework called the kernel two-sample test whose objective is to determine whether the graphs come from the same distribution. If they come from the same distribution, then the risk and return profiles should be the same. The fact that the sample data points are graphs means that one needs to use graph testing frameworks. The problem is natural for kernel two-sample testing as one can use so-called graph kernels to work with samples of graphs. The null hypothesis of the two-sample graph kernel test is that the graph sequences were generated from the same distribution, while the alternative is that the distributions are different. A failure to reject the null hypothesis would indicate that ESG screening does not affect diversification while rejection would indicate that ESG screening does have an effect. The article describes the graph kernel two-sample testing framework, and further provides a brief overview of different graph kernels. We then demonstrate the power of the graph two-sample testing framework under different realistic scenarios. Finally, the proposed methodology is applied to data within the SnP500 to demonstrate the workflow one can use in asset management to test for structural differences in diversification of portfolios under different ESG screening rules.
Collapse
Affiliation(s)
- Ragnar L. Gudmundarson
- Department of Actuarial Mathematics and Statistics, Heriot-Watt University, Edinburgh, United Kingdom
- Centre for Networks & Enterprise, Edinburgh Business School, Edinburgh, United Kingdom
| | - Gareth W. Peters
- Department of Statistics & Applied Probability, University of California, Santa Barbara, Santa Barbara, California, United States of America
| |
Collapse
|
2
|
Zhang J, Merikangas KR, Li H, Shou H. TWO-SAMPLE TESTS FOR MULTIVARIATE REPEATED MEASUREMENTS OF HISTOGRAM OBJECTS WITH APPLICATIONS TO WEARABLE DEVICE DATA. Ann Appl Stat 2022; 16:2396-2416. [PMID: 38037595 PMCID: PMC10688324 DOI: 10.1214/21-aoas1596] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/02/2023]
Abstract
Repeated observations have become increasingly common in biomedical research and longitudinal studies. For instance, wearable sensor devices are deployed to continuously track physiological and biological signals from each individual over multiple days. It remains of great interest to appropriately evaluate how the daily distribution of biosignals might differ across disease groups and demographics. Hence, these data could be formulated as multivariate complex object data, such as probability densities, histograms, and observations on a tree. Traditional statistical methods would often fail to apply, as they are sampled from an arbitrary non-Euclidean metric space. In this paper we propose novel, nonparametric, graph-based two-sample tests for object data with the same structure of repeated measures. We treat the repeatedly measured object data as multivariate object data, which requires the same number of repeated observations per individual but eliminates any assumptions on the errors of the repeated observations. A set of test statistics are proposed to capture various possible alternatives. We derive their asymptotic null distributions under the permutation null. These tests exhibit substantial power improvements over the existing methods while controlling the type I errors under finite samples as shown through simulation studies. The proposed tests are demonstrated to provide additional insights on the location, inter- and intra-individual variability of the daily physical activity distributions in a sample of studies for mood disorders.
Collapse
Affiliation(s)
- Jingru Zhang
- Division of Biostatistics, Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania Perelman School of Medicine
| | - Kathleen R. Merikangas
- Genetic Epidemiology Research Branch, National Institute of Mental Health, National Institutes of Health
| | - Hongzhe Li
- Division of Biostatistics, Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania Perelman School of Medicine
| | - Haochang Shou
- Division of Biostatistics, Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania Perelman School of Medicine
| |
Collapse
|
3
|
Wang R, Fan W, Wang X. A hyperbolic divergence based nonparametric test for two‐sample multivariate distributions. CAN J STAT 2022. [DOI: 10.1002/cjs.11736] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Affiliation(s)
- Roulin Wang
- Department of Statistics and Finance, School of Management University of Science and Technology of China Hefei Anhui AH 230026 China
| | - Wei Fan
- School of the Gifted Young University of Science and Technology of China Hefei Anhui AH 230026 China
| | - Xueqin Wang
- Department of Statistics and Finance, School of Management University of Science and Technology of China Hefei Anhui AH 230026 China
- International Institute of Finance, School of Management University of Science and Technology of China Hefei Anhui AH 230026 China
| |
Collapse
|
4
|
Bibby JA, Agarwal D, Freiwald T, Kunz N, Merle NS, West EE, Singh P, Larochelle A, Chinian F, Mukherjee S, Afzali B, Kemper C, Zhang NR. Systematic single-cell pathway analysis to characterize early T cell activation. Cell Rep 2022; 41:111697. [PMID: 36417885 PMCID: PMC10704209 DOI: 10.1016/j.celrep.2022.111697] [Citation(s) in RCA: 35] [Impact Index Per Article: 17.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2022] [Revised: 07/06/2022] [Accepted: 10/31/2022] [Indexed: 11/23/2022] Open
Abstract
Pathway analysis is a key analytical stage in the interpretation of omics data, providing a powerful method for detecting alterations in cellular processes. We recently developed a sensitive and distribution-free statistical framework for multisample distribution testing, which we implement here in the open-source R package single-cell pathway analysis (SCPA). We demonstrate the effectiveness of SCPA over commonly used methods, generate a scRNA-seq T cell dataset, and characterize pathway activity over early cellular activation. This reveals regulatory pathways in T cells, including an intrinsic type I interferon system regulating T cell survival and a reliance on arachidonic acid metabolism throughout T cell activation. A systems-level characterization of pathway activity in T cells across multiple tissues also identifies alpha-defensin expression as a hallmark of bone-marrow-derived T cells. Overall, this work provides a widely applicable tool for single-cell pathway analysis and highlights regulatory mechanisms of T cells.
Collapse
Affiliation(s)
- Jack A Bibby
- Complement and Inflammation Research Section (CIRS), National Heart, Lung, and Blood Institute (NHLBI), National Institutes of Health (NIH), Bethesda, MD 20892, USA
| | - Divyansh Agarwal
- Massachusetts General Hospital, Harvard Medical School, Boston, MA 02108, USA
| | - Tilo Freiwald
- Immunoregulation Section, Kidney Diseases Branch, National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK), NIH, Bethesda, MD 20892, USA
| | - Natalia Kunz
- Complement and Inflammation Research Section (CIRS), National Heart, Lung, and Blood Institute (NHLBI), National Institutes of Health (NIH), Bethesda, MD 20892, USA
| | - Nicolas S Merle
- Complement and Inflammation Research Section (CIRS), National Heart, Lung, and Blood Institute (NHLBI), National Institutes of Health (NIH), Bethesda, MD 20892, USA
| | - Erin E West
- Complement and Inflammation Research Section (CIRS), National Heart, Lung, and Blood Institute (NHLBI), National Institutes of Health (NIH), Bethesda, MD 20892, USA
| | - Parul Singh
- Complement and Inflammation Research Section (CIRS), National Heart, Lung, and Blood Institute (NHLBI), National Institutes of Health (NIH), Bethesda, MD 20892, USA
| | - Andre Larochelle
- Cellular and Molecular Therapeutics Branch, National Heart, Lung and Blood Institute (NHLBI), National Institutes of Health (NIH), Bethesda, MD 20892, USA
| | - Fariba Chinian
- Cellular and Molecular Therapeutics Branch, National Heart, Lung and Blood Institute (NHLBI), National Institutes of Health (NIH), Bethesda, MD 20892, USA
| | - Somabha Mukherjee
- Department of Statistics and Data Science, National University of Singapore, Singapore 117546, Singapore
| | - Behdad Afzali
- Immunoregulation Section, Kidney Diseases Branch, National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK), NIH, Bethesda, MD 20892, USA
| | - Claudia Kemper
- Complement and Inflammation Research Section (CIRS), National Heart, Lung, and Blood Institute (NHLBI), National Institutes of Health (NIH), Bethesda, MD 20892, USA; Institute for Systemic Inflammation Research, University of Lübeck, 23562 Lübeck, Germany.
| | - Nancy R Zhang
- Department of Statistics, University of Pennsylvania, Philadelphia, PA 19104, USA.
| |
Collapse
|
5
|
Imaizumi M, Ota H, Hamaguchi T. Hypothesis Test and Confidence Analysis with Wasserstein Distance on General Dimension. Neural Comput 2022; 34:1448-1487. [PMID: 35534006 DOI: 10.1162/neco_a_01501] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2021] [Accepted: 02/01/2022] [Indexed: 11/04/2022]
Abstract
We develop a general framework for statistical inference with the 1-Wasserstein distance. Recently, the Wasserstein distance has attracted considerable attention and has been widely applied to various machine learning tasks because of its excellent properties. However, hypothesis tests and a confidence analysis for it have not been established in a general multivariate setting. This is because the limit distribution of the empirical distribution with the Wasserstein distance is unavailable without strong restriction. To address this problem, in this study, we develop a novel nonasymptotic gaussian approximation for the empirical 1-Wasserstein distance. Using the approximation method, we develop a hypothesis test and confidence analysis for the empirical 1-Wasserstein distance. We also provide a theoretical guarantee and an efficient algorithm for the proposed approximation. Our experiments validate its performance numerically.
Collapse
Affiliation(s)
- Masaaki Imaizumi
- University of Tokyo Meguro, Tokyo 153-0041, Japan.,RIKEN Center for Advanced Intelligence Project, Chuo, Tokyo, 103-0027, Japan
| | - Hirofumi Ota
- Rutgers University, Piscataway, NJ 08854. U.S.A.
| | | |
Collapse
|
6
|
Wang S. Self-Supervised Metric Learning in Multi-View Data: A Downstream Task Perspective. J Am Stat Assoc 2022. [DOI: 10.1080/01621459.2022.2057317] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
Affiliation(s)
- Shulei Wang
- Department of Statistics, University of Illinois at Urbana-Champaign, Champaign, IL 61820
| |
Collapse
|
7
|
Liu L, Meng Y, Wu X, Ying Z, Zheng T. Log-Rank-Type Tests for Equality of Distributions in High-Dimensional Spaces. J Comput Graph Stat 2022. [DOI: 10.1080/10618600.2022.2051530] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
Affiliation(s)
- Linxi Liu
- Department of Statistics, University of Pittsburgh
| | - Yang Meng
- Department of Statistics, Columbia University
| | | | | | - Tian Zheng
- Department of Statistics, Columbia University
| |
Collapse
|
8
|
Tang L, Li J. Combining dependent tests based on data depth with applications to the two-sample problem for data of arbitrary types. J Nonparametr Stat 2022. [DOI: 10.1080/10485252.2021.2025371] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
Affiliation(s)
- Linli Tang
- Department of Statistics, University of California, Riverside, CA, USA
| | - Jun Li
- Department of Statistics, University of California, Riverside, CA, USA
| |
Collapse
|
9
|
Chen H, Xia Y. A Normality Test for High-dimensional Data Based on the Nearest Neighbor Approach. J Am Stat Assoc 2021. [DOI: 10.1080/01621459.2021.1953507] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
Affiliation(s)
- Hao Chen
- Department of Statistics, University of California at Davis, CA
| | - Yin Xia
- Department of Statistics, School of Management, Fudan University
| |
Collapse
|
10
|
Deb N, Sen B. Multivariate Rank-Based Distribution-Free Nonparametric Testing Using Measure Transportation. J Am Stat Assoc 2021. [DOI: 10.1080/01621459.2021.1923508] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
Affiliation(s)
- Nabarun Deb
- Department of Statistics, Columbia University, New York, NY
| | | |
Collapse
|
11
|
Statistical and Machine Learning Link Selection Methods for Brain Functional Networks: Review and Comparison. Brain Sci 2021; 11:brainsci11060735. [PMID: 34073098 PMCID: PMC8227272 DOI: 10.3390/brainsci11060735] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/22/2021] [Revised: 05/24/2021] [Accepted: 05/28/2021] [Indexed: 11/28/2022] Open
Abstract
Network-based representations have introduced a revolution in neuroscience, expanding the understanding of the brain from the activity of individual regions to the interactions between them. This augmented network view comes at the cost of high dimensionality, which hinders both our capacity of deciphering the main mechanisms behind pathologies, and the significance of any statistical and/or machine learning task used in processing this data. A link selection method, allowing to remove irrelevant connections in a given scenario, is an obvious solution that provides improved utilization of these network representations. In this contribution we review a large set of statistical and machine learning link selection methods and evaluate them on real brain functional networks. Results indicate that most methods perform in a qualitatively similar way, with NBS (Network Based Statistics) winning in terms of quantity of retained information, AnovaNet in terms of stability and ExT (Extra Trees) in terms of lower computational cost. While machine learning methods are conceptually more complex than statistical ones, they do not yield a clear advantage. At the same time, the high heterogeneity in the set of links retained by each method suggests that they are offering complementary views to the data. The implications of these results in neuroscience tasks are finally discussed.
Collapse
|
12
|
Wang S, Cai TT, Li H. Optimal Estimation of Wasserstein Distance on A Tree with An Application to Microbiome Studies. J Am Stat Assoc 2021; 116:1237-1253. [PMID: 36860698 PMCID: PMC9974173 DOI: 10.1080/01621459.2019.1699422] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]
Abstract
The weighted UniFrac distance, a plug-in estimator of the Wasserstein distance of read counts on a tree, has been widely used to measure the microbial community difference in microbiome studies. Our investigation however shows that such a plug-in estimator, although intuitive and commonly used in practice, suffers from potential bias. Motivated by this finding, we study the problem of optimal estimation of the Wasserstein distance between two distributions on a tree from the sampled data in the high-dimensional setting. The minimax rate of convergence is established. To overcome the bias problem, we introduce a new estimator, referred to as the moment-screening estimator on a tree (MET), by using implicit best polynomial approximation that incorporates the tree structure. The new estimator is computationally efficient and is shown to be minimax rate-optimal. Numerical studies using both simulated and real biological datasets demonstrate the practical merits of MET, including reduced biases and statistically more significant differences in microbiome between the inactive Crohn's disease patients and the normal controls.
Collapse
Affiliation(s)
- Shulei Wang
- Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104
| | - T Tony Cai
- Department of Statistics, The Wharton School, University of Pennsylvania, Philadelphia, PA 19104
| | - Hongzhe Li
- Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104
| |
Collapse
|
13
|
Chakraborty S, Zhang X. A new framework for distance and kernel-based metrics in high dimensions. Electron J Stat 2021. [DOI: 10.1214/21-ejs1889] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
14
|
|
15
|
Banerjee T, Bhattacharya BB, Mukherjee G. A nearest-neighbor based nonparametric test for viral remodeling in heterogeneous single-cell proteomic data. Ann Appl Stat 2020. [DOI: 10.1214/20-aoas1362] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
16
|
Kim I, Balakrishnan S, Wasserman L. Robust multivariate nonparametric tests via projection averaging. Ann Stat 2020. [DOI: 10.1214/19-aos1936] [Citation(s) in RCA: 17] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
17
|
Chen H, Small DS. New multivariate tests for assessing covariate balance in matched observational studies. Biometrics 2020; 78:202-213. [PMID: 33074562 DOI: 10.1111/biom.13395] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/24/2019] [Accepted: 10/06/2020] [Indexed: 01/08/2023]
Abstract
We propose new tests for assessing whether covariates in a treatment group and matched control group are balanced in observational studies. The tests exhibit high power under a wide range of multivariate alternatives, some of which existing tests have little power for. The asymptotic permutation null distributions of the proposed tests are studied and the P-values calculated through the asymptotic results work well in simulation studies, facilitating the application of the test to large data sets. The tests are illustrated in a study of the effect of smoking on blood lead levels. The proposed tests are implemented in an R package BalanceCheck.
Collapse
Affiliation(s)
- Hao Chen
- Department of Statistics, University of California at Davis, Davis, California
| | - Dylan S Small
- Department of Statistics, The Wharton School, University of Pennsylvania, Philadelphia, Pennsylvania
| |
Collapse
|
18
|
Bhattacharya BB. Asymptotic distribution and detection thresholds for two-sample tests based on geometric graphs. Ann Stat 2020. [DOI: 10.1214/19-aos1913] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
19
|
Mukherjee S, Agarwal D, Zhang NR, Bhattacharya BB. Distribution-Free Multisample Tests Based on Optimal Matchings With Applications to Single Cell Genomics. J Am Stat Assoc 2020. [DOI: 10.1080/01621459.2020.1791131] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
Affiliation(s)
- Somabha Mukherjee
- Department of Statistics, The Wharton School, University of Pennsylvania, Philadelphia, PA
| | - Divyansh Agarwal
- Department of Statistics, The Wharton School, University of Pennsylvania, Philadelphia, PA
| | - Nancy R. Zhang
- Department of Statistics, The Wharton School, University of Pennsylvania, Philadelphia, PA
| | | |
Collapse
|
20
|
Mukhopadhyay S, Wang K. A nonparametric approach to high-dimensional k-sample comparison problems. Biometrika 2020. [DOI: 10.1093/biomet/asaa015] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Summary
High-dimensional $k$-sample comparison is a common task in applications. We construct a class of easy-to-implement distribution-free tests based on new nonparametric tools and unexplored connections with spectral graph theory. The test is shown to have various desirable properties and a characteristic exploratory flavour that has practical consequences for statistical modelling. Numerical examples show that the proposed method works surprisingly well across a broad range of realistic situations.
Collapse
Affiliation(s)
- Subhadeep Mukhopadhyay
- Department of Statistical Science, Temple University, Philadelphia, Pennsylvania 19122, U.S.A
| | - Kaijun Wang
- Fred Hutchinson Cancer Research Center, 1100 Fairview Ave. N., Seattle, Washington 98109, U.S.A
| |
Collapse
|
21
|
Zhang Q, Mahdi G, Tinker J, Chen H. A graph-based multi-sample test for identifying pathways associated with cancer progression. Comput Biol Chem 2020; 87:107285. [PMID: 32521496 DOI: 10.1016/j.compbiolchem.2020.107285] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/06/2020] [Accepted: 05/09/2020] [Indexed: 11/28/2022]
Abstract
Cancer is in general not a result of an abnormality of a single gene but a consequence of changes in many genes, it is therefore of great importance to understand the roles of different oncogenic and tumor suppressor pathways in tumorigenesis. In recent years, there have been many computational models developed to study the genetic alterations of different pathways in the evolutionary process of cancer. However, most of the methods are knowledge-based enrichment analyses and inflexible to analyze user-defined pathways or gene sets. In this paper, we develop a nonparametric and data-driven approach to testing for the dynamic changes of pathways over the cancer progression. Our method is based on an expansion and refinement of the pathway being studied, followed by a graph-based multivariate test, which is very easy to implement in practice. The new test is applied to the rich Cancer Genome Atlas data to study the (epi)genetic alterations of 186 KEGG pathways in the development of serous ovarian cancer. To make use of the comprehensive data, we incorporate three data types in the analysis representing gene expression level, copy number and DNA methylation level. Our analysis suggests a list of nine pathways that are closely associated with serous ovarian cancer progression, including cell cycle, ERBB, JAK-STAT signaling and p53 signaling pathways. By pairwise tests, we found that most of the identified pathways contribute only to a particular transition step. For instance, the cell cycle and ERBB pathways play key roles in the early-stage transition, while the ECM receptor and apoptosis pathways contribute to the progression from stage III to stage IV. The proposed computational pipeline is powerful in detecting important pathways and gene sets that drive cancers at certain stage(s). It offers new insights into the understanding of molecular mechanism of cancer initiation and progression.
Collapse
Affiliation(s)
- Qingyang Zhang
- Department of Mathematical Sciences, University of Arkansas, USA.
| | - Ghadeer Mahdi
- Department of Mathematical Sciences, University of Arkansas, USA; Department of Mathematics, College of Education, Baghdad University, Iraq
| | - Jian Tinker
- Department of Mathematical Sciences, University of Arkansas, USA
| | - Hao Chen
- Department of Statistics, University of California at Davis, USA.
| |
Collapse
|
22
|
|
23
|
Ceyhan E. Domination number of an interval catch digraph family and its use for testing uniformity. STATISTICS-ABINGDON 2020. [DOI: 10.1080/02331888.2020.1720020] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]
Affiliation(s)
- Elvan Ceyhan
- Department of Mathematics and Statistics, Auburn University, Auburn, AL, USA
| |
Collapse
|
24
|
Li J. Asymptotic distribution-free change-point detection based on interpoint distances for high-dimensional data. J Nonparametr Stat 2020. [DOI: 10.1080/10485252.2019.1710505] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]
Affiliation(s)
- Jun Li
- Department of Statistics, University of California, Riverside, CA, USA
| |
Collapse
|
25
|
Chen Y, Markatou M. Kernel Tests for One, Two, and K-Sample Goodness-of-Fit: State of the Art and Implementation Considerations. STATISTICAL MODELING IN BIOMEDICAL RESEARCH 2020:309-337. [DOI: 10.1007/978-3-030-33416-1_14] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/23/2023]
|
26
|
|
27
|
Abstract
Summary
Fréchet mean and variance provide a way of obtaining a mean and variance for metric space-valued random variables, and can be used for statistical analysis of data objects that lie in abstract spaces devoid of algebraic structure and operations. Examples of such data objects include covariance matrices, graph Laplacians of networks and univariate probability distribution functions. We derive a central limit theorem for the Fréchet variance under mild regularity conditions, using empirical process theory, and also provide a consistent estimator of the asymptotic variance. These results lead to a test for comparing $k$ populations of metric space-valued data objects in terms of Fréchet means and variances. We examine the finite-sample performance of this novel inference procedure through simulation studies on several special cases that include probability distributions and graph Laplacians, leading to a test for comparing populations of networks. The proposed approach has good finite-sample performance in simulations for different kinds of random objects. We illustrate the proposed methods by analysing data on mortality profiles of various countries and resting-state functional magnetic resonance imaging data.
Collapse
Affiliation(s)
- Paromita Dubey
- Department of Statistics, University of California, One Shields Avenue, Davis, California, U.S.A
| | - Hans-Georg Müller
- Department of Statistics, University of California, One Shields Avenue, Davis, California, U.S.A
| |
Collapse
|
28
|
Affiliation(s)
- Jialiang Mao
- Department of Statistical Science, Duke University, Durham, NC
| | - Yuhan Chen
- Department of Statistical Science, Duke University, Durham, NC
| | - Li Ma
- Department of Statistical Science, Duke University, Durham, NC
| |
Collapse
|
29
|
Montero‐Manso P, Vilar JA. Two‐sample homogeneity testing: A procedure based on comparing distributions of interpoint distances. Stat Anal Data Min 2019. [DOI: 10.1002/sam.11417] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Affiliation(s)
- Pablo Montero‐Manso
- Research group MODES, Department of Mathematics, Faculty of Computer ScienceUniversidade da Coruña A Coruña Spain
| | - José A. Vilar
- Research group MODES, Department of Mathematics, Faculty of Computer ScienceUniversidade da Coruña A Coruña Spain
| |
Collapse
|
30
|
Zhang Q, Du Y. Model-free feature screening for categorical outcomes: Nonlinear effect detection and false discovery rate control. PLoS One 2019; 14:e0217463. [PMID: 31150453 PMCID: PMC6544247 DOI: 10.1371/journal.pone.0217463] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2018] [Accepted: 05/13/2019] [Indexed: 11/19/2022] Open
Abstract
Feature screening has become a real prerequisite for the analysis of high-dimensional genomic data, as it is effective in reducing dimensionality and removing redundant features. However, existing methods for feature screening have been mostly relying on the assumptions of linear effects and independence (or weak dependence) between features, which might be inappropriate in real practice. In this paper, we consider the problem of selecting continuous features for a categorical outcome from high-dimensional data. We propose a powerful statistical procedure that consists of two steps, a nonparametric significance test based on edge count and a multiple testing procedure with dependence adjustment for false discovery rate control. The new method presents two novelties. First, the edge-count test directly targets distributional difference between groups, therefore it is sensitive to nonlinear effects. Second, we relax the independence assumption and adapt Efron's procedure to adjust for the dependence between features. The performance of the proposed procedure, in terms of statistical power and false discovery rate, is illustrated by simulated data. We apply the new method to three genomic datasets to identify genes associated with colon, cervical and prostate cancers.
Collapse
Affiliation(s)
- Qingyang Zhang
- Department of Mathematical Sciences, University of Arkansas, Fayetteville, AR, United States of America
| | - Yuchun Du
- Department of Biological Sciences, University of Arkansas, Fayetteville, AR, United States of America
| |
Collapse
|
31
|
Bhattacharya BB. A general asymptotic framework for distribution-free graph-based two-sample tests. J R Stat Soc Series B Stat Methodol 2019. [DOI: 10.1111/rssb.12319] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|
32
|
Chu L, Chen H. Asymptotic distribution-free change-point detection for multivariate and non-Euclidean data. Ann Stat 2019. [DOI: 10.1214/18-aos1691] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
33
|
|
34
|
Li J. Asymptotic normality of interpoint distances for high-dimensional data with applications to the two-sample problem. Biometrika 2018. [DOI: 10.1093/biomet/asy020] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Affiliation(s)
- Jun Li
- Department of Statistics, University of California, 1337 Olmsted Hall, Riverside, California 92521, U.S.A
| |
Collapse
|
35
|
Sarkar S, Ghosh AK. On some high-dimensional two-sample tests based on averages of inter-point distances. Stat (Int Stat Inst) 2018. [DOI: 10.1002/sta4.187] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
Affiliation(s)
- Soham Sarkar
- Theoretical Statistics and Mathematics Unit; Indian Statistical Institute; 203, B. T. Road Kolkata 700108 India
| | - Anil K. Ghosh
- Theoretical Statistics and Mathematics Unit; Indian Statistical Institute; 203, B. T. Road Kolkata 700108 India
| |
Collapse
|
36
|
Chen H, Chen X, Su Y. A Weighted Edge-Count Two-Sample Test for Multivariate and Object Data. J Am Stat Assoc 2018. [DOI: 10.1080/01621459.2017.1307757] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
Affiliation(s)
- Hao Chen
- Department of Statistics at University of California, Davis, CA
| | - Xu Chen
- Department of Statistics at Duke University, Durham, NC
| | - Yi Su
- Department of Statistics at University of California, Davis, CA
| |
Collapse
|
37
|
Zhang Q. A powerful nonparametric method for detecting differentially co-expressed genes: distance correlation screening and edge-count test. BMC SYSTEMS BIOLOGY 2018; 12:58. [PMID: 29769129 PMCID: PMC5956795 DOI: 10.1186/s12918-018-0582-x] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/16/2017] [Accepted: 03/08/2018] [Indexed: 01/24/2023]
Abstract
Background Differential co-expression analysis, as a complement of differential expression analysis, offers significant insights into the changes in molecular mechanism of different phenotypes. A prevailing approach to detecting differentially co-expressed genes is to compare Pearson’s correlation coefficients in two phenotypes. However, due to the limitations of Pearson’s correlation measure, this approach lacks the power to detect nonlinear changes in gene co-expression which is common in gene regulatory networks. Results In this work, a new nonparametric procedure is proposed to search differentially co-expressed gene pairs in different phenotypes from large-scale data. Our computational pipeline consisted of two main steps, a screening step and a testing step. The screening step is to reduce the search space by filtering out all the independent gene pairs using distance correlation measure. In the testing step, we compare the gene co-expression patterns in different phenotypes by a recently developed edge-count test. Both steps are distribution-free and targeting nonlinear relations. We illustrate the promise of the new approach by analyzing the Cancer Genome Atlas data and the METABRIC data for breast cancer subtypes. Conclusions Compared with some existing methods, the new method is more powerful in detecting nonlinear type of differential co-expressions. The distance correlation screening can greatly improve computational efficiency, facilitating its application to large data sets.
Collapse
Affiliation(s)
- Qingyang Zhang
- Department of Mathematical Sciences, University of Arkansas, Fayetteville, AR 72701, USA.
| |
Collapse
|