1
|
Li J. Finite sample t-tests for high-dimensional means. J MULTIVARIATE ANAL 2023; 196:105183. [PMID: 37780727 PMCID: PMC10538523 DOI: 10.1016/j.jmva.2023.105183] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/30/2023]
Abstract
When sample sizes are small, it becomes challenging for an asymptotic test requiring diverging sample sizes to maintain an accurate Type I error rate. In this paper, we consider one-sample, two-sample and ANOVA tests for mean vectors when data are high-dimensional but sample sizes are very small. We establish asymptotic t -distributions of the proposed U -statistics, which only require data dimensionality to diverge but sample sizes to be fixed and no less than 3. The proposed tests maintain accurate Type I error rates for a wide range of sample sizes and data dimensionality. Moreover, the tests are nonparametric and can be applied to data which are normally distributed or heavy-tailed. Simulation studies confirm the theoretical results for the tests. We also apply the proposed tests to an fMRI dataset to demonstrate the practical implementation of the methods.
Collapse
Affiliation(s)
- Jun Li
- Department of Mathematical Sciences, Kent State University, Kent, OH 44242, USA
| |
Collapse
|
2
|
Chesnaye MA, Bell SL, Harte JM, Simonsen LB, Visram AS, Stone MA, Munro KJ, Simpson DM. Modified T 2 Statistics for Improved Detection of Aided Cortical Auditory Evoked Potentials in Hearing-Impaired Infants. Trends Hear 2023; 27:23312165231154035. [PMID: 36847299 PMCID: PMC9974628 DOI: 10.1177/23312165231154035] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2022] [Revised: 12/28/2022] [Accepted: 01/11/2023] [Indexed: 03/01/2023] Open
Abstract
The cortical auditory evoked potential (CAEP) is a change in neural activity in response to sound, and is of interest for audiological assessment of infants, especially those who use hearing aids. Within this population, CAEP waveforms are known to vary substantially across individuals, which makes detecting the CAEP through visual inspection a challenging task. It also means that some of the best automated CAEP detection methods used in adults are probably not suitable for this population. This study therefore evaluates and optimizes the performance of new and existing methods for aided (i.e., the stimuli are presented through subjects' hearing aid(s)) CAEP detection in infants with hearing loss. Methods include the conventional Hotellings T2 test, various modified q-sample statistics, and two novel variants of T2 statistics, which were designed to exploit the correlation structure underlying the data. Various additional methods from the literature were also evaluated, including the previously best-performing methods for adult CAEP detection. Data for the assessment consisted of aided CAEPs recorded from 59 infant hearing aid users with mild to profound bilateral hearing loss, and simulated signals. The highest test sensitivities were observed for the modified T2 statistics, followed by the modified q-sample statistics, and lastly by the conventional Hotelling's T2 test, which showed low detection rates for ensemble sizes <80 epochs. The high test sensitivities at small ensemble sizes observed for the modified T2 and q-sample statistics are especially relevant for infant testing, as the time available for data collection tends to be limited in this population.
Collapse
Affiliation(s)
- Michael Alexander Chesnaye
- Institute of Sound and Vibration Research, Faculty of Engineering and the Environment, University of Southampton, Southampton, UK
| | - Steven Lewis Bell
- Institute of Sound and Vibration Research, Faculty of Engineering and the Environment, University of Southampton, Southampton, UK
| | - James Michael Harte
- Interacoustics Research Unit, Technical University of Denmark, Lyngby, Denmark
- Eriksholm Research Centre, Snekkersten, Denmark
| | | | - Anisa Sadru Visram
- Manchester Centre for Audiology and Deafness, School of Health Sciences, University of Manchester, Manchester, UK
- Manchester University Hospitals NHS Foundation Trust, Manchester Academic Health Science Centre, Manchester, UK
| | - Michael Anthony Stone
- Manchester Centre for Audiology and Deafness, School of Health Sciences, University of Manchester, Manchester, UK
- Manchester University Hospitals NHS Foundation Trust, Manchester Academic Health Science Centre, Manchester, UK
| | - Kevin James Munro
- Manchester Centre for Audiology and Deafness, School of Health Sciences, University of Manchester, Manchester, UK
- Manchester University Hospitals NHS Foundation Trust, Manchester Academic Health Science Centre, Manchester, UK
| | - David Martin Simpson
- Institute of Sound and Vibration Research, Faculty of Engineering and the Environment, University of Southampton, Southampton, UK
| |
Collapse
|
3
|
Li W, Zhu J. CLT for spiked eigenvalues of a sample covariance matrix from high-dimensional Gaussian mean mixtures. J MULTIVARIATE ANAL 2022. [DOI: 10.1016/j.jmva.2022.105127] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
|
4
|
Ouyang Y, Liu J, Tong T, Xu W. A rank-based high-dimensional test for equality of mean vectors. Comput Stat Data Anal 2022. [DOI: 10.1016/j.csda.2022.107495] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
5
|
Zhang JT, Zhu T. A new normal reference test for linear hypothesis testing in high-dimensional heteroscedastic one-way MANOVA. Comput Stat Data Anal 2022. [DOI: 10.1016/j.csda.2021.107385] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/03/2022]
|
6
|
Harrar SW, Kong X. Recent developments in high-dimensional inference for multivariate data: Parametric, semiparametric and nonparametric approaches. J MULTIVARIATE ANAL 2022. [DOI: 10.1016/j.jmva.2021.104855] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/23/2022]
|
7
|
Huang Y, Li C, Li R, Yang S. An overview of tests on high-dimensional means. J MULTIVARIATE ANAL 2022. [DOI: 10.1016/j.jmva.2021.104813] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
|
8
|
Koh JH, Yoon SJ, Kim M, Cho S, Lim J, Park Y, Kim HS, Kwon SW, Kim WU. Lipidome profile predictive of disease evolution and activity in rheumatoid arthritis. Exp Mol Med 2022; 54:143-155. [PMID: 35169224 PMCID: PMC8894401 DOI: 10.1038/s12276-022-00725-z] [Citation(s) in RCA: 22] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2021] [Revised: 10/26/2021] [Accepted: 11/04/2021] [Indexed: 12/14/2022] Open
Abstract
Lipid mediators are crucial for the pathogenesis of rheumatoid arthritis (RA); however, global analyses have not been undertaken to systematically define the lipidome underlying the dynamics of disease evolution, activation, and resolution. Here, we performed untargeted lipidomics analysis of synovial fluid and serum from RA patients at different disease activities and clinical phases (preclinical phase to active phase to sustained remission). We found that the lipidome profile in RA joint fluid was severely perturbed and that this correlated with the extent of inflammation and severity of synovitis on ultrasonography. The serum lipidome profile of active RA, albeit less prominent than the synovial lipidome, was also distinguishable from that of RA in the sustained remission phase and from that of noninflammatory osteoarthritis. Of note, the serum lipidome profile at the preclinical phase of RA closely mimicked that of active RA. Specifically, alterations in a set of lysophosphatidylcholine, phosphatidylcholine, ether-linked phosphatidylethanolamine, and sphingomyelin subclasses correlated with RA activity, reflecting treatment responses to anti-rheumatic drugs when monitored serially. Collectively, these results suggest that analysis of lipidome profiles is useful for identifying biomarker candidates that predict the evolution of preclinical to definitive RA and could facilitate the assessment of disease activity and treatment outcomes.
Collapse
Affiliation(s)
- Jung Hee Koh
- Division of Rheumatology, Department of Internal Medicine, the Catholic University of Korea, Seoul, 06591, Republic of Korea.,Center for Integrative Rheumatoid Transcriptomics and Dynamics, the Catholic University of Korea, Seoul, 06591, Republic of Korea
| | - Sang Jun Yoon
- College of Pharmacy, Seoul National University, Seoul, 08826, Republic of Korea
| | - Mina Kim
- College of Pharmacy, Seoul National University, Seoul, 08826, Republic of Korea
| | - Seonghun Cho
- Department of Statistics, Seoul National University, Seoul, 08826, Republic of Korea
| | - Johan Lim
- Department of Statistics, Seoul National University, Seoul, 08826, Republic of Korea
| | - Youngjae Park
- Division of Rheumatology, Department of Internal Medicine, the Catholic University of Korea, Seoul, 06591, Republic of Korea
| | - Hyun-Sook Kim
- Department of Internal Medicine, Soonchunhyang University College of Medicine, Seoul, 04401, Republic of Korea
| | - Sung Won Kwon
- College of Pharmacy, Seoul National University, Seoul, 08826, Republic of Korea.
| | - Wan-Uk Kim
- Division of Rheumatology, Department of Internal Medicine, the Catholic University of Korea, Seoul, 06591, Republic of Korea. .,Center for Integrative Rheumatoid Transcriptomics and Dynamics, the Catholic University of Korea, Seoul, 06591, Republic of Korea.
| |
Collapse
|
9
|
Jin J, Wang Y. T2-DAG: a powerful test for differentially expressed gene pathways via graph-informed structural equation modeling. Bioinformatics 2022; 38:1005-1014. [PMID: 34755844 PMCID: PMC8796375 DOI: 10.1093/bioinformatics/btab770] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2021] [Revised: 11/01/2021] [Accepted: 11/04/2021] [Indexed: 02/04/2023] Open
Abstract
MOTIVATION A major task in genetic studies is to identify genes related to human diseases and traits to understand functional characteristics of genetic mutations and enhance patient diagnosis. Compared with marginal analyses of individual genes, identification of gene pathways, i.e. a set of genes with known interactions that collectively contribute to specific biological functions, can provide more biologically meaningful results. Such gene pathway analysis can be formulated into a high-dimensional two-sample testing problem. Given the typically limited sample size of gene expression datasets, most existing two-sample tests tend to have compromised powers because they ignore or only inefficiently incorporate the auxiliary pathway information on gene interactions. RESULTS We propose T2-DAG, a Hotelling's T2-type test for detecting differentially expressed gene pathways, which efficiently leverages the auxiliary pathway information on gene interactions from existing pathway databases through a linear structural equation model. We further establish its asymptotic distribution under pertinent assumptions. Simulation studies under various scenarios show that T2-DAG outperforms several representative existing methods with well-controlled type-I error rates and substantially improved powers, even with incomplete or inaccurate pathway information or unadjusted confounding effects. We also illustrate the performance of T2-DAG in an application to detect differentially expressed KEGG pathways between different stages of lung cancer. AVAILABILITY AND IMPLEMENTATION The R (R Development Core Team, 2021) package T2DAG which implements the proposed T2-DAG test is available on Github at https://github.com/Jin93/T2DAG. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Jin Jin
- Department of Biostatistics, Bloomberg School of Public Health, Johns Hopkins University, Baltimore, MD 21205, USA
| | - Yue Wang
- School of Mathematical and Natural Sciences, Arizona State University, Glendale, AZ 85306, USA
| |
Collapse
|
10
|
Zhang JT, Zhou B, Guo J. Linear hypothesis testing in high-dimensional heteroscedastic one-way MANOVA: A normal reference L2-norm based test. J MULTIVARIATE ANAL 2022. [DOI: 10.1016/j.jmva.2021.104816] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
|
11
|
Huang D, Chowdhury S, Wang H, Savage SR, Ivey RG, Kennedy JJ, Whiteaker JR, Lin C, Hou X, Oberg AL, Larson MC, Eskandari N, Delisi DA, Gentile S, Huntoon CJ, Voytovich UJ, Shire ZJ, Yu Q, Gygi SP, Hoofnagle AN, Herbert ZT, Lorentzen TD, Calinawan A, Karnitz LM, Weroha SJ, Kaufmann SH, Zhang B, Wang P, Birrer MJ, Paulovich AG. Multiomic analysis identifies CPT1A as a potential therapeutic target in platinum-refractory, high-grade serous ovarian cancer. Cell Rep Med 2021; 2:100471. [PMID: 35028612 PMCID: PMC8714940 DOI: 10.1016/j.xcrm.2021.100471] [Citation(s) in RCA: 18] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2020] [Revised: 09/24/2021] [Accepted: 11/19/2021] [Indexed: 12/14/2022]
Abstract
Resistance to platinum compounds is a major determinant of patient survival in high-grade serous ovarian cancer (HGSOC). To understand mechanisms of platinum resistance and identify potential therapeutic targets in resistant HGSOC, we generated a data resource composed of dynamic (±carboplatin) protein, post-translational modification, and RNA sequencing (RNA-seq) profiles from intra-patient cell line pairs derived from 3 HGSOC patients before and after acquiring platinum resistance. These profiles reveal extensive responses to carboplatin that differ between sensitive and resistant cells. Higher fatty acid oxidation (FAO) pathway expression is associated with platinum resistance, and both pharmacologic inhibition and CRISPR knockout of carnitine palmitoyltransferase 1A (CPT1A), which represents a rate limiting step of FAO, sensitize HGSOC cells to platinum. The results are further validated in patient-derived xenograft models, indicating that CPT1A is a candidate therapeutic target to overcome platinum resistance. All multiomic data can be queried via an intuitive gene-query user interface (https://sites.google.com/view/ptrc-cell-line).
Collapse
Affiliation(s)
- Dongqing Huang
- Clinical Research Division, Fred Hutchinson Cancer Research Center, Seattle, WA 98109, USA
| | - Shrabanti Chowdhury
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA
| | - Hong Wang
- Clinical Research Division, Fred Hutchinson Cancer Research Center, Seattle, WA 98109, USA
| | - Sara R. Savage
- Lester and Sue Smith Breast Center, Baylor College of Medicine, Houston, TX 77030, USA
| | - Richard G. Ivey
- Clinical Research Division, Fred Hutchinson Cancer Research Center, Seattle, WA 98109, USA
| | - Jacob J. Kennedy
- Clinical Research Division, Fred Hutchinson Cancer Research Center, Seattle, WA 98109, USA
| | - Jeffrey R. Whiteaker
- Clinical Research Division, Fred Hutchinson Cancer Research Center, Seattle, WA 98109, USA
| | - Chenwei Lin
- Clinical Research Division, Fred Hutchinson Cancer Research Center, Seattle, WA 98109, USA
| | - Xiaonan Hou
- Department of Oncology, Mayo Clinic, Rochester, MN 55905, USA
| | - Ann L. Oberg
- Department of Quantitative Health Sciences, Division of Computational Biology, Mayo Clinic, Rochester, MN 55905, USA
| | - Melissa C. Larson
- Department of Quantitative Health Sciences, Division of Clinical Trials and Biostatistics, Mayo Clinic, Rochester, MN 55905, USA
| | - Najmeh Eskandari
- Division of Hematology and Oncology, Department of Medicine, University of Illinois, Chicago, IL 60612, USA
| | - Davide A. Delisi
- Division of Hematology and Oncology, Department of Medicine, University of Illinois, Chicago, IL 60612, USA
| | - Saverio Gentile
- Division of Hematology and Oncology, Department of Medicine, University of Illinois, Chicago, IL 60612, USA
| | | | - Uliana J. Voytovich
- Clinical Research Division, Fred Hutchinson Cancer Research Center, Seattle, WA 98109, USA
| | - Zahra J. Shire
- Clinical Research Division, Fred Hutchinson Cancer Research Center, Seattle, WA 98109, USA
| | - Qing Yu
- Department of Cell Biology, Harvard Medical School, Boston, MA 02115, USA
| | - Steven P. Gygi
- Department of Cell Biology, Harvard Medical School, Boston, MA 02115, USA
| | - Andrew N. Hoofnagle
- Department of Lab Medicine, University of Washington, Seattle, WA 98195, USA
| | - Zachary T. Herbert
- Molecular Biology Core Facilities, Dana-Farber Cancer Institute, Boston, MA 02215, USA
| | - Travis D. Lorentzen
- Clinical Research Division, Fred Hutchinson Cancer Research Center, Seattle, WA 98109, USA
| | - Anna Calinawan
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA
| | | | - S. John Weroha
- Department of Oncology, Mayo Clinic, Rochester, MN 55905, USA
| | | | - Bing Zhang
- Lester and Sue Smith Breast Center, Baylor College of Medicine, Houston, TX 77030, USA
| | - Pei Wang
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA
| | - Michael J. Birrer
- University of Arkansas for Medical Sciences, Little Rock, AR 72205, USA
| | - Amanda G. Paulovich
- Clinical Research Division, Fred Hutchinson Cancer Research Center, Seattle, WA 98109, USA
| |
Collapse
|
12
|
Bulut H. A robust Hotelling test statistic for one sample case in high dimensional data. COMMUN STAT-THEOR M 2021. [DOI: 10.1080/03610926.2021.1996606] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
Affiliation(s)
- Hasan Bulut
- Department of Statistics, Ondokuz Mayıs University, Samsun, Turkey
| |
Collapse
|
13
|
|
14
|
Cao M, He Y. A high-dimensional test on linear hypothesis of means under a low-dimensional factor model. METRIKA 2021. [DOI: 10.1007/s00184-021-00841-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|
15
|
He Y, Zhang M, Zhang X, Zhou W. High-dimensional two-sample mean vectors test and support recovery with factor adjustment. Comput Stat Data Anal 2020. [DOI: 10.1016/j.csda.2020.107004] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]
|
16
|
Li H, Aue A, Paul D. High-dimensional general linear hypothesis tests via non-linear spectral shrinkage. BERNOULLI 2020. [DOI: 10.3150/19-bej1186] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
17
|
MAKİNDE O, OMOTOSO O. On some Diagonalized and Regularized Hotelling’s T^2 Tests of Location for High Dimensional Data. GAZI UNIVERSITY JOURNAL OF SCIENCE 2020. [DOI: 10.35378/gujs.642062] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/05/2022]
|
18
|
Li H, Aue A, Paul D, Peng J, Wang P. An adaptable generalization of Hotelling’s $T^{2}$ test in high dimension. Ann Stat 2020. [DOI: 10.1214/19-aos1869] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
19
|
Liu Y, Sun W, Reiner AP, Kooperberg C, He Q. Statistical inference of genetic pathway analysis in high dimensions. Biometrika 2019; 106:651. [PMID: 31427824 DOI: 10.1093/biomet/asz033] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2017] [Indexed: 11/13/2022] Open
Abstract
Genetic pathway analysis has become an important tool for investigating the association between a group of genetic variants and traits. With dense genotyping and extensive imputation, the number of genetic variants in biological pathways has increased considerably and sometimes exceeds the sample size [Formula: see text]. Conducting genetic pathway analysis and statistical inference in such settings is challenging. We introduce an approach that can handle pathways whose dimension [Formula: see text] could be greater than [Formula: see text]. Our method can be used to detect pathways that have nonsparse weak signals, as well as pathways that have sparse but stronger signals. We establish the asymptotic distribution for the proposed statistic and conduct theoretical analysis on its power. Simulation studies show that our test has correct Type I error control and is more powerful than existing approaches. An application to a genome-wide association study of high-density lipoproteins demonstrates the proposed approach.
Collapse
Affiliation(s)
- Yang Liu
- Department of Mathematics and Statistics, Wright State University, 3640 Colonel Glenn Highway, Dayton, Ohio, U.S.A
| | - Wei Sun
- Public Health Sciences Division, Fred Hutchinson Cancer Research Center, 1100 Fairview Avenue North, Seattle, Washington, U.S.A
| | - Alexander P Reiner
- Public Health Sciences Division, Fred Hutchinson Cancer Research Center, 1100 Fairview Avenue North, Seattle, Washington, U.S.A
| | - Charles Kooperberg
- Public Health Sciences Division, Fred Hutchinson Cancer Research Center, 1100 Fairview Avenue North, Seattle, Washington, U.S.A
| | - Qianchuan He
- Public Health Sciences Division, Fred Hutchinson Cancer Research Center, 1100 Fairview Avenue North, Seattle, Washington, U.S.A
| |
Collapse
|
20
|
Niu Z, Hu J, Bai Z, Gao W. On LR simultaneous test of high-dimensional mean vector and covariance matrix under non-normality. Stat Probab Lett 2019. [DOI: 10.1016/j.spl.2018.10.008] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
21
|
Zhang Q, Hu J, Bai Z. Invariant test based on the modified correction to LRT for the equality of two high-dimensional covariance matrices. Electron J Stat 2019. [DOI: 10.1214/19-ejs1542] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
22
|
Zhang M, Zhou C, He Y, Zhang X. Adaptive test for mean vectors of high-dimensional time series data with factor structure. J Korean Stat Soc 2018. [DOI: 10.1016/j.jkss.2018.05.003] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/16/2022]
|
23
|
Hu Z, Tong T, Genton MG. Diagonal likelihood ratio test for equality of mean vectors in high-dimensional data. Biometrics 2018; 75:256-267. [PMID: 30325005 DOI: 10.1111/biom.12984] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2017] [Accepted: 09/14/2018] [Indexed: 11/27/2022]
Abstract
We propose a likelihood ratio test framework for testing normal mean vectors in high-dimensional data under two common scenarios: the one-sample test and the two-sample test with equal covariance matrices. We derive the test statistics under the assumption that the covariance matrices follow a diagonal matrix structure. In comparison with the diagonal Hotelling's tests, our proposed test statistics display some interesting characteristics. In particular, they are a summation of the log-transformed squared t-statistics rather than a direct summation of those components. More importantly, to derive the asymptotic normality of our test statistics under the null and local alternative hypotheses, we do not need the requirement that the covariance matrices follow a diagonal matrix structure. As a consequence, our proposed test methods are very flexible and readily applicable in practice. Simulation studies and a real data analysis are also carried out to demonstrate the advantages of our likelihood ratio test methods.
Collapse
Affiliation(s)
- Zongliang Hu
- College of Mathematics and Statistics, Shenzhen University, Shenzhen, 518060, China
| | - Tiejun Tong
- Department of Mathematics, Hong Kong Baptist University, Hong Kong
| | - Marc G Genton
- Statistics Program, King Abdullah University of Science and Technology, Thuwal, 23955-6900, Saudi Arabia
| |
Collapse
|
24
|
|
25
|
Accuracy of regularized D-rule for binary classification. J Korean Stat Soc 2018. [DOI: 10.1016/j.jkss.2017.11.002] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022]
|
26
|
Zhao N, Zhan X, Guthrie KA, Mitchell CM, Larson J. Generalized Hotelling's test for paired compositional data with application to human microbiome studies. Genet Epidemiol 2018; 42:459-469. [PMID: 29737047 DOI: 10.1002/gepi.22127] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2018] [Revised: 03/12/2018] [Accepted: 03/29/2018] [Indexed: 02/06/2023]
Abstract
The human microbiome is a dynamic system that changes due to diseases, medication, change in diet, etc. The paired design is a common approach to evaluate the microbial changes while controlling for the inherent differences between people. For example, microbiome data may be collected from the same individuals before and after a treatment. Two challenges exist in analyzing this type of data. First, microbiome data are compositional such that the reads for all taxa in each sample are constrained to sum to a constant. Second, the number of taxa can be much larger than the sample size. Few statistical methods exist to analyze such data besides methods that test one taxon at a time. In this paper, we propose to first conduct a log-ratio transformation of the compositions, and then develop a generalized Hotelling's test (GHT) to evaluate whether the average microbiome compositions are equivalent in the paired samples. We replace the sample covariance matrix in standard Hotelling's statistic by a shrinkage-based covariance, calculated as a weighted average of the sample covariance and a positive definite target matrix. The optimal weighting can be obtained for many commonly used target matrices. We develop a permutation procedure to assess the statistical significance. Extensive simulations show that our proposed method has well-controlled type I error and better power than a few ad hoc approaches. We apply our method to examine the vaginal microbiome changes in response to treatments for menopausal hot flashes. An R package " GHT" is freely available at https://github.com/zhaoni153/GHT.
Collapse
Affiliation(s)
- Ni Zhao
- Departments of Biostatistics, Johns Hopkins University, Baltimore, Maryland, United States of America
| | - Xiang Zhan
- Department of Public Health Sciences, Pennsylvania State University, Hershey, Pennsylvania, United States of America
| | - Katherine A Guthrie
- Public Health Sciences Division, Fred Hutchinson Cancer Research Center, Seattle, Washington, United States of America
| | - Caroline M Mitchell
- Vincent Center for Reproductive Biology, Massachusetts General Hospital, Harvard Medical School, Boston, Massachusetts, United States of America
| | - Joseph Larson
- Public Health Sciences Division, Fred Hutchinson Cancer Research Center, Seattle, Washington, United States of America
| |
Collapse
|
27
|
Wang C, Jiang B. On the dimension effect of regularized linear discriminant analysis. Electron J Stat 2018. [DOI: 10.1214/18-ejs1469] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
28
|
Wang X, Shojaie A, Zhang Y, Shelley D, Lampe PD, Levy L, Peters U, Potter JD, White E, Lampe JW. Exploratory plasma proteomic analysis in a randomized crossover trial of aspirin among healthy men and women. PLoS One 2017; 12:e0178444. [PMID: 28542447 PMCID: PMC5444835 DOI: 10.1371/journal.pone.0178444] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2016] [Accepted: 05/12/2017] [Indexed: 12/21/2022] Open
Abstract
Long-term use of aspirin is associated with lower risk of colorectal cancer and other cancers; however, the mechanism of chemopreventive effect of aspirin is not fully understood. Animal studies suggest that COX-2, NFκB signaling and Wnt/β-catenin pathways may play a role, but no clinical trials have systematically evaluated the biological response to aspirin in healthy humans. Using a high-density antibody array, we assessed the difference in plasma protein levels after 60 days of regular dose aspirin (325 mg/day) compared to placebo in a randomized double-blinded crossover trial of 44 healthy non-smoking men and women, aged 21-45 years. The plasma proteome was analyzed on an antibody microarray with ~3,300 full-length antibodies, printed in triplicate. Moderated paired t-tests were performed on individual antibodies, and gene-set analyses were performed based on KEGG and GO pathways. Among the 3,000 antibodies analyzed, statistically significant differences in plasma protein levels were observed for nine antibodies after adjusting for false discoveries (FDR adjusted p-value<0.1). The most significant protein was succinate dehydrogenase subunit C (SDHC), a key enzyme complex of the mitochondrial tricarboxylic acid (TCA) cycle. The other statistically significant proteins (NR2F1, MSI1, MYH1, FOXO1, KHDRBS3, NFKBIE, LYZ and IKZF1) are involved in multiple pathways, including DNA base-pair repair, inflammation and oncogenic pathways. None of the 258 KEGG and 1,139 GO pathways was found to be statistically significant after FDR adjustment. This study suggests several chemopreventive mechanisms of aspirin in humans, which have previously been reported to play a role in anti- or pro-carcinogenesis in cell systems; however, larger, confirmatory studies are needed.
Collapse
Affiliation(s)
- Xiaoliang Wang
- Department of Epidemiology, University of Washington, Seattle, Washington, United States of America
- Division of Public Health Sciences, Fred Hutchinson Cancer Research Center, Seattle, Washington, United States of America
| | - Ali Shojaie
- Department of Biostatistics, University of Washington, Seattle, Washington, United States of America
| | - Yuzheng Zhang
- Division of Public Health Sciences, Fred Hutchinson Cancer Research Center, Seattle, Washington, United States of America
| | - David Shelley
- Division of Public Health Sciences, Fred Hutchinson Cancer Research Center, Seattle, Washington, United States of America
| | - Paul D. Lampe
- Division of Public Health Sciences, Fred Hutchinson Cancer Research Center, Seattle, Washington, United States of America
| | - Lisa Levy
- Division of Public Health Sciences, Fred Hutchinson Cancer Research Center, Seattle, Washington, United States of America
| | - Ulrike Peters
- Department of Epidemiology, University of Washington, Seattle, Washington, United States of America
- Division of Public Health Sciences, Fred Hutchinson Cancer Research Center, Seattle, Washington, United States of America
| | - John D. Potter
- Department of Epidemiology, University of Washington, Seattle, Washington, United States of America
- Division of Public Health Sciences, Fred Hutchinson Cancer Research Center, Seattle, Washington, United States of America
| | - Emily White
- Department of Epidemiology, University of Washington, Seattle, Washington, United States of America
- Division of Public Health Sciences, Fred Hutchinson Cancer Research Center, Seattle, Washington, United States of America
| | - Johanna W. Lampe
- Department of Epidemiology, University of Washington, Seattle, Washington, United States of America
- Division of Public Health Sciences, Fred Hutchinson Cancer Research Center, Seattle, Washington, United States of America
| |
Collapse
|
29
|
|
30
|
High dimensional extension of the growth curve model and its application in genetics. STAT METHOD APPL-GER 2016. [DOI: 10.1007/s10260-016-0369-4] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
|
31
|
Feng L, Zou C, Wang Z. Multivariate-Sign-Based High-Dimensional Tests for the Two-Sample Location Problem. J Am Stat Assoc 2016. [DOI: 10.1080/01621459.2015.1035380] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
|
32
|
|
33
|
Dong K, Pang H, Tong T, Genton MG. Shrinkage-based diagonal Hotelling’s tests for high-dimensional small sample size data. J MULTIVARIATE ANAL 2016. [DOI: 10.1016/j.jmva.2015.08.022] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
|
34
|
Jung K. Statistical Aspects in Proteomic Biomarker Discovery. Methods Mol Biol 2016; 1362:293-310. [PMID: 26519185 DOI: 10.1007/978-1-4939-3106-4_19] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
In the pursuit of a personalized medicine, i.e., the individual treatment of a patient, many medical decision problems are desired to be supported by biomarkers that can help to make a diagnosis, prediction, or prognosis. Proteomic biomarkers are of special interest since they can not only be detected in tissue samples but can also often be easily detected in diverse body fluids. Statistical methods play an important role in the discovery and validation of proteomic biomarkers. They are necessary in the planning of experiments, in the processing of raw signals, and in the final data analysis. This review provides an overview on the most frequent experimental settings including sample size considerations, and focuses on exploratory data analysis and classifier development.
Collapse
Affiliation(s)
- Klaus Jung
- Department of Medical Statistics, Georg-August-University Göttingen, Humboldtallee 32, 37073, Göttingen, Germany.
| |
Collapse
|
35
|
Kruppa J, Jung K. Set-Based Test Procedures for the Functional Analysis of Protein Lists from Differential Analysis. METHODS IN MOLECULAR BIOLOGY (CLIFTON, N.J.) 2015; 1362:143-56. [PMID: 26519175 DOI: 10.1007/978-1-4939-3106-4_9] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
Abstract
The analysis of most high-throughput proteomics experiments involves the selection of differentially expressed proteins or peptides between two different sets of samples, e.g., from two experimental groups. As a result, a large list of selected features is reported, typically sorted by a measure for the expression fold change and a p-value from a statistical test. The biological interpretation of such a list is usually difficult since the features can typically be assigned to a large variety of biological classes. To facilitate the biological interpretation, set-based procedures focus on the analysis of feature subsets that all belong to the same biological class (e.g., same cellular component, biological process, molecular function, or pathway). Set-based procedures can roughly be divided into "enrichment methods" and "global test procedures," where the first involve all features of an experiment and the second only those features of a particular set. In this chapter we detail the working principle of these kind of statistical methods and describe how features can be classified into molecular subsets. We illustrate the use of the methods on a data example from a proteomics Parkinson study.
Collapse
Affiliation(s)
- Jochen Kruppa
- Department of Medical Statistics, Georg-August-University Göttingen, Humboldtallee 32, 37073, Göttingen, Germany
| | - Klaus Jung
- Department of Medical Statistics, Georg-August-University Göttingen, Humboldtallee 32, 37073, Göttingen, Germany.
| |
Collapse
|
36
|
JingYuan LIU, Wei ZHONG, RunZe LI. A selective overview of feature screening for ultrahigh-dimensional data. SCIENCE CHINA. MATHEMATICS 2015; 58:2033-2054. [PMID: 26779257 PMCID: PMC4711389 DOI: 10.1007/s11425-015-5062-9] [Citation(s) in RCA: 39] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/21/2023]
Abstract
High-dimensional data have frequently been collected in many scientific areas including genomewide association study, biomedical imaging, tomography, tumor classifications, and finance. Analysis of high-dimensional data poses many challenges for statisticians. Feature selection and variable selection are fundamental for high-dimensional data analysis. The sparsity principle, which assumes that only a small number of predictors contribute to the response, is frequently adopted and deemed useful in the analysis of high-dimensional data. Following this general principle, a large number of variable selection approaches via penalized least squares or likelihood have been developed in the recent literature to estimate a sparse model and select significant variables simultaneously. While the penalized variable selection methods have been successfully applied in many high-dimensional analyses, modern applications in areas such as genomics and proteomics push the dimensionality of data to an even larger scale, where the dimension of data may grow exponentially with the sample size. This has been called ultrahigh-dimensional data in the literature. This work aims to present a selective overview of feature screening procedures for ultrahigh-dimensional data. We focus on insights into how to construct marginal utilities for feature screening on specific models and motivation for the need of model-free feature screening procedures.
Collapse
Affiliation(s)
- LIU JingYuan
- Department of Statistics, School of Economics, Xiamen University, Xiamen 361005, China
- Wang Yanan Institute for Studies in Economics, Xiamen University, Xiamen 361005, China
- Fujian Key Laboratory of Statistical Science, Xiamen University, Xiamen 361005, China
| | - ZHONG Wei
- Department of Statistics, School of Economics, Xiamen University, Xiamen 361005, China
- Wang Yanan Institute for Studies in Economics, Xiamen University, Xiamen 361005, China
- Fujian Key Laboratory of Statistical Science, Xiamen University, Xiamen 361005, China
| | - LI RunZe
- Department of Statistics and The Methodology Center, Pennsylvania State University, University Park, PA 16802-2111, USA
| |
Collapse
|
37
|
Lee S, Lim J, Sohn I, Jung SH, Park CK. Two sample test for high-dimensional partially paired data. J Appl Stat 2015. [DOI: 10.1080/02664763.2015.1014890] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]
|
38
|
An adaptive test for the mean vector in large-<mml:math altimg="si101.gif" display="inline" overflow="scroll" xmlns:xocs="http://www.elsevier.com/xml/xocs/dtd" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://www.elsevier.com/xml/ja/dtd" xmlns:ja="http://www.elsevier.com/xml/ja/dtd" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:tb="http://www.elsevier.com/xml/common/table/dtd" xmlns:sb="http://www.elsevier.com/xml/common/struct-bib/dtd" xmlns:ce="http://www.elsevier.com/xml/common/dtd" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:cals="http://www.elsevier.com/xml/common/cals/dtd" xmlns:sa="http://www.elsevier.com/xml/common/struct-aff/dtd"><mml:mi>p</mml:mi></mml:math>-small-<mml:math altimg="si102.gif" display="inline" overflow="scroll" xmlns:xocs="http://www.elsevier.com/xml/xocs/dtd" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://www.elsevier.com/xml/ja/dtd" xmlns:ja="http://www.elsevier.com/xml/ja/dtd" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:tb="http://www.elsevier.com/xml/common/table/dtd" xmlns:sb="http://www.elsevier.com/xml/common/struct-bib/dtd" xmlns:ce="http://www.elsevier.com/xml/common/dtd" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:cals="http://www.elsevier.com/xml/common/cals/dtd" xmlns:sa="http://www.elsevier.com/xml/common/struct-aff/dtd"><mml:mi>n</mml:mi></mml:math> problems. Comput Stat Data Anal 2015. [DOI: 10.1016/j.csda.2015.03.004] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
39
|
|
40
|
Estimation of variances and covariances for high‐dimensional data: a selective review. WIRES COMPUTATIONAL STATISTICS 2014. [DOI: 10.1002/wics.1308] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
|
41
|
Zhong W, Zhu L. An iterative approach to distance correlation-based sure independence screening. J STAT COMPUT SIM 2014. [DOI: 10.1080/00949655.2014.928820] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]
|
42
|
Chi YY, Gribbin MJ, Johnson JL, Muller KE. Power calculation for overall hypothesis testing with high-dimensional commensurate outcomes. Stat Med 2014; 33:812-27. [PMID: 24122945 PMCID: PMC4072336 DOI: 10.1002/sim.5986] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2012] [Revised: 08/19/2013] [Accepted: 08/21/2013] [Indexed: 11/07/2022]
Abstract
The complexity of system biology means that any metabolic, genetic, or proteomic pathway typically includes so many components (e.g., molecules) that statistical methods specialized for overall testing of high-dimensional and commensurate outcomes are required. While many overall tests have been proposed, very few have power and sample size methods. We develop accurate power and sample size methods and software to facilitate study planning for high-dimensional pathway analysis. With an account of any complex correlation structure between high-dimensional outcomes, the new methods allow power calculation even when the sample size is less than the number of variables. We derive the exact (finite-sample) and approximate non-null distributions of the 'univariate' approach to repeated measures test statistic, as well as power-equivalent scenarios useful to generalize our numerical evaluations. Extensive simulations of group comparisons support the accuracy of the approximations even when the ratio of number of variables to sample size is large. We derive a minimum set of constants and parameters sufficient and practical for power calculation. Using the new methods and specifying the minimum set to determine power for a study of metabolic consequences of vitamin B6 deficiency helps illustrate the practical value of the new results. Free software implementing the power and sample size methods applies to a wide range of designs, including one group pre-intervention and post-intervention comparisons, multiple parallel group comparisons with one-way or factorial designs, and the adjustment and evaluation of covariate effects.
Collapse
Affiliation(s)
- Yueh-Yun Chi
- Department of Biostatistics, University of Florida, Gainesville, FL, U.S.A
| | | | | | | |
Collapse
|
43
|
Jung K, Dihazi H, Bibi A, Dihazi GH, Beißbarth T. Adaption of the global test idea to proteomics data with missing values. Bioinformatics 2014; 30:1424-30. [PMID: 24489372 DOI: 10.1093/bioinformatics/btu062] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Global test procedures are frequently used in gene expression analysis to study the relationship between a functional subset of RNA transcripts and an experimental group factor. However, these procedures have been rarely used for the analysis of high-throughput data from other sources, such as proteome expression data. The main difficulties in transferring global test procedures from genomics to proteomics data are the more complicated way of obtaining functional annotations and the handling of missing values in some types of proteomics data. RESULTS We propose a simple mixed linear model in combination with a permutation procedure and missing values imputation to conduct global tests in proteomics experiments. This new approach is motivated by protein expression data obtained by means of 2-D gel electrophoresis within a mouse experiment of our current research. A simulation study yielded that power and testing level of the mixed model alone can be affected by missing values in the dataset. Imputation of missing values was able to correct for a bias in some simulation settings. Our new approach provides the possibility to rank Gene Ontology (GO) terms associated with protein sets. It is also helpful in the case in which a specific protein is represented by multiple spots on a 2-D gel by considering these spots also as a protein set. Analysis of our data points at correlations between the deficiency of the protein 'calreticulin' and protein sets related to biological processes in the heart muscle. AVAILABILITY AND IMPLEMENTATION Our proposed approach is included in the R-package 'RepeatedHighDim', which already contains a global test procedure for gene expression data. The package can be retrieved from http://cran.r-project.org/. CONTACT klaus.jung@ams.med.uni-goettingen.de.
Collapse
Affiliation(s)
- Klaus Jung
- Department of Medical Statistics, and Department of Nephrology and Rheumatology, University Medical Center Göttingen, Göttingen 37099, Germany
| | | | | | | | | |
Collapse
|
44
|
Chen LS, Prentice RL, Wang P. A penalized EM algorithm incorporating missing data mechanism for Gaussian parameter estimation. Biometrics 2014; 70:312-22. [PMID: 24471933 DOI: 10.1111/biom.12149] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2013] [Revised: 12/01/2013] [Accepted: 01/01/2014] [Indexed: 11/29/2022]
Abstract
Missing data rates could depend on the targeted values in many settings, including mass spectrometry-based proteomic profiling studies. Here, we consider mean and covariance estimation under a multivariate Gaussian distribution with non-ignorable missingness, including scenarios in which the dimension (p) of the response vector is equal to or greater than the number (n) of independent observations. A parameter estimation procedure is developed by maximizing a class of penalized likelihood functions that entails explicit modeling of missing data probabilities. The performance of the resulting "penalized EM algorithm incorporating missing data mechanism (PEMM)" estimation procedure is evaluated in simulation studies and in a proteomic data illustration.
Collapse
Affiliation(s)
- Lin S Chen
- Department of Health Studies, University of Chicago, 5841 S Maryland Ave, Chicago, Illinois, U.S.A
| | | | | |
Collapse
|
45
|
Abstract
Classification and prediction problems using spectral data lead to high-dimensional data sets. Spectral data are, however, different from most other high-dimensional data sets in that information usually varies smoothly with wavelength, suggesting that fitted models should also vary smoothly with wavelength. Functional data analysis, widely used in the analysis of spectral data, meets this objective by changing perspective from the raw spectra to approximations using smooth basis functions. This paper explores linear regression and linear discriminant analysis fitted directly to the spectral data, imposing penalties on the values and roughness of the fitted coefficients, and shows by example that this can lead to better fits than existing standard methodologies.
Collapse
|
46
|
Abstract
This paper is concerned with screening features in ultrahigh dimensional data analysis, which has become increasingly important in diverse scientific fields. We develop a sure independence screening procedure based on the distance correlation (DC-SIS, for short). The DC-SIS can be implemented as easily as the sure independence screening procedure based on the Pearson correlation (SIS, for short) proposed by Fan and Lv (2008). However, the DC-SIS can significantly improve the SIS. Fan and Lv (2008) established the sure screening property for the SIS based on linear models, but the sure screening property is valid for the DC-SIS under more general settings including linear models. Furthermore, the implementation of the DC-SIS does not require model specification (e.g., linear model or generalized linear model) for responses or predictors. This is a very appealing property in ultrahigh dimensional data analysis. Moreover, the DC-SIS can be used directly to screen grouped predictor variables and for multivariate response variables. We establish the sure screening property for the DC-SIS, and conduct simulations to examine its finite sample performance. Numerical comparison indicates that the DC-SIS performs much better than the SIS in various models. We also illustrate the DC-SIS through a real data example.
Collapse
Affiliation(s)
- Runze Li
- The Pennsylvania State University, Xiamen University & Shanghai University of Finance and Economics
| | - Wei Zhong
- The Pennsylvania State University, Xiamen University & Shanghai University of Finance and Economics
| | - Liping Zhu
- The Pennsylvania State University, Xiamen University & Shanghai University of Finance and Economics
| |
Collapse
|