1
|
Zhang Z, He J, Cao J, Li S. Maximum Decentral Projection Margin Classifier for High Dimension and Low Sample Size problems. Neural Netw 2023; 157:147-159. [DOI: 10.1016/j.neunet.2022.10.017] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2021] [Revised: 09/05/2022] [Accepted: 10/13/2022] [Indexed: 11/09/2022]
|
2
|
Zhu Y, Dai F, Maitra R. Fully Three-dimensional Radial Visualization. J Comput Graph Stat 2022. [DOI: 10.1080/10618600.2021.2020129] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
Affiliation(s)
- Yifan Zhu
- Department of Statistics, Iowa State University, Ames, Iowa
| | - Fan Dai
- Department of Mathematical Sciences, Michigan Technological University, Houghton, Michigan
| | - Ranjan Maitra
- Department of Statistics, Iowa State University, Ames, Iowa
| |
Collapse
|
3
|
Mojiri A, Khalili A, Zeinal Hamadani A. New hard-thresholding rules based on data splitting in high-dimensional imbalanced classification. Electron J Stat 2022. [DOI: 10.1214/21-ejs1939] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Affiliation(s)
- Arezou Mojiri
- Department of Mathematical Sciences, Isfahan University of Technology, Isfahan, 19395-5746, Iran
| | - Abbas Khalili
- Department of Mathematics and Statistics, McGill University, Montreal, H3A 0B9, Canada
| | - Ali Zeinal Hamadani
- Department of Industrial and Systems Engineering, Isfahan University of Technology, Isfahan, 19395-5746, Iran
| |
Collapse
|
4
|
Laa U, Cook D, Lee S. Burning Sage: Reversing the Curse of Dimensionality in the Visualization of High-Dimensional Data. J Comput Graph Stat 2021. [DOI: 10.1080/10618600.2021.1963264] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
Affiliation(s)
- Ursula Laa
- School of Physics and Astronomy, Monash University, Melborne, Australia
- Department of Econometrics and Business Statistics, Monash University, Melborne, Australia
- Institute of Statistics, University of Natural Resources and Life Sciences, Vienna, Austria
| | - Dianne Cook
- Department of Econometrics and Business Statistics, Monash University, Melborne, Australia
| | - Stuart Lee
- Department of Econometrics and Business Statistics, Monash University, Melborne, Australia
- Molecular Medicine Division, Walter and Eliza Hall Institute, Parkville, Australia
| |
Collapse
|
5
|
Baek S, Park H, Park J. A high‐dimensional classification rule using sample covariance matrix equipped with adjusted estimated eigenvalues. Stat (Int Stat Inst) 2021. [DOI: 10.1002/sta4.358] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Affiliation(s)
- Seungchul Baek
- Department of Mathematics and Statistics University of Maryland Baltimore County Baltimore 21250 Maryland USA
| | - Hoyoung Park
- Department of Statistics Seoul National University Seoul 08826 Korea
| | - Junyong Park
- Department of Statistics Seoul National University Seoul 08826 Korea
| |
Collapse
|
6
|
Vogelstein JT, Bridgeford EW, Tang M, Zheng D, Douville C, Burns R, Maggioni M. Supervised dimensionality reduction for big data. Nat Commun 2021; 12:2872. [PMID: 34001899 PMCID: PMC8129083 DOI: 10.1038/s41467-021-23102-2] [Citation(s) in RCA: 14] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2020] [Accepted: 03/26/2021] [Indexed: 11/25/2022] Open
Abstract
To solve key biomedical problems, experimentalists now routinely measure millions or billions of features (dimensions) per sample, with the hope that data science techniques will be able to build accurate data-driven inferences. Because sample sizes are typically orders of magnitude smaller than the dimensionality of these data, valid inferences require finding a low-dimensional representation that preserves the discriminating information (e.g., whether the individual suffers from a particular disease). There is a lack of interpretable supervised dimensionality reduction methods that scale to millions of dimensions with strong statistical theoretical guarantees. We introduce an approach to extending principal components analysis by incorporating class-conditional moment estimates into the low-dimensional projection. The simplest version, Linear Optimal Low-rank projection, incorporates the class-conditional means. We prove, and substantiate with both synthetic and real data benchmarks, that Linear Optimal Low-Rank Projection and its generalizations lead to improved data representations for subsequent classification, while maintaining computational efficiency and scalability. Using multiple brain imaging datasets consisting of more than 150 million features, and several genomics datasets with more than 500,000 features, Linear Optimal Low-Rank Projection outperforms other scalable linear dimensionality reduction techniques in terms of accuracy, while only requiring a few minutes on a standard desktop computer.
Collapse
Affiliation(s)
| | | | - Minh Tang
- Johns Hopkins University, Baltimore, MD, USA
| | - Da Zheng
- Johns Hopkins University, Baltimore, MD, USA
| | | | | | | |
Collapse
|
7
|
Nakayama Y. Robust support vector machine for high-dimensional imbalanced data. COMMUN STAT-SIMUL C 2021. [DOI: 10.1080/03610918.2019.1586922] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
Affiliation(s)
- Yugo Nakayama
- Graduate School of Pure and Applied Sciences, University of Tsukuba, Tsukuba-shi, Ibaraki, Japan
| |
Collapse
|
8
|
|
9
|
Carmichael I, Marron JS. Geometric insights into support vector machine behavior using the KKT conditions. Electron J Stat 2021. [DOI: 10.1214/21-ejs1902] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
10
|
Chang W, Ahn J, Jung S. Double data piling leads to perfect classification. Electron J Stat 2021. [DOI: 10.1214/21-ejs1945] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Affiliation(s)
- Woonyoung Chang
- Department of Statistics and Data Science, Carnegie Mellon University, Pittsburgh, PA 15213, United States
| | - Jeongyoun Ahn
- Department of Industrial and Systems Engineering, KAIST, Daejeon 34141, South Korea
| | - Sungkyu Jung
- Department of Statistics, Seoul National University, Seoul 08826, South Korea
| |
Collapse
|
11
|
Ahn J, Chung HC, Jeon Y. Trace Ratio Optimization for High-Dimensional Multi-Class Discrimination. J Comput Graph Stat 2020. [DOI: 10.1080/10618600.2020.1807352] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
Affiliation(s)
- Jeongyoun Ahn
- Department of Statistics, University of Georgia, Athens, GA
| | | | - Yongho Jeon
- Department of Applied Statistics, Yonsei University, Seoul, Republic of Korea
| |
Collapse
|
12
|
Shen L, Yin Q. Data maximum dispersion classifier in projection space for high-dimension low-sample-size problems. Knowl Based Syst 2020. [DOI: 10.1016/j.knosys.2019.105420] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/19/2022]
|
13
|
Bias-corrected support vector machine with Gaussian kernel in high-dimension, low-sample-size settings. ANN I STAT MATH 2019. [DOI: 10.1007/s10463-019-00727-1] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]
|
14
|
|
15
|
Bolivar-Cime A, Cordova-Rodriguez LM. Binary discrimination methods for high-dimensional data with a geometric representation. COMMUN STAT-THEOR M 2018. [DOI: 10.1080/03610926.2017.1342838] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
Affiliation(s)
- A. Bolivar-Cime
- Universidad Juárez Autónoma de Tabasco, División Académica de Ciencias Básicas, Cunduacán, Tabasco, México
| | - L. M. Cordova-Rodriguez
- Universidad Juárez Autónoma de Tabasco, División Académica de Ciencias Básicas, Cunduacán, Tabasco, México
| |
Collapse
|
16
|
Ahn J, Lee MH, Lee JA. Distance-based outlier detection for high dimension, low sample size data. J Appl Stat 2018. [DOI: 10.1080/02664763.2018.1452901] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/17/2022]
Affiliation(s)
- Jeongyoun Ahn
- Department of Statistics, University of Georgia, Athens, GA, USA
| | - Myung Hee Lee
- Center for Global Health, Department of Medicine, Weill Cornell Medical College, New York City, NY, USA
| | - Jung Ae Lee
- Agricultural Statistics Laboratory, University of Arkansas, Fayetteville, AR, USA
| |
Collapse
|
17
|
Distance-based classifier by data transformation for high-dimension, strongly spiked eigenvalue models. ANN I STAT MATH 2018. [DOI: 10.1007/s10463-018-0655-z] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
18
|
Aoshima M, Shen D, Shen H, Yata K, Zhou YH, Marron JS. A survey of high dimension low sample size asymptotics. AUST NZ J STAT 2018; 60:4-19. [PMID: 30197552 PMCID: PMC6124695 DOI: 10.1111/anzs.12212] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/17/2022]
Abstract
Peter Hall's work illuminated many aspects of statistical thought, some of which are very well known including the bootstrap and smoothing. However, he also explored many other lesser known aspects of mathematical statistics. This is a survey of one of those areas, initiated by a seminal paper in 2005, on high dimension low sample size asymptotics. An interesting characteristic of that first paper, and of many of the following papers, is that they contain deep and insightful concepts which are frequently surprising and counter-intuitive, yet have mathematical underpinnings which tend to be direct and not difficult to prove.
Collapse
Affiliation(s)
- Makoto Aoshima
- Institute of Mathematics, University of Tsukuba, Ibaraki 305-8571, Japan
| | - Dan Shen
- Interdisciplinary Data Sciences Consortium, Department of Mathematics & Statistics, University of South Florida, FL 33620, USA
| | - Haipeng Shen
- Innovation and Information Management, Faculty of Business and Economics, University of Hong Kong, Hong Kong
| | - Kazuyoshi Yata
- Institute of Mathematics, University of Tsukuba, Ibaraki 305-8571, Japan
| | - Yi-Hui Zhou
- Bioinformatics Research Center, Departments of Biological Sciences, North Carolina State University, USA
| | - J. S. Marron
- Department of Statistics and Operations Research, University of North Carolina, Chapel Hill, NC 27514, USA
| |
Collapse
|
19
|
Lu Q, Qiao X. Sparse Fisher's linear discriminant analysis for partially labeled data. Stat Anal Data Min 2017. [DOI: 10.1002/sam.11367] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Affiliation(s)
- Qiyi Lu
- Department of Mathematical Sciences Binghamton University, State University of New York Binghamton New York 13902‐6000
| | - Xingye Qiao
- Department of Mathematical Sciences Binghamton University, State University of New York Binghamton New York 13902‐6000
| |
Collapse
|
20
|
Nakayama Y, Yata K, Aoshima M. Support vector machine and its bias correction in high-dimension, low-sample-size settings. J Stat Plan Inference 2017. [DOI: 10.1016/j.jspi.2017.05.005] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
21
|
Benito M, García‐Portugués E, Marron JS, Peña D. Distance‐weighted discrimination of face images for gender classification. Stat (Int Stat Inst) 2017. [DOI: 10.1002/sta4.151] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Affiliation(s)
- Mónica Benito
- Department of Statistics Carlos III University of Madrid Madrid 28903 Spain
| | | | - J. S. Marron
- Department of Statistics and Operations Research University of North Carolina at Chapel Hill Chapel Hill 27514 NC USA
| | - Daniel Peña
- Department of Statistics Carlos III University of Madrid Madrid 28903 Spain
- Institute of Financial Big Data Carlos III University of Madrid Madrid 28903 Spain
| |
Collapse
|
22
|
Li G, Jung S. Incorporating covariates into integrated factor analysis of multi‐view data. Biometrics 2017; 73:1433-1442. [DOI: 10.1111/biom.12698] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2016] [Revised: 02/01/2017] [Accepted: 03/01/2017] [Indexed: 12/19/2022]
Affiliation(s)
- Gen Li
- Department of Biostatistics, Mailman School of Public HealthColumbia UniversityNew York 10032New YorkU.S.A
| | - Sungkyu Jung
- Department of StatisticsUniversity of PittsburghPittsburgh 15260PennsylvaniaU.S.A
| |
Collapse
|
23
|
|
24
|
|
25
|
|
26
|
Marron JS, Alonso AM. Overview of object oriented data analysis. Biom J 2014; 56:732-53. [PMID: 24421177 DOI: 10.1002/bimj.201300072] [Citation(s) in RCA: 99] [Impact Index Per Article: 9.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/22/2013] [Revised: 10/28/2013] [Accepted: 11/02/2013] [Indexed: 11/09/2022]
Abstract
Object oriented data analysis is the statistical analysis of populations of complex objects. In the special case of functional data analysis, these data objects are curves, where a variety of Euclidean approaches, such as principal components analysis, have been very successful. Challenges in modern medical image analysis motivate the statistical analysis of populations of more complex data objects that are elements of mildly non-Euclidean spaces, such as lie groups and symmetric spaces, or of strongly non-Euclidean spaces, such as spaces of tree-structured data objects. These new contexts for object oriented data analysis create several potentially large new interfaces between mathematics and statistics. The notion of object oriented data analysis also impacts data analysis, through providing a framework for discussion of the many choices needed in many modern complex data analyses, especially in interdisciplinary contexts.
Collapse
Affiliation(s)
- J Steve Marron
- Department of Statistics and Operations Research, University of North Carolina, Chapel Hill, NC 27599, USA
| | - Andrés M Alonso
- Department of Statistics and INEACU, Universidad Carlos III de Madrid, Calle Madrid 126, 28903, Getafe, Spain
| |
Collapse
|
27
|
|
28
|
Bolivar-Cime A, Marron J. Comparison of binary discrimination methods for high dimension low sample size data. J MULTIVARIATE ANAL 2013. [DOI: 10.1016/j.jmva.2012.10.001] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
|
29
|
Feragen A, Owen M, Petersen J, Wille MMW, Thomsen LH, Dirksen A, de Bruijne M. Tree-space statistics and approximations for large-scale analysis of anatomical trees. INFORMATION PROCESSING IN MEDICAL IMAGING : PROCEEDINGS OF THE ... CONFERENCE 2013; 23:74-85. [PMID: 24683959 DOI: 10.1007/978-3-642-38868-2_7] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/01/2023]
Abstract
Statistical analysis of anatomical trees is hard to perform due to differences in the topological structure of the trees. In this paper we define statistical properties of leaf-labeled anatomical trees with geometric edge attributes by considering the anatomical trees as points in the geometric space of leaf-labeled trees. This tree-space is a geodesic metric space where any two trees are connected by a unique shortest path, which corresponds to a tree deformation. However, tree-space is not a manifold, and the usual strategy of performing statistical analysis in a tangent space and projecting onto tree-space is not available. Using tree-space and its shortest paths, a variety of statistical properties, such as mean, principal component, hypothesis testing and linear discriminant analysis can be defined. For some of these properties it is still an open problem how to compute them; others (like the mean) can be computed, but efficient alternatives are helpful in speeding up algorithms that use means iteratively, like hypothesis testing. In this paper, we take advantage of a very large dataset (N = 8016) to obtain computable approximations, under the assumption that the data trees parametrize the relevant parts of tree-space well. Using the developed approximate statistics, we illustrate how the structure and geometry of airway trees vary across a population and show that airway trees with Chronic Obstructive Pulmonary Disease come from a different distribution in tree-space than healthy ones. Software is available from http://image.diku.dk/aasa/software.php.
Collapse
|
30
|
|
31
|
|
32
|
Zhang L, Lin X. Some considerations of classification for high dimension low-sample size data. Stat Methods Med Res 2011; 22:537-50. [PMID: 22116342 DOI: 10.1177/0962280211428387] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
We review in this article several classification methods, especially for high-dimensional and low-sample size data. We discuss several desirable properties for classifiers in such settings, including predictability, consistency, generality, stability, robustness and sparsity. Specifically, a good classifier should have a small prediction error (predictability); converge to the Bayes-rule classifier asymptotically (consistency); be stable when adding/removing an observation (generality); be stable for different data sets of the same kind (stochastic stability); be stable when there are a small number of contaminated observations (robustness); and have a small number of variables in the classifier (interpretability or sparsity). Several simulation examples and real applications are used to illustrate the usefulness of the existing popular classifiers and compare their performance.
Collapse
Affiliation(s)
- Lingsong Zhang
- 1Department of Statistics, Purdue University, West Lafayette, IN, USA
| | | |
Collapse
|
33
|
Wei S, Nobel AB. Comment. J Am Stat Assoc 2011. [DOI: 10.1198/jasa.2011.tm11322] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
|
34
|
Cherkassky V, Dhar S, Wuyang Dai. Practical Conditions for Effectiveness of the Universum Learning. ACTA ACUST UNITED AC 2011; 22:1241-55. [DOI: 10.1109/tnn.2011.2157522] [Citation(s) in RCA: 36] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
|
35
|
Qiao X, Zhang HH, Liu Y, Todd MJ, Marron JS. Weighted Distance Weighted Discrimination and Its Asymptotic Properties. J Am Stat Assoc 2010; 105:401-414. [PMID: 21152360 PMCID: PMC2996856 DOI: 10.1198/jasa.2010.tm08487] [Citation(s) in RCA: 70] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
Abstract
While Distance Weighted Discrimination (DWD) is an appealing approach to classification in high dimensions, it was designed for balanced datasets. In the case of unequal costs, biased sampling, or unbalanced data, there are major improvements available, using appropriately weighted versions of DWD (wDWD). A major contribution of this paper is the development of optimal weighting schemes for various nonstandard classification problems. In addition, we discuss several alternative criteria and propose an adaptive weighting scheme (awDWD) and demonstrate its advantages over nonadaptive weighting schemes under some situations. The second major contribution is a theoretical study of weighted DWD. Both high-dimensional low sample-size asymptotics and Fisher consistency of DWD are studied. The performance of weighted DWD is evaluated using simulated examples and two real data examples. The theoretical results are also confirmed by simulations.
Collapse
Affiliation(s)
- Xingye Qiao
- Department of Statistics and Operations Research, University of North Carolina, Chapel Hill, NC 27599
| | - Hao Helen Zhang
- Department of Statistics, North Carolina State University, Raleigh, NC 27695
| | - Yufeng Liu
- Department of Statistics and Operations Research, Carolina Center for Genome Sciences, University of North Carolina, Chapel Hill, NC 27599
| | - Michael J. Todd
- School of Operations Research and Information Engineering, Cornell University, Ithaca, NY 14853
| | - J. S. Marron
- Department of Statistics and Operations Research, University of North Carolina, Chapel Hill, NC 27599
| |
Collapse
|