1
|
Ahfock D, Astle WJ, Richardson S. On randomized sketching algorithms and the Tracy-Widom law. Stat Comput 2023; 33:34. [PMID: 36691583 PMCID: PMC9852177 DOI: 10.1007/s11222-022-10148-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 12/12/2021] [Accepted: 09/03/2022] [Indexed: 06/17/2023]
Abstract
UNLABELLED There is an increasing body of work exploring the integration of random projection into algorithms for numerical linear algebra. The primary motivation is to reduce the overall computational cost of processing large datasets. A suitably chosen random projection can be used to embed the original dataset in a lower-dimensional space such that key properties of the original dataset are retained. These algorithms are often referred to as sketching algorithms, as the projected dataset can be used as a compressed representation of the full dataset. We show that random matrix theory, in particular the Tracy-Widom law, is useful for describing the operating characteristics of sketching algorithms in the tall-data regime when the sample size n is much greater than the number of variables d. Asymptotic large sample results are of particular interest as this is the regime where sketching is most useful for data compression. In particular, we develop asymptotic approximations for the success rate in generating random subspace embeddings and the convergence probability of iterative sketching algorithms. We test a number of sketching algorithms on real large high-dimensional datasets and find that the asymptotic expressions give accurate predictions of the empirical performance. SUPPLEMENTARY INFORMATION The online version contains supplementary material available at 10.1007/s11222-022-10148-5.
Collapse
Affiliation(s)
- Daniel Ahfock
- MRC Biostatistics Unit, University of Cambridge, Cambridge, UK
| | | | | |
Collapse
|
2
|
Egolf D, Barber Q, Zemp R. Single laser-shot super-resolution photoacoustic tomography with fast sparsity-based reconstruction. Photoacoustics 2021; 22:100258. [PMID: 33816111 PMCID: PMC8005825 DOI: 10.1016/j.pacs.2021.100258] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/03/2020] [Revised: 02/24/2021] [Accepted: 03/01/2021] [Indexed: 06/12/2023]
Abstract
Recently, ℓ 1 -norm based reconstruction approaches have been used with linear array systems to improve photoacoustic resolution and demonstrate undersampled imaging when there is sufficient sparsity in some domain. However, such approaches have yet to beat the half-wavelength resolution limit. In this paper, the ability to beat the half-wavelength diffraction limit is demonstrated using a 5 MHz ring array photoacoustic tomography system and ℓ 1 -norm based reconstruction approaches. We used the array system to image wire targets at ≈ 2 - 3 cm depth in both intralipid scattering solution and water. The minimum observable separation was estimated as 70 ± 10 μ m , improving on the half-wavelength resolution limit of 145 μ m . This improvement was demonstrated even when using a random projection transform to reduce data by 99 % , enabling substantially faster reconstruction times. This is the first photoacoustic tomography approach capable of beating the half-wavelength resolution limit with a single laser shot.
Collapse
|
3
|
Abstract
Sketching is a probabilistic data compression technique that has been largely developed by the computer science community. Numerical operations on big datasets can be intolerably slow; sketching algorithms address this issue by generating a smaller surrogate dataset. Typically, inference proceeds on the compressed dataset. Sketching algorithms generally use random projections to compress the original dataset, and this stochastic generation process makes them amenable to statistical analysis. We argue that the sketched data can be modelled as a random sample, thus placing this family of data compression methods firmly within an inferential framework. In particular, we focus on the Gaussian, Hadamard and Clarkson-Woodruff sketches and their use in single-pass sketching algorithms for linear regression with huge samples. We explore the statistical properties of sketched regression algorithms and derive new distributional results for a large class of sketching estimators. A key result is a conditional central limit theorem for data-oblivious sketches. An important finding is that the best choice of sketching algorithm in terms of mean squared error is related to the signal-to-noise ratio in the source dataset. Finally, we demonstrate the theory and the limits of its applicability on two datasets.
Collapse
Affiliation(s)
- D. C. Ahfock
- MRC Biostatistics Unit, University of Cambridge, Robinson Way, Cambridge CB2 0SR, U.K
| | - W. J. Astle
- MRC Biostatistics Unit, University of Cambridge, Robinson Way, Cambridge CB2 0SR, U.K
| | - S. Richardson
- MRC Biostatistics Unit, University of Cambridge, Robinson Way, Cambridge CB2 0SR, U.K
| |
Collapse
|
4
|
Mirniaharikandehei S, Heidari M, Danala G, Lakshmivarahan S, Zheng B. Applying a random projection algorithm to optimize machine learning model for predicting peritoneal metastasis in gastric cancer patients using CT images. Comput Methods Programs Biomed 2021; 200:105937. [PMID: 33486339 PMCID: PMC7920928 DOI: 10.1016/j.cmpb.2021.105937] [Citation(s) in RCA: 14] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/25/2020] [Accepted: 01/06/2021] [Indexed: 05/14/2023]
Abstract
BACKGROUND AND OBJECTIVE Non-invasively predicting the risk of cancer metastasis before surgery can play an essential role in determining which patients can benefit from neoadjuvant chemotherapy. This study aims to investigate and test the advantages of applying a random projection algorithm to develop and optimize a radiomics-based machine learning model to predict peritoneal metastasis in gastric cancer patients using a small and imbalanced computed tomography (CT) image dataset. METHODS A retrospective dataset involving CT images acquired from 159 patients is assembled, including 121 and 38 cases with and without peritoneal metastasis, respectively. A computer-aided detection scheme is first applied to segment primary gastric tumor volumes and initially compute 315 image features. Then, five gradients boosting machine (GBM) models embedded with five feature selection methods (including random projection algorithm, principal component analysis, least absolute shrinkage, and selection operator, maximum relevance and minimum redundancy, and recursive feature elimination) along with a synthetic minority oversampling technique, are built to predict the risk of peritoneal metastasis. All GBM models are trained and tested using a leave-one-case-out cross-validation method. RESULTS Results show that the GBM model embedded with a random projection algorithm yields a significantly higher prediction accuracy (71.2%) than the other four GBM models (p<0.05). The precision, sensitivity, and specificity of this optimal GBM model are 65.78%, 43.10%, and 87.12%, respectively. CONCLUSIONS This study demonstrates that CT images of the primary gastric tumors contain discriminatory information to predict the risk of peritoneal metastasis, and a random projection algorithm is a promising method to generate optimal feature vector, improving the performance of machine learning based prediction models.
Collapse
Affiliation(s)
| | - Morteza Heidari
- School of Electrical and Computer Engineering, University of Oklahoma, Norman, OK 73019, USA
| | - Gopichandh Danala
- School of Electrical and Computer Engineering, University of Oklahoma, Norman, OK 73019, USA
| | | | - Bin Zheng
- School of Electrical and Computer Engineering, University of Oklahoma, Norman, OK 73019, USA
| |
Collapse
|
5
|
Abstract
Kernel canonical correlation analysis (KCCA) is a popular tool as a nonlinear extension of canonical correlation analysis. Consistency and optimal convergence rate have been established in the literature. However, the time complexity of KCCA scales as O(n3) and is thus prohibitive when n is large. We propose an m-dimensional randomized sketches approach for KCCA with m<<n, based on the recent work on randomized sketches for kernel ridge regression (KRR). Technically we establish our theoretical results relying on an interesting connection between KCCA and KRR by utilizing a novel "duality tracking" device that alternates between the infinite-dimensional operator-theory-based view of KCCA and the finite-dimensional kernel-matrix-based view.
Collapse
Affiliation(s)
- Heng Lian
- Department of Mathematics, City University of Hong Kong, Hong Kong; City University of Hong Kong Shenzhen Research Institute, Shenzhen, China.
| | - Fode Zhang
- Center of Statistical Research and School of Statistics, Southwestern University of Finance and Economics, Chengdu, 611130, China
| | - Wenqi Lu
- School of Management, Fudan University, Shanghai, China; Department of Mathematics, City University of Hong Kong, Hong Kong
| |
Collapse
|
6
|
Chen ZH, You ZH, Li LP, Wang YB, Qiu Y, Hu PW. Identification of self-interacting proteins by integrating random projection classifier and finite impulse response filter. BMC Genomics 2019; 20:928. [PMID: 31881833 PMCID: PMC6933882 DOI: 10.1186/s12864-019-6301-1] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/02/2023] Open
Abstract
Background Identification of protein-protein interactions (PPIs) is crucial for understanding biological processes and investigating the cellular functions of genes. Self-interacting proteins (SIPs) are those in which more than two identical proteins can interact with each other and they are the specific type of PPIs. More and more researchers draw attention to the SIPs detection, and several prediction model have been proposed, but there are still some problems. Hence, there is an urgent need to explore a efficient computational model for SIPs prediction. Results In this study, we developed an effective model to predict SIPs, called RP-FIRF, which merges the Random Projection (RP) classifier and Finite Impulse Response Filter (FIRF) together. More specifically, each protein sequence was firstly transformed into the Position Specific Scoring Matrix (PSSM) by exploiting Position Specific Iterated BLAST (PSI-BLAST). Then, to effectively extract the discriminary SIPs feature to improve the performance of SIPs prediction, a FIRF method was used on PSSM. The R’classifier was proposed to execute the classification and predict novel SIPs. We evaluated the performance of the proposed RP-FIRF model and compared it with the state-of-the-art support vector machine (SVM) on human and yeast datasets, respectively. The proposed model can achieve high average accuracies of 97.89 and 97.35% using five-fold cross-validation. To further evaluate the high performance of the proposed method, we also compared it with other six exiting methods, the experimental results demonstrated that the capacity of our model surpass that of the other previous approaches. Conclusion Experimental results show that self-interacting proteins are accurately well-predicted by the proposed model on human and yeast datasets, respectively. It fully show that the proposed model can predict the SIPs effectively and sufficiently. Thus, RP-FIRF model is an automatic decision support method which should provide useful insights into the recognition of SIPs.
Collapse
Affiliation(s)
- Zhan-Heng Chen
- The Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, Urumqi, 830011, China.,University of Chinese Academy of Sciences, Beijing, 100049, China
| | - Zhu-Hong You
- The Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, Urumqi, 830011, China. .,University of Chinese Academy of Sciences, Beijing, 100049, China.
| | - Li-Ping Li
- The Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, Urumqi, 830011, China
| | - Yan-Bin Wang
- The Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, Urumqi, 830011, China
| | - Yu Qiu
- The Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, Urumqi, 830011, China.,University of Chinese Academy of Sciences, Beijing, 100049, China
| | | |
Collapse
|
7
|
Naseri A, Liu X, Tang K, Zhang S, Zhi D. RaPID: ultra-fast, powerful, and accurate detection of segments identical by descent (IBD) in biobank-scale cohorts. Genome Biol 2019; 20:143. [PMID: 31345249 PMCID: PMC6659282 DOI: 10.1186/s13059-019-1754-8] [Citation(s) in RCA: 37] [Impact Index Per Article: 7.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/26/2019] [Accepted: 07/03/2019] [Indexed: 11/10/2022] Open
Abstract
While genetic relatedness, usually manifested as segments identical by descent (IBD), is ubiquitous in modern large biobanks, current IBD detection methods are not efficient at such a scale. Here, we describe an efficient method, RaPID, for detecting IBD segments in a panel with phased haplotypes. RaPID achieves a time and space complexity linear to the input size and the number of reported IBDs. With simulation, we showed that RaPID is orders of magnitude faster than existing method while offering competitive power and accuracy. In UK Biobank, RaPID identified 3,335,807 IBDs with a lenght ≥ 10 cM among 223,507 male X chromosomes in 11 min.
Collapse
Affiliation(s)
- Ardalan Naseri
- Department of Computer Science, University of Central Florida, Orlando, FL, 32816, USA
| | - Xiaoming Liu
- USF Genomics, College of Public Health, University of South Florida, Tampa, FL, 33612, USA
| | - Kecong Tang
- Department of Computer Science, University of Central Florida, Orlando, FL, 32816, USA
| | - Shaojie Zhang
- Department of Computer Science, University of Central Florida, Orlando, FL, 32816, USA.
| | - Degui Zhi
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, 77030, USA.
- Department of Epidemiology, Human Genetics & Environmental Sciences, The University of Texas Health Science Center at Houston, Houston, TX, 77030, USA.
| |
Collapse
|
8
|
Zoh RS, Sarkar A, Carroll RJ, Mallick BK. A Powerful Bayesian Test for Equality of Means in High Dimensions. J Am Stat Assoc 2018; 113:1733-1741. [PMID: 30739967 PMCID: PMC6364997 DOI: 10.1080/01621459.2017.1371024] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2015] [Revised: 08/01/2017] [Indexed: 10/18/2022]
Abstract
We develop a Bayes factor based testing procedure for comparing two population means in high dimensional settings. In 'large-p-small-n' settings, Bayes factors based on proper priors require eliciting a large and complex p×p covariance matrix, whereas Bayes factors based on Jeffrey's prior suffer the same impediment as the classical Hotelling T 2 test statistic as they involve inversion of ill-formed sample covariance matrices. To circumvent this limitation, we propose that the Bayes factor be based on lower dimensional random projections of the high dimensional data vectors. We choose the prior under the alternative to maximize the power of the test for a fixed threshold level, yielding a restricted most powerful Bayesian test (RMPBT). The final test statistic is based on the ensemble of Bayes factors corresponding to multiple replications of randomly projected data. We show that the test is unbiased and, under mild conditions, is also locally consistent. We demonstrate the efficacy of the approach through simulated and real data examples.
Collapse
Affiliation(s)
- Roger S Zoh
- Department of Epidemiology & Biostatistics, Texas A&M University, 1266 TAMU, College Station, TX 77843-1266, USA
| | - Abhra Sarkar
- Department of Statistical Science, Duke University, Box 90251, Durham NC 27708-0251, USA
| | - Raymond J Carroll
- Department of Statistics, Texas A&M University, 3143 TAMU, College Station, TX 77843-3143, USA. School of Mathematical Sciences, University of Technology, Sydney, Broadway NSW 2007, Australia
| | - Bani K Mallick
- Department of Statistics, Texas A&M University, 3143 TAMU, College Station, TX 77843-3143, USA
| |
Collapse
|
9
|
Abstract
Principal component analysis (PCA) is a widespread technique for data analysis that relies on the covariance/correlation matrix of the analyzed data. However, to properly work with high-dimensional data sets, PCA poses severe mathematical constraints on the minimum number of different replicates, or samples, that must be included in the analysis. Generally, improper sampling is due to a small number of data respect to the number of the degrees of freedom that characterize the ensemble. In the field of life sciences it is often important to have an algorithm that can accept poorly dimensioned data sets, including degenerated ones. Here a new random projection algorithm is proposed, in which a random symmetric matrix surrogates the covariance/correlation matrix of PCA, while maintaining the data clustering capacity. We demonstrate that what is important for clustering efficiency of PCA is not the exact form of the covariance/correlation matrix, but simply its symmetry.
Collapse
Affiliation(s)
- Luigi Leonardo Palese
- University of Bari "Aldo Moro", Department of Basic Medical Sciences, Neurosciences and Sense Organs (SMBNOS), Bari 70124, Italy.
| |
Collapse
|
10
|
Abstract
Gaussian processes are widely used in nonparametric regression, classification and spatiotemporal modelling, facilitated in part by a rich literature on their theoretical properties. However, one of their practical limitations is expensive computation, typically on the order of n3 where n is the number of data points, in performing the necessary matrix inversions. For large datasets, storage and processing also lead to computational bottlenecks, and numerical stability of the estimates and predicted values degrades with increasing n. Various methods have been proposed to address these problems, including predictive processes in spatial data analysis and the subset-of-regressors technique in machine learning. The idea underlying these approaches is to use a subset of the data, but this raises questions concerning sensitivity to the choice of subset and limitations in estimating fine-scale structure in regions that are not well covered by the subset. Motivated by the literature on compressive sensing, we propose an alternative approach that involves linear projection of all the data points onto a lower-dimensional subspace. We demonstrate the superiority of this approach from a theoretical perspective and through simulated and real data examples.
Collapse
Affiliation(s)
- Anjishnu Banerjee
- Department of Statistical Science, Duke University, Box 90251, Durham, North Carolina 27708-0251, U.S.A
| | | | | |
Collapse
|