1
|
Sen Puliparambil B, Tomal JH, Yan Y. A Novel Algorithm for Feature Selection Using Penalized Regression with Applications to Single-Cell RNA Sequencing Data. Biology (Basel) 2022; 11:biology11101495. [PMID: 36290397 PMCID: PMC9598401 DOI: 10.3390/biology11101495] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/11/2022] [Revised: 09/21/2022] [Accepted: 09/30/2022] [Indexed: 11/05/2022]
Abstract
With the emergence of single-cell RNA sequencing (scRNA-seq) technology, scientists are able to examine gene expression at single-cell resolution. Analysis of scRNA-seq data has its own challenges, which stem from its high dimensionality. The method of machine learning comes with the potential of gene (feature) selection from the high-dimensional scRNA-seq data. Even though there exist multiple machine learning methods that appear to be suitable for feature selection, such as penalized regression, there is no rigorous comparison of their performances across data sets, where each poses its own challenges. Therefore, in this paper, we analyzed and compared multiple penalized regression methods for scRNA-seq data. Given the scRNA-seq data sets we analyzed, the results show that sparse group lasso (SGL) outperforms the other six methods (ridge, lasso, elastic net, drop lasso, group lasso, and big lasso) using the metrics area under the receiver operating curve (AUC) and computation time. Building on these findings, we proposed a new algorithm for feature selection using penalized regression methods. The proposed algorithm works by selecting a small subset of genes and applying SGL to select the differentially expressed genes in scRNA-seq data. By using hierarchical clustering to group genes, the proposed method bypasses the need for domain-specific knowledge for gene grouping information. In addition, the proposed algorithm provided consistently better AUC for the data sets used.
Collapse
Affiliation(s)
- Bhavithry Sen Puliparambil
- Master of Science in Data Science Program, Thompson Rivers University, 805 TRU Way, Kamloops, BC V2C 0C8, Canada
- Correspondence:
| | - Jabed H. Tomal
- Department of Mathematics and Statistics, Thompson Rivers University, 805 TRU Way, Kamloops, BC V2C 0C8, Canada
| | - Yan Yan
- Department of Computing Science, Thompson Rivers University, 805 TRU Way, Kamloops, BC V2C 0C8, Canada
| |
Collapse
|
2
|
Bai F, Puk KM, Liu J, Zhou H, Tao P, Zhou W, Wang S. Sparse group selection and analysis of function-related residue for protein-state recognition. J Comput Chem 2022; 43:1342-1354. [PMID: 35656889 PMCID: PMC9248267 DOI: 10.1002/jcc.26937] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/05/2021] [Revised: 03/23/2022] [Accepted: 05/08/2022] [Indexed: 11/08/2022]
Abstract
Machine learning methods have helped to advance wide range of scientific and technological field in recent years, including computational chemistry. As the chemical systems could become complex with high dimension, feature selection could be critical but challenging to develop reliable machine learning based prediction models, especially for proteins as bio-macromolecules. In this study, we applied sparse group lasso (SGL) method as a general feature selection method to develop classification model for an allosteric protein in different functional states. This results into a much improved model with comparable accuracy (Acc) and only 28 selected features comparing to 289 selected features from a previous study. The Acc achieves 91.50% with 1936 selected feature, which is far higher than that of baseline methods. In addition, grouping protein amino acids into secondary structures provides additional interpretability of the selected features. The selected features are verified as associated with key allosteric residues through comparison with both experimental and computational works about the model protein, and demonstrate the effectiveness and necessity of applying rigorous feature selection and evaluation methods on complex chemical systems.
Collapse
Affiliation(s)
- Fangyun Bai
- Department of Management Science and Engineering, Tongji University. Fangyun Bai and Kin Ming Puk contributed equally to this work
| | | | - Jin Liu
- Department of Pharmaceutical Sciences, University of North Texas System College of Pharmacy, University of North Texas Health Science Center
| | - Hongyu Zhou
- Department of Chemistry, Center for Scientific Computation, Center for Drug Discovery, Design, and Delivery (CD4), Southern Methodist University
| | - Peng Tao
- Department of Chemistry, Center for Scientific Computation, Center for Drug Discovery, Design, and Delivery (CD4), Southern Methodist University
| | - Wenyong Zhou
- Department of Management Science and Engineering, Tongji University
| | - Shouyi Wang
- Corresponding author: Shouyi Wang, Department of Industrial, Manufacturing and Systems Engineering, University of Texas at Arlington.
| |
Collapse
|
3
|
Zhang J, Li Y. High-Dimensional Gaussian Graphical Regression Models with Covariates. J Am Stat Assoc 2022; 118:2088-2100. [PMID: 38143787 PMCID: PMC10746132 DOI: 10.1080/01621459.2022.2034632] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/05/2021] [Accepted: 01/20/2022] [Indexed: 10/19/2022]
Abstract
Though Gaussian graphical models have been widely used in many scientific fields, relatively limited progress has been made to link graph structures to external covariates. We propose a Gaussian graphical regression model, which regresses both the mean and the precision matrix of a Gaussian graphical model on covariates. In the context of co-expression quantitative trait locus (QTL) studies, our method can determine how genetic variants and clinical conditions modulate the subject-level network structures, and recover both the population-level and subject-level gene networks. Our framework encourages sparsity of covariate effects on both the mean and the precision matrix. In particular for the precision matrix, we stipulate simultaneous sparsity, i.e., group sparsity and element-wise sparsity, on effective covariates and their effects on network edges, respectively. We establish variable selection consistency first under the case with known mean parameters and then a more challenging case with unknown means depending on external covariates, and establish in both cases the ℓ2 convergence rates and the selection consistency of the estimated precision parameters. The utility and efficacy of our proposed method is demonstrated through simulation studies and an application to a co-expression QTL study with brain cancer patients.
Collapse
Affiliation(s)
- Jingfei Zhang
- Department of Management Science, University of Miami, Coral Gables, FL 33146
| | - Yi Li
- Department of Biostatistics, University of Michigan, Ann Arbor, MI 48109
| |
Collapse
|
4
|
Bai Y, Gong Y, Bai J, Liu J, Deng HW, Calhoun V, Wang YP. A Joint Analysis of Multi-Paradigm fMRI Data With Its Application to Cognitive Study. IEEE Trans Med Imaging 2021; 40:951-962. [PMID: 33284749 PMCID: PMC7925383 DOI: 10.1109/tmi.2020.3042786] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/06/2023]
Abstract
With the development of neuroimaging techniques, a growing amount of multi-modal brain imaging data are collected, facilitating comprehensive study of the brain. In this paper, we jointly analyzed functional magnetic resonance imaging (fMRI) collected under different paradigms in order to understand cognitive behaviors of an individual. To this end, we proposed a novel multi-view learning algorithm called structure-enforced collaborative regression (SCoRe) to extract co-expressed discriminative brain regions under the guidance of anatomical structure of the brain. An advantage of SCoRe over its predecessor collaborative regression (CoRe) lies in its incorporation of group structures in the brain imaging data, which makes the model biologically more meaningful. Results from real data analysis has confirmed that by incorporating prior knowledge of brain structure, SCoRe can deliver better prediction performance and is less sensitive to hyper-parameters than CoRe. After validation with simulation experiments, we applied SCoRe to fMRI data collected from the Philadelphia Neurodevelopmental Cohort and adopted the scores from the wide range achievement test (WRAT) to evaluate an individual's cognitive skills. We located 14 relevant brain regions that can efficiently predict WRAT scores and these brain regions were further confirmed by other independent studies.
Collapse
|
5
|
Su T, Wang Y, Liu Y, Branton WG, Asahchop E, Power C, Jiang B, Kong L, Tang N. Sparse Multicategory Generalized Distance Weighted Discrimination in Ultra-High Dimensions. Entropy (Basel) 2020; 22:E1257. [PMID: 33287025 PMCID: PMC7712546 DOI: 10.3390/e22111257] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Received: 09/30/2020] [Revised: 10/26/2020] [Accepted: 11/02/2020] [Indexed: 11/21/2022]
Abstract
Distance weighted discrimination (DWD) is an appealing classification method that is capable of overcoming data piling problems in high-dimensional settings. Especially when various sparsity structures are assumed in these settings, variable selection in multicategory classification poses great challenges. In this paper, we propose a multicategory generalized DWD (MgDWD) method that maintains intrinsic variable group structures during selection using a sparse group lasso penalty. Theoretically, we derive minimizer uniqueness for the penalized MgDWD loss function and consistency properties for the proposed classifier. We further develop an efficient algorithm based on the proximal operator to solve the optimization problem. The performance of MgDWD is evaluated using finite sample simulations and miRNA data from an HIV study.
Collapse
Affiliation(s)
- Tong Su
- Key Lab of Statistical Modeling and Data Analysis of Yunnan Province, Yunnan University, Kunming 650091, China;
| | - Yafei Wang
- Department of Mathematical and Statistical Sciences, University of Alberta, Edmonton, AB T6G 2G1, Canada; (Y.W.); (Y.L.); (B.J.)
| | - Yi Liu
- Department of Mathematical and Statistical Sciences, University of Alberta, Edmonton, AB T6G 2G1, Canada; (Y.W.); (Y.L.); (B.J.)
| | - William G. Branton
- Department of Medicine (Neurology), University of Alberta, Edmonton, AB T6G 2G1, Canada; (W.G.B.); (E.A.); (C.P.)
| | - Eugene Asahchop
- Department of Medicine (Neurology), University of Alberta, Edmonton, AB T6G 2G1, Canada; (W.G.B.); (E.A.); (C.P.)
| | - Christopher Power
- Department of Medicine (Neurology), University of Alberta, Edmonton, AB T6G 2G1, Canada; (W.G.B.); (E.A.); (C.P.)
| | - Bei Jiang
- Department of Mathematical and Statistical Sciences, University of Alberta, Edmonton, AB T6G 2G1, Canada; (Y.W.); (Y.L.); (B.J.)
| | - Linglong Kong
- Department of Mathematical and Statistical Sciences, University of Alberta, Edmonton, AB T6G 2G1, Canada; (Y.W.); (Y.L.); (B.J.)
| | - Niansheng Tang
- Key Lab of Statistical Modeling and Data Analysis of Yunnan Province, Yunnan University, Kunming 650091, China;
| |
Collapse
|
6
|
Abstract
Mediation analysis attempts to determine whether the relationship between an independent variable (e.g., exposure) and an outcome variable can be explained, at least partially, by an intermediate variable, called a mediator. Most methods for mediation analysis focus on one mediator at a time, although multiple mediators can be jointly analyzed by structural equation models (SEMs) that account for correlations among the mediators. We extend the use of SEMs for the analysis of multiple mediators by creating a sparse group lasso penalized model such that the penalty considers the natural groupings of parameters that determine mediation, as well as encourages sparseness of the model parameters. This provides a way to simultaneously evaluate many mediators and select those that have the most impact, a feature of modern penalized models. Simulations are used to illustrate the benefits and limitations of our approach, and application to a study of DNA methylation and reactive cortisol stress following childhood trauma discovered two novel methylation loci that mediate the association of childhood trauma scores with reactive cortisol stress levels. Our new methods are incorporated into R software called regmed.
Collapse
Affiliation(s)
- Daniel J Schaid
- Department of Health Sciences Research, Mayo Clinic, Rochester, Minnesota
| | - Jason P Sinnwell
- Department of Health Sciences Research, Mayo Clinic, Rochester, Minnesota
| |
Collapse
|
7
|
Abstract
Identification of genetic variants associated with complex traits is a critical step for improving plant resistance and breeding. Although the majority of existing methods for variants detection have good predictive performance in the average case, they can not precisely identify the variants present in a small number of target genes. In this paper, we propose a weighted sparse group lasso (WSGL) method to select both common and low-frequency variants in groups. Under the biologically realistic assumption that complex traits are influenced by a few single loci in a small number of genes, our method involves a sparse group lasso approach to simultaneously select associated groups along with the loci within each group. To increase the probability of selecting out low-frequency variants, biological prior information is introduced in the model by re-weighting lasso regularization based on weights calculated from input data. Experimental results from both simulation and real data of single nucleotide polymorphisms (SNPs) associated with Arabidopsis flowering traits demonstrate the superiority of WSGL over other competitive approaches for genetic variants detection.
Collapse
Affiliation(s)
- Kai Che
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
| | - Xi Chen
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
| | - Maozu Guo
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China.,School of Electrical and Information Engineering, Beijing University of Civil Engineering and Architecture, Beijing, China.,Beijing Key Laboratory of Intelligent Processing for Building Big Data, Beijing, China
| | - Chunyu Wang
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
| | - Xiaoyan Liu
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
| |
Collapse
|
8
|
Li Y, Sun C, Li P, Zhao Y, Mensah GK, Xu Y, Guo H, Chen J. Hypernetwork Construction and Feature Fusion Analysis Based on Sparse Group Lasso Method on fMRI Dataset. Front Neurosci 2020; 14:60. [PMID: 32116508 PMCID: PMC7029661 DOI: 10.3389/fnins.2020.00060] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2019] [Accepted: 01/15/2020] [Indexed: 01/21/2023] Open
Abstract
Recent works have shown that the resting-state brain functional connectivity hypernetwork, where multiple nodes can be connected, are an effective technique for brain disease diagnosis and classification research. The lasso method was used to construct hypernetworks by solving sparse linear regression models in previous research. But, constructing a hypernetwork based on the lasso method simply selects a single variable, in that it lacks the ability to interpret the grouping effect. Considering the group structure problem, the previous study proposed to create a hypernetwork based on the elastic net and the group lasso methods, and the results showed that the former method had the best classification performance. However, the highly correlated variables selected by the elastic net method were not necessarily in the active set in the group. Therefore, we extended our research to address this issue. Herein, we propose a new method that introduces the sparse group lasso method to improve the construction of the hypernetwork by solving the group structure problem of the brain regions. We used the traditional lasso, group lasso method, and sparse group lasso method to construct a hypernetwork in patients with depression and normal subjects. Meanwhile, other clustering coefficients (clustering coefficients based on pairs of nodes) were also introduced to extract features with traditional clustering coefficients. Two types of features with significant differences obtained after feature selection were subjected to multi-kernel learning for feature fusion and classification using each method, respectively. The network topology results revealed differences among the three networks, where hypernetwork using the lasso method was the strictest; the group lasso, most lenient; and the sgLasso method, moderate. The network topology of the sparse group lasso method was similar to that of the group lasso method but different from the lasso method. The classification results show that the sparse group lasso method achieves the best classification accuracy by using multi-kernel learning, which indicates that better classification performance can be achieved when the group structure exists and is properly extended.
Collapse
Affiliation(s)
- Yao Li
- College of Information and Computer, Taiyuan University of Technology, Taiyuan, China
| | - Chao Sun
- College of Information and Computer, Taiyuan University of Technology, Taiyuan, China
| | - Pengzu Li
- College of Information and Computer, Taiyuan University of Technology, Taiyuan, China
| | - Yunpeng Zhao
- College of Arts, Taiyuan University of Technology, Taiyuan, China
| | - Godfred Kim Mensah
- College of Information and Computer, Taiyuan University of Technology, Taiyuan, China
| | - Yong Xu
- Department of Psychiatry, First Hospital of Shanxi Medical University, Taiyuan, China
| | - Hao Guo
- College of Information and Computer, Taiyuan University of Technology, Taiyuan, China
| | - Junjie Chen
- College of Information and Computer, Taiyuan University of Technology, Taiyuan, China
| |
Collapse
|
9
|
Guo Y, Wu C, Guo M, Zou Q, Liu X, Keinan A. Combining Sparse Group Lasso and Linear Mixed Model Improves Power to Detect Genetic Variants Underlying Quantitative Traits. Front Genet 2019; 10:271. [PMID: 31024614 PMCID: PMC6469383 DOI: 10.3389/fgene.2019.00271] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/17/2018] [Accepted: 03/12/2019] [Indexed: 11/13/2022] Open
Abstract
Genome-Wide association studies (GWAS), based on testing one single nucleotide polymorphism (SNP) at a time, have revolutionized our understanding of the genetics of complex traits. In GWAS, there is a need to consider confounding effects such as due to population structure, and take groups of SNPs into account simultaneously due to the “polygenic” attribute of complex quantitative traits. In this paper, we propose a new approach SGL-LMM that puts together sparse group lasso (SGL) and linear mixed model (LMM) for multivariate associations of quantitative traits. LMM, as has been often used in GWAS, controls for confounders, while SGL maintains sparsity of the underlying multivariate regression model. SGL-LMM first sets a fixed zero effect to learn the parameters of random effects using LMM, and then estimates fixed effects using SGL regularization. We present efficient algorithms for hyperparameter tuning and feature selection using stability selection. While controlling for confounders and constraining for sparse solutions, SGL-LMM also provides a natural framework for incorporating prior biological information into the group structure underlying the model. Results based on both simulated and real data show SGL-LMM outperforms previous approaches in terms of power to detect associations and accuracy of quantitative trait prediction.
Collapse
Affiliation(s)
- Yingjie Guo
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China.,Department of Computational Biology, Cornell University, Ithaca, NY, United States
| | - Chenxi Wu
- Department of Mathematics, Rutgers University, Piscataway, NJ, United States
| | - Maozu Guo
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China.,School of Electrical and Information Engineering, Beijing University of Civil Engineering and Architecture, Beijing, China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
| | - Xiaoyan Liu
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
| | - Alon Keinan
- Department of Computational Biology, Cornell University, Ithaca, NY, United States.,Cornell Center for Comparative and Population Genomics, Center for Vertebrate Genomics, and Center for Enervating Neuroimmune Disease, Cornell University, Ithaca, NY, United States
| |
Collapse
|
10
|
Abstract
In the past few years, new technologies in the field of neuroscience have made it possible to simultaneously image activity in large populations of neurons at cellular resolution in behaving animals. In mid-2016, a huge repository of this so-called "calcium imaging" data was made publicly available. The availability of this large-scale data resource opens the door to a host of scientific questions for which new statistical methods must be developed. In this paper we consider the first step in the analysis of calcium imaging data-namely, identifying the neurons in a calcium imaging video. We propose a dictionary learning approach for this task. First, we perform image segmentation to develop a dictionary containing a huge number of candidate neurons. Next, we refine the dictionary using clustering. Finally, we apply the dictionary to select neurons and estimate their corresponding activity over time, using a sparse group lasso optimization problem. We assess performance on simulated calcium imaging data and apply our proposal to three calcium imaging data sets. Our proposed approach is implemented in the R package scalpel, which is available on CRAN.
Collapse
Affiliation(s)
- Ashley Petersen
- Division of Biostatistics, University of Minnesota, Minneapolis, Minnesota 55455, USA,
| | - Noah Simon
- Department of Biostatistics, University of Washington, Seattle, Washington 98195, USA, ; Departments of Biostatistics and Statistics, University of Washington, Seattle, Washington 98195, USA,
| | - Daniela Witten
- Department of Biostatistics, University of Washington, Seattle, Washington 98195, USA, ; Departments of Biostatistics and Statistics, University of Washington, Seattle, Washington 98195, USA,
| |
Collapse
|