1
|
Gorstein E, Aghdam R, Solís-Lemus C. HighDimMixedModels.jl: Robust high-dimensional mixed-effects models across omics data. PLoS Comput Biol 2025; 21:e1012143. [PMID: 39804942 PMCID: PMC11761659 DOI: 10.1371/journal.pcbi.1012143] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/08/2024] [Revised: 01/24/2025] [Accepted: 11/19/2024] [Indexed: 01/16/2025] Open
Abstract
High-dimensional mixed-effects models are an increasingly important form of regression in which the number of covariates rivals or exceeds the number of samples, which are collected in groups or clusters. The penalized likelihood approach to fitting these models relies on a coordinate descent algorithm that lacks guarantees of convergence to a global optimum. Here, we empirically study the behavior of this algorithm on simulated and real examples of three types of data that are common in modern biology: transcriptome, genome-wide association, and microbiome data. Our simulations provide new insights into the algorithm's behavior in these settings, and, comparing the performance of two popular penalties, we demonstrate that the smoothly clipped absolute deviation (SCAD) penalty consistently outperforms the least absolute shrinkage and selection operator (LASSO) penalty in terms of both variable selection and estimation accuracy across omics data. To empower researchers in biology and other fields to fit models with the SCAD penalty, we implement the algorithm in a Julia package, HighDimMixedModels.jl.
Collapse
Affiliation(s)
- Evan Gorstein
- Wisconsin Institute for Discovery, University of Wisconsin-Madison, Madison, Wisconsin, United States of America
- Department of Statistics, University of Wisconsin-Madison, Madison, Wisconsin, United States of America
| | - Rosa Aghdam
- Wisconsin Institute for Discovery, University of Wisconsin-Madison, Madison, Wisconsin, United States of America
| | - Claudia Solís-Lemus
- Wisconsin Institute for Discovery, University of Wisconsin-Madison, Madison, Wisconsin, United States of America
- Department of Plant Pathology, University of Wisconsin-Madison, Madison, Wisconsin, United States of America
| |
Collapse
|
2
|
Chen S, Fang Z, Li Z, Liu X. A novel block-coordinate gradient descent algorithm for simultaneous grouped selection of fixed and random effects in joint modeling. Stat Med 2024; 43:4595-4613. [PMID: 39145573 DOI: 10.1002/sim.10193] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2022] [Revised: 06/26/2024] [Accepted: 07/24/2024] [Indexed: 08/16/2024]
Abstract
Joint models for longitudinal and time-to-event data are receiving increasing attention owing to its capability of capturing the possible association between these two types of data. Typically, a joint model consists of a longitudinal submodel for longitudinal processes and a survival submodel for the time-to-event response, and links two submodels by common covariates that may carry both fixed and random effects. However, research gaps still remain on how to simultaneously select fixed and random effects from the two submodels under the joint modeling framework efficiently and effectively. In this article, we propose a novel block-coordinate gradient descent (BCGD) algorithm to simultaneously select multiple longitudinal covariates that may carry fixed and random effects in the joint model. Specifically, for the multiple longitudinal processes, a linear mixed effect model is adopted where random intercepts and slopes serve as essential covariates of the trajectories, and for the survival submodel, the popular proportional hazard model is employed. A penalized likelihood estimation is used to control the dimensionality of covariates in the joint model and estimate the unknown parameters, especially when estimating the covariance matrix of random effects. The proposed BCGD method can successfully capture the useful covariates of both fixed and random effects with excellent selection power, and efficiently provide a relatively accurate estimate of fixed and random effects empirically. The simulation results show excellent performance of the proposed method and support its effectiveness. The proposed BCGD method is further applied on two real data sets, and we examine the risk factors for the effects of different heart valves, differing on type of tissue, implanted in the aortic position and the risk factors for the diagnosis of primary biliary cholangitis.
Collapse
Affiliation(s)
- Shuyan Chen
- School of Management, University of Science and Technology of China, Anhui, China
| | - Zhiqing Fang
- School of Statistics and Management, Shanghai University of Finance and Economics, Shanghai, China
| | - Zhong Li
- School of Insurance and Economics, University of International Business and Economics, Beijing, China
| | - Xin Liu
- School of Statistics and Management, Shanghai University of Finance and Economics, Shanghai, China
| |
Collapse
|
3
|
Alamin M, Sultana MH, Lou X, Jin W, Xu H. Dissecting Complex Traits Using Omics Data: A Review on the Linear Mixed Models and Their Application in GWAS. PLANTS (BASEL, SWITZERLAND) 2022; 11:3277. [PMID: 36501317 PMCID: PMC9739826 DOI: 10.3390/plants11233277] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 09/21/2022] [Revised: 11/23/2022] [Accepted: 11/25/2022] [Indexed: 06/17/2023]
Abstract
Genome-wide association study (GWAS) is the most popular approach to dissecting complex traits in plants, humans, and animals. Numerous methods and tools have been proposed to discover the causal variants for GWAS data analysis. Among them, linear mixed models (LMMs) are widely used statistical methods for regulating confounding factors, including population structure, resulting in increased computational proficiency and statistical power in GWAS studies. Recently more attention has been paid to pleiotropy, multi-trait, gene-gene interaction, gene-environment interaction, and multi-locus methods with the growing availability of large-scale GWAS data and relevant phenotype samples. In this review, we have demonstrated all possible LMMs-based methods available in the literature for GWAS. We briefly discuss the different LMM methods, software packages, and available open-source applications in GWAS. Then, we include the advantages and weaknesses of the LMMs in GWAS. Finally, we discuss the future perspective and conclusion. The present review paper would be helpful to the researchers for selecting appropriate LMM models and methods quickly for GWAS data analysis and would benefit the scientific society.
Collapse
Affiliation(s)
- Md. Alamin
- Institute of Bioinformatics, Zhejiang University, Hangzhou 310058, China
- Department of Biology, School of Life Sciences, Southern University of Science and Technology, Shenzhen 518055, China
| | | | - Xiangyang Lou
- Department of Biostatistics, University of Alabama at Birmingham, Birmingham, AL 35294, USA
| | - Wenfei Jin
- Department of Biology, School of Life Sciences, Southern University of Science and Technology, Shenzhen 518055, China
| | - Haiming Xu
- Institute of Bioinformatics, Zhejiang University, Hangzhou 310058, China
| |
Collapse
|
4
|
Huang M, Lai H, Yu Y, Chen X, Wang T, Feng Q. Deep-gated recurrent unit and diet network-based genome-wide association analysis for detecting the biomarkers of Alzheimer's disease. Med Image Anal 2021; 73:102189. [PMID: 34343841 DOI: 10.1016/j.media.2021.102189] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/28/2021] [Revised: 05/30/2021] [Accepted: 07/16/2021] [Indexed: 01/01/2023]
Abstract
Genome-wide association analysis (GWAS) is a commonly used method to detect the potential biomarkers of Alzheimer's disease (AD). Most existing GWAS methods entail a high computational cost, disregard correlations among imaging data and correlations among genetic data, and ignore various associations between longitudinal imaging and genetic data. A novel GWAS method was proposed to identify potential AD biomarkers and address these problems. A network based on a gated recurrent unit was applied without imputing incomplete longitudinal imaging data to integrate the longitudinal data of variable lengths and extract an image representation. In this study, a modified diet network that can considerably reduce the number of parameters in the genetic network was proposed to perform GWAS between image representation and genetic data. Genetic representation can be extracted in this way. A link between genetic representation and AD was established to detect potential AD biomarkers. The proposed method was tested on a set of simulated data and a real AD dataset. Results of the simulated data showed that the proposed method can accurately detect relevant biomarkers. Moreover, the results of real AD dataset showed that the proposed method can detect some new risk-related genes of AD. Based on previous reports, no research has incorporated a deep-learning model into a GWAS framework to investigate the potential information on super-high-dimensional genetic data and longitudinal imaging data and create a link between imaging genetics and AD for detecting potential AD biomarkers. Therefore, the proposed method may provide new insights into the underlying pathological mechanism of AD.
Collapse
Affiliation(s)
- Meiyan Huang
- School of Biomedical Engineering, Southern Medical University, Guangzhou 510515, China; Guangdong Provincial Key Laboratory of Medical Image Processing, Southern Medical University, Guangzhou 510515, China; Guangdong Province Engineering Laboratory for Medical Imaging and Diagnostic Technology, Southern Medical University, Guangzhou 510515, China.
| | - Haoran Lai
- School of Biomedical Engineering, Southern Medical University, Guangzhou 510515, China.
| | - Yuwei Yu
- School of Biomedical Engineering, Southern Medical University, Guangzhou 510515, China.
| | - Xiumei Chen
- School of Biomedical Engineering, Southern Medical University, Guangzhou 510515, China.
| | - Tao Wang
- School of Biomedical Engineering, Southern Medical University, Guangzhou 510515, China.
| | - Qianjin Feng
- School of Biomedical Engineering, Southern Medical University, Guangzhou 510515, China; Guangdong Provincial Key Laboratory of Medical Image Processing, Southern Medical University, Guangzhou 510515, China; Guangdong Province Engineering Laboratory for Medical Imaging and Diagnostic Technology, Southern Medical University, Guangzhou 510515, China.
| | | |
Collapse
|
5
|
Gu X, Chen Z, Wang D. Prediction of G Protein-Coupled Receptors With CTDC Extraction and MRMD2.0 Dimension-Reduction Methods. Front Bioeng Biotechnol 2020; 8:635. [PMID: 32671038 PMCID: PMC7329982 DOI: 10.3389/fbioe.2020.00635] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/21/2020] [Accepted: 05/26/2020] [Indexed: 11/13/2022] Open
Abstract
The G Protein-Coupled Receptor (GPCR) family consists of more than 800 different members. In this article, we attempt to use the physicochemical properties of Composition, Transition, Distribution (CTD) to represent GPCRs. The dimensionality reduction method of MRMD2.0 filters the physicochemical properties of GPCR redundancy. Matplotlib plots the coordinates to distinguish GPCRs from other protein sequences. The chart data show a clear distinction effect, and there is a well-defined boundary between the two. The experimental results show that our method can predict GPCRs.
Collapse
Affiliation(s)
- Xingyue Gu
- Institute of Computing Science and Technology, Guangzhou University, Guangzhou, China
| | - Zhihua Chen
- Institute of Computing Science and Technology, Guangzhou University, Guangzhou, China
| | - Donghua Wang
- Department of General Surgery, Heilongjiang Province Land Reclamation Headquarters General Hospital, Harbin, China
| |
Collapse
|
6
|
Che K, Chen X, Guo M, Wang C, Liu X. Genetic Variants Detection Based on Weighted Sparse Group Lasso. Front Genet 2020; 11:155. [PMID: 32194631 PMCID: PMC7063084 DOI: 10.3389/fgene.2020.00155] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2019] [Accepted: 02/10/2020] [Indexed: 01/21/2023] Open
Abstract
Identification of genetic variants associated with complex traits is a critical step for improving plant resistance and breeding. Although the majority of existing methods for variants detection have good predictive performance in the average case, they can not precisely identify the variants present in a small number of target genes. In this paper, we propose a weighted sparse group lasso (WSGL) method to select both common and low-frequency variants in groups. Under the biologically realistic assumption that complex traits are influenced by a few single loci in a small number of genes, our method involves a sparse group lasso approach to simultaneously select associated groups along with the loci within each group. To increase the probability of selecting out low-frequency variants, biological prior information is introduced in the model by re-weighting lasso regularization based on weights calculated from input data. Experimental results from both simulation and real data of single nucleotide polymorphisms (SNPs) associated with Arabidopsis flowering traits demonstrate the superiority of WSGL over other competitive approaches for genetic variants detection.
Collapse
Affiliation(s)
- Kai Che
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
| | - Xi Chen
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
| | - Maozu Guo
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China.,School of Electrical and Information Engineering, Beijing University of Civil Engineering and Architecture, Beijing, China.,Beijing Key Laboratory of Intelligent Processing for Building Big Data, Beijing, China
| | - Chunyu Wang
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
| | - Xiaoyan Liu
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
| |
Collapse
|
7
|
Chen Z, Pang M, Zhao Z, Li S, Miao R, Zhang Y, Feng X, Feng X, Zhang Y, Duan M, Huang L, Zhou F. Feature selection may improve deep neural networks for the bioinformatics problems. Bioinformatics 2019; 36:1542-1552. [DOI: 10.1093/bioinformatics/btz763] [Citation(s) in RCA: 25] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2019] [Revised: 09/03/2019] [Accepted: 10/02/2019] [Indexed: 12/22/2022] Open
Abstract
Abstract
Motivation
Deep neural network (DNN) algorithms were utilized in predicting various biomedical phenotypes recently, and demonstrated very good prediction performances without selecting features. This study proposed a hypothesis that the DNN models may be further improved by feature selection algorithms.
Results
A comprehensive comparative study was carried out by evaluating 11 feature selection algorithms on three conventional DNN algorithms, i.e. convolution neural network (CNN), deep belief network (DBN) and recurrent neural network (RNN), and three recent DNNs, i.e. MobilenetV2, ShufflenetV2 and Squeezenet. Five binary classification methylomic datasets were chosen to calculate the prediction performances of CNN/DBN/RNN models using feature selected by the 11 feature selection algorithms. Seventeen binary classification transcriptome and two multi-class transcriptome datasets were also utilized to evaluate how the hypothesis may generalize to different data types. The experimental data supported our hypothesis that feature selection algorithms may improve DNN models, and the DBN models using features selected by SVM-RFE usually achieved the best prediction accuracies on the five methylomic datasets.
Availability and implementation
All the algorithms were implemented and tested under the programming environment Python version 3.6.6.
Supplementary information
Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Zheng Chen
- BioKnow Health Informatics Lab, College of Computer Science and Technology
- Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, Jilin, China
| | - Meng Pang
- BioKnow Health Informatics Lab, College of Computer Science and Technology
- Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, Jilin, China
| | - Zixin Zhao
- BioKnow Health Informatics Lab, College of Computer Science and Technology
- Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, Jilin, China
| | - Shuainan Li
- BioKnow Health Informatics Lab, College of Computer Science and Technology
- Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, Jilin, China
| | - Rui Miao
- BioKnow Health Informatics Lab, College of Computer Science and Technology
- Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, Jilin, China
| | - Yifan Zhang
- BioKnow Health Informatics Lab, College of Computer Science and Technology
- Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, Jilin, China
| | - Xiaoyue Feng
- BioKnow Health Informatics Lab, College of Computer Science and Technology
- Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, Jilin, China
| | - Xin Feng
- BioKnow Health Informatics Lab, College of Computer Science and Technology
- Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, Jilin, China
| | - Yexian Zhang
- BioKnow Health Informatics Lab, College of Computer Science and Technology
- Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, Jilin, China
| | - Meiyu Duan
- BioKnow Health Informatics Lab, College of Computer Science and Technology
- Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, Jilin, China
| | - Lan Huang
- BioKnow Health Informatics Lab, College of Computer Science and Technology
- Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, Jilin, China
| | - Fengfeng Zhou
- BioKnow Health Informatics Lab, College of Computer Science and Technology
- Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, Jilin, China
| |
Collapse
|