1
|
Li X, Feng X, Liu X. Heritability estimation for a linear combination of phenotypes via ridge regression. Bioinformatics 2022; 38:4687-4696. [PMID: 36053166 DOI: 10.1093/bioinformatics/btac587] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/19/2022] [Revised: 07/23/2022] [Indexed: 11/12/2022] Open
Abstract
MOTIVATION The joint analysis of multiple phenotypes is important in many biological studies, such as plant and animal breeding. The heritability estimation for a linear combination of phenotypes is designed to account for correlation information. Existing methods for estimating heritability mainly focus on single phenotypes under random-effect models. These methods also require some stringent conditions, which calls for a more flexible and interpretable method for estimating heritability. Fixed-effect models emerge as a useful alternative. RESULTS In this paper, we propose a novel heritability estimator based on multivariate ridge regression for linear combinations of phenotypes, yielding accurate estimates in both sparse and dense cases. Under mild conditions in the high-dimensional setting, the proposed estimator appears to be consistent and asymptotically normally distributed. Simulation studies show that the proposed estimator is promising under different scenarios. Compared with independently combined heritability estimates in the case of multiple phenotypes, the proposed method significantly improves the performance by considering correlations among those phenotypes. We further demonstrate its application in heritability estimation and correlation analysis for the Oryza sativa rice dataset. AVAILABILITY AND IMPLEMENTATION An R package implementing the proposed method is available at https://github.com/xg-SUFE1/MultiRidgeVar, where covariance estimates are also given together with heritability estimates. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Xiaoguang Li
- School of Statistics and Management, Shanghai University of Finance and Economics, Shanghai, 200433, China
| | - Xingdong Feng
- School of Statistics and Management, Shanghai University of Finance and Economics, Shanghai, 200433, China
| | - Xu Liu
- School of Statistics and Management, Shanghai University of Finance and Economics, Shanghai, 200433, China
| |
Collapse
|
2
|
Asymptotic Normality in Linear Regression with Approximately Sparse Structure. MATHEMATICS 2022. [DOI: 10.3390/math10101657] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/04/2023]
Abstract
In this paper, we study the asymptotic normality in high-dimensional linear regression. We focus on the case where the covariance matrix of the regression variables has a KMS structure, in asymptotic settings where the number of predictors, p, is proportional to the number of observations, n. The main result of the paper is the derivation of the exact asymptotic distribution for the suitably centered and normalized squared norm of the product between predictor matrix, X, and outcome variable, Y, i.e., the statistic ∥X′Y∥22, under rather unrestrictive assumptions for the model parameters βj. We employ variance-gamma distribution in order to derive the results, which, along with the asymptotic results, allows us to easily define the exact distribution of the statistic. Additionally, we consider a specific case of approximate sparsity of the model parameter vector β and perform a Monte Carlo simulation study. The simulation results suggest that the statistic approaches the limiting distribution fairly quickly even under high variable multi-correlation and relatively small number of observations, suggesting possible applications to the construction of statistical testing procedures for the real-world data and related problems.
Collapse
|
3
|
Gamarnik D, Zadik I. Sparse high-dimensional linear regression. Estimating squared error and a phase transition. Ann Stat 2022. [DOI: 10.1214/21-aos2130] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
4
|
Bradic J, Fan J, Zhu Y. Testability of high-dimensional linear models with nonsparse structures. Ann Stat 2022; 50:615-639. [DOI: 10.1214/19-aos1932] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Affiliation(s)
- Jelena Bradic
- Department of Mathematics and Halicioğlu Data Science Institute, University of California, San Diego
| | - Jianqing Fan
- Department of Operations Research and Financial Engineering, Princeton University
| | - Yinchu Zhu
- Department of Economics and International Business School, Brandeis University
| |
Collapse
|
5
|
Chen HY, Li H, Argos M, Persky VW, Turyk ME. Statistical Methods for Assessing the Explained Variation of a Health Outcome by a Mixture of Exposures. INTERNATIONAL JOURNAL OF ENVIRONMENTAL RESEARCH AND PUBLIC HEALTH 2022; 19:2693. [PMID: 35270383 PMCID: PMC8910055 DOI: 10.3390/ijerph19052693] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/31/2021] [Revised: 02/13/2022] [Accepted: 02/18/2022] [Indexed: 12/04/2022]
Abstract
Exposures to environmental pollutants are often composed of mixtures of chemicals that can be highly correlated because of similar sources and/or chemical structures. The effect of an individual chemical on a health outcome can be weak and difficult to detect because of the relatively low level of exposures to many environmental pollutants. To tackle the challenging problem of assessing the health risk of exposure to a mixture of environmental pollutants, we propose a statistical approach to assessing the proportion of the variation of an outcome explained by a mixture of pollutants. The proposed approach avoids the difficult task of identifying specific pollutants that are responsible for the effects and may also be used to assess interactions among exposures. Extensive simulation results demonstrate that the proposed approach has very good performance. Application of the proposed approach is illustrated by investigating the main and interaction effects of the chemical pollutants on systolic and diastolic blood pressure in participants from the National Health and Nutrition Examination Survey.
Collapse
Affiliation(s)
- Hua Yun Chen
- Division of Epidemiology & Biostatistics, School of Public Health, University of Illinois at Chicago, 1603 West Taylor Street, Chicago, IL 60612, USA; (H.L.); (M.A.); (V.W.P.); (M.E.T.)
| | | | | | | | | |
Collapse
|
6
|
Joubert BR, Kioumourtzoglou MA, Chamberlain T, Chen HY, Gennings C, Turyk ME, Miranda ML, Webster TF, Ensor KB, Dunson DB, Coull BA. Powering Research through Innovative Methods for Mixtures in Epidemiology (PRIME) Program: Novel and Expanded Statistical Methods. INTERNATIONAL JOURNAL OF ENVIRONMENTAL RESEARCH AND PUBLIC HEALTH 2022; 19:1378. [PMID: 35162394 PMCID: PMC8835015 DOI: 10.3390/ijerph19031378] [Citation(s) in RCA: 28] [Impact Index Per Article: 14.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/22/2021] [Revised: 01/18/2022] [Accepted: 01/21/2022] [Indexed: 11/16/2022]
Abstract
Humans are exposed to a diverse mixture of chemical and non-chemical exposures across their lifetimes. Well-designed epidemiology studies as well as sophisticated exposure science and related technologies enable the investigation of the health impacts of mixtures. While existing statistical methods can address the most basic questions related to the association between environmental mixtures and health endpoints, there were gaps in our ability to learn from mixtures data in several common epidemiologic scenarios, including high correlation among health and exposure measures in space and/or time, the presence of missing observations, the violation of important modeling assumptions, and the presence of computational challenges incurred by current implementations. To address these and other challenges, NIEHS initiated the Powering Research through Innovative methods for Mixtures in Epidemiology (PRIME) program, to support work on the development and expansion of statistical methods for mixtures. Six independent projects supported by PRIME have been highly productive but their methods have not yet been described collectively in a way that would inform application. We review 37 new methods from PRIME projects and summarize the work across previously published research questions, to inform methods selection and increase awareness of these new methods. We highlight important statistical advancements considering data science strategies, exposure-response estimation, timing of exposures, epidemiological methods, the incorporation of toxicity/chemical information, spatiotemporal data, risk assessment, and model performance, efficiency, and interpretation. Importantly, we link to software to encourage application and testing on other datasets. This review can enable more informed analyses of environmental mixtures. We stress training for early career scientists as well as innovation in statistical methodology as an ongoing need. Ultimately, we direct efforts to the common goal of reducing harmful exposures to improve public health.
Collapse
Affiliation(s)
- Bonnie R. Joubert
- Division of Extramural Research and Training, National Institute of Environmental Health Sciences, National Institutes of Health, Durham, NC 27709, USA;
| | - Marianthi-Anna Kioumourtzoglou
- Department of Environmental Health Sciences, Columbia University Mailman School of Public Health, New York, NY 10032, USA;
| | - Toccara Chamberlain
- Division of Extramural Research and Training, National Institute of Environmental Health Sciences, National Institutes of Health, Durham, NC 27709, USA;
| | - Hua Yun Chen
- Division of Epidemiology and Biostatistics, School of Public Health, University of Illinois Chicago, Chicago, IL 60612, USA; (H.Y.C.); (M.E.T.)
| | - Chris Gennings
- Department of Environmental Medicine and Public Health, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA;
| | - Mary E. Turyk
- Division of Epidemiology and Biostatistics, School of Public Health, University of Illinois Chicago, Chicago, IL 60612, USA; (H.Y.C.); (M.E.T.)
| | - Marie Lynn Miranda
- Department of Applied and Computational Mathematics and Statistics, University of Notre Dame, South Bend, IN 46556, USA;
| | - Thomas F. Webster
- Department of Environmental Health, Boston University School of Public Health, Boston, MA 02118, USA;
| | | | - David B. Dunson
- Department of Statistical Science, Duke University, Durham, NC 27710, USA;
| | - Brent A. Coull
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA 02115, USA;
| |
Collapse
|
7
|
Livne I, Azriel D, Goldberg Y. Improved estimators for semi-supervised high-dimensional regression model. Electron J Stat 2022. [DOI: 10.1214/22-ejs2070] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Affiliation(s)
- Ilan Livne
- The Faculty of Industrial Engineering and Management, Technion, Israel
| | - David Azriel
- The Faculty of Industrial Engineering and Management, Technion, Israel
| | - Yair Goldberg
- The Faculty of Industrial Engineering and Management, Technion, Israel
| |
Collapse
|
8
|
Wang R, Xu X. A Bayesian-motivated test for high-dimensional linear regression models with fixed design matrix. Stat Pap (Berl) 2021. [DOI: 10.1007/s00362-020-01157-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]
|
9
|
Comminges L, Collier O, Ndaoud M, Tsybakov AB. Adaptive robust estimation in sparse vector model. Ann Stat 2021. [DOI: 10.1214/20-aos2002] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Affiliation(s)
- L. Comminges
- CEREMADE, Université Paris-Dauphine, PSL and CREST
| | - O. Collier
- Modal’X, UPL, Université Paris Nanterre and CREST
| | | | | |
Collapse
|
10
|
Abstract
Summary
Genome-wide association studies have identified thousands of genetic variants that are associated with complex traits. Many complex traits are shown to share genetic etiology. Although various genetic correlation measures and their estimators have been developed, rigorous statistical analysis of their properties, including their robustness to model assumptions, is still lacking. We develop a method of moments estimator of genetic correlation between two traits in the framework of high-dimensional linear models. We show that the genetic correlation defined based on the regression coefficients and the linkage disequilibrium matrix can be decomposed into both the pleiotropic effects and correlations due to linkage disequilibrium between the causal loci of the two traits. The proposed estimator can be computed from summary association statistics when the raw genotype data are not available. Theoretical properties of the estimator in terms of consistency and asymptotic normality are provided. The proposed estimator is closely related to the estimator from the linkage disequilibrium score regression. However, our analysis reveals that the linkage disequilibrium score regression method does not make full use of the linkage disequilibrium information, and its jackknife variance estimate can be biased when the model assumptions are violated. Simulations and real data analysis results show that the proposed estimator is more robust and has better interpretability than the linkage disequilibrium score regression method under different genetic architectures.
Collapse
Affiliation(s)
- Jianqiao Wang
- Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania 19104, U.S.A
| | - Hongzhe Li
- Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania 19104, U.S.A
| |
Collapse
|
11
|
Law M, Ritov Y. Inference without compatibility: Using exponential weighting for inference on a parameter of a linear model. BERNOULLI 2021. [DOI: 10.3150/20-bej1280] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Affiliation(s)
- Michael Law
- Department of Statistics, University of Michigan, Ann Arbor, USA
| | - Ya’acov Ritov
- Department of Statistics, University of Michigan, Ann Arbor, USA
| |
Collapse
|
12
|
Guo X, Cheng G. Moderate-Dimensional Inferences on Quadratic Functionals in Ordinary Least Squares. J Am Stat Assoc 2021. [DOI: 10.1080/01621459.2021.1893177] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
Affiliation(s)
- Xiao Guo
- International Institute of Finance, School of Management, University of Science and Technology of China, Hefei, Anhui, China
| | - Guang Cheng
- Department of Statistics, Purdue University, West Lafayette, IN
| |
Collapse
|
13
|
Mai TT, Turner P, Corander J. Boosting heritability: estimating the genetic component of phenotypic variation with multiple sample splitting. BMC Bioinformatics 2021; 22:164. [PMID: 33773584 PMCID: PMC8004405 DOI: 10.1186/s12859-021-04079-7] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2021] [Accepted: 03/15/2021] [Indexed: 11/29/2022] Open
Abstract
Background Heritability is a central measure in genetics quantifying how much of the variability observed in a trait is attributable to genetic differences. Existing methods for estimating heritability are most often based on random-effect models, typically for computational reasons. The alternative of using a fixed-effect model has received much more limited attention in the literature. Results In this paper, we propose a generic strategy for heritability inference, termed as “boosting heritability”, by combining the advantageous features of different recent methods to produce an estimate of the heritability with a high-dimensional linear model. Boosting heritability uses in particular a multiple sample splitting strategy which leads in general to a stable and accurate estimate. We use both simulated data and real antibiotic resistance data from a major human pathogen, Sptreptococcus pneumoniae, to demonstrate the attractive features of our inference strategy. Conclusions Boosting is shown to offer a reliable and practically useful tool for inference about heritability.
Collapse
Affiliation(s)
- The Tien Mai
- Oslo Centre for Biostatistics and Epidemiology, Department of Biostatistics, University of Oslo, Oslo, Norway.
| | - Paul Turner
- Cambodia-Oxford Medical Research Unit, Angkor Hospital for Children, Siem Reap, Cambodia.,Centre for Tropical Medicine and Global Health, Nuffield Department of Medicine, University of Oxford, Oxford, UK
| | - Jukka Corander
- Oslo Centre for Biostatistics and Epidemiology, Department of Biostatistics, University of Oslo, Oslo, Norway.,Department of Mathematics and Statistics, University of Helsinki, Helsinki, Finland
| |
Collapse
|
14
|
Javanmard A, Lee JD. A flexible framework for hypothesis testing in high dimensions. J R Stat Soc Series B Stat Methodol 2020. [DOI: 10.1111/rssb.12373] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
15
|
Abstract
Summary
We propose a novel estimator of error variance and establish its asymptotic properties based on ridge regression and random matrix theory. The proposed estimator is valid under both low- and high-dimensional models, and performs well not only in nonsparse cases, but also in sparse ones. The finite-sample performance of the proposed method is assessed through an intensive numerical study, which indicates that the method is promising compared with its competitors in many interesting scenarios.
Collapse
Affiliation(s)
- X Liu
- School of Statistics and Management, Shanghai University of Finance and Economics, 777 Guoding Road, Shanghai 200433, China
| | - S Zheng
- School of Mathematics & Statistics, Northeast Normal University, 5268 Renmin Street, Changchun 130024, China
| | - X Feng
- School of Statistics and Management, Shanghai University of Finance and Economics, 777 Guoding Road, Shanghai 200433, China
| |
Collapse
|
16
|
Tony Cai T, Guo Z. Semisupervised inference for explained variance in high dimensional linear regression and its applications. J R Stat Soc Series B Stat Methodol 2020. [DOI: 10.1111/rssb.12357] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Affiliation(s)
- T. Tony Cai
- University of Pennsylvania; Philadelphia USA
| | | |
Collapse
|
17
|
Azriel D, Schwartzman A. Estimation of linear projections of non-sparse coefficients in high-dimensional regression. Electron J Stat 2020. [DOI: 10.1214/19-ejs1656] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
18
|
|
19
|
Abstract
Summary
Consider a high-dimensional linear regression problem, where the number of covariates is larger than the number of observations and the interest is in estimating the conditional variance of the response variable given the covariates. A conditional and an unconditional framework are considered, where conditioning is with respect to the covariates, which are ancillary to the parameter of interest. In recent papers, a consistent estimator was developed in the unconditional framework when the marginal distribution of the covariates is normal with known mean and variance. In the present work, a certain Bayesian hypothesis test is formulated under the conditional framework, and it is shown that the Bayes risk is a constant. This implies that no consistent estimator exists in the conditional framework. However, when the marginal distribution of the covariates is normal, the conditional error of the above consistent estimator converges to zero, with probability converging to one. It follows that even in the conditional setting, information about the marginal distribution of an ancillary statistic may have a significant impact on statistical inference. The practical implication in the context of high-dimensional regression models is that additional observations where only the covariates are given are potentially very useful and should not be ignored. This finding is most relevant to semi-supervised learning problems where covariate information is easy to obtain.
Collapse
Affiliation(s)
- D Azriel
- Faculty of Industrial Engineering and Management, Technion - Israel Institute of Technology, Technion City, Haifa 3200003, Israel
| |
Collapse
|
20
|
|
21
|
Wang Z, Xue L. Variance estimation for sparse ultra-high dimensional varying coefficient models. COMMUN STAT-THEOR M 2019. [DOI: 10.1080/03610926.2018.1429627] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
Affiliation(s)
- Zhaoliang Wang
- College of Applied Sciences, Beijing University of Technology, Beijing, China
- School of Mathematics and Information Science, Henan Polytechnic University, Jiaozuo, China
| | - Liugen Xue
- College of Applied Sciences, Beijing University of Technology, Beijing, China
| |
Collapse
|
22
|
|
23
|
Guo Z, Wang W, Cai TT, Li H. Optimal Estimation of Genetic Relatedness in High-dimensional Linear Models. J Am Stat Assoc 2018; 114:358-369. [PMID: 38434789 PMCID: PMC10907007 DOI: 10.1080/01621459.2017.1407774] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2016] [Revised: 10/01/2017] [Indexed: 10/18/2022]
Abstract
Estimating the genetic relatedness between two traits based on the genome-wide association data is an important problem in genetics research. In the framework of high-dimensional linear models, we introduce two measures of genetic relatedness and develop optimal estimators for them. One is genetic covariance, which is defined to be the inner product of the two regression vectors, and another is genetic correlation, which is a normalized inner product by their lengths. We propose functional de-biased estimators (FDEs), which consist of an initial estimation step with the plug-in scaled Lasso estimator, and a further bias correction step. We also develop estimators of the quadratic functionals of the regression vectors, which can be used to estimate the heritability of each trait. The estimators are shown to be minimax rate-optimal and can be efficiently implemented. Simulation results show that FDEs provide better estimates of the genetic relatedness than simple plug-in estimates. FDE is also applied to an analysis of a yeast segregant data set with multiple traits to estimate the genetic relatedness among these traits.
Collapse
Affiliation(s)
- Zijian Guo
- Department of Statistics and Biostatistics, Rutgers University
| | - Wanjie Wang
- Department of Statistics and Applied Probability, National University of Singapore
| | - T. Tony Cai
- Department of Statistics, The Wharton School, University of Pennsylvania
| | - Hongzhe Li
- Department of Biostatistics and Epidemiology, Perelman School of Medicine, University of Pennsylvania
| |
Collapse
|
24
|
|
25
|
Zhu Y, Bradic J. Significance testing in non-sparse high-dimensional linear models. Electron J Stat 2018. [DOI: 10.1214/18-ejs1443] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
26
|
Jacquin L, Cao TV, Ahmadi N. A Unified and Comprehensible View of Parametric and Kernel Methods for Genomic Prediction with Application to Rice. Front Genet 2016; 7:145. [PMID: 27555865 PMCID: PMC4977290 DOI: 10.3389/fgene.2016.00145] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2016] [Accepted: 07/26/2016] [Indexed: 11/29/2022] Open
Abstract
One objective of this study was to provide readers with a clear and unified understanding of parametric statistical and kernel methods, used for genomic prediction, and to compare some of these in the context of rice breeding for quantitative traits. Furthermore, another objective was to provide a simple and user-friendly R package, named KRMM, which allows users to perform RKHS regression with several kernels. After introducing the concept of regularized empirical risk minimization, the connections between well-known parametric and kernel methods such as Ridge regression [i.e., genomic best linear unbiased predictor (GBLUP)] and reproducing kernel Hilbert space (RKHS) regression were reviewed. Ridge regression was then reformulated so as to show and emphasize the advantage of the kernel “trick” concept, exploited by kernel methods in the context of epistatic genetic architectures, over parametric frameworks used by conventional methods. Some parametric and kernel methods; least absolute shrinkage and selection operator (LASSO), GBLUP, support vector machine regression (SVR) and RKHS regression were thereupon compared for their genomic predictive ability in the context of rice breeding using three real data sets. Among the compared methods, RKHS regression and SVR were often the most accurate methods for prediction followed by GBLUP and LASSO. An R function which allows users to perform RR-BLUP of marker effects, GBLUP and RKHS regression, with a Gaussian, Laplacian, polynomial or ANOVA kernel, in a reasonable computation time has been developed. Moreover, a modified version of this function, which allows users to tune kernels for RKHS regression, has also been developed and parallelized for HPC Linux clusters. The corresponding KRMM package and all scripts have been made publicly available.
Collapse
Affiliation(s)
- Laval Jacquin
- Centre de Coopération Internationale en Recherche Agronomique pour le Développement, BIOS, UMR AGAP Montpellier, France
| | - Tuong-Vi Cao
- Centre de Coopération Internationale en Recherche Agronomique pour le Développement, BIOS, UMR AGAP Montpellier, France
| | - Nourollah Ahmadi
- Centre de Coopération Internationale en Recherche Agronomique pour le Développement, BIOS, UMR AGAP Montpellier, France
| |
Collapse
|
27
|
Schweiger R, Kaufman S, Laaksonen R, Kleber ME, März W, Eskin E, Rosset S, Halperin E. Fast and Accurate Construction of Confidence Intervals for Heritability. Am J Hum Genet 2016; 98:1181-1192. [PMID: 27259052 DOI: 10.1016/j.ajhg.2016.04.016] [Citation(s) in RCA: 25] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2016] [Accepted: 04/27/2016] [Indexed: 11/26/2022] Open
Abstract
Estimation of heritability is fundamental in genetic studies. Recently, heritability estimation using linear mixed models (LMMs) has gained popularity because these estimates can be obtained from unrelated individuals collected in genome-wide association studies. Typically, heritability estimation under LMMs uses the restricted maximum likelihood (REML) approach. Existing methods for the construction of confidence intervals and estimators of SEs for REML rely on asymptotic properties. However, these assumptions are often violated because of the bounded parameter space, statistical dependencies, and limited sample size, leading to biased estimates and inflated or deflated confidence intervals. Here, we show that the estimation of confidence intervals by state-of-the-art methods is inaccurate, especially when the true heritability is relatively low or relatively high. We further show that these inaccuracies occur in datasets including thousands of individuals. Such biases are present, for example, in estimates of heritability of gene expression in the Genotype-Tissue Expression project and of lipid profiles in the Ludwigshafen Risk and Cardiovascular Health study. We also show that often the probability that the genetic component is estimated as 0 is high even when the true heritability is bounded away from 0, emphasizing the need for accurate confidence intervals. We propose a computationally efficient method, ALBI (accurate LMM-based heritability bootstrap confidence intervals), for estimating the distribution of the heritability estimator and for constructing accurate confidence intervals. Our method can be used as an add-on to existing methods for estimating heritability and variance components, such as GCTA, FaST-LMM, GEMMA, or EMMAX.
Collapse
|