1
|
Wang X, Shi S, Ali Khan MY, Zhang Z, Zhang Y. Improving the accuracy of genomic prediction in dairy cattle using the biologically annotated neural networks framework. J Anim Sci Biotechnol 2024; 15:87. [PMID: 38945998 PMCID: PMC11215832 DOI: 10.1186/s40104-024-01044-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2024] [Accepted: 05/05/2024] [Indexed: 07/02/2024] Open
Abstract
BACKGROUND Biologically annotated neural networks (BANNs) are feedforward Bayesian neural network models that utilize partially connected architectures based on SNP-set annotations. As an interpretable neural network, BANNs model SNP and SNP-set effects in their input and hidden layers, respectively. Furthermore, the weights and connections of the network are regarded as random variables with prior distributions reflecting the manifestation of genetic effects at various genomic scales. However, its application in genomic prediction has yet to be explored. RESULTS This study extended the BANNs framework to the area of genomic selection and explored the optimal SNP-set partitioning strategies by using dairy cattle datasets. The SNP-sets were partitioned based on two strategies-gene annotations and 100 kb windows, denoted as BANN_gene and BANN_100kb, respectively. The BANNs model was compared with GBLUP, random forest (RF), BayesB and BayesCπ through five replicates of five-fold cross-validation using genotypic and phenotypic data on milk production traits, type traits, and one health trait of 6,558, 6,210 and 5,962 Chinese Holsteins, respectively. Results showed that the BANNs framework achieves higher genomic prediction accuracy compared to GBLUP, RF and Bayesian methods. Specifically, the BANN_100kb demonstrated superior accuracy and the BANN_gene exhibited generally suboptimal accuracy compared to GBLUP, RF, BayesB and BayesCπ across all traits. The average accuracy improvements of BANN_100kb over GBLUP, RF, BayesB and BayesCπ were 4.86%, 3.95%, 3.84% and 1.92%, and the accuracy of BANN_gene was improved by 3.75%, 2.86%, 2.73% and 0.85% compared to GBLUP, RF, BayesB and BayesCπ, respectively across all seven traits. Meanwhile, both BANN_100kb and BANN_gene yielded lower overall mean square error values than GBLUP, RF and Bayesian methods. CONCLUSION Our findings demonstrated that the BANNs framework performed better than traditional genomic prediction methods in our tested scenarios, and might serve as a promising alternative approach for genomic prediction in dairy cattle.
Collapse
Affiliation(s)
- Xue Wang
- State Key Laboratory of Animal Biotech Breeding, National Engineering Laboratory for Animal Breeding, Key Laboratory of Animal Genetics, Breeding and Reproduction of Ministry of Agriculture and Rural Affairs, College of Animal Science and Technology, China Agricultural University, Beijing 100193, China
| | - Shaolei Shi
- State Key Laboratory of Animal Biotech Breeding, National Engineering Laboratory for Animal Breeding, Key Laboratory of Animal Genetics, Breeding and Reproduction of Ministry of Agriculture and Rural Affairs, College of Animal Science and Technology, China Agricultural University, Beijing 100193, China
| | - Md Yousuf Ali Khan
- State Key Laboratory of Animal Biotech Breeding, National Engineering Laboratory for Animal Breeding, Key Laboratory of Animal Genetics, Breeding and Reproduction of Ministry of Agriculture and Rural Affairs, College of Animal Science and Technology, China Agricultural University, Beijing 100193, China
- Bangladesh Livestock Research Institute, Dhaka 1341, Bangladesh
| | - Zhe Zhang
- Guangdong Laboratory of Lingnan Modern Agriculture, National Engineering Research Center for Breeding Swine Industry, Guangdong Provincial Key Lab of Agro-Animal Genomics and Molecular Breeding, College of Animal Science, South China Agricultural University, Guangzhou 510642, China.
| | - Yi Zhang
- State Key Laboratory of Animal Biotech Breeding, National Engineering Laboratory for Animal Breeding, Key Laboratory of Animal Genetics, Breeding and Reproduction of Ministry of Agriculture and Rural Affairs, College of Animal Science and Technology, China Agricultural University, Beijing 100193, China.
| |
Collapse
|
2
|
Link V, Schraiber JG, Fan C, Dinh B, Mancuso N, Chiang CWK, Edge MD. Tree-based QTL mapping with expected local genetic relatedness matrices. Am J Hum Genet 2023; 110:2077-2091. [PMID: 38065072 PMCID: PMC10716520 DOI: 10.1016/j.ajhg.2023.10.017] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/08/2023] [Revised: 10/26/2023] [Accepted: 10/27/2023] [Indexed: 12/18/2023] Open
Abstract
Understanding the genetic basis of complex phenotypes is a central pursuit of genetics. Genome-wide association studies (GWASs) are a powerful way to find genetic loci associated with phenotypes. GWASs are widely and successfully used, but they face challenges related to the fact that variants are tested for association with a phenotype independently, whereas in reality variants at different sites are correlated because of their shared evolutionary history. One way to model this shared history is through the ancestral recombination graph (ARG), which encodes a series of local coalescent trees. Recent computational and methodological breakthroughs have made it feasible to estimate approximate ARGs from large-scale samples. Here, we explore the potential of an ARG-based approach to quantitative-trait locus (QTL) mapping, echoing existing variance-components approaches. We propose a framework that relies on the conditional expectation of a local genetic relatedness matrix (local eGRM) given the ARG. Simulations show that our method is especially beneficial for finding QTLs in the presence of allelic heterogeneity. By framing QTL mapping in terms of the estimated ARG, we can also facilitate the detection of QTLs in understudied populations. We use local eGRM to analyze two chromosomes containing known body size loci in a sample of Native Hawaiians. Our investigations can provide intuition about the benefits of using estimated ARGs in population- and statistical-genetic methods in general.
Collapse
Affiliation(s)
- Vivian Link
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA, USA
| | - Joshua G Schraiber
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA, USA
| | - Caoqi Fan
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA, USA; Center for Genetic Epidemiology, Department of Population and Public Health Sciences, Keck School of Medicine, University of Southern California, Los Angeles, CA, USA
| | - Bryan Dinh
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA, USA; Center for Genetic Epidemiology, Department of Population and Public Health Sciences, Keck School of Medicine, University of Southern California, Los Angeles, CA, USA
| | - Nicholas Mancuso
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA, USA; Center for Genetic Epidemiology, Department of Population and Public Health Sciences, Keck School of Medicine, University of Southern California, Los Angeles, CA, USA
| | - Charleston W K Chiang
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA, USA; Center for Genetic Epidemiology, Department of Population and Public Health Sciences, Keck School of Medicine, University of Southern California, Los Angeles, CA, USA
| | - Michael D Edge
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA, USA.
| |
Collapse
|
3
|
Hai Y, Ma J, Yang K, Wen Y. Bayesian linear mixed model with multiple random effects for prediction analysis on high-dimensional multi-omics data. Bioinformatics 2023; 39:btad647. [PMID: 37882747 PMCID: PMC10627352 DOI: 10.1093/bioinformatics/btad647] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2023] [Revised: 09/24/2023] [Accepted: 10/24/2023] [Indexed: 10/27/2023] Open
Abstract
MOTIVATION Accurate disease risk prediction is an essential step in the modern quest for precision medicine. While high-dimensional multi-omics data have provided unprecedented data resources for prediction studies, their high-dimensionality and complex inter/intra-relationships have posed significant analytical challenges. RESULTS We proposed a two-step Bayesian linear mixed model framework (TBLMM) for risk prediction analysis on multi-omics data. TBLMM models the predictive effects from multi-omics data using a hybrid of the sparsity regression and linear mixed model with multiple random effects. It can resemble the shape of the true effect size distributions and accounts for non-linear, including interaction effects, among multi-omics data via kernel fusion. It infers its parameters via a computationally efficient variational Bayes algorithm. Through extensive simulation studies and the prediction analyses on the positron emission tomography imaging outcomes using data obtained from the Alzheimer's Disease Neuroimaging Initiative, we have demonstrated that TBLMM can consistently outperform the existing method in predicting the risk of complex traits. AVAILABILITY AND IMPLEMENTATION The corresponding R package is available on GitHub (https://github.com/YaluWen/TBLMM).
Collapse
Affiliation(s)
- Yang Hai
- Department of Health Statistics, Shanxi Medical University, Taiyuan, Shanxi Province 030000, China
- Department of Statistics, University of Auckland, Auckland 1010, New Zealand
| | - Jixiang Ma
- Department of Health Statistics, Shanxi Medical University, Taiyuan, Shanxi Province 030000, China
| | - Kaixin Yang
- Department of Health Statistics, Shanxi Medical University, Taiyuan, Shanxi Province 030000, China
| | - Yalu Wen
- Department of Health Statistics, Shanxi Medical University, Taiyuan, Shanxi Province 030000, China
- Department of Statistics, University of Auckland, Auckland 1010, New Zealand
| |
Collapse
|
4
|
Hai Y, Zhao W, Meng Q, Liu L, Wen Y. Bayesian linear mixed model with multiple random effects for family-based genetic studies. Front Genet 2023; 14:1267704. [PMID: 37928242 PMCID: PMC10620972 DOI: 10.3389/fgene.2023.1267704] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2023] [Accepted: 09/25/2023] [Indexed: 11/07/2023] Open
Abstract
Motivation: Family-based study design is one of the popular designs used in genetic research, and the whole-genome sequencing data obtained from family-based studies offer many unique features for risk prediction studies. They can not only provide a more comprehensive view of many complex diseases, but also utilize information in the design to further improve the prediction accuracy. While promising, existing analytical methods often ignore the information embedded in the study design and overlook the predictive effects of rare variants, leading to a prediction model with sub-optimal performance. Results: We proposed a Bayesian linear mixed model for the prediction analysis of sequencing data obtained from family-based studies. Our method can not only capture predictive effects from both common and rare variants, but also easily accommodate various disease model assumptions. It uses information embedded in the study design to form surrogates, where the predictive effects from unmeasured/unknown genetic and environmental risk factors can be modelled. Through extensive simulation studies and the analysis of sequencing data obtained from the Michigan State University Twin Registry study, we have demonstrated that the proposed method outperforms commonly adopted techniques. Availability: R package is available at https://github.com/yhai943/FBLMM.
Collapse
Affiliation(s)
- Yang Hai
- Department of Statistics, University of Auckland, Auckland, New Zealand
| | - Wenxuan Zhao
- Department of Health Statistics, School of Public Health, Shanxi Medical University, Taiyuan, China
| | - Qingyu Meng
- Department of Health Statistics, School of Public Health, Shanxi Medical University, Taiyuan, China
| | - Long Liu
- Department of Health Statistics, School of Public Health, Shanxi Medical University, Taiyuan, China
| | - Yalu Wen
- Department of Statistics, University of Auckland, Auckland, New Zealand
| |
Collapse
|
5
|
Alatrany AS, Khan W, Hussain AJ, Mustafina J, Al-Jumeily D. Transfer Learning for Classification of Alzheimer's Disease Based on Genome Wide Data. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:2700-2711. [PMID: 37018274 DOI: 10.1109/tcbb.2022.3233869] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/19/2023]
Abstract
Alzheimer's disease (AD) is a type of brain disorder that is regarded as a degenerative disease because the corresponding symptoms aggravate with the time progression. Single nucleotide polymorphisms (SNPs) have been identified as relevant biomarkers for this condition. This study aims to identify SNPs biomarkers associated with the AD in order to perform a reliable classification of AD. In contrast to existing related works, we utilize deep transfer learning with varying experimental analysis for reliable classification of AD. For this purpose, the convolutional neural networks (CNN) are firstly trained over the genome-wide association studies (GWAS) dataset requested from the AD neuroimaging initiative. We then employ the deep transfer learning for further training of our CNN (as base model) over a different AD GWAS dataset, to extract the final set of features. The extracted features are then fed into Support Vector Machine for classification of AD. Detailed experiments are performed using multiple datasets and varying experimental configurations. The statistical outcomes indicate an accuracy of 89% which is a significant improvement when benchmarked with existing related works.
Collapse
|
6
|
Fu B, Pazokitoroudi A, Sudarshan M, Liu Z, Subramanian L, Sankararaman S. Fast kernel-based association testing of non-linear genetic effects for biobank-scale data. Nat Commun 2023; 14:4936. [PMID: 37582955 PMCID: PMC10427662 DOI: 10.1038/s41467-023-40346-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2022] [Accepted: 07/18/2023] [Indexed: 08/17/2023] Open
Abstract
Our knowledge of non-linear genetic effects on complex traits remains limited, in part, due to the modest power to detect such effects. While kernel-based tests offer a versatile approach to test for non-linear relationships between sets of genetic variants and traits, current approaches cannot be applied to Biobank-scale datasets containing hundreds of thousands of individuals. We propose, FastKAST, a kernel-based approach that can test for non-linear effects of a set of variants on a quantitative trait. FastKAST provides calibrated hypothesis tests while enabling analysis of Biobank-scale datasets with hundreds of thousands of unrelated individuals from a homogeneous population. We apply FastKAST to 53 quantitative traits measured across ≈ 300 K unrelated white British individuals in the UK Biobank to detect sets of variants with non-linear effects at genome-wide significance.
Collapse
Affiliation(s)
- Boyang Fu
- Department of Computer Science, UCLA, Los Angeles, CA, USA.
| | | | - Mukund Sudarshan
- Department of Computer Science, Courant Institute of Mathematical Sciences, New York University, New York, NY, USA
| | - Zhengtong Liu
- Department of Computer Science, UCLA, Los Angeles, CA, USA
| | - Lakshminarayanan Subramanian
- Department of Computer Science, Courant Institute of Mathematical Sciences, New York University, New York, NY, USA
- Department of Population Health, NYU Grossman School of Medicine, New York, NY, USA
| | - Sriram Sankararaman
- Department of Computer Science, UCLA, Los Angeles, CA, USA.
- Department of Human Genetics, David Geffen School of Medicine, UCLA, Los Angeles, CA, USA.
- Department of Computational Medicine, David Geffen School of Medicine, UCLA, Los Angeles, CA, USA.
| |
Collapse
|
7
|
Link V, Schraiber JG, Fan C, Dinh B, Mancuso N, Chiang CW, Edge MD. Tree-based QTL mapping with expected local genetic relatedness matrices. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.04.07.536093. [PMID: 37066144 PMCID: PMC10104234 DOI: 10.1101/2023.04.07.536093] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/18/2023]
Abstract
Understanding the genetic basis of complex phenotypes is a central pursuit of genetics. Genome-wide Association Studies (GWAS) are a powerful way to find genetic loci associated with phenotypes. GWAS are widely and successfully used, but they face challenges related to the fact that variants are tested for association with a phenotype independently, whereas in reality variants at different sites are correlated because of their shared evolutionary history. One way to model this shared history is through the ancestral recombination graph (ARG), which encodes a series of local coalescent trees. Recent computational and methodological breakthroughs have made it feasible to estimate approximate ARGs from large-scale samples. Here, we explore the potential of an ARG-based approach to quantitative-trait locus (QTL) mapping, echoing existing variance-components approaches. We propose a framework that relies on the conditional expectation of a local genetic relatedness matrix given the ARG (local eGRM). Simulations show that our method is especially beneficial for finding QTLs in the presence of allelic heterogeneity. By framing QTL mapping in terms of the estimated ARG, we can also facilitate the detection of QTLs in understudied populations. We use local eGRM to identify a large-effect BMI locus, the CREBRF gene, in a sample of Native Hawaiians in which it was not previously detectable by GWAS because of a lack of population-specific imputation resources. Our investigations can provide intuition about the benefits of using estimated ARGs in population- and statistical-genetic methods in general.
Collapse
Affiliation(s)
- Vivian Link
- Department of Quantitative and Computational Biology, University of Southern California
| | - Joshua G. Schraiber
- Department of Quantitative and Computational Biology, University of Southern California
| | - Caoqi Fan
- Department of Quantitative and Computational Biology, University of Southern California
- Center for Genetic Epidemiology, Department of Population and Public Health Sciences, Keck School of Medicine, University of Southern California
| | - Bryan Dinh
- Department of Quantitative and Computational Biology, University of Southern California
- Center for Genetic Epidemiology, Department of Population and Public Health Sciences, Keck School of Medicine, University of Southern California
| | - Nicholas Mancuso
- Department of Quantitative and Computational Biology, University of Southern California
- Center for Genetic Epidemiology, Department of Population and Public Health Sciences, Keck School of Medicine, University of Southern California
| | - Charleston W.K. Chiang
- Department of Quantitative and Computational Biology, University of Southern California
- Center for Genetic Epidemiology, Department of Population and Public Health Sciences, Keck School of Medicine, University of Southern California
| | - Michael D. Edge
- Department of Quantitative and Computational Biology, University of Southern California
| |
Collapse
|
8
|
Mahmood U, Li X, Fan Y, Chang W, Niu Y, Li J, Qu C, Lu K. Multi-omics revolution to promote plant breeding efficiency. FRONTIERS IN PLANT SCIENCE 2022; 13:1062952. [PMID: 36570904 PMCID: PMC9773847 DOI: 10.3389/fpls.2022.1062952] [Citation(s) in RCA: 12] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 10/06/2022] [Accepted: 11/24/2022] [Indexed: 06/17/2023]
Abstract
Crop production is the primary goal of agricultural activities, which is always taken into consideration. However, global agricultural systems are coming under increasing pressure from the rising food demand of the rapidly growing world population and changing climate. To address these issues, improving high-yield and climate-resilient related-traits in crop breeding is an effective strategy. In recent years, advances in omics techniques, including genomics, transcriptomics, proteomics, and metabolomics, paved the way for accelerating plant/crop breeding to cope with the changing climate and enhance food production. Optimized omics and phenotypic plasticity platform integration, exploited by evolving machine learning algorithms will aid in the development of biological interpretations for complex crop traits. The precise and progressive assembly of desire alleles using precise genome editing approaches and enhanced breeding strategies would enable future crops to excel in combating the changing climates. Furthermore, plant breeding and genetic engineering ensures an exclusive approach to developing nutrient sufficient and climate-resilient crops, the productivity of which can sustainably and adequately meet the world's food, nutrition, and energy needs. This review provides an overview of how the integration of omics approaches could be exploited to select crop varieties with desired traits.
Collapse
Affiliation(s)
- Umer Mahmood
- Integrative Science Center of Germplasm Creation in Western China (Chongqing) Science City and Southwest University, College of Agronomy and Biotechnology, Southwest University, Chongqing, China
| | - Xiaodong Li
- Integrative Science Center of Germplasm Creation in Western China (Chongqing) Science City and Southwest University, College of Agronomy and Biotechnology, Southwest University, Chongqing, China
| | - Yonghai Fan
- Integrative Science Center of Germplasm Creation in Western China (Chongqing) Science City and Southwest University, College of Agronomy and Biotechnology, Southwest University, Chongqing, China
| | - Wei Chang
- Integrative Science Center of Germplasm Creation in Western China (Chongqing) Science City and Southwest University, College of Agronomy and Biotechnology, Southwest University, Chongqing, China
| | - Yue Niu
- Integrative Science Center of Germplasm Creation in Western China (Chongqing) Science City and Southwest University, College of Agronomy and Biotechnology, Southwest University, Chongqing, China
| | - Jiana Li
- Integrative Science Center of Germplasm Creation in Western China (Chongqing) Science City and Southwest University, College of Agronomy and Biotechnology, Southwest University, Chongqing, China
- Academy of Agricultural Sciences, Southwest University, Chongqing, China
- Engineering Research Center of South Upland Agriculture, Ministry of Education, Chongqing, China
| | - Cunmin Qu
- Integrative Science Center of Germplasm Creation in Western China (Chongqing) Science City and Southwest University, College of Agronomy and Biotechnology, Southwest University, Chongqing, China
- Academy of Agricultural Sciences, Southwest University, Chongqing, China
- Engineering Research Center of South Upland Agriculture, Ministry of Education, Chongqing, China
| | - Kun Lu
- Integrative Science Center of Germplasm Creation in Western China (Chongqing) Science City and Southwest University, College of Agronomy and Biotechnology, Southwest University, Chongqing, China
- Academy of Agricultural Sciences, Southwest University, Chongqing, China
- Engineering Research Center of South Upland Agriculture, Ministry of Education, Chongqing, China
| |
Collapse
|
9
|
Wang X, Wen Y. A penalized linear mixed model with generalized method of moments estimators for complex phenotype prediction. Bioinformatics 2022; 38:5222-5228. [PMID: 36205617 DOI: 10.1093/bioinformatics/btac659] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2022] [Revised: 07/27/2022] [Accepted: 10/05/2022] [Indexed: 12/24/2022] Open
Abstract
MOTIVATION Linear mixed models (LMMs) have long been the method of choice for risk prediction analysis on high-dimensional data. However, it remains computationally challenging to simultaneously model a large amount of variants that can be noise or have predictive effects of complex forms. RESULTS In this work, we have developed a penalized LMM with generalized method of moments (pLMMGMM) estimators for prediction analysis. pLMMGMM is built within the LMM framework, where random effects are used to model the joint predictive effects from all variants within a region. Different from existing methods that focus on linear relationships and use empirical criteria for variable screening, pLMMGMM can efficiently detect regions that harbor genetic variants with both linear and non-linear predictive effects. In addition, unlike existing LMMs that can only handle a very limited number of random effects, pLMMGMM is much less computationally demanding. It can jointly consider a large number of regions and accurately detect those that are predictive. Through theoretical investigations, we have shown that our method has the selection consistency and asymptotic normality. Through extensive simulations and the analysis of PET-imaging outcomes, we have demonstrated that pLMMGMM outperformed existing models and it can accurately detect regions that harbor risk factors with various forms of predictive effects. AVAILABILITY AND IMPLEMENTATION The R-package is available at https://github.com/XiaQiong/GMMLasso. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Xiaqiong Wang
- Department of Statistics, University of Auckland, Auckland 1010, New Zealand
| | - Yalu Wen
- Department of Statistics, University of Auckland, Auckland 1010, New Zealand
| |
Collapse
|
10
|
Liu L, Meng Q, Weng C, Lu Q, Wang T, Wen Y. Explainable deep transfer learning model for disease risk prediction using high-dimensional genomic data. PLoS Comput Biol 2022; 18:e1010328. [PMID: 35839250 PMCID: PMC9328574 DOI: 10.1371/journal.pcbi.1010328] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2021] [Revised: 07/27/2022] [Accepted: 06/27/2022] [Indexed: 11/19/2022] Open
Abstract
Building an accurate disease risk prediction model is an essential step in the modern quest for precision medicine. While high-dimensional genomic data provides valuable data resources for the investigations of disease risk, their huge amount of noise and complex relationships between predictors and outcomes have brought tremendous analytical challenges. Deep learning model is the state-of-the-art methods for many prediction tasks, and it is a promising framework for the analysis of genomic data. However, deep learning models generally suffer from the curse of dimensionality and the lack of biological interpretability, both of which have greatly limited their applications. In this work, we have developed a deep neural network (DNN) based prediction modeling framework. We first proposed a group-wise feature importance score for feature selection, where genes harboring genetic variants with both linear and non-linear effects are efficiently detected. We then designed an explainable transfer-learning based DNN method, which can directly incorporate information from feature selection and accurately capture complex predictive effects. The proposed DNN-framework is biologically interpretable, as it is built based on the selected predictive genes. It is also computationally efficient and can be applied to genome-wide data. Through extensive simulations and real data analyses, we have demonstrated that our proposed method can not only efficiently detect predictive features, but also accurately predict disease risk, as compared to many existing methods. Accurate disease risk prediction is an essential step towards precision medicine. Deep learning models have achieved the state-of-the-art performance for many prediction tasks. However, they generally suffer from the curse of dimensionality and lack of biological interpretability, both of which have greatly limited their applications to the prediction analysis of whole-genome sequencing data. We present here an explainable deep transfer learning model for the analysis of high-dimensional genomic data. Our proposed method can detect predictive genes that harbor genetic variants with both linear and non-linear effects via the proposed group-wise feature importance score. It can also efficiently and accurately model disease risk based on the detected predictive genes using the proposed transfer-learning based network architecture. Our proposed method is built at the gene level, and thus is much more biologically interpretable. It is also computationally efficiently and can be applied to whole-exome sequencing data that have millions of potential predictors. Through both simulation studies and the analysis of whole-exome data obtained from the Alzheimer’s Disease Neuroimaging Initiative, we have demonstrated that our method can efficiently detect predictive genes and it has better prediction performance than many existing methods.
Collapse
Affiliation(s)
- Long Liu
- Department of Health Statistics, Shanxi Medical University, Taiyuan, Shanxi, China
| | - Qingyu Meng
- Department of Health Statistics, Shanxi Medical University, Taiyuan, Shanxi, China
| | - Cherry Weng
- Department of Statistics, University of Auckland, Auckland, New Zealand
| | - Qing Lu
- Department of Biostatistics, University of Florida, Gainesville, Florida, United States of America
| | - Tong Wang
- Department of Health Statistics, Shanxi Medical University, Taiyuan, Shanxi, China
- * E-mail: (TW); (YW)
| | - Yalu Wen
- Department of Health Statistics, Shanxi Medical University, Taiyuan, Shanxi, China
- Department of Statistics, University of Auckland, Auckland, New Zealand
- * E-mail: (TW); (YW)
| |
Collapse
|
11
|
Wang X, Wen Y. A penalized linear mixed model with generalized method of moments for prediction analysis on high-dimensional multi-omics data. Brief Bioinform 2022; 23:6596990. [PMID: 35649346 PMCID: PMC9310531 DOI: 10.1093/bib/bbac193] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2022] [Revised: 03/18/2022] [Accepted: 04/27/2022] [Indexed: 11/13/2022] Open
Abstract
With the advances in high-throughput biotechnologies, high-dimensional multi-layer omics data become increasingly available. They can provide both confirmatory and complementary information to disease risk and thus have offered unprecedented opportunities for risk prediction studies. However, the high-dimensionality and complex inter/intra-relationships among multi-omics data have brought tremendous analytical challenges. Here we present a computationally efficient penalized linear mixed model with generalized method of moments estimator (MpLMMGMM) for the prediction analysis on multi-omics data. Our method extends the widely used linear mixed model proposed for genomic risk predictions to model multi-omics data, where kernel functions are used to capture various types of predictive effects from different layers of omics data and penalty terms are introduced to reduce the impact of noise. Compared with existing penalized linear mixed models, the proposed method adopts the generalized method of moments estimator and it is much more computationally efficient. Through extensive simulation studies and the analysis of positron emission tomography imaging outcomes, we have demonstrated that MpLMMGMM can simultaneously consider a large number of variables and efficiently select those that are predictive from the corresponding omics layers. It can capture both linear and nonlinear predictive effects and achieves better prediction performance than competing methods.
Collapse
Affiliation(s)
- Xiaqiong Wang
- Department of Statistics, University of Auckland, 38 Princes Street, 1010, Auckland, New Zealand
| | - Yalu Wen
- Department of Statistics, University of Auckland, 38 Princes Street, 1010, Auckland, New Zealand
| |
Collapse
|
12
|
Liu YH, Zhang M, Scheuring CF, Cilkiz M, Sze SH, Smith CW, Murray SC, Xu W, Zhang HB. Accurate prediction of complex traits for individuals and offspring from parents using a simple, rapid, and efficient method for gene-based breeding in cotton and maize. PLANT SCIENCE : AN INTERNATIONAL JOURNAL OF EXPERIMENTAL PLANT BIOLOGY 2022; 316:111153. [PMID: 35151437 DOI: 10.1016/j.plantsci.2021.111153] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/02/2021] [Accepted: 12/11/2021] [Indexed: 06/14/2023]
Abstract
Accurate, simple, rapid, and inexpensive prediction of complex traits controlled by numerous genes is paramount to enhanced plant breeding, animal breeding, and human medicine. Here we report a novel method that enables accurate, simple, and rapid prediction of complex traits of individuals or offspring from parents based on the number of favorable alleles (NFAs) of the genes controlling the objective traits. The NFAs of 226 cotton fiber length (GFL) genes and nine maize hybrid grain yield related (ZmF1GY) genes were directly used to predict cotton fiber lengths of individual plants and maize grain yields of F1 hybrids from parents, respectively, using prediction model-based methods as controls. The NFAs of the 226 GFL genes predicted cotton fiber lengths at an accuracy of 0.85, as the model methods and outperforming genomic prediction by 82 % - 170 %. The NFAs of the nine ZmF1GY genes predicted grain yields of maize hybrids from parents at an accuracy of 0.80, outperforming genomic prediction by 67 %. Moreover, the prediction accuracies of these traits were consistent across years, environments, and eco-agricultural systems. Importantly, the accurate prediction of these traits directly using the NFAs of the genes allows breeding to be performed in greenhouse, phytotron, or off-season, without the need of the model training and validation steps essential and costly for model-based genomic or genic prediction. Therefore, this new method dramatically outperforms the current model-based genomic methods used for phenotype prediction and streamlines the process of breeding, thus promising to substantially enhance current plant and animal breeding.
Collapse
Affiliation(s)
- Yun-Hua Liu
- Department of Soil and Crop Sciences, Texas A&M University, College Station, TX 77843, USA
| | - Meiping Zhang
- Department of Soil and Crop Sciences, Texas A&M University, College Station, TX 77843, USA
| | - Chantel F Scheuring
- Department of Soil and Crop Sciences, Texas A&M University, College Station, TX 77843, USA
| | - Mustafa Cilkiz
- Department of Soil and Crop Sciences, Texas A&M University, College Station, TX 77843, USA
| | - Sing-Hoi Sze
- Department of Computer Science and Engineering and Department of Biochemistry and Biophysics, Texas A&M University, College Station, TX 77843, USA
| | - C Wayne Smith
- Department of Soil and Crop Sciences, Texas A&M University, College Station, TX 77843, USA
| | - Seth C Murray
- Department of Soil and Crop Sciences, Texas A&M University, College Station, TX 77843, USA
| | - Wenwei Xu
- Texas A&M AgriLife Research, Lubbock, TX 79403, USA
| | - Hong-Bin Zhang
- Department of Soil and Crop Sciences, Texas A&M University, College Station, TX 77843, USA.
| |
Collapse
|
13
|
Wang T, Qiao J, Zhang S, Wei Y, Zeng P. Simultaneous test and estimation of total genetic effect in eQTL integrative analysis through mixed models. Brief Bioinform 2022; 23:6535679. [PMID: 35212359 DOI: 10.1093/bib/bbac038] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2021] [Revised: 01/22/2022] [Accepted: 02/07/2021] [Indexed: 11/14/2022] Open
Abstract
Integration of expression quantitative trait loci (eQTL) into genome-wide association studies (GWASs) is a promising manner to reveal functional roles of associated single-nucleotide polymorphisms (SNPs) in complex phenotypes and has become an active research field in post-GWAS era. However, how to efficiently incorporate eQTL mapping study into GWAS for prioritization of causal genes remains elusive. We herein proposed a novel method termed as Mixed transcriptome-wide association studies (TWAS) and mediated Variance estimation (MTV) by modeling the effects of cis-SNPs of a gene as a function of eQTL. MTV formulates the integrative method and TWAS within a unified framework via mixed models and therefore includes many prior methods/tests as special cases. We further justified MTV from another two statistical perspectives of mediation analysis and two-stage Mendelian randomization. Relative to existing methods, MTV is superior for pronounced features including the processing of direct effects of cis-SNPs on phenotypes, the powerful likelihood ratio test for assessment of joint effects of cis-SNPs and genetically regulated gene expression (GReX), two useful quantities to measure relative genetic contributions of GReX and cis-SNPs to phenotypic variance, and the computationally efferent parameter expansion expectation maximum algorithm. With extensive simulations, we identified that MTV correctly controlled the type I error in joint evaluation of the total genetic effect and proved more powerful to discover true association signals across various scenarios compared to existing methods. We finally applied MTV to 41 complex traits/diseases available from three GWASs and discovered many new associated genes that had otherwise been missed by existing methods. We also revealed that a small but substantial fraction of phenotypic variation was mediated by GReX. Overall, MTV constructs a robust and realistic modeling foundation for integrative omics analysis and has the advantage of offering more attractive biological interpretations of GWAS results.
Collapse
Affiliation(s)
- Ting Wang
- Department of Biostatistics at Xuzhou Medical University, China
| | - Jiahao Qiao
- Department of Biostatistics at Xuzhou Medical University, China
| | - Shuo Zhang
- Department of Biostatistics at Xuzhou Medical University, China
| | - Yongyue Wei
- Department of Biostatistics at Nanjing Medical University, China
| | - Ping Zeng
- Department of Biostatistics, Center for Medical Statistics and Data Analysis and Key Laboratory of Human Genetics and Environmental Medicine at Xuzhou Medical University, China
| |
Collapse
|
14
|
Demetci P, Cheng W, Darnell G, Zhou X, Ramachandran S, Crawford L. Multi-scale inference of genetic trait architecture using biologically annotated neural networks. PLoS Genet 2021; 17:e1009754. [PMID: 34411094 PMCID: PMC8407593 DOI: 10.1371/journal.pgen.1009754] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2020] [Revised: 08/31/2021] [Accepted: 07/31/2021] [Indexed: 01/01/2023] Open
Abstract
In this article, we present Biologically Annotated Neural Networks (BANNs), a nonlinear probabilistic framework for association mapping in genome-wide association (GWA) studies. BANNs are feedforward models with partially connected architectures that are based on biological annotations. This setup yields a fully interpretable neural network where the input layer encodes SNP-level effects, and the hidden layer models the aggregated effects among SNP-sets. We treat the weights and connections of the network as random variables with prior distributions that reflect how genetic effects manifest at different genomic scales. The BANNs software uses variational inference to provide posterior summaries which allow researchers to simultaneously perform (i) mapping with SNPs and (ii) enrichment analyses with SNP-sets on complex traits. Through simulations, we show that our method improves upon state-of-the-art association mapping and enrichment approaches across a wide range of genetic architectures. We then further illustrate the benefits of BANNs by analyzing real GWA data assayed in approximately 2,000 heterogenous stock of mice from the Wellcome Trust Centre for Human Genetics and approximately 7,000 individuals from the Framingham Heart Study. Lastly, using a random subset of individuals of European ancestry from the UK Biobank, we show that BANNs is able to replicate known associations in high and low-density lipoprotein cholesterol content.
Collapse
Affiliation(s)
- Pinar Demetci
- Department of Computer Science, Brown University, Providence, Rhode Island, United States of America
- Center for Computational Molecular Biology, Brown University, Providence, Rhode Island, United States of America
| | - Wei Cheng
- Center for Computational Molecular Biology, Brown University, Providence, Rhode Island, United States of America
- Department of Ecology and Evolutionary Biology, Brown University, Providence, Rhode Island, United States of America
| | - Gregory Darnell
- Center for Computational Molecular Biology, Brown University, Providence, Rhode Island, United States of America
| | - Xiang Zhou
- Department of Biostatistics, University of Michigan, Ann Arbor, Michigan, United States of America
- Center for Statistical Genetics, University of Michigan, Ann Arbor, Michigan, United States of America
| | - Sohini Ramachandran
- Department of Computer Science, Brown University, Providence, Rhode Island, United States of America
- Center for Computational Molecular Biology, Brown University, Providence, Rhode Island, United States of America
- Department of Ecology and Evolutionary Biology, Brown University, Providence, Rhode Island, United States of America
| | - Lorin Crawford
- Center for Computational Molecular Biology, Brown University, Providence, Rhode Island, United States of America
- Microsoft Research New England, Cambridge, Massachusetts, United States of America
- Department of Biostatistics, Brown University, Providence, Rhode Island, United States of America
| |
Collapse
|
15
|
Hai Y, Wen Y. A Bayesian linear mixed model for prediction of complex traits. Bioinformatics 2021; 36:5415-5423. [PMID: 33331865 PMCID: PMC8016495 DOI: 10.1093/bioinformatics/btaa1023] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2020] [Revised: 11/24/2020] [Accepted: 11/27/2020] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Accurate disease risk prediction is essential for precision medicine. Existing models either assume that diseases are caused by groups of predictors with small-to-moderate effects or a few isolated predictors with large effects. Their performance can be sensitive to the underlying disease mechanisms, which are usually unknown in advance. RESULTS We developed a Bayesian linear mixed model (BLMM), where genetic effects were modelled using a hybrid of the sparsity regression and linear mixed model with multiple random effects. The parameters in BLMM were inferred through a computationally efficient variational Bayes algorithm. The proposed method can resemble the shape of the true effect size distributions, captures the predictive effects from both common and rare variants, and is robust against various disease models. Through extensive simulations and the application to a whole-genome sequencing dataset obtained from the Alzheimer's Disease Neuroimaging Initiatives, we have demonstrated that BLMM has better prediction performance than existing methods and can detect variables and/or genetic regions that are predictive. AVAILABILITYAND IMPLEMENTATION The R-package is available at https://github.com/yhai943/BLMM. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Yang Hai
- Department of Statistics, University of Auckland, Auckland 1010, New Zealand
| | - Yalu Wen
- Department of Statistics, University of Auckland, Auckland 1010, New Zealand
| |
Collapse
|
16
|
Zeng P, Dai J, Jin S, Zhou X. Aggregating multiple expression prediction models improves the power of transcriptome-wide association studies. Hum Mol Genet 2021; 30:939-951. [PMID: 33615361 DOI: 10.1093/hmg/ddab056] [Citation(s) in RCA: 23] [Impact Index Per Article: 7.7] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2020] [Revised: 02/10/2021] [Accepted: 02/15/2021] [Indexed: 12/11/2022] Open
Abstract
Transcriptome-wide association study (TWAS) is an important integrative method for identifying genes that are causally associated with phenotypes. A key step of TWAS involves the construction of expression prediction models for every gene in turn using its cis-SNPs as predictors. Different TWAS methods rely on different models for gene expression prediction, and each such model makes a distinct modeling assumption that is often suitable for a particular genetic architecture underlying expression. However, the genetic architectures underlying gene expression vary across genes throughout the transcriptome. Consequently, different TWAS methods may be beneficial in detecting genes with distinct genetic architectures. Here, we develop a new method, HMAT, which aggregates TWAS association evidence obtained across multiple gene expression prediction models by leveraging the harmonic mean P-value combination strategy. Because each expression prediction model is suited to capture a particular genetic architecture, aggregating TWAS associations across prediction models as in HMAT improves accurate expression prediction and enables subsequent powerful TWAS analysis across the transcriptome. A key feature of HMAT is its ability to accommodate the correlations among different TWAS test statistics and produce calibrated P-values after aggregation. Through numerical simulations, we illustrated the advantage of HMAT over commonly used TWAS methods as well as ad hoc P-value combination rules such as Fisher's method. We also applied HMAT to analyze summary statistics of nine common diseases. In the real data applications, HMAT was on average 30.6% more powerful compared to the next best method, detecting many new disease-associated genes that were otherwise not identified by existing TWAS approaches. In conclusion, HMAT represents a flexible and powerful TWAS method that enjoys robust performance across a range of genetic architectures underlying gene expression.
Collapse
Affiliation(s)
- Ping Zeng
- Department of Epidemiology and Biostatistics, School of Public Health, Xuzhou Medical University, Xuzhou, Jiangsu 221004, China.,Center for Medical Statistics and Data Analysis, School of Public Health, Xuzhou Medical University, Xuzhou, Jiangsu 221004, China
| | - Jing Dai
- Department of Epidemiology and Biostatistics, School of Public Health, Xuzhou Medical University, Xuzhou, Jiangsu 221004, China
| | - Siyi Jin
- Department of Epidemiology and Biostatistics, School of Public Health, Xuzhou Medical University, Xuzhou, Jiangsu 221004, China
| | - Xiang Zhou
- Department of Biostatistics, University of Michigan, Ann Arbor, MI 48109, USA.,Center for Statistical Genetics, University of Michigan, Ann Arbor, MI 48109, USA
| |
Collapse
|
17
|
Coupled mixed model for joint genetic analysis of complex disorders with two independently collected data sets. BMC Bioinformatics 2021; 22:50. [PMID: 33546598 PMCID: PMC7866684 DOI: 10.1186/s12859-021-03959-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2020] [Accepted: 01/06/2021] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND In the last decade, Genome-wide Association studies (GWASs) have contributed to decoding the human genome by uncovering many genetic variations associated with various diseases. Many follow-up investigations involve joint analysis of multiple independently generated GWAS data sets. While most of the computational approaches developed for joint analysis are based on summary statistics, the joint analysis based on individual-level data with consideration of confounding factors remains to be a challenge. RESULTS In this study, we propose a method, called Coupled Mixed Model (CMM), that enables a joint GWAS analysis on two independently collected sets of GWAS data with different phenotypes. The CMM method does not require the data sets to have the same phenotypes as it aims to infer the unknown phenotypes using a set of multivariate sparse mixed models. Moreover, CMM addresses the confounding variables due to population stratification, family structures, and cryptic relatedness, as well as those arising during data collection such as batch effects that frequently appear in joint genetic studies. We evaluate the performance of CMM using simulation experiments. In real data analysis, we illustrate the utility of CMM by an application to evaluating common genetic associations for Alzheimer's disease and substance use disorder using datasets independently collected for the two complex human disorders. Comparison of the results with those from previous experiments and analyses supports the utility of our method and provides new insights into the diseases. The software is available at https://github.com/HaohanWang/CMM .
Collapse
|
18
|
Li J, Lu Q, Wen Y. Multi-kernel linear mixed model with adaptive lasso for prediction analysis on high-dimensional multi-omics data. Bioinformatics 2020; 36:1785-1794. [PMID: 31693075 DOI: 10.1093/bioinformatics/btz822] [Citation(s) in RCA: 16] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/10/2019] [Revised: 10/08/2019] [Accepted: 11/01/2019] [Indexed: 12/11/2022] Open
Abstract
MOTIVATION The use of human genome discoveries and other established factors to build an accurate risk prediction model is an essential step toward precision medicine. While multi-layer high-dimensional omics data provide unprecedented data resources for prediction studies, their corresponding analytical methods are much less developed. RESULTS We present a multi-kernel penalized linear mixed model with adaptive lasso (MKpLMM), a predictive modeling framework that extends the standard linear mixed models widely used in genomic risk prediction, for multi-omics data analysis. MKpLMM can capture not only the predictive effects from each layer of omics data but also their interactions via using multiple kernel functions. It adopts a data-driven approach to select predictive regions as well as predictive layers of omics data, and achieves robust selection performance. Through extensive simulation studies, the analyses of PET-imaging outcomes from the Alzheimer's Disease Neuroimaging Initiative study, and the analyses of 64 drug responses, we demonstrate that MKpLMM consistently outperforms competing methods in phenotype prediction. AVAILABILITY AND IMPLEMENTATION The R-package is available at https://github.com/YaluWen/OmicPred. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Jun Li
- Department of Thoracic Surgery, Dalian Municipal Central Hospital Affiliated of Dalian Medical University, Dalian 116000, China
| | - Qing Lu
- Department of Epidemiology and Biostatistics, Michigan State University, East Lansing, MI 48824, USA
| | - Yalu Wen
- Department of Statistics, University of Auckland, Auckland 1010, New Zealand
| |
Collapse
|
19
|
Wen Y, Lu Q. Multikernel linear mixed model with adaptive lasso for complex phenotype prediction. Stat Med 2020; 39:1311-1327. [PMID: 31985088 DOI: 10.1002/sim.8477] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2019] [Revised: 11/17/2019] [Accepted: 12/24/2019] [Indexed: 12/15/2022]
Abstract
Linear mixed models (LMMs) and their extensions have been widely used for high-dimensional genomic data analyses. While LMMs hold great promise for risk prediction research, the high dimensionality of the data and different effect sizes of genomic regions bring great analytical and computational challenges. In this work, we present a multikernel linear mixed model with adaptive lasso (KLMM-AL) to predict phenotypes using high-dimensional genomic data. We develop two algorithms for estimating parameters from our model and also establish the asymptotic properties of LMM with adaptive lasso when only one dependent observation is available. The proposed KLMM-AL can account for heterogeneous effect sizes from different genomic regions, capture both additive and nonadditive genetic effects, and adaptively and efficiently select predictive genomic regions and their corresponding effects. Through simulation studies, we demonstrate that KLMM-AL outperforms most of existing methods. Moreover, KLMM-AL achieves high sensitivity and specificity of selecting predictive genomic regions. KLMM-AL is further illustrated by an application to the sequencing dataset obtained from the Alzheimer's disease neuroimaging initiative.
Collapse
Affiliation(s)
- Yalu Wen
- Department of Statistics, The University of Auckland, Auckland, New Zealand
| | - Qing Lu
- Department of Epidemiology and Biostatistics, Michigan State University, East Lansing, Michigan
| |
Collapse
|
20
|
Wang X, Wen Y. A U-statistics for integrative analysis of multilayer omics data. Bioinformatics 2020; 36:2365-2374. [PMID: 31913435 DOI: 10.1093/bioinformatics/btaa004] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/10/2019] [Revised: 12/09/2019] [Accepted: 01/02/2020] [Indexed: 11/12/2022] Open
Abstract
MOTIVATION The emerging multilayer omics data provide unprecedented opportunities for detecting biomarkers that are associated with complex diseases at various molecular levels. However, the high-dimensionality of multiomics data and the complex disease etiologies have brought tremendous analytical challenges. RESULTS We developed a U-statistics-based non-parametric framework for the association analysis of multilayer omics data, where consensus and permutation-based weighting schemes are developed to account for various types of disease models. Our proposed method is flexible for analyzing different types of outcomes as it makes no assumptions about their distributions. Moreover, it explicitly accounts for various types of underlying disease models through weighting schemes and thus provides robust performance against them. Through extensive simulations and the application to dataset obtained from the Alzheimer's Disease Neuroimaging Initiatives, we demonstrated that our method outperformed the commonly used kernel regression-based methods. AVAILABILITY AND IMPLEMENTATION The R-package is available at https://github.com/YaluWen/Uomic. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Xiaqiong Wang
- Department of Statistics, University of Auckland, Auckland, New Zealand
| | - Yalu Wen
- Department of Statistics, University of Auckland, Auckland, New Zealand
| |
Collapse
|
21
|
Liu YH, Xu Y, Zhang M, Cui Y, Sze SH, Smith CW, Xu S, Zhang HB. Accurate Prediction of a Quantitative Trait Using the Genes Controlling the Trait for Gene-Based Breeding in Cotton. FRONTIERS IN PLANT SCIENCE 2020; 11:583277. [PMID: 33281846 PMCID: PMC7690289 DOI: 10.3389/fpls.2020.583277] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/14/2020] [Accepted: 10/15/2020] [Indexed: 05/03/2023]
Abstract
Accurate phenotype prediction of quantitative traits is paramount to enhanced plant research and breeding. Here, we report the accurate prediction of cotton fiber length, a typical quantitative trait, using 474 cotton (Gossypium ssp.) fiber length (GFL) genes and nine prediction models. When the SNPs/InDels contained in 226 of the GFL genes or the expressions of all 474 GFL genes was used for fiber length prediction, a prediction accuracy of r = 0.83 was obtained, approaching the maximally possible prediction accuracy of a quantitative trait. This has improved by 116%, the prediction accuracies of the fiber length thus far achieved for genomic selection using genome-wide random DNA markers. Moreover, analysis of the GFL genes identified 125 of the GFL genes that are key to accurate prediction of fiber length, with which a prediction accuracy similar to that of all 474 GFL genes was obtained. The fiber lengths of the plants predicted with expressions of the 125 key GFL genes were significantly correlated with those predicted with the SNPs/InDels of the above 226 SNP/InDel-containing GFL genes (r = 0.892, P = 0.000). The prediction accuracies of fiber length using both genic datasets were highly consistent across environments or generations. Finally, we found that a training population consisting of 100-120 plants was sufficient to train a model for accurate prediction of a quantitative trait using the genes controlling the trait. Therefore, the genes controlling a quantitative trait are capable of accurately predicting its phenotype, thereby dramatically improving the ability, accuracy, and efficiency of phenotype prediction and promoting gene-based breeding in cotton and other species.
Collapse
Affiliation(s)
- Yun-Hua Liu
- Department of Soil and Crop Sciences, Texas A&M University, College Station, TX, United States
| | - Yang Xu
- Botany and Plant Sciences, University of California, Riverside, Riverside, CA, United States
| | - Meiping Zhang
- Department of Soil and Crop Sciences, Texas A&M University, College Station, TX, United States
| | - Yanru Cui
- Botany and Plant Sciences, University of California, Riverside, Riverside, CA, United States
| | - Sing-Hoi Sze
- Department of Computer Science and Engineering and Department of Biochemistry and Biophysics, Texas A&M University, College Station, TX, United States
| | - C. Wayne Smith
- Department of Soil and Crop Sciences, Texas A&M University, College Station, TX, United States
| | - Shizhong Xu
- Department of Soil and Crop Sciences, Texas A&M University, College Station, TX, United States
- *Correspondence: Shizhong Xu,
| | - Hong-Bin Zhang
- Botany and Plant Sciences, University of California, Riverside, Riverside, CA, United States
- Hong-Bin Zhang,
| |
Collapse
|
22
|
Zeng P, Hao X, Zhou X. Pleiotropic mapping and annotation selection in genome-wide association studies with penalized Gaussian mixture models. Bioinformatics 2019; 34:2797-2807. [PMID: 29635306 DOI: 10.1093/bioinformatics/bty204] [Citation(s) in RCA: 23] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2017] [Accepted: 04/02/2018] [Indexed: 12/11/2022] Open
Abstract
Motivation Genome-wide association studies (GWASs) have identified many genetic loci associated with complex traits. A substantial fraction of these identified loci is associated with multiple traits-a phenomena known as pleiotropy. Identification of pleiotropic associations can help characterize the genetic relationship among complex traits and can facilitate our understanding of disease etiology. Effective pleiotropic association mapping requires the development of statistical methods that can jointly model multiple traits with genome-wide single nucleic polymorphisms (SNPs) together. Results We develop a joint modeling method, which we refer to as the integrative MApping of Pleiotropic association (iMAP). iMAP models summary statistics from GWASs, uses a multivariate Gaussian distribution to account for phenotypic correlation, simultaneously infers genome-wide SNP association pattern using mixture modeling and has the potential to reveal causal relationship between traits. Importantly, iMAP integrates a large number of SNP functional annotations to substantially improve association mapping power, and, with a sparsity-inducing penalty, is capable of selecting informative annotations from a large, potentially non-informative set. To enable scalable inference of iMAP to association studies with hundreds of thousands of individuals and millions of SNPs, we develop an efficient expectation maximization algorithm based on an approximate penalized regression algorithm. With simulations and comparisons to existing methods, we illustrate the benefits of iMAP in terms of both high association mapping power and accurate estimation of genome-wide SNP association patterns. Finally, we apply iMAP to perform a joint analysis of 48 traits from 31 GWAS consortia together with 40 tissue-specific SNP annotations generated from the Roadmap Project. Availability and implementation iMAP is freely available at http://www.xzlab.org/software.html. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Ping Zeng
- Department of Epidemiology and Biostatistics, Xuzhou Medical University, Xuzhou, Jiangsu, China.,Department of Biostatistics, University of Michigan, Ann Arbor, MI, USA.,Center for Statistical Genetics, Department of Biostatistics, University of Michigan, Ann Arbor, MI, USA
| | - Xingjie Hao
- Department of Biostatistics, University of Michigan, Ann Arbor, MI, USA.,Center for Statistical Genetics, Department of Biostatistics, University of Michigan, Ann Arbor, MI, USA
| | - Xiang Zhou
- Department of Biostatistics, University of Michigan, Ann Arbor, MI, USA.,Center for Statistical Genetics, Department of Biostatistics, University of Michigan, Ann Arbor, MI, USA
| |
Collapse
|
23
|
Mueller LD, Phillips MA, Barter TT, Greenspan ZS, Rose MR. Genome-Wide Mapping of Gene-Phenotype Relationships in Experimentally Evolved Populations. Mol Biol Evol 2019; 35:2085-2095. [PMID: 29860403 DOI: 10.1093/molbev/msy113] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Model organisms subjected to sustained experimental evolution often show levels of phenotypic differentiation that dramatically exceed the phenotypic differences observed in natural populations. Genome-wide sequencing of pooled populations then offers the opportunity to make inferences about the genes that are the cause of these phenotypic differences. We tested, through computer simulations, the efficacy of a statistical learning technique called the "fused lasso additive model" (FLAM). We focused on the ability of FLAM to distinguish between genes which are differentiated and directly affect a phenotype from differentiated genes which have no effect on the phenotype. FLAM can separate these two classes of genes even with relatively small samples (10 populations, in total). The efficacy of FLAM is improved with increased number of populations, reduced environmental phenotypic variation, and increased within-treatment among-replicate variation. FLAM was applied to SNP variation measured in both twenty-population and thirty-population studies of Drosophila subjected to selection for age-at-reproduction, to illustrate the application of the method.
Collapse
Affiliation(s)
- Laurence D Mueller
- Department of Ecology and Evolutionary Biology, University of California, Irvine, Irvine, CA
| | - Mark A Phillips
- Department of Ecology and Evolutionary Biology, University of California, Irvine, Irvine, CA
| | - Thomas T Barter
- Department of Ecology and Evolutionary Biology, University of California, Irvine, Irvine, CA
| | - Zachary S Greenspan
- Department of Ecology and Evolutionary Biology, University of California, Irvine, Irvine, CA
| | - Michael R Rose
- Department of Ecology and Evolutionary Biology, University of California, Irvine, Irvine, CA
| |
Collapse
|
24
|
Crawford L, Flaxman SR, Runcie DE, West M. VARIABLE PRIORITIZATION IN NONLINEAR BLACK BOX METHODS: A GENETIC ASSOCIATION CASE STUDY 1. Ann Appl Stat 2019; 13:958-989. [PMID: 32542104 PMCID: PMC7295151 DOI: 10.1214/18-aoas1222] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
Abstract
The central aim in this paper is to address variable selection questions in nonlinear and nonparametric regression. Motivated by statistical genetics, where nonlinear interactions are of particular interest, we introduce a novel and interpretable way to summarize the relative importance of predictor variables. Methodologically, we develop the "RelATive cEntrality" (RATE) measure to prioritize candidate genetic variants that are not just marginally important, but whose associations also stem from significant covarying relationships with other variants in the data. We illustrate RATE through Bayesian Gaussian process regression, but the methodological innovations apply to other "black box" methods. It is known that nonlinear models often exhibit greater predictive accuracy than linear models, particularly for phenotypes generated by complex genetic architectures. With detailed simulations and two real data association mapping studies, we show that applying RATE enables an explanation for this improved performance.
Collapse
|
25
|
Jackknife Model Averaging Prediction Methods for Complex Phenotypes with Gene Expression Levels by Integrating External Pathway Information. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2019; 2019:2807470. [PMID: 31089389 PMCID: PMC6476151 DOI: 10.1155/2019/2807470] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/13/2019] [Accepted: 03/20/2019] [Indexed: 01/03/2023]
Abstract
Motivation In the past few years many prediction approaches have been proposed and widely employed in high dimensional genetic data for disease risk evaluation. However, those approaches typically ignore in model fitting the important group structures that naturally exists in genetic data. Methods In the present study, we applied a novel model-averaging approach, called jackknife model averaging prediction (JMAP), for high dimensional genetic risk prediction while incorporating pathway information into the model specification. JMAP selects the optimal weights across candidate models by minimizing a cross validation criterion in a jackknife way. Compared with previous approaches, one of the primary features of JMAP is to allow model weights to vary from 0 to 1 but without the limitation that the summation of weights is equal to one. We evaluated the performance of JMAP using extensive simulation studies and compared it with existing methods. We finally applied JMAP to four real cancer datasets that are publicly available from TCGA. Results The simulations showed that compared with other existing approaches (e.g., gsslasso), JMAP performed best or is among the best methods across a range of scenarios. For example, among 14 out of 16 simulation settings with PVE = 0.3, JMAP has an average of 0.075 higher prediction accuracy compared with gsslasso. We further found that in the simulation, the model weights for the true candidate models have much smaller chances to be zero compared with those for the null candidate models and are substantially greater in magnitude. In the real data application, JMAP also behaves comparably or better compared with the other methods for continuous phenotypes. For example, for the COAD, CRC, and PAAD datasets, the average gains of predictive accuracy of JMAP are 0.019, 0.064, and 0.052 compared with gsslasso. Conclusion The proposed method JMAP is a novel model-averaging approach for high dimensional genetic risk prediction while incorporating external useful group structures into the model specification.
Collapse
|
26
|
Zhang H, Yin L, Wang M, Yuan X, Liu X. Factors Affecting the Accuracy of Genomic Selection for Agricultural Economic Traits in Maize, Cattle, and Pig Populations. Front Genet 2019; 10:189. [PMID: 30923535 PMCID: PMC6426750 DOI: 10.3389/fgene.2019.00189] [Citation(s) in RCA: 64] [Impact Index Per Article: 12.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2018] [Accepted: 02/21/2019] [Indexed: 11/20/2022] Open
Abstract
Genomic Selection (GS) has been proved to be a powerful tool for estimating genetic values in plant and livestock breeding. Newly developed sequencing technologies have dramatically reduced the cost of genotyping and significantly increased the scale of genotype data that used for GS. Meanwhile, state-of-the-art statistical methods were developed to make the best use of high marker density genotype data. In this study, 14 traits from four data sets of three species (maize, cattle, and pig) and five influential factors that affect the prediction accuracy were evaluated, including marker density (from 1 to ~600 k), statistical method (GBLUP-A, GBLUP-AD, and BayesR), minor allele frequency (MAF), heritability, and genetic architecture. Results indicate that in the GBLUP method, higher marker density leads to a higher prediction accuracy. In contrast, BayesR method needs more Monte Carlo Markov Chain (MCMC) iterations to reach the convergence and get reliable prediction values. BayesR outperforms GBLUP in predicting high or medium heritability trait that affected by one or several genes with large effects, while GBLUP performs similarly or slightly better than BayesR in predicting low heritability trait that controlled by a large amount of genes with minor effects. Prediction accuracy of trait with complex genetic architecture can be improved by increasing the marker density. Interestingly, for simple traits that controlled by one or several genes with large effects, higher marker density can cause a lower prediction accuracy if the QTN is included, but leads to a higher prediction accuracy if the QTN is excluded. The quantity of genetic markers with low MAF would not significantly affect the prediction accuracy of GBLUP, but results in a bad prediction accuracy performance of BayesR method. Compared with GBLUP-A, GBLUP-AD didn't show any advantages in capturing the non-additive variance for the traits with high heritability. The factors that affected prediction accuracy are discussed in this study and indicate that a combination of either GBLUP or BayesR method with moderate marker density and favorable polymorphism single nucleotide polymorphisms (SNPs) (~25 k SNPs) would always produce a good and stable prediction accuracy with acceptable breeding and computational costs.
Collapse
Affiliation(s)
- Haohao Zhang
- School of Computer Science and Technology, Wuhan University of Technology, Wuhan, China
| | - Lilin Yin
- Key Laboratory of Agricultural Animal Genetics, Breeding, and Reproduction of Ministry of Education and Key Laboratory of Swine Genetics and Breeding of Ministry of Agriculture, College of Animal Science and Veterinary Medicine, Huazhong Agricultural University, Wuhan, China
| | - Meiyue Wang
- Department of Botany and Plant Sciences, University of California, Riverside, Riverside, CA, United States
| | - Xiaohui Yuan
- School of Computer Science and Technology, Wuhan University of Technology, Wuhan, China
| | - Xiaolei Liu
- Key Laboratory of Agricultural Animal Genetics, Breeding, and Reproduction of Ministry of Education and Key Laboratory of Swine Genetics and Breeding of Ministry of Agriculture, College of Animal Science and Veterinary Medicine, Huazhong Agricultural University, Wuhan, China
| |
Collapse
|
27
|
Weissbrod O, Rothschild D, Barkan E, Segal E. Host genetics and microbiome associations through the lens of genome wide association studies. Curr Opin Microbiol 2018; 44:9-19. [PMID: 29909175 DOI: 10.1016/j.mib.2018.05.003] [Citation(s) in RCA: 29] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2018] [Revised: 03/15/2018] [Accepted: 05/25/2018] [Indexed: 12/22/2022]
Abstract
Recent studies indicate that the gut microbiome is partially heritable, motivating the need to investigate microbiome-host genome associations via microbial genome-wide association studies (mGWAS). Existing mGWAS demonstrate that microbiome-host genotype associations are typically weak and are spread across multiple variants, similar to associations often observed in genome-wide association studies (GWAS) of complex traits. Here we reconsider mGWAS by viewing them through the lens of GWAS, and demonstrate that there are striking similarities between the challenges and pitfalls faced by the two study designs. We further advocate the mGWAS community to adopt three key lessons learned over the history of GWAS: firstly, adopting uniform data and reporting formats to facilitate replication and meta-analysis efforts; secondly, enforcing stringent statistical criteria to reduce the number of false positive findings; and thirdly, considering the microbiome and the host genome as distinct entities, rather than studying different taxa and single nucleotide polymorphism (SNPs) separately. Finally, we anticipate that mGWAS sample sizes will have to increase by orders of magnitude to reproducibly associate the host genome with the gut microbiome.
Collapse
Affiliation(s)
- Omer Weissbrod
- Department of Computer Science and Applied Mathematics, Weizmann Institute of Science, Rehovot 7610001, Israel; Department of Molecular Cell Biology, Weizmann Institute of Science, Rehovot 7610001, Israel
| | - Daphna Rothschild
- Department of Computer Science and Applied Mathematics, Weizmann Institute of Science, Rehovot 7610001, Israel; Department of Molecular Cell Biology, Weizmann Institute of Science, Rehovot 7610001, Israel
| | - Elad Barkan
- Department of Computer Science and Applied Mathematics, Weizmann Institute of Science, Rehovot 7610001, Israel; Department of Molecular Cell Biology, Weizmann Institute of Science, Rehovot 7610001, Israel
| | - Eran Segal
- Department of Computer Science and Applied Mathematics, Weizmann Institute of Science, Rehovot 7610001, Israel; Department of Molecular Cell Biology, Weizmann Institute of Science, Rehovot 7610001, Israel.
| |
Collapse
|
28
|
Márquez-Luna C, Loh PR, Price AL. Multiethnic polygenic risk scores improve risk prediction in diverse populations. Genet Epidemiol 2017; 41:811-823. [PMID: 29110330 PMCID: PMC5726434 DOI: 10.1002/gepi.22083] [Citation(s) in RCA: 183] [Impact Index Per Article: 26.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/12/2017] [Revised: 08/16/2017] [Accepted: 08/30/2017] [Indexed: 01/04/2023]
Abstract
Methods for genetic risk prediction have been widely investigated in recent years. However, most available training data involves European samples, and it is currently unclear how to accurately predict disease risk in other populations. Previous studies have used either training data from European samples in large sample size or training data from the target population in small sample size, but not both. Here, we introduce a multiethnic polygenic risk score that combines training data from European samples and training data from the target population. We applied this approach to predict type 2 diabetes (T2D) in a Latino cohort using both publicly available European summary statistics in large sample size (Neff = 40k) and Latino training data in small sample size (Neff = 8k). Here, we attained a >70% relative improvement in prediction accuracy (from R2 = 0.027 to 0.047) compared to methods that use only one source of training data, consistent with large relative improvements in simulations. We observed a systematically lower load of T2D risk alleles in Latino individuals with more European ancestry, which could be explained by polygenic selection in ancestral European and/or Native American populations. We predict T2D in a South Asian UK Biobank cohort using European (Neff = 40k) and South Asian (Neff = 16k) training data and attained a >70% relative improvement in prediction accuracy, and application to predict height in an African UK Biobank cohort using European (N = 113k) and African (N = 2k) training data attained a 30% relative improvement. Our work reduces the gap in polygenic risk prediction accuracy between European and non-European target populations.
Collapse
Affiliation(s)
- Carla Márquez-Luna
- Department of Biostatistics, Harvard T. H. Chan School of Public Health, Boston, Massachusetts, United States of America
| | - Po-Ru Loh
- Department of Epidemiology, Harvard T. H. Chan School of Public Health, Boston, Massachusetts, United States of America
- Program in Medical and Population Genetics, Broad Institute of Harvard and MIT, Cambridge, Massachusetts, United States of America
| | - Alkes L Price
- Department of Biostatistics, Harvard T. H. Chan School of Public Health, Boston, Massachusetts, United States of America
- Department of Epidemiology, Harvard T. H. Chan School of Public Health, Boston, Massachusetts, United States of America
- Program in Medical and Population Genetics, Broad Institute of Harvard and MIT, Cambridge, Massachusetts, United States of America
| |
Collapse
|
29
|
Mangin B, Bonnafous F, Blanchet N, Boniface MC, Bret-Mestries E, Carrère S, Cottret L, Legrand L, Marage G, Pegot-Espagnet P, Munos S, Pouilly N, Vear F, Vincourt P, Langlade NB. Genomic Prediction of Sunflower Hybrids Oil Content. FRONTIERS IN PLANT SCIENCE 2017; 8:1633. [PMID: 28983306 DOI: 10.3389/fpls.2017.01633d] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Subscribe] [Scholar Register] [Received: 06/28/2017] [Accepted: 09/06/2017] [Indexed: 05/26/2023]
Abstract
Prediction of hybrid performance using incomplete factorial mating designs is widely used in breeding programs including different heterotic groups. Based on the general combining ability (GCA) of the parents, predictions are accurate only if the genetic variance resulting from the specific combining ability is small and both parents have phenotyped descendants. Genomic selection (GS) can predict performance using a model trained on both phenotyped and genotyped hybrids that do not necessarily include all hybrid parents. Therefore, GS could overcome the issue of unknown parent GCA. Here, we compared the accuracy of classical GCA-based and genomic predictions for oil content of sunflower seeds using several GS models. Our study involved 452 sunflower hybrids from an incomplete factorial design of 36 female and 36 male lines. Re-sequencing of parental lines allowed to identify 468,194 non-redundant SNPs and to infer the hybrid genotypes. Oil content was observed in a multi-environment trial (MET) over 3 years, leading to nine different environments. We compared GCA-based model to different GS models including female and male genomic kinships with the addition of the female-by-male interaction genomic kinship, the use of functional knowledge as SNPs in genes of oil metabolic pathways, and with epistasis modeling. When both parents have descendants in the training set, the predictive ability was high even for GCA-based prediction, with an average MET value of 0.782. GS performed slightly better (+0.2%). Neither the inclusion of the female-by-male interaction, nor functional knowledge of oil metabolism, nor epistasis modeling improved the GS accuracy. GS greatly improved predictive ability when one or both parents were untested in the training set, increasing GCA-based predictive ability by 10.4% from 0.575 to 0.635 in the MET. In this scenario, performing GS only considering SNPs in oil metabolic pathways did not improve whole genome GS prediction but increased GCA-based prediction ability by 6.4%. Our results show that GS is a major improvement to breeding efficiency compared to the classical GCA modeling when either one or both parents are not well-characterized. This finding could therefore accelerate breeding through reducing phenotyping efforts and more effectively targeting for the most promising crosses.
Collapse
Affiliation(s)
- Brigitte Mangin
- LIPM, Université de Toulouse, INRA, Centre National de la Recherche ScientifiqueCastanet-Tolosan, France
| | - Fanny Bonnafous
- LIPM, Université de Toulouse, INRA, Centre National de la Recherche ScientifiqueCastanet-Tolosan, France
| | - Nicolas Blanchet
- LIPM, Université de Toulouse, INRA, Centre National de la Recherche ScientifiqueCastanet-Tolosan, France
| | - Marie-Claude Boniface
- LIPM, Université de Toulouse, INRA, Centre National de la Recherche ScientifiqueCastanet-Tolosan, France
| | | | - Sébastien Carrère
- LIPM, Université de Toulouse, INRA, Centre National de la Recherche ScientifiqueCastanet-Tolosan, France
| | - Ludovic Cottret
- LIPM, Université de Toulouse, INRA, Centre National de la Recherche ScientifiqueCastanet-Tolosan, France
| | - Ludovic Legrand
- LIPM, Université de Toulouse, INRA, Centre National de la Recherche ScientifiqueCastanet-Tolosan, France
| | - Gwenola Marage
- LIPM, Université de Toulouse, INRA, Centre National de la Recherche ScientifiqueCastanet-Tolosan, France
| | - Prune Pegot-Espagnet
- LIPM, Université de Toulouse, INRA, Centre National de la Recherche ScientifiqueCastanet-Tolosan, France
| | - Stéphane Munos
- LIPM, Université de Toulouse, INRA, Centre National de la Recherche ScientifiqueCastanet-Tolosan, France
| | - Nicolas Pouilly
- LIPM, Université de Toulouse, INRA, Centre National de la Recherche ScientifiqueCastanet-Tolosan, France
| | - Felicity Vear
- GDEC, INRA, Université Clermont II Blaise PascalClermont-Ferrand, France
| | - Patrick Vincourt
- LIPM, Université de Toulouse, INRA, Centre National de la Recherche ScientifiqueCastanet-Tolosan, France
| | - Nicolas B Langlade
- LIPM, Université de Toulouse, INRA, Centre National de la Recherche ScientifiqueCastanet-Tolosan, France
| |
Collapse
|
30
|
Non-parametric genetic prediction of complex traits with latent Dirichlet process regression models. Nat Commun 2017; 8:456. [PMID: 28878256 PMCID: PMC5587666 DOI: 10.1038/s41467-017-00470-2] [Citation(s) in RCA: 71] [Impact Index Per Article: 10.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/26/2016] [Accepted: 06/30/2017] [Indexed: 01/03/2023] Open
Abstract
Using genotype data to perform accurate genetic prediction of complex traits can facilitate genomic selection in animal and plant breeding programs, and can aid in the development of personalized medicine in humans. Because most complex traits have a polygenic architecture, accurate genetic prediction often requires modeling all genetic variants together via polygenic methods. Here, we develop such a polygenic method, which we refer to as the latent Dirichlet process regression model. Dirichlet process regression is non-parametric in nature, relies on the Dirichlet process to flexibly and adaptively model the effect size distribution, and thus enjoys robust prediction performance across a broad spectrum of genetic architectures. We compare Dirichlet process regression with several commonly used prediction methods with simulations. We further apply Dirichlet process regression to predict gene expressions, to conduct PrediXcan based gene set test, to perform genomic selection of four traits in two species, and to predict eight complex traits in a human cohort.Genetic prediction of complex traits with polygenic architecture has wide application from animal breeding to disease prevention. Here, Zeng and Zhou develop a non-parametric genetic prediction method based on latent Dirichlet Process regression models.
Collapse
|
31
|
Zeng P, Zhou X, Huang S. Prediction of gene expression with cis-SNPs using mixed models and regularization methods. BMC Genomics 2017; 18:368. [PMID: 28490319 PMCID: PMC5425981 DOI: 10.1186/s12864-017-3759-6] [Citation(s) in RCA: 24] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2016] [Accepted: 05/03/2017] [Indexed: 12/25/2022] Open
Abstract
Background It has been shown that gene expression in human tissues is heritable, thus predicting gene expression using only SNPs becomes possible. The prediction of gene expression can offer important implications on the genetic architecture of individual functional associated SNPs and further interpretations of the molecular basis underlying human diseases. Methods We compared three types of methods for predicting gene expression using only cis-SNPs, including the polygenic model, i.e. linear mixed model (LMM), two sparse models, i.e. Lasso and elastic net (ENET), and the hybrid of LMM and sparse model, i.e. Bayesian sparse linear mixed model (BSLMM). The three kinds of prediction methods have very different assumptions of underlying genetic architectures. These methods were evaluated using simulations under various scenarios, and were applied to the Geuvadis gene expression data. Results The simulations showed that these four prediction methods (i.e. Lasso, ENET, LMM and BSLMM) behaved best when their respective modeling assumptions were satisfied, but BSLMM had a robust performance across a range of scenarios. According to R2 of these models in the Geuvadis data, the four methods performed quite similarly. We did not observe any clustering or enrichment of predictive genes (defined as genes with R2 ≥ 0.05) across the chromosomes, and also did not see there was any clear relationship between the proportion of the predictive genes and the proportion of genes in each chromosome. However, an interesting finding in the Geuvadis data was that highly predictive genes (e.g. R2 ≥ 0.30) may have sparse genetic architectures since Lasso, ENET and BSLMM outperformed LMM for these genes; and this observation was validated in another gene expression data. We further showed that the predictive genes were enriched in approximately independent LD blocks. Conclusions Gene expression can be predicted with only cis-SNPs using well-developed prediction models and these predictive genes were enriched in some approximately independent LD blocks. The prediction of gene expression can shed some light on the functional interpretation for identified SNPs in GWASs.
Collapse
Affiliation(s)
- Ping Zeng
- Department of Epidemiology and Biostatistics, Xuzhou Medical University, 209 Tongshan Rd, Xuzhou, Jiangsu, 221004, China. .,Department of Biostatistics, University of Michigan, 1415 Washington Heights, Ann Arbor, MI, 48104, USA.
| | - Xiang Zhou
- Department of Biostatistics, University of Michigan, 1415 Washington Heights, Ann Arbor, MI, 48104, USA
| | - Shuiping Huang
- Department of Epidemiology and Biostatistics, Xuzhou Medical University, 209 Tongshan Rd, Xuzhou, Jiangsu, 221004, China.
| |
Collapse
|
32
|
Mangin B, Bonnafous F, Blanchet N, Boniface MC, Bret-Mestries E, Carrère S, Cottret L, Legrand L, Marage G, Pegot-Espagnet P, Munos S, Pouilly N, Vear F, Vincourt P, Langlade NB. Genomic Prediction of Sunflower Hybrids Oil Content. FRONTIERS IN PLANT SCIENCE 2017; 8:1633. [PMID: 28983306 PMCID: PMC5613134 DOI: 10.3389/fpls.2017.01633] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/28/2017] [Accepted: 09/06/2017] [Indexed: 05/18/2023]
Abstract
Prediction of hybrid performance using incomplete factorial mating designs is widely used in breeding programs including different heterotic groups. Based on the general combining ability (GCA) of the parents, predictions are accurate only if the genetic variance resulting from the specific combining ability is small and both parents have phenotyped descendants. Genomic selection (GS) can predict performance using a model trained on both phenotyped and genotyped hybrids that do not necessarily include all hybrid parents. Therefore, GS could overcome the issue of unknown parent GCA. Here, we compared the accuracy of classical GCA-based and genomic predictions for oil content of sunflower seeds using several GS models. Our study involved 452 sunflower hybrids from an incomplete factorial design of 36 female and 36 male lines. Re-sequencing of parental lines allowed to identify 468,194 non-redundant SNPs and to infer the hybrid genotypes. Oil content was observed in a multi-environment trial (MET) over 3 years, leading to nine different environments. We compared GCA-based model to different GS models including female and male genomic kinships with the addition of the female-by-male interaction genomic kinship, the use of functional knowledge as SNPs in genes of oil metabolic pathways, and with epistasis modeling. When both parents have descendants in the training set, the predictive ability was high even for GCA-based prediction, with an average MET value of 0.782. GS performed slightly better (+0.2%). Neither the inclusion of the female-by-male interaction, nor functional knowledge of oil metabolism, nor epistasis modeling improved the GS accuracy. GS greatly improved predictive ability when one or both parents were untested in the training set, increasing GCA-based predictive ability by 10.4% from 0.575 to 0.635 in the MET. In this scenario, performing GS only considering SNPs in oil metabolic pathways did not improve whole genome GS prediction but increased GCA-based prediction ability by 6.4%. Our results show that GS is a major improvement to breeding efficiency compared to the classical GCA modeling when either one or both parents are not well-characterized. This finding could therefore accelerate breeding through reducing phenotyping efforts and more effectively targeting for the most promising crosses.
Collapse
Affiliation(s)
- Brigitte Mangin
- LIPM, Université de Toulouse, INRA, Centre National de la Recherche ScientifiqueCastanet-Tolosan, France
- *Correspondence: Brigitte Mangin
| | - Fanny Bonnafous
- LIPM, Université de Toulouse, INRA, Centre National de la Recherche ScientifiqueCastanet-Tolosan, France
| | - Nicolas Blanchet
- LIPM, Université de Toulouse, INRA, Centre National de la Recherche ScientifiqueCastanet-Tolosan, France
| | - Marie-Claude Boniface
- LIPM, Université de Toulouse, INRA, Centre National de la Recherche ScientifiqueCastanet-Tolosan, France
| | | | - Sébastien Carrère
- LIPM, Université de Toulouse, INRA, Centre National de la Recherche ScientifiqueCastanet-Tolosan, France
| | - Ludovic Cottret
- LIPM, Université de Toulouse, INRA, Centre National de la Recherche ScientifiqueCastanet-Tolosan, France
| | - Ludovic Legrand
- LIPM, Université de Toulouse, INRA, Centre National de la Recherche ScientifiqueCastanet-Tolosan, France
| | - Gwenola Marage
- LIPM, Université de Toulouse, INRA, Centre National de la Recherche ScientifiqueCastanet-Tolosan, France
| | - Prune Pegot-Espagnet
- LIPM, Université de Toulouse, INRA, Centre National de la Recherche ScientifiqueCastanet-Tolosan, France
| | - Stéphane Munos
- LIPM, Université de Toulouse, INRA, Centre National de la Recherche ScientifiqueCastanet-Tolosan, France
| | - Nicolas Pouilly
- LIPM, Université de Toulouse, INRA, Centre National de la Recherche ScientifiqueCastanet-Tolosan, France
| | - Felicity Vear
- GDEC, INRA, Université Clermont II Blaise PascalClermont-Ferrand, France
| | - Patrick Vincourt
- LIPM, Université de Toulouse, INRA, Centre National de la Recherche ScientifiqueCastanet-Tolosan, France
| | - Nicolas B. Langlade
- LIPM, Université de Toulouse, INRA, Centre National de la Recherche ScientifiqueCastanet-Tolosan, France
| |
Collapse
|