1
|
Zhao T, Wang F, Mott R, Dekkers J, Cheng H. Using encrypted genotypes and phenotypes for collaborative genomic analyses to maintain data confidentiality. Genetics 2024; 226:iyad210. [PMID: 38085098 PMCID: PMC11090459 DOI: 10.1093/genetics/iyad210] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2023] [Accepted: 11/13/2023] [Indexed: 03/08/2024] Open
Abstract
To adhere to and capitalize on the benefits of the FAIR (findable, accessible, interoperable, and reusable) principles in agricultural genome-to-phenome studies, it is crucial to address privacy and intellectual property issues that prevent sharing and reuse of data in research and industry. Direct sharing of genotype and phenotype data is often prohibited due to intellectual property and privacy concerns. Thus, there is a pressing need for encryption methods that obscure confidential aspects of the data, without affecting the outcomes of certain statistical analyses. A homomorphic encryption method for genotypes and phenotypes (HEGP) has been proposed for single-marker regression in genome-wide association studies (GWAS) using linear mixed models with Gaussian errors. This methodology permits frequentist likelihood-based parameter estimation and inference. In this paper, we extend HEGP to broader applications in genome-to-phenome analyses. We show that HEGP is suited to commonly used linear mixed models for genetic analyses of quantitative traits including genomic best linear unbiased prediction (GBLUP) and ridge-regression best linear unbiased prediction (RR-BLUP), as well as Bayesian variable selection methods (e.g. those in Bayesian Alphabet), for genetic parameter estimation, genomic prediction, and GWAS. By advancing the capabilities of HEGP, we offer researchers and industry professionals a secure and efficient approach for collaborative genomic analyses while preserving data confidentiality.
Collapse
Affiliation(s)
- Tianjing Zhao
- Department of Animal Science, University of California, Davis, CA 95616, USA
- Department of Animal Science, University of Nebraska-Lincoln, Lincoln, NE 68583, USA
| | - Fangyi Wang
- Department of Plant Sciences, University of California, Davis, CA 95616, USA
| | - Richard Mott
- Genetics Institute, University College London, London, WC1E 6BT, UK
| | - Jack Dekkers
- Department of Animal Science, Iowa State University, Ames, IA 50011, USA
| | - Hao Cheng
- Department of Animal Science, University of California, Davis, CA 95616, USA
| |
Collapse
|
2
|
Zhao T, Cheng H. Interpreting single-step genomic evaluation as a neural network of three layers: pedigree, genotypes, and phenotypes. Genet Sel Evol 2023; 55:68. [PMID: 37789273 PMCID: PMC10546757 DOI: 10.1186/s12711-023-00838-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2022] [Accepted: 09/08/2023] [Indexed: 10/05/2023] Open
Abstract
The single-step approach has become the most widely-used methodology for genomic evaluations when only a subset of phenotyped individuals in the pedigree are genotyped, where the genotypes for non-genotyped individuals are imputed based on gene contents (i.e., genotypes) of genotyped individuals through their pedigree relationships. We proposed a new method named single-step neural network with mixed models (NNMM) to represent single-step genomic evaluations as a neural network of three sequential layers: pedigree, genotypes, and phenotypes. These three sequential layers of information create a unified network instead of two separate steps, allowing the unobserved gene contents of non-genotyped individuals to be sampled based on pedigree, observed genotypes of genotyped individuals, and phenotypes. In addition to imputation of genotypes using all three sources of information, including phenotypes, genotypes, and pedigree, single-step NNMM provides a more flexible framework to allow nonlinear relationships between genotypes and phenotypes, and for individuals to be genotyped with different single-nucleotide polymorphism (SNP) panels. The single-step NNMM has been implemented in the software package "JWAS'.
Collapse
Affiliation(s)
- Tianjing Zhao
- Department of Animal Science, University of California Davis, Davis, CA, 95616, USA
- Integrative Genetics and Genomics Graduate Group, University of California Davis, Davis, CA, 95616, USA
| | - Hao Cheng
- Department of Animal Science, University of California Davis, Davis, CA, 95616, USA.
| |
Collapse
|
3
|
Alves AAC, Fernandes AFA, Lopes FB, Breen V, Hawken R, Gianola D, Rosa GJDM. (Quasi) multitask support vector regression with heuristic hyperparameter optimization for whole-genome prediction of complex traits: a case study with carcass traits in broilers. G3 (BETHESDA, MD.) 2023; 13:jkad109. [PMID: 37216670 PMCID: PMC10411556 DOI: 10.1093/g3journal/jkad109] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/13/2023] [Revised: 03/13/2023] [Accepted: 04/24/2023] [Indexed: 05/24/2023]
Abstract
This study investigates nonlinear kernels for multitrait (MT) genomic prediction using support vector regression (SVR) models. We assessed the predictive ability delivered by single-trait (ST) and MT models for 2 carcass traits (CT1 and CT2) measured in purebred broiler chickens. The MT models also included information on indicator traits measured in vivo [Growth and feed efficiency trait (FE)]. We proposed an approach termed (quasi) multitask SVR (QMTSVR), with hyperparameter optimization performed via genetic algorithm. ST and MT Bayesian shrinkage and variable selection models [genomic best linear unbiased predictor (GBLUP), BayesC (BC), and reproducing kernel Hilbert space (RKHS) regression] were employed as benchmarks. MT models were trained using 2 validation designs (CV1 and CV2), which differ if the information on secondary traits is available in the testing set. Models' predictive ability was assessed with prediction accuracy (ACC; i.e. the correlation between predicted and observed values, divided by the square root of phenotype accuracy), standardized root-mean-squared error (RMSE*), and inflation factor (b). To account for potential bias in CV2-style predictions, we also computed a parametric estimate of accuracy (ACCpar). Predictive ability metrics varied according to trait, model, and validation design (CV1 or CV2), ranging from 0.71 to 0.84 for ACC, 0.78 to 0.92 for RMSE*, and between 0.82 and 1.34 for b. The highest ACC and smallest RMSE* were achieved with QMTSVR-CV2 in both traits. We observed that for CT1, model/validation design selection was sensitive to the choice of accuracy metric (ACC or ACCpar). Nonetheless, the higher predictive accuracy of QMTSVR over MTGBLUP and MTBC was replicated across accuracy metrics, besides the similar performance between the proposed method and the MTRKHS model. Results showed that the proposed approach is competitive with conventional MT Bayesian regression models using either Gaussian or spike-slab multivariate priors.
Collapse
Affiliation(s)
| | | | | | - Vivian Breen
- Cobb-Vantress Inc., Siloam Springs, AR 72761, USA
| | | | - Daniel Gianola
- Department of Animal and Dairy Sciences, University of Wisconsin-Madison, Madison, WI 53706, USA
| | | |
Collapse
|
4
|
Morgante F, Carbonetto P, Wang G, Zou Y, Sarkar A, Stephens M. A flexible empirical Bayes approach to multivariate multiple regression, and its improved accuracy in predicting multi-tissue gene expression from genotypes. PLoS Genet 2023; 19:e1010539. [PMID: 37418505 PMCID: PMC10355440 DOI: 10.1371/journal.pgen.1010539] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2022] [Accepted: 06/02/2023] [Indexed: 07/09/2023] Open
Abstract
Predicting phenotypes from genotypes is a fundamental task in quantitative genetics. With technological advances, it is now possible to measure multiple phenotypes in large samples. Multiple phenotypes can share their genetic component; therefore, modeling these phenotypes jointly may improve prediction accuracy by leveraging effects that are shared across phenotypes. However, effects can be shared across phenotypes in a variety of ways, so computationally efficient statistical methods are needed that can accurately and flexibly capture patterns of effect sharing. Here, we describe new Bayesian multivariate, multiple regression methods that, by using flexible priors, are able to model and adapt to different patterns of effect sharing and specificity across phenotypes. Simulation results show that these new methods are fast and improve prediction accuracy compared with existing methods in a wide range of settings where effects are shared. Further, in settings where effects are not shared, our methods still perform competitively with state-of-the-art methods. In real data analyses of expression data in the Genotype Tissue Expression (GTEx) project, our methods improve prediction performance on average for all tissues, with the greatest gains in tissues where effects are strongly shared, and in the tissues with smaller sample sizes. While we use gene expression prediction to illustrate our methods, the methods are generally applicable to any multi-phenotype applications, including prediction of polygenic scores and breeding values. Thus, our methods have the potential to provide improvements across fields and organisms.
Collapse
Affiliation(s)
- Fabio Morgante
- Center for Human Genetics, Clemson University, Greenwood, South Carolina, United States of America
- Department of Genetics and Biochemistry, Clemson University, Clemson, South Carolina, United States of America
- Section of Genetic Medicine, Department of Medicine, University of Chicago, Chicago, Illinois, United States of America
| | - Peter Carbonetto
- Department of Human Genetics, University of Chicago, Chicago, Illinois, United States of America
- Research Computing Center, University of Chicago, Chicago, Illinois, United States of America
| | - Gao Wang
- Department of Human Genetics, University of Chicago, Chicago, Illinois, United States of America
- Department of Neurology, Columbia University, New York, New York, United States of America
- Gertrude H. Sergievsky Center, Columbia University, New York, New York, United States of America
| | - Yuxin Zou
- Department of Statistics, University of Chicago, Chicago, Illinois, United States of America
- Regeneron Genetics Center, Regeneron Pharmaceuticals Inc., Tarrytown, New York, United States of America
| | - Abhishek Sarkar
- Department of Human Genetics, University of Chicago, Chicago, Illinois, United States of America
| | - Matthew Stephens
- Department of Human Genetics, University of Chicago, Chicago, Illinois, United States of America
- Department of Statistics, University of Chicago, Chicago, Illinois, United States of America
| |
Collapse
|
5
|
Mowlaei ME, Shi X. FSF-GA: A Feature Selection Framework for Phenotype Prediction Using Genetic Algorithms. Genes (Basel) 2023; 14:genes14051059. [PMID: 37239419 DOI: 10.3390/genes14051059] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/17/2023] [Revised: 05/04/2023] [Accepted: 05/06/2023] [Indexed: 05/28/2023] Open
Abstract
(1) Background: Phenotype prediction is a pivotal task in genetics in order to identify how genetic factors contribute to phenotypic differences. This field has seen extensive research, with numerous methods proposed for predicting phenotypes. Nevertheless, the intricate relationship between genotypes and complex phenotypes, including common diseases, has resulted in an ongoing challenge to accurately decipher the genetic contribution. (2) Results: In this study, we propose a novel feature selection framework for phenotype prediction utilizing a genetic algorithm (FSF-GA) that effectively reduces the feature space to identify genotypes contributing to phenotype prediction. We provide a comprehensive vignette of our method and conduct extensive experiments using a widely used yeast dataset. (3) Conclusions: Our experimental results show that our proposed FSF-GA method delivers comparable phenotype prediction performance as compared to baseline methods, while providing features selected for predicting phenotypes. These selected feature sets can be used to interpret the underlying genetic architecture that contributes to phenotypic variation.
Collapse
Affiliation(s)
- Mohammad Erfan Mowlaei
- Department of Computer and Information Sciences, Temple University, 925 N. 12th Street, Philadelphia, PA 19122, USA
| | - Xinghua Shi
- Department of Computer and Information Sciences, Temple University, 925 N. 12th Street, Philadelphia, PA 19122, USA
| |
Collapse
|
6
|
Vu NT, Phuc TH, Nguyen NH, Van Sang N. Effects of common full-sib families on accuracy of genomic prediction for tagging weight in striped catfish Pangasianodon hypophthalmus. Front Genet 2023; 13:1081246. [PMID: 36685869 PMCID: PMC9845282 DOI: 10.3389/fgene.2022.1081246] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2022] [Accepted: 12/06/2022] [Indexed: 01/06/2023] Open
Abstract
Common full-sib families (c 2 ) make up a substantial proportion of total phenotypic variation in traits of commercial importance in aquaculture species and omission or inclusion of the c 2 resulted in possible changes in genetic parameter estimates and re-ranking of estimated breeding values. However, the impacts of common full-sib families on accuracy of genomic prediction for commercial traits of economic importance are not well known in many species, including aquatic animals. This research explored the impacts of common full-sib families on accuracy of genomic prediction for tagging weight in a population of striped catfish comprising 11,918 fish traced back to the base population (four generations), in which 560 individuals had genotype records of 14,154 SNPs. Our single step genomic best linear unbiased prediction (ssGLBUP) showed that the accuracy of genomic prediction for tagging weight was reduced by 96.5%-130.3% when the common full-sib families were included in statistical models. The reduction in the prediction accuracy was to a smaller extent in multivariate analysis than in univariate models. Imputation of missing genotypes somewhat reduced the upward biases in the prediction accuracy for tagging weight. It is therefore suggested that genomic evaluation models for traits recorded during the early phase of growth development should account for the common full-sib families to minimise possible biases in the accuracy of genomic prediction and hence, selection response.
Collapse
Affiliation(s)
- Nguyen Thanh Vu
- School of Science, Technology and Engineering, University of the Sunshine Coast, Sippy Downs, QLD, Australia,Center for Bio-Innovation, University of the Sunshine Coast, Maroochydore, QLD, Australia,Research Institute for Aquaculture No. 2, Ho Chi Minh City, Vietnam
| | - Tran Huu Phuc
- Research Institute for Aquaculture No. 2, Ho Chi Minh City, Vietnam
| | - Nguyen Hong Nguyen
- School of Science, Technology and Engineering, University of the Sunshine Coast, Sippy Downs, QLD, Australia,Center for Bio-Innovation, University of the Sunshine Coast, Maroochydore, QLD, Australia,*Correspondence: Nguyen Hong Nguyen, ; Nguyen Van Sang,
| | - Nguyen Van Sang
- Research Institute for Aquaculture No. 2, Ho Chi Minh City, Vietnam,*Correspondence: Nguyen Hong Nguyen, ; Nguyen Van Sang,
| |
Collapse
|
7
|
Qu J, Morota G, Cheng H. A Bayesian random regression method using mixture priors for genome-enabled analysis of time-series high-throughput phenotyping data. THE PLANT GENOME 2022; 15:e20228. [PMID: 35904052 DOI: 10.1002/tpg2.20228] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/23/2021] [Accepted: 04/07/2022] [Indexed: 06/15/2023]
Abstract
The recent advancement in image-based phenotyping platforms enables the acquisition of large-scale nondestructive crop phenotypes measured at frequent intervals. To further understand the underlying genetic basis over a physiological process and improve plant breeding programs, the question of how to efficiently utilize these time-series measurements in genome-enabled analysis including genomic prediction and genome-wide association studies (GWASs) should be considered. In this paper, a Bayesian random regression model with mixture priors is developed to introduce more meaningful biological assumptions to the analysis of longitudinal traits. The mixture prior for marker effects in Bayes Cπ is implemented in our developed model (RR-BayesC) for demonstration purpose. The estimation of single-nucleotide polymorphism-specific effects that are related to the dynamic performance of crops and the accuracy of genomic prediction by RR-BayesC were studied through both simulated and real rice (Oryza sativa L.) data. For genomic prediction, three predictive scenarios were studied. In the simulated study, RR-BayesC showed a significantly higher prediction accuracy than that obtained by single-trait analysis, especially for days when heritability were low. In real data analysis, RR-BayesC showed relatively high prediction accuracy when forecast is required for phenotypes at later period (e.g., from 0.94 to 0.98 for lines with observations at an earlier period and from 0.65 to 0.67 for lines without any observations). For GWASs, inference of single markers and inference of genomic windows were conducted. In the simulated study, RR-BayesC showed its promising ability to distinguish quantitative trait loci (QTL) that are invariant to temporal covariates and QTL that interact with time. An association study of real data was also presented to demonstrate the application of RR-BayesC in real data analysis. In this paper, we develop a Bayesian random regression model that is able to incorporate mixture priors to marker effects and show improved performance of genomic prediction and GWASs for longitudinal data analysis based on both simulated and real data. The software tool JWAS offers routines to perform our proposed random regression analysis.
Collapse
Affiliation(s)
- Jiayi Qu
- Dep. of Animal Science, Univ. of California Davis, Davis, CA, 95616, USA
| | - Gota Morota
- Dep. of Animal and Poultry Sciences, Virginia Polytechnic Institute and State Univ., Blacksburg, VA, 24061, USA
| | - Hao Cheng
- Dep. of Animal Science, Univ. of California Davis, Davis, CA, 95616, USA
| |
Collapse
|
8
|
Ribeiro AMF, Sanglard LP, Wijesena HR, Ciobanu DC, Horvath S, Spangler ML. DNA methylation profile in beef cattle is influenced by additive genetics and age. Sci Rep 2022; 12:12016. [PMID: 35835812 PMCID: PMC9283455 DOI: 10.1038/s41598-022-16350-9] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2022] [Accepted: 07/08/2022] [Indexed: 11/15/2022] Open
Abstract
DNA methylation (DNAm) has been considered a promising indicator of biological age in mammals and could be useful to increase the accuracy of phenotypic prediction in livestock. The objectives of this study were to estimate the heritability and age effects of site-specific DNAm (DNAm level) and cumulative DNAm across all sites (DNAm load) in beef cattle. Blood samples were collected from cows ranging from 217 to 3,192 days (0.6 to 8.7 years) of age (n = 136). All animals were genotyped, and DNAm was obtained using the Infinium array HorvathMammalMethylChip40. Genetic parameters for DNAm were obtained from an animal model based on the genomic relationship matrix, including the fixed effects of age and breed composition. Heritability estimates of DNAm levels ranged from 0.18 to 0.72, with a similar average across all regions and chromosomes. Heritability estimate of DNAm load was 0.45. The average age effect on DNAm level varied among genomic regions. The DNAm level across the genome increased with age in the promoter and 5′ UTR and decreased in the exonic, intronic, 3′ UTR, and intergenic regions. In addition, DNAm level increased with age in regions enriched in CpG and decreased in regions deficient in CpG. Results suggest DNAm profiles are influenced by both genetics and the environmental effect of age in beef cattle.
Collapse
Affiliation(s)
| | - Leticia P Sanglard
- Department of Animal Science, University of Nebraska, Lincoln, NE, 68583, USA
| | - Hiruni R Wijesena
- Department of Animal Science, University of Nebraska, Lincoln, NE, 68583, USA
| | - Daniel C Ciobanu
- Department of Animal Science, University of Nebraska, Lincoln, NE, 68583, USA
| | - Steve Horvath
- Department of Human Genetics, David Geffen School of Medicine, University of California, Los Angeles, CA, 90095, USA. .,Department of Biostatistics, Fielding School of Public Health, University of California, Los Angeles, CA, 90095, USA.
| | - Matthew L Spangler
- Department of Animal Science, University of Nebraska, Lincoln, NE, 68583, USA.
| |
Collapse
|
9
|
Xavier A, Habier D. A new approach fits multivariate genomic prediction models efficiently. Genet Sel Evol 2022; 54:45. [PMID: 35715755 PMCID: PMC9204867 DOI: 10.1186/s12711-022-00730-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2021] [Accepted: 05/13/2022] [Indexed: 12/03/2022] Open
Abstract
Background Fast, memory-efficient, and reliable algorithms for estimating genomic estimated breeding values (GEBV) for multiple traits and environments are needed to make timely decisions in breeding. Multivariate genomic prediction exploits genetic correlations between traits and environments to increase accuracy of GEBV compared to univariate methods. These genetic correlations are estimated simultaneously with GEBV, because they are specific to year, environment, and management. However, estimating genetic parameters is computationally demanding with restricted maximum likelihood (REML) and Bayesian samplers, and canonical transformations or orthogonalizations cannot be used for unbalanced experimental designs. Methods We propose a multivariate randomized Gauss–Seidel algorithm for simultaneous estimation of model effects and genetic parameters. Two previously proposed methods for estimating genetic parameters were combined with a Gauss–Seidel (GS) solver, and were called Tilde-Hat-GS (THGS) and Pseudo-Expectation-GS (PEGS). Balanced and unbalanced experimental designs were simulated to compare runtime, bias and accuracy of GEBV, and bias and standard errors of estimates of heritabilities and genetic correlations of THGS, PEGS, and REML. Models with 10 to 400 response variables, 1279 to 42,034 genetic markers, and 5990 to 1.85 million observations were fitted. Results Runtime of PEGS and THGS was a fraction of REML. Accuracies of GEBV were slightly lower than those from REML, but higher than those from the univariate approach, hence THGS and PEGS exploited genetic correlations. For 500 to 600 observations per response variable, biases of estimates of genetic parameters of THGS and PEGS were small, but standard errors of estimates of genetic correlations were higher than for REML. Bias and standard errors decreased as sample size increased. For balanced designs, GEBV and estimates of genetic correlations from THGS were unbiased when only an intercept and eigenvectors of genotype scores were fitted. Conclusions THGS and PEGS are fast and memory-efficient algorithms for multivariate genomic prediction for balanced and unbalanced experimental designs. They are scalable for increasing numbers of environments and genetic markers. Accuracy of GEBV was comparable to REML. Estimates of genetic parameters had little bias, but their standard errors were larger than for REML. More studies are needed to evaluate the proposed methods for datasets that contain selection. Supplementary Information The online version contains supplementary material available at 10.1186/s12711-022-00730-w.
Collapse
Affiliation(s)
- Alencar Xavier
- Biostatistics, Corteva Agrisciences, 8305 NW 62nd Ave, Johnston, IA, 50131, USA. .,Department of Agronomy, Purdue University, 915 W State St, West Lafayette, IN, 47907, USA.
| | - David Habier
- Biostatistics, Corteva Agrisciences, 8305 NW 62nd Ave, Johnston, IA, 50131, USA.
| |
Collapse
|
10
|
Chen CJ, Garrick D, Fernando R, Karaman E, Stricker C, Keehan M, Cheng H. XSim version 2: simulation of modern breeding programs. G3 GENES|GENOMES|GENETICS 2022; 12:6542309. [PMID: 35244161 PMCID: PMC8982375 DOI: 10.1093/g3journal/jkac032] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/28/2021] [Accepted: 01/06/2022] [Indexed: 11/25/2022]
Abstract
Simulation can be an efficient approach to design, evaluate, and optimize breeding programs. In the era of modern agriculture, breeding programs can benefit from a simulator that integrates various sources of big data and accommodates state-of-the-art statistical models. The initial release of XSim, in which stochastic descendants can be efficiently simulated with a drop-down strategy, has mainly been used to validate genomic selection results. In this article, we present XSim Version 2 that is an open-source tool and has been extensively redesigned with additional features to meet the needs in modern breeding programs. It seamlessly incorporates multiple statistical models for genetic evaluations, such as GBLUP, Bayesian alphabets, and neural networks, and it can effortlessly simulate successive generations of descendants based on complex mating schemes by the aid of its modular design. Case studies are presented to demonstrate the flexibility of XSim Version 2 in simulating crossbreeding in animal and plant populations. Modern biotechnology, including double haploids and embryo transfer, can all be simultaneously integrated into the mating plans that drive the simulation. From a computing perspective, XSim Version 2 is implemented in Julia, which is a computer language that retains the readability of scripting languages (e.g. R and Python) without sacrificing much computational speed compared to compiled languages (e.g. C). This makes XSim Version 2 a simulation tool that is relatively easy for both champions and community members to maintain, modify, or extend in order to improve their breeding programs. Functions and operators are overloaded for a better user interface so they may concatenate, subset, summarize, and organize simulated populations at each breeding step. With the strong and foreseeable demands in the community, XSim Version 2 will serve as a modern simulator bridging the gaps between theories and experiments with its flexibility, extensibility, and friendly interface.
Collapse
Affiliation(s)
- Chunpeng James Chen
- Department of Animal Science, University of California, Davis, CA 95616, USA
| | | | - Rohan Fernando
- Department of Animal Science, Iowa State University, Ames, IA 50010, USA
| | - Emre Karaman
- Center for Quantitative Genetics and Genomics, Aarhus University, Aarhus 8830, Denmark
| | - Chris Stricker
- agn Genetics GmbH, Davos-Dorf, Graubünden 7260, Switzerland
| | | | - Hao Cheng
- Department of Animal Science, University of California, Davis, CA 95616, USA
| |
Collapse
|
11
|
Zhao T, Zeng J, Cheng H. Extend mixed models to multilayer neural networks for genomic prediction including intermediate omics data. Genetics 2022; 221:6536967. [PMID: 35212766 PMCID: PMC9071534 DOI: 10.1093/genetics/iyac034] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/06/2021] [Accepted: 02/17/2022] [Indexed: 11/13/2022] Open
Abstract
With the growing amount and diversity of intermediate omics data complementary to genomics (e.g. DNA methylation, gene expression, and protein abundance), there is a need to develop methods to incorporate intermediate omics data into conventional genomic evaluation. The omics data help decode the multiple layers of regulation from genotypes to phenotypes, thus forms a connected multilayer network naturally. We developed a new method named NN-MM to model the multiple layers of regulation from genotypes to intermediate omics features, then to phenotypes, by extending conventional linear mixed models ("MM") to multilayer artificial neural networks ("NN"). NN-MM incorporates intermediate omics features by adding middle layers between genotypes and phenotypes. Linear mixed models (e.g. pedigree-based BLUP, GBLUP, Bayesian Alphabet, single-step GBLUP, or single-step Bayesian Alphabet) can be used to sample marker effects or genetic values on intermediate omics features, and activation functions in neural networks are used to capture the nonlinear relationships between intermediate omics features and phenotypes. NN-MM had significantly better prediction performance than the recently proposed single-step approach for genomic prediction with intermediate omics data. Compared to the single-step approach, NN-MM can handle various patterns of missing omics measures and allows nonlinear relationships between intermediate omics features and phenotypes. NN-MM has been implemented in an open-source package called "JWAS".
Collapse
Affiliation(s)
- Tianjing Zhao
- Department of Animal Science, University of California Davis, Davis, CA 95616, USA,Integrative Genetics and Genomics Graduate Group, University of California Davis, Davis, CA 95616, USA
| | - Jian Zeng
- Institute for Molecular Bioscience, The University of Queensland, Brisbane, QLD 4072, Australia
| | - Hao Cheng
- Department of Animal Science, University of California Davis, Davis, CA 95616, USA,Corresponding author: Department of Animal Science, University of California, Davis, CA 95616, USA.
| |
Collapse
|
12
|
Jung M, Keller B, Roth M, Aranzana MJ, Auwerkerken A, Guerra W, Al-Rifaï M, Lewandowski M, Sanin N, Rymenants M, Didelot F, Dujak C, Font i Forcada C, Knauf A, Laurens F, Studer B, Muranty H, Patocchi A. Genetic architecture and genomic predictive ability of apple quantitative traits across environments. HORTICULTURE RESEARCH 2022; 9:uhac028. [PMID: 35184165 PMCID: PMC8976694 DOI: 10.1093/hr/uhac028] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 06/25/2021] [Revised: 12/09/2021] [Accepted: 01/11/2022] [Indexed: 06/14/2023]
Abstract
Implementation of genomic tools is desirable to increase the efficiency of apple breeding. Recently, the multi-environment apple reference population (apple REFPOP) proved useful for rediscovering loci, estimating genomic predictive ability, and studying genotype by environment interactions (G × E). So far, only two phenological traits were investigated using the apple REFPOP, although the population may be valuable when dissecting genetic architecture and reporting predictive abilities for additional key traits in apple breeding. Here we show contrasting genetic architecture and genomic predictive abilities for 30 quantitative traits across up to six European locations using the apple REFPOP. A total of 59 stable and 277 location-specific associations were found using GWAS, 69.2% of which are novel when compared with 41 reviewed publications. Average genomic predictive abilities of 0.18-0.88 were estimated using main-effect univariate, main-effect multivariate, multi-environment univariate, and multi-environment multivariate models. The G × E accounted for up to 24% of the phenotypic variability. This most comprehensive genomic study in apple in terms of trait-environment combinations provided knowledge of trait biology and prediction models that can be readily applied for marker-assisted or genomic selection, thus facilitating increased breeding efficiency.
Collapse
Affiliation(s)
- Michaela Jung
- Agroscope, Breeding Research Group, 8820 Wädenswil, Switzerland
- Molecular Plant Breeding, Institute of Agricultural Sciences, ETH Zurich, 8092 Zurich, Switzerland
| | - Beat Keller
- Agroscope, Breeding Research Group, 8820 Wädenswil, Switzerland
- Molecular Plant Breeding, Institute of Agricultural Sciences, ETH Zurich, 8092 Zurich, Switzerland
| | - Morgane Roth
- Agroscope, Breeding Research Group, 8820 Wädenswil, Switzerland
- GAFL, INRAE, 84140 Montfavet, France
| | - Maria José Aranzana
- IRTA (Institut de Recerca i Tecnologia Agroalimentàries), 08140 Caldes de Montbui, Barcelona, Spain
- Centre for Research in Agricultural Genomics (CRAG) CSIC-IRTA-UAB-UB, Campus UAB, 08193 Bellaterra, Barcelona, Spain
| | | | | | - Mehdi Al-Rifaï
- Univ Angers, Institut Agro, INRAE, IRHS, SFR QuaSaV, F-49000 Angers, France
| | - Mariusz Lewandowski
- The National Institute of Horticultural Research, Konstytucji 3 Maja 1/3, 96-100 Skierniewice, Poland
| | | | - Marijn Rymenants
- Better3fruit N.V., 3202 Rillaar, Belgium
- Laboratory for Plant Genetics and Crop Improvement, KU Leuven, B-3001 Leuven, Belgium
| | | | - Christian Dujak
- Centre for Research in Agricultural Genomics (CRAG) CSIC-IRTA-UAB-UB, Campus UAB, 08193 Bellaterra, Barcelona, Spain
| | - Carolina Font i Forcada
- IRTA (Institut de Recerca i Tecnologia Agroalimentàries), 08140 Caldes de Montbui, Barcelona, Spain
| | - Andrea Knauf
- Agroscope, Breeding Research Group, 8820 Wädenswil, Switzerland
- Molecular Plant Breeding, Institute of Agricultural Sciences, ETH Zurich, 8092 Zurich, Switzerland
| | - François Laurens
- Univ Angers, Institut Agro, INRAE, IRHS, SFR QuaSaV, F-49000 Angers, France
| | - Bruno Studer
- Molecular Plant Breeding, Institute of Agricultural Sciences, ETH Zurich, 8092 Zurich, Switzerland
| | - Hélène Muranty
- Univ Angers, Institut Agro, INRAE, IRHS, SFR QuaSaV, F-49000 Angers, France
| | - Andrea Patocchi
- Agroscope, Breeding Research Group, 8820 Wädenswil, Switzerland
| |
Collapse
|
13
|
Weiss-Lehman CP, Werner CM, Bowler CH, Hallett LM, Mayfield MM, Godoy O, Aoyama L, Barabás G, Chu C, Ladouceur E, Larios L, Shoemaker LG. Disentangling key species interactions in diverse and heterogeneous communities: A Bayesian sparse modelling approach. Ecol Lett 2022; 25:1263-1276. [PMID: 35106910 PMCID: PMC9543015 DOI: 10.1111/ele.13977] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/20/2021] [Revised: 12/07/2021] [Accepted: 01/02/2022] [Indexed: 11/30/2022]
Abstract
Modelling species interactions in diverse communities traditionally requires a prohibitively large number of species‐interaction coefficients, especially when considering environmental dependence of parameters. We implemented Bayesian variable selection via sparsity‐inducing priors on non‐linear species abundance models to determine which species interactions should be retained and which can be represented as an average heterospecific interaction term, reducing the number of model parameters. We evaluated model performance using simulated communities, computing out‐of‐sample predictive accuracy and parameter recovery across different input sample sizes. We applied our method to a diverse empirical community, allowing us to disentangle the direct role of environmental gradients on species’ intrinsic growth rates from indirect effects via competitive interactions. We also identified a few neighbouring species from the diverse community that had non‐generic interactions with our focal species. This sparse modelling approach facilitates exploration of species interactions in diverse communities while maintaining a manageable number of parameters.
Collapse
Affiliation(s)
| | - Chhaya M Werner
- Botany Department, University of Wyoming, Laramie, Wyoming, USA
| | - Catherine H Bowler
- School of Biological Sciences, University of Queensland, Brisbane, Queensland, Australia
| | - Lauren M Hallett
- Biology Department, University of Oregon, Eugene, Oregon, USA.,Environmental Studies Program, University of Oregon, Eugene, Oregon, USA
| | - Margaret M Mayfield
- School of Biological Sciences, University of Queensland, Brisbane, Queensland, Australia
| | - Oscar Godoy
- Departamento de Biología, Instituto Universitario de Investigación Marina (INMAR), Universidad de Cádiz, Puerto Real, Spain
| | - Lina Aoyama
- Biology Department, University of Oregon, Eugene, Oregon, USA.,Environmental Studies Program, University of Oregon, Eugene, Oregon, USA
| | - György Barabás
- Division of Theoretical Biology, Department of IFM, Linköping University, Linköping, Sweden
| | - Chengjin Chu
- Department of Ecology, State Key Laboratory of Biocontrol and School of Life Sciences, Sun Yat-sen University, Guangzhou, Guangdong, China
| | - Emma Ladouceur
- German Centre for Integrative Biodiversity Research (iDiv) Leipzig-Halle-Jena, Leipzig, Germany.,Department of Physiological Diversity, Helmholtz Centre for Environmental Research -UFZ, Leipzig, Germany
| | - Loralee Larios
- Department of Botany and Plant Sciences, University of California Riverside, Riverside, California, USA
| | | |
Collapse
|
14
|
Wang Z, Cheng H. Single-Trait and Multiple-Trait Genomic Prediction From Multi-Class Bayesian Alphabet Models Using Biological Information. Front Genet 2021; 12:717457. [PMID: 34707638 PMCID: PMC8542848 DOI: 10.3389/fgene.2021.717457] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/31/2021] [Accepted: 08/23/2021] [Indexed: 11/13/2022] Open
Abstract
Genomic prediction has been widely used in multiple areas and various genomic prediction methods have been developed. The majority of these methods, however, focus on statistical properties and ignore the abundant useful biological information like genome annotation or previously discovered causal variants. Therefore, to improve prediction performance, several methods have been developed to incorporate biological information into genomic prediction, mostly in single-trait analysis. A commonly used method to incorporate biological information is allocating molecular markers into different classes based on the biological information and assigning separate priors to molecular markers in different classes. It has been shown that such methods can achieve higher prediction accuracy than conventional methods in some circumstances. However, these methods mainly focus on single-trait analysis, and available priors of these methods are limited. Thus, in both single-trait and multiple-trait analysis, we propose the multi-class Bayesian Alphabet methods, in which multiple Bayesian Alphabet priors, including RR-BLUP, BayesA, BayesB, BayesCΠ, and Bayesian LASSO, can be used for markers allocated to different classes. The superior performance of the multi-class Bayesian Alphabet in genomic prediction is demonstrated using both real and simulated data. The software tool JWAS offers open-source routines to perform these analyses.
Collapse
Affiliation(s)
- Zigui Wang
- Department of Animal Science, University of California, Davis, Davis, CA, United States
| | - Hao Cheng
- Department of Animal Science, University of California, Davis, Davis, CA, United States
| |
Collapse
|
15
|
Zhao T, Fernando R, Cheng H. Interpretable artificial neural networks incorporating Bayesian alphabet models for genome-wide prediction and association studies. G3 (BETHESDA, MD.) 2021; 11:jkab228. [PMID: 34499126 PMCID: PMC8496266 DOI: 10.1093/g3journal/jkab228] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/13/2021] [Accepted: 06/22/2021] [Indexed: 01/05/2023]
Abstract
In conventional linear models for whole-genome prediction and genome-wide association studies (GWAS), it is usually assumed that the relationship between genotypes and phenotypes is linear. Bayesian neural networks have been used to account for non-linearity such as complex genetic architectures. Here, we introduce a method named NN-Bayes, where "NN" stands for neural networks, and "Bayes" stands for Bayesian Alphabet models, including a collection of Bayesian regression models such as BayesA, BayesB, BayesC, and Bayesian LASSO. NN-Bayes incorporates Bayesian Alphabet models into non-linear neural networks via hidden layers between single-nucleotide polymorphisms (SNPs) and observed traits. Thus, NN-Bayes attempts to improve the performance of genome-wide prediction and GWAS by accommodating non-linear relationships between the hidden nodes and the observed trait, while maintaining genomic interpretability through the Bayesian regression models that connect the SNPs to the hidden nodes. For genomic interpretability, the posterior distribution of marker effects in NN-Bayes is inferred by Markov chain Monte Carlo approaches and used for inference of association through posterior inclusion probabilities and window posterior probability of association. In simulation studies with dominance and epistatic effects, performance of NN-Bayes was significantly better than conventional linear models for both GWAS and whole-genome prediction, and the differences on prediction accuracy were substantial in magnitude. In real-data analyses, for the soy dataset, NN-Bayes achieved significantly higher prediction accuracies than conventional linear models, and results from other four different species showed that NN-Bayes had similar prediction performance to linear models, which is potentially due to the small sample size. Our NN-Bayes is optimized for high-dimensional genomic data and implemented in an open-source package called "JWAS." NN-Bayes can lead to greater use of Bayesian neural networks to account for non-linear relationships due to its interpretability and computational performance.
Collapse
Affiliation(s)
- Tianjing Zhao
- Department of Animal Science, University of California Davis, Davis, CA 95616, USA
- Integrative Genetics and Genomics Graduate Group, University of California Davis, Davis, CA 95616, USA
| | - Rohan Fernando
- Department of Animal Science, Iowa State University, Ames, IA 50011, USA
| | - Hao Cheng
- Department of Animal Science, University of California Davis, Davis, CA 95616, USA
| |
Collapse
|
16
|
Pégard M, Segura V, Muñoz F, Bastien C, Jorge V, Sanchez L. Favorable Conditions for Genomic Evaluation to Outperform Classical Pedigree Evaluation Highlighted by a Proof-of-Concept Study in Poplar. FRONTIERS IN PLANT SCIENCE 2020; 11:581954. [PMID: 33193528 PMCID: PMC7655903 DOI: 10.3389/fpls.2020.581954] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 07/10/2020] [Accepted: 09/22/2020] [Indexed: 06/11/2023]
Abstract
Forest trees like poplar are particular in many ways compared to other domesticated species. They have long juvenile phases, ongoing crop-wild gene flow, extensive outcrossing, and slow growth. All these particularities tend to make the conduction of breeding programs and evaluation stages costly both in time and resources. Perennials like trees are therefore good candidates for the implementation of genomic selection (GS) which is a good way to accelerate the breeding process, by unchaining selection from phenotypic evaluation without affecting precision. In this study, we tried to compare GS to pedigree-based traditional evaluation, and evaluated under which conditions genomic evaluation outperforms classical pedigree evaluation. Several conditions were evaluated as the constitution of the training population by cross-validation, the implementation of multi-trait, single trait, additive and non-additive models with different estimation methods (G-BLUP or weighted G-BLUP). Finally, the impact of the marker densification was tested through four marker density sets. The population under study corresponds to a pedigree of 24 parents and 1,011 offspring, structured into 35 full-sib families. Four evaluation batches were planted in the same location and seven traits were evaluated on 1 and 2 years old trees. The quality of prediction was reported by the accuracy, the Spearman rank correlation and prediction bias and tested with a cross-validation and an independent individual test set. Our results show that genomic evaluation performance could be comparable to the already well-optimized pedigree-based evaluation under certain conditions. Genomic evaluation appeared to be advantageous when using an independent test set and a set of less precise phenotypes. Genome-based methods showed advantages over pedigree counterparts when ranking candidates at the within-family levels, for most of the families. Our study also showed that looking at ranking criteria as Spearman rank correlation can reveal benefits to genomic selection hidden by biased predictions.
Collapse
Affiliation(s)
| | - Vincent Segura
- BioForA, INRA, ONF, Orléans, France
- AGAP, Univ Montpellier, CIRAD, INRAE, Institut Agro, Montpellier, France
| | | | | | | | | |
Collapse
|