51
|
Costa-Neto G, Crossa J, Fritsche-Neto R. Enviromic Assembly Increases Accuracy and Reduces Costs of the Genomic Prediction for Yield Plasticity in Maize. FRONTIERS IN PLANT SCIENCE 2021; 12:717552. [PMID: 34691099 PMCID: PMC8529011 DOI: 10.3389/fpls.2021.717552] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/31/2021] [Accepted: 09/03/2021] [Indexed: 05/21/2023]
Abstract
Quantitative genetics states that phenotypic variation is a consequence of the interaction between genetic and environmental factors. Predictive breeding is based on this statement, and because of this, ways of modeling genetic effects are still evolving. At the same time, the same refinement must be used for processing environmental information. Here, we present an "enviromic assembly approach," which includes using ecophysiology knowledge in shaping environmental relatedness into whole-genome predictions (GP) for plant breeding (referred to as enviromic-aided genomic prediction, E-GP). We propose that the quality of an environment is defined by the core of environmental typologies and their frequencies, which describe different zones of plant adaptation. From this, we derived markers of environmental similarity cost-effectively. Combined with the traditional additive and non-additive effects, this approach may better represent the putative phenotypic variation observed across diverse growing conditions (i.e., phenotypic plasticity). Then, we designed optimized multi-environment trials coupling genetic algorithms, enviromic assembly, and genomic kinships capable of providing in-silico realization of the genotype-environment combinations that must be phenotyped in the field. As proof of concept, we highlighted two E-GP applications: (1) managing the lack of phenotypic information in training accurate GP models across diverse environments and (2) guiding an early screening for yield plasticity exerting optimized phenotyping efforts. Our approach was tested using two tropical maize sets, two types of enviromics assembly, six experimental network sizes, and two types of optimized training set across environments. We observed that E-GP outperforms benchmark GP in all scenarios, especially when considering smaller training sets. The representativeness of genotype-environment combinations is more critical than the size of multi-environment trials (METs). The conventional genomic best-unbiased prediction (GBLUP) is inefficient in predicting the quality of a yet-to-be-seen environment, while enviromic assembly enabled it by increasing the accuracy of yield plasticity predictions. Furthermore, we discussed theoretical backgrounds underlying how intrinsic envirotype-phenotype covariances within the phenotypic records can impact the accuracy of GP. The E-GP is an efficient approach to better use environmental databases to deliver climate-smart solutions, reduce field costs, and anticipate future scenarios.
Collapse
Affiliation(s)
- Germano Costa-Neto
- Department of Genetics, “Luiz de Queiroz” Agriculture College, University of São Paulo (ESALQ/USP), Piracicaba, Brazil
- Institute for Genomic Diversity, Cornell University, Ithaca, NY, United States
- *Correspondence: Germano Costa-Neto
| | - Jose Crossa
- Biometrics and Statistics Unit, International Maize and Wheat Improvement Center (CIMMYT), Mexico City, Mexico
- Colegio de Posgraduado, Mexico City, Mexico
| | - Roberto Fritsche-Neto
- Department of Genetics, “Luiz de Queiroz” Agriculture College, University of São Paulo (ESALQ/USP), Piracicaba, Brazil
- Breeding Analytics and Data Management Unit, International Rice Research Institute (IRRI), Los Baños, Philippines
| |
Collapse
|
52
|
Crossa J, Fritsche-Neto R, Montesinos-Lopez OA, Costa-Neto G, Dreisigacker S, Montesinos-Lopez A, Bentley AR. The Modern Plant Breeding Triangle: Optimizing the Use of Genomics, Phenomics, and Enviromics Data. FRONTIERS IN PLANT SCIENCE 2021; 12:651480. [PMID: 33936136 PMCID: PMC8085545 DOI: 10.3389/fpls.2021.651480] [Citation(s) in RCA: 56] [Impact Index Per Article: 18.7] [Reference Citation Analysis] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/09/2021] [Accepted: 02/11/2021] [Indexed: 05/04/2023]
Affiliation(s)
- Jose Crossa
- International Maize and Wheat Improvement Center (CIMMYT), Carretera México-Veracruz, de Mexico, Mexico
- Colegio de Postgraduados, Montecillo, Edo. de Mexico, Mexico
| | - Roberto Fritsche-Neto
- Department of Genetics, “Luiz de Queiroz” Agriculture College, University of São Paulo, São Paulo, Brazil
| | | | - Germano Costa-Neto
- Department of Genetics, “Luiz de Queiroz” Agriculture College, University of São Paulo, São Paulo, Brazil
| | - Susanne Dreisigacker
- International Maize and Wheat Improvement Center (CIMMYT), Carretera México-Veracruz, de Mexico, Mexico
| | - Abelardo Montesinos-Lopez
- Departamento de Matemáticas, Centro Universitario de Ciencias Exactas e Ingenierías (CUCEI), Universidad de Guadalajara, Guadalajara, Mexico
| | - Alison R. Bentley
- International Maize and Wheat Improvement Center (CIMMYT), Carretera México-Veracruz, de Mexico, Mexico
- *Correspondence: Alison R. Bentley
| |
Collapse
|
53
|
Maldonado C, Mora-Poblete F, Contreras-Soto RI, Ahmar S, Chen JT, do Amaral Júnior AT, Scapim CA. Genome-Wide Prediction of Complex Traits in Two Outcrossing Plant Species Through Deep Learning and Bayesian Regularized Neural Network. FRONTIERS IN PLANT SCIENCE 2020; 11:593897. [PMID: 33329658 PMCID: PMC7728740 DOI: 10.3389/fpls.2020.593897] [Citation(s) in RCA: 14] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/11/2020] [Accepted: 10/27/2020] [Indexed: 05/25/2023]
Abstract
Genomic selection models were investigated to predict several complex traits in breeding populations of Zea mays L. and Eucalyptus globulus Labill. For this, the following methods of Machine Learning (ML) were implemented: (i) Deep Learning (DL) and (ii) Bayesian Regularized Neural Network (BRNN) both in combination with different hyperparameters. These ML methods were also compared with Genomic Best Linear Unbiased Prediction (GBLUP) and different Bayesian regression models [Bayes A, Bayes B, Bayes Cπ, Bayesian Ridge Regression, Bayesian LASSO, and Reproducing Kernel Hilbert Space (RKHS)]. DL models, using Rectified Linear Units (as the activation function), had higher predictive ability values, which varied from 0.27 (pilodyn penetration of 6 years old eucalypt trees) to 0.78 (flowering-related traits of maize). Moreover, the larger mini-batch size (100%) had a significantly higher predictive ability for wood-related traits than the smaller mini-batch size (10%). On the other hand, in the BRNN method, the architectures of one and two layers that used only the pureline function showed better results of prediction, with values ranging from 0.21 (pilodyn penetration) to 0.71 (flowering traits). A significant increase in the prediction ability was observed for DL in comparison with other methods of genomic prediction (Bayesian alphabet models, GBLUP, RKHS, and BRNN). Another important finding was the usefulness of DL models (through an iterative algorithm) as an SNP detection strategy for genome-wide association studies. The results of this study confirm the importance of DL for genome-wide analyses and crop/tree improvement strategies, which holds promise for accelerating breeding progress.
Collapse
Affiliation(s)
- Carlos Maldonado
- Instituto de Ciencias Agroalimentarias, Animales y Ambientales, Universidad de O’ Higgins, San Fernando, Chile
| | | | - Rodrigo Iván Contreras-Soto
- Instituto de Ciencias Agroalimentarias, Animales y Ambientales, Universidad de O’ Higgins, San Fernando, Chile
| | - Sunny Ahmar
- Institute of Biological Sciences, University of Talca, Talca, Chile
- College of Plant Sciences and Technology, Huazhong Agricultural University, Wuhan, China
| | - Jen-Tsung Chen
- Department of Life Sciences, National University of Kaohsiung, Kaohsiung, Taiwan
| | - Antônio Teixeira do Amaral Júnior
- Laboratory de Melhoramento Genético Veget al., Universidade Estadual do Norte Fluminense Darcy Ribeiro/CCTA, Campos dos Goytacazes, Brazil
| | | |
Collapse
|
54
|
Montesinos-López OA, Montesinos-López JC, Singh P, Lozano-Ramirez N, Barrón-López A, Montesinos-López A, Crossa J. A Multivariate Poisson Deep Learning Model for Genomic Prediction of Count Data. G3 (BETHESDA, MD.) 2020; 10:4177-4190. [PMID: 32934019 PMCID: PMC7642922 DOI: 10.1534/g3.120.401631] [Citation(s) in RCA: 18] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/30/2020] [Accepted: 09/13/2020] [Indexed: 01/24/2023]
Abstract
The paradigm called genomic selection (GS) is a revolutionary way of developing new plants and animals. This is a predictive methodology, since it uses learning methods to perform its task. Unfortunately, there is no universal model that can be used for all types of predictions; for this reason, specific methodologies are required for each type of output (response variables). Since there is a lack of efficient methodologies for multivariate count data outcomes, in this paper, a multivariate Poisson deep neural network (MPDN) model is proposed for the genomic prediction of various count outcomes simultaneously. The MPDN model uses the minus log-likelihood of a Poisson distribution as a loss function, in hidden layers for capturing nonlinear patterns using the rectified linear unit (RELU) activation function and, in the output layer, the exponential activation function was used for producing outputs on the same scale of counts. The proposed MPDN model was compared to conventional generalized Poisson regression models and univariate Poisson deep learning models in two experimental data sets of count data. We found that the proposed MPDL outperformed univariate Poisson deep neural network models, but did not outperform, in terms of prediction, the univariate generalized Poisson regression models. All deep learning models were implemented in Tensorflow as back-end and Keras as front-end, which allows implementing these models on moderate and large data sets, which is a significant advantage over previous GS models for multivariate count data.
Collapse
Affiliation(s)
| | | | - Pawan Singh
- Biometrics and Statistics Unit, Genetic Resources Program, International Maize and Wheat Improvement Center (CIMMYT), Km 45 Carretera Mexico-Veracruz, CP 52640, Mexico
| | - Nerida Lozano-Ramirez
- Biometrics and Statistics Unit, Genetic Resources Program, International Maize and Wheat Improvement Center (CIMMYT), Km 45 Carretera Mexico-Veracruz, CP 52640, Mexico
| | - Alberto Barrón-López
- Department of Animal Production (DPA), Universidad Nacional Agraria La Molina, Av. La Molina s/n La Molina, 15024, Lima, Perú
| | - Abelardo Montesinos-López
- Departamento de Matemáticas, Centro Universitario de Ciencias Exactas e Ingenierías (CUCEI), Universidad de Guadalajara, 44430, Jalisco, México
| | - José Crossa
- Biometrics and Statistics Unit, Genetic Resources Program, International Maize and Wheat Improvement Center (CIMMYT), Km 45 Carretera Mexico-Veracruz, CP 52640, Mexico
- Colegio de Post-Graduados, Montecillos Texcoco. Edo. de Mexico
| |
Collapse
|
55
|
Thessen AE, Walls RL, Vogt L, Singer J, Warren R, Buttigieg PL, Balhoff JP, Mungall CJ, McGuinness DL, Stucky BJ, Yoder MJ, Haendel MA. Transforming the study of organisms: Phenomic data models and knowledge bases. PLoS Comput Biol 2020; 16:e1008376. [PMID: 33232313 PMCID: PMC7685442 DOI: 10.1371/journal.pcbi.1008376] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/07/2023] Open
Abstract
The rapidly decreasing cost of gene sequencing has resulted in a deluge of genomic data from across the tree of life; however, outside a few model organism databases, genomic data are limited in their scientific impact because they are not accompanied by computable phenomic data. The majority of phenomic data are contained in countless small, heterogeneous phenotypic data sets that are very difficult or impossible to integrate at scale because of variable formats, lack of digitization, and linguistic problems. One powerful solution is to represent phenotypic data using data models with precise, computable semantics, but adoption of semantic standards for representing phenotypic data has been slow, especially in biodiversity and ecology. Some phenotypic and trait data are available in a semantic language from knowledge bases, but these are often not interoperable. In this review, we will compare and contrast existing ontology and data models, focusing on nonhuman phenotypes and traits. We discuss barriers to integration of phenotypic data and make recommendations for developing an operationally useful, semantically interoperable phenotypic data ecosystem.
Collapse
Affiliation(s)
- Anne E. Thessen
- Environmental and Molecular Toxicology, Oregon State University, Corvallis, Oregon, United States of America
- Ronin Institute for Independent Scholarship, Monclair, New Jersey, United States of America
| | - Ramona L. Walls
- Bio5 Institute, University of Arizona, Tucson, Arizona, United States of America
| | - Lars Vogt
- TIB Leibniz Information Centre for Science and Technology, Hannover, Germany
| | | | | | - Pier Luigi Buttigieg
- Alfred-Wegener-Institut, Helmholtz-Zentrum für Polar- und Meeresforschung, Bremerhaven, Germany
| | - James P. Balhoff
- Renaissance Computing Institute, University of North Carolina, Chapel Hill, North Carolina, United States of America
| | - Christopher J. Mungall
- Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, California, United States of America
| | | | - Brian J. Stucky
- Florida Museum of Natural History, University of Florida, Gainesville, Florida, United States of America
| | - Matthew J. Yoder
- Illinois Natural History Survey, Champaign, Illinois, United States of America
| | - Melissa A. Haendel
- Environmental and Molecular Toxicology, Oregon State University, Corvallis, Oregon, United States of America
| |
Collapse
|
56
|
Kim KD, Kang Y, Kim C. Application of Genomic Big Data in Plant Breeding:Past, Present, and Future. PLANTS (BASEL, SWITZERLAND) 2020; 9:E1454. [PMID: 33126607 PMCID: PMC7694055 DOI: 10.3390/plants9111454] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/28/2020] [Revised: 10/26/2020] [Accepted: 10/26/2020] [Indexed: 01/11/2023]
Abstract
Plant breeding has a long history of developing new varieties that have ensured the food security of the human population. During this long journey together with humanity, plant breeders have successfully integrated the latest innovations in science and technologies to accelerate the increase in crop production and quality. For the past two decades, since the completion of human genome sequencing, genomic tools and sequencing technologies have advanced remarkably, and adopting these innovations has enabled us to cost down and/or speed up the plant breeding process. Currently, with the growing mass of genomic data and digitalized biological data, interdisciplinary approaches using new technologies could lead to a new paradigm of plant breeding. In this review, we summarize the overall history and advances of plant breeding, which have been aided by plant genomic research. We highlight the key advances in the field of plant genomics that have impacted plant breeding over the past decades and introduce the current status of innovative approaches such as genomic selection, which could overcome limitations of conventional breeding and enhance the rate of genetic gain.
Collapse
Affiliation(s)
- Kyung Do Kim
- Department of Bioscience and Bioinformatics, Myongji University, Yongin 17058, Korea;
| | - Yuna Kang
- Department of Crop Science, Chungnam National University, Daejeon 34134, Korea;
| | - Changsoo Kim
- Department of Crop Science, Chungnam National University, Daejeon 34134, Korea;
- Department of Smart Agriculture Systems, Chungnam National University, Daejeon 34134, Korea
| |
Collapse
|
57
|
Cortés AJ, López-Hernández F, Osorio-Rodriguez D. Predicting Thermal Adaptation by Looking Into Populations' Genomic Past. Front Genet 2020; 11:564515. [PMID: 33101385 PMCID: PMC7545011 DOI: 10.3389/fgene.2020.564515] [Citation(s) in RCA: 37] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2020] [Accepted: 08/24/2020] [Indexed: 12/18/2022] Open
Abstract
Molecular evolution offers an insightful theory to interpret the genomic consequences of thermal adaptation to previous events of climate change beyond range shifts. However, disentangling often mixed footprints of selective and demographic processes from those due to lineage sorting, recombination rate variation, and genomic constrains is not trivial. Therefore, here we condense current and historical population genomic tools to study thermal adaptation and outline key developments (genomic prediction, machine learning) that might assist their utilization for improving forecasts of populations' responses to thermal variation. We start by summarizing how recent thermal-driven selective and demographic responses can be inferred by coalescent methods and in turn how quantitative genetic theory offers suitable multi-trait predictions over a few generations via the breeder's equation. We later assume that enough generations have passed as to display genomic signatures of divergent selection to thermal variation and describe how these footprints can be reconstructed using genome-wide association and selection scans or, alternatively, may be used for forward prediction over multiple generations under an infinitesimal genomic prediction model. Finally, we move deeper in time to comprehend the genomic consequences of thermal shifts at an evolutionary time scale by relying on phylogeographic approaches that allow for reticulate evolution and ecological parapatric speciation, and end by envisioning the potential of modern machine learning techniques to better inform long-term predictions. We conclude that foreseeing future thermal adaptive responses requires bridging the multiple spatial scales of historical and predictive environmental change research under modern cohesive approaches such as genomic prediction and machine learning frameworks.
Collapse
Affiliation(s)
- Andrés J Cortés
- Corporación Colombiana de Investigación Agropecuaria AGROSAVIA, C.I. La Selva, Rionegro, Colombia.,Departamento de Ciencias Forestales, Facultad de Ciencias Agrarias, Universidad Nacional de Colombia - Sede Medellín, Medellín, Colombia
| | - Felipe López-Hernández
- Corporación Colombiana de Investigación Agropecuaria AGROSAVIA, C.I. La Selva, Rionegro, Colombia
| | - Daniela Osorio-Rodriguez
- Division of Geological and Planetary Sciences, California Institute of Technology (Caltech), Pasadena, CA, United States
| |
Collapse
|
58
|
Moreira FF, Oliveira HR, Volenec JJ, Rainey KM, Brito LF. Integrating High-Throughput Phenotyping and Statistical Genomic Methods to Genetically Improve Longitudinal Traits in Crops. FRONTIERS IN PLANT SCIENCE 2020; 11:681. [PMID: 32528513 PMCID: PMC7264266 DOI: 10.3389/fpls.2020.00681] [Citation(s) in RCA: 20] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/26/2020] [Accepted: 04/30/2020] [Indexed: 05/28/2023]
Abstract
The rapid development of remote sensing in agronomic research allows the dynamic nature of longitudinal traits to be adequately described, which may enhance the genetic improvement of crop efficiency. For traits such as light interception, biomass accumulation, and responses to stressors, the data generated by the various high-throughput phenotyping (HTP) methods requires adequate statistical techniques to evaluate phenotypic records throughout time. As a consequence, information about plant functioning and activation of genes, as well as the interaction of gene networks at different stages of plant development and in response to environmental stimulus can be exploited. In this review, we outline the current analytical approaches in quantitative genetics that are applied to longitudinal traits in crops throughout development, describe the advantages and pitfalls of each approach, and indicate future research directions and opportunities.
Collapse
Affiliation(s)
- Fabiana F. Moreira
- Department of Agronomy, Purdue University, West Lafayette, IN, United States
| | - Hinayah R. Oliveira
- Department of Animal Sciences, Purdue University, West Lafayette, IN, United States
| | - Jeffrey J. Volenec
- Department of Agronomy, Purdue University, West Lafayette, IN, United States
| | - Katy M. Rainey
- Department of Agronomy, Purdue University, West Lafayette, IN, United States
| | - Luiz F. Brito
- Department of Animal Sciences, Purdue University, West Lafayette, IN, United States
| |
Collapse
|
59
|
Abdollahi-Arpanahi R, Gianola D, Peñagaricano F. Deep learning versus parametric and ensemble methods for genomic prediction of complex phenotypes. Genet Sel Evol 2020; 52:12. [PMID: 32093611 PMCID: PMC7038529 DOI: 10.1186/s12711-020-00531-z] [Citation(s) in RCA: 79] [Impact Index Per Article: 19.8] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/24/2019] [Accepted: 02/13/2020] [Indexed: 12/19/2022] Open
Abstract
Background Transforming large amounts of genomic data into valuable knowledge for predicting complex traits has been an important challenge for animal and plant breeders. Prediction of complex traits has not escaped the current excitement on machine-learning, including interest in deep learning algorithms such as multilayer perceptrons (MLP) and convolutional neural networks (CNN). The aim of this study was to compare the predictive performance of two deep learning methods (MLP and CNN), two ensemble learning methods [random forests (RF) and gradient boosting (GB)], and two parametric methods [genomic best linear unbiased prediction (GBLUP) and Bayes B] using real and simulated datasets. Methods The real dataset consisted of 11,790 Holstein bulls with sire conception rate (SCR) records and genotyped for 58k single nucleotide polymorphisms (SNPs). To support the evaluation of deep learning methods, various simulation studies were conducted using the observed genotype data as template, assuming a heritability of 0.30 with either additive or non-additive gene effects, and two different numbers of quantitative trait nucleotides (100 and 1000). Results In the bull dataset, the best predictive correlation was obtained with GB (0.36), followed by Bayes B (0.34), GBLUP (0.33), RF (0.32), CNN (0.29) and MLP (0.26). The same trend was observed when using mean squared error of prediction. The simulation indicated that when gene action was purely additive, parametric methods outperformed other methods. When the gene action was a combination of additive, dominance and of two-locus epistasis, the best predictive ability was obtained with gradient boosting, and the superiority of deep learning over the parametric methods depended on the number of loci controlling the trait and on sample size. In fact, with a large dataset including 80k individuals, the predictive performance of deep learning methods was similar or slightly better than that of parametric methods for traits with non-additive gene action. Conclusions For prediction of traits with non-additive gene action, gradient boosting was a robust method. Deep learning approaches were not better for genomic prediction unless non-additive variance was sizable.
Collapse
Affiliation(s)
| | - Daniel Gianola
- Departments of Animal Sciences and Dairy Science, University of Wisconsin-Madison, Madison, WI, USA
| | - Francisco Peñagaricano
- Department of Animal Sciences, University of Florida, Gainesville, FL, USA. .,University of Florida Genetics Institute, University of Florida, Gainesville, FL, USA.
| |
Collapse
|
60
|
Zingaretti LM, Gezan SA, Ferrão LFV, Osorio LF, Monfort A, Muñoz PR, Whitaker VM, Pérez-Enciso M. Exploring Deep Learning for Complex Trait Genomic Prediction in Polyploid Outcrossing Species. FRONTIERS IN PLANT SCIENCE 2020; 11:25. [PMID: 32117371 PMCID: PMC7015897 DOI: 10.3389/fpls.2020.00025] [Citation(s) in RCA: 65] [Impact Index Per Article: 16.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/22/2019] [Accepted: 01/10/2020] [Indexed: 05/21/2023]
Abstract
Genomic prediction (GP) is the procedure whereby the genetic merits of untested candidates are predicted using genome wide marker information. Although numerous examples of GP exist in plants and animals, applications to polyploid organisms are still scarce, partly due to limited genome resources and the complexity of this system. Deep learning (DL) techniques comprise a heterogeneous collection of machine learning algorithms that have excelled at many prediction tasks. A potential advantage of DL for GP over standard linear model methods is that DL can potentially take into account all genetic interactions, including dominance and epistasis, which are expected to be of special relevance in most polyploids. In this study, we evaluated the predictive accuracy of linear and DL techniques in two important small fruits or berries: strawberry and blueberry. The two datasets contained a total of 1,358 allopolyploid strawberry (2n=8x=112) and 1,802 autopolyploid blueberry (2n=4x=48) individuals, genotyped for 9,908 and 73,045 single nucleotide polymorphism (SNP) markers, respectively, and phenotyped for five agronomic traits each. DL depends on numerous parameters that influence performance and optimizing hyperparameter values can be a critical step. Here we show that interactions between hyperparameter combinations should be expected and that the number of convolutional filters and regularization in the first layers can have an important effect on model performance. In terms of genomic prediction, we did not find an advantage of DL over linear model methods, except when the epistasis component was important. Linear Bayesian models were better than convolutional neural networks for the full additive architecture, whereas the opposite was observed under strong epistasis. However, by using a parameterization capable of taking into account these non-linear effects, Bayesian linear models can match or exceed the predictive accuracy of DL. A semiautomatic implementation of the DL pipeline is available at https://github.com/lauzingaretti/deepGP/.
Collapse
Affiliation(s)
- Laura M. Zingaretti
- Centre for Research in Agricultural Genomics (CRAG) CSIC-IRTA-UAB-UB, Campus UAB, Barcelona, Spain
| | - Salvador Alejandro Gezan
- School of Forest Resources and Conservation, University of Florida, Gainesville, FL, United States
| | - Luis Felipe V. Ferrão
- Blueberry Breeding and Genomics Lab, Horticultural Sciences Department, University of Florida, Gainesville, FL, United States
| | - Luis F. Osorio
- IFAS Gulf Coast Research and Education Center, University of Florida, Wimauma, FL, United States
| | - Amparo Monfort
- Centre for Research in Agricultural Genomics (CRAG) CSIC-IRTA-UAB-UB, Campus UAB, Barcelona, Spain
- Institut de Recerca i Tecnologia Agroalimentàries (IRTA), Barcelona, Spain
| | - Patricio R. Muñoz
- Blueberry Breeding and Genomics Lab, Horticultural Sciences Department, University of Florida, Gainesville, FL, United States
| | - Vance M. Whitaker
- IFAS Gulf Coast Research and Education Center, University of Florida, Wimauma, FL, United States
| | - Miguel Pérez-Enciso
- Centre for Research in Agricultural Genomics (CRAG) CSIC-IRTA-UAB-UB, Campus UAB, Barcelona, Spain
- ICREA, Passeig de Lluís Companys 23, Barcelona, Spain
| |
Collapse
|
61
|
Gianola D, Fernando RL. A Multiple-Trait Bayesian Lasso for Genome-Enabled Analysis and Prediction of Complex Traits. Genetics 2020; 214:305-331. [PMID: 31879318 PMCID: PMC7017027 DOI: 10.1534/genetics.119.302934] [Citation(s) in RCA: 14] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2019] [Accepted: 12/20/2019] [Indexed: 12/21/2022] Open
Abstract
A multiple-trait Bayesian LASSO (MBL) for genome-based analysis and prediction of quantitative traits is presented and applied to two real data sets. The data-generating model is a multivariate linear Bayesian regression on possibly a huge number of molecular markers, and with a Gaussian residual distribution posed. Each (one per marker) of the [Formula: see text] vectors of regression coefficients (T: number of traits) is assigned the same T-variate Laplace prior distribution, with a null mean vector and unknown scale matrix Σ. The multivariate prior reduces to that of the standard univariate Bayesian LASSO when [Formula: see text] The covariance matrix of the residual distribution is assigned a multivariate Jeffreys prior, and Σ is given an inverse-Wishart prior. The unknown quantities in the model are learned using a Markov chain Monte Carlo sampling scheme constructed using a scale-mixture of normal distributions representation. MBL is demonstrated in a bivariate context employing two publicly available data sets using a bivariate genomic best linear unbiased prediction model (GBLUP) for benchmarking results. The first data set is one where wheat grain yields in two different environments are treated as distinct traits. The second data set comes from genotyped Pinus trees, with each individual measured for two traits: rust bin and gall volume. In MBL, the bivariate marker effects are shrunk differentially, i.e., "short" vectors are more strongly shrunk toward the origin than in GBLUP; conversely, "long" vectors are shrunk less. A predictive comparison was carried out as well in wheat, where the comparators of MBL were bivariate GBLUP and bivariate Bayes Cπ-a variable selection procedure. A training-testing layout was used, with 100 random reconstructions of training and testing sets. For the wheat data, all methods produced similar predictions. In Pinus, MBL gave better predictions that either a Bayesian bivariate GBLUP or the single trait Bayesian LASSO. MBL has been implemented in the Julia language package JWAS, and is now available for the scientific community to explore with different traits, species, and environments. It is well known that there is no universally best prediction machine, and MBL represents a new resource in the armamentarium for genome-enabled analysis and prediction of complex traits.
Collapse
Affiliation(s)
- Daniel Gianola
- Department of Animal Sciences, University of Wisconsin-Madison, Wisconsin 53706
- Department of Dairy Science, University of Wisconsin-Madison, Wisconsin 53706
- Department of Animal Science, Iowa State University, Ames, Iowa 50011
- Department of Plant Sciences, Technical University of Munich (TUM), TUM School of Life Sciences, Freising, 85354 Germany
| | - Rohan L Fernando
- Department of Animal Science, Iowa State University, Ames, Iowa 50011
| |
Collapse
|
62
|
Xu Y, Liu X, Fu J, Wang H, Wang J, Huang C, Prasanna BM, Olsen MS, Wang G, Zhang A. Enhancing Genetic Gain through Genomic Selection: From Livestock to Plants. PLANT COMMUNICATIONS 2020; 1:100005. [PMID: 33404534 PMCID: PMC7747995 DOI: 10.1016/j.xplc.2019.100005] [Citation(s) in RCA: 88] [Impact Index Per Article: 22.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/28/2022]
Abstract
Although long-term genetic gain has been achieved through increasing use of modern breeding methods and technologies, the rate of genetic gain needs to be accelerated to meet humanity's demand for agricultural products. In this regard, genomic selection (GS) has been considered most promising for genetic improvement of the complex traits controlled by many genes each with minor effects. Livestock scientists pioneered GS application largely due to livestock's significantly higher individual values and the greater reduction in generation interval that can be achieved in GS. Large-scale application of GS in plants can be achieved by refining field management to improve heritability estimation and prediction accuracy and developing optimum GS models with the consideration of genotype-by-environment interaction and non-additive effects, along with significant cost reduction. Moreover, it would be more effective to integrate GS with other breeding tools and platforms for accelerating the breeding process and thereby further enhancing genetic gain. In addition, establishing an open-source breeding network and developing transdisciplinary approaches would be essential in enhancing breeding efficiency for small- and medium-sized enterprises and agricultural research systems in developing countries. New strategies centered on GS for enhancing genetic gain need to be developed.
Collapse
Affiliation(s)
- Yunbi Xu
- Institute of Crop Science/CIMMYT-China, Chinese Academy of Agricultural Sciences, Beijing 100081, China
- CIMMYT-China Tropical Maize Research Center, Foshan University, Foshan 528231, China
- CIMMYT-China Specialty Maize Research Center, Shanghai Academy of Agricultural Sciences, Shanghai 201400, China
| | - Xiaogang Liu
- Institute of Crop Science/CIMMYT-China, Chinese Academy of Agricultural Sciences, Beijing 100081, China
| | - Junjie Fu
- Institute of Crop Science/CIMMYT-China, Chinese Academy of Agricultural Sciences, Beijing 100081, China
| | - Hongwu Wang
- Institute of Crop Science/CIMMYT-China, Chinese Academy of Agricultural Sciences, Beijing 100081, China
| | - Jiankang Wang
- Institute of Crop Science/CIMMYT-China, Chinese Academy of Agricultural Sciences, Beijing 100081, China
| | - Changling Huang
- Institute of Crop Science/CIMMYT-China, Chinese Academy of Agricultural Sciences, Beijing 100081, China
| | - Boddupalli M. Prasanna
- CIMMYT (International Maize and Wheat Improvement Center), ICRAF Campus, United Nations Avenue, Nairobi, Kenya
| | - Michael S. Olsen
- CIMMYT (International Maize and Wheat Improvement Center), ICRAF Campus, United Nations Avenue, Nairobi, Kenya
| | - Guoying Wang
- Institute of Crop Science/CIMMYT-China, Chinese Academy of Agricultural Sciences, Beijing 100081, China
| | - Aimin Zhang
- Institute of Genetics and Developmental Biology, Chinese Academy of Sciences, Beijing 100101, China
| |
Collapse
|
63
|
Crossa J, Martini JWR, Gianola D, Pérez-Rodríguez P, Jarquin D, Juliana P, Montesinos-López O, Cuevas J. Deep Kernel and Deep Learning for Genome-Based Prediction of Single Traits in Multienvironment Breeding Trials. Front Genet 2019; 10:1168. [PMID: 31921277 PMCID: PMC6913188 DOI: 10.3389/fgene.2019.01168] [Citation(s) in RCA: 41] [Impact Index Per Article: 8.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2019] [Accepted: 10/23/2019] [Indexed: 11/13/2022] Open
Abstract
Deep learning (DL) is a promising method for genomic-enabled prediction. However, the implementation of DL is difficult because many hyperparameters (number of hidden layers, number of neurons, learning rate, number of epochs, batch size, etc.) need to be tuned. For this reason, deep kernel methods, which only require defining the number of layers, may be an attractive alternative. Deep kernel methods emulate DL models with a large number of neurons, but are defined by relatively easily computed covariance matrices. In this research, we compared the genome-based prediction of DL to a deep kernel (arc-cosine kernel, AK), to the commonly used non-additive Gaussian kernel (GK), as well as to the conventional additive genomic best linear unbiased predictor (GBLUP/GB). We used two real wheat data sets for benchmarking these methods. On average, AK and GK outperformed DL and GB. The gain in terms of prediction performance of AK and GK over DL and GB was not large, but AK and GK have the advantage that only one parameter, the number of layers (AK) or the bandwidth parameter (GK), has to be tuned in each method. Furthermore, although AK and GK had similar performance, deep kernel AK is easier to implement than GK, since the parameter "number of layers" is more easily determined than the bandwidth parameter of GK. Comparing AK and DL for the data set of year 2015-2016, the difference in performance of the two methods was bigger, with AK predicting much better than DL. On this data, the optimization of the hyperparameters for DL was difficult and the finally used parameters may have been suboptimal. Our results suggest that AK is a good alternative to DL with the advantage that practically no tuning process is required.
Collapse
Affiliation(s)
- José Crossa
- Biometrics and Statistics Unit, Genetic Resources Program, and Global Wheat Program, International Maize and Wheat Improvement Center (CIMMYT), Texcoco, Mexico.,Programa de Postgrado de Socioeconomia, Estadistica e Informatica, Colegio de Postgraduados, Texcoco, Mexico
| | - Johannes W R Martini
- Biometrics and Statistics Unit, Genetic Resources Program, and Global Wheat Program, International Maize and Wheat Improvement Center (CIMMYT), Texcoco, Mexico
| | - Daniel Gianola
- Department of Animal Sciences, University of Wisconsin-Madison, Madison, WI, United States
| | - Paulino Pérez-Rodríguez
- Programa de Postgrado de Socioeconomia, Estadistica e Informatica, Colegio de Postgraduados, Texcoco, Mexico
| | - Diego Jarquin
- Department of Agronomy and Horticulture, University of Nebraska-Lincoln, Lincoln, NE, United States
| | - Philomin Juliana
- Biometrics and Statistics Unit, Genetic Resources Program, and Global Wheat Program, International Maize and Wheat Improvement Center (CIMMYT), Texcoco, Mexico
| | | | - Jaime Cuevas
- Departamento de Ciencias, Universidad de Quintana Roo, Chetumal, Mexico
| |
Collapse
|
64
|
Liu Y, Wang D, He F, Wang J, Joshi T, Xu D. Phenotype Prediction and Genome-Wide Association Study Using Deep Convolutional Neural Network of Soybean. Front Genet 2019; 10:1091. [PMID: 31824557 PMCID: PMC6883005 DOI: 10.3389/fgene.2019.01091] [Citation(s) in RCA: 54] [Impact Index Per Article: 10.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/22/2019] [Accepted: 10/09/2019] [Indexed: 12/21/2022] Open
Abstract
Genomic selection uses single-nucleotide polymorphisms (SNPs) to predict quantitative phenotypes for enhancing traits in breeding populations and has been widely used to increase breeding efficiency for plants and animals. Existing statistical methods rely on a prior distribution assumption of imputed genotype effects, which may not fit experimental datasets. Emerging deep learning technology could serve as a powerful machine learning tool to predict quantitative phenotypes without imputation and also to discover potential associated genotype markers efficiently. We propose a deep-learning framework using convolutional neural networks (CNNs) to predict the quantitative traits from SNPs and also to investigate genotype contributions to the trait using saliency maps. The missing values of SNPs are treated as a new genotype for the input of the deep learning model. We tested our framework on both simulation data and experimental datasets of soybean. The results show that the deep learning model can bypass the imputation of missing values and achieve more accurate results for predicting quantitative phenotypes than currently available other well-known statistical methods. It can also effectively and efficiently identify significant markers of SNPs and SNP combinations associated in genome-wide association study.
Collapse
Affiliation(s)
- Yang Liu
- Institute of Data Science and Informatics, University of Missouri, Columbia, MO, United States.,Department of Electrical Engineer and Computer Science, University of Missouri, Columbia, MO, United States
| | - Duolin Wang
- Department of Electrical Engineer and Computer Science, University of Missouri, Columbia, MO, United States.,Christopher S. Bond Life Science Center, University of Missouri, Columbia, MO, United States
| | - Fei He
- Christopher S. Bond Life Science Center, University of Missouri, Columbia, MO, United States.,Department of Computer Science and Information Technology, Northeast Normal University, Changchun, China
| | - Juexin Wang
- Department of Electrical Engineer and Computer Science, University of Missouri, Columbia, MO, United States.,Christopher S. Bond Life Science Center, University of Missouri, Columbia, MO, United States
| | - Trupti Joshi
- Institute of Data Science and Informatics, University of Missouri, Columbia, MO, United States.,Christopher S. Bond Life Science Center, University of Missouri, Columbia, MO, United States.,Department of Health Management and Informatics, School of Medicine, University of Missouri, Columbia, MO, United States
| | - Dong Xu
- Institute of Data Science and Informatics, University of Missouri, Columbia, MO, United States.,Department of Electrical Engineer and Computer Science, University of Missouri, Columbia, MO, United States.,Christopher S. Bond Life Science Center, University of Missouri, Columbia, MO, United States
| |
Collapse
|
65
|
Montesinos-López OA, Montesinos-López A, Tuberosa R, Maccaferri M, Sciara G, Ammar K, Crossa J. Multi-Trait, Multi-Environment Genomic Prediction of Durum Wheat With Genomic Best Linear Unbiased Predictor and Deep Learning Methods. FRONTIERS IN PLANT SCIENCE 2019; 10:1311. [PMID: 31787990 PMCID: PMC6856087 DOI: 10.3389/fpls.2019.01311] [Citation(s) in RCA: 28] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/18/2019] [Accepted: 09/20/2019] [Indexed: 05/23/2023]
Abstract
Although durum wheat (Triticum turgidum var. durum Desf.) is a minor cereal crop representing just 5-7% of the world's total wheat crop, it is a staple food in Mediterranean countries, where it is used to produce pasta, couscous, bulgur and bread. In this paper, we cover multi-trait prediction of grain yield (GY), days to heading (DH) and plant height (PH) of 270 durum wheat lines that were evaluated in 43 environments (country-location-year combinations) across a broad range of water regimes in the Mediterranean Basin and other locations. Multi-trait prediction analyses were performed by implementing a multi-trait deep learning model (MTDL) with a feed-forward network topology and a rectified linear unit activation function with a grid search approach for the selection of hyper-parameters. The results of the multi-trait deep learning method were also compared with univariate predictions of the genomic best linear unbiased predictor (GBLUP) method and the univariate counterpart of the multi-trait deep learning method (UDL). All models were implemented with and without the genotype × environment interaction term. We found that the best predictions were observed without the genotype × environment interaction term in the UDL and MTDL methods. However, under the GBLUP method, the best predictions were observed when the genotype × environment interaction term was taken into account. We also found that in general the best predictions were observed under the GBLUP model; however, the predictions of the MTDL were very similar to those of the GBLUP model. This result provides more evidence that the GBLUP model is a powerful approach for genomic prediction, but also that the deep learning method is a practical approach for predicting univariate and multivariate traits in the context of genomic selection.
Collapse
Affiliation(s)
| | - Abelardo Montesinos-López
- Departamento de Matemáticas, Centro Universitario de Ciencias Exactas e Ingenierías (CUCEI), Universidad de Guadalajara, Guadalajara, Mexico
| | - Roberto Tuberosa
- Department of Agricultural and Food Sciences, University of Bologna, Bologna, Italy
| | - Marco Maccaferri
- Department of Agricultural and Food Sciences, University of Bologna, Bologna, Italy
| | - Giuseppe Sciara
- Department of Agricultural and Food Sciences, University of Bologna, Bologna, Italy
| | - Karim Ammar
- Global Wheat Breeding Program, International Maize and Wheat Improvement Center (CIMMYT), Mexico City, Mexico
| | - José Crossa
- Global Wheat Breeding Program, International Maize and Wheat Improvement Center (CIMMYT), Mexico City, Mexico
| |
Collapse
|
66
|
Harfouche AL, Jacobson DA, Kainer D, Romero JC, Harfouche AH, Scarascia Mugnozza G, Moshelion M, Tuskan GA, Keurentjes JJ, Altman A. Accelerating Climate Resilient Plant Breeding by Applying Next-Generation Artificial Intelligence. Trends Biotechnol 2019; 37:1217-1235. [DOI: 10.1016/j.tibtech.2019.05.007] [Citation(s) in RCA: 48] [Impact Index Per Article: 9.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/11/2019] [Revised: 05/18/2019] [Accepted: 05/23/2019] [Indexed: 12/20/2022]
|
67
|
A Bayesian Genomic Multi-output Regressor Stacking Model for Predicting Multi-trait Multi-environment Plant Breeding Data. G3-GENES GENOMES GENETICS 2019; 9:3381-3393. [PMID: 31427455 PMCID: PMC6778812 DOI: 10.1534/g3.119.400336] [Citation(s) in RCA: 17] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
In this paper we propose a Bayesian multi-output regressor stacking (BMORS) model that is a generalization of the multi-trait regressor stacking method. The proposed BMORS model consists of two stages: in the first stage, a univariate genomic best linear unbiased prediction (GBLUP including genotype × environment interaction GE) model is implemented for each of the L traits under study; then the predictions of all traits are included as covariates in the second stage, by implementing a Ridge regression model. The main objectives of this research were to study alternative models to the existing multi-trait multi-environment (BMTME) model with respect to (1) genomic-enabled prediction accuracy, and (2) potential advantages in terms of computing resources and implementation. We compared the predictions of the BMORS model to those of the univariate GBLUP model using 7 maize and wheat datasets. We found that the proposed BMORS produced similar predictions to the univariate GBLUP model and to the BMTME model in terms of prediction accuracy; however, the best predictions were obtained under the BMTME model. In terms of computing resources, we found that the BMORS is at least 9 times faster than the BMTME method. Based on our empirical findings, the proposed BMORS model is an alternative for predicting multi-trait and multi-environment data, which are very common in genomic-enabled prediction in plant and animal breeding programs.
Collapse
|
68
|
Cuevas J, Montesinos-López O, Juliana P, Guzmán C, Pérez-Rodríguez P, González-Bucio J, Burgueño J, Montesinos-López A, Crossa J. Deep Kernel for Genomic and Near Infrared Predictions in Multi-environment Breeding Trials. G3 (BETHESDA, MD.) 2019; 9:2913-2924. [PMID: 31289023 PMCID: PMC6723142 DOI: 10.1534/g3.119.400493] [Citation(s) in RCA: 39] [Impact Index Per Article: 7.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/28/2019] [Accepted: 07/04/2019] [Indexed: 01/15/2023]
Abstract
Kernel methods are flexible and easy to interpret and have been successfully used in genomic-enabled prediction of various plant species. Kernel methods used in genomic prediction comprise the linear genomic best linear unbiased predictor (GBLUP or GB) kernel, and the Gaussian kernel (GK). In general, these kernels have been used with two statistical models: single-environment and genomic × environment (GE) models. Recently near infrared spectroscopy (NIR) has been used as an inexpensive and non-destructive high-throughput phenotyping method for predicting unobserved line performance in plant breeding trials. In this study, we used a non-linear arc-cosine kernel (AK) that emulates deep learning artificial neural networks. We compared AK prediction accuracy with the prediction accuracy of GB and GK kernel methods in four genomic data sets, one of which also includes pedigree and NIR information. Results show that for all four data sets, AK and GK kernels achieved higher prediction accuracy than the linear GB kernel for the single-environment and GE multi-environment models. In addition, AK achieved similar or slightly higher prediction accuracy than the GK kernel. For all data sets, the GE model achieved higher prediction accuracy than the single-environment model. For the data set that includes pedigree, markers and NIR, results show that the NIR wavelength alone achieved lower prediction accuracy than the genomic information alone; however, the pedigree plus NIR information achieved only slightly lower prediction accuracy than the marker plus the NIR high-throughput data.
Collapse
Affiliation(s)
- Jaime Cuevas
- Universidad de Quintana Roo, Chetumal, Quintana Roo, 77019 México
| | | | - Philomin Juliana
- International Maize and Wheat Improvement Center (CIMMYT), Carretera Mexico- Veracruz Km. 45, El Batán, 56237, Texcoco, Edo. de Mexico, Mexico
| | - Carlos Guzmán
- International Maize and Wheat Improvement Center (CIMMYT), Carretera Mexico- Veracruz Km. 45, El Batán, 56237, Texcoco, Edo. de Mexico, Mexico
| | | | | | - Juan Burgueño
- International Maize and Wheat Improvement Center (CIMMYT), Carretera Mexico- Veracruz Km. 45, El Batán, 56237, Texcoco, Edo. de Mexico, Mexico
| | - Abelardo Montesinos-López
- Departamento de Matemáticas, Centro Universitario de Ciencias Exactas e Ingenierías, (CUCEI), Universidad de Guadalajara, Guadalajara, Jalisco, 44430
| | - José Crossa
- International Maize and Wheat Improvement Center (CIMMYT), Carretera Mexico- Veracruz Km. 45, El Batán, 56237, Texcoco, Edo. de Mexico, Mexico
| |
Collapse
|
69
|
Pérez-Enciso M, Zingaretti LM. A Guide for Using Deep Learning for Complex Trait Genomic Prediction. Genes (Basel) 2019; 10:E553. [PMID: 31330861 PMCID: PMC6678200 DOI: 10.3390/genes10070553] [Citation(s) in RCA: 70] [Impact Index Per Article: 14.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2019] [Revised: 07/06/2019] [Accepted: 07/18/2019] [Indexed: 11/17/2022] Open
Abstract
Deep learning (DL) has emerged as a powerful tool to make accurate predictions from complex data such as image, text, or video. However, its ability to predict phenotypic values from molecular data is less well studied. Here, we describe the theoretical foundations of DL and provide a generic code that can be easily modified to suit specific needs. DL comprises a wide variety of algorithms which depend on numerous hyperparameters. Careful optimization of hyperparameter values is critical to avoid overfitting. Among the DL architectures currently tested in genomic prediction, convolutional neural networks (CNNs) seem more promising than multilayer perceptrons (MLPs). A limitation of DL is in interpreting the results. This may not be relevant for genomic prediction in plant or animal breeding but can be critical when deciding the genetic risk to a disease. Although DL technologies are not "plug-and-play", they are easily implemented using Keras and TensorFlow public software. To illustrate the principles described here, we implemented a Keras-based code in GitHub.
Collapse
Affiliation(s)
- Miguel Pérez-Enciso
- Catalan Institution for Research and Advanced Studies (ICREA), Passeig de Lluís Companys 23, 08010 Barcelona, Spain.
- Centre for Research in Agricultural Genomics (CRAG), CSIC-IRTA-UAB-UB, 08193 Bellaterra, Barcelona, Spain.
| | - Laura M Zingaretti
- Centre for Research in Agricultural Genomics (CRAG), CSIC-IRTA-UAB-UB, 08193 Bellaterra, Barcelona, Spain
| |
Collapse
|
70
|
Montesinos-López OA, Martín-Vallejo J, Crossa J, Gianola D, Hernández-Suárez CM, Montesinos-López A, Juliana P, Singh R. New Deep Learning Genomic-Based Prediction Model for Multiple Traits with Binary, Ordinal, and Continuous Phenotypes. G3 (BETHESDA, MD.) 2019; 9:1545-1556. [PMID: 30858235 PMCID: PMC6505163 DOI: 10.1534/g3.119.300585] [Citation(s) in RCA: 44] [Impact Index Per Article: 8.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/09/2019] [Accepted: 03/08/2019] [Indexed: 12/16/2022]
Abstract
Multiple-trait experiments with mixed phenotypes (binary, ordinal and continuous) are not rare in animal and plant breeding programs. However, there is a lack of statistical models that can exploit the correlation between traits with mixed phenotypes in order to improve prediction accuracy in the context of genomic selection (GS). For this reason, when breeders have mixed phenotypes, they usually analyze them using univariate models, and thus are not able to exploit the correlation between traits, which many times helps improve prediction accuracy. In this paper we propose applying deep learning for analyzing multiple traits with mixed phenotype data in terms of prediction accuracy. The prediction performance of multiple-trait deep learning with mixed phenotypes (MTDLMP) models was compared to the performance of univariate deep learning (UDL) models. Both models were evaluated using predictors with and without the genotype × environment (G×E) interaction term (I and WI, respectively). The metric used for evaluating prediction accuracy was Pearson's correlation for continuous traits and the percentage of cases correctly classified (PCCC) for binary and ordinal traits. We found that a modest gain in prediction accuracy was obtained only in the continuous trait under the MTDLMP model compared to the UDL model, whereas for the other traits (1 binary and 2 ordinal) we did not find any difference between the two models. In both models we observed that the prediction performance was better for WI than for I. The MTDLMP model is a good alternative for performing simultaneous predictions of mixed phenotypes (binary, ordinal and continuous) in the context of GS.
Collapse
Affiliation(s)
| | - Javier Martín-Vallejo
- Departamento de Estadística, Universidad de Salamanca, c/Espejo 2, Salamanca, 37007, España
| | - José Crossa
- International Maize and Wheat Improvement Center (CIMMYT), Apdo. Postal 6-641, 06600, Ciudad de México, México
| | - Daniel Gianola
- Departments of Animal Sciences, Dairy Science, and Biostatistics and Medical Informatics, University of Wisconsin-Madison, Wisconsin 53706
| | | | - Abelardo Montesinos-López
- Departamento de Matemáticas, Centro Universitario de Ciencias Exactas e Ingenierías (CUCEI), Universidad de Guadalajara, 44430, Jalisco, México
| | - Philomin Juliana
- International Maize and Wheat Improvement Center (CIMMYT), Apdo. Postal 6-641, 06600, Ciudad de México, México
| | - Ravi Singh
- International Maize and Wheat Improvement Center (CIMMYT), Apdo. Postal 6-641, 06600, Ciudad de México, México
| |
Collapse
|
71
|
Montesinos-López OA, Martín-Vallejo J, Crossa J, Gianola D, Hernández-Suárez CM, Montesinos-López A, Juliana P, Singh R. A Benchmarking Between Deep Learning, Support Vector Machine and Bayesian Threshold Best Linear Unbiased Prediction for Predicting Ordinal Traits in Plant Breeding. G3 (BETHESDA, MD.) 2019; 9:601-618. [PMID: 30593512 PMCID: PMC6385991 DOI: 10.1534/g3.118.200998] [Citation(s) in RCA: 60] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/22/2018] [Accepted: 12/27/2018] [Indexed: 11/18/2022]
Abstract
Genomic selection is revolutionizing plant breeding. However, still lacking are better statistical models for ordinal phenotypes to improve the accuracy of the selection of candidate genotypes. For this reason, in this paper we explore the genomic based prediction performance of two popular machine learning methods: the Multi Layer Perceptron (MLP) and support vector machine (SVM) methods vs. the Bayesian threshold genomic best linear unbiased prediction (TGBLUP) model. We used the percentage of cases correctly classified (PCCC) as a metric to measure the prediction performance, and seven real data sets to evaluate the prediction accuracy, and found that the best predictions (in four out of the seven data sets) in terms of PCCC occurred under the TGLBUP model, while the worst occurred under the SVM method. Also, in general we found no statistical differences between using 1, 2 and 3 layers under the MLP models, which means that many times the conventional neuronal network model with only one layer is enough. However, although even that the TGBLUP model was better, we found that the predictions of MLP and SVM were very competitive with the advantage that the SVM was the most efficient in terms of the computational time required.
Collapse
Affiliation(s)
| | - Javier Martín-Vallejo
- Departamento de Estadística, Universidad de Salamanca, c/Espejo 2, Salamanca, 37007, España
| | - José Crossa
- International Maize and Wheat Improvement Center (CIMMYT), Apdo. Postal 6-641, 06600, Ciudad de México, México
| | - Daniel Gianola
- Departments of Animal Sciences, Dairy Science, and Biostatistics and Medical Informatics, University of Wisconsin-Madison, Madison, Wisconsin 53706
| | | | - Abelardo Montesinos-López
- Departamento de Matemáticas, Centro Universitario de Ciencias Exactas e Ingenierías (CUCEI), Universidad de Guadalajara, 44430, Guadalajara, Jalisco, México
| | - Philomin Juliana
- International Maize and Wheat Improvement Center (CIMMYT), Apdo. Postal 6-641, 06600, Ciudad de México, México
| | - Ravi Singh
- International Maize and Wheat Improvement Center (CIMMYT), Apdo. Postal 6-641, 06600, Ciudad de México, México
| |
Collapse
|
72
|
Montesinos-López OA, Montesinos-López A, Crossa J, Gianola D, Hernández-Suárez CM, Martín-Vallejo J. Multi-trait, Multi-environment Deep Learning Modeling for Genomic-Enabled Prediction of Plant Traits. G3 (BETHESDA, MD.) 2018; 8:3829-3840. [PMID: 30291108 PMCID: PMC6288830 DOI: 10.1534/g3.118.200728] [Citation(s) in RCA: 66] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/30/2018] [Accepted: 10/03/2018] [Indexed: 11/27/2022]
Abstract
Multi-trait and multi-environment data are common in animal and plant breeding programs. However, what is lacking are more powerful statistical models that can exploit the correlation between traits to improve prediction accuracy in the context of genomic selection (GS). Multi-trait models are more complex than univariate models and usually require more computational resources, but they are preferred because they can exploit the correlation between traits, which many times helps improve prediction accuracy. For this reason, in this paper we explore the power of multi-trait deep learning (MTDL) models in terms of prediction accuracy. The prediction performance of MTDL models was compared to the performance of the Bayesian multi-trait and multi-environment (BMTME) model proposed by Montesinos-López et al. (2016), which is a multi-trait version of the genomic best linear unbiased prediction (GBLUP) univariate model. Both models were evaluated with predictors with and without the genotype×environment interaction term. The prediction performance of both models was evaluated in terms of Pearson's correlation using cross-validation. We found that the best predictions in two of the three data sets were found under the BMTME model, but in general the predictions of both models, BTMTE and MTDL, were similar. Among models without the genotype×environment interaction, the MTDL model was the best, while among models with genotype×environment interaction, the BMTME model was superior. These results indicate that the MTDL model is very competitive for performing predictions in the context of GS, with the important practical advantage that it requires less computational resources than the BMTME model.
Collapse
Affiliation(s)
| | - Abelardo Montesinos-López
- Departamento de Matemáticas, Centro Universitario de Ciencias Exactas e Ingenierías (CUCEI), Universidad de Guadalajara, 44430, Guadalajara, Jalisco, México
| | - José Crossa
- Biometrics and Statistics Unit, Genetic Resources Program, International Maize and Wheat Improvement Center (CIMMYT), Apdo. Postal 6-641, 06600, Ciudad de México, México
| | - Daniel Gianola
- Departments of Animal Sciences, Dairy Science, and Biostatistics and Medical Informatics, University of Wisconsin-Madison, Madison, Wisconsin 53706
| | | | - Javier Martín-Vallejo
- Departamento de Estadística, Universidad de Salamanca, c/Espejo 2, Salamanca, 37007, España
| |
Collapse
|