1
|
Weber SE, Chawla HS, Ehrig L, Hickey LT, Frisch M, Snowdon RJ. Accurate prediction of quantitative traits with failed SNP calls in canola and maize. FRONTIERS IN PLANT SCIENCE 2023; 14:1221750. [PMID: 37936929 PMCID: PMC10627008 DOI: 10.3389/fpls.2023.1221750] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 05/12/2023] [Accepted: 10/05/2023] [Indexed: 11/09/2023]
Abstract
In modern plant breeding, genomic selection is becoming the gold standard to select superior genotypes in large breeding populations that are only partially phenotyped. Many breeding programs commonly rely on single-nucleotide polymorphism (SNP) markers to capture genome-wide data for selection candidates. For this purpose, SNP arrays with moderate to high marker density represent a robust and cost-effective tool to generate reproducible, easy-to-handle, high-throughput genotype data from large-scale breeding populations. However, SNP arrays are prone to technical errors that lead to failed allele calls. To overcome this problem, failed calls are often imputed, based on the assumption that failed SNP calls are purely technical. However, this ignores the biological causes for failed calls-for example: deletions-and there is increasing evidence that gene presence-absence and other kinds of genome structural variants can play a role in phenotypic expression. Because deletions are frequently not in linkage disequilibrium with their flanking SNPs, permutation of missing SNP calls can potentially obscure valuable marker-trait associations. In this study, we analyze published datasets for canola and maize using four parametric and two machine learning models and demonstrate that failed allele calls in genomic prediction are highly predictive for important agronomic traits. We present two statistical pipelines, based on population structure and linkage disequilibrium, that enable the filtering of failed SNP calls that are likely caused by biological reasons. For the population and trait examined, prediction accuracy based on these filtered failed allele calls was competitive to standard SNP-based prediction, underlying the potential value of missing data in genomic prediction approaches. The combination of SNPs with all failed allele calls or the filtered allele calls did not outperform predictions with only SNP-based prediction due to redundancy in genomic relationship estimates.
Collapse
Affiliation(s)
- Sven E. Weber
- Department of Plant Breeding, Justus Liebig University, Giessen, Germany
| | | | - Lennard Ehrig
- Department of Plant Breeding, Justus Liebig University, Giessen, Germany
| | - Lee T. Hickey
- Centre for Crop Science, Queensland Alliance for Agriculture and Food Innovation, The University of Queensland, St Lucia, QLD, Australia
| | - Matthias Frisch
- Department of Biometry and Population Genetics, Justus Liebig University, Giessen, Germany
| | - Rod J. Snowdon
- Department of Plant Breeding, Justus Liebig University, Giessen, Germany
| |
Collapse
|
2
|
Chafai N, Hayah I, Houaga I, Badaoui B. A review of machine learning models applied to genomic prediction in animal breeding. Front Genet 2023; 14:1150596. [PMID: 37745853 PMCID: PMC10516561 DOI: 10.3389/fgene.2023.1150596] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2023] [Accepted: 08/22/2023] [Indexed: 09/26/2023] Open
Abstract
The advent of modern genotyping technologies has revolutionized genomic selection in animal breeding. Large marker datasets have shown several drawbacks for traditional genomic prediction methods in terms of flexibility, accuracy, and computational power. Recently, the application of machine learning models in animal breeding has gained a lot of interest due to their tremendous flexibility and their ability to capture patterns in large noisy datasets. Here, we present a general overview of a handful of machine learning algorithms and their application in genomic prediction to provide a meta-picture of their performance in genomic estimated breeding values estimation, genotype imputation, and feature selection. Finally, we discuss a potential adoption of machine learning models in genomic prediction in developing countries. The results of the reviewed studies showed that machine learning models have indeed performed well in fitting large noisy data sets and modeling minor nonadditive effects in some of the studies. However, sometimes conventional methods outperformed machine learning models, which confirms that there's no universal method for genomic prediction. In summary, machine learning models have great potential for extracting patterns from single nucleotide polymorphism datasets. Nonetheless, the level of their adoption in animal breeding is still low due to data limitations, complex genetic interactions, a lack of standardization and reproducibility, and the lack of interpretability of machine learning models when trained with biological data. Consequently, there is no remarkable outperformance of machine learning methods compared to traditional methods in genomic prediction. Therefore, more research should be conducted to discover new insights that could enhance livestock breeding programs.
Collapse
Affiliation(s)
- Narjice Chafai
- Laboratory of Biodiversity, Ecology, and Genome, Department of Biology, Faculty of Sciences, Mohammed V University in Rabat, Rabat, Morocco
| | - Ichrak Hayah
- Laboratory of Biodiversity, Ecology, and Genome, Department of Biology, Faculty of Sciences, Mohammed V University in Rabat, Rabat, Morocco
| | - Isidore Houaga
- Centre for Tropical Livestock Genetics and Health, The Roslin Institute, Royal (Dick) School of Veterinary Medicine, The University of Edinburgh, Edinburgh, United Kingdom
- The Roslin Institute, Royal (Dick) School of Veterinary Studies, University of Edinburgh, Edinburgh, United Kingdom
| | - Bouabid Badaoui
- Laboratory of Biodiversity, Ecology, and Genome, Department of Biology, Faculty of Sciences, Mohammed V University in Rabat, Rabat, Morocco
- African Sustainable Agriculture Research Institute (ASARI), Mohammed VI Polytechnic University (UM6P), Laayoune, Morocco
| |
Collapse
|
3
|
Jeon D, Kang Y, Lee S, Choi S, Sung Y, Lee TH, Kim C. Digitalizing breeding in plants: A new trend of next-generation breeding based on genomic prediction. FRONTIERS IN PLANT SCIENCE 2023; 14:1092584. [PMID: 36743488 PMCID: PMC9892199 DOI: 10.3389/fpls.2023.1092584] [Citation(s) in RCA: 7] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 11/08/2022] [Accepted: 01/05/2023] [Indexed: 06/18/2023]
Abstract
As the world's population grows and food needs diversification, the demand for cereals and horticultural crops with beneficial traits increases. In order to meet a variety of demands, suitable cultivars and innovative breeding methods need to be developed. Breeding methods have changed over time following the advance of genetics. With the advent of new sequencing technology in the early 21st century, predictive breeding, such as genomic selection (GS), emerged when large-scale genomic information became available. GS shows good predictive ability for the selection of individuals with traits of interest even for quantitative traits by using various types of the whole genome-scanning markers, breaking away from the limitations of marker-assisted selection (MAS). In the current review, we briefly describe the history of breeding techniques, each breeding method, various statistical models applied to GS and methods to increase the GS efficiency. Consequently, we intend to propose and define the term digital breeding through this review article. Digital breeding is to develop a predictive breeding methods such as GS at a higher level, aiming to minimize human intervention by automatically proceeding breeding design, propagating breeding populations, and to make selections in consideration of various environments, climates, and topography during the breeding process. We also classified the phases of digital breeding based on the technologies and methods applied to each phase. This review paper will provide an understanding and a direction for the final evolution of plant breeding in the future.
Collapse
Affiliation(s)
- Donghyun Jeon
- Plant Computational Genomics Laboratory, Department of Science in Smart Agriculture Systems, Chungnam National University, Daejeon, Republic of Korea
| | - Yuna Kang
- Plant Computational Genomics Laboratory, Department of Crop Science, Chungnam National University, Daejeon, Republic of Korea
| | - Solji Lee
- Plant Computational Genomics Laboratory, Department of Crop Science, Chungnam National University, Daejeon, Republic of Korea
| | - Sehyun Choi
- Plant Computational Genomics Laboratory, Department of Crop Science, Chungnam National University, Daejeon, Republic of Korea
| | - Yeonjun Sung
- Plant Computational Genomics Laboratory, Department of Science in Smart Agriculture Systems, Chungnam National University, Daejeon, Republic of Korea
| | - Tae-Ho Lee
- Genomics Division, National Institute of Agricultural Sciences, Jeonju, Republic of Korea
| | - Changsoo Kim
- Plant Computational Genomics Laboratory, Department of Science in Smart Agriculture Systems, Chungnam National University, Daejeon, Republic of Korea
- Plant Computational Genomics Laboratory, Department of Crop Science, Chungnam National University, Daejeon, Republic of Korea
| |
Collapse
|
4
|
Abstract
Neuroimaging studies have a growing interest in learning the association between the individual brain connectivity networks and their clinical characteristics. It is also of great interest to identify the sub brain networks as biomarkers to predict the clinical symptoms, such as disease status, potentially providing insight on neuropathology. This motivates the need for developing a new type of regression model where the response variable is scalar, and predictors are networks that are typically represented as adjacent matrices or weighted adjacent matrices, to which we refer as scalar-on-network regression. In this work, we develop a new boosting method for model fitting with sub-network markers selection. Our approach, as opposed to group lasso or other existing regularization methods, is essentially a gradient descent algorithm leveraging known network structure. We demonstrate the utility of our methods via simulation studies and analysis of the resting-state fMRI data in a cognitive developmental cohort study.
Collapse
Affiliation(s)
| | - Kevin He
- University of Michigan, Department of Biostatistics
| | - Jian Kang
- University of Michigan, Department of Biostatistics
| |
Collapse
|
5
|
Perez BC, Bink MCAM, Svenson KL, Churchill GA, Calus MPL. Adding gene transcripts into genomic prediction improves accuracy and reveals sampling time dependence. G3 (BETHESDA, MD.) 2022; 12:jkac258. [PMID: 36161485 PMCID: PMC9635642 DOI: 10.1093/g3journal/jkac258] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 04/09/2022] [Accepted: 09/07/2022] [Indexed: 06/16/2023]
Abstract
Recent developments allowed generating multiple high-quality 'omics' data that could increase the predictive performance of genomic prediction for phenotypes and genetic merit in animals and plants. Here, we have assessed the performance of parametric and nonparametric models that leverage transcriptomics in genomic prediction for 13 complex traits recorded in 478 animals from an outbred mouse population. Parametric models were implemented using the best linear unbiased prediction, while nonparametric models were implemented using the gradient boosting machine algorithm. We also propose a new model named GTCBLUP that aims to remove between-omics-layer covariance from predictors, whereas its counterpart GTBLUP does not do that. While gradient boosting machine models captured more phenotypic variation, their predictive performance did not exceed the best linear unbiased prediction models for most traits. Models leveraging gene transcripts captured higher proportions of the phenotypic variance for almost all traits when these were measured closer to the moment of measuring gene transcripts in the liver. In most cases, the combination of layers was not able to outperform the best single-omics models to predict phenotypes. Using only gene transcripts, the gradient boosting machine model was able to outperform best linear unbiased prediction for most traits except body weight, but the same pattern was not observed when using both single nucleotide polymorphism genotypes and gene transcripts. Although the GTCBLUP model was not able to produce the most accurate phenotypic predictions, it showed the highest accuracies for breeding values for 9 out of 13 traits. We recommend using the GTBLUP model for prediction of phenotypes and using the GTCBLUP for prediction of breeding values.
Collapse
Affiliation(s)
- Bruno C Perez
- Hendrix Genetics B.V., Research and Technology Center (RTC), 5830 AC Boxmeer, The Netherlands
| | - Marco C A M Bink
- Hendrix Genetics B.V., Research and Technology Center (RTC), 5830 AC Boxmeer, The Netherlands
| | | | | | - Mario P L Calus
- Corresponding author: Animal Breeding and Genomics, Wageningen University & Research, P.O. Box 338, 6700 AH Wageningen, The Netherlands.
| |
Collapse
|
6
|
Genome-Enabled Prediction Methods Based on Machine Learning. METHODS IN MOLECULAR BIOLOGY (CLIFTON, N.J.) 2022; 2467:189-218. [PMID: 35451777 DOI: 10.1007/978-1-0716-2205-6_7] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
Abstract
Growth of artificial intelligence and machine learning (ML) methodology has been explosive in recent years. In this class of procedures, computers get knowledge from sets of experiences and provide forecasts or classification. In genome-wide based prediction (GWP), many ML studies have been carried out. This chapter provides a description of main semiparametric and nonparametric algorithms used in GWP in animals and plants. Thirty-four ML comparative studies conducted in the last decade were used to develop a meta-analysis through a Thurstonian model, to evaluate algorithms with the best predictive qualities. It was found that some kernel, Bayesian, and ensemble methods displayed greater robustness and predictive ability. However, the type of study and data distribution must be considered in order to choose the most appropriate model for a given problem.
Collapse
|
7
|
Perez BC, Bink MCAM, Svenson KL, Churchill GA, Calus MPL. Prediction performance of linear models and gradient boosting machine on complex phenotypes in outbred mice. G3 (BETHESDA, MD.) 2022; 12:6528848. [PMID: 35166767 PMCID: PMC8982369 DOI: 10.1093/g3journal/jkac039] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/15/2021] [Accepted: 01/29/2022] [Indexed: 12/14/2022]
Abstract
We compared the performance of linear (GBLUP, BayesB, and elastic net) methods to a nonparametric tree-based ensemble (gradient boosting machine) method for genomic prediction of complex traits in mice. The dataset used contained genotypes for 50,112 SNP markers and phenotypes for 835 animals from 6 generations. Traits analyzed were bone mineral density, body weight at 10, 15, and 20 weeks, fat percentage, circulating cholesterol, glucose, insulin, triglycerides, and urine creatinine. The youngest generation was used as a validation subset, and predictions were based on all older generations. Model performance was evaluated by comparing predictions for animals in the validation subset against their adjusted phenotypes. Linear models outperformed gradient boosting machine for 7 out of 10 traits. For bone mineral density, cholesterol, and glucose, the gradient boosting machine model showed better prediction accuracy and lower relative root mean squared error than the linear models. Interestingly, for these 3 traits, there is evidence of a relevant portion of phenotypic variance being explained by epistatic effects. Using a subset of top markers selected from a gradient boosting machine model helped for some of the traits to improve the accuracy of prediction when these were fitted into linear and gradient boosting machine models. Our results indicate that gradient boosting machine is more strongly affected by data size and decreased connectedness between reference and validation sets than the linear models. Although the linear models outperformed gradient boosting machine for the polygenic traits, our results suggest that gradient boosting machine is a competitive method to predict complex traits with assumed epistatic effects.
Collapse
Affiliation(s)
- Bruno C Perez
- Hendrix Genetics B.V., Research and Technology Center (RTC), 5830 AC Boxmeer, The Netherlands
| | - Marco C A M Bink
- Hendrix Genetics B.V., Research and Technology Center (RTC), 5830 AC Boxmeer, The Netherlands
| | | | | | - Mario P L Calus
- Wageningen University & Research, Animal Breeding and Genomics, 6700 AH Wageningen, The Netherlands
| |
Collapse
|
8
|
Varona L, Legarra A, Toro MA, Vitezica ZG. Genomic Prediction Methods Accounting for Nonadditive Genetic Effects. Methods Mol Biol 2022; 2467:219-243. [PMID: 35451778 DOI: 10.1007/978-1-0716-2205-6_8] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
The use of genomic information for prediction of future phenotypes or breeding values for the candidates to selection has become a standard over the last decade. However, most procedures for genomic prediction only consider the additive (or substitution) effects associated with polymorphic markers. Nevertheless, the implementation of models that consider nonadditive genetic variation may be interesting because they (1) may increase the ability of prediction, (2) can be used to define mate allocation procedures in plant and animal breeding schemes, and (3) can be used to benefit from nonadditive genetic variation in crossbreeding or purebred breeding schemes. This study reviews the available methods for incorporating nonadditive effects into genomic prediction procedures and their potential applications in predicting future phenotypic performance, mate allocation, and crossbred and purebred selection. Finally, a brief outline of some future research lines is also proposed.
Collapse
Affiliation(s)
- Luis Varona
- Departamento de Anatomía, Embriología y Genética Animal, Universidad de Zaragoza, Zaragoza, Spain.
- Instituto Agroalimentario de Aragón (IA2), Zaragoza, Spain.
| | | | - Miguel A Toro
- Dpto. Producción Agraria, ETS Ingeniería Agronómica, Alimentaria y de Biosistemas, Universidad Politécnica de Madrid, Madrid, Spain
| | | |
Collapse
|
9
|
Cockburn M. Review: Application and Prospective Discussion of Machine Learning for the Management of Dairy Farms. Animals (Basel) 2020; 10:E1690. [PMID: 32962078 PMCID: PMC7552676 DOI: 10.3390/ani10091690] [Citation(s) in RCA: 24] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/04/2020] [Revised: 09/09/2020] [Accepted: 09/15/2020] [Indexed: 12/29/2022] Open
Abstract
Dairy farmers use herd management systems, behavioral sensors, feeding lists, breeding schedules, and health records to document herd characteristics. Consequently, large amounts of dairy data are becoming available. However, a lack of data integration makes it difficult for farmers to analyze the data on their dairy farm, which indicates that these data are currently not being used to their full potential. Hence, multiple issues in dairy farming such as low longevity, poor performance, and health issues remain. We aimed to evaluate whether machine learning (ML) methods can solve some of these existing issues in dairy farming. This review summarizes peer-reviewed ML papers published in the dairy sector between 2015 and 2020. Ultimately, 97 papers from the subdomains of management, physiology, reproduction, behavior analysis, and feeding were considered in this review. The results confirm that ML algorithms have become common tools in most areas of dairy research, particularly to predict data. Despite the quantity of research available, most tested algorithms have not performed sufficiently for a reliable implementation in practice. This may be due to poor training data. The availability of data resources from multiple farms covering longer periods would be useful to improve prediction accuracies. In conclusion, ML is a promising tool in dairy research, which could be used to develop and improve decision support for farmers. As the cow is a multifactorial system, ML algorithms could analyze integrated data sources that describe and ultimately allow managing cows according to all relevant influencing factors. However, both the integration of multiple data sources and the obtainability of public data currently remain challenging.
Collapse
Affiliation(s)
- Marianne Cockburn
- Agroscope, Competitiveness and System Evaluation, 8356 Ettenhausen, Switzerland
| |
Collapse
|
10
|
Jia C, Zhao F, Wang X, Han J, Zhao H, Liu G, Wang Z. Genomic Prediction for 25 Agronomic and Quality Traits in Alfalfa ( Medicago sativa). FRONTIERS IN PLANT SCIENCE 2018; 9:1220. [PMID: 30177947 PMCID: PMC6109793 DOI: 10.3389/fpls.2018.01220] [Citation(s) in RCA: 21] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/14/2018] [Accepted: 07/30/2018] [Indexed: 05/31/2023]
Abstract
Agronomic and quality traits in alfalfa are very important to forage industry. Genomic prediction (GP) based on genotyping-by-sequencing (GBS) data could shorten the breeding cycles and accelerate the genetic gains of these complex traits, if they display moderate to high prediction accuracies. The aim of this study was to investigate the predictive potentials of these traits in alfalfa. A total of 322 genotypes from 75 alfalfa accessions were used for GP of the agronomic and quality traits, which were related to yield and nutrition value, respectively, using BayesA, BayesB, and BayesCπ methods. Ten-fold cross validation was used to evaluate the accuracy of GP represented by the correlation between genomic estimated breeding value (GEBV) and estimated breeding value (EBV). The accuracies ranged from 0.0021 to 0.6485 for different traits. For each trait, three GP methods displayed similar prediction accuracies. Among 15 quality traits, mineral element Ca had a moderate and the highest prediction accuracy (0.34). NDF digestibility after 48 h (NDFD 48 h) and 30 h (NDFD 30 h) and mineral element Mg had prediction accuracies varying from 0.20 to 0.25. Other traits, for example, fat and crude protein, showed low prediction accuracies (0.05 to 0.19). Among 10 agronomic traits, however, some displayed relatively high prediction accuracies. Plant height (PH) in fall (FH) had the highest prediction accuracy (0.65), followed by flowering date (FD) and plant regrowth (PR) with accuracies at 0.52 and 0.51, respectively. Leaf to stem ratio (LS), plant branch (PB), and biomass yield (BY) reached to moderate prediction accuracies ranging from 0.25 to 0.32. Our results revealed that a few agronomic traits, such as FH, FD, and PR, had relatively high prediction accuracies, therefore it is feasible to apply genomic selection (GS) for these traits in alfalfa breeding programs. Because of the limitations of population size and density of SNP markers, several traits displayed low accuracies which could be improved by a bigger reference population, higher density of SNP markers, and more powerful statistic tools.
Collapse
Affiliation(s)
- Congjun Jia
- Institute of Animal Science, Chinese Academy of Agricultural Sciences, Beijing, China
| | - Fuping Zhao
- Institute of Animal Science, Chinese Academy of Agricultural Sciences, Beijing, China
| | - Xuemin Wang
- Institute of Animal Science, Chinese Academy of Agricultural Sciences, Beijing, China
| | - Jianlin Han
- CAAS-ILRI Joint Laboratory on Livestock and Forage Genetic Resources, Institute of Animal Science, Chinese Academy of Agricultural Sciences, Beijing, China
- International Livestock Research Institute (ILRI), Nairobi, Kenya
| | - Haiming Zhao
- Institute of Dryland Farming, Hebei Academy of Agriculture and Forestry Sciences, Hengshui, China
| | - Guibo Liu
- Institute of Dryland Farming, Hebei Academy of Agriculture and Forestry Sciences, Hengshui, China
| | - Zan Wang
- Institute of Animal Science, Chinese Academy of Agricultural Sciences, Beijing, China
| |
Collapse
|
11
|
Weigel K, VanRaden P, Norman H, Grosu H. A 100-Year Review: Methods and impact of genetic selection in dairy cattle—From daughter–dam comparisons to deep learning algorithms. J Dairy Sci 2017; 100:10234-10250. [DOI: 10.3168/jds.2017-12954] [Citation(s) in RCA: 55] [Impact Index Per Article: 7.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2017] [Accepted: 06/11/2017] [Indexed: 11/19/2022]
|
12
|
González-Recio O, Rosa GJ, Gianola D. Machine learning methods and predictive ability metrics for genome-wide prediction of complex traits. Livest Sci 2014. [DOI: 10.1016/j.livsci.2014.05.036] [Citation(s) in RCA: 75] [Impact Index Per Article: 7.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/01/2022]
|
13
|
de Los Campos G, Hickey JM, Pong-Wong R, Daetwyler HD, Calus MPL. Whole-genome regression and prediction methods applied to plant and animal breeding. Genetics 2013; 193:327-45. [PMID: 22745228 PMCID: PMC3567727 DOI: 10.1534/genetics.112.143313] [Citation(s) in RCA: 489] [Impact Index Per Article: 44.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2012] [Accepted: 06/11/2012] [Indexed: 11/18/2022] Open
Abstract
Genomic-enabled prediction is becoming increasingly important in animal and plant breeding and is also receiving attention in human genetics. Deriving accurate predictions of complex traits requires implementing whole-genome regression (WGR) models where phenotypes are regressed on thousands of markers concurrently. Methods exist that allow implementing these large-p with small-n regressions, and genome-enabled selection (GS) is being implemented in several plant and animal breeding programs. The list of available methods is long, and the relationships between them have not been fully addressed. In this article we provide an overview of available methods for implementing parametric WGR models, discuss selected topics that emerge in applications, and present a general discussion of lessons learned from simulation and empirical data analysis in the last decade.
Collapse
Affiliation(s)
- Gustavo de Los Campos
- Department of Biostatistics, School of Public Health, University of Alabama, Birmingham, AL 35294, USA.
| | | | | | | | | |
Collapse
|
14
|
Jiménez-Montero JA, González-Recio O, Alenda R. Comparison of methods for the implementation of genome-assisted evaluation of Spanish dairy cattle. J Dairy Sci 2012; 96:625-34. [PMID: 23102955 DOI: 10.3168/jds.2012-5631] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/17/2012] [Accepted: 09/17/2012] [Indexed: 01/24/2023]
Abstract
The aim of this study was to evaluate methods for genomic evaluation of the Spanish Holstein population as an initial step toward the implementation of routine genomic evaluations. This study provides a description of the population structure of progeny tested bulls in Spain at the genomic level and compares different genomic evaluation methods with regard to accuracy and bias. Two bayesian linear regression models, Bayes-A and Bayesian-LASSO (B-LASSO), as well as a machine learning algorithm, Random-Boosting (R-Boost), and BLUP using a realized genomic relationship matrix (G-BLUP), were compared. Five traits that are currently under selection in the Spanish Holstein population were used: milk yield, fat yield, protein yield, fat percentage, and udder depth. In total, genotypes from 1859 progeny tested bulls were used. The training sets were composed of bulls born before 2005; including 1601 bulls for production and 1574 bulls for type, whereas the testing sets contained 258 and 235 bulls born in 2005 or later for production and type, respectively. Deregressed proofs (DRP) from January 2009 Interbull (Uppsala, Sweden) evaluation were used as the dependent variables for bulls in the training sets, whereas DRP from the December 2011 DRPs Interbull evaluation were used to compare genomic predictions with progeny test results for bulls in the testing set. Genomic predictions were more accurate than traditional pedigree indices for predicting future progeny test results of young bulls. The gain in accuracy, due to inclusion of genomic data varied by trait and ranged from 0.04 to 0.42 Pearson correlation units. Results averaged across traits showed that B-LASSO had the highest accuracy with an advantage of 0.01, 0.03 and 0.03 points in Pearson correlation compared with R-Boost, Bayes-A, and G-BLUP, respectively. The B-LASSO predictions also showed the least bias (0.02, 0.03 and 0.10 SD units less than Bayes-A, R-Boost and G-BLUP, respectively) as measured by mean difference between genomic predictions and progeny test results. The R-Boosting algorithm provided genomic predictions with regression coefficients closer to unity, which is an alternative measure of bias, for 4 out of 5 traits and also resulted in mean squared errors estimates that were 2%, 10%, and 12% smaller than B-LASSO, Bayes-A, and G-BLUP, respectively. The observed prediction accuracy obtained with these methods was within the range of values expected for a population of similar size, suggesting that the prediction method and reference population described herein are appropriate for implementation of routine genome-assisted evaluations in Spanish dairy cattle. R-Boost is a competitive marker regression methodology in terms of predictive ability that can accommodate large data sets.
Collapse
Affiliation(s)
- J A Jiménez-Montero
- Departamento de Producción Animal, Escuela Técnica Superior de Ingenieros (ETSI) Agrónomos-Universidad Politécnica de Madrid, 28040 Madrid, Spain.
| | | | | |
Collapse
|
15
|
González-Recio O, Jiménez-Montero JA, Alenda R. The gradient boosting algorithm and random boosting for genome-assisted evaluation in large data sets. J Dairy Sci 2012; 96:614-24. [PMID: 23102953 DOI: 10.3168/jds.2012-5630] [Citation(s) in RCA: 28] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/17/2012] [Accepted: 09/11/2012] [Indexed: 11/19/2022]
Abstract
In the next few years, with the advent of high-density single nucleotide polymorphism (SNP) arrays and genome sequencing, genomic evaluation methods will need to deal with a large number of genetic variants and an increasing sample size. The boosting algorithm is a machine-learning technique that may alleviate the drawbacks of dealing with such large data sets. This algorithm combines different predictors in a sequential manner with some shrinkage on them; each predictor is applied consecutively to the residuals from the committee formed by the previous ones to form a final prediction based on a subset of covariates. Here, a detailed description is provided and examples using a toy data set are included. A modification of the algorithm called "random boosting" was proposed to increase predictive ability and decrease computation time of genome-assisted evaluation in large data sets. Random boosting uses a random selection of markers to add a subsequent weak learner to the predictive model. These modifications were applied to a real data set composed of 1,797 bulls genotyped for 39,714 SNP. Deregressed proofs of 4 yield traits and 1 type trait from January 2009 routine evaluations were used as dependent variables. A 2-fold cross-validation scenario was implemented. Sires born before 2005 were used as a training sample (1,576 and 1,562 for production and type traits, respectively), whereas younger sires were used as a testing sample to evaluate predictive ability of the algorithm on yet-to-be-observed phenotypes. Comparison with the original algorithm was provided. The predictive ability of the algorithm was measured as Pearson correlations between observed and predicted responses. Further, estimated bias was computed as the average difference between observed and predicted phenotypes. The results showed that the modification of the original boosting algorithm could be run in 1% of the time used with the original algorithm and with negligible differences in accuracy and bias. This modification may be used to speed the calculus of genome-assisted evaluation in large data sets such us those obtained from consortiums.
Collapse
Affiliation(s)
- O González-Recio
- Departamento de Mejora Genética Animal, Instituto Nacional de Investigación y Tecnología Agraria y Alimentaria (INIA), 28040 Madrid, Spain.
| | | | | |
Collapse
|
16
|
Maltecca C, Parker KL, Cassady JP. Application of multiple shrinkage methods to genomic predictions1. J Anim Sci 2012; 90:1777-87. [DOI: 10.2527/jas.2011-4350] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Affiliation(s)
- Christian Maltecca
- Department of Animal Science, North Carolina State University, Raleigh, 27695
| | - Kristen L. Parker
- Department of Animal Science, North Carolina State University, Raleigh, 27695
| | - Joseph P. Cassady
- Department of Animal Science, North Carolina State University, Raleigh, 27695
| |
Collapse
|
17
|
González-Recio O, Forni S. Genome-wide prediction of discrete traits using Bayesian regressions and machine learning. Genet Sel Evol 2011; 43:7. [PMID: 21329522 PMCID: PMC3400433 DOI: 10.1186/1297-9686-43-7] [Citation(s) in RCA: 79] [Impact Index Per Article: 6.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/31/2010] [Accepted: 02/17/2011] [Indexed: 01/18/2023] Open
Abstract
BACKGROUND Genomic selection has gained much attention and the main goal is to increase the predictive accuracy and the genetic gain in livestock using dense marker information. Most methods dealing with the large p (number of covariates) small n (number of observations) problem have dealt only with continuous traits, but there are many important traits in livestock that are recorded in a discrete fashion (e.g. pregnancy outcome, disease resistance). It is necessary to evaluate alternatives to analyze discrete traits in a genome-wide prediction context. METHODS This study shows two threshold versions of Bayesian regressions (Bayes A and Bayesian LASSO) and two machine learning algorithms (boosting and random forest) to analyze discrete traits in a genome-wide prediction context. These methods were evaluated using simulated and field data to predict yet-to-be observed records. Performances were compared based on the models' predictive ability. RESULTS The simulation showed that machine learning had some advantages over Bayesian regressions when a small number of QTL regulated the trait under pure additivity. However, differences were small and disappeared with a large number of QTL. Bayesian threshold LASSO and boosting achieved the highest accuracies, whereas Random Forest presented the highest classification performance. Random Forest was the most consistent method in detecting resistant and susceptible animals, phi correlation was up to 81% greater than Bayesian regressions. Random Forest outperformed other methods in correctly classifying resistant and susceptible animals in the two pure swine lines evaluated. Boosting and Bayes A were more accurate with crossbred data. CONCLUSIONS The results of this study suggest that the best method for genome-wide prediction may depend on the genetic basis of the population analyzed. All methods were less accurate at correctly classifying intermediate animals than extreme animals. Among the different alternatives proposed to analyze discrete traits, machine-learning showed some advantages over Bayesian regressions. Boosting with a pseudo Huber loss function showed high accuracy, whereas Random Forest produced more consistent results and an interesting predictive ability. Nonetheless, the best method may be case-dependent and a initial evaluation of different methods is recommended to deal with a particular problem.
Collapse
|