1
|
Ren Y, Wu C, Zhou H, Hu X, Miao Z. Dual-extraction modeling: A multi-modal deep-learning architecture for phenotypic prediction and functional gene mining of complex traits. PLANT COMMUNICATIONS 2024; 5:101002. [PMID: 38872306 DOI: 10.1016/j.xplc.2024.101002] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/25/2024] [Revised: 05/27/2024] [Accepted: 06/11/2024] [Indexed: 06/15/2024]
Abstract
Despite considerable advances in extracting crucial insights from bio-omics data to unravel the intricate mechanisms underlying complex traits, the absence of a universal multi-modal computational tool with robust interpretability for accurate phenotype prediction and identification of trait-associated genes remains a challenge. This study introduces the dual-extraction modeling (DEM) approach, a multi-modal deep-learning architecture designed to extract representative features from heterogeneous omics datasets, enabling the prediction of complex trait phenotypes. Through comprehensive benchmarking experiments, we demonstrate the efficacy of DEM in classification and regression prediction of complex traits. DEM consistently exhibits superior accuracy, robustness, generalizability, and flexibility. Notably, we establish its effectiveness in predicting pleiotropic genes that influence both flowering time and rosette leaf number, underscoring its commendable interpretability. In addition, we have developed user-friendly software to facilitate seamless utilization of DEM's functions. In summary, this study presents a state-of-the-art approach with the ability to effectively predict qualitative and quantitative traits and identify functional genes, confirming its potential as a valuable tool for exploring the genetic basis of complex traits.
Collapse
Affiliation(s)
- Yanlin Ren
- State Key Laboratory for Crop Stress Resistance and High-Efficiency Production, Center of Bioinformatics, College of Life Sciences, Northwest A&F University, Yangling, Shaanxi 712100, China
| | - Chenhua Wu
- State Key Laboratory for Crop Stress Resistance and High-Efficiency Production, Center of Bioinformatics, College of Life Sciences, Northwest A&F University, Yangling, Shaanxi 712100, China
| | - He Zhou
- State Key Laboratory for Crop Stress Resistance and High-Efficiency Production, Center of Bioinformatics, College of Life Sciences, Northwest A&F University, Yangling, Shaanxi 712100, China
| | - Xiaona Hu
- College of Chemistry & Pharmacy, Northwest A&F University, Yangling, Shaanxi 712100, China.
| | - Zhenyan Miao
- State Key Laboratory for Crop Stress Resistance and High-Efficiency Production, Center of Bioinformatics, College of Life Sciences, Northwest A&F University, Yangling, Shaanxi 712100, China; Key Laboratory of Biology and Genetics Improvement of Maize in Arid Area of Northwest Region, Ministry of Agriculture, Northwest A&F University, Yangling, Shaanxi 712100, China.
| |
Collapse
|
2
|
Farooq MA, Gao S, Hassan MA, Huang Z, Rasheed A, Hearne S, Prasanna B, Li X, Li H. Artificial intelligence in plant breeding. Trends Genet 2024:S0168-9525(24)00167-7. [PMID: 39117482 DOI: 10.1016/j.tig.2024.07.001] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2024] [Revised: 07/06/2024] [Accepted: 07/12/2024] [Indexed: 08/10/2024]
Abstract
Harnessing cutting-edge technologies to enhance crop productivity is a pivotal goal in modern plant breeding. Artificial intelligence (AI) is renowned for its prowess in big data analysis and pattern recognition, and is revolutionizing numerous scientific domains including plant breeding. We explore the wider potential of AI tools in various facets of breeding, including data collection, unlocking genetic diversity within genebanks, and bridging the genotype-phenotype gap to facilitate crop breeding. This will enable the development of crop cultivars tailored to the projected future environments. Moreover, AI tools also hold promise for refining crop traits by improving the precision of gene-editing systems and predicting the potential effects of gene variants on plant phenotypes. Leveraging AI-enabled precision breeding can augment the efficiency of breeding programs and holds promise for optimizing cropping systems at the grassroots level. This entails identifying optimal inter-cropping and crop-rotation models to enhance agricultural sustainability and productivity in the field.
Collapse
Affiliation(s)
- Muhammad Amjad Farooq
- State Key Laboratory of Crop Gene Resources and Breeding, Institute of Crop Sciences, Chinese Academy of Agricultural Sciences (CAAS), International Maize and Wheat Improvement Center (CIMMYT) China office, Beijing 100081, China; Nanfan Research Institute, CAAS, Sanya, Hainan 572024, China
| | - Shang Gao
- State Key Laboratory of Crop Gene Resources and Breeding, Institute of Crop Sciences, Chinese Academy of Agricultural Sciences (CAAS), International Maize and Wheat Improvement Center (CIMMYT) China office, Beijing 100081, China; Nanfan Research Institute, CAAS, Sanya, Hainan 572024, China
| | - Muhammad Adeel Hassan
- Adaptive Cropping Systems Laboratory, Beltsville Agricultural Research Center, US Department of Agriculture, Beltsville, MD 20705, USA; Oak Ridge Institute for Science and Education, Oak Ridge, TN 37830, USA
| | - Zhangping Huang
- State Key Laboratory of Crop Gene Resources and Breeding, Institute of Crop Sciences, Chinese Academy of Agricultural Sciences (CAAS), International Maize and Wheat Improvement Center (CIMMYT) China office, Beijing 100081, China; Nanfan Research Institute, CAAS, Sanya, Hainan 572024, China
| | - Awais Rasheed
- Department of Plant Sciences, Quaid-i-Azam University, Islamabad 45320, Pakistan
| | - Sarah Hearne
- CIMMYT, KM 45 Carretera Mexico-Veracruz, El Batan, Texcoco 56237, Mexico
| | - Boddupalli Prasanna
- CIMMYT, International Centre for Research in Agroforestry (ICRAF) House, Nairobi 00100, Kenya
| | - Xinhai Li
- State Key Laboratory of Crop Gene Resources and Breeding, Institute of Crop Sciences, Chinese Academy of Agricultural Sciences (CAAS), International Maize and Wheat Improvement Center (CIMMYT) China office, Beijing 100081, China
| | - Huihui Li
- State Key Laboratory of Crop Gene Resources and Breeding, Institute of Crop Sciences, Chinese Academy of Agricultural Sciences (CAAS), International Maize and Wheat Improvement Center (CIMMYT) China office, Beijing 100081, China; Nanfan Research Institute, CAAS, Sanya, Hainan 572024, China.
| |
Collapse
|
3
|
Nascimento M, Nascimento ACC, Azevedo CF, de Oliveira ACB, Caixeta ET, Jarquin D. Enhancing genomic prediction with Stacking Ensemble Learning in Arabica Coffee. FRONTIERS IN PLANT SCIENCE 2024; 15:1373318. [PMID: 39086911 PMCID: PMC11288849 DOI: 10.3389/fpls.2024.1373318] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 01/19/2024] [Accepted: 06/12/2024] [Indexed: 08/02/2024]
Abstract
Coffee Breeding programs have traditionally relied on observing plant characteristics over years, a slow and costly process. Genomic selection (GS) offers a DNA-based alternative for faster selection of superior cultivars. Stacking Ensemble Learning (SEL) combines multiple models for potentially even more accurate selection. This study explores SEL potential in coffee breeding, aiming to improve prediction accuracy for important traits [yield (YL), total number of the fruits (NF), leaf miner infestation (LM), and cercosporiosis incidence (Cer)] in Coffea Arabica. We analyzed data from 195 individuals genotyped for 21,211 single-nucleotide polymorphism (SNP) markers. To comprehensively assess model performance, we employed a cross-validation (CV) scheme. Genomic Best Linear Unbiased Prediction (GBLUP), multivariate adaptive regression splines (MARS), Quantile Random Forest (QRF), and Random Forest (RF) served as base learners. For the meta-learner within the SEL framework, various options were explored, including Ridge Regression, RF, GBLUP, and Single Average. The SEL method was able to predict the predictive ability (PA) of important traits in Coffea Arabica. SEL presented higher PA compared with those obtained for all base learner methods. The gains in PA in relation to GBLUP were 87.44% (the ratio between the PA obtained from best Stacking model and the GBLUP), 37.83%, 199.82%, and 14.59% for YL, NF, LM and Cer, respectively. Overall, SEL presents a promising approach for GS. By combining predictions from multiple models, SEL can potentially enhance the PA of GS for complex traits.
Collapse
Affiliation(s)
- Moyses Nascimento
- Laboratory of Intelligence Computational and Statistical Learning (LICAE), Department of Statistics, Federal University of Viçosa, Viçosa, Brazil
- Agronomy Department, University of Florida, Gainesville, FL, United States
| | - Ana Carolina Campana Nascimento
- Laboratory of Intelligence Computational and Statistical Learning (LICAE), Department of Statistics, Federal University of Viçosa, Viçosa, Brazil
- Agronomy Department, University of Florida, Gainesville, FL, United States
| | - Camila Ferreira Azevedo
- Laboratory of Intelligence Computational and Statistical Learning (LICAE), Department of Statistics, Federal University of Viçosa, Viçosa, Brazil
| | | | | | - Diego Jarquin
- Agronomy Department, University of Florida, Gainesville, FL, United States
| |
Collapse
|
4
|
Wang X, Zhang Z, Du H, Pfeiffer C, Mészáros G, Ding X. Predictive ability of multi-population genomic prediction methods of phenotypes for reproduction traits in Chinese and Austrian pigs. Genet Sel Evol 2024; 56:49. [PMID: 38926647 PMCID: PMC11201905 DOI: 10.1186/s12711-024-00915-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2022] [Accepted: 05/30/2024] [Indexed: 06/28/2024] Open
Abstract
BACKGROUND Multi-population genomic prediction can rapidly expand the size of the reference population and improve genomic prediction ability. Machine learning (ML) algorithms have shown advantages in single-population genomic prediction of phenotypes. However, few studies have explored the effectiveness of ML methods for multi-population genomic prediction. RESULTS In this study, 3720 Yorkshire pigs from Austria and four breeding farms in China were used, and single-trait genomic best linear unbiased prediction (ST-GBLUP), multitrait GBLUP (MT-GBLUP), Bayesian Horseshoe (BayesHE), and three ML methods (support vector regression (SVR), kernel ridge regression (KRR) and AdaBoost.R2) were compared to explore the optimal method for joint genomic prediction of phenotypes of Chinese and Austrian pigs through 10 replicates of fivefold cross-validation. In this study, we tested the performance of different methods in two scenarios: (i) including only one Austrian population and one Chinese pig population that were genetically linked based on principal component analysis (PCA) (designated as the "two-population scenario") and (ii) adding reference populations that are unrelated based on PCA to the above two populations (designated as the "multi-population scenario"). Our results show that, the use of MT-GBLUP in the two-population scenario resulted in an improvement of 7.1% in predictive ability compared to ST-GBLUP, while the use of SVR and KKR yielded improvements in predictive ability of 4.5 and 5.3%, respectively, compared to MT-GBLUP. SVR and KRR also yielded lower mean square errors (MSE) in most population and trait combinations. In the multi-population scenario, improvements in predictive ability of 29.7, 24.4 and 11.1% were obtained compared to ST-GBLUP when using, respectively, SVR, KRR, and AdaBoost.R2. However, compared to MT-GBLUP, the potential of ML methods to improve predictive ability was not demonstrated. CONCLUSIONS Our study demonstrates that ML algorithms can achieve better prediction performance than multitrait GBLUP models in multi-population genomic prediction of phenotypes when the populations have similar genetic backgrounds; however, when reference populations that are unrelated based on PCA are added, the ML methods did not show a benefit. When the number of populations increased, only MT-GBLUP improved predictive ability in both validation populations, while the other methods showed improvement in only one population.
Collapse
Affiliation(s)
- Xue Wang
- State Key Laboratory of Animal Biotech Breeding, Key Laboratory of Animal Genetics and Breeding of Ministry of Agriculture and Rural Affairs, National Engineering Laboratory of Animal Breeding, College of Animal Science and Technology, China Agricultural University, Beijing, China
| | - Zipeng Zhang
- State Key Laboratory of Animal Biotech Breeding, Key Laboratory of Animal Genetics and Breeding of Ministry of Agriculture and Rural Affairs, National Engineering Laboratory of Animal Breeding, College of Animal Science and Technology, China Agricultural University, Beijing, China
| | - Hehe Du
- State Key Laboratory of Animal Biotech Breeding, Key Laboratory of Animal Genetics and Breeding of Ministry of Agriculture and Rural Affairs, National Engineering Laboratory of Animal Breeding, College of Animal Science and Technology, China Agricultural University, Beijing, China
| | | | - Gábor Mészáros
- University of Natural Resources and Life Sciences, Vienna, Austria
| | - Xiangdong Ding
- State Key Laboratory of Animal Biotech Breeding, Key Laboratory of Animal Genetics and Breeding of Ministry of Agriculture and Rural Affairs, National Engineering Laboratory of Animal Breeding, College of Animal Science and Technology, China Agricultural University, Beijing, China.
| |
Collapse
|
5
|
Li X, Chen X, Wang Q, Yang N, Sun C. Integrating Bioinformatics and Machine Learning for Genomic Prediction in Chickens. Genes (Basel) 2024; 15:690. [PMID: 38927626 PMCID: PMC11202573 DOI: 10.3390/genes15060690] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/09/2024] [Revised: 05/12/2024] [Accepted: 05/23/2024] [Indexed: 06/28/2024] Open
Abstract
Genomic prediction plays an increasingly important role in modern animal breeding, with predictive accuracy being a crucial aspect. The classical linear mixed model is gradually unable to accommodate the growing number of target traits and the increasingly intricate genetic regulatory patterns. Hence, novel approaches are necessary for future genomic prediction. In this study, we used an illumina 50K SNP chip to genotype 4190 egg-type female Rhode Island Red chickens. Machine learning (ML) and classical bioinformatics methods were integrated to fit genotypes with 10 economic traits in chickens. We evaluated the effectiveness of ML methods using Pearson correlation coefficients and the RMSE between predicted and actual phenotypic values and compared them with rrBLUP and BayesA. Our results indicated that ML algorithms exhibit significantly superior performance to rrBLUP and BayesA in predicting body weight and eggshell strength traits. Conversely, rrBLUP and BayesA demonstrated 2-58% higher predictive accuracy in predicting egg numbers. Additionally, the incorporation of suggestively significant SNPs obtained through the GWAS into the ML models resulted in an increase in the predictive accuracy of 0.1-27% across nearly all traits. These findings suggest the potential of combining classical bioinformatics methods with ML techniques to improve genomic prediction in the future.
Collapse
Affiliation(s)
| | | | | | | | - Congjiao Sun
- State Key Laboratory of Animal Biotech Breeding and Frontiers Science Center for Molecular Design Breeding (MOE), China Agricultural University, Beijing 100193, China; (X.L.); (X.C.); (Q.W.); (N.Y.)
| |
Collapse
|
6
|
Bose S, Banerjee S, Kumar S, Saha A, Nandy D, Hazra S. Review of applications of artificial intelligence (AI) methods in crop research. J Appl Genet 2024; 65:225-240. [PMID: 38216788 DOI: 10.1007/s13353-023-00826-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/13/2023] [Revised: 12/23/2023] [Accepted: 12/26/2023] [Indexed: 01/14/2024]
Abstract
Sophisticated and modern crop improvement techniques can bridge the gap for feeding the ever-increasing population. Artificial intelligence (AI) refers to the simulation of human intelligence in machines, which refers to the application of computational algorithms, machine learning (ML) and deep learning (DL) techniques. This is aimed to generalise patterns and relationships from historical data, employing various mathematical optimisation techniques thus making prediction models for facilitating selection of superior genotypes. These techniques are less resource intensive and can solve the problem based on the analysis of large-scale phenotypic datasets. ML for genomic selection (GS) uses high-throughput genotyping technologies to gather genetic information on a large number of markers across the genome. The prediction of GS models is based on the mathematical relation between genotypic and phenotypic data from the training population. ML techniques have emerged as powerful tools for genome editing through analysing large-scale genomic data and facilitating the development of accurate prediction models. Precise phenotyping is a prerequisite to advance crop breeding for solving agricultural production-related issues. ML algorithms can solve this problem through generating predictive models, based on the analysis of large-scale phenotypic datasets. DL models also have the potential reliability of precise phenotyping. This review provides a comprehensive overview on various ML and DL models, their applications, potential to enhance the efficiency, specificity and safety towards advanced crop improvement protocols such as genomic selection, genome editing, along with phenotypic prediction to promote accelerated breeding.
Collapse
Affiliation(s)
- Suvojit Bose
- Department of Vegetables and Spice Crops, Uttar Banga Krishi Viswavidyalaya, Pundibari, Cooch Behar, 736165, West Bengal, India
| | | | - Soumya Kumar
- School of Agricultural Sciences, JIS University, Kolkata, 700109, West Bengal, India
| | - Akash Saha
- School of Agricultural Sciences, JIS University, Kolkata, 700109, West Bengal, India
| | - Debalina Nandy
- School of Agricultural Sciences, JIS University, Kolkata, 700109, West Bengal, India
| | - Soham Hazra
- Department of Agriculture, Brainware University, Barasat, 700125, West Bengal, India.
| |
Collapse
|
7
|
Mota LFM, Arikawa LM, Santos SWB, Fernandes Júnior GA, Alves AAC, Rosa GJM, Mercadante MEZ, Cyrillo JNSG, Carvalheiro R, Albuquerque LG. Benchmarking machine learning and parametric methods for genomic prediction of feed efficiency-related traits in Nellore cattle. Sci Rep 2024; 14:6404. [PMID: 38493207 PMCID: PMC10944497 DOI: 10.1038/s41598-024-57234-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2023] [Accepted: 03/15/2024] [Indexed: 03/18/2024] Open
Abstract
Genomic selection (GS) offers a promising opportunity for selecting more efficient animals to use consumed energy for maintenance and growth functions, impacting profitability and environmental sustainability. Here, we compared the prediction accuracy of multi-layer neural network (MLNN) and support vector regression (SVR) against single-trait (STGBLUP), multi-trait genomic best linear unbiased prediction (MTGBLUP), and Bayesian regression (BayesA, BayesB, BayesC, BRR, and BLasso) for feed efficiency (FE) traits. FE-related traits were measured in 1156 Nellore cattle from an experimental breeding program genotyped for ~ 300 K markers after quality control. Prediction accuracy (Acc) was evaluated using a forward validation splitting the dataset based on birth year, considering the phenotypes adjusted for the fixed effects and covariates as pseudo-phenotypes. The MLNN and SVR approaches were trained by randomly splitting the training population into fivefold to select the best hyperparameters. The results show that the machine learning methods (MLNN and SVR) and MTGBLUP outperformed STGBLUP and the Bayesian regression approaches, increasing the Acc by approximately 8.9%, 14.6%, and 13.7% using MLNN, SVR, and MTGBLUP, respectively. Acc for SVR and MTGBLUP were slightly different, ranging from 0.62 to 0.69 and 0.62 to 0.68, respectively, with empirically unbiased for both models (0.97 and 1.09). Our results indicated that SVR and MTGBLUBP approaches were more accurate in predicting FE-related traits than Bayesian regression and STGBLUP and seemed competitive for GS of complex phenotypes with various degrees of inheritance.
Collapse
Affiliation(s)
- Lucio F M Mota
- School of Agricultural and Veterinarian Sciences, São Paulo State University (UNESP), Jaboticabal, SP, 14884-900, Brazil.
| | - Leonardo M Arikawa
- School of Agricultural and Veterinarian Sciences, São Paulo State University (UNESP), Jaboticabal, SP, 14884-900, Brazil
| | - Samuel W B Santos
- School of Agricultural and Veterinarian Sciences, São Paulo State University (UNESP), Jaboticabal, SP, 14884-900, Brazil
| | - Gerardo A Fernandes Júnior
- School of Agricultural and Veterinarian Sciences, São Paulo State University (UNESP), Jaboticabal, SP, 14884-900, Brazil
| | - Anderson A C Alves
- School of Agricultural and Veterinarian Sciences, São Paulo State University (UNESP), Jaboticabal, SP, 14884-900, Brazil
| | - Guilherme J M Rosa
- Department of Animal and Dairy Sciences, University of Wisconsin, Madison, WI, 53706, USA
| | - Maria E Z Mercadante
- Institute of Animal Science, Beef Cattle Research Center, Sertãozinho, SP, 14174-000, Brazil
- National Council for Science and Technological Development, Brasilia, DF, 71605-001, Brazil
| | - Joslaine N S G Cyrillo
- Institute of Animal Science, Beef Cattle Research Center, Sertãozinho, SP, 14174-000, Brazil
| | - Roberto Carvalheiro
- School of Agricultural and Veterinarian Sciences, São Paulo State University (UNESP), Jaboticabal, SP, 14884-900, Brazil
- National Council for Science and Technological Development, Brasilia, DF, 71605-001, Brazil
| | - Lucia G Albuquerque
- School of Agricultural and Veterinarian Sciences, São Paulo State University (UNESP), Jaboticabal, SP, 14884-900, Brazil.
- National Council for Science and Technological Development, Brasilia, DF, 71605-001, Brazil.
| |
Collapse
|
8
|
Hoque A, Anderson JV, Rahman M. Genomic prediction for agronomic traits in a diverse Flax (Linum usitatissimum L.) germplasm collection. Sci Rep 2024; 14:3196. [PMID: 38326469 PMCID: PMC10850546 DOI: 10.1038/s41598-024-53462-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/28/2023] [Accepted: 01/31/2024] [Indexed: 02/09/2024] Open
Abstract
Breeding programs require exhaustive phenotyping of germplasms, which is time-demanding and expensive. Genomic prediction helps breeders harness the diversity of any collection to bypass phenotyping. Here, we examined the genomic prediction's potential for seed yield and nine agronomic traits using 26,171 single nucleotide polymorphism (SNP) markers in a set of 337 flax (Linum usitatissimum L.) germplasm, phenotyped in five environments. We evaluated 14 prediction models and several factors affecting predictive ability based on cross-validation schemes. Models yielded significant variation among predictive ability values across traits for the whole marker set. The ridge regression (RR) model covering additive gene action yielded better predictive ability for most of the traits, whereas it was higher for low heritable traits by models capturing epistatic gene action. Marker subsets based on linkage disequilibrium decay distance gave significantly higher predictive abilities to the whole marker set, but for randomly selected markers, it reached a plateau above 3000 markers. Markers having significant association with traits improved predictive abilities compared to the whole marker set when marker selection was made on the whole population instead of the training set indicating a clear overfitting. The correction for population structure did not increase predictive abilities compared to the whole collection. However, stratified sampling by picking representative genotypes from each cluster improved predictive abilities. The indirect predictive ability for a trait was proportionate to its correlation with other traits. These results will help breeders to select the best models, optimum marker set, and suitable genotype set to perform an indirect selection for quantitative traits in this diverse flax germplasm collection.
Collapse
Affiliation(s)
- Ahasanul Hoque
- Department of Plant Sciences, North Dakota State University, Fargo, ND, USA
- Department of Genetics and Plant Breeding, Bangladesh Agricultural University, Mymensingh, 2202, Bangladesh
| | - James V Anderson
- USDA-ARS, Edward T. Schafer Agricultural Research Center, Fargo, ND, USA
| | - Mukhlesur Rahman
- Department of Plant Sciences, North Dakota State University, Fargo, ND, USA.
| |
Collapse
|
9
|
Wu C, Zhang Y, Ying Z, Li L, Wang J, Yu H, Zhang M, Feng X, Wei X, Xu X. A transformer-based genomic prediction method fused with knowledge-guided module. Brief Bioinform 2023; 25:bbad438. [PMID: 38058185 PMCID: PMC10701102 DOI: 10.1093/bib/bbad438] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2023] [Revised: 10/15/2023] [Accepted: 11/03/2023] [Indexed: 12/08/2023] Open
Abstract
Genomic prediction (GP) uses single nucleotide polymorphisms (SNPs) to establish associations between markers and phenotypes. Selection of early individuals by genomic estimated breeding value shortens the generation interval and speeds up the breeding process. Recently, methods based on deep learning (DL) have gained great attention in the field of GP. In this study, we explore the application of Transformer-based structures to GP and develop a novel deep-learning model named GPformer. GPformer obtains a global view by gleaning beneficial information from all relevant SNPs regardless of the physical distance between SNPs. Comprehensive experimental results on five different crop datasets show that GPformer outperforms ridge regression-based linear unbiased prediction (RR-BLUP), support vector regression (SVR), light gradient boosting machine (LightGBM) and deep neural network genomic prediction (DNNGP) in terms of mean absolute error, Pearson's correlation coefficient and the proposed metric consistent index. Furthermore, we introduce a knowledge-guided module (KGM) to extract genome-wide association studies-based information, which is fused into GPformer as prior knowledge. KGM is very flexible and can be plugged into any DL network. Ablation studies of KGM on three datasets illustrate the efficiency of KGM adequately. Moreover, GPformer is robust and stable to hyperparameters and can generalize to each phenotype of every dataset, which is suitable for practical application scenarios.
Collapse
Affiliation(s)
- Cuiling Wu
- Institute of Intelligent Computing, Zhejiang Lab, Hangzhou 311121, China
| | - Yiyi Zhang
- Institute of Intelligent Computing, Zhejiang Lab, Hangzhou 311121, China
| | - Zhiwen Ying
- Institute of Intelligent Computing, Zhejiang Lab, Hangzhou 311121, China
| | - Ling Li
- Institute of Intelligent Computing, Zhejiang Lab, Hangzhou 311121, China
| | - Jun Wang
- Institute of Intelligent Computing, Zhejiang Lab, Hangzhou 311121, China
| | - Hui Yu
- Northeast Institute of Geography and Agroecology, Chinese Academy of Sciences, Changchun 130012, China
| | - Mengchen Zhang
- State Key Laboratory of Rice Biology, China National Rice Research Institute, Hangzhou 310006, China
| | - Xianzhong Feng
- Institute of Intelligent Computing, Zhejiang Lab, Hangzhou 311121, China
- Northeast Institute of Geography and Agroecology, Chinese Academy of Sciences, Changchun 130012, China
| | - Xinghua Wei
- Institute of Intelligent Computing, Zhejiang Lab, Hangzhou 311121, China
- State Key Laboratory of Rice Biology, China National Rice Research Institute, Hangzhou 310006, China
| | - Xiaogang Xu
- School of Computer and Information Engineering, Zhejiang Gongshang University, Hangzhou 310018, China
| |
Collapse
|
10
|
Hamadani A, Ganai NA. Artificial intelligence algorithm comparison and ranking for weight prediction in sheep. Sci Rep 2023; 13:13242. [PMID: 37582936 PMCID: PMC10427635 DOI: 10.1038/s41598-023-40528-4] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/02/2023] [Accepted: 08/11/2023] [Indexed: 08/17/2023] Open
Abstract
In a rapidly transforming world, farm data is growing exponentially. Realizing the importance of this data, researchers are looking for new solutions to analyse this data and make farming predictions. Artificial Intelligence, with its capacity to handle big data is rapidly becoming popular. In addition, it can also handle non-linear, noisy data and is not limited by the conditions required for conventional data analysis. This study was therefore undertaken to compare the most popular machine learning (ML) algorithms and rank them as per their ability to make predictions on sheep farm data spanning 11 years. Data was cleaned and prepared was done before analysis. Winsorization was done for outlier removal. Principal component analysis (PCA) and feature selection (FS) were done and based on that, three datasets were created viz. PCA (wherein only PCA was used), PCA+ FS (both techniques used for dimensionality reduction), and FS (only feature selection used) bodyweight prediction. Among the 11 ML algorithms that were evaluated, the correlations between true and predicted values for MARS algorithm, Bayesian ridge regression, Ridge regression, Support Vector Machines, Gradient boosting algorithm, Random forests, XgBoost algorithm, Artificial neural networks, Classification and regression trees, Polynomial regression, K nearest neighbours and Genetic Algorithms were 0.993, 0.992, 0.991, 0.991, 0.991, 0.99, 0.99, 0.984, 0.984, 0.957, 0.949, 0.734 respectively for bodyweights. The top five algorithms for the prediction of bodyweights, were MARS, Bayesian ridge regression, Ridge regression, Support Vector Machines and Gradient boosting algorithm. A total of 12 machine learning models were developed for the prediction of bodyweights in sheep in the present study. It may be said that machine learning techniques can perform predictions with reasonable accuracies and can thus help in drawing inferences and making futuristic predictions on farms for their economic prosperity, performance improvement and subsequently food security.
Collapse
Affiliation(s)
| | - Nazir Ahmad Ganai
- Sher-e-Kashmir University of Agricultural Sciences and Technology of Kashmir, Kashmir, India
| |
Collapse
|
11
|
Alves AAC, Fernandes AFA, Lopes FB, Breen V, Hawken R, Gianola D, Rosa GJDM. (Quasi) multitask support vector regression with heuristic hyperparameter optimization for whole-genome prediction of complex traits: a case study with carcass traits in broilers. G3 (BETHESDA, MD.) 2023; 13:jkad109. [PMID: 37216670 PMCID: PMC10411556 DOI: 10.1093/g3journal/jkad109] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/13/2023] [Revised: 03/13/2023] [Accepted: 04/24/2023] [Indexed: 05/24/2023]
Abstract
This study investigates nonlinear kernels for multitrait (MT) genomic prediction using support vector regression (SVR) models. We assessed the predictive ability delivered by single-trait (ST) and MT models for 2 carcass traits (CT1 and CT2) measured in purebred broiler chickens. The MT models also included information on indicator traits measured in vivo [Growth and feed efficiency trait (FE)]. We proposed an approach termed (quasi) multitask SVR (QMTSVR), with hyperparameter optimization performed via genetic algorithm. ST and MT Bayesian shrinkage and variable selection models [genomic best linear unbiased predictor (GBLUP), BayesC (BC), and reproducing kernel Hilbert space (RKHS) regression] were employed as benchmarks. MT models were trained using 2 validation designs (CV1 and CV2), which differ if the information on secondary traits is available in the testing set. Models' predictive ability was assessed with prediction accuracy (ACC; i.e. the correlation between predicted and observed values, divided by the square root of phenotype accuracy), standardized root-mean-squared error (RMSE*), and inflation factor (b). To account for potential bias in CV2-style predictions, we also computed a parametric estimate of accuracy (ACCpar). Predictive ability metrics varied according to trait, model, and validation design (CV1 or CV2), ranging from 0.71 to 0.84 for ACC, 0.78 to 0.92 for RMSE*, and between 0.82 and 1.34 for b. The highest ACC and smallest RMSE* were achieved with QMTSVR-CV2 in both traits. We observed that for CT1, model/validation design selection was sensitive to the choice of accuracy metric (ACC or ACCpar). Nonetheless, the higher predictive accuracy of QMTSVR over MTGBLUP and MTBC was replicated across accuracy metrics, besides the similar performance between the proposed method and the MTRKHS model. Results showed that the proposed approach is competitive with conventional MT Bayesian regression models using either Gaussian or spike-slab multivariate priors.
Collapse
Affiliation(s)
| | | | | | - Vivian Breen
- Cobb-Vantress Inc., Siloam Springs, AR 72761, USA
| | | | - Daniel Gianola
- Department of Animal and Dairy Sciences, University of Wisconsin-Madison, Madison, WI 53706, USA
| | | |
Collapse
|
12
|
Ruperao P, Rangan P, Shah T, Thakur V, Kalia S, Mayes S, Rathore A. The Progression in Developing Genomic Resources for Crop Improvement. Life (Basel) 2023; 13:1668. [PMID: 37629524 PMCID: PMC10455509 DOI: 10.3390/life13081668] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2023] [Revised: 07/21/2023] [Accepted: 07/25/2023] [Indexed: 08/27/2023] Open
Abstract
Sequencing technologies have rapidly evolved over the past two decades, and new technologies are being continually developed and commercialized. The emerging sequencing technologies target generating more data with fewer inputs and at lower costs. This has also translated to an increase in the number and type of corresponding applications in genomics besides enhanced computational capacities (both hardware and software). Alongside the evolving DNA sequencing landscape, bioinformatics research teams have also evolved to accommodate the increasingly demanding techniques used to combine and interpret data, leading to many researchers moving from the lab to the computer. The rich history of DNA sequencing has paved the way for new insights and the development of new analysis methods. Understanding and learning from past technologies can help with the progress of future applications. This review focuses on the evolution of sequencing technologies, their significant enabling role in generating plant genome assemblies and downstream applications, and the parallel development of bioinformatics tools and skills, filling the gap in data analysis techniques.
Collapse
Affiliation(s)
- Pradeep Ruperao
- Center of Excellence in Genomics and Systems Biology, International Crops Research Institute for the Semi-Arid Tropics (ICRISAT), Hyderabad 502324, India
| | - Parimalan Rangan
- ICAR-National Bureau of Plant Genetic Resources, PUSA Campus, New Delhi 110012, India;
| | - Trushar Shah
- International Institute of Tropical Agriculture (IITA), Nairobi 30709-00100, Kenya;
| | - Vivek Thakur
- Department of Systems & Computational Biology, School of Life Sciences, University of Hyderabad, Hyderabad 500046, India;
| | - Sanjay Kalia
- Department of Biotechnology, Ministry of Science and Technology, Government of India, New Delhi 110003, India;
| | - Sean Mayes
- Center of Excellence in Genomics and Systems Biology, International Crops Research Institute for the Semi-Arid Tropics (ICRISAT), Hyderabad 502324, India
| | - Abhishek Rathore
- Excellence in Breeding, International Maize and Wheat Improvement Center (CIMMYT), Hyderabad 502324, India
| |
Collapse
|
13
|
Zhao L, Walkowiak S, Fernando WGD. Artificial Intelligence: A Promising Tool in Exploring the Phytomicrobiome in Managing Disease and Promoting Plant Health. PLANTS (BASEL, SWITZERLAND) 2023; 12:plants12091852. [PMID: 37176910 PMCID: PMC10180744 DOI: 10.3390/plants12091852] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/06/2023] [Revised: 04/25/2023] [Accepted: 04/27/2023] [Indexed: 05/15/2023]
Abstract
There is increasing interest in harnessing the microbiome to improve cropping systems. With the availability of high-throughput and low-cost sequencing technologies, gathering microbiome data is becoming more routine. However, the analysis of microbiome data is challenged by the size and complexity of the data, and the incomplete nature of many microbiome databases. Further, to bring microbiome data value, it often needs to be analyzed in conjunction with other complex data that impact on crop health and disease management, such as plant genotype and environmental factors. Artificial intelligence (AI), boosted through deep learning (DL), has achieved significant breakthroughs and is a powerful tool for managing large complex datasets such as the interplay between the microbiome, crop plants, and their environment. In this review, we aim to provide readers with a brief introduction to AI techniques, and we introduce how AI has been applied to areas of microbiome sequencing taxonomy, the functional annotation for microbiome sequences, associating the microbiome community with host traits, designing synthetic communities, genomic selection, field phenotyping, and disease forecasting. At the end of this review, we proposed further efforts that are required to fully exploit the power of AI in studying phytomicrobiomes.
Collapse
Affiliation(s)
- Liang Zhao
- Department of Plant Science, University of Manitoba, Winnipeg, MB R3T 2N2, Canada
| | | | | |
Collapse
|
14
|
Jeon D, Kang Y, Lee S, Choi S, Sung Y, Lee TH, Kim C. Digitalizing breeding in plants: A new trend of next-generation breeding based on genomic prediction. FRONTIERS IN PLANT SCIENCE 2023; 14:1092584. [PMID: 36743488 PMCID: PMC9892199 DOI: 10.3389/fpls.2023.1092584] [Citation(s) in RCA: 7] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 11/08/2022] [Accepted: 01/05/2023] [Indexed: 06/18/2023]
Abstract
As the world's population grows and food needs diversification, the demand for cereals and horticultural crops with beneficial traits increases. In order to meet a variety of demands, suitable cultivars and innovative breeding methods need to be developed. Breeding methods have changed over time following the advance of genetics. With the advent of new sequencing technology in the early 21st century, predictive breeding, such as genomic selection (GS), emerged when large-scale genomic information became available. GS shows good predictive ability for the selection of individuals with traits of interest even for quantitative traits by using various types of the whole genome-scanning markers, breaking away from the limitations of marker-assisted selection (MAS). In the current review, we briefly describe the history of breeding techniques, each breeding method, various statistical models applied to GS and methods to increase the GS efficiency. Consequently, we intend to propose and define the term digital breeding through this review article. Digital breeding is to develop a predictive breeding methods such as GS at a higher level, aiming to minimize human intervention by automatically proceeding breeding design, propagating breeding populations, and to make selections in consideration of various environments, climates, and topography during the breeding process. We also classified the phases of digital breeding based on the technologies and methods applied to each phase. This review paper will provide an understanding and a direction for the final evolution of plant breeding in the future.
Collapse
Affiliation(s)
- Donghyun Jeon
- Plant Computational Genomics Laboratory, Department of Science in Smart Agriculture Systems, Chungnam National University, Daejeon, Republic of Korea
| | - Yuna Kang
- Plant Computational Genomics Laboratory, Department of Crop Science, Chungnam National University, Daejeon, Republic of Korea
| | - Solji Lee
- Plant Computational Genomics Laboratory, Department of Crop Science, Chungnam National University, Daejeon, Republic of Korea
| | - Sehyun Choi
- Plant Computational Genomics Laboratory, Department of Crop Science, Chungnam National University, Daejeon, Republic of Korea
| | - Yeonjun Sung
- Plant Computational Genomics Laboratory, Department of Science in Smart Agriculture Systems, Chungnam National University, Daejeon, Republic of Korea
| | - Tae-Ho Lee
- Genomics Division, National Institute of Agricultural Sciences, Jeonju, Republic of Korea
| | - Changsoo Kim
- Plant Computational Genomics Laboratory, Department of Science in Smart Agriculture Systems, Chungnam National University, Daejeon, Republic of Korea
- Plant Computational Genomics Laboratory, Department of Crop Science, Chungnam National University, Daejeon, Republic of Korea
| |
Collapse
|
15
|
Comparison of artificial intelligence algorithms and their ranking for the prediction of genetic merit in sheep. Sci Rep 2022; 12:18726. [PMID: 36333409 PMCID: PMC9636184 DOI: 10.1038/s41598-022-23499-w] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2022] [Accepted: 11/01/2022] [Indexed: 11/06/2022] Open
Abstract
As the amount of data on farms grows, it is important to evaluate the potential of artificial intelligence for making farming predictions. Considering all this, this study was undertaken to evaluate various machine learning (ML) algorithms using 52-year data for sheep. Data preparation was done before analysis. Breeding values were estimated using Best Linear Unbiased Prediction. 12 ML algorithms were evaluated for their ability to predict the breeding values. The variance inflation factor for all features selected through principal component analysis (PCA) was 1. The correlation coefficients between true and predicted values for artificial neural networks, Bayesian ridge regression, classification and regression trees, gradient boosting algorithm, K nearest neighbours, multivariate adaptive regression splines (MARS) algorithm, polynomial regression, principal component regression (PCR), random forests, support vector machines, XGBoost algorithm were 0.852, 0.742, 0.869, 0.915, 0.781, 0.746, 0.742, 0.746, 0.917, 0.777, 0.915 respectively for breeding value prediction. Random forests had the highest correlation coefficients. Among the prediction equations generated using OLS, the highest coefficient of determination was 0.569. A total of 12 machine learning models were developed from the prediction of breeding values in sheep in the present study. It may be said that machine learning techniques can perform predictions with reasonable accuracies and can thus be viable alternatives to conventional strategies for breeding value prediction.
Collapse
|
16
|
Manthena V, Jarquín D, Varshney RK, Roorkiwal M, Dixit GP, Bharadwaj C, Howard R. Evaluating dimensionality reduction for genomic prediction. Front Genet 2022; 13:958780. [PMID: 36313472 PMCID: PMC9614092 DOI: 10.3389/fgene.2022.958780] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2022] [Accepted: 09/05/2022] [Indexed: 11/13/2022] Open
Abstract
The development of genomic selection (GS) methods has allowed plant breeding programs to select favorable lines using genomic data before performing field trials. Improvements in genotyping technology have yielded high-dimensional genomic marker data which can be difficult to incorporate into statistical models. In this paper, we investigated the utility of applying dimensionality reduction (DR) methods as a pre-processing step for GS methods. We compared five DR methods and studied the trend in the prediction accuracies of each method as a function of the number of features retained. The effect of DR methods was studied using three models that involved the main effects of line, environment, marker, and the genotype by environment interactions. The methods were applied on a real data set containing 315 lines phenotyped in nine environments with 26,817 markers each. Regardless of the DR method and prediction model used, only a fraction of features was sufficient to achieve maximum correlation. Our results underline the usefulness of DR methods as a key pre-processing step in GS models to improve computational efficiency in the face of ever-increasing size of genomic data.
Collapse
Affiliation(s)
- Vamsi Manthena
- Department of Statistics, University of Nebraska-Lincoln, Lincoln, NE, United States
| | - Diego Jarquín
- Agronomy Department, University of Florida, Gainesville, FL, United States
| | - Rajeev K. Varshney
- State Agricultural Biotechnology Centre, Centre for Crop and Food Innovation, Murdoch University, Murdoch, WA, Australia
- Center of Excellence in Genomics & Systems Biology, International Crops Research Institute for the Semi-Arid Tropics (ICRISAT), Hyderabad, India
| | - Manish Roorkiwal
- Center of Excellence in Genomics & Systems Biology, International Crops Research Institute for the Semi-Arid Tropics (ICRISAT), Hyderabad, India
- Khalifa Center for Genetic Engineering and Biotechnology, United Arab Emirates University, Al-Ain, United Arab Emirates
| | - Girish Prasad Dixit
- ICAR- All India Coordinated Research Project (AICRP)- Chickpea, ICAR-Indian Institute of Pulses Research (IIPR), Kanpur, Uttar Pradesh, India
| | | | - Reka Howard
- Department of Statistics, University of Nebraska-Lincoln, Lincoln, NE, United States
- *Correspondence: Reka Howard,
| |
Collapse
|
17
|
Zhang F, Weigel K, Cabrera V. Predicting daily milk yield for primiparous cows using data of within-herd relatives to capture genotype-by-environment interactions. J Dairy Sci 2022; 105:6739-6748. [DOI: 10.3168/jds.2021-21559] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/10/2021] [Accepted: 03/29/2022] [Indexed: 11/19/2022]
|
18
|
Gabur I, Simioniuc DP, Snowdon RJ, Cristea D. Machine Learning Applied to the Search for Nonlinear Features in Breeding Populations. Front Artif Intell 2022; 5:876578. [PMID: 35669178 PMCID: PMC9164111 DOI: 10.3389/frai.2022.876578] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2022] [Accepted: 04/19/2022] [Indexed: 11/13/2022] Open
Abstract
Large plant breeding populations are traditionally a source of novel allelic diversity and are at the core of selection efforts for elite material. Finding rare diversity requires a deep understanding of biological interactions between the genetic makeup of one genotype and its environmental conditions. Most modern breeding programs still rely on linear regression models to solve this problem, generalizing the complex genotype by phenotype interactions through manually constructed linear features. However, the identification of positive alleles vs. background can be addressed using deep learning approaches that have the capacity to learn complex nonlinear functions for the inputs. Machine learning (ML) is an artificial intelligence (AI) approach involving a range of algorithms to learn from input data sets and predict outcomes in other related samples. This paper describes a variety of techniques that include supervised and unsupervised ML algorithms to improve our understanding of nonlinear interactions from plant breeding data sets. Feature selection (FS) methods are combined with linear and nonlinear predictors and compared to traditional prediction methods used in plant breeding. Recent advances in ML allowed the construction of complex models that have the capacity to better differentiate between positive alleles and the genetic background. Using real plant breeding program data, we show that ML methods have the ability to outperform current approaches, increase prediction accuracies, decrease the computing time drastically, and improve the detection of important alleles involved in qualitative or quantitative traits.
Collapse
Affiliation(s)
- Iulian Gabur
- Department of Plant Breeding, Justus-Liebig-University, Giessen, Germany
- Department of Plant Sciences, Iasi University of Life Sciences, Iasi, Romania
- *Correspondence: Iulian Gabur
| | | | - Rod J. Snowdon
- Department of Plant Breeding, Justus-Liebig-University, Giessen, Germany
| | - Dan Cristea
- Institute of Computer Science, Romanian Academy, Iasi Branch, Iasi, Romania
| |
Collapse
|
19
|
Wang X, Shi S, Wang G, Luo W, Wei X, Qiu A, Luo F, Ding X. Using machine learning to improve the accuracy of genomic prediction of reproduction traits in pigs. J Anim Sci Biotechnol 2022; 13:60. [PMID: 35578371 PMCID: PMC9112588 DOI: 10.1186/s40104-022-00708-0] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2021] [Accepted: 03/13/2022] [Indexed: 12/02/2022] Open
Abstract
Background Recently, machine learning (ML) has become attractive in genomic prediction, but its superiority in genomic prediction over conventional (ss) GBLUP methods and the choice of optimal ML methods need to be investigated. Results In this study, 2566 Chinese Yorkshire pigs with reproduction trait records were genotyped with the GenoBaits Porcine SNP 50 K and PorcineSNP50 panels. Four ML methods, including support vector regression (SVR), kernel ridge regression (KRR), random forest (RF) and Adaboost.R2 were implemented. Through 20 replicates of fivefold cross-validation (CV) and one prediction for younger individuals, the utility of ML methods in genomic prediction was explored. In CV, compared with genomic BLUP (GBLUP), single-step GBLUP (ssGBLUP) and the Bayesian method BayesHE, ML methods significantly outperformed these conventional methods. ML methods improved the genomic prediction accuracy of GBLUP, ssGBLUP, and BayesHE by 19.3%, 15.0% and 20.8%, respectively. In addition, ML methods yielded smaller mean squared error (MSE) and mean absolute error (MAE) in all scenarios. ssGBLUP yielded an improvement of 3.8% on average in accuracy compared to that of GBLUP, and the accuracy of BayesHE was close to that of GBLUP. In genomic prediction of younger individuals, RF and Adaboost.R2_KRR performed better than GBLUP and BayesHE, while ssGBLUP performed comparably with RF, and ssGBLUP yielded slightly higher accuracy and lower MSE than Adaboost.R2_KRR in the prediction of total number of piglets born, while for number of piglets born alive, Adaboost.R2_KRR performed significantly better than ssGBLUP. Among ML methods, Adaboost.R2_KRR consistently performed well in our study. Our findings also demonstrated that optimal hyperparameters are useful for ML methods. After tuning hyperparameters in CV and in predicting genomic outcomes of younger individuals, the average improvement was 14.3% and 21.8% over those using default hyperparameters, respectively. Conclusion Our findings demonstrated that ML methods had better overall prediction performance than conventional genomic selection methods, and could be new options for genomic prediction. Among ML methods, Adaboost.R2_KRR consistently performed well in our study, and tuning hyperparameters is necessary for ML methods. The optimal hyperparameters depend on the character of traits, datasets etc. Supplementary Information The online version contains supplementary material available at 10.1186/s40104-022-00708-0.
Collapse
Affiliation(s)
- Xue Wang
- Key Laboratory of Animal Genetics and Breeding of Ministry of Agriculture and Rural Affairs, National Engineering Laboratory of Animal Breeding, College of Animal Science and Technology, China Agricultural University, Beijing, China
| | - Shaolei Shi
- Key Laboratory of Animal Genetics and Breeding of Ministry of Agriculture and Rural Affairs, National Engineering Laboratory of Animal Breeding, College of Animal Science and Technology, China Agricultural University, Beijing, China
| | - Guijiang Wang
- Hebei Province Animal Husbandry and Improved Breeds Work Station, Shijiazhuang, Hebei, China
| | - Wenxue Luo
- Hebei Province Animal Husbandry and Improved Breeds Work Station, Shijiazhuang, Hebei, China
| | - Xia Wei
- Zhangjiakou Dahao Heshan New Agricultural Development Co., Ltd, Zhangjiakou, Hebei, China
| | - Ao Qiu
- Key Laboratory of Animal Genetics and Breeding of Ministry of Agriculture and Rural Affairs, National Engineering Laboratory of Animal Breeding, College of Animal Science and Technology, China Agricultural University, Beijing, China
| | - Fei Luo
- Hebei Province Animal Husbandry and Improved Breeds Work Station, Shijiazhuang, Hebei, China
| | - Xiangdong Ding
- Key Laboratory of Animal Genetics and Breeding of Ministry of Agriculture and Rural Affairs, National Engineering Laboratory of Animal Breeding, College of Animal Science and Technology, China Agricultural University, Beijing, China.
| |
Collapse
|
20
|
Genome-Enabled Prediction Methods Based on Machine Learning. METHODS IN MOLECULAR BIOLOGY (CLIFTON, N.J.) 2022; 2467:189-218. [PMID: 35451777 DOI: 10.1007/978-1-0716-2205-6_7] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
Abstract
Growth of artificial intelligence and machine learning (ML) methodology has been explosive in recent years. In this class of procedures, computers get knowledge from sets of experiences and provide forecasts or classification. In genome-wide based prediction (GWP), many ML studies have been carried out. This chapter provides a description of main semiparametric and nonparametric algorithms used in GWP in animals and plants. Thirty-four ML comparative studies conducted in the last decade were used to develop a meta-analysis through a Thurstonian model, to evaluate algorithms with the best predictive qualities. It was found that some kernel, Bayesian, and ensemble methods displayed greater robustness and predictive ability. However, the type of study and data distribution must be considered in order to choose the most appropriate model for a given problem.
Collapse
|
21
|
Hamla S, Sacré PY, Derenne A, Derfoufi KM, Cowper B, Butré CI, Delobel A, Goormaghtigh E, Hubert P, Ziemons E. A new alternative tool to analyse glycosylation in pharmaceutical proteins based on infrared spectroscopy combined with nonlinear support vector regression. Analyst 2022; 147:1086-1098. [PMID: 35174378 DOI: 10.1039/d1an00697e] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/07/2023]
Abstract
Almost 60% of commercialized pharmaceutical proteins are glycosylated. Glycosylation is considered a critical quality attribute, as it affects the stability, bioactivity and safety of proteins. Hence, the development of analytical methods to characterise the composition and structure of glycoproteins is crucial. Currently, existing methods are time-consuming, expensive, and require significant sample preparation steps, which can alter the robustness of the analyses. In this work, we suggest the use of a fast, direct, and simple Fourier transform infrared spectroscopy (FT-IR) combined with a chemometric strategy to address this challenge. In this context, a database of FT-IR spectra of glycoproteins was built, and the glycoproteins were characterised by reference methods (MALDI-TOF, LC-ESI-QTOF and LC-FLR-MS) to estimate the mass ratio between carbohydrates and proteins and determine the composition in monosaccharides. The FT-IR spectra were processed first by Partial Least Squares Regression (PLSR), one of the most used regression algorithms in spectroscopy and secondly by Support Vector Regression (SVR). SVR has emerged in recent years and is now considered a powerful alternative to PLSR, thanks to its ability to flexibly model nonlinear relationships. The results provide clear evidence of the efficiency of the combination of FT-IR spectroscopy, and SVR modelling to characterise glycosylation in therapeutic proteins. The SVR models showed better predictive performances than the PLSR models in terms of RMSECV, RMSEP, R2CV, R2Pred and RPD. This tool offers several potential applications, such as comparing the glycosylation of a biosimilar and the original molecule, monitoring batch-to-batch homogeneity, and in-process control.
Collapse
Affiliation(s)
- Sabrina Hamla
- University of Liege (ULiege), CIRM, Vibra-Sante Hub, Department of Pharmacy, Laboratory of Pharmaceutical Analytical Chemistry, Liege, Belgium.
| | - Pierre-Yves Sacré
- University of Liege (ULiege), CIRM, Vibra-Sante Hub, Department of Pharmacy, Laboratory of Pharmaceutical Analytical Chemistry, Liege, Belgium.
| | - Allison Derenne
- Center for Structural Biology and Bioinformatics, Laboratory for the Structure and Function of Biological Membranes, ULB, Campus Plaine CP206/02, 1050 Brussels, Belgium
| | - Kheiro-Mouna Derfoufi
- Center for Structural Biology and Bioinformatics, Laboratory for the Structure and Function of Biological Membranes, ULB, Campus Plaine CP206/02, 1050 Brussels, Belgium
| | - Ben Cowper
- National Institute for Biological Standards and Control, Blanche Lane, South Mimms, Potters Bar, Hertfordshire, EN6 3QG, UK
| | - Claire I Butré
- Quality Assistance, Techno Parc de Thudinie 2, 6536 Thuin, Belgium
| | - Arnaud Delobel
- Quality Assistance, Techno Parc de Thudinie 2, 6536 Thuin, Belgium
| | - Erik Goormaghtigh
- Center for Structural Biology and Bioinformatics, Laboratory for the Structure and Function of Biological Membranes, ULB, Campus Plaine CP206/02, 1050 Brussels, Belgium
| | - Philippe Hubert
- University of Liege (ULiege), CIRM, Vibra-Sante Hub, Department of Pharmacy, Laboratory of Pharmaceutical Analytical Chemistry, Liege, Belgium.
| | - Eric Ziemons
- University of Liege (ULiege), CIRM, Vibra-Sante Hub, Department of Pharmacy, Laboratory of Pharmaceutical Analytical Chemistry, Liege, Belgium.
| |
Collapse
|
22
|
Budhlakoti N, Kushwaha AK, Rai A, Chaturvedi KK, Kumar A, Pradhan AK, Kumar U, Kumar RR, Juliana P, Mishra DC, Kumar S. Genomic Selection: A Tool for Accelerating the Efficiency of Molecular Breeding for Development of Climate-Resilient Crops. Front Genet 2022; 13:832153. [PMID: 35222548 PMCID: PMC8864149 DOI: 10.3389/fgene.2022.832153] [Citation(s) in RCA: 39] [Impact Index Per Article: 19.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2021] [Accepted: 01/10/2022] [Indexed: 12/17/2022] Open
Abstract
Since the inception of the theory and conceptual framework of genomic selection (GS), extensive research has been done on evaluating its efficiency for utilization in crop improvement. Though, the marker-assisted selection has proven its potential for improvement of qualitative traits controlled by one to few genes with large effects. Its role in improving quantitative traits controlled by several genes with small effects is limited. In this regard, GS that utilizes genomic-estimated breeding values of individuals obtained from genome-wide markers to choose candidates for the next breeding cycle is a powerful approach to improve quantitative traits. In the last two decades, GS has been widely adopted in animal breeding programs globally because of its potential to improve selection accuracy, minimize phenotyping, reduce cycle time, and increase genetic gains. In addition, given the promising initial evaluation outcomes of GS for the improvement of yield, biotic and abiotic stress tolerance, and quality in cereal crops like wheat, maize, and rice, prospects of integrating it in breeding crops are also being explored. Improved statistical models that leverage the genomic information to increase the prediction accuracies are critical for the effectiveness of GS-enabled breeding programs. Study on genetic architecture under drought and heat stress helps in developing production markers that can significantly accelerate the development of stress-resilient crop varieties through GS. This review focuses on the transition from traditional selection methods to GS, underlying statistical methods and tools used for this purpose, current status of GS studies in crop plants, and perspectives for its successful implementation in the development of climate-resilient crops.
Collapse
Affiliation(s)
- Neeraj Budhlakoti
- ICAR- Indian Agricultural Statistics Research Institute, New Delhi, India
| | | | - Anil Rai
- ICAR- Indian Agricultural Statistics Research Institute, New Delhi, India
| | - K K Chaturvedi
- ICAR- Indian Agricultural Statistics Research Institute, New Delhi, India
| | - Anuj Kumar
- ICAR- Indian Agricultural Statistics Research Institute, New Delhi, India
| | | | - Uttam Kumar
- Borlaug Institute for South Asia (BISA), Ludhiana, India
| | | | | | - D C Mishra
- ICAR- Indian Agricultural Statistics Research Institute, New Delhi, India
| | - Sundeep Kumar
- ICAR- National Bureau of Plant Genetic Resources, New Delhi, India
| |
Collapse
|
23
|
Gardiner LJ, Krishna R. Bluster or Lustre: Can AI Improve Crops and Plant Health? PLANTS (BASEL, SWITZERLAND) 2021; 10:plants10122707. [PMID: 34961177 PMCID: PMC8707749 DOI: 10.3390/plants10122707] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 11/03/2021] [Revised: 11/24/2021] [Accepted: 12/06/2021] [Indexed: 06/14/2023]
Abstract
In a changing climate where future food security is a growing concern, researchers are exploring new methods and technologies in the effort to meet ambitious crop yield targets. The application of Artificial Intelligence (AI) including Machine Learning (ML) methods in this area has been proposed as a potential mechanism to support this. This review explores current research in the area to convey the state-of-the-art as to how AI/ML have been used to advance research, gain insights, and generally enable progress in this area. We address the question-Can AI improve crops and plant health? We further discriminate the bluster from the lustre by identifying the key challenges that AI has been shown to address, balanced with the potential issues with its usage, and the key requisites for its success. Overall, we hope to raise awareness and, as a result, promote usage, of AI related approaches where they can have appropriate impact to improve practices in agricultural and plant sciences.
Collapse
|
24
|
Predicting Heritability of Oil Palm Breeding Using Phenotypic Traits and Machine Learning. SUSTAINABILITY 2021. [DOI: 10.3390/su132212613] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/01/2023]
Abstract
Oil palm is one of the main crops grown to help achieve sustainability in Malaysia. The selection of the best breeds will produce quality crops and increase crop yields. This study aimed to examine machine learning (ML) in oil palm breeding (OPB) using factors other than genetic data. A new conceptual framework to adopt the ML in OPB will be presented at the end of this paper. At first, data types, phenotype traits, current ML models, and evaluation technique will be identified through a literature survey. This study found that the phenotype and genotype data are widely used in oil palm breeding programs. The average bunch weight, bunch number, and fresh fruit bunch are the most important characteristics that can influence the genetic improvement of progenies. Although machine learning approaches have been applied to increase the productivity of the crop, most studies focus on molecular markers or genotypes for plant breeding, rather than on phenotype. Theoretically, the use of phenotypic data related to offspring should predict high breeding values by using ML. Therefore, a new ML conceptual framework to study the phenotype and progeny data of oil palm breeds will be discussed in relation to achieving the Sustainable Development Goals (SDGs).
Collapse
|
25
|
Improving Biomass and Grain Yield Prediction of Wheat Genotypes on Sodic Soil Using Integrated High-Resolution Multispectral, Hyperspectral, 3D Point Cloud, and Machine Learning Techniques. REMOTE SENSING 2021. [DOI: 10.3390/rs13173482] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/19/2022]
Abstract
Sodic soils adversely affect crop production over extensive areas of rain-fed cropping worldwide, with particularly large areas in Australia. Crop phenotyping may assist in identifying cultivars tolerant to soil sodicity. However, studies to identify the most appropriate traits and reliable tools to assist crop phenotyping on sodic soil are limited. Hence, this study evaluated the ability of multispectral, hyperspectral, 3D point cloud, and machine learning techniques to improve estimation of biomass and grain yield of wheat genotypes grown on a moderately sodic (MS) and highly sodic (HS) soil sites in northeastern Australia. While a number of studies have reported using different remote sensing approaches and crop traits to quantify crop growth, stress, and yield variation, studies are limited using the combination of these techniques including machine learning to improve estimation of genotypic biomass and yield, especially in constrained sodic soil environments. At close to flowering, unmanned aerial vehicle (UAV) and ground-based proximal sensing was used to obtain remote and/or proximal sensing data, while biomass yield and crop heights were also manually measured in the field. Grain yield was machine-harvested at maturity. UAV remote and/or proximal sensing-derived spectral vegetation indices (VIs), such as normalized difference vegetation index, optimized soil adjusted vegetation index, and enhanced vegetation index and crop height were closely corresponded to wheat genotypic biomass and grain yields. UAV multispectral VIs more closely associated with biomass and grain yields compared to proximal sensing data. The red-green-blue (RGB) 3D point cloud technique was effective in determining crop height, which was slightly better correlated with genotypic biomass and grain yield than ground-measured crop height data. These remote sensing-derived crop traits (VIs and crop height) and wheat biomass and grain yields were further simulated using machine learning algorithms (multitarget linear regression, support vector machine regression, Gaussian process regression, and artificial neural network) with different kernels to improve estimation of biomass and grain yield. The artificial neural network predicted biomass yield (R2 = 0.89; RMSE = 34.8 g/m2 for the MS and R2 = 0.82; RMSE = 26.4 g/m2 for the HS site) and grain yield (R2 = 0.88; RMSE = 11.8 g/m2 for the MS and R2 = 0.74; RMSE = 16.1 g/m2 for the HS site) with slightly less error than the others. Wheat genotypes Mitch, Corack, Mace, Trojan, Lancer, and Bremer were identified as more tolerant to sodic soil constraints than Emu Rock, Janz, Flanker, and Gladius. The study improves our ability to select appropriate traits and techniques in accurate estimation of wheat genotypic biomass and grain yields on sodic soils. This will also assist farmers in identifying cultivars tolerant to sodic soil constraints.
Collapse
|
26
|
Liu S, Xue P, Lu J, Lu W. Fitting analysis and research of measured data of SAW yarn tension sensor based on PSO-SVR model. ULTRASONICS 2021; 116:106511. [PMID: 34237494 DOI: 10.1016/j.ultras.2021.106511] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/06/2020] [Revised: 06/14/2021] [Accepted: 06/28/2021] [Indexed: 06/13/2023]
Abstract
With the rapid growth of the SAW (Surface Acoustic Wave) yarn tension sensor, the requirement for its measurement accuracy is higher and higher. However, little research has been conducted in this field. Thus, this paper studies this field and provides a solution. This paper firstly investigates the principle and training of PSO-SVR model. On this basis, this paper also studies the association of output frequency difference data with the matching yarn tension exerted on the SAW yarn tension sensor. After that, employing the frequency difference data as input and corresponding tension as output, the PSO-SVR model is trained and employed to predict output tension of the sensor. Finally, the error with actually applied tension was calculated, the same in the least-squares approach and the BP neural network. By multiple comparisons of the same sample data set in the overall, as well as the local accuracy of the forecasted results, it is easy to confirm that the output error forecast by PSO-SVR model is much smaller relative to the least-squares approach and BP neural network. As a result, a new way for the data analysis of the SAW yarn tension sensor is provided.
Collapse
Affiliation(s)
- Shoubing Liu
- School of Electrical and Information Engineering, Henan University of Engineering, Zhengzhou, 451191, China
| | - Peng Xue
- School of Electrical and Information Engineering, Henan University of Engineering, Zhengzhou, 451191, China
| | - Jinyan Lu
- School of Electrical and Information Engineering, Henan University of Engineering, Zhengzhou, 451191, China
| | - Wenke Lu
- School of Information Science and Technology, Donghua University, Shanghai, 201620, China.
| |
Collapse
|
27
|
Srivastava S, Lopez BI, Kumar H, Jang M, Chai HH, Park W, Park JE, Lim D. Prediction of Hanwoo Cattle Phenotypes from Genotypes Using Machine Learning Methods. Animals (Basel) 2021; 11:ani11072066. [PMID: 34359194 PMCID: PMC8300336 DOI: 10.3390/ani11072066] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/10/2021] [Revised: 07/06/2021] [Accepted: 07/09/2021] [Indexed: 11/16/2022] Open
Abstract
Hanwoo was originally raised for draft purposes, but the increase in local demand for red meat turned that purpose into full-scale meat-type cattle rearing; it is now considered one of the most economically important species and a vital food source for Koreans. The application of genomic selection in Hanwoo breeding programs in recent years was expected to lead to higher genetic progress. However, better statistical methods that can improve the genomic prediction accuracy are required. Hence, this study aimed to compare the predictive performance of three machine learning methods, namely, random forest (RF), extreme gradient boosting method (XGB), and support vector machine (SVM), when predicting the carcass weight (CWT), marbling score (MS), backfat thickness (BFT) and eye muscle area (EMA). Phenotypic and genotypic data (53,866 SNPs) from 7324 commercial Hanwoo cattle that were slaughtered at the age of around 30 months were used. The results showed that the boosting method XGB showed the highest predictive correlation for CWT and MS, followed by GBLUP, SVM, and RF. Meanwhile, the best predictive correlation for BFT and EMA was delivered by GBLUP, followed by SVM, RF, and XGB. Although XGB presented the highest predictive correlations for some traits, we did not find an advantage of XGB or any machine learning methods over GBLUP according to the mean squared error of prediction. Thus, we still recommend the use of GBLUP in the prediction of genomic breeding values for carcass traits in Hanwoo cattle.
Collapse
|
28
|
Liang M, Chang T, An B, Duan X, Du L, Wang X, Miao J, Xu L, Gao X, Zhang L, Li J, Gao H. A Stacking Ensemble Learning Framework for Genomic Prediction. Front Genet 2021; 12:600040. [PMID: 33747037 PMCID: PMC7969712 DOI: 10.3389/fgene.2021.600040] [Citation(s) in RCA: 16] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2020] [Accepted: 01/12/2021] [Indexed: 11/22/2022] Open
Abstract
Machine learning (ML) is perhaps the most useful tool for the interpretation of large genomic datasets. However, the performance of a single machine learning method in genomic selection (GS) is currently unsatisfactory. To improve the genomic predictions, we constructed a stacking ensemble learning framework (SELF), integrating three machine learning methods, to predict genomic estimated breeding values (GEBVs). The present study evaluated the prediction ability of SELF by analyzing three real datasets, with different genetic architecture; comparing the prediction accuracy of SELF, base learners, genomic best linear unbiased prediction (GBLUP) and BayesB. For each trait, SELF performed better than base learners, which included support vector regression (SVR), kernel ridge regression (KRR) and elastic net (ENET). The prediction accuracy of SELF was, on average, 7.70% higher than GBLUP in three datasets. Except for the milk fat percentage (MFP) traits, of the German Holstein dairy cattle dataset, SELF was more robust than BayesB in all remaining traits. Therefore, we believed that SEFL has the potential to be promoted to estimate GEBVs in other animals and plants.
Collapse
Affiliation(s)
- Mang Liang
- Institute of Animal Sciences, Chinese Academy of Agricultural Sciences, Beijing, China
| | - Tianpeng Chang
- Institute of Animal Sciences, Chinese Academy of Agricultural Sciences, Beijing, China
| | - Bingxing An
- Institute of Animal Sciences, Chinese Academy of Agricultural Sciences, Beijing, China
| | - Xinghai Duan
- Institute of Animal Sciences, Chinese Academy of Agricultural Sciences, Beijing, China
| | - Lili Du
- Institute of Animal Sciences, Chinese Academy of Agricultural Sciences, Beijing, China
| | - Xiaoqiao Wang
- Institute of Animal Sciences, Chinese Academy of Agricultural Sciences, Beijing, China
| | - Jian Miao
- Institute of Animal Sciences, Chinese Academy of Agricultural Sciences, Beijing, China
| | - Lingyang Xu
- Institute of Animal Sciences, Chinese Academy of Agricultural Sciences, Beijing, China
| | - Xue Gao
- Institute of Animal Sciences, Chinese Academy of Agricultural Sciences, Beijing, China
| | - Lupei Zhang
- Institute of Animal Sciences, Chinese Academy of Agricultural Sciences, Beijing, China
| | - Junya Li
- Institute of Animal Sciences, Chinese Academy of Agricultural Sciences, Beijing, China
| | - Huijiang Gao
- Institute of Animal Sciences, Chinese Academy of Agricultural Sciences, Beijing, China
| |
Collapse
|
29
|
Piles M, Bergsma R, Gianola D, Gilbert H, Tusell L. Feature Selection Stability and Accuracy of Prediction Models for Genomic Prediction of Residual Feed Intake in Pigs Using Machine Learning. Front Genet 2021; 12:611506. [PMID: 33692825 PMCID: PMC7938892 DOI: 10.3389/fgene.2021.611506] [Citation(s) in RCA: 12] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2020] [Accepted: 01/20/2021] [Indexed: 11/25/2022] Open
Abstract
Feature selection (FS, i.e., selection of a subset of predictor variables) is essential in high-dimensional datasets to prevent overfitting of prediction/classification models and reduce computation time and resources. In genomics, FS allows identifying relevant markers and designing low-density SNP chips to evaluate selection candidates. In this research, several univariate and multivariate FS algorithms combined with various parametric and non-parametric learners were applied to the prediction of feed efficiency in growing pigs from high-dimensional genomic data. The objective was to find the best combination of feature selector, SNP subset size, and learner leading to accurate and stable (i.e., less sensitive to changes in the training data) prediction models. Genomic best linear unbiased prediction (GBLUP) without SNP pre-selection was the benchmark. Three types of FS methods were implemented: (i) filter methods: univariate (univ.dtree, spearcor) or multivariate (cforest, mrmr), with random selection as benchmark; (ii) embedded methods: elastic net and least absolute shrinkage and selection operator (LASSO) regression; (iii) combination of filter and embedded methods. Ridge regression, support vector machine (SVM), and gradient boosting (GB) were applied after pre-selection performed with the filter methods. Data represented 5,708 individual records of residual feed intake to be predicted from the animal’s own genotype. Accuracy (stability of results) was measured as the median (interquartile range) of the Spearman correlation between observed and predicted data in a 10-fold cross-validation. The best prediction in terms of accuracy and stability was obtained with SVM and GB using 500 or more SNPs [0.28 (0.02) and 0.27 (0.04) for SVM and GB with 1,000 SNPs, respectively]. With larger subset sizes (1,000–1,500 SNPs), the filter method had no influence on prediction quality, which was similar to that attained with a random selection. With 50–250 SNPs, the FS method had a huge impact on prediction quality: it was very poor for tree-based methods combined with any learner, but good and similar to what was obtained with larger SNP subsets when spearcor or mrmr were implemented with or without embedded methods. Those filters also led to very stable results, suggesting their potential use for designing low-density SNP chips for genome-based evaluation of feed efficiency.
Collapse
Affiliation(s)
- Miriam Piles
- Animal Breeding and Genetics Program, Institute of Agriculture and Food Research and Technology (IRTA), Barcelona, Spain
| | - Rob Bergsma
- Topigs Norsvin Research Center, Beuningen, Netherlands
| | - Daniel Gianola
- Department of Animal Sciences, University of Wisconsin-Madison, Madison, WI, United States.,Department of Dairy Science, University of Wisconsin-Madison, Madison, WI, United States
| | - Hélène Gilbert
- GenPhySE, INRAE, Université de Toulouse, Castanet-Tolosan, France
| | - Llibertat Tusell
- Animal Breeding and Genetics Program, Institute of Agriculture and Food Research and Technology (IRTA), Barcelona, Spain.,GenPhySE, INRAE, Université de Toulouse, Castanet-Tolosan, France
| |
Collapse
|
30
|
Tong H, Nikoloski Z. Machine learning approaches for crop improvement: Leveraging phenotypic and genotypic big data. JOURNAL OF PLANT PHYSIOLOGY 2021; 257:153354. [PMID: 33385619 DOI: 10.1016/j.jplph.2020.153354] [Citation(s) in RCA: 47] [Impact Index Per Article: 15.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/30/2020] [Revised: 12/14/2020] [Accepted: 12/15/2020] [Indexed: 05/07/2023]
Abstract
Highly efficient and accurate selection of elite genotypes can lead to dramatic shortening of the breeding cycle in major crops relevant for sustaining present demands for food, feed, and fuel. In contrast to classical approaches that emphasize the need for resource-intensive phenotyping at all stages of artificial selection, genomic selection dramatically reduces the need for phenotyping. Genomic selection relies on advances in machine learning and the availability of genotyping data to predict agronomically relevant phenotypic traits. Here we provide a systematic review of machine learning approaches applied for genomic selection of single and multiple traits in major crops in the past decade. We emphasize the need to gather data on intermediate phenotypes, e.g. metabolite, protein, and gene expression levels, along with developments of modeling techniques that can lead to further improvements of genomic selection. In addition, we provide a critical view of factors that affect genomic selection, with attention to transferability of models between different environments. Finally, we highlight the future aspects of integrating high-throughput molecular phenotypic data from omics technologies with biological networks for crop improvement.
Collapse
Affiliation(s)
- Hao Tong
- Bioinformatics Group, Institute of Biochemistry and Biology, University of Potsdam, Potsdam, Germany; Bioinformatics and Mathematical Modeling Department, Centre for Plant Systems Biology and Biotechnology, Plovdiv, Bulgaria; Systems Biology and Mathematical Modeling Group, Max Planck Institute of Molecular Plant Physiology, Potsdam, Germany
| | - Zoran Nikoloski
- Bioinformatics Group, Institute of Biochemistry and Biology, University of Potsdam, Potsdam, Germany; Bioinformatics and Mathematical Modeling Department, Centre for Plant Systems Biology and Biotechnology, Plovdiv, Bulgaria; Systems Biology and Mathematical Modeling Group, Max Planck Institute of Molecular Plant Physiology, Potsdam, Germany.
| |
Collapse
|
31
|
Abstract
Technological developments have revolutionized measurements on plant genotypes and phenotypes, leading to routine production of large, complex data sets. This has led to increased efforts to extract meaning from these measurements and to integrate various data sets. Concurrently, machine learning has rapidly evolved and is now widely applied in science in general and in plant genotyping and phenotyping in particular. Here, we review the application of machine learning in the context of plant science and plant breeding. We focus on analyses at different phenotype levels, from biochemical to yield, and in connecting genotypes to these. In this way, we illustrate how machine learning offers a suite of methods that enable researchers to find meaningful patterns in relevant plant data.
Collapse
Affiliation(s)
- Aalt Dirk Jan van Dijk
- Bioinformatics Group, Department of Plant Sciences, Wageningen University and Research, Wageningen 6708 PB, the Netherlands
- Biometris, Department of Plant Sciences, Wageningen University and Research, Wageningen 6708 PB, the Netherlands
| | - Gert Kootstra
- Farm Technology, Department of Plant Sciences, Wageningen University and Research, Wageningen 6708 PB, the Netherlands
| | - Willem Kruijer
- Biometris, Department of Plant Sciences, Wageningen University and Research, Wageningen 6708 PB, the Netherlands
| | - Dick de Ridder
- Bioinformatics Group, Department of Plant Sciences, Wageningen University and Research, Wageningen 6708 PB, the Netherlands
| |
Collapse
|
32
|
Tusell L, Bergsma R, Gilbert H, Gianola D, Piles M. Machine Learning Prediction of Crossbred Pig Feed Efficiency and Growth Rate From Single Nucleotide Polymorphisms. Front Genet 2020; 11:567818. [PMID: 33391339 PMCID: PMC7775539 DOI: 10.3389/fgene.2020.567818] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2020] [Accepted: 11/17/2020] [Indexed: 11/24/2022] Open
Abstract
This research assessed the ability of a Support Vector Machine (SVM) regression model to predict pig crossbred (CB) performance from various sources of phenotypic and genotypic information for improving crossbreeding performance at reduced genotyping cost. Data consisted of average daily gain (ADG) and residual feed intake (RFI) records and genotypes of 5,708 purebred (PB) boars and 5,007 CB pigs. Prediction models were fitted using individual PB genotypes and phenotypes (trn.1); genotypes of PB sires and average of CB records per PB sire (trn.2); and individual CB genotypes and phenotypes (trn.3). The average of CB offspring records was the trait to be predicted from PB sire’s genotype using cross-validation. Single nucleotide polymorphisms (SNPs) were ranked based on the Spearman Rank correlation with the trait. Subsets with an increasing number (from 50 to 2,000) of the most informative SNPs were used as predictor variables in SVM. Prediction performance was the median of the Spearman correlation (SC, interquartile range in brackets) between observed and predicted phenotypes in the testing set. The best predictive performances were obtained when sire phenotypic information was included in trn.1 (0.22 [0.03] for RFI with SVM and 250 SNPs, and 0.12 [0.05] for ADG with SVM and 500–1,000 SNPs) or when trn.3 was used (0.29 [0.16] with Genomic best linear unbiased prediction (GBLUP) for RFI, and 0.15 [0.09] for ADG with just 50 SNPs). Animals from the last two generations were assigned to the testing set and remaining animals to the training set. Individual’s PB own phenotype and genotype improved the prediction ability of CB offspring of young animals for ADG but not for RFI. The highest SC was 0.34 [0.21] and 0.36 [0.22] for RFI and ADG, respectively, with SVM and 50 SNPs. Predictive performance using CB data for training leads to a SC of 0.34 [0.19] with GBLUP and 0.28 [0.18] with SVM and 250 SNPs for RFI and 0.34 [0.15] with SVM and 500 SNPs for ADG. Results suggest that PB candidates could be evaluated for CB performance with SVM and low-density SNP chip panels after collecting their own RFI or ADG performances or even earlier, after being genotyped using a reference population of CB animals.
Collapse
Affiliation(s)
- Llibertat Tusell
- GenPhySE, Université de Toulouse, National Research Institute for Agriculture, Food and the Environment (INRAE), Castanet-Tolosan, France
| | - Rob Bergsma
- Topigs Norsvin Research Center, Beuningen, Netherlands
| | - Hélène Gilbert
- GenPhySE, Université de Toulouse, National Research Institute for Agriculture, Food and the Environment (INRAE), Castanet-Tolosan, France
| | - Daniel Gianola
- Department of Animal Sciences, University of Wisconsin-Madison, Madison, WL, United States.,Department of Dairy Science, University of Wisconsin-Madison, Madison, WI, United States
| | - Miriam Piles
- Animal Breeding and Genetics Program, Institute of Agriculture and Food Research and Technology (IRTA), Barcelona, Spain
| |
Collapse
|
33
|
Alves AAC, Espigolan R, Bresolin T, Costa RM, Fernandes Júnior GA, Ventura RV, Carvalheiro R, Albuquerque LG. Genome-enabled prediction of reproductive traits in Nellore cattle using parametric models and machine learning methods. Anim Genet 2020; 52:32-46. [PMID: 33191532 DOI: 10.1111/age.13021] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 10/13/2020] [Indexed: 12/31/2022]
Abstract
This study aimed to assess the predictive ability of different machine learning (ML) methods for genomic prediction of reproductive traits in Nellore cattle. The studied traits were age at first calving (AFC), scrotal circumference (SC), early pregnancy (EP) and stayability (STAY). The numbers of genotyped animals and SNP markers available were 2342 and 321 419 (AFC), 4671 and 309 486 (SC), 2681 and 319 619 (STAY) and 3356 and 319 108 (EP). Predictive ability of support vector regression (SVR), Bayesian regularized artificial neural network (BRANN) and random forest (RF) were compared with results obtained using parametric models (genomic best linear unbiased predictor, GBLUP, and Bayesian least absolute shrinkage and selection operator, BLASSO). A 5-fold cross-validation strategy was performed and the average prediction accuracy (ACC) and mean squared errors (MSE) were computed. The ACC was defined as the linear correlation between predicted and observed breeding values for categorical traits (EP and STAY) and as the correlation between predicted and observed adjusted phenotypes divided by the square root of the estimated heritability for continuous traits (AFC and SC). The average ACC varied from low to moderate depending on the trait and model under consideration, ranging between 0.56 and 0.63 (AFC), 0.27 and 0.36 (SC), 0.57 and 0.67 (EP), and 0.52 and 0.62 (STAY). SVR provided slightly better accuracies than the parametric models for all traits, increasing the prediction accuracy for AFC to around 6.3 and 4.8% compared with GBLUP and BLASSO respectively. Likewise, there was an increase of 8.3% for SC, 4.5% for EP and 4.8% for STAY, comparing SVR with both GBLUP and BLASSO. In contrast, the RF and BRANN did not present competitive predictive ability compared with the parametric models. The results indicate that SVR is a suitable method for genome-enabled prediction of reproductive traits in Nellore cattle. Further, the optimal kernel bandwidth parameter in the SVR model was trait-dependent, thus, a fine-tuning for this hyper-parameter in the training phase is crucial.
Collapse
Affiliation(s)
- A A C Alves
- Department of Animal Science, School of Agricultural and Veterinary Sciences, Sao Paulo State University (UNESP), Jaboticabal, 14884-900, Brazil
| | - R Espigolan
- Department of Animal Science, School of Agricultural and Veterinary Sciences, Sao Paulo State University (UNESP), Jaboticabal, 14884-900, Brazil
| | - T Bresolin
- Department of Animal Science, School of Agricultural and Veterinary Sciences, Sao Paulo State University (UNESP), Jaboticabal, 14884-900, Brazil
| | - R M Costa
- Department of Exact Sciences, School of Agricultural and Veterinary Sciences, Sao Paulo State University (UNESP), Jaboticabal, 4884-900, Brazil
| | - G A Fernandes Júnior
- Department of Animal Science, School of Agricultural and Veterinary Sciences, Sao Paulo State University (UNESP), Jaboticabal, 14884-900, Brazil
| | - R V Ventura
- Department of Animal Nutrition and Production, School of Veterinary Medicine and Animal Science, University of Sao Paulo (USP), Pirassununga, 13635-900, Brazil
| | - R Carvalheiro
- Department of Animal Science, School of Agricultural and Veterinary Sciences, Sao Paulo State University (UNESP), Jaboticabal, 14884-900, Brazil.,National Council of Technological and Scientific Development (CNPq), Brasília, 71605-001, Brazil
| | - L G Albuquerque
- Department of Animal Science, School of Agricultural and Veterinary Sciences, Sao Paulo State University (UNESP), Jaboticabal, 14884-900, Brazil.,National Council of Technological and Scientific Development (CNPq), Brasília, 71605-001, Brazil
| |
Collapse
|
34
|
Liang M, Miao J, Wang X, Chang T, An B, Duan X, Xu L, Gao X, Zhang L, Li J, Gao H. Application of ensemble learning to genomic selection in chinese simmental beef cattle. J Anim Breed Genet 2020; 138:291-299. [PMID: 33089920 DOI: 10.1111/jbg.12514] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2020] [Revised: 09/03/2020] [Accepted: 10/01/2020] [Indexed: 11/30/2022]
Abstract
Genomic selection (GS) using the whole-genome molecular makers to predict genomic estimated breeding values (GEBVs) is revolutionizing the livestock and plant breeding. Seeking out novel strategies with higher prediction accuracy for GS has been the ultimate goal of breeders. With the rapid development of artificial intelligence, machine learning algorithms were applied to estimate the GEBVs increasingly. Although some machine learning methods have better performance in phenotype prediction, there is still considerable room for improvement. In this study, we applied an ensemble-learning algorithm, Adaboost.RT, which integrated support vector regression (SVR), kernel ridge regression (KRR) and random forest (RF), to predict genomic breeding values of three economic traits (carcass weight, live weight, and eye muscle area) in Chinese Simmental beef cattle. Predictive accuracy measured as the Pearson correlation between the corrected phenotypes and predicted GEBVs. Moreover, we compared the reliability of SVR, KRR, RF, Adaboost.RT and GBLUP methods. The result showed that machine learning methods outperformed GBLUP, and the average improvement of four machine learning methods over the GBLUP was 12.8%, 14.9%, 5.4% and 14.4%, respectively. Among the four machine learning methods, the reliability of Adaboost.RT was comparable to KRR with higher stability. We therefore believe that the Adaboost.RT algorithm is a reliable and efficient method for GS.
Collapse
Affiliation(s)
- Mang Liang
- Institute of Animal Science, Chinese Academy of Agricultural Sciences, Beijing, China
| | - Jian Miao
- Institute of Animal Science, Chinese Academy of Agricultural Sciences, Beijing, China
| | - Xiaoqiao Wang
- Institute of Animal Science, Chinese Academy of Agricultural Sciences, Beijing, China
| | - Tianpeng Chang
- Institute of Animal Science, Chinese Academy of Agricultural Sciences, Beijing, China
| | - Bingxing An
- Institute of Animal Science, Chinese Academy of Agricultural Sciences, Beijing, China
| | - Xinghai Duan
- Institute of Animal Science, Chinese Academy of Agricultural Sciences, Beijing, China
| | - Lingyang Xu
- Institute of Animal Science, Chinese Academy of Agricultural Sciences, Beijing, China
| | - Xue Gao
- Institute of Animal Science, Chinese Academy of Agricultural Sciences, Beijing, China
| | - Lupei Zhang
- Institute of Animal Science, Chinese Academy of Agricultural Sciences, Beijing, China
| | - Junya Li
- Institute of Animal Science, Chinese Academy of Agricultural Sciences, Beijing, China
| | - Huijiang Gao
- Institute of Animal Science, Chinese Academy of Agricultural Sciences, Beijing, China
| |
Collapse
|
35
|
Azodi CB, Bolger E, McCarren A, Roantree M, de Los Campos G, Shiu SH. Benchmarking Parametric and Machine Learning Models for Genomic Prediction of Complex Traits. G3 (BETHESDA, MD.) 2019; 9:3691-3702. [PMID: 31533955 PMCID: PMC6829122 DOI: 10.1534/g3.119.400498] [Citation(s) in RCA: 75] [Impact Index Per Article: 15.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/03/2019] [Accepted: 09/09/2019] [Indexed: 12/21/2022]
Abstract
The usefulness of genomic prediction in crop and livestock breeding programs has prompted efforts to develop new and improved genomic prediction algorithms, such as artificial neural networks and gradient tree boosting. However, the performance of these algorithms has not been compared in a systematic manner using a wide range of datasets and models. Using data of 18 traits across six plant species with different marker densities and training population sizes, we compared the performance of six linear and six non-linear algorithms. First, we found that hyperparameter selection was necessary for all non-linear algorithms and that feature selection prior to model training was critical for artificial neural networks when the markers greatly outnumbered the number of training lines. Across all species and trait combinations, no one algorithm performed best, however predictions based on a combination of results from multiple algorithms (i.e., ensemble predictions) performed consistently well. While linear and non-linear algorithms performed best for a similar number of traits, the performance of non-linear algorithms vary more between traits. Although artificial neural networks did not perform best for any trait, we identified strategies (i.e., feature selection, seeded starting weights) that boosted their performance to near the level of other algorithms. Our results highlight the importance of algorithm selection for the prediction of trait values.
Collapse
Affiliation(s)
| | - Emily Bolger
- Department of Mathematics, Moravian College, Bethlehem, PA
| | - Andrew McCarren
- Insight Centre for Data Analytics, School of Computing, Dublin City University, Dublin 9, Ireland
| | - Mark Roantree
- Insight Centre for Data Analytics, School of Computing, Dublin City University, Dublin 9, Ireland
| | - Gustavo de Los Campos
- Department of Epidemiology & Biostatistics
- Department of Statistics & Probability
- Institute for Quantitative Health Science and Engineering, and
| | - Shin-Han Shiu
- Department of Plant Biology
- Department of Computational, Mathematics, Science, and Engineering, Michigan State University, East Lansing, MI, 48824
| |
Collapse
|
36
|
Olatoye MO, Hu Z, Aikpokpodion PO. Epistasis Detection and Modeling for Genomic Selection in Cowpea ( Vigna unguiculata L. Walp.). Front Genet 2019; 10:677. [PMID: 31417604 PMCID: PMC6682672 DOI: 10.3389/fgene.2019.00677] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2019] [Accepted: 06/27/2019] [Indexed: 12/24/2022] Open
Abstract
Genetic architecture reflects the pattern of effects and interaction of genes underlying phenotypic variation. Most mapping and breeding approaches generally consider the additive part of variation but offer limited knowledge on the benefits of epistasis which explains in part the variation observed in traits. In this study, the cowpea multiparent advanced generation inter-cross (MAGIC) population was used to characterize the epistatic genetic architecture of flowering time, maturity, and seed size. In addition, consideration for epistatic genetic architecture in genomic-enabled breeding (GEB) was investigated using parametric, semi-parametric, and non-parametric genomic selection (GS) models. Our results showed that large and moderate effect-sized two-way epistatic interactions underlie the traits examined. Flowering time QTL colocalized with cowpea putative orthologs of Arabidopsis thaliana and Glycine max genes like PHYTOCLOCK1 (PCL1 [Vigun11g157600]) and PHYTOCHROME A (PHY A [Vigun01g205500]). Flowering time adaptation to long and short photoperiod was found to be controlled by distinct and common main and epistatic loci. Parametric and semi-parametric GS models outperformed non-parametric GS model, while using known quantitative trait nucleotide(s) (QTNs) as fixed effects improved prediction accuracy when traits were controlled by large effect loci. In general, our study demonstrated that prior understanding of the genetic architecture of a trait can help make informed decisions in GEB.
Collapse
Affiliation(s)
- Marcus O. Olatoye
- Department of Crop Sciences, University of Illinois, Urbana-Champaign, IL, United States
| | - Zhenbin Hu
- Department of Agronomy, Kansas State University, Manhattan, KS, United States
| | | |
Collapse
|
37
|
Cherlin S, Plant D, Taylor JC, Colombo M, Spiliopoulou A, Tzanis E, Morgan AW, Barnes MR, McKeigue P, Barrett JH, Pitzalis C, Barton A, Consortium MATURA, Cordell HJ. Prediction of treatment response in rheumatoid arthritis patients using genome-wide SNP data. Genet Epidemiol 2018; 42:754-771. [PMID: 30311271 PMCID: PMC6334178 DOI: 10.1002/gepi.22159] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/27/2018] [Revised: 07/06/2018] [Accepted: 07/28/2018] [Indexed: 01/13/2023]
Abstract
Although a number of treatments are available for rheumatoid arthritis (RA), each of them shows a significant nonresponse rate in patients. Therefore, predicting a priori the likelihood of treatment response would be of great patient benefit. Here, we conducted a comparison of a variety of statistical methods for predicting three measures of treatment response, between baseline and 3 or 6 months, using genome-wide SNP data from RA patients available from the MAximising Therapeutic Utility in Rheumatoid Arthritis (MATURA) consortium. Two different treatments and 11 different statistical methods were evaluated. We used 10-fold cross validation to assess predictive performance, with nested 10-fold cross validation used to tune the model hyperparameters when required. Overall, we found that SNPs added very little prediction information to that obtained using clinical characteristics only, such as baseline trait value. This observation can be explained by the lack of strong genetic effects and the relatively small sample sizes available; in analysis of simulated and real data, with larger effects and/or larger sample sizes, prediction performance was much improved. Overall, methods that were consistent with the genetic architecture of the trait were able to achieve better predictive ability than methods that were not. For treatment response in RA, methods that assumed a complex underlying genetic architecture achieved slightly better prediction performance than methods that assumed a simplified genetic architecture.
Collapse
Affiliation(s)
- Svetlana Cherlin
- Institute of Genetic MedicineNewcastle UniversityNewcastle upon TyneUK
| | - Darren Plant
- NIHR Manchester Biomedical Research Centre, Manchester University NHS Foundation TrustManchester Academic Health Science CentreManchesterUK
| | - John C. Taylor
- Leeds Institute of Cancer and PathologyUniversity of LeedsLeedsUK
- NIHR Leeds Biomedical Research CentreLeeds Teaching Hospitals NHS TrustLeedsUK
| | - Marco Colombo
- Centre for Population Health Sciences, Usher Institute of Population Health Sciences and InformaticsUniversity of EdinburghEdinburghUK
| | - Athina Spiliopoulou
- Centre for Population Health Sciences, Usher Institute of Population Health Sciences and InformaticsUniversity of EdinburghEdinburghUK
| | - Evan Tzanis
- Centre for Experimental Medicine and Rheumatology, William Harvey Research Institute, Barts and the London School of Medicine and DentistryQueen Mary University of London and Barts Health NHS TrustLondonUK
| | - Ann W. Morgan
- NIHR Leeds Biomedical Research CentreLeeds Teaching Hospitals NHS TrustLeedsUK
- Leeds Institute of Rheumatic and Musculoskeletal MedicineUniversity of LeedsLeedsUK
| | - Michael R. Barnes
- Centre for Experimental Medicine and Rheumatology, William Harvey Research Institute, Barts and the London School of Medicine and DentistryQueen Mary University of London and Barts Health NHS TrustLondonUK
| | - Paul McKeigue
- Centre for Population Health Sciences, Usher Institute of Population Health Sciences and InformaticsUniversity of EdinburghEdinburghUK
| | - Jennifer H. Barrett
- Leeds Institute of Cancer and PathologyUniversity of LeedsLeedsUK
- NIHR Leeds Biomedical Research CentreLeeds Teaching Hospitals NHS TrustLeedsUK
| | - Costantino Pitzalis
- Centre for Experimental Medicine and Rheumatology, William Harvey Research Institute, Barts and the London School of Medicine and DentistryQueen Mary University of London and Barts Health NHS TrustLondonUK
| | - Anne Barton
- NIHR Manchester Biomedical Research Centre, Manchester University NHS Foundation TrustManchester Academic Health Science CentreManchesterUK
- Arthritis Research UK Centre for Genetics and Genomics, Centre for Musculoskeletal ResearchThe University of ManchesterManchesterUK
| | - MATURA Consortium
- Institute of Genetic MedicineNewcastle UniversityNewcastle upon TyneUK
- NIHR Manchester Biomedical Research Centre, Manchester University NHS Foundation TrustManchester Academic Health Science CentreManchesterUK
- Leeds Institute of Cancer and PathologyUniversity of LeedsLeedsUK
- NIHR Leeds Biomedical Research CentreLeeds Teaching Hospitals NHS TrustLeedsUK
- Centre for Population Health Sciences, Usher Institute of Population Health Sciences and InformaticsUniversity of EdinburghEdinburghUK
- Centre for Experimental Medicine and Rheumatology, William Harvey Research Institute, Barts and the London School of Medicine and DentistryQueen Mary University of London and Barts Health NHS TrustLondonUK
- Leeds Institute of Rheumatic and Musculoskeletal MedicineUniversity of LeedsLeedsUK
- Arthritis Research UK Centre for Genetics and Genomics, Centre for Musculoskeletal ResearchThe University of ManchesterManchesterUK
| | - Heather J. Cordell
- NIHR Manchester Biomedical Research Centre, Manchester University NHS Foundation TrustManchester Academic Health Science CentreManchesterUK
| |
Collapse
|
38
|
Gianola D, Cecchinato A, Naya H, Schön CC. Prediction of Complex Traits: Robust Alternatives to Best Linear Unbiased Prediction. Front Genet 2018; 9:195. [PMID: 29951082 PMCID: PMC6008589 DOI: 10.3389/fgene.2018.00195] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2018] [Accepted: 05/14/2018] [Indexed: 12/05/2022] Open
Abstract
A widely used method for prediction of complex traits in animal and plant breeding is “genomic best linear unbiased prediction” (GBLUP). In a quantitative genetics setting, BLUP is a linear regression of phenotypes on a pedigree or on a genomic relationship matrix, depending on the type of input information available. Normality of the distributions of random effects and of model residuals is not required for BLUP but a Gaussian assumption is made implicitly. A potential downside is that Gaussian linear regressions are sensitive to outliers, genetic or environmental in origin. We present simple (relative to a fully Bayesian analysis) to implement robust alternatives to BLUP using a linear model with residual t or Laplace distributions instead of a Gaussian one, and evaluate the methods with milk yield records on Italian Brown Swiss cattle, grain yield data in inbred wheat lines, and using three traits measured on accessions of Arabidopsis thaliana. The methods do not use Markov chain Monte Carlo sampling and model hyper-parameters, viewed here as regularization “knobs,” are tuned via some cross-validation. Uncertainty of predictions are evaluated by employing bootstrapping or by random reconstruction of training and testing sets. It was found (e.g., test-day milk yield in cows, flowering time and FRIGIDA expression in Arabidopsis) that the best predictions were often those obtained with the robust methods. The results obtained are encouraging and stimulate further investigation and generalization.
Collapse
Affiliation(s)
- Daniel Gianola
- Department of Animal Sciences, University of Wisconsin-Madison, Madison, WI, United States.,Department of Dairy Science, University of Wisconsin-Madison, Madison, WI, United States.,Department of Plant Sciences, TUM School of Life Sciences, Technical University of Munich, Munich, Germany.,Department of Agronomy, Food Natural Resources, Animals and Environment, Università degli Studi di Padova, Padova, Italy.,Institut Pasteur de Montevideo, Montevideo, Uruguay
| | - Alessio Cecchinato
- Department of Agronomy, Food Natural Resources, Animals and Environment, Università degli Studi di Padova, Padova, Italy
| | - Hugo Naya
- Institut Pasteur de Montevideo, Montevideo, Uruguay
| | - Chris-Carolin Schön
- Department of Plant Sciences, TUM School of Life Sciences, Technical University of Munich, Munich, Germany
| |
Collapse
|
39
|
Watanabe T, Otowa T, Abe O, Kuwabara H, Aoki Y, Natsubori T, Takao H, Kakiuchi C, Kondo K, Ikeda M, Iwata N, Kasai K, Sasaki T, Yamasue H. Oxytocin receptor gene variations predict neural and behavioral response to oxytocin in autism. Soc Cogn Affect Neurosci 2017; 12:496-506. [PMID: 27798253 PMCID: PMC5390696 DOI: 10.1093/scan/nsw150] [Citation(s) in RCA: 30] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/28/2016] [Accepted: 10/04/2016] [Indexed: 12/27/2022] Open
Abstract
Oxytocin appears beneficial for autism spectrum disorder (ASD), and more than 20 single-nucleotide polymorphisms (SNPs) in oxytocin receptor (OXTR) are relevant to ASD. However, neither biological functions of OXTR SNPs in ASD nor critical OXTR SNPs that determine oxytocin’s effects on ASD remains known. Here, using a machine-learning algorithm that was designed to evaluate collective effects of multiple SNPs and automatically identify most informative SNPs, we examined relationships between 27 representative OXTR SNPs and six types of behavioral/neural response to oxytocin in ASD individuals. The oxytocin effects were extracted from our previous placebo-controlled within-participant clinical trial administering single-dose intranasal oxytocin to 38 high-functioning adult Japanese ASD males. Consequently, we identified six different SNP sets that could accurately predict the six different oxytocin efficacies, and confirmed the robustness of these SNP selections against variations of the datasets and analysis parameters. Moreover, major alleles of several prominent OXTR SNPs—including rs53576 and rs2254298—were found to have dissociable effects on the oxytocin efficacies. These findings suggest biological functions of the OXTR SNP variants on autistic oxytocin responses, and implied that clinical oxytocin efficacy may be genetically predicted before its actual administration, which would contribute to establishment of future precision medicines for ASD.
Collapse
Affiliation(s)
- Takamitsu Watanabe
- Institute of Cognitive Neuroscience, University College London, 17 Queen Square, London WC1N 3AZ, UK
| | - Takeshi Otowa
- Department of Neuropsychiatry Graduate School of Medicine, The University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo 113-8655, Japan
| | - Osamu Abe
- Department of Radiology, Nihon University School of Medicine, 30-1 Oyaguchikami-cho, Itabashi-ku, Tokyo 173-8610, Japan
| | | | - Yuta Aoki
- Department of Neuropsychiatry Graduate School of Medicine, The University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo 113-8655, Japan
| | - Tatsunobu Natsubori
- Department of Neuropsychiatry Graduate School of Medicine, The University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo 113-8655, Japan
| | - Hidemasa Takao
- Department of Radiology Graduate School of Medicine, The University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo, 113-8655, Japan
| | - Chihiro Kakiuchi
- Department of Neuropsychiatry Graduate School of Medicine, The University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo 113-8655, Japan
| | - Kenji Kondo
- Department of Psychiatry Fujita Health University School of Medicine, Aichi 470-1192, Japan
| | - Masashi Ikeda
- Department of Psychiatry Fujita Health University School of Medicine, Aichi 470-1192, Japan
| | - Nakao Iwata
- Department of Psychiatry Fujita Health University School of Medicine, Aichi 470-1192, Japan
| | - Kiyoto Kasai
- Department of Neuropsychiatry Graduate School of Medicine, The University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo 113-8655, Japan
| | - Tsukasa Sasaki
- Department of Physical and Health Education Graduate School of Education, The University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo 113-8655, Japan
| | - Hidenori Yamasue
- Department of Neuropsychiatry Graduate School of Medicine, The University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo 113-8655, Japan.,Department of Psychiatry Hamamatsu University School of Medicine, 1-20-1 Handayama, Higashiku, Hamamatsu City 431-3192, Japan
| |
Collapse
|
40
|
Genome-Wide Association Studies with a Genomic Relationship Matrix: A Case Study with Wheat and Arabidopsis. G3-GENES GENOMES GENETICS 2016; 6:3241-3256. [PMID: 27520956 PMCID: PMC5068945 DOI: 10.1534/g3.116.034256] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/02/2022]
Abstract
Standard genome-wide association studies (GWAS) scan for relationships between each of p molecular markers and a continuously distributed target trait. Typically, a marker-based matrix of genomic similarities among individuals (G) is constructed, to account more properly for the covariance structure in the linear regression model used. We show that the generalized least-squares estimator of the regression of phenotype on one or on m markers is invariant with respect to whether or not the marker(s) tested is(are) used for building G, provided variance components are unaffected by exclusion of such marker(s) from G. The result is arrived at by using a matrix expression such that one can find many inverses of genomic relationship, or of phenotypic covariance matrices, stemming from removing markers tested as fixed, but carrying out a single inversion. When eigenvectors of the genomic relationship matrix are used as regressors with fixed regression coefficients, e.g., to account for population stratification, their removal from G does matter. Removal of eigenvectors from G can have a noticeable effect on estimates of genomic and residual variances, so caution is needed. Concepts were illustrated using genomic data on 599 wheat inbred lines, with grain yield as target trait, and on close to 200 Arabidopsis thaliana accessions.
Collapse
|
41
|
Gong P, Nan X, Barker ND, Boyd RE, Chen Y, Wilkins DE, Johnson DR, Suedel BC, Perkins EJ. Predicting chemical bioavailability using microarray gene expression data and regression modeling: A tale of three explosive compounds. BMC Genomics 2016; 17:205. [PMID: 26956490 PMCID: PMC4784335 DOI: 10.1186/s12864-016-2541-5] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2015] [Accepted: 02/25/2016] [Indexed: 11/10/2022] Open
Abstract
Background Chemical bioavailability is an important dose metric in environmental risk assessment. Although many approaches have been used to evaluate bioavailability, not a single approach is free from limitations. Previously, we developed a new genomics-based approach that integrated microarray technology and regression modeling for predicting bioavailability (tissue residue) of explosives compounds in exposed earthworms. In the present study, we further compared 18 different regression models and performed variable selection simultaneously with parameter estimation. Results This refined approach was applied to both previously collected and newly acquired earthworm microarray gene expression datasets for three explosive compounds. Our results demonstrate that a prediction accuracy of R2 = 0.71–0.82 was achievable at a relatively low model complexity with as few as 3–10 predictor genes per model. These results are much more encouraging than our previous ones. Conclusion This study has demonstrated that our approach is promising for bioavailability measurement, which warrants further studies of mixed contamination scenarios in field settings Electronic supplementary material The online version of this article (doi:10.1186/s12864-016-2541-5) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Ping Gong
- Environmental Laboratory, US Army Engineer Research and Development Center, Vicksburg, MS, 39180, USA.
| | - Xiaofei Nan
- Department of Computer and Information Science, University of Mississippi, Oxford, Mississippi, 38677, USA. .,Present Address: School of Information Engineering, Zhengzhou University, Zhengzhou, Henan, 450001, China.
| | | | - Robert E Boyd
- Environmental Laboratory, US Army Engineer Research and Development Center, Vicksburg, MS, 39180, USA.
| | - Yixin Chen
- Department of Computer and Information Science, University of Mississippi, Oxford, Mississippi, 38677, USA.
| | - Dawn E Wilkins
- Department of Computer and Information Science, University of Mississippi, Oxford, Mississippi, 38677, USA.
| | | | - Burton C Suedel
- Environmental Laboratory, US Army Engineer Research and Development Center, Vicksburg, MS, 39180, USA.
| | - Edward J Perkins
- Environmental Laboratory, US Army Engineer Research and Development Center, Vicksburg, MS, 39180, USA.
| |
Collapse
|
42
|
Zhao W, Tao T, Zio E. System reliability prediction by support vector regression with analytic selection and genetic algorithm parameters selection. Appl Soft Comput 2015. [DOI: 10.1016/j.asoc.2015.02.026] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
|
43
|
Li L, Long Y, Zhang L, Dalton-Morgan J, Batley J, Yu L, Meng J, Li M. Genome wide analysis of flowering time trait in multiple environments via high-throughput genotyping technique in Brassica napus L. PLoS One 2015; 10:e0119425. [PMID: 25790019 PMCID: PMC4366152 DOI: 10.1371/journal.pone.0119425] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2014] [Accepted: 01/13/2015] [Indexed: 11/19/2022] Open
Abstract
The prediction of the flowering time (FT) trait in Brassica napus based on genome-wide markers and the detection of underlying genetic factors is important not only for oilseed producers around the world but also for the other crop industry in the rotation system in China. In previous studies the low density and mixture of biomarkers used obstructed genomic selection in B. napus and comprehensive mapping of FT related loci. In this study, a high-density genome-wide SNP set was genotyped from a double-haploid population of B. napus. We first performed genomic prediction of FT traits in B. napus using SNPs across the genome under ten environments of three geographic regions via eight existing genomic predictive models. The results showed that all the models achieved comparably high accuracies, verifying the feasibility of genomic prediction in B. napus. Next, we performed a large-scale mapping of FT related loci among three regions, and found 437 associated SNPs, some of which represented known FT genes, such as AP1 and PHYE. The genes tagged by the associated SNPs were enriched in biological processes involved in the formation of flowers. Epistasis analysis showed that significant interactions were found between detected loci, even among some known FT related genes. All the results showed that our large scale and high-density genotype data are of great practical and scientific values for B. napus. To our best knowledge, this is the first evaluation of genomic selection models in B. napus based on a high-density SNP dataset and large-scale mapping of FT loci.
Collapse
Affiliation(s)
- Lun Li
- College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan, China
- Hubei Bioinformatics and Molecular Imaging Key Laboratory, Huazhong University of Science and Technology, Wuhan, China
| | - Yan Long
- National Key Lab of Crop Genetic Improvement, Huazhong Agricultural University, Wuhan, China
- Biotechnology Research Institute, Chinese Academy of Agricultural Sciences, Beijing, China
| | - Libin Zhang
- College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan, China
- Hubei Bioinformatics and Molecular Imaging Key Laboratory, Huazhong University of Science and Technology, Wuhan, China
| | - Jessica Dalton-Morgan
- School of Agriculture & Food Sciences, The University of Queensland, Brisbane, Australia
| | - Jacqueline Batley
- School of Agriculture & Food Sciences, The University of Queensland, Brisbane, Australia
| | - Longjiang Yu
- College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan, China
| | - Jinling Meng
- National Key Lab of Crop Genetic Improvement, Huazhong Agricultural University, Wuhan, China
| | - Maoteng Li
- College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan, China
| |
Collapse
|
44
|
Onogi A, Ideta O, Inoshita Y, Ebana K, Yoshioka T, Yamasaki M, Iwata H. Exploring the areas of applicability of whole-genome prediction methods for Asian rice (Oryza sativa L.). TAG. THEORETICAL AND APPLIED GENETICS. THEORETISCHE UND ANGEWANDTE GENETIK 2015; 128:41-53. [PMID: 25341369 DOI: 10.1007/s00122-014-2411-y] [Citation(s) in RCA: 33] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/09/2014] [Accepted: 10/03/2014] [Indexed: 05/25/2023]
Abstract
Our simulation results clarify the areas of applicability of nine prediction methods and suggest the factors that affect their accuracy at predicting empirical traits. Whole-genome prediction is used to predict genetic value from genome-wide markers. The choice of method is important for successful prediction. We compared nine methods using empirical data for eight phenological and morphological traits of Asian rice cultivars (Oryza sativa L.) and data simulated from real marker genotype data. The methods were genomic BLUP (GBLUP), reproducing kernel Hilbert spaces regression (RKHS), Lasso, elastic net, random forest (RForest), Bayesian lasso (Blasso), extended Bayesian lasso (EBlasso), weighted Bayesian shrinkage regression (wBSR), and the average of all methods (Ave). The objectives were to evaluate the predictive ability of these methods in a cultivar population, to characterize them by exploring the area of applicability of each method using simulation, and to investigate the causes of their different accuracies for empirical traits. GBLUP was the most accurate for one trait, RKHS and Ave for two, and RForest for three traits. In the simulation, Blasso, EBlasso, and Ave showed stable performance across the simulated scenarios, whereas the other methods, except wBSR, had specific areas of applicability; wBSR performed poorly in most scenarios. For each method, the accuracy ranking for the empirical traits was largely consistent with that in one of the simulated scenarios, suggesting that the simulation conditions reflected the factors that affected the method accuracy for the empirical results. This study will be useful for genomic prediction not only in Asian rice, but also in populations from other crops with relatively small training sets and strong linkage disequilibrium structures.
Collapse
Affiliation(s)
- Akio Onogi
- Department of Agricultural and Environmental Biology, Graduate School of Agricultural and Life Sciences, The University of Tokyo, 1-1-1 Yayoi, Bunkyo-Ku, Tokyo, 113-8657, Japan
| | | | | | | | | | | | | |
Collapse
|
45
|
Morota G, Gianola D. Kernel-based whole-genome prediction of complex traits: a review. Front Genet 2014; 5:363. [PMID: 25360145 PMCID: PMC4199321 DOI: 10.3389/fgene.2014.00363] [Citation(s) in RCA: 99] [Impact Index Per Article: 9.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2014] [Accepted: 09/29/2014] [Indexed: 01/18/2023] Open
Abstract
Prediction of genetic values has been a focus of applied quantitative genetics since the beginning of the 20th century, with renewed interest following the advent of the era of whole genome-enabled prediction. Opportunities offered by the emergence of high-dimensional genomic data fueled by post-Sanger sequencing technologies, especially molecular markers, have driven researchers to extend Ronald Fisher and Sewall Wright's models to confront new challenges. In particular, kernel methods are gaining consideration as a regression method of choice for genome-enabled prediction. Complex traits are presumably influenced by many genomic regions working in concert with others (clearly so when considering pathways), thus generating interactions. Motivated by this view, a growing number of statistical approaches based on kernels attempt to capture non-additive effects, either parametrically or non-parametrically. This review centers on whole-genome regression using kernel methods applied to a wide range of quantitative traits of agricultural importance in animals and plants. We discuss various kernel-based approaches tailored to capturing total genetic variation, with the aim of arriving at an enhanced predictive performance in the light of available genome annotation information. Connections between prediction machines born in animal breeding, statistics, and machine learning are revisited, and their empirical prediction performance is discussed. Overall, while some encouraging results have been obtained with non-parametric kernels, recovering non-additive genetic variation in a validation dataset remains a challenge in quantitative genetics.
Collapse
Affiliation(s)
- Gota Morota
- Department of Animal Science, University of Nebraska-Lincoln Lincoln, NE, USA
| | - Daniel Gianola
- Department of Animal Sciences, University of Wisconsin-Madison Madison, WI, USA ; Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison Madison, WI, USA ; Department of Dairy Science, University of Wisconsin-Madison Madison, WI, USA
| |
Collapse
|
46
|
González-Recio O, Rosa GJ, Gianola D. Machine learning methods and predictive ability metrics for genome-wide prediction of complex traits. Livest Sci 2014. [DOI: 10.1016/j.livsci.2014.05.036] [Citation(s) in RCA: 75] [Impact Index Per Article: 7.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/01/2022]
|
47
|
Howard R, Carriquiry AL, Beavis WD. Parametric and nonparametric statistical methods for genomic selection of traits with additive and epistatic genetic architectures. G3 (BETHESDA, MD.) 2014; 4:1027-46. [PMID: 24727289 PMCID: PMC4065247 DOI: 10.1534/g3.114.010298] [Citation(s) in RCA: 89] [Impact Index Per Article: 8.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/08/2014] [Accepted: 03/18/2014] [Indexed: 01/12/2023]
Abstract
Parametric and nonparametric methods have been developed for purposes of predicting phenotypes. These methods are based on retrospective analyses of empirical data consisting of genotypic and phenotypic scores. Recent reports have indicated that parametric methods are unable to predict phenotypes of traits with known epistatic genetic architectures. Herein, we review parametric methods including least squares regression, ridge regression, Bayesian ridge regression, least absolute shrinkage and selection operator (LASSO), Bayesian LASSO, best linear unbiased prediction (BLUP), Bayes A, Bayes B, Bayes C, and Bayes Cπ. We also review nonparametric methods including Nadaraya-Watson estimator, reproducing kernel Hilbert space, support vector machine regression, and neural networks. We assess the relative merits of these 14 methods in terms of accuracy and mean squared error (MSE) using simulated genetic architectures consisting of completely additive or two-way epistatic interactions in an F2 population derived from crosses of inbred lines. Each simulated genetic architecture explained either 30% or 70% of the phenotypic variability. The greatest impact on estimates of accuracy and MSE was due to genetic architecture. Parametric methods were unable to predict phenotypic values when the underlying genetic architecture was based entirely on epistasis. Parametric methods were slightly better than nonparametric methods for additive genetic architectures. Distinctions among parametric methods for additive genetic architectures were incremental. Heritability, i.e., proportion of phenotypic variability, had the second greatest impact on estimates of accuracy and MSE.
Collapse
Affiliation(s)
- Réka Howard
- Department of Statistics, Iowa State University, Ames, Iowa 50011 Department of Agronomy, Iowa State University, Ames, Iowa 50011
| | | | | |
Collapse
|
48
|
Morota G, Boddhireddy P, Vukasinovic N, Gianola D, Denise S. Kernel-based variance component estimation and whole-genome prediction of pre-corrected phenotypes and progeny tests for dairy cow health traits. Front Genet 2014; 5:56. [PMID: 24715901 PMCID: PMC3970026 DOI: 10.3389/fgene.2014.00056] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2013] [Accepted: 03/04/2014] [Indexed: 11/13/2022] Open
Abstract
Prediction of complex trait phenotypes in the presence of unknown gene action is an ongoing challenge in animals, plants, and humans. Development of flexible predictive models that perform well irrespective of genetic and environmental architectures is desirable. Methods that can address non-additive variation in a non-explicit manner are gaining attention for this purpose and, in particular, semi-parametric kernel-based methods have been applied to diverse datasets, mostly providing encouraging results. On the other hand, the gains obtained from these methods have been smaller when smoothed values such as estimated breeding value (EBV) have been used as response variables. However, less emphasis has been placed on the choice of phenotypes to be used in kernel-based whole-genome prediction. This study aimed to evaluate differences between semi-parametric and parametric approaches using two types of response variables and molecular markers as inputs. Pre-corrected phenotypes (PCP) and EBV obtained for dairy cow health traits were used for this comparison. We observed that non-additive genetic variances were major contributors to total genetic variances in PCP, whereas additivity was the largest contributor to variability of EBV, as expected. Within the kernels evaluated, non-parametric methods yielded slightly better predictive performance across traits relative to their additive counterparts regardless of the type of response variable used. This reinforces the view that non-parametric kernels aiming to capture non-linear relationships between a panel of SNPs and phenotypes are appealing for complex trait prediction. However, like past studies, the gain in predictive correlation was not large for either PCP or EBV. We conclude that capturing non-additive genetic variation, especially epistatic variation, in a cross-validation framework remains a significant challenge even when it is important, as seems to be the case for health traits in dairy cows.
Collapse
Affiliation(s)
- Gota Morota
- Department of Animal Sciences, University of Wisconsin-Madison Madison, WI, USA
| | | | | | - Daniel Gianola
- Department of Animal Sciences, University of Wisconsin-Madison Madison, WI, USA ; Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison Madison, WI, USA ; Department of Dairy Science, University of Wisconsin-Madison Madison, WI, USA
| | | |
Collapse
|
49
|
Morota G, Koyama M, Rosa GJM, Weigel KA, Gianola D. Predicting complex traits using a diffusion kernel on genetic markers with an application to dairy cattle and wheat data. Genet Sel Evol 2013; 45:17. [PMID: 23763755 PMCID: PMC3706293 DOI: 10.1186/1297-9686-45-17] [Citation(s) in RCA: 28] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/06/2012] [Accepted: 05/31/2013] [Indexed: 01/09/2023] Open
Abstract
Background Arguably, genotypes and phenotypes may be linked in functional forms that are not well addressed by the linear additive models that are standard in quantitative genetics. Therefore, developing statistical learning models for predicting phenotypic values from all available molecular information that are capable of capturing complex genetic network architectures is of great importance. Bayesian kernel ridge regression is a non-parametric prediction model proposed for this purpose. Its essence is to create a spatial distance-based relationship matrix called a kernel. Although the set of all single nucleotide polymorphism genotype configurations on which a model is built is finite, past research has mainly used a Gaussian kernel. Results We sought to investigate the performance of a diffusion kernel, which was specifically developed to model discrete marker inputs, using Holstein cattle and wheat data. This kernel can be viewed as a discretization of the Gaussian kernel. The predictive ability of the diffusion kernel was similar to that of non-spatial distance-based additive genomic relationship kernels in the Holstein data, but outperformed the latter in the wheat data. However, the difference in performance between the diffusion and Gaussian kernels was negligible. Conclusions It is concluded that the ability of a diffusion kernel to capture the total genetic variance is not better than that of a Gaussian kernel, at least for these data. Although the diffusion kernel as a choice of basis function may have potential for use in whole-genome prediction, our results imply that embedding genetic markers into a non-Euclidean metric space has very small impact on prediction. Our results suggest that use of the black box Gaussian kernel is justified, given its connection to the diffusion kernel and its similar predictive performance.
Collapse
Affiliation(s)
- Gota Morota
- Department of Animal Sciences, University of Wisconsin-Madison, Madison, WI, USA.
| | | | | | | | | |
Collapse
|
50
|
de Los Campos G, Hickey JM, Pong-Wong R, Daetwyler HD, Calus MPL. Whole-genome regression and prediction methods applied to plant and animal breeding. Genetics 2013; 193:327-45. [PMID: 22745228 PMCID: PMC3567727 DOI: 10.1534/genetics.112.143313] [Citation(s) in RCA: 489] [Impact Index Per Article: 44.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2012] [Accepted: 06/11/2012] [Indexed: 11/18/2022] Open
Abstract
Genomic-enabled prediction is becoming increasingly important in animal and plant breeding and is also receiving attention in human genetics. Deriving accurate predictions of complex traits requires implementing whole-genome regression (WGR) models where phenotypes are regressed on thousands of markers concurrently. Methods exist that allow implementing these large-p with small-n regressions, and genome-enabled selection (GS) is being implemented in several plant and animal breeding programs. The list of available methods is long, and the relationships between them have not been fully addressed. In this article we provide an overview of available methods for implementing parametric WGR models, discuss selected topics that emerge in applications, and present a general discussion of lessons learned from simulation and empirical data analysis in the last decade.
Collapse
Affiliation(s)
- Gustavo de Los Campos
- Department of Biostatistics, School of Public Health, University of Alabama, Birmingham, AL 35294, USA.
| | | | | | | | | |
Collapse
|