1
|
MacNish TR, Danilevicz MF, Bayer PE, Bestry MS, Edwards D. Application of machine learning and genomics for orphan crop improvement. Nat Commun 2025; 16:982. [PMID: 39856113 PMCID: PMC11760368 DOI: 10.1038/s41467-025-56330-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2024] [Accepted: 01/15/2025] [Indexed: 01/27/2025] Open
Abstract
Orphan crops are important sources of nutrition in developing regions and many are tolerant to biotic and abiotic stressors; however, modern crop improvement technologies have not been widely applied to orphan crops due to the lack of resources available. There are orphan crop representatives across major crop types and the conservation of genes between these related species can be used in crop improvement. Machine learning (ML) has emerged as a promising tool for crop improvement. Transferring knowledge from major crops to orphan crops and using machine learning to improve accuracy and efficiency can be used to improve orphan crops.
Collapse
Affiliation(s)
- Tessa R MacNish
- School of Biological Sciences, The University of Western Australia, Perth, Australia
- Centre for Applied Bioinformatics, The University of Western Australia, Perth, Australia
| | - Monica F Danilevicz
- School of Biological Sciences, The University of Western Australia, Perth, Australia
- Centre for Applied Bioinformatics, The University of Western Australia, Perth, Australia
- Australian Herbicide Resistance Initiative, The University of Western Australia, Perth, Australia
| | - Philipp E Bayer
- Centre for Applied Bioinformatics, The University of Western Australia, Perth, Australia
- The UWA Oceans Institute, The University of Western Australia, Perth, Australia
- Minderoo Foundation, Perth, Australia
| | - Mitchell S Bestry
- School of Biological Sciences, The University of Western Australia, Perth, Australia
- Centre for Applied Bioinformatics, The University of Western Australia, Perth, Australia
| | - David Edwards
- School of Biological Sciences, The University of Western Australia, Perth, Australia.
- Centre for Applied Bioinformatics, The University of Western Australia, Perth, Australia.
| |
Collapse
|
2
|
Xu Y, Zhang Y, Cui Y, Zhou K, Yu G, Yang W, Wang X, Li F, Guan X, Zhang X, Yang Z, Xu S, Xu C. GA-GBLUP: leveraging the genetic algorithm to improve the predictability of genomic selection. Brief Bioinform 2024; 25:bbae385. [PMID: 39101500 PMCID: PMC11299030 DOI: 10.1093/bib/bbae385] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/06/2024] [Revised: 07/03/2024] [Accepted: 07/24/2024] [Indexed: 08/06/2024] Open
Abstract
Genomic selection (GS) has emerged as an effective technology to accelerate crop hybrid breeding by enabling early selection prior to phenotype collection. Genomic best linear unbiased prediction (GBLUP) is a robust method that has been routinely used in GS breeding programs. However, GBLUP assumes that markers contribute equally to the total genetic variance, which may not be the case. In this study, we developed a novel GS method called GA-GBLUP that leverages the genetic algorithm (GA) to select markers related to the target trait. We defined four fitness functions for optimization, including AIC, BIC, R2, and HAT, to improve the predictability and bin adjacent markers based on the principle of linkage disequilibrium to reduce model dimension. The results demonstrate that the GA-GBLUP model, equipped with R2 and HAT fitness function, produces much higher predictability than GBLUP for most traits in rice and maize datasets, particularly for traits with low heritability. Moreover, we have developed a user-friendly R package, GAGBLUP, for GS, and the package is freely available on CRAN (https://CRAN.R-project.org/package=GAGBLUP).
Collapse
Affiliation(s)
- Yang Xu
- Key Laboratory of Plant Functional Genomics of the Ministry of Education/Jiangsu Key Laboratory of Crop Genomics and Molecular Breeding/Jiangsu Co-Innovation Center for Modern Production Technology of Grain Crops, College of Agriculture, Yangzhou University, Yangzhou, Jiangsu 225009, China
| | - Yuxiang Zhang
- Key Laboratory of Plant Functional Genomics of the Ministry of Education/Jiangsu Key Laboratory of Crop Genomics and Molecular Breeding/Jiangsu Co-Innovation Center for Modern Production Technology of Grain Crops, College of Agriculture, Yangzhou University, Yangzhou, Jiangsu 225009, China
| | - Yanru Cui
- College of Agronomy, Hebei Agricultural University, Baoding, Hebei 071001, China
| | - Kai Zhou
- Key Laboratory of Plant Functional Genomics of the Ministry of Education/Jiangsu Key Laboratory of Crop Genomics and Molecular Breeding/Jiangsu Co-Innovation Center for Modern Production Technology of Grain Crops, College of Agriculture, Yangzhou University, Yangzhou, Jiangsu 225009, China
| | - Guangning Yu
- Key Laboratory of Plant Functional Genomics of the Ministry of Education/Jiangsu Key Laboratory of Crop Genomics and Molecular Breeding/Jiangsu Co-Innovation Center for Modern Production Technology of Grain Crops, College of Agriculture, Yangzhou University, Yangzhou, Jiangsu 225009, China
| | - Wenyan Yang
- Key Laboratory of Plant Functional Genomics of the Ministry of Education/Jiangsu Key Laboratory of Crop Genomics and Molecular Breeding/Jiangsu Co-Innovation Center for Modern Production Technology of Grain Crops, College of Agriculture, Yangzhou University, Yangzhou, Jiangsu 225009, China
| | - Xin Wang
- Key Laboratory of Plant Functional Genomics of the Ministry of Education/Jiangsu Key Laboratory of Crop Genomics and Molecular Breeding/Jiangsu Co-Innovation Center for Modern Production Technology of Grain Crops, College of Agriculture, Yangzhou University, Yangzhou, Jiangsu 225009, China
| | - Furong Li
- Key Laboratory of Plant Functional Genomics of the Ministry of Education/Jiangsu Key Laboratory of Crop Genomics and Molecular Breeding/Jiangsu Co-Innovation Center for Modern Production Technology of Grain Crops, College of Agriculture, Yangzhou University, Yangzhou, Jiangsu 225009, China
| | - Xiusheng Guan
- Key Laboratory of Plant Functional Genomics of the Ministry of Education/Jiangsu Key Laboratory of Crop Genomics and Molecular Breeding/Jiangsu Co-Innovation Center for Modern Production Technology of Grain Crops, College of Agriculture, Yangzhou University, Yangzhou, Jiangsu 225009, China
| | - Xuecai Zhang
- Global Maize Program, International Maize and Wheat Improvement Centre, Texcoco 56237, Mexico
| | - Zefeng Yang
- Key Laboratory of Plant Functional Genomics of the Ministry of Education/Jiangsu Key Laboratory of Crop Genomics and Molecular Breeding/Jiangsu Co-Innovation Center for Modern Production Technology of Grain Crops, College of Agriculture, Yangzhou University, Yangzhou, Jiangsu 225009, China
| | - Shizhong Xu
- Department of Botany and Plant Sciences, University of California, Riverside, CA 92521, United States
| | - Chenwu Xu
- Key Laboratory of Plant Functional Genomics of the Ministry of Education/Jiangsu Key Laboratory of Crop Genomics and Molecular Breeding/Jiangsu Co-Innovation Center for Modern Production Technology of Grain Crops, College of Agriculture, Yangzhou University, Yangzhou, Jiangsu 225009, China
| |
Collapse
|
3
|
Sarnaik KS, Linden PA, Gasnick A, Bassiri A, Manyak GA, Jarrett CM, Sinopoli JN, Tapias Vargas L, Towe CW. Computational risk model for predicting 2-year malignancy of pulmonary nodules using demographic and radiographic characteristics. J Thorac Cardiovasc Surg 2024; 167:1910-1924.e2. [PMID: 37717851 DOI: 10.1016/j.jtcvs.2023.09.027] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 05/02/2023] [Revised: 08/14/2023] [Accepted: 09/11/2023] [Indexed: 09/19/2023]
Abstract
OBJECTIVES To determine whether discriminatory performance of a computational risk model in classifying pulmonary lesion malignancy using demographic, radiographic, and clinical characteristics is superior to the opinion of experienced providers. We hypothesized that computational risk models would outperform providers. METHODS Outcome of malignancy was obtained from selected patients enrolled in the NAVIGATE trial (NCT02410837). Five predictive risk models were developed using an 80:20 train-test split: univariable logistic regression model based solely on provider opinion, multivariable logistic regression model, random forest classifier, extreme gradient boosting model, and artificial neural network. Area under the receiver operating characteristic curve achieved during testing of the predictive models was compared to that of prebiopsy provider opinion baseline using the DeLong test with 10,000 bootstrapped iterations. RESULTS The cohort included 984 patients, 735 (74.7%) of which were diagnosed with malignancy. Factors associated with malignancy from multivariable logistic regression included age, history of cancer, largest lesion size, lung zone, and positron-emission tomography positivity. Testing area under the receiver operating characteristic curve were 0.830 for provider opinion baseline, 0.770 for provider opinion univariable logistic regression, 0.659 for multivariable logistic regression model, 0.743 for random forest classifier, 0.740 for extreme gradient boosting, and 0.679 for artificial neural network. Provider opinion baseline was determined to be the best predictive classification system. CONCLUSIONS Computational models predicting malignancy of pulmonary lesions using clinical, demographic, and radiographic characteristics are inferior to provider opinion. This study questions the ability of these models to provide additional insight into patient care. Expert clinician evaluation of pulmonary lesion malignancy is paramount.
Collapse
Affiliation(s)
- Kunaal S Sarnaik
- Department of Surgery, Case Western Reserve University School of Medicine, Cleveland, Ohio
| | - Philip A Linden
- Department of Surgery, Case Western Reserve University School of Medicine, Cleveland, Ohio; Division of Thoracic and Esophageal Surgery, Department of Surgery, University Hospitals Cleveland Medical Center, Cleveland, Ohio
| | - Allison Gasnick
- Department of Surgery, Case Western Reserve University School of Medicine, Cleveland, Ohio
| | - Aria Bassiri
- Department of Surgery, Case Western Reserve University School of Medicine, Cleveland, Ohio; Division of Thoracic and Esophageal Surgery, Department of Surgery, University Hospitals Cleveland Medical Center, Cleveland, Ohio
| | - Grigory A Manyak
- Department of Surgery, Case Western Reserve University School of Medicine, Cleveland, Ohio
| | - Craig M Jarrett
- Department of Surgery, Case Western Reserve University School of Medicine, Cleveland, Ohio; Division of Thoracic and Esophageal Surgery, Department of Surgery, University Hospitals Cleveland Medical Center, Cleveland, Ohio
| | - Jillian N Sinopoli
- Department of Surgery, Case Western Reserve University School of Medicine, Cleveland, Ohio; Division of Thoracic and Esophageal Surgery, Department of Surgery, University Hospitals Cleveland Medical Center, Cleveland, Ohio
| | - Leonidas Tapias Vargas
- Department of Surgery, Case Western Reserve University School of Medicine, Cleveland, Ohio; Division of Thoracic and Esophageal Surgery, Department of Surgery, University Hospitals Cleveland Medical Center, Cleveland, Ohio
| | - Christopher W Towe
- Department of Surgery, Case Western Reserve University School of Medicine, Cleveland, Ohio; Division of Thoracic and Esophageal Surgery, Department of Surgery, University Hospitals Cleveland Medical Center, Cleveland, Ohio.
| |
Collapse
|
4
|
Zhou W, Yan Z, Zhang L. A comparative study of 11 non-linear regression models highlighting autoencoder, DBN, and SVR, enhanced by SHAP importance analysis in soybean branching prediction. Sci Rep 2024; 14:5905. [PMID: 38467662 PMCID: PMC10928191 DOI: 10.1038/s41598-024-55243-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2023] [Accepted: 02/21/2024] [Indexed: 03/13/2024] Open
Abstract
To explore a robust tool for advancing digital breeding practices through an artificial intelligence-driven phenotype prediction expert system, we undertook a thorough analysis of 11 non-linear regression models. Our investigation specifically emphasized the significance of Support Vector Regression (SVR) and SHapley Additive exPlanations (SHAP) in predicting soybean branching. By using branching data (phenotype) of 1918 soybean accessions and 42 k SNP (Single Nucleotide Polymorphism) polymorphic data (genotype), this study systematically compared 11 non-linear regression AI models, including four deep learning models (DBN (deep belief network) regression, ANN (artificial neural network) regression, Autoencoders regression, and MLP (multilayer perceptron) regression) and seven machine learning models (e.g., SVR (support vector regression), XGBoost (eXtreme Gradient Boosting) regression, Random Forest regression, LightGBM regression, GPs (Gaussian processes) regression, Decision Tree regression, and Polynomial regression). After being evaluated by four valuation metrics: R2 (R-squared), MAE (Mean Absolute Error), MSE (Mean Squared Error), and MAPE (Mean Absolute Percentage Error), it was found that the SVR, Polynomial Regression, DBN, and Autoencoder outperformed other models and could obtain a better prediction accuracy when they were used for phenotype prediction. In the assessment of deep learning approaches, we exemplified the SVR model, conducting analyses on feature importance and gene ontology (GO) enrichment to provide comprehensive support. After comprehensively comparing four feature importance algorithms, no notable distinction was observed in the feature importance ranking scores across the four algorithms, namely Variable Ranking, Permutation, SHAP, and Correlation Matrix, but the SHAP value could provide rich information on genes with negative contributions, and SHAP importance was chosen for feature selection. The results of this study offer valuable insights into AI-mediated plant breeding, addressing challenges faced by traditional breeding programs. The method developed has broad applicability in phenotype prediction, minor QTL (quantitative trait loci) mining, and plant smart-breeding systems, contributing significantly to the advancement of AI-based breeding practices and transitioning from experience-based to data-based breeding.
Collapse
Affiliation(s)
- Wei Zhou
- Florida Agricultural and Mechanical University, Tallahassee, FL, 32307, USA.
| | - Zhengxiao Yan
- Florida State University, Tallahassee, FL, 32306, USA
| | - Liting Zhang
- Florida State University, Tallahassee, FL, 32306, USA
| |
Collapse
|
5
|
Heilmann PG, Frisch M, Abbadi A, Kox T, Herzog E. Stacked ensembles on basis of parentage information can predict hybrid performance with an accuracy comparable to marker-based GBLUP. FRONTIERS IN PLANT SCIENCE 2023; 14:1178902. [PMID: 37546247 PMCID: PMC10401275 DOI: 10.3389/fpls.2023.1178902] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 03/03/2023] [Accepted: 06/26/2023] [Indexed: 08/08/2023]
Abstract
Testcross factorials in newly established hybrid breeding programs are often highly unbalanced, incomplete, and characterized by predominance of special combining ability (SCA) over general combining ability (GCA). This results in a low efficiency of GCA-based selection. Machine learning algorithms might improve prediction of hybrid performance in such testcross factorials, as they have been successfully applied to find complex underlying patterns in sparse data. Our objective was to compare the prediction accuracy of machine learning algorithms to that of GCA-based prediction and genomic best linear unbiased prediction (GBLUP) in six unbalanced incomplete factorials from hybrid breeding programs of rapeseed, wheat, and corn. We investigated a range of machine learning algorithms with three different types of predictor variables: (a) information on parentage of hybrids, (b) in addition hybrid performance of crosses of the parental lines with other crossing partners, and (c) genotypic marker data. In two highly incomplete and unbalanced factorials from rapeseed, in which the SCA variance contributed considerably to the genetic variance, stacked ensembles of gradient boosting machines based on parentage information outperformed GCA prediction. The stacked ensembles increased prediction accuracy from 0.39 to 0.45, and from 0.48 to 0.54 compared to GCA prediction. The prediction accuracy reached by stacked ensembles without marker data reached values comparable to those of GBLUP that requires marker data. We conclude that hybrid prediction with stacked ensembles of gradient boosting machines based on parentage information is a promising approach that is worth further investigations with other data sets in which SCA variance is high.
Collapse
Affiliation(s)
| | - Matthias Frisch
- Institute of Agronomy and Plant Breeding II, Justus Liebig University, Gießen, Germany
| | | | | | - Eva Herzog
- Institute of Agronomy and Plant Breeding II, Justus Liebig University, Gießen, Germany
| |
Collapse
|
6
|
Xu Y, Zhang X, Li H, Zheng H, Zhang J, Olsen MS, Varshney RK, Prasanna BM, Qian Q. Smart breeding driven by big data, artificial intelligence, and integrated genomic-enviromic prediction. MOLECULAR PLANT 2022; 15:1664-1695. [PMID: 36081348 DOI: 10.1016/j.molp.2022.09.001] [Citation(s) in RCA: 66] [Impact Index Per Article: 22.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/04/2022] [Revised: 08/20/2022] [Accepted: 09/02/2022] [Indexed: 05/12/2023]
Abstract
The first paradigm of plant breeding involves direct selection-based phenotypic observation, followed by predictive breeding using statistical models for quantitative traits constructed based on genetic experimental design and, more recently, by incorporation of molecular marker genotypes. However, plant performance or phenotype (P) is determined by the combined effects of genotype (G), envirotype (E), and genotype by environment interaction (GEI). Phenotypes can be predicted more precisely by training a model using data collected from multiple sources, including spatiotemporal omics (genomics, phenomics, and enviromics across time and space). Integration of 3D information profiles (G-P-E), each with multidimensionality, provides predictive breeding with both tremendous opportunities and great challenges. Here, we first review innovative technologies for predictive breeding. We then evaluate multidimensional information profiles that can be integrated with a predictive breeding strategy, particularly envirotypic data, which have largely been neglected in data collection and are nearly untouched in model construction. We propose a smart breeding scheme, integrated genomic-enviromic prediction (iGEP), as an extension of genomic prediction, using integrated multiomics information, big data technology, and artificial intelligence (mainly focused on machine and deep learning). We discuss how to implement iGEP, including spatiotemporal models, environmental indices, factorial and spatiotemporal structure of plant breeding data, and cross-species prediction. A strategy is then proposed for prediction-based crop redesign at both the macro (individual, population, and species) and micro (gene, metabolism, and network) scales. Finally, we provide perspectives on translating smart breeding into genetic gain through integrative breeding platforms and open-source breeding initiatives. We call for coordinated efforts in smart breeding through iGEP, institutional partnerships, and innovative technological support.
Collapse
Affiliation(s)
- Yunbi Xu
- Institute of Crop Sciences, CIMMYT-China, Chinese Academy of Agricultural Sciences, Beijing 100081, China; CIMMYT-China Tropical Maize Research Center, School of Food Science and Engineering, Foshan University, Foshan, Guangdong 528231, China; Peking University Institute of Advanced Agricultural Sciences, Weifang, Shandong 261325, China.
| | - Xingping Zhang
- Peking University Institute of Advanced Agricultural Sciences, Weifang, Shandong 261325, China
| | - Huihui Li
- Institute of Crop Sciences, CIMMYT-China, Chinese Academy of Agricultural Sciences, Beijing 100081, China; National Nanfan Research Institute (Sanya), Chinese Academy of Agricultural Sciences, Sanya, Hainan 572024, China
| | - Hongjian Zheng
- CIMMYT-China Specialty Maize Research Center, Shanghai Academy of Agricultural Sciences, Shanghai 201400, China
| | - Jianan Zhang
- MolBreeding Biotechnology Co., Ltd., Shijiazhuang, Hebei 050035, China
| | - Michael S Olsen
- CIMMYT (International Maize and Wheat Improvement Center), ICRAF Campus, United Nations Avenue, Nairobi, Kenya
| | - Rajeev K Varshney
- State Agricultural Biotechnology Centre, Centre for Crop and Food Innovation, Food Futures Institute, Murdoch University, Murdoch, Australia
| | - Boddupalli M Prasanna
- CIMMYT (International Maize and Wheat Improvement Center), ICRAF Campus, United Nations Avenue, Nairobi, Kenya
| | - Qian Qian
- Institute of Crop Sciences, CIMMYT-China, Chinese Academy of Agricultural Sciences, Beijing 100081, China
| |
Collapse
|