1
|
Wang H, Chen M, Wei X, Xia R, Pei D, Huang X, Han B. Computational tools for plant genomics and breeding. SCIENCE CHINA. LIFE SCIENCES 2024; 67:1579-1590. [PMID: 38676814 DOI: 10.1007/s11427-024-2578-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/05/2024] [Accepted: 03/25/2024] [Indexed: 04/29/2024]
Abstract
Plant genomics and crop breeding are at the intersection of biotechnology and information technology. Driven by a combination of high-throughput sequencing, molecular biology and data science, great advances have been made in omics technologies at every step along the central dogma, especially in genome assembling, genome annotation, epigenomic profiling, and transcriptome profiling. These advances further revolutionized three directions of development. One is genetic dissection of complex traits in crops, along with genomic prediction and selection. The second is comparative genomics and evolution, which open up new opportunities to depict the evolutionary constraints of biological sequences for deleterious variant discovery. The third direction is the development of deep learning approaches for the rational design of biological sequences, especially proteins, for synthetic biology. All three directions of development serve as the foundation for a new era of crop breeding where agronomic traits are enhanced by genome design.
Collapse
Affiliation(s)
- Hai Wang
- State Key Laboratory of Maize Bio-breeding, Frontiers Science Center for Molecular Design Breeding, Joint International Research Laboratory of Crop Molecular Breeding, National Maize Improvement Center, College of Agronomy and Biotechnology, China Agricultural University, Beijing, 100193, China.
- Sanya Institute of China Agricultural University, Sanya, 572025, China.
- Hainan Yazhou Bay Seed Laboratory, Sanya, 572025, China.
| | - Mengjiao Chen
- State Key Laboratory of Tree Genetics and Breeding, Key Laboratory of Tree Breeding and Cultivation of the State Forestry and Grassland Administration, Research Institute of Forestry, Chinese Academy of Forestry, Beijing, 100091, China
| | - Xin Wei
- Shanghai Key Laboratory of Plant Molecular Sciences, College of Life Sciences, Shanghai Normal University, Shanghai, 200234, China
| | - Rui Xia
- College of Horticulture, South China Agricultural University, Guangzhou, 510640, China
| | - Dong Pei
- State Key Laboratory of Tree Genetics and Breeding, Key Laboratory of Tree Breeding and Cultivation of the State Forestry and Grassland Administration, Research Institute of Forestry, Chinese Academy of Forestry, Beijing, 100091, China
| | - Xuehui Huang
- Shanghai Key Laboratory of Plant Molecular Sciences, College of Life Sciences, Shanghai Normal University, Shanghai, 200234, China
| | - Bin Han
- National Center for Gene Research, CAS Center for Excellence in Molecular Plant Sciences, Chinese Academy of Sciences, Shanghai, 200233, China
| |
Collapse
|
2
|
Chu TT, Kristensen PS, Jensen J. Simulation of functional additive and non-additive genetic effects using statistical estimates from quantitative genetic models. Heredity (Edinb) 2024; 133:33-42. [PMID: 38822133 PMCID: PMC11222558 DOI: 10.1038/s41437-024-00690-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/05/2024] [Revised: 05/08/2024] [Accepted: 05/08/2024] [Indexed: 06/02/2024] Open
Abstract
Stochastic simulation software is commonly used to aid breeders designing cost-effective breeding programs and to validate statistical models used in genetic evaluation. An essential feature of the software is the ability to simulate populations with desired genetic and non-genetic parameters. However, this feature often fails when non-additive effects due to dominance or epistasis are modeled, as the desired properties of simulated populations are estimated from classical quantitative genetic statistical models formulated at the population level. The software simulates underlying functional effects for genotypic values at the individual level, which are not necessarily the same as effects from statistical models in which dominance and epistasis are included. This paper provides the theoretical basis and mathematical formulas for the transformation between functional and statistical effects in such simulations. The transformation is demonstrated with two statistical models analyzing individual phenotypes in a single population (common in animal breeding) and plot phenotypes of three-way hybrids involving two inbred populations (observed in some crop breeding programs). We also describe different methods for the simulation of functional effects for additive genetics, dominance, and epistasis to achieve the desired levels of variance components in classical statistical models used in quantitative genetics.
Collapse
Affiliation(s)
- Thinh Tuan Chu
- Center for Quantitative Genetics and Genomics, Aarhus University, Aarhus, Denmark.
- Vietnam National University of Agriculture, Faculty of Animal Science, Trâu Quỳ, Gia Lâm, Hanoi, Vietnam.
| | | | - Just Jensen
- Center for Quantitative Genetics and Genomics, Aarhus University, Aarhus, Denmark
| |
Collapse
|
3
|
Bose S, Banerjee S, Kumar S, Saha A, Nandy D, Hazra S. Review of applications of artificial intelligence (AI) methods in crop research. J Appl Genet 2024; 65:225-240. [PMID: 38216788 DOI: 10.1007/s13353-023-00826-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/13/2023] [Revised: 12/23/2023] [Accepted: 12/26/2023] [Indexed: 01/14/2024]
Abstract
Sophisticated and modern crop improvement techniques can bridge the gap for feeding the ever-increasing population. Artificial intelligence (AI) refers to the simulation of human intelligence in machines, which refers to the application of computational algorithms, machine learning (ML) and deep learning (DL) techniques. This is aimed to generalise patterns and relationships from historical data, employing various mathematical optimisation techniques thus making prediction models for facilitating selection of superior genotypes. These techniques are less resource intensive and can solve the problem based on the analysis of large-scale phenotypic datasets. ML for genomic selection (GS) uses high-throughput genotyping technologies to gather genetic information on a large number of markers across the genome. The prediction of GS models is based on the mathematical relation between genotypic and phenotypic data from the training population. ML techniques have emerged as powerful tools for genome editing through analysing large-scale genomic data and facilitating the development of accurate prediction models. Precise phenotyping is a prerequisite to advance crop breeding for solving agricultural production-related issues. ML algorithms can solve this problem through generating predictive models, based on the analysis of large-scale phenotypic datasets. DL models also have the potential reliability of precise phenotyping. This review provides a comprehensive overview on various ML and DL models, their applications, potential to enhance the efficiency, specificity and safety towards advanced crop improvement protocols such as genomic selection, genome editing, along with phenotypic prediction to promote accelerated breeding.
Collapse
Affiliation(s)
- Suvojit Bose
- Department of Vegetables and Spice Crops, Uttar Banga Krishi Viswavidyalaya, Pundibari, Cooch Behar, 736165, West Bengal, India
| | | | - Soumya Kumar
- School of Agricultural Sciences, JIS University, Kolkata, 700109, West Bengal, India
| | - Akash Saha
- School of Agricultural Sciences, JIS University, Kolkata, 700109, West Bengal, India
| | - Debalina Nandy
- School of Agricultural Sciences, JIS University, Kolkata, 700109, West Bengal, India
| | - Soham Hazra
- Department of Agriculture, Brainware University, Barasat, 700125, West Bengal, India.
| |
Collapse
|
4
|
Ubbens J, Feldmann MJ, Stavness I, Sharpe AG. Quantitative evaluation of nonlinear methods for population structure visualization and inference. G3 GENES|GENOMES|GENETICS 2022; 12:6651067. [PMID: 35900169 PMCID: PMC9434256 DOI: 10.1093/g3journal/jkac191] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/23/2022] [Accepted: 07/20/2022] [Indexed: 11/20/2022]
Abstract
Population structure (also called genetic structure and population stratification) is the presence of a systematic difference in allele frequencies between subpopulations in a population as a result of nonrandom mating between individuals. It can be informative of genetic ancestry, and in the context of medical genetics, it is an important confounding variable in genome-wide association studies. Recently, many nonlinear dimensionality reduction techniques have been proposed for the population structure visualization task. However, an objective comparison of these techniques has so far been missing from the literature. In this article, we discuss the previously proposed nonlinear techniques and some of their potential weaknesses. We then propose a novel quantitative evaluation methodology for comparing these nonlinear techniques, based on populations for which pedigree is known a priori either through artificial selection or simulation. Based on this evaluation metric, we find graph-based algorithms such as t-SNE and UMAP to be superior to principal component analysis, while neural network-based methods fall behind.
Collapse
Affiliation(s)
- Jordan Ubbens
- Global Institute for Food Security (GIFS), University of Saskatchewan, Saskatoon, SKS7N 0W9, Canada
| | - Mitchell J Feldmann
- Department of Plant Sciences, University of California , Davis, CA95616, USA
| | - Ian Stavness
- Global Institute for Food Security (GIFS), University of Saskatchewan, Saskatoon, SKS7N 0W9, Canada
- Department of Computer Science, University of Saskatchewan , Saskatoon, SKS7N 0W9, Canada
| | - Andrew G Sharpe
- Global Institute for Food Security (GIFS), University of Saskatchewan, Saskatoon, SKS7N 0W9, Canada
| |
Collapse
|
5
|
Pérez-Enciso M, Zingaretti LM, Ramayo-Caldas Y, de Los Campos G. Opportunities and limits of combining microbiome and genome data for complex trait prediction. Genet Sel Evol 2021; 53:65. [PMID: 34362312 PMCID: PMC8344190 DOI: 10.1186/s12711-021-00658-7] [Citation(s) in RCA: 14] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2021] [Accepted: 07/20/2021] [Indexed: 12/12/2022] Open
Abstract
Background Analysis and prediction of complex traits using microbiome data combined with host genomic information is a topic of utmost interest. However, numerous questions remain to be answered: how useful can the microbiome be for complex trait prediction? Are estimates of microbiability reliable? Can the underlying biological links between the host’s genome, microbiome, and phenome be recovered? Methods Here, we address these issues by (i) developing a novel simulation strategy that uses real microbiome and genotype data as inputs, and (ii) using variance-component approaches (Bayesian Reproducing Kernel Hilbert Space (RKHS) and Bayesian variable selection methods (Bayes C)) to quantify the proportion of phenotypic variance explained by the genome and the microbiome. The proposed simulation approach can mimic genetic links between the microbiome and genotype data by a permutation procedure that retains the distributional properties of the data. Results Using real genotype and rumen microbiota abundances from dairy cattle, simulation results suggest that microbiome data can significantly improve the accuracy of phenotype predictions, regardless of whether some microbiota abundances are under direct genetic control by the host or not. This improvement depends logically on the microbiome being stable over time. Overall, random-effects linear methods appear robust for variance components estimation, in spite of the typically highly leptokurtic distribution of microbiota abundances. The predictive performance of Bayes C was higher but more sensitive to the number of causative effects than RKHS. Accuracy with Bayes C depended, in part, on the number of microorganisms’ taxa that influence the phenotype. Conclusions While we conclude that, overall, genome-microbiome-links can be characterized using variance component estimates, we are less optimistic about the possibility of identifying the causative host genetic effects that affect microbiota abundances, which would require much larger sample sizes than are typically available for genome-microbiome-phenome studies. The R code to replicate the analyses is in https://github.com/miguelperezenciso/simubiome. Supplementary Information The online version contains supplementary material available at 10.1186/s12711-021-00658-7.
Collapse
Affiliation(s)
- Miguel Pérez-Enciso
- ICREA, Passeig de Lluís Companys 23, 08010, Barcelona, Spain. .,Centre for Research in Agricultural Genomics (CRAG), CSIC-IRTA-UAB-UB, 08193, Bellaterra, Barcelona, Spain. .,Dept. of Epidemiology & Biostatistics, and Dept. of Statistics & Probability, Michigan State University, East Lansing, MI, 48824, USA.
| | - Laura M Zingaretti
- Centre for Research in Agricultural Genomics (CRAG), CSIC-IRTA-UAB-UB, 08193, Bellaterra, Barcelona, Spain.,Dept. of Epidemiology & Biostatistics, and Dept. of Statistics & Probability, Michigan State University, East Lansing, MI, 48824, USA
| | - Yuliaxis Ramayo-Caldas
- Animal Breeding and Genetics Program, Institute for Research and Technology in Food and Agriculture (IRTA), Torre Marimon, 08140, Caldes de Montbui, Barcelona, Spain
| | - Gustavo de Los Campos
- Dept. of Epidemiology & Biostatistics, and Dept. of Statistics & Probability, Michigan State University, East Lansing, MI, 48824, USA
| |
Collapse
|
6
|
Garreta L, Cerón‐Souza I, Palacio MR, Reyes‐Herrera PH. MultiGWAS: An integrative tool for Genome Wide Association Studies in tetraploid organisms. Ecol Evol 2021; 11:7411-7426. [PMID: 34188823 PMCID: PMC8216910 DOI: 10.1002/ece3.7572] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2020] [Revised: 03/22/2021] [Accepted: 03/23/2021] [Indexed: 12/27/2022] Open
Abstract
The genome-wide association studies (GWASs) are essential to determine the genetic bases of either ecological or economic phenotypic variation across individuals within populations of the model and nonmodel organisms. For this research question, the GWAS replication testing different parameters and models to validate the results' reproducibility is common. However, straightforward methodologies that manage both replication and tetraploid data are still missing. To solve this problem, we designed the MultiGWAS, a tool that does GWAS for diploid and tetraploid organisms by executing in parallel four software packages, two designed for polyploid data (GWASpoly and SHEsis) and two designed for diploid data (GAPIT and TASSEL). MultiGWAS has several advantages. It runs either in the command line or in a graphical interface; it manages different genotype formats, including VCF. Moreover, it allows control for population structure, relatedness, and several quality control checks on genotype data. Besides, MultiGWAS can test for additive and dominant gene action models, and, through a proprietary scoring function, select the best model to report its associations. Finally, it generates several reports that facilitate identifying false associations from both the significant and the best-ranked association Single Nucleotide Polymorphisms (SNPs) among the four software packages. We tested MultiGWAS with public tetraploid potato data for tuber shape and several simulated data under both additive and dominant models. These tests demonstrated that MultiGWAS is better at detecting reliable associations than using each of the four software packages individually. Moreover, the parallel analysis of polyploid and diploid software that only offers MultiGWAS demonstrates its utility in understanding the best genetic model behind the SNP association in tetraploid organisms. Therefore, MultiGWAS probed to be an excellent alternative for wrapping GWAS replication in diploid and tetraploid organisms in a single analysis environment.
Collapse
Affiliation(s)
- Luis Garreta
- Corporación Colombiana de Investigación Agropecuaria (AGROSAVIA)CI TibaitatáBogotaColombia
| | - Ivania Cerón‐Souza
- Corporación Colombiana de Investigación Agropecuaria (AGROSAVIA)CI TibaitatáBogotaColombia
| | | | - Paula H. Reyes‐Herrera
- Corporación Colombiana de Investigación Agropecuaria (AGROSAVIA)CI TibaitatáBogotaColombia
| |
Collapse
|
7
|
Pook T, Büttgen L, Ganesan A, Ha NT, Simianer H. MoBPSweb: A web-based framework to simulate and compare breeding programs. G3 (BETHESDA, MD.) 2021; 11:jkab023. [PMID: 33712818 PMCID: PMC8022963 DOI: 10.1093/g3journal/jkab023] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/11/2020] [Accepted: 01/11/2021] [Indexed: 11/13/2022]
Abstract
In this study, we introduce a new web-based simulation framework ("MoBPSweb") that combines a unified language to describe breeding programs with the simulation software MoBPS, standing for "Modular Breeding Program Simulator." Thereby, MoBPSweb provides a flexible environment to log, simulate, evaluate, and compare breeding programs. Inputs can be provided via modules ranging from a Vis.js-based environment for "drawing" the breeding program to a variety of modules to provide phenotype information, economic parameters, and other relevant information. Similarly, results of the simulation study can be extracted and compared to other scenarios via output modules (e.g., observed phenotypes, the accuracy of breeding value estimation, inbreeding rates), while all simulations and downstream analysis are executed in the highly efficient R-package MoBPS.
Collapse
Affiliation(s)
- Torsten Pook
- Center for Integrated Breeding Research, Department of Animal Sciences, Animal Breeding and Genetics Group, University of Goettingen, Goettingen D 37075, Germany
| | - Lisa Büttgen
- Center for Integrated Breeding Research, Department of Animal Sciences, Animal Breeding and Genetics Group, University of Goettingen, Goettingen D 37075, Germany
| | - Amudha Ganesan
- Center for Integrated Breeding Research, Department of Animal Sciences, Animal Breeding and Genetics Group, University of Goettingen, Goettingen D 37075, Germany
| | - Ngoc-Thuy Ha
- Center for Integrated Breeding Research, Department of Animal Sciences, Animal Breeding and Genetics Group, University of Goettingen, Goettingen D 37075, Germany
| | - Henner Simianer
- Center for Integrated Breeding Research, Department of Animal Sciences, Animal Breeding and Genetics Group, University of Goettingen, Goettingen D 37075, Germany
| |
Collapse
|
8
|
Tong H, Nikoloski Z. Machine learning approaches for crop improvement: Leveraging phenotypic and genotypic big data. JOURNAL OF PLANT PHYSIOLOGY 2021; 257:153354. [PMID: 33385619 DOI: 10.1016/j.jplph.2020.153354] [Citation(s) in RCA: 47] [Impact Index Per Article: 15.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/30/2020] [Revised: 12/14/2020] [Accepted: 12/15/2020] [Indexed: 05/07/2023]
Abstract
Highly efficient and accurate selection of elite genotypes can lead to dramatic shortening of the breeding cycle in major crops relevant for sustaining present demands for food, feed, and fuel. In contrast to classical approaches that emphasize the need for resource-intensive phenotyping at all stages of artificial selection, genomic selection dramatically reduces the need for phenotyping. Genomic selection relies on advances in machine learning and the availability of genotyping data to predict agronomically relevant phenotypic traits. Here we provide a systematic review of machine learning approaches applied for genomic selection of single and multiple traits in major crops in the past decade. We emphasize the need to gather data on intermediate phenotypes, e.g. metabolite, protein, and gene expression levels, along with developments of modeling techniques that can lead to further improvements of genomic selection. In addition, we provide a critical view of factors that affect genomic selection, with attention to transferability of models between different environments. Finally, we highlight the future aspects of integrating high-throughput molecular phenotypic data from omics technologies with biological networks for crop improvement.
Collapse
Affiliation(s)
- Hao Tong
- Bioinformatics Group, Institute of Biochemistry and Biology, University of Potsdam, Potsdam, Germany; Bioinformatics and Mathematical Modeling Department, Centre for Plant Systems Biology and Biotechnology, Plovdiv, Bulgaria; Systems Biology and Mathematical Modeling Group, Max Planck Institute of Molecular Plant Physiology, Potsdam, Germany
| | - Zoran Nikoloski
- Bioinformatics Group, Institute of Biochemistry and Biology, University of Potsdam, Potsdam, Germany; Bioinformatics and Mathematical Modeling Department, Centre for Plant Systems Biology and Biotechnology, Plovdiv, Bulgaria; Systems Biology and Mathematical Modeling Group, Max Planck Institute of Molecular Plant Physiology, Potsdam, Germany.
| |
Collapse
|