1
|
Botkin J, Medina C, Park S, Poudel K, Cha M, Lee Y, Prom LK, Curtin SJ, Xu Z, Ahn E. Analyzing Medicago spp. seed morphology using GWAS and machine learning. Sci Rep 2024; 14:17588. [PMID: 39080407 PMCID: PMC11289399 DOI: 10.1038/s41598-024-67790-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/08/2024] [Accepted: 07/16/2024] [Indexed: 08/02/2024] Open
Abstract
Alfalfa is widely recognized as an important forage crop. To understand the morphological characteristics and genetic basis of seed morphology in alfalfa, we screened 318 Medicago spp., including 244 Medicago sativa subsp. sativa (alfalfa) and 23 other Medicago spp., for seed area size, length, width, length-to-width ratio, perimeter, circularity, the distance between the intersection of length & width (IS) and center of gravity (CG), and seed darkness & red-green-blue (RGB) intensities. The results revealed phenotypic diversity and correlations among the tested accessions. Based on the phenotypic data of M. sativa subsp. sativa, a genome-wide association study (GWAS) was conducted using single nucleotide polymorphisms (SNPs) called against the Medicago truncatula genome. Genes in proximity to associated markers were detected, including CPR1, MON1, a PPR protein, and Wun1(threshold of 1E-04). Machine learning models were utilized to validate GWAS, and identify additional marker-trait associations for potentially complex traits. Marker S7_33375673, upstream of Wun1, was the most important predictor variable for red color intensity and highly important for brightness. Fifty-two markers were identified in coding regions. Along with strong correlations observed between seed morphology traits, these genes will facilitate the process of understanding the genetic basis of seed morphology in Medicago spp.
Collapse
Affiliation(s)
- Jacob Botkin
- Department of Plant Pathology, University of Minnesota, St. Paul, MN, 55108, USA
| | - Cesar Medina
- Department of Agronomy and Plant Genetics, University of Minnesota, St. Paul, MN, 55108, USA
| | - Sunchung Park
- Sustainable Perennial Crops Laboratory, United States Department of Agriculture- Agricultural Research Service, Beltsville Agricultural Research Center, Beltsville, MD, 20705, USA
| | - Kabita Poudel
- Department of Agronomy and Plant Genetics, University of Minnesota, St. Paul, MN, 55108, USA
| | - Minhyeok Cha
- Department of Biotechnology, Korea University, Seoul, 02841, Republic of Korea
| | - Yoonjung Lee
- Department of Plant Pathology, University of Minnesota, St. Paul, MN, 55108, USA
| | - Louis K Prom
- United States Department of Agriculture- Agricultural Research Service, Southern Plains Agricultural Research Center, 2765 F & B Road, College Station, TX, 77845, USA
| | - Shaun J Curtin
- Department of Agronomy and Plant Genetics, University of Minnesota, St. Paul, MN, 55108, USA
- Plant Science Research Unit, United States Department of Agriculture- Agricultural Research Service, St. Paul, MN, 55108, USA
- Center for Plant Precision Genomics, University of Minnesota, St. Paul, MN, 55108, USA
- Center for Genome Engineering, University of Minnesota, St. Paul, MN, 55108, USA
| | - Zhanyou Xu
- Plant Science Research Unit, United States Department of Agriculture- Agricultural Research Service, St. Paul, MN, 55108, USA.
| | - Ezekiel Ahn
- Sustainable Perennial Crops Laboratory, United States Department of Agriculture- Agricultural Research Service, Beltsville Agricultural Research Center, Beltsville, MD, 20705, USA.
| |
Collapse
|
2
|
Odriozola I, Rasmussen JA, Gilbert MTP, Limborg MT, Alberdi A. A practical introduction to holo-omics. CELL REPORTS METHODS 2024; 4:100820. [PMID: 38986611 PMCID: PMC11294832 DOI: 10.1016/j.crmeth.2024.100820] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/14/2023] [Revised: 04/17/2024] [Accepted: 06/20/2024] [Indexed: 07/12/2024]
Abstract
Holo-omics refers to the joint study of non-targeted molecular data layers from host-microbiota systems or holobionts, which is increasingly employed to disentangle the complex interactions between the elements that compose them. We navigate through the generation, analysis, and integration of omics data, focusing on the commonalities and main differences to generate and analyze the various types of omics, with a special focus on optimizing data generation and integration. We advocate for careful generation and distillation of data, followed by independent exploration and analyses of the single omic layers to obtain a better understanding of the study system, before the integration of multiple omic layers in a final model is attempted. We highlight critical decision points to achieve this aim and flag the main challenges to address complex biological questions regarding the integrative study of host-microbiota relationships.
Collapse
Affiliation(s)
- Iñaki Odriozola
- Center for Evolutionary Hologenomics, Globe Institute, University of Copenhagen, Copenhagen, Denmark
| | - Jacob A Rasmussen
- Center for Evolutionary Hologenomics, Globe Institute, University of Copenhagen, Copenhagen, Denmark
| | - M Thomas P Gilbert
- Center for Evolutionary Hologenomics, Globe Institute, University of Copenhagen, Copenhagen, Denmark; University Museum, NTNU, Trondheim, Norway
| | - Morten T Limborg
- Center for Evolutionary Hologenomics, Globe Institute, University of Copenhagen, Copenhagen, Denmark
| | - Antton Alberdi
- Center for Evolutionary Hologenomics, Globe Institute, University of Copenhagen, Copenhagen, Denmark.
| |
Collapse
|
3
|
Li X, Chen X, Wang Q, Yang N, Sun C. Integrating Bioinformatics and Machine Learning for Genomic Prediction in Chickens. Genes (Basel) 2024; 15:690. [PMID: 38927626 PMCID: PMC11202573 DOI: 10.3390/genes15060690] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/09/2024] [Revised: 05/12/2024] [Accepted: 05/23/2024] [Indexed: 06/28/2024] Open
Abstract
Genomic prediction plays an increasingly important role in modern animal breeding, with predictive accuracy being a crucial aspect. The classical linear mixed model is gradually unable to accommodate the growing number of target traits and the increasingly intricate genetic regulatory patterns. Hence, novel approaches are necessary for future genomic prediction. In this study, we used an illumina 50K SNP chip to genotype 4190 egg-type female Rhode Island Red chickens. Machine learning (ML) and classical bioinformatics methods were integrated to fit genotypes with 10 economic traits in chickens. We evaluated the effectiveness of ML methods using Pearson correlation coefficients and the RMSE between predicted and actual phenotypic values and compared them with rrBLUP and BayesA. Our results indicated that ML algorithms exhibit significantly superior performance to rrBLUP and BayesA in predicting body weight and eggshell strength traits. Conversely, rrBLUP and BayesA demonstrated 2-58% higher predictive accuracy in predicting egg numbers. Additionally, the incorporation of suggestively significant SNPs obtained through the GWAS into the ML models resulted in an increase in the predictive accuracy of 0.1-27% across nearly all traits. These findings suggest the potential of combining classical bioinformatics methods with ML techniques to improve genomic prediction in the future.
Collapse
Affiliation(s)
| | | | | | | | - Congjiao Sun
- State Key Laboratory of Animal Biotech Breeding and Frontiers Science Center for Molecular Design Breeding (MOE), China Agricultural University, Beijing 100193, China; (X.L.); (X.C.); (Q.W.); (N.Y.)
| |
Collapse
|
4
|
Cortés AJ. Abiotic Stress Tolerance Boosted by Genetic Diversity in Plants. Int J Mol Sci 2024; 25:5367. [PMID: 38791404 PMCID: PMC11121514 DOI: 10.3390/ijms25105367] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/29/2024] [Accepted: 03/14/2024] [Indexed: 05/26/2024] Open
Abstract
Plant breeding [...].
Collapse
Affiliation(s)
- Andrés J. Cortés
- Corporación Colombiana de Investigación Agropecuaria AGROSAVIA, C.I. La Selva, Km 7 vía Rionegro—Las Palmas, Rionegro 054048, Colombia;
- Facultad de Ciencias Agrarias—de Ciencias Forestales, Universidad Nacional de Colombia—Sede Medellín, Medellín 050034, Colombia
- Department of Plant Breeding, Swedish University of Agricultural Sciences, Lomma 23436, Sweden
| |
Collapse
|
5
|
Sandell FL, Holzweber T, Street NR, Dohm JC, Himmelbauer H. Genomic basis of seed colour in quinoa inferred from variant patterns using extreme gradient boosting. PLANT BIOTECHNOLOGY JOURNAL 2024; 22:1312-1324. [PMID: 38213076 PMCID: PMC11022794 DOI: 10.1111/pbi.14267] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/07/2023] [Revised: 11/03/2023] [Accepted: 11/28/2023] [Indexed: 01/13/2024]
Abstract
Quinoa is an agriculturally important crop species originally domesticated in the Andes of central South America. One of its most important phenotypic traits is seed colour. Seed colour variation is determined by contrasting abundance of betalains, a class of strong antioxidant and free radicals scavenging colour pigments only found in plants of the order Caryophyllales. However, the genetic basis for these pigments in seeds remains to be identified. Here we demonstrate the application of machine learning (extreme gradient boosting) to identify genetic variants predictive of seed colour. We show that extreme gradient boosting outperforms the classical genome-wide association approach. We provide re-sequencing and phenotypic data for 156 South American quinoa accessions and identify candidate genes potentially controlling betalain content in quinoa seeds. Genes identified include novel cytochrome P450 genes and known members of the betalain synthesis pathway, as well as genes annotated as being involved in seed development. Our work showcases the power of modern machine learning methods to extract biologically meaningful information from large sequencing data sets.
Collapse
Affiliation(s)
- Felix L. Sandell
- Department of Biotechnology, Institute of Computational BiologyUniversity of Natural Resources and Life Sciences (BOKU)ViennaAustria
| | - Thomas Holzweber
- Department of Biotechnology, Institute of Computational BiologyUniversity of Natural Resources and Life Sciences (BOKU)ViennaAustria
| | - Nathaniel R. Street
- Department of Plant Physiology, Umeå Plant Science CentreUmeå UniversityUmeåSweden
- SciLifeLabUmeå UniversityUmeåSweden
| | - Juliane C. Dohm
- Department of Biotechnology, Institute of Computational BiologyUniversity of Natural Resources and Life Sciences (BOKU)ViennaAustria
| | - Heinz Himmelbauer
- Department of Biotechnology, Institute of Computational BiologyUniversity of Natural Resources and Life Sciences (BOKU)ViennaAustria
| |
Collapse
|
6
|
Chang-Brahim I, Koppensteiner LJ, Beltrame L, Bodner G, Saranti A, Salzinger J, Fanta-Jende P, Sulzbachner C, Bruckmüller F, Trognitz F, Samad-Zamini M, Zechner E, Holzinger A, Molin EM. Reviewing the essential roles of remote phenotyping, GWAS and explainable AI in practical marker-assisted selection for drought-tolerant winter wheat breeding. FRONTIERS IN PLANT SCIENCE 2024; 15:1319938. [PMID: 38699541 PMCID: PMC11064034 DOI: 10.3389/fpls.2024.1319938] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 10/11/2023] [Accepted: 03/13/2024] [Indexed: 05/05/2024]
Abstract
Marker-assisted selection (MAS) plays a crucial role in crop breeding improving the speed and precision of conventional breeding programmes by quickly and reliably identifying and selecting plants with desired traits. However, the efficacy of MAS depends on several prerequisites, with precise phenotyping being a key aspect of any plant breeding programme. Recent advancements in high-throughput remote phenotyping, facilitated by unmanned aerial vehicles coupled to machine learning, offer a non-destructive and efficient alternative to traditional, time-consuming, and labour-intensive methods. Furthermore, MAS relies on knowledge of marker-trait associations, commonly obtained through genome-wide association studies (GWAS), to understand complex traits such as drought tolerance, including yield components and phenology. However, GWAS has limitations that artificial intelligence (AI) has been shown to partially overcome. Additionally, AI and its explainable variants, which ensure transparency and interpretability, are increasingly being used as recognised problem-solving tools throughout the breeding process. Given these rapid technological advancements, this review provides an overview of state-of-the-art methods and processes underlying each MAS, from phenotyping, genotyping and association analyses to the integration of explainable AI along the entire workflow. In this context, we specifically address the challenges and importance of breeding winter wheat for greater drought tolerance with stable yields, as regional droughts during critical developmental stages pose a threat to winter wheat production. Finally, we explore the transition from scientific progress to practical implementation and discuss ways to bridge the gap between cutting-edge developments and breeders, expediting MAS-based winter wheat breeding for drought tolerance.
Collapse
Affiliation(s)
- Ignacio Chang-Brahim
- Unit Bioresources, Center for Health & Bioresources, AIT Austrian Institute of Technology, Tulln, Austria
| | | | - Lorenzo Beltrame
- Unit Assistive and Autonomous Systems, Center for Vision, Automation & Control, AIT Austrian Institute of Technology, Vienna, Austria
| | - Gernot Bodner
- Department of Crop Sciences, Institute of Agronomy, University of Natural Resources and Life Sciences Vienna, Tulln, Austria
| | - Anna Saranti
- Human-Centered AI Lab, Department of Forest- and Soil Sciences, Institute of Forest Engineering, University of Natural Resources and Life Sciences Vienna, Vienna, Austria
| | - Jules Salzinger
- Unit Assistive and Autonomous Systems, Center for Vision, Automation & Control, AIT Austrian Institute of Technology, Vienna, Austria
| | - Phillipp Fanta-Jende
- Unit Assistive and Autonomous Systems, Center for Vision, Automation & Control, AIT Austrian Institute of Technology, Vienna, Austria
| | - Christoph Sulzbachner
- Unit Assistive and Autonomous Systems, Center for Vision, Automation & Control, AIT Austrian Institute of Technology, Vienna, Austria
| | - Felix Bruckmüller
- Unit Assistive and Autonomous Systems, Center for Vision, Automation & Control, AIT Austrian Institute of Technology, Vienna, Austria
| | - Friederike Trognitz
- Unit Bioresources, Center for Health & Bioresources, AIT Austrian Institute of Technology, Tulln, Austria
| | | | - Elisabeth Zechner
- Verein zur Förderung einer nachhaltigen und regionalen Pflanzenzüchtung, Zwettl, Austria
| | - Andreas Holzinger
- Human-Centered AI Lab, Department of Forest- and Soil Sciences, Institute of Forest Engineering, University of Natural Resources and Life Sciences Vienna, Vienna, Austria
| | - Eva M. Molin
- Unit Bioresources, Center for Health & Bioresources, AIT Austrian Institute of Technology, Tulln, Austria
- Human-Centered AI Lab, Department of Forest- and Soil Sciences, Institute of Forest Engineering, University of Natural Resources and Life Sciences Vienna, Vienna, Austria
| |
Collapse
|
7
|
Egebjerg JM, Szomek M, Thaysen K, Juhl AD, Kozakijevic S, Werner S, Pratsch C, Schneider G, Kapishnikov S, Ekman A, Röttger R, Wüstner D. Automated quantification of vacuole fusion and lipophagy in Saccharomyces cerevisiae from fluorescence and cryo-soft X-ray microscopy data using deep learning. Autophagy 2024; 20:902-922. [PMID: 37908116 PMCID: PMC11062380 DOI: 10.1080/15548627.2023.2270378] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2023] [Accepted: 10/02/2023] [Indexed: 11/02/2023] Open
Abstract
During starvation in the yeast Saccharomyces cerevisiae vacuolar vesicles fuse and lipid droplets (LDs) can become internalized into the vacuole in an autophagic process named lipophagy. There is a lack of tools to quantitatively assess starvation-induced vacuole fusion and lipophagy in intact cells with high resolution and throughput. Here, we combine soft X-ray tomography (SXT) with fluorescence microscopy and use a deep-learning computational approach to visualize and quantify these processes in yeast. We focus on yeast homologs of mammalian NPC1 (NPC intracellular cholesterol transporter 1; Ncr1 in yeast) and NPC2 proteins, whose dysfunction leads to Niemann Pick type C (NPC) disease in humans. We developed a convolutional neural network (CNN) model which classifies fully fused versus partially fused vacuoles based on fluorescence images of stained cells. This CNN, named Deep Yeast Fusion Network (DYFNet), revealed that cells lacking Ncr1 (ncr1∆ cells) or Npc2 (npc2∆ cells) have a reduced capacity for vacuole fusion. Using a second CNN model, we implemented a pipeline named LipoSeg to perform automated instance segmentation of LDs and vacuoles from high-resolution reconstructions of X-ray tomograms. From that, we obtained 3D renderings of LDs inside and outside of the vacuole in a fully automated manner and additionally measured droplet volume, number, and distribution. We find that ncr1∆ and npc2∆ cells could ingest LDs into vacuoles normally but showed compromised degradation of LDs and accumulation of lipid vesicles inside vacuoles. Our new method is versatile and allows for analysis of vacuole fusion, droplet size and lipophagy in intact cells.Abbreviations: BODIPY493/503: 4,4-difluoro-1,3,5,7,8-pentamethyl-4-bora-3a,4a-diaza-s-Indacene; BPS: bathophenanthrolinedisulfonic acid disodium salt hydrate; CNN: convolutional neural network; DHE; dehydroergosterol; npc2∆, yeast deficient in Npc2; DSC, Dice similarity coefficient; EM, electron microscopy; EVs, extracellular vesicles; FIB-SEM, focused ion beam milling-scanning electron microscopy; FM 4-64, N-(3-triethylammoniumpropyl)-4-(6-[4-{diethylamino} phenyl] hexatrienyl)-pyridinium dibromide; LDs, lipid droplets; Ncr1, yeast homolog of human NPC1 protein; ncr1∆, yeast deficient in Ncr1; NPC, Niemann Pick type C; NPC2, Niemann Pick type C homolog; OD600, optical density at 600 nm; ReLU, rectifier linear unit; PPV, positive predictive value; NPV, negative predictive value; MCC, Matthews correlation coefficient; SXT, soft X-ray tomography; UV, ultraviolet; YPD, yeast extract peptone dextrose.
Collapse
Affiliation(s)
- Jacob Marcus Egebjerg
- Department of Biochemistry and Molecular Biology, University of Southern Denmark, Odense M, Denmark
- Department of Mathematics and Computer Science, University of Southern Denmark, Odense M, Denmark
| | - Maria Szomek
- Department of Biochemistry and Molecular Biology, University of Southern Denmark, Odense M, Denmark
| | - Katja Thaysen
- Department of Biochemistry and Molecular Biology, University of Southern Denmark, Odense M, Denmark
| | - Alice Dupont Juhl
- Department of Biochemistry and Molecular Biology, University of Southern Denmark, Odense M, Denmark
| | - Suzana Kozakijevic
- Department of Biochemistry and Molecular Biology, University of Southern Denmark, Odense M, Denmark
| | - Stephan Werner
- Department of X‑Ray Microscopy, Helmholtz-Zentrum Berlin, Germany and Humboldt-Universität zu Berlin, Institut für Physik, Berlin, Germany
| | - Christoph Pratsch
- Department of X‑Ray Microscopy, Helmholtz-Zentrum Berlin, Germany and Humboldt-Universität zu Berlin, Institut für Physik, Berlin, Germany
| | - Gerd Schneider
- Department of X‑Ray Microscopy, Helmholtz-Zentrum Berlin, Germany and Humboldt-Universität zu Berlin, Institut für Physik, Berlin, Germany
| | - Sergey Kapishnikov
- SiriusXT, 9A Holly Ave. Stillorgan Industrial Park, Blackrock, Co, Dublin, Ireland
| | - Axel Ekman
- Department of Biological and Environmental Science and Nanoscience Centre, University of Jyväskylä, Jyväskylä, Finland
| | - Richard Röttger
- Department of Mathematics and Computer Science, University of Southern Denmark, Odense M, Denmark
| | - Daniel Wüstner
- Department of Biochemistry and Molecular Biology, University of Southern Denmark, Odense M, Denmark
| |
Collapse
|
8
|
Zhou W, Yan Z, Zhang L. A comparative study of 11 non-linear regression models highlighting autoencoder, DBN, and SVR, enhanced by SHAP importance analysis in soybean branching prediction. Sci Rep 2024; 14:5905. [PMID: 38467662 PMCID: PMC10928191 DOI: 10.1038/s41598-024-55243-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2023] [Accepted: 02/21/2024] [Indexed: 03/13/2024] Open
Abstract
To explore a robust tool for advancing digital breeding practices through an artificial intelligence-driven phenotype prediction expert system, we undertook a thorough analysis of 11 non-linear regression models. Our investigation specifically emphasized the significance of Support Vector Regression (SVR) and SHapley Additive exPlanations (SHAP) in predicting soybean branching. By using branching data (phenotype) of 1918 soybean accessions and 42 k SNP (Single Nucleotide Polymorphism) polymorphic data (genotype), this study systematically compared 11 non-linear regression AI models, including four deep learning models (DBN (deep belief network) regression, ANN (artificial neural network) regression, Autoencoders regression, and MLP (multilayer perceptron) regression) and seven machine learning models (e.g., SVR (support vector regression), XGBoost (eXtreme Gradient Boosting) regression, Random Forest regression, LightGBM regression, GPs (Gaussian processes) regression, Decision Tree regression, and Polynomial regression). After being evaluated by four valuation metrics: R2 (R-squared), MAE (Mean Absolute Error), MSE (Mean Squared Error), and MAPE (Mean Absolute Percentage Error), it was found that the SVR, Polynomial Regression, DBN, and Autoencoder outperformed other models and could obtain a better prediction accuracy when they were used for phenotype prediction. In the assessment of deep learning approaches, we exemplified the SVR model, conducting analyses on feature importance and gene ontology (GO) enrichment to provide comprehensive support. After comprehensively comparing four feature importance algorithms, no notable distinction was observed in the feature importance ranking scores across the four algorithms, namely Variable Ranking, Permutation, SHAP, and Correlation Matrix, but the SHAP value could provide rich information on genes with negative contributions, and SHAP importance was chosen for feature selection. The results of this study offer valuable insights into AI-mediated plant breeding, addressing challenges faced by traditional breeding programs. The method developed has broad applicability in phenotype prediction, minor QTL (quantitative trait loci) mining, and plant smart-breeding systems, contributing significantly to the advancement of AI-based breeding practices and transitioning from experience-based to data-based breeding.
Collapse
Affiliation(s)
- Wei Zhou
- Florida Agricultural and Mechanical University, Tallahassee, FL, 32307, USA.
| | - Zhengxiao Yan
- Florida State University, Tallahassee, FL, 32306, USA
| | - Liting Zhang
- Florida State University, Tallahassee, FL, 32306, USA
| |
Collapse
|
9
|
Kerruish DWM, Cormican P, Kenny EM, Kearns J, Colgan E, Boulton CA, Stelma SNE. The origins of the Guinness stout yeast. Commun Biol 2024; 7:68. [PMID: 38216745 PMCID: PMC10786833 DOI: 10.1038/s42003-023-05587-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2022] [Accepted: 11/14/2023] [Indexed: 01/14/2024] Open
Abstract
Beer is made via the fermentation of an aqueous extract predominantly composed of malted barley flavoured with hops. The transforming microorganism is typically a single strain of Saccharomyces cerevisiae, and for the majority of major beer brands the yeast strain is a unique component. The present yeast used to make Guinness stout brewed in Dublin, Ireland, can be traced back to 1903, but its origins are unknown. To that end, we used Illumina and Nanopore sequencing to generate whole-genome sequencing data for a total of 22 S. cerevisiae yeast strains: 16 from the Guinness collection and 6 other historical Irish brewing. The origins of the Guinness yeast were determined with a SNP-based analysis, demonstrating that the Guinness strains occupy a distinct group separate from other historical Irish brewing yeasts. Assessment of chromosome number, copy number variation and phenotypic evaluation of key brewing attributes established Guinness yeast-specific SNPs but no specific chromosomal amplifications. Our analysis also demonstrated the effects of yeast storage on phylogeny. Altogether, our results suggest that the Guinness yeast used today is related to the first deposited Guinness yeast; the 1903 Watling Laboratory Guinness yeast.
Collapse
Affiliation(s)
| | | | | | - Jessica Kearns
- Diageo Ireland, St James's Gate, The Liberties, Dublin, Ireland
| | - Eibhlin Colgan
- Diageo Ireland, St James's Gate, The Liberties, Dublin, Ireland
| | | | | |
Collapse
|
10
|
Bonet D, Levin M, Montserrat DM, Ioannidis AG. Machine Learning Strategies for Improved Phenotype Prediction in Underrepresented Populations. PACIFIC SYMPOSIUM ON BIOCOMPUTING. PACIFIC SYMPOSIUM ON BIOCOMPUTING 2024; 29:404-418. [PMID: 38160295 PMCID: PMC10799683] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 01/03/2024]
Abstract
Precision medicine models often perform better for populations of European ancestry due to the over-representation of this group in the genomic datasets and large-scale biobanks from which the models are constructed. As a result, prediction models may misrepresent or provide less accurate treatment recommendations for underrepresented populations, contributing to health disparities. This study introduces an adaptable machine learning toolkit that integrates multiple existing methodologies and novel techniques to enhance the prediction accuracy for underrepresented populations in genomic datasets. By leveraging machine learning techniques, including gradient boosting and automated methods, coupled with novel population-conditional re-sampling techniques, our method significantly improves the phenotypic prediction from single nucleotide polymorphism (SNP) data for diverse populations. We evaluate our approach using the UK Biobank, which is composed primarily of British individuals with European ancestry, and a minority representation of groups with Asian and African ancestry. Performance metrics demonstrate substantial improvements in phenotype prediction for underrepresented groups, achieving prediction accuracy comparable to that of the majority group. This approach represents a significant step towards improving prediction accuracy amidst current dataset diversity challenges. By integrating a tailored pipeline, our approach fosters more equitable validity and utility of statistical genetics methods, paving the way for more inclusive models and outcomes.
Collapse
Affiliation(s)
- David Bonet
- Stanford University, Stanford, CA, US2Universitat Politècnica de Catalunya, Barcelona, Spain
| | | | | | | |
Collapse
|
11
|
Heinrich F, Lange TM, Kircher M, Ramzan F, Schmitt AO, Gültas M. Exploring the potential of incremental feature selection to improve genomic prediction accuracy. Genet Sel Evol 2023; 55:78. [PMID: 37946104 PMCID: PMC10634161 DOI: 10.1186/s12711-023-00853-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/03/2023] [Accepted: 11/02/2023] [Indexed: 11/12/2023] Open
Abstract
BACKGROUND The ever-increasing availability of high-density genomic markers in the form of single nucleotide polymorphisms (SNPs) enables genomic prediction, i.e. the inference of phenotypes based solely on genomic data, in the field of animal and plant breeding, where it has become an important tool. However, given the limited number of individuals, the abundance of variables (SNPs) can reduce the accuracy of prediction models due to overfitting or irrelevant SNPs. Feature selection can help to reduce the number of irrelevant SNPs and increase the model performance. In this study, we investigated an incremental feature selection approach based on ranking the SNPs according to the results of a genome-wide association study that we combined with random forest as a prediction model, and we applied it on several animal and plant datasets. RESULTS Applying our approach to different datasets yielded a wide range of outcomes, i.e. from a substantial increase in prediction accuracy in a few cases to minor improvements when only a fraction of the available SNPs were used. Compared with models using all available SNPs, our approach was able to achieve comparable performances with a considerably reduced number of SNPs in several cases. Our approach showcased state-of-the-art efficiency and performance while having a faster computation time. CONCLUSIONS The results of our study suggest that our incremental feature selection approach has the potential to improve prediction accuracy substantially. However, this gain seems to depend on the genomic data used. Even for datasets where the number of markers is smaller than the number of individuals, feature selection may still increase the performance of the genomic prediction. Our approach is implemented in R and is available at https://github.com/FelixHeinrich/GP_with_IFS/ .
Collapse
Affiliation(s)
- Felix Heinrich
- Breeding Informatics Group, Department of Animal Sciences, Georg-August University, Margarethe von Wrangell-Weg 7, 37075, Göttingen, Germany.
| | - Thomas Martin Lange
- Breeding Informatics Group, Department of Animal Sciences, Georg-August University, Margarethe von Wrangell-Weg 7, 37075, Göttingen, Germany
| | - Magdalena Kircher
- Institute for Animal Breeding and Genetics, University of Veterinary Medicine Hannover, Bünteweg 17p, 30559, Hannover, Germany
| | - Faisal Ramzan
- Institute of Animal and Dairy Sciences, University of Agriculture Faisalabad, Jail Road, 38000, Faisalabad, Pakistan
| | - Armin Otto Schmitt
- Breeding Informatics Group, Department of Animal Sciences, Georg-August University, Margarethe von Wrangell-Weg 7, 37075, Göttingen, Germany
- Center for Integrated Breeding Research (CiBreed), Georg-August University, Albrecht-Thaer-Weg 3, 37075, Göttingen, Germany
| | - Mehmet Gültas
- Center for Integrated Breeding Research (CiBreed), Georg-August University, Albrecht-Thaer-Weg 3, 37075, Göttingen, Germany.
- Faculty of Agriculture, South Westphalia University of Applied Sciences, 59494, Soest, Germany.
| |
Collapse
|
12
|
Bonet D, Levin M, Montserrat DM, Ioannidis AG. Machine Learning Strategies for Improved Phenotype Prediction in Underrepresented Populations. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.10.12.561949. [PMID: 37904983 PMCID: PMC10614800 DOI: 10.1101/2023.10.12.561949] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/02/2023]
Abstract
Precision medicine models often perform better for populations of European ancestry due to the over-representation of this group in the genomic datasets and large-scale biobanks from which the models are constructed. As a result, prediction models may misrepresent or provide less accurate treatment recommendations for underrepresented populations, contributing to health disparities. This study introduces an adaptable machine learning toolkit that integrates multiple existing methodologies and novel techniques to enhance the prediction accuracy for underrepresented populations in genomic datasets. By leveraging machine learning techniques, including gradient boosting and automated methods, coupled with novel population-conditional re-sampling techniques, our method significantly improves the phenotypic prediction from single nucleotide polymorphism (SNP) data for diverse populations. We evaluate our approach using the UK Biobank, which is composed primarily of British individuals with European ancestry, and a minority representation of groups with Asian and African ancestry. Performance metrics demonstrate substantial improvements in phenotype prediction for underrepresented groups, achieving prediction accuracy comparable to that of the majority group. This approach represents a significant step towards improving prediction accuracy amidst current dataset diversity challenges. By integrating a tailored pipeline, our approach fosters more equitable validity and utility of statistical genetics methods, paving the way for more inclusive models and outcomes.
Collapse
Affiliation(s)
- David Bonet
- Stanford University, Stanford, CA, US
- Universitat Politècnica de Catalunya, Barcelona, Spain
| | - May Levin
- Stanford University, Stanford, CA, US
| | | | - Alexander G Ioannidis
- Stanford University, Stanford, CA, US
- University of California Santa Cruz, Santa Cruz, CA, US
| |
Collapse
|
13
|
Verplaetse N, Passemiers A, Arany A, Moreau Y, Raimondi D. Large sample size and nonlinear sparse models outline epistatic effects in inflammatory bowel disease. Genome Biol 2023; 24:224. [PMID: 37798735 PMCID: PMC10552306 DOI: 10.1186/s13059-023-03064-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2023] [Accepted: 09/20/2023] [Indexed: 10/07/2023] Open
Abstract
BACKGROUND Despite clear evidence of nonlinear interactions in the molecular architecture of polygenic diseases, linear models have so far appeared optimal in genotype-to-phenotype modeling. A key bottleneck for such modeling is that genetic data intrinsically suffers from underdetermination ([Formula: see text]). Millions of variants are present in each individual while the collection of large, homogeneous cohorts is hindered by phenotype incidence, sequencing cost, and batch effects. RESULTS We demonstrate that when we provide enough training data and control the complexity of nonlinear models, a neural network outperforms additive approaches in whole exome sequencing-based inflammatory bowel disease case-control prediction. To do so, we propose a biologically meaningful sparsified neural network architecture, providing empirical evidence for positive and negative epistatic effects present in the inflammatory bowel disease pathogenesis. CONCLUSIONS In this paper, we show that underdetermination is likely a major driver for the apparent optimality of additive modeling in clinical genetics today.
Collapse
Affiliation(s)
- Nora Verplaetse
- Department of of Electrical Engineering, Katholieke Universiteit Leuven, Leuven, Belgium.
| | - Antoine Passemiers
- Department of of Electrical Engineering, Katholieke Universiteit Leuven, Leuven, Belgium
| | - Adam Arany
- Department of of Electrical Engineering, Katholieke Universiteit Leuven, Leuven, Belgium
| | - Yves Moreau
- Department of of Electrical Engineering, Katholieke Universiteit Leuven, Leuven, Belgium
| | - Daniele Raimondi
- Department of of Electrical Engineering, Katholieke Universiteit Leuven, Leuven, Belgium.
| |
Collapse
|
14
|
Sadeqi MB, Ballvora A, Dadshani S, Léon J. Genetic Parameter and Hyper-Parameter Estimation Underlie Nitrogen Use Efficiency in Bread Wheat. Int J Mol Sci 2023; 24:14275. [PMID: 37762585 PMCID: PMC10531695 DOI: 10.3390/ijms241814275] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2023] [Revised: 09/07/2023] [Accepted: 09/14/2023] [Indexed: 09/29/2023] Open
Abstract
Estimation and prediction play a key role in breeding programs. Currently, phenotyping of complex traits such as nitrogen use efficiency (NUE) in wheat is still expensive, requires high-throughput technologies and is very time consuming compared to genotyping. Therefore, researchers are trying to predict phenotypes based on marker information. Genetic parameters such as population structure, genomic relationship matrix, marker density and sample size are major factors that increase the performance and accuracy of a model. However, they play an important role in adjusting the statistically significant false discovery rate (FDR) threshold in estimation. In parallel, there are many genetic hyper-parameters that are hidden and not represented in the given genomic selection (GS) model but have significant effects on the results, such as panel size, number of markers, minor allele frequency, number of call rates for each marker, number of cross validations and batch size in the training set of the genomic file. The main challenge is to ensure the reliability and accuracy of predicted breeding values (BVs) as results. Our study has confirmed the results of bias-variance tradeoff and adaptive prediction error for the ensemble-learning-based model STACK, which has the highest performance when estimating genetic parameters and hyper-parameters in a given GS model compared to other models.
Collapse
Affiliation(s)
- Mohammad Bahman Sadeqi
- INRES-Plant Breeding, Rheinische Friedrich-Wilhelms-Universität Bonn, 53113 Bonn, Germany; (M.B.S.); (J.L.)
| | - Agim Ballvora
- INRES-Plant Breeding, Rheinische Friedrich-Wilhelms-Universität Bonn, 53113 Bonn, Germany; (M.B.S.); (J.L.)
| | - Said Dadshani
- INRES-Plant Nutrition, Rheinische Friedrich-Wilhelms-Universität Bonn, 53113 Bonn, Germany;
| | - Jens Léon
- INRES-Plant Breeding, Rheinische Friedrich-Wilhelms-Universität Bonn, 53113 Bonn, Germany; (M.B.S.); (J.L.)
| |
Collapse
|
15
|
Duc NT, Ramlal A, Rajendran A, Raju D, Lal SK, Kumar S, Sahoo RN, Chinnusamy V. Image-based phenotyping of seed architectural traits and prediction of seed weight using machine learning models in soybean. FRONTIERS IN PLANT SCIENCE 2023; 14:1206357. [PMID: 37771485 PMCID: PMC10523016 DOI: 10.3389/fpls.2023.1206357] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 04/15/2023] [Accepted: 07/26/2023] [Indexed: 09/30/2023]
Abstract
Among seed attributes, weight is one of the main factors determining the soybean harvest index. Recently, the focus of soybean breeding has shifted to improving seed size and weight for crop optimization in terms of seed and oil yield. With recent technological advancements, there is an increasing application of imaging sensors that provide simple, real-time, non-destructive, and inexpensive image data for rapid image-based prediction of seed traits in plant breeding programs. The present work is related to digital image analysis of seed traits for the prediction of hundred-seed weight (HSW) in soybean. The image-based seed architectural traits (i-traits) measured were area size (AS), perimeter length (PL), length (L), width (W), length-to-width ratio (LWR), intersection of length and width (IS), seed circularity (CS), and distance between IS and CG (DS). The phenotypic investigation revealed significant genetic variability among 164 soybean genotypes for both i-traits and manually measured seed weight. Seven popular machine learning (ML) algorithms, namely Simple Linear Regression (SLR), Multiple Linear Regression (MLR), Random Forest (RF), Support Vector Regression (SVR), LASSO Regression (LR), Ridge Regression (RR), and Elastic Net Regression (EN), were used to create models that can predict the weight of soybean seeds based on the image-based novel features derived from the Red-Green-Blue (RGB)/visual image. Among the models, random forest and multiple linear regression models that use multiple explanatory variables related to seed size traits (AS, L, W, and DS) were identified as the best models for predicting seed weight with the highest prediction accuracy (coefficient of determination, R2=0.98 and 0.94, respectively) and the lowest prediction error, i.e., root mean square error (RMSE) and mean absolute error (MAE). Finally, principal components analysis (PCA) and a hierarchical clustering approach were used to identify IC538070 as a superior genotype with a larger seed size and weight. The identified donors/traits can potentially be used in soybean improvement programs.
Collapse
Affiliation(s)
- Nguyen Trung Duc
- Division of Plant Physiology, Indian Council of Agricultural Research-Indian Agricultural Research Institute (ICAR-IARI), New Delhi, India
- Vietnam National University of Agriculture, Hanoi, Vietnam
| | - Ayyagari Ramlal
- Division of Genetics, Indian Council of Agricultural Research-Indian Agricultural Research Institute (ICAR-IARI), New Delhi, India
- School of Biological Sciences, Universiti Sains Malaysia (USM), Georgetown, Penang, Malaysia
| | - Ambika Rajendran
- Division of Genetics, Indian Council of Agricultural Research-Indian Agricultural Research Institute (ICAR-IARI), New Delhi, India
| | - Dhandapani Raju
- Division of Plant Physiology, Indian Council of Agricultural Research-Indian Agricultural Research Institute (ICAR-IARI), New Delhi, India
| | - S. K. Lal
- Division of Genetics, Indian Council of Agricultural Research-Indian Agricultural Research Institute (ICAR-IARI), New Delhi, India
| | - Sudhir Kumar
- Division of Plant Physiology, Indian Council of Agricultural Research-Indian Agricultural Research Institute (ICAR-IARI), New Delhi, India
| | - Rabi Narayan Sahoo
- Division of Agricultural Physics, Indian Council of Agricultural Research-Indian Agricultural Research Institute (ICAR-IARI), New Delhi, India
| | - Viswanathan Chinnusamy
- Division of Plant Physiology, Indian Council of Agricultural Research-Indian Agricultural Research Institute (ICAR-IARI), New Delhi, India
| |
Collapse
|
16
|
Kovuri P, Yadav A, Sinha H. Role of genetic architecture in phenotypic plasticity. Trends Genet 2023; 39:703-714. [PMID: 37173192 DOI: 10.1016/j.tig.2023.04.002] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/22/2022] [Revised: 04/06/2023] [Accepted: 04/11/2023] [Indexed: 05/15/2023]
Abstract
Phenotypic plasticity, the ability of an organism to display different phenotypes across environments, is widespread in nature. Plasticity aids survival in novel environments. Herein, we review studies from yeast that allow us to start uncovering the genetic architecture of phenotypic plasticity. Genetic variants and their interactions impact the phenotype in different environments, and distinct environments modulate the impact of genetic variants and their interactions on the phenotype. Because of this, certain hidden genetic variation is expressed in specific genetic and environmental backgrounds. A better understanding of the genetic mechanisms of phenotypic plasticity will help to determine short- and long-term responses to selection and how wide variation in disease manifestation occurs in human populations.
Collapse
Affiliation(s)
- Purnima Kovuri
- Department of Biotechnology, Bhupat and Jyoti Mehta School of Biosciences, IIT Madras, Chennai, India; Centre for Integrative Biology and Systems mEdicine (IBSE), IIT Madras, Chennai, India; Robert Bosch Centre for Data Science and Artificial Intelligence (RBCDSAI), IIT Madras, Chennai, India
| | - Anupama Yadav
- Center for Cancer Systems Biology (CCSB), Dana-Farber Cancer Institute, Boston, MA, USA; Department of Genetics, Blavatnik Institute, Harvard Medical School, Boston, MA, USA; Department of Cancer Biology, Dana-Farber Cancer Institute, Boston, MA, USA
| | - Himanshu Sinha
- Department of Biotechnology, Bhupat and Jyoti Mehta School of Biosciences, IIT Madras, Chennai, India; Centre for Integrative Biology and Systems mEdicine (IBSE), IIT Madras, Chennai, India; Robert Bosch Centre for Data Science and Artificial Intelligence (RBCDSAI), IIT Madras, Chennai, India.
| |
Collapse
|
17
|
Xu B, Meng R, Chen G, Liang L, Lv Z, Zhou L, Sun R, Zhao F, Yang W. Improved weed mapping in corn fields by combining UAV-based spectral, textural, structural, and thermal measurements. PEST MANAGEMENT SCIENCE 2023; 79:2591-2602. [PMID: 36883563 DOI: 10.1002/ps.7443] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/27/2022] [Revised: 01/20/2023] [Accepted: 03/08/2023] [Indexed: 06/02/2023]
Abstract
BACKGROUND Spatial-explicit weed information is critical for controlling weed infestation and reducing corn yield losses. The development of unmanned aerial vehicle (UAV)-based remote sensing presents an unprecedented opportunity for efficient, timely weed mapping. Spectral, textural, and structural measurements have been used for weed mapping, whereas thermal measurements-for example, canopy temperature (CT)-were seldom considered and used. In this study, we quantified the optimal combination of spectral, textural, structural, and CT measurements based on different machine-learning algorithms for weed mapping. RESULTS CT improved weed-mapping accuracies as complementary information for spectral, textural, and structural features (up to 5% and 0.051 improvements in overall accuracy [OA] and Marco-F1, respectively). The fusion of textural, structural, and thermal features achieved the best performance in weed mapping (OA = 96.4%, Marco-F1 = 0.964), followed by the fusion of structural and thermal features (OA = 93.6%, Marco-F1 = 0.936). The Support Vector Machine-based model achieved the best performance in weed mapping, with 3.5% and 7.1% improvements in OA and 0.036 and 0.071 in Marco-F1 respectively, compared with the best models of Random Forest and Naïve Bayes Classifier. CONCLUSION Thermal measurement can complement other types of remote-sensing measurements and improve the weed-mapping accuracy within the data-fusion framework. Importantly, integrating textural, structural, and thermal features achieved the best performance for weed mapping. Our study provides a novel method for weed mapping using UAV-based multisource remote sensing measurements, which is critical for ensuring crop production in precision agriculture. © 2023 The Authors. Pest Management Science published by John Wiley & Sons Ltd on behalf of Society of Chemical Industry.
Collapse
Affiliation(s)
- Binyuan Xu
- College of Resources and Environment, Huazhong Agricultural University, Wuhan, China
| | - Ran Meng
- College of Resources and Environment, Huazhong Agricultural University, Wuhan, China
- HIT Institute for Artificial Intelligence Co. Ltd, Harbin, China
| | - Gengshen Chen
- National Key Laboratory of Crop Genetic Improvement, National Center of Plant Gene Research (Wuhan), Hubei Hongshan Laboratory, Huazhong Agricultural University, Wuhan, China
| | - Linlin Liang
- Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing, China
| | - Zhengang Lv
- College of Resources and Environment, Huazhong Agricultural University, Wuhan, China
| | - Longfei Zhou
- College of Resources and Environment, Huazhong Agricultural University, Wuhan, China
| | - Rui Sun
- College of Resources and Environment, Huazhong Agricultural University, Wuhan, China
| | - Feng Zhao
- Key Laboratory of Geographical Process Analysis & Simulation of Hubei Province, College of Urban and Environmental Sciences, Central China Normal University, Wuhan, China
| | - Wanneng Yang
- National Key Laboratory of Crop Genetic Improvement, National Center of Plant Gene Research (Wuhan), Hubei Hongshan Laboratory, Huazhong Agricultural University, Wuhan, China
| |
Collapse
|
18
|
Zhao L, Walkowiak S, Fernando WGD. Artificial Intelligence: A Promising Tool in Exploring the Phytomicrobiome in Managing Disease and Promoting Plant Health. PLANTS (BASEL, SWITZERLAND) 2023; 12:plants12091852. [PMID: 37176910 PMCID: PMC10180744 DOI: 10.3390/plants12091852] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/06/2023] [Revised: 04/25/2023] [Accepted: 04/27/2023] [Indexed: 05/15/2023]
Abstract
There is increasing interest in harnessing the microbiome to improve cropping systems. With the availability of high-throughput and low-cost sequencing technologies, gathering microbiome data is becoming more routine. However, the analysis of microbiome data is challenged by the size and complexity of the data, and the incomplete nature of many microbiome databases. Further, to bring microbiome data value, it often needs to be analyzed in conjunction with other complex data that impact on crop health and disease management, such as plant genotype and environmental factors. Artificial intelligence (AI), boosted through deep learning (DL), has achieved significant breakthroughs and is a powerful tool for managing large complex datasets such as the interplay between the microbiome, crop plants, and their environment. In this review, we aim to provide readers with a brief introduction to AI techniques, and we introduce how AI has been applied to areas of microbiome sequencing taxonomy, the functional annotation for microbiome sequences, associating the microbiome community with host traits, designing synthetic communities, genomic selection, field phenotyping, and disease forecasting. At the end of this review, we proposed further efforts that are required to fully exploit the power of AI in studying phytomicrobiomes.
Collapse
Affiliation(s)
- Liang Zhao
- Department of Plant Science, University of Manitoba, Winnipeg, MB R3T 2N2, Canada
| | | | | |
Collapse
|
19
|
Liang M, Cao S, Deng T, Du L, Li K, An B, Du Y, Xu L, Zhang L, Gao X, Li J, Guo P, Gao H. MAK: a machine learning framework improved genomic prediction via multi-target ensemble regressor chains and automatic selection of assistant traits. Brief Bioinform 2023; 24:7031157. [PMID: 36752363 DOI: 10.1093/bib/bbad043] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/22/2022] [Revised: 01/13/2023] [Accepted: 01/20/2023] [Indexed: 02/09/2023] Open
Abstract
Incorporating the genotypic and phenotypic of the correlated traits into the multi-trait model can significantly improve the prediction accuracy of the target trait in animal and plant breeding, as well as human genetics. However, in most cases, the phenotypic information of the correlated and target trait of the individual to be evaluated was null simultaneously, particularly for the newborn. Therefore, we propose a machine learning framework, MAK, to improve the prediction accuracy of the target trait by constructing the multi-target ensemble regression chains and selecting the assistant trait automatically, which predicted the genomic estimated breeding values of the target trait using genotypic information only. The prediction ability of MAK was significantly more robust than the genomic best linear unbiased prediction, BayesB, BayesRR and the multi trait Bayesian method in the four real animal and plant datasets, and the computational efficiency of MAK was roughly 100 times faster than BayesB and BayesRR.
Collapse
Affiliation(s)
- Mang Liang
- Chinese Academy of Agricultural Sciences Institute of Animal Science
| | - Sheng Cao
- Chinese Academy of Agricultural Sciences Institute of Animal Science
| | - Tianyu Deng
- Chinese Academy of Agricultural Sciences Institute of Animal Science
| | - Lili Du
- Chinese Academy of Agricultural Sciences Institute of Animal Science
| | - Keanning Li
- Chinese Academy of Agricultural Sciences Institute of Animal Science
| | - Bingxing An
- Chinese Academy of Agricultural Sciences Institute of Animal Science
| | - Yueying Du
- Chinese Academy of Agricultural Sciences Institute of Animal Science
| | - Lingyang Xu
- Chinese Academy of Agricultural Sciences Institute of Animal Science
| | - Lupei Zhang
- Chinese Academy of Agricultural Sciences Institute of Animal Science
| | - Xue Gao
- Chinese Academy of Agricultural Sciences Institute of Animal Science
| | - Junya Li
- Chinese Academy of Agricultural Sciences Institute of Animal Science
| | | | - Huijiang Gao
- Chinese Academy of Agricultural Sciences Institute of Animal Science
| |
Collapse
|
20
|
Wang W, Guo W, Le L, Yu J, Wu Y, Li D, Wang Y, Wang H, Lu X, Qiao H, Gu X, Tian J, Zhang C, Pu L. Integration of high-throughput phenotyping, GWAS, and predictive models reveals the genetic architecture of plant height in maize. MOLECULAR PLANT 2023; 16:354-373. [PMID: 36447436 DOI: 10.1016/j.molp.2022.11.016] [Citation(s) in RCA: 8] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/12/2022] [Revised: 09/05/2022] [Accepted: 11/27/2022] [Indexed: 06/16/2023]
Abstract
Plant height (PH) is an essential trait in maize (Zea mays) that is tightly associated with planting density, biomass, lodging resistance, and grain yield in the field. Dissecting the dynamics of maize plant architecture will be beneficial for ideotype-based maize breeding and prediction, as the genetic basis controlling PH in maize remains largely unknown. In this study, we developed an automated high-throughput phenotyping platform (HTP) to systematically and noninvasively quantify 77 image-based traits (i-traits) and 20 field traits (f-traits) for 228 maize inbred lines across all developmental stages. Time-resolved i-traits with novel digital phenotypes and complex correlations with agronomic traits were characterized to reveal the dynamics of maize growth. An i-trait-based genome-wide association study identified 4945 trait-associated SNPs, 2603 genetic loci, and 1974 corresponding candidate genes. We found that rapid growth of maize plants occurs mainly at two developmental stages, stage 2 (S2) to S3 and S5 to S6, accounting for the final PH indicators. By integrating the PH-association network with the transcriptome profiles of specific internodes, we revealed 13 hub genes that may play vital roles during rapid growth. The candidate genes and novel i-traits identified at multiple growth stages may be used as potential indicators for final PH in maize. One candidate gene, ZmVATE, was functionally validated and shown to regulate PH-related traits in maize using genetic mutation. Furthermore, machine learning was used to build predictive models for final PH based on i-traits, and their performance was assessed across developmental stages. Moderate, strong, and very strong correlations between predictions and experimental datasets were achieved from the early S4 (tenth-leaf) stage. Colletively, our study provides a valuable tool for dissecting the spatiotemporal formation of specific internodes and the genetic architecture of PH, as well as resources and predictive models that are useful for molecular design breeding and predicting maize varieties with ideal plant architectures.
Collapse
Affiliation(s)
- Weixuan Wang
- Biotechnology Research Institute, Chinese Academy of Agricultural Sciences, Beijing 100081, China; National Nanfan Research Institute (Sanya), Chinese Academy of Agricultural Sciences, Sanya 572024, China
| | - Weijun Guo
- Biotechnology Research Institute, Chinese Academy of Agricultural Sciences, Beijing 100081, China
| | - Liang Le
- Biotechnology Research Institute, Chinese Academy of Agricultural Sciences, Beijing 100081, China
| | - Jia Yu
- Biotechnology Research Institute, Chinese Academy of Agricultural Sciences, Beijing 100081, China
| | - Yue Wu
- Biotechnology Research Institute, Chinese Academy of Agricultural Sciences, Beijing 100081, China
| | - Dongwei Li
- Biotechnology Research Institute, Chinese Academy of Agricultural Sciences, Beijing 100081, China
| | - Yifan Wang
- Biotechnology Research Institute, Chinese Academy of Agricultural Sciences, Beijing 100081, China
| | - Huan Wang
- Biotechnology Research Institute, Chinese Academy of Agricultural Sciences, Beijing 100081, China
| | - Xiaoduo Lu
- Institute of Molecular Breeding for Maize, Qilu Normal University, Jinan 250200, China
| | - Hong Qiao
- Institute for Cellular and Molecular Biology, The University of Texas at Austin, Austin, TX 78712, USA; Department of Molecular Biosciences, The University of Texas at Austin, Austin, TX 78712, USA
| | - Xiaofeng Gu
- Biotechnology Research Institute, Chinese Academy of Agricultural Sciences, Beijing 100081, China
| | - Jian Tian
- Biotechnology Research Institute, Chinese Academy of Agricultural Sciences, Beijing 100081, China
| | - Chunyi Zhang
- Biotechnology Research Institute, Chinese Academy of Agricultural Sciences, Beijing 100081, China; Sanya Institute, Hainan Academy of Agricultural Sciences, Sanya 572000, China.
| | - Li Pu
- Biotechnology Research Institute, Chinese Academy of Agricultural Sciences, Beijing 100081, China; National Nanfan Research Institute (Sanya), Chinese Academy of Agricultural Sciences, Sanya 572024, China.
| |
Collapse
|
21
|
Guo T, Li X. Machine learning for predicting phenotype from genotype and environment. Curr Opin Biotechnol 2023; 79:102853. [PMID: 36463837 DOI: 10.1016/j.copbio.2022.102853] [Citation(s) in RCA: 8] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/03/2022] [Revised: 11/01/2022] [Accepted: 11/07/2022] [Indexed: 12/03/2022]
Abstract
Predicting phenotype with genomic and environmental information is critically needed and challenging. Machine learning methods have emerged as powerful tools to make accurate predictions from large and complex biological data. Here, we review the progress of phenotype prediction models enabled or improved by machine learning methods. We categorized the applications into three scenarios: prediction with genotypic information, with environmental information, and with both. In each scenario, we illustrate the practicality of prediction models, the advantages of machine learning, and the challenges of modeling complex relationships. We discuss the promising potential of leveraging machine learning and genetics theories to develop models that can predict phenotype and also interpret the biological consequences of changes in genotype and environment.
Collapse
Affiliation(s)
- Tingting Guo
- National Key Laboratory of Crop Genetic Improvement, Huazhong Agricultural University, Wuhan 430070, China; Hubei Hongshan Laboratory, Wuhan 430070, China.
| | - Xianran Li
- USDA, Agricultural Research Service, Wheat Health, Genetics, and Quality Research Unit, Pullman, WA 99164, USA; Department of Crop and Soil Sciences, Washington State University, Pullman, WA 99164, USA.
| |
Collapse
|
22
|
Farooq M, van Dijk AD, Nijveen H, Mansoor S, de Ridder D. Genomic prediction in plants: opportunities for ensemble machine learning based approaches. F1000Res 2023; 11:802. [PMID: 37035464 PMCID: PMC10080209 DOI: 10.12688/f1000research.122437.2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 01/04/2023] [Indexed: 01/12/2023] Open
Abstract
Background: Many studies have demonstrated the utility of machine learning (ML) methods for genomic prediction (GP) of various plant traits, but a clear rationale for choosing ML over conventionally used, often simpler parametric methods, is still lacking. Predictive performance of GP models might depend on a plethora of factors including sample size, number of markers, population structure and genetic architecture. Methods: Here, we investigate which problem and dataset characteristics are related to good performance of ML methods for genomic prediction. We compare the predictive performance of two frequently used ensemble ML methods (Random Forest and Extreme Gradient Boosting) with parametric methods including genomic best linear unbiased prediction (GBLUP), reproducing kernel Hilbert space regression (RKHS), BayesA and BayesB. To explore problem characteristics, we use simulated and real plant traits under different genetic complexity levels determined by the number of Quantitative Trait Loci (QTLs), heritability (h2 and h2e), population structure and linkage disequilibrium between causal nucleotides and other SNPs. Results: Decision tree based ensemble ML methods are a better choice for nonlinear phenotypes and are comparable to Bayesian methods for linear phenotypes in the case of large effect Quantitative Trait Nucleotides (QTNs). Furthermore, we find that ML methods are susceptible to confounding due to population structure but less sensitive to low linkage disequilibrium than linear parametric methods. Conclusions: Overall, this provides insights into the role of ML in GP as well as guidelines for practitioners.
Collapse
Affiliation(s)
- Muhammad Farooq
- Bioinformatics group, Department of Plant Science, Wageningen University and Research, Wageningen, Gelderland, 6708PB, The Netherlands
- Molecular Virology and Gene Silencing Lab, Agricultural Biotechnology Division, National Institute for Biotechnology and Genetic Engineering (NIBGE), Faisalabad, Punjab, 38000, Pakistan
| | - Aalt D.J. van Dijk
- Bioinformatics group, Department of Plant Science, Wageningen University and Research, Wageningen, Gelderland, 6708PB, The Netherlands
| | - Harm Nijveen
- Bioinformatics group, Department of Plant Science, Wageningen University and Research, Wageningen, Gelderland, 6708PB, The Netherlands
| | - Shahid Mansoor
- Molecular Virology and Gene Silencing Lab, Agricultural Biotechnology Division, National Institute for Biotechnology and Genetic Engineering (NIBGE), Faisalabad, Punjab, 38000, Pakistan
| | - Dick de Ridder
- Bioinformatics group, Department of Plant Science, Wageningen University and Research, Wageningen, Gelderland, 6708PB, The Netherlands
| |
Collapse
|
23
|
Raimondi D, Orlando G, Verplaetse N, Fariselli P, Moreau Y. Editorial: Towards genome interpretation: Computational methods to model the genotype-phenotype relationship. FRONTIERS IN BIOINFORMATICS 2022; 2:1098941. [PMID: 36530385 PMCID: PMC9749061 DOI: 10.3389/fbinf.2022.1098941] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2022] [Accepted: 11/17/2022] [Indexed: 11/12/2023] Open
Affiliation(s)
| | | | | | - Piero Fariselli
- Department of Medical Sciences, University of Torino, Torino, Italy
| | | |
Collapse
|
24
|
Wang K, Yang B, Li Q, Liu S. Systematic Evaluation of Genomic Prediction Algorithms for Genomic Prediction and Breeding of Aquatic Animals. Genes (Basel) 2022; 13:genes13122247. [PMID: 36553514 PMCID: PMC9778314 DOI: 10.3390/genes13122247] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/24/2022] [Revised: 11/18/2022] [Accepted: 11/25/2022] [Indexed: 12/04/2022] Open
Abstract
The extensive use of genomic selection (GS) in livestock and crops has led to a series of genomic-prediction (GP) algorithms despite the lack of a single algorithm that can suit all the species and traits. A systematic evaluation of available GP algorithms is thus necessary to identify the optimal GP algorithm for selective breeding in aquaculture species. In this study, a systematic comparison of ten GP algorithms, including both traditional and machine-learning algorithms, was conducted using publicly available genotype and phenotype data of eight traits, including weight and disease resistance traits, from five aquaculture species. The study aimed to provide insights into the optimal algorithm for GP in aquatic animals. Notably, no algorithm showed the best performance in all traits. However, reproducing kernel Hilbert space (RKHS) and support-vector machine (SVM) algorithms achieved relatively high prediction accuracies in most of the tested traits. Bayes A and random forest (RF) better prevented noise interference in the phenotypic data compared to the other algorithms. The prediction performances of GP algorithms in the Crassostrea gigas dataset were improved by using a genome-wide association study (GWAS) to select subsets of significant SNPs. An R package, "ASGS," which integrates the commonly used traditional and machine-learning algorithms for efficiently finding the optimal algorithm, was developed to assist the application of genomic selection breeding of aquaculture species. This work provides valuable information and a tool for optimizing algorithms for GP, aiding genetic breeding in aquaculture species.
Collapse
Affiliation(s)
- Kuiqin Wang
- Key Laboratory of Mariculture, Ministry of Education, College of Fisheries, Ocean University of China, Qingdao 266003, China
| | - Ben Yang
- Key Laboratory of Mariculture, Ministry of Education, College of Fisheries, Ocean University of China, Qingdao 266003, China
| | - Qi Li
- Key Laboratory of Mariculture, Ministry of Education, College of Fisheries, Ocean University of China, Qingdao 266003, China
- Laboratory for Marine Fisheries Science and Food Production Processes, Qingdao National Laboratory for Marine Science and Technology, Qingdao 266237, China
| | - Shikai Liu
- Key Laboratory of Mariculture, Ministry of Education, College of Fisheries, Ocean University of China, Qingdao 266003, China
- Laboratory for Marine Fisheries Science and Food Production Processes, Qingdao National Laboratory for Marine Science and Technology, Qingdao 266237, China
- Correspondence: ; Tel.: +86-0532-82032595
| |
Collapse
|
25
|
Durge AR, Shrimankar DD, Sawarkar AD. Heuristic Analysis of Genomic Sequence Processing Models for High Efficiency Prediction: A Statistical Perspective. Curr Genomics 2022; 23:299-317. [PMID: 36778194 PMCID: PMC9878859 DOI: 10.2174/1389202923666220927105311] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2022] [Revised: 08/29/2022] [Accepted: 09/01/2022] [Indexed: 11/22/2022] Open
Abstract
Genome sequences indicate a wide variety of characteristics, which include species and sub-species type, genotype, diseases, growth indicators, yield quality, etc. To analyze and study the characteristics of the genome sequences across different species, various deep learning models have been proposed by researchers, such as Convolutional Neural Networks (CNNs), Deep Belief Networks (DBNs), Multilayer Perceptrons (MLPs), etc., which vary in terms of evaluation performance, area of application and species that are processed. Due to a wide differentiation between the algorithmic implementations, it becomes difficult for research programmers to select the best possible genome processing model for their application. In order to facilitate this selection, the paper reviews a wide variety of such models and compares their performance in terms of accuracy, area of application, computational complexity, processing delay, precision and recall. Thus, in the present review, various deep learning and machine learning models have been presented that possess different accuracies for different applications. For multiple genomic data, Repeated Incremental Pruning to Produce Error Reduction with Support Vector Machine (Ripper SVM) outputs 99.7% of accuracy, and for cancer genomic data, it exhibits 99.27% of accuracy using the CNN Bayesian method. Whereas for Covid genome analysis, Bidirectional Long Short-Term Memory with CNN (BiLSTM CNN) exhibits the highest accuracy of 99.95%. A similar analysis of precision and recall of different models has been reviewed. Finally, this paper concludes with some interesting observations related to the genomic processing models and recommends applications for their efficient use.
Collapse
Affiliation(s)
- Aditi R. Durge
- Department of Computer Science and Engineering, Visvesvaraya National Institute of Technology (VNIT), Nagpur, India
| | - Deepti D. Shrimankar
- Department of Computer Science and Engineering, Visvesvaraya National Institute of Technology (VNIT), Nagpur, India,Address correspondence to this author at the Department of Computer Science and Engineering, Visvesvaraya National Institute of Technology (VNIT), Nagpur, India; Tel: 9860606477; E-mail:
| | - Ankush D. Sawarkar
- Department of Computer Science and Engineering, Visvesvaraya National Institute of Technology (VNIT), Nagpur, India
| |
Collapse
|
26
|
Xu Y, Zhang X, Li H, Zheng H, Zhang J, Olsen MS, Varshney RK, Prasanna BM, Qian Q. Smart breeding driven by big data, artificial intelligence, and integrated genomic-enviromic prediction. MOLECULAR PLANT 2022; 15:1664-1695. [PMID: 36081348 DOI: 10.1016/j.molp.2022.09.001] [Citation(s) in RCA: 43] [Impact Index Per Article: 21.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/04/2022] [Revised: 08/20/2022] [Accepted: 09/02/2022] [Indexed: 05/12/2023]
Abstract
The first paradigm of plant breeding involves direct selection-based phenotypic observation, followed by predictive breeding using statistical models for quantitative traits constructed based on genetic experimental design and, more recently, by incorporation of molecular marker genotypes. However, plant performance or phenotype (P) is determined by the combined effects of genotype (G), envirotype (E), and genotype by environment interaction (GEI). Phenotypes can be predicted more precisely by training a model using data collected from multiple sources, including spatiotemporal omics (genomics, phenomics, and enviromics across time and space). Integration of 3D information profiles (G-P-E), each with multidimensionality, provides predictive breeding with both tremendous opportunities and great challenges. Here, we first review innovative technologies for predictive breeding. We then evaluate multidimensional information profiles that can be integrated with a predictive breeding strategy, particularly envirotypic data, which have largely been neglected in data collection and are nearly untouched in model construction. We propose a smart breeding scheme, integrated genomic-enviromic prediction (iGEP), as an extension of genomic prediction, using integrated multiomics information, big data technology, and artificial intelligence (mainly focused on machine and deep learning). We discuss how to implement iGEP, including spatiotemporal models, environmental indices, factorial and spatiotemporal structure of plant breeding data, and cross-species prediction. A strategy is then proposed for prediction-based crop redesign at both the macro (individual, population, and species) and micro (gene, metabolism, and network) scales. Finally, we provide perspectives on translating smart breeding into genetic gain through integrative breeding platforms and open-source breeding initiatives. We call for coordinated efforts in smart breeding through iGEP, institutional partnerships, and innovative technological support.
Collapse
Affiliation(s)
- Yunbi Xu
- Institute of Crop Sciences, CIMMYT-China, Chinese Academy of Agricultural Sciences, Beijing 100081, China; CIMMYT-China Tropical Maize Research Center, School of Food Science and Engineering, Foshan University, Foshan, Guangdong 528231, China; Peking University Institute of Advanced Agricultural Sciences, Weifang, Shandong 261325, China.
| | - Xingping Zhang
- Peking University Institute of Advanced Agricultural Sciences, Weifang, Shandong 261325, China
| | - Huihui Li
- Institute of Crop Sciences, CIMMYT-China, Chinese Academy of Agricultural Sciences, Beijing 100081, China; National Nanfan Research Institute (Sanya), Chinese Academy of Agricultural Sciences, Sanya, Hainan 572024, China
| | - Hongjian Zheng
- CIMMYT-China Specialty Maize Research Center, Shanghai Academy of Agricultural Sciences, Shanghai 201400, China
| | - Jianan Zhang
- MolBreeding Biotechnology Co., Ltd., Shijiazhuang, Hebei 050035, China
| | - Michael S Olsen
- CIMMYT (International Maize and Wheat Improvement Center), ICRAF Campus, United Nations Avenue, Nairobi, Kenya
| | - Rajeev K Varshney
- State Agricultural Biotechnology Centre, Centre for Crop and Food Innovation, Food Futures Institute, Murdoch University, Murdoch, Australia
| | - Boddupalli M Prasanna
- CIMMYT (International Maize and Wheat Improvement Center), ICRAF Campus, United Nations Avenue, Nairobi, Kenya
| | - Qian Qian
- Institute of Crop Sciences, CIMMYT-China, Chinese Academy of Agricultural Sciences, Beijing 100081, China
| |
Collapse
|
27
|
Pedrini S, Doecke JD, Hone E, Wang P, Thota R, Bush AI, Rowe CC, Dore V, Villemagne VL, Ames D, Rainey‐Smith S, Verdile G, Sohrabi HR, Raida MR, Taddei K, Gandy S, Masters CL, Chatterjee P, Martins R. Plasma high-density lipoprotein cargo is altered in Alzheimer's disease and is associated with regional brain volume. J Neurochem 2022; 163:53-67. [PMID: 36000528 PMCID: PMC9804612 DOI: 10.1111/jnc.15681] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2022] [Revised: 07/12/2022] [Accepted: 07/22/2022] [Indexed: 01/05/2023]
Abstract
Cholesterol levels have been repeatedly linked to Alzheimer's Disease (AD), suggesting that high levels could be detrimental, but this effect is likely attributed to Low-Density Lipoprotein (LDL) cholesterol. On the other hand, High-Density Lipoproteins (HDL) cholesterol levels have been associated with reduced brain amyloidosis and improved cognitive function. However, recent findings have suggested that HDL-functionality, which depends upon the HDL-cargo proteins associated with HDL, rather than HDL levels, appears to be the key factor, suggesting a quality over quantity status. In this report, we have assessed the HDL-cargo (Cholesterol, ApoA-I, ApoA-II, ApoC-I, ApoC-III, ApoD, ApoE, ApoH, ApoJ, CRP, and SAA) in stable healthy control (HC), healthy controls who will convert to MCI/AD (HC-Conv) and AD patients (AD). Compared to HC we observed an increased cholesterol/ApoA-I ratio in AD and HC-Conv, as well as an increased ApoD/ApoA-I ratio and a decreased ApoA-II/ApoA-I ratio in AD. Higher cholesterol/ApoA-I ratio was also associated with lower cortical grey matter volume and higher ventricular volume, while higher ApoA-II/ApoA-I and ApoJ/ApoA-I ratios were associated with greater cortical grey matter volume (and for ApoA-II also with greater hippocampal volume) and smaller ventricular volume. Additionally, in a clinical status-independent manner, the ApoE/ApoA-I ratio was significantly lower in APOE ε4 carriers and lowest in APOE ε4 homozygous. Together, these data indicate that in AD patients the composition of HDL is altered, which may affect HDL functionality, and such changes are associated with altered regional brain volumetric data.
Collapse
Affiliation(s)
- Steve Pedrini
- School of Medical SciencesEdith Cowan UniversityJoondalupWestern AustraliaAustralia,CRC for Mental HealthMelbourneVictoriaAustralia
| | - James D. Doecke
- Australian E‐Health Research CentreCSIROBrisbaneQueenslandAustralia
| | - Eugene Hone
- School of Medical SciencesEdith Cowan UniversityJoondalupWestern AustraliaAustralia,CRC for Mental HealthMelbourneVictoriaAustralia
| | - Penghao Wang
- College of Science, Health, Engineering and EducationMurdoch UniversityMurdochWestern AustraliaAustralia
| | - Rohith Thota
- Faculty of Medicine, Health and Human Sciences, Department of Biomedical SciencesMacquarie UniversitySydneyNew South WalesAustralia
| | - Ashley I. Bush
- CRC for Mental HealthMelbourneVictoriaAustralia,The Florey Institute, The University of MelbourneParkvilleVictoriaAustralia
| | - Christopher C. Rowe
- Department of Nuclear Medicine and Centre for PETAustin HealthHeidelbergVictoriaAustralia
| | - Vincent Dore
- Department of Nuclear Medicine and Centre for PETAustin HealthHeidelbergVictoriaAustralia
| | | | - David Ames
- National Ageing Research InstituteParkvilleVictoriaAustralia,University of Melbourne Academic unit for Psychiatry of Old AgeSt George's HospitalKewVictoriaAustralia
| | - Stephanie Rainey‐Smith
- School of Medical SciencesEdith Cowan UniversityJoondalupWestern AustraliaAustralia,Centre for Healthy Ageing, Health Futures InstituteMurdoch UniversityMurdochWestern AustraliaAustralia
| | - Giuseppe Verdile
- Curtin Medical SchoolCurtin UniversityBentleyWestern AustraliaAustralia,Curtin Health Innovation Research InstituteCurtin UniversityBentleyWestern AustraliaAustralia
| | - Hamid R. Sohrabi
- Centre for Healthy Ageing, Health Futures InstituteMurdoch UniversityMurdochWestern AustraliaAustralia
| | - Manfred R. Raida
- Life Science Institute, Singapore Lipidomics IncubatorNational University of SingaporeSingapore CitySingapore
| | - Kevin Taddei
- School of Medical SciencesEdith Cowan UniversityJoondalupWestern AustraliaAustralia,CRC for Mental HealthMelbourneVictoriaAustralia
| | - Sam Gandy
- Department of NeurologyIcahn School of Medicine at Mount SinaiNew York CityNew YorkUSA
| | - Colin L. Masters
- The Florey Institute, The University of MelbourneParkvilleVictoriaAustralia
| | - Pratishtha Chatterjee
- Faculty of Medicine, Health and Human Sciences, Department of Biomedical SciencesMacquarie UniversitySydneyNew South WalesAustralia
| | - Ralph N. Martins
- School of Medical SciencesEdith Cowan UniversityJoondalupWestern AustraliaAustralia,CRC for Mental HealthMelbourneVictoriaAustralia,Faculty of Medicine, Health and Human Sciences, Department of Biomedical SciencesMacquarie UniversitySydneyNew South WalesAustralia,School of Psychiatry and Clinical NeurosciencesUniversity of Western AustraliaCrawleyWestern AustraliaAustralia
| | | |
Collapse
|
28
|
Ayat M, Domaratzki M. Sparse bayesian learning for genomic selection in yeast. FRONTIERS IN BIOINFORMATICS 2022; 2:960889. [PMID: 36304259 PMCID: PMC9580947 DOI: 10.3389/fbinf.2022.960889] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/03/2022] [Accepted: 08/02/2022] [Indexed: 11/13/2022] Open
Abstract
Genomic selection, which predicts phenotypes such as yield and drought resistance in crops from high-density markers positioned throughout the genome of the varieties, is moving towards machine learning techniques to make predictions on complex traits that are controlled by several genes. In this paper, we consider sparse Bayesian learning and ensemble learning as a technique for genomic selection and ranking markers based on their relevance to a trait. We define and explore two different forms of the sparse Bayesian learning for predicting phenotypes and identifying the most influential markers of a trait, respectively. We apply our methods on a Saccharomyces cerevisiae dataset, and analyse our results with respect to existing related works, trait heritability, as well as the accuracies obtained from linear and Gaussian kernel functions. We find that sparse Bayesian methods are not only competitive with other machine learning methods in predicting yeast growth in different environments, but are also capable of identifying the most important markers, including both positive and negative effects on the growth, from which biologists can get insight. This attribute can make our proposed ensemble of sparse Bayesian learners favourable in ranking markers based on their relevance to a trait.
Collapse
Affiliation(s)
- Maryam Ayat
- Lactanet, Sainte-Anne-deBellevue, QC, Canada
| | - Mike Domaratzki
- Department of Computer Science, University of Western Ontario, London, ON, Canada
- *Correspondence: Mike Domaratzki,
| |
Collapse
|
29
|
Farooq M, van Dijk AD, Nijveen H, Mansoor S, de Ridder D. Genomic prediction in plants: opportunities for ensemble machine learning based approaches. F1000Res 2022; 11:802. [PMID: 37035464 PMCID: PMC10080209 DOI: 10.12688/f1000research.122437.1] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 07/08/2022] [Indexed: 12/15/2022] Open
Abstract
Background: Many studies have demonstrated the utility of machine learning (ML) methods for genomic prediction (GP) of various plant traits, but a clear rationale for choosing ML over conventionally used, often simpler parametric methods, is still lacking. Predictive performance of GP models might depend on a plethora of factors including sample size, number of markers, population structure and genetic architecture. Methods: Here, we investigate which problem and dataset characteristics are related to good performance of ML methods for genomic prediction. We compare the predictive performance of two frequently used ensemble ML methods (Random Forest and Extreme Gradient Boosting) with parametric methods including genomic best linear unbiased prediction (GBLUP), reproducing kernel Hilbert space regression (RKHS), BayesA and BayesB. To explore problem characteristics, we use simulated and real plant traits under different genetic complexity levels determined by the number of Quantitative Trait Loci (QTLs), heritability (h2 and h2e), population structure and linkage disequilibrium between causal nucleotides and other SNPs. Results: Decision tree based ensemble ML methods are a better choice for nonlinear phenotypes and are comparable to Bayesian methods for linear phenotypes in the case of large effect Quantitative Trait Nucleotides (QTNs). Furthermore, we find that ML methods are susceptible to confounding due to population structure but less sensitive to low linkage disequilibrium than linear parametric methods. Conclusions: Overall, this provides insights into the role of ML in GP as well as guidelines for practitioners.
Collapse
Affiliation(s)
- Muhammad Farooq
- Bioinformatics group, Department of Plant Science, Wageningen University and Research, Wageningen, Gelderland, 6708PB, The Netherlands
- Molecular Virology and Gene Silencing Lab, Agricultural Biotechnology Division, National Institute for Biotechnology and Genetic Engineering (NIBGE), Faisalabad, Punjab, 38000, Pakistan
| | - Aalt D.J. van Dijk
- Bioinformatics group, Department of Plant Science, Wageningen University and Research, Wageningen, Gelderland, 6708PB, The Netherlands
| | - Harm Nijveen
- Bioinformatics group, Department of Plant Science, Wageningen University and Research, Wageningen, Gelderland, 6708PB, The Netherlands
| | - Shahid Mansoor
- Molecular Virology and Gene Silencing Lab, Agricultural Biotechnology Division, National Institute for Biotechnology and Genetic Engineering (NIBGE), Faisalabad, Punjab, 38000, Pakistan
| | - Dick de Ridder
- Bioinformatics group, Department of Plant Science, Wageningen University and Research, Wageningen, Gelderland, 6708PB, The Netherlands
| |
Collapse
|
30
|
Imbalanced regression using regressor-classifier ensembles. Mach Learn 2022. [DOI: 10.1007/s10994-022-06199-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/17/2022]
Abstract
AbstractWe present an extension to the federated ensemble regression using classification algorithm, an ensemble learning algorithm for regression problems which leverages the distribution of the samples in a learning set to achieve improved performance. We evaluated the extension using four classifiers and four regressors, two discretizers, and 119 responses from a wide variety of datasets in different domains. Additionally, we compared our algorithm to two resampling methods aimed at addressing imbalanced datasets. Our results show that the proposed extension is highly unlikely to perform worse than the base case, and on average outperforms the two resampling methods with significant differences in performance.
Collapse
|
31
|
Zhang Q, Zhang Q, Jensen J. Association Studies and Genomic Prediction for Genetic Improvements in Agriculture. FRONTIERS IN PLANT SCIENCE 2022; 13:904230. [PMID: 35720549 PMCID: PMC9201771 DOI: 10.3389/fpls.2022.904230] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 03/25/2022] [Accepted: 05/16/2022] [Indexed: 06/15/2023]
Abstract
To feed the fast growing global population with sufficient food using limited global resources, it is urgent to develop and utilize cutting-edge technologies and improve efficiency of agricultural production. In this review, we specifically introduce the concepts, theories, methods, applications and future implications of association studies and predicting unknown genetic value or future phenotypic events using genomics in the area of breeding in agriculture. Genome wide association studies can identify the quantitative genetic loci associated with phenotypes of importance in agriculture, while genomic prediction utilizes individual genetic value to rank selection candidates to improve the next generation of plants or animals. These technologies and methods have improved the efficiency of genetic improvement programs for agricultural production via elite animal breeds and plant varieties. With the development of new data acquisition technologies, there will be more and more data collected from high-through-put technologies to assist agricultural breeding. It will be crucial to extract useful information among these large amounts of data and to face this challenge, more efficient algorithms need to be developed and utilized for analyzing these data. Such development will require knowledge from multiple disciplines of research.
Collapse
Affiliation(s)
- Qianqian Zhang
- Institute of Biotechnology, Beijing Academy of Agricultural and Forestry Sciences, Beijing, China
| | - Qin Zhang
- College of Animal Science and Technology, Shandong Agricultural University, Taian, China
- College of Animal Science and Technology, China Agricultural University, BeijingChina
| | - Just Jensen
- Centre for Quantitative Genetics and Genomics, Aarhus University, Aarhus, Denmark
| |
Collapse
|
32
|
Wang W, Cheng Y, Ren Y, Zhang Z, Geng H. Prediction of Chlorophyll Content in Multi-Temporal Winter Wheat Based on Multispectral and Machine Learning. FRONTIERS IN PLANT SCIENCE 2022; 13:896408. [PMID: 35712585 PMCID: PMC9197342 DOI: 10.3389/fpls.2022.896408] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 03/15/2022] [Accepted: 04/19/2022] [Indexed: 06/15/2023]
Abstract
To obtain the canopy chlorophyll content of winter wheat in a rapid and non-destructive high-throughput manner, the study was conducted on winter wheat in Xinjiang Manas Experimental Base in 2021, and the multispectral images of two water treatments' normal irrigation (NI) and drought stress (DS) in three key fertility stages (heading, flowering, and filling) of winter wheat were obtained by DJI P4M unmanned aerial vehicle (UAV). The flag leaf chlorophyll content (CC) data of different genotypes in the field were obtained by SPAD-502 Plus chlorophyll meter. Firstly, the CC distribution of different genotypes was studied, then, 13 vegetation indices, combined with the Random Forest algorithm and correlation evaluation of CC, and 14 vegetation indices were used for vegetation index preference. Finally, preferential vegetation indices and nine machine learning algorithms, Ridge regression with cross-validation (RidgeCV), Ridge, Adaboost Regression, Bagging_Regressor, K_Neighbor, Gradient_Boosting_Regressor, Random Forest, Support Vector Machine (SVM), and Least absolute shrinkage and selection operator (Lasso), were preferentially selected to construct the CC estimation models under two water treatments at three different fertility stages, which were evaluated by correlation coefficient (r), root means square error (RMSE) and the normalized root mean square error (NRMSE) to select the optimal estimation model. The results showed that the CC values under normal irrigation were higher than those underwater limitation treatment at different fertility stages; several vegetation indices and CC values showed a highly significant correlation, with the highest correlation reaching.51; in the prediction model construction of CC values, different models under normal irrigation and water limitation treatment had high estimation accuracy, among which the model with the highest prediction accuracy under normal irrigation was at the heading stage. The highest precision of the model prediction under normal irrigation was in the RidgeCV model (r = 0.63, RMSE = 3.28, NRMSE = 16.2%) and the highest precision of the model prediction under water limitation treatment was in the SVM model (r = 0.63, RMSE = 3.47, NRMSE = 19.2%).
Collapse
Affiliation(s)
- Wei Wang
- High-Quality Special Wheat Crop Engineering Technology Research Center, College of Agronomy, Xinjiang Agricultural University, Ũrũmqi, China
- Department of Computer Science and Information Engineering, Anyang Institute of Technology, Anyang, China
| | - Yukun Cheng
- High-Quality Special Wheat Crop Engineering Technology Research Center, College of Agronomy, Xinjiang Agricultural University, Ũrũmqi, China
| | - Yi Ren
- High-Quality Special Wheat Crop Engineering Technology Research Center, College of Agronomy, Xinjiang Agricultural University, Ũrũmqi, China
| | - Zhihui Zhang
- High-Quality Special Wheat Crop Engineering Technology Research Center, College of Agronomy, Xinjiang Agricultural University, Ũrũmqi, China
| | - Hongwei Geng
- High-Quality Special Wheat Crop Engineering Technology Research Center, College of Agronomy, Xinjiang Agricultural University, Ũrũmqi, China
| |
Collapse
|
33
|
Danilevicz MF, Gill M, Anderson R, Batley J, Bennamoun M, Bayer PE, Edwards D. Plant Genotype to Phenotype Prediction Using Machine Learning. Front Genet 2022; 13:822173. [PMID: 35664329 PMCID: PMC9159391 DOI: 10.3389/fgene.2022.822173] [Citation(s) in RCA: 11] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/25/2021] [Accepted: 03/07/2022] [Indexed: 12/13/2022] Open
Abstract
Genomic prediction tools support crop breeding based on statistical methods, such as the genomic best linear unbiased prediction (GBLUP). However, these tools are not designed to capture non-linear relationships within multi-dimensional datasets, or deal with high dimension datasets such as imagery collected by unmanned aerial vehicles. Machine learning (ML) algorithms have the potential to surpass the prediction accuracy of current tools used for genotype to phenotype prediction, due to their capacity to autonomously extract data features and represent their relationships at multiple levels of abstraction. This review addresses the challenges of applying statistical and machine learning methods for predicting phenotypic traits based on genetic markers, environment data, and imagery for crop breeding. We present the advantages and disadvantages of explainable model structures, discuss the potential of machine learning models for genotype to phenotype prediction in crop breeding, and the challenges, including the scarcity of high-quality datasets, inconsistent metadata annotation and the requirements of ML models.
Collapse
Affiliation(s)
- Monica F. Danilevicz
- School of Biological Sciences and Institute of Agriculture, University of Western Australia, Perth, WA, Australia
| | - Mitchell Gill
- School of Biological Sciences and Institute of Agriculture, University of Western Australia, Perth, WA, Australia
| | - Robyn Anderson
- School of Biological Sciences and Institute of Agriculture, University of Western Australia, Perth, WA, Australia
| | - Jacqueline Batley
- School of Biological Sciences and Institute of Agriculture, University of Western Australia, Perth, WA, Australia
| | - Mohammed Bennamoun
- School of Physics, Mathematics and Computing, University of Western Australia, Perth, WA, Australia
| | - Philipp E. Bayer
- School of Biological Sciences and Institute of Agriculture, University of Western Australia, Perth, WA, Australia
| | - David Edwards
- School of Biological Sciences and Institute of Agriculture, University of Western Australia, Perth, WA, Australia
- *Correspondence: David Edwards,
| |
Collapse
|
34
|
Genome-Enabled Prediction Methods Based on Machine Learning. METHODS IN MOLECULAR BIOLOGY (CLIFTON, N.J.) 2022; 2467:189-218. [PMID: 35451777 DOI: 10.1007/978-1-0716-2205-6_7] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
Abstract
Growth of artificial intelligence and machine learning (ML) methodology has been explosive in recent years. In this class of procedures, computers get knowledge from sets of experiences and provide forecasts or classification. In genome-wide based prediction (GWP), many ML studies have been carried out. This chapter provides a description of main semiparametric and nonparametric algorithms used in GWP in animals and plants. Thirty-four ML comparative studies conducted in the last decade were used to develop a meta-analysis through a Thurstonian model, to evaluate algorithms with the best predictive qualities. It was found that some kernel, Bayesian, and ensemble methods displayed greater robustness and predictive ability. However, the type of study and data distribution must be considered in order to choose the most appropriate model for a given problem.
Collapse
|
35
|
Obesity-Associated Differentially Methylated Regions in Colon Cancer. J Pers Med 2022; 12:jpm12050660. [PMID: 35629083 PMCID: PMC9142939 DOI: 10.3390/jpm12050660] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2022] [Revised: 04/11/2022] [Accepted: 04/18/2022] [Indexed: 02/01/2023] Open
Abstract
Obesity with adiposity is a common disorder in modern days, influenced by environmental factors such as eating and lifestyle habits and affecting the epigenetics of adipose-based gene regulations and metabolic pathways in colorectal cancer (CRC). We compared epigenetic changes of differentially methylated regions (DMR) of genes in colon tissues of 225 colon cancer cases (154 non-obese and 71 obese) and 15 healthy non-obese controls by accessing The Cancer Genome Atlas (TCGA) data. We applied machine-learning-based analytics including generalized regression (GR) as a confirmatory validation model to identify the factors that could contribute to DMRs impacting colon cancer to enhance prediction accuracy. We found that age was a significant predictor in obese cancer patients, both alone (p = 0.003) and interacting with hypomethylated DMRs of ZBTB46, a tumor suppressor gene (p = 0.008). DMRs of three additional genes: HIST1H3I (p = 0.001), an oncogene with a hypomethylated DMR in the promoter region; SRGAP2C (p = 0.006), a tumor suppressor gene with a hypermethylated DMR in the promoter region; and NFATC4 (p = 0.006), an adipocyte differentiating oncogene with a hypermethylated DMR in an intron region, are also significant predictors of cancer in obese patients, independent of age. The genes affected by these DMR could be potential novel biomarkers of colon cancer in obese patients for cancer prevention and progression.
Collapse
|
36
|
Parvandeh S, Donehower LA, Katsonis P, Hsu TK, Asmussen J, Lee K, Lichtarge O. EPIMUTESTR: a nearest neighbor machine learning approach to predict cancer driver genes from the evolutionary action of coding variants. Nucleic Acids Res 2022; 50:e70. [PMID: 35412634 PMCID: PMC9262594 DOI: 10.1093/nar/gkac215] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2021] [Revised: 03/17/2022] [Accepted: 03/21/2022] [Indexed: 02/01/2023] Open
Abstract
Discovering rare cancer driver genes is difficult because their mutational frequency is too low for statistical detection by computational methods. EPIMUTESTR is an integrative nearest-neighbor machine learning algorithm that identifies such marginal genes by modeling the fitness of their mutations with the phylogenetic Evolutionary Action (EA) score. Over cohorts of sequenced patients from The Cancer Genome Atlas representing 33 tumor types, EPIMUTESTR detected 214 previously inferred cancer driver genes and 137 new candidates never identified computationally before of which seven genes are supported in the COSMIC Cancer Gene Census. EPIMUTESTR achieved better robustness and specificity than existing methods in a number of benchmark methods and datasets.
Collapse
Affiliation(s)
- Saeid Parvandeh
- To whom correspondence should be addressed. Tel: +1 713 798 7677;
| | - Lawrence A Donehower
- Department of Molecular Virology and Microbiology, Houston, TX 77030, USA,Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX 77030, USA
| | - Panagiotis Katsonis
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA
| | - Teng-Kuei Hsu
- Department of Biochemistry & Molecular Biology, Baylor College of Medicine, One Baylor Plaza, Houston, TX 77030, USA
| | - Jennifer K Asmussen
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA
| | - Kwanghyuk Lee
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA
| | - Olivier Lichtarge
- Correspondence may also be addressed to Olivier Lichtarge. Tel: +1 713 798 5646;
| |
Collapse
|
37
|
Perez BC, Bink MCAM, Svenson KL, Churchill GA, Calus MPL. Prediction performance of linear models and gradient boosting machine on complex phenotypes in outbred mice. G3 (BETHESDA, MD.) 2022; 12:6528848. [PMID: 35166767 PMCID: PMC8982369 DOI: 10.1093/g3journal/jkac039] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/15/2021] [Accepted: 01/29/2022] [Indexed: 12/14/2022]
Abstract
We compared the performance of linear (GBLUP, BayesB, and elastic net) methods to a nonparametric tree-based ensemble (gradient boosting machine) method for genomic prediction of complex traits in mice. The dataset used contained genotypes for 50,112 SNP markers and phenotypes for 835 animals from 6 generations. Traits analyzed were bone mineral density, body weight at 10, 15, and 20 weeks, fat percentage, circulating cholesterol, glucose, insulin, triglycerides, and urine creatinine. The youngest generation was used as a validation subset, and predictions were based on all older generations. Model performance was evaluated by comparing predictions for animals in the validation subset against their adjusted phenotypes. Linear models outperformed gradient boosting machine for 7 out of 10 traits. For bone mineral density, cholesterol, and glucose, the gradient boosting machine model showed better prediction accuracy and lower relative root mean squared error than the linear models. Interestingly, for these 3 traits, there is evidence of a relevant portion of phenotypic variance being explained by epistatic effects. Using a subset of top markers selected from a gradient boosting machine model helped for some of the traits to improve the accuracy of prediction when these were fitted into linear and gradient boosting machine models. Our results indicate that gradient boosting machine is more strongly affected by data size and decreased connectedness between reference and validation sets than the linear models. Although the linear models outperformed gradient boosting machine for the polygenic traits, our results suggest that gradient boosting machine is a competitive method to predict complex traits with assumed epistatic effects.
Collapse
Affiliation(s)
- Bruno C Perez
- Hendrix Genetics B.V., Research and Technology Center (RTC), 5830 AC Boxmeer, The Netherlands
| | - Marco C A M Bink
- Hendrix Genetics B.V., Research and Technology Center (RTC), 5830 AC Boxmeer, The Netherlands
| | | | | | - Mario P L Calus
- Wageningen University & Research, Animal Breeding and Genomics, 6700 AH Wageningen, The Netherlands
| |
Collapse
|
38
|
Bartholomé J, Prakash PT, Cobb JN. Genomic Prediction: Progress and Perspectives for Rice Improvement. Methods Mol Biol 2022; 2467:569-617. [PMID: 35451791 DOI: 10.1007/978-1-0716-2205-6_21] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
Genomic prediction can be a powerful tool to achieve greater rates of genetic gain for quantitative traits if thoroughly integrated into a breeding strategy. In rice as in other crops, the interest in genomic prediction is very strong with a number of studies addressing multiple aspects of its use, ranging from the more conceptual to the more practical. In this chapter, we review the literature on rice (Oryza sativa) and summarize important considerations for the integration of genomic prediction in breeding programs. The irrigated breeding program at the International Rice Research Institute is used as a concrete example on which we provide data and R scripts to reproduce the analysis but also to highlight practical challenges regarding the use of predictions. The adage "To someone with a hammer, everything looks like a nail" describes a common psychological pitfall that sometimes plagues the integration and application of new technologies to a discipline. We have designed this chapter to help rice breeders avoid that pitfall and appreciate the benefits and limitations of applying genomic prediction, as it is not always the best approach nor the first step to increasing the rate of genetic gain in every context.
Collapse
Affiliation(s)
- Jérôme Bartholomé
- CIRAD, UMR AGAP Institut, Montpellier, France.
- AGAP Institut, Univ Montpellier, CIRAD, INRAE, Montpellier SupAgro, Montpellier, France.
- Rice Breeding Platform, International Rice Research Institute, Manila, Philippines.
| | | | | |
Collapse
|
39
|
Dalla Lana F, Madden LV, Paul PA. Logistic Models Derived via LASSO Methods for Quantifying the Risk of Natural Contamination of Maize Grain with Deoxynivalenol. PHYTOPATHOLOGY 2021; 111:2250-2267. [PMID: 34009008 DOI: 10.1094/phyto-03-21-0104-r] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Models were developed to quantify the risk of deoxynivalenol (DON) contamination of maize grain based on weather, cultural practices, hybrid resistance, and Gibberella ear rot (GER) intensity. Data on natural DON contamination of 15 to 16 hybrids and weather were collected from 10 Ohio locations over 4 years. Logistic regression with 10-fold cross-validation was used to develop models to predict the risk of DON ≥1 ppm. The presence and severity of GER predicted DON risk with an accuracy of 0.81 and 0.87, respectively. Temperature, relative humidity, surface wetness, and rainfall were used to generate 37 weather-based predictor variables summarized over each of six 15-day windows relative to maize silking (R1). With these variables, least absolute shrinkage and selection operator (LASSO) followed by all-subsets variable selection and logistic regression with 10-fold cross-validation were used to build single-window weather-based models, from which 11 with one or two predictors were selected based on performance metrics and simplicity. LASSO logistic regression was also used to build more complex multiwindow models with up to 22 predictors. The performance of the best single-window models was comparable to that of the best multiwindow models, with accuracy ranging from 0.81 to 0.83 for the former and 0.83 to 0.87 for the latter group of models. These results indicated that the risk of DON ≥1 ppm can be accurately predicted with simple models built using temperature- and moisture-based predictors from a single window. These models will be the foundation for developing tools to predict the risk of DON contamination of maize grain.
Collapse
Affiliation(s)
- Felipe Dalla Lana
- Department of Plant Pathology, The Ohio State University, Ohio Agricultural Research, and Development Center, Wooster, OH 44691
| | - Laurence V Madden
- Department of Plant Pathology, The Ohio State University, Ohio Agricultural Research, and Development Center, Wooster, OH 44691
| | - Pierce A Paul
- Department of Plant Pathology, The Ohio State University, Ohio Agricultural Research, and Development Center, Wooster, OH 44691
| |
Collapse
|
40
|
Raimondi D, Corso M, Fariselli P, Moreau Y. From genotype to phenotype in Arabidopsis thaliana: in-silico genome interpretation predicts 288 phenotypes from sequencing data. Nucleic Acids Res 2021; 50:e16. [PMID: 34792168 PMCID: PMC8860592 DOI: 10.1093/nar/gkab1099] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/31/2021] [Revised: 10/06/2021] [Accepted: 10/22/2021] [Indexed: 01/09/2023] Open
Abstract
In many cases, the unprecedented availability of data provided by high-throughput sequencing has shifted the bottleneck from a data availability issue to a data interpretation issue, thus delaying the promised breakthroughs in genetics and precision medicine, for what concerns Human genetics, and phenotype prediction to improve plant adaptation to climate change and resistance to bioagressors, for what concerns plant sciences. In this paper, we propose a novel Genome Interpretation paradigm, which aims at directly modeling the genotype-to-phenotype relationship, and we focus on A. thaliana since it is the best studied model organism in plant genetics. Our model, called Galiana, is the first end-to-end Neural Network (NN) approach following the genomes in/phenotypes out paradigm and it is trained to predict 288 real-valued Arabidopsis thaliana phenotypes from Whole Genome sequencing data. We show that 75 of these phenotypes are predicted with a Pearson correlation ≥0.4, and are mostly related to flowering traits. We show that our end-to-end NN approach achieves better performances and larger phenotype coverage than models predicting single phenotypes from the GWAS-derived known associated genes. Galiana is also fully interpretable, thanks to the Saliency Maps gradient-based approaches. We followed this interpretation approach to identify 36 novel genes that are likely to be associated with flowering traits, finding evidence for 6 of them in the existing literature.
Collapse
Affiliation(s)
| | - Massimiliano Corso
- Institut Jean-Pierre Bourgin, Université Paris-Saclay, INRAE, AgroParisTech, 78000 Versailles, France
| | - Piero Fariselli
- Department of Medical Sciences, University of Torino, 10123 Torino, Italy
| | - Yves Moreau
- ESAT-STADIUS, KU Leuven, 3001 Leuven, Belgium
| |
Collapse
|
41
|
Predicting Heritability of Oil Palm Breeding Using Phenotypic Traits and Machine Learning. SUSTAINABILITY 2021. [DOI: 10.3390/su132212613] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/01/2023]
Abstract
Oil palm is one of the main crops grown to help achieve sustainability in Malaysia. The selection of the best breeds will produce quality crops and increase crop yields. This study aimed to examine machine learning (ML) in oil palm breeding (OPB) using factors other than genetic data. A new conceptual framework to adopt the ML in OPB will be presented at the end of this paper. At first, data types, phenotype traits, current ML models, and evaluation technique will be identified through a literature survey. This study found that the phenotype and genotype data are widely used in oil palm breeding programs. The average bunch weight, bunch number, and fresh fruit bunch are the most important characteristics that can influence the genetic improvement of progenies. Although machine learning approaches have been applied to increase the productivity of the crop, most studies focus on molecular markers or genotypes for plant breeding, rather than on phenotype. Theoretically, the use of phenotypic data related to offspring should predict high breeding values by using ML. Therefore, a new ML conceptual framework to study the phenotype and progeny data of oil palm breeds will be discussed in relation to achieving the Sustainable Development Goals (SDGs).
Collapse
|
42
|
Zhao Y, Lyu X, Xiao W, Tian S, Zhang J, Hu Z, Fu Y. Evaluation of the soil profile quality of subsided land in a coal mining area backfilled with river sediment based on monitoring wheat growth biomass with UAV systems. ENVIRONMENTAL MONITORING AND ASSESSMENT 2021; 193:576. [PMID: 34392439 DOI: 10.1007/s10661-021-09250-4] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/16/2021] [Accepted: 06/28/2021] [Indexed: 06/13/2023]
Abstract
Underground coal mining leads to land subsidence, and the situation is particularly serious in the Coal-Grain Complex in eastern China, causing the crop production to be reduced or to be taken out. Backfilling with Yellow River sediment is one of the effective methods to solve the land subsidence in this area, but a key issue is how to select the optimal soil reconstruction profile so that the crop yield after backfilling and reclamation is unaffected. The main purpose of this study is to verify the feasibility of selecting the optimal soil reconstruction profile by rapid monitoring of crop growth and judging soil quality with the aid of unmanned aerial vehicle systems (UAVs). A control treatment and 13 experimental treatments were established for the study area. The control treatment consisted of using 30 cm topsoil and 90 cm subsoil and the topsoil is a proxy for native (undisturbed) soil from the study sites. All other treatments consisted of using varying combinations of subsoil and sediment overlain by 30 cm of topsoil. The vegetation indices from the UAV multispectral images, and the plant height and vegetation coverage from the UAV RGB images were used for estimation of the winter wheat biomass in a random forest regression. The results showed that the random forest regression model yielded accurate estimation of the aboveground biomass. Furthermore, knowledge of plant height and vegetation coverage improved the accuracy of prediction such that crop growth was well characterized. The optimal soil profile consisted of 0.3 m topsoil + 0.2 m subsoil + 0.2 m sediment + 0.2 m subsoil + 0.3 m sediment. A fast and effective airborne monitoring method for soil quality was established, thus providing greatly improved monitoring efficiency.
Collapse
Affiliation(s)
- Yanling Zhao
- Institute of Land Reclamation and Ecological Restoration, China University of Mining and Technology, Beijing, 100083, People's Republic of China
| | - Xuejiao Lyu
- Institute of Land Reclamation and Ecological Restoration, China University of Mining and Technology, Beijing, 100083, People's Republic of China
| | - Wu Xiao
- Department of Land Management, Zhejiang University, Hangzhou, 310058, People's Republic of China
| | - Shuaishuai Tian
- Yellow River Engineering Consulting Co. Ltd, Zhengzhou, 450003, China
| | - Jianyong Zhang
- Institute of Land Reclamation and Ecological Restoration, China University of Mining and Technology, Beijing, 100083, People's Republic of China
| | - Zhenqi Hu
- Institute of Land Reclamation and Ecological Restoration, China University of Mining and Technology, Beijing, 100083, People's Republic of China.
- School of Environment Science and Spatial Informatics, China University of Mining and Technology, Xuzhou, 221116, China.
| | - Yanhua Fu
- School of Economics and Management, Tianjin Chengjian University, Tianjin, 3000384, China
| |
Collapse
|
43
|
Awlia M, Alshareef N, Saber N, Korte A, Oakey H, Panzarová K, Trtílek M, Negrão S, Tester M, Julkowska MM. Genetic mapping of the early responses to salt stress in Arabidopsis thaliana. THE PLANT JOURNAL : FOR CELL AND MOLECULAR BIOLOGY 2021; 107:544-563. [PMID: 33964046 DOI: 10.1111/tpj.15310] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/13/2020] [Revised: 03/05/2021] [Accepted: 04/19/2021] [Indexed: 06/12/2023]
Abstract
Salt stress decreases plant growth prior to significant ion accumulation in the shoot. However, the processes underlying this rapid reduction in growth are still unknown. To understand the changes in salt stress responses through time and at multiple physiological levels, examining different plant processes within a single set-up is required. Recent advances in phenotyping has allowed the image-based estimation of plant growth, morphology, colour and photosynthetic activity. In this study, we examined the salt stress-induced responses of 191 Arabidopsis accessions from 1 h to 7 days after treatment using high-throughput phenotyping. Multivariate analyses and machine learning algorithms identified that quantum yield measured in the light-adapted state (Fv' /Fm' ) greatly affected growth maintenance in the early phase of salt stress, whereas the maximum quantum yield (QYmax ) was crucial at a later stage. In addition, our genome-wide association study (GWAS) identified 770 loci that were specific to salt stress, in which two loci associated with QYmax and Fv' /Fm' were selected for validation using T-DNA insertion lines. We characterized an unknown protein kinase found in the QYmax locus that reduced photosynthetic efficiency and growth maintenance under salt stress. Understanding the molecular context of the candidate genes identified will provide valuable insights into the early plant responses to salt stress. Furthermore, our work incorporates high-throughput phenotyping, multivariate analyses and GWAS, uncovering details of temporal stress responses and identifying associations across different traits and time points, which are likely to constitute the genetic components of salinity tolerance.
Collapse
Affiliation(s)
- Mariam Awlia
- Division of Biological and Environmental Sciences and Engineering (BESE), King Abdullah University of Science and Technology (KAUST), Thuwal, 23955-6900, Saudi Arabia
| | - Nouf Alshareef
- Division of Biological and Environmental Sciences and Engineering (BESE), King Abdullah University of Science and Technology (KAUST), Thuwal, 23955-6900, Saudi Arabia
- Department of Biochemistry, Faculty of Science, King Abdulaziz University (KAU), Jeddah, Saudi Arabia
| | - Noha Saber
- Division of Biological and Environmental Sciences and Engineering (BESE), King Abdullah University of Science and Technology (KAUST), Thuwal, 23955-6900, Saudi Arabia
| | - Arthur Korte
- Center for Computational and Theoretical Biology, University of Würzburg, Würzburg, Germany
| | - Helena Oakey
- Faculty of Sciences, School of Agriculture, Food and Wine, The University of Adelaide, Adelaide, SA, 5005, Australia
| | | | - Martin Trtílek
- Photon Systems Instruments (PSI), Drásov, Czech Republic
| | - Sónia Negrão
- Division of Biological and Environmental Sciences and Engineering (BESE), King Abdullah University of Science and Technology (KAUST), Thuwal, 23955-6900, Saudi Arabia
- School of Biology and Environmental Science, University College Dublin, Dublin, Ireland
| | - Mark Tester
- Division of Biological and Environmental Sciences and Engineering (BESE), King Abdullah University of Science and Technology (KAUST), Thuwal, 23955-6900, Saudi Arabia
| | - Magdalena M Julkowska
- Division of Biological and Environmental Sciences and Engineering (BESE), King Abdullah University of Science and Technology (KAUST), Thuwal, 23955-6900, Saudi Arabia
| |
Collapse
|
44
|
Mores A, Borrelli GM, Laidò G, Petruzzino G, Pecchioni N, Amoroso LGM, Desiderio F, Mazzucotelli E, Mastrangelo AM, Marone D. Genomic Approaches to Identify Molecular Bases of Crop Resistance to Diseases and to Develop Future Breeding Strategies. Int J Mol Sci 2021; 22:5423. [PMID: 34063853 PMCID: PMC8196592 DOI: 10.3390/ijms22115423] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2021] [Revised: 04/30/2021] [Accepted: 05/15/2021] [Indexed: 12/16/2022] Open
Abstract
Plant diseases are responsible for substantial crop losses each year and affect food security and agricultural sustainability. The improvement of crop resistance to pathogens through breeding represents an environmentally sound method for managing disease and minimizing these losses. The challenge is to breed varieties with a stable and broad-spectrum resistance. Different approaches, from markers to recent genomic and 'post-genomic era' technologies, will be reviewed in order to contribute to a better understanding of the complexity of host-pathogen interactions and genes, including those with small phenotypic effects and mechanisms that underlie resistance. An efficient combination of these approaches is herein proposed as the basis to develop a successful breeding strategy to obtain resistant crop varieties that yield higher in increasing disease scenarios.
Collapse
Affiliation(s)
- Antonia Mores
- Council for Agricultural Research and Economics, Research Centre for Cereal and Industrial Crops, S.S. 673, Km 25,200, 71122 Foggia, Italy; (A.M.); (G.M.B.); (G.L.); (G.P.); (N.P.); (A.M.M.)
| | - Grazia Maria Borrelli
- Council for Agricultural Research and Economics, Research Centre for Cereal and Industrial Crops, S.S. 673, Km 25,200, 71122 Foggia, Italy; (A.M.); (G.M.B.); (G.L.); (G.P.); (N.P.); (A.M.M.)
| | - Giovanni Laidò
- Council for Agricultural Research and Economics, Research Centre for Cereal and Industrial Crops, S.S. 673, Km 25,200, 71122 Foggia, Italy; (A.M.); (G.M.B.); (G.L.); (G.P.); (N.P.); (A.M.M.)
| | - Giuseppe Petruzzino
- Council for Agricultural Research and Economics, Research Centre for Cereal and Industrial Crops, S.S. 673, Km 25,200, 71122 Foggia, Italy; (A.M.); (G.M.B.); (G.L.); (G.P.); (N.P.); (A.M.M.)
| | - Nicola Pecchioni
- Council for Agricultural Research and Economics, Research Centre for Cereal and Industrial Crops, S.S. 673, Km 25,200, 71122 Foggia, Italy; (A.M.); (G.M.B.); (G.L.); (G.P.); (N.P.); (A.M.M.)
| | | | - Francesca Desiderio
- Council for Agricultural Research and Economics, Genomics and Bioinformatics Research Center, Via San Protaso 302, 29017 Fiorenzuola d’Arda, Italy; (F.D.); (E.M.)
| | - Elisabetta Mazzucotelli
- Council for Agricultural Research and Economics, Genomics and Bioinformatics Research Center, Via San Protaso 302, 29017 Fiorenzuola d’Arda, Italy; (F.D.); (E.M.)
| | - Anna Maria Mastrangelo
- Council for Agricultural Research and Economics, Research Centre for Cereal and Industrial Crops, S.S. 673, Km 25,200, 71122 Foggia, Italy; (A.M.); (G.M.B.); (G.L.); (G.P.); (N.P.); (A.M.M.)
| | - Daniela Marone
- Council for Agricultural Research and Economics, Research Centre for Cereal and Industrial Crops, S.S. 673, Km 25,200, 71122 Foggia, Italy; (A.M.); (G.M.B.); (G.L.); (G.P.); (N.P.); (A.M.M.)
| |
Collapse
|
45
|
Cortés AJ, López-Hernández F. Harnessing Crop Wild Diversity for Climate Change Adaptation. Genes (Basel) 2021; 12:783. [PMID: 34065368 PMCID: PMC8161384 DOI: 10.3390/genes12050783] [Citation(s) in RCA: 49] [Impact Index Per Article: 16.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2021] [Revised: 04/28/2021] [Accepted: 05/19/2021] [Indexed: 12/20/2022] Open
Abstract
Warming and drought are reducing global crop production with a potential to substantially worsen global malnutrition. As with the green revolution in the last century, plant genetics may offer concrete opportunities to increase yield and crop adaptability. However, the rate at which the threat is happening requires powering new strategies in order to meet the global food demand. In this review, we highlight major recent 'big data' developments from both empirical and theoretical genomics that may speed up the identification, conservation, and breeding of exotic and elite crop varieties with the potential to feed humans. We first emphasize the major bottlenecks to capture and utilize novel sources of variation in abiotic stress (i.e., heat and drought) tolerance. We argue that adaptation of crop wild relatives to dry environments could be informative on how plant phenotypes may react to a drier climate because natural selection has already tested more options than humans ever will. Because isolated pockets of cryptic diversity may still persist in remote semi-arid regions, we encourage new habitat-based population-guided collections for genebanks. We continue discussing how to systematically study abiotic stress tolerance in these crop collections of wild and landraces using geo-referencing and extensive environmental data. By uncovering the genes that underlie the tolerance adaptive trait, natural variation has the potential to be introgressed into elite cultivars. However, unlocking adaptive genetic variation hidden in related wild species and early landraces remains a major challenge for complex traits that, as abiotic stress tolerance, are polygenic (i.e., regulated by many low-effect genes). Therefore, we finish prospecting modern analytical approaches that will serve to overcome this issue. Concretely, genomic prediction, machine learning, and multi-trait gene editing, all offer innovative alternatives to speed up more accurate pre- and breeding efforts toward the increase in crop adaptability and yield, while matching future global food demands in the face of increased heat and drought. In order for these 'big data' approaches to succeed, we advocate for a trans-disciplinary approach with open-source data and long-term funding. The recent developments and perspectives discussed throughout this review ultimately aim to contribute to increased crop adaptability and yield in the face of heat waves and drought events.
Collapse
Affiliation(s)
- Andrés J. Cortés
- Corporación Colombiana de Investigación Agropecuaria AGROSAVIA, C.I. La Selva, Km 7 Vía Rionegro, Las Palmas, Rionegro 054048, Colombia;
- Departamento de Ciencias Forestales, Facultad de Ciencias Agrarias, Universidad Nacional de Colombia, Sede Medellín, Medellín 050034, Colombia
| | - Felipe López-Hernández
- Corporación Colombiana de Investigación Agropecuaria AGROSAVIA, C.I. La Selva, Km 7 Vía Rionegro, Las Palmas, Rionegro 054048, Colombia;
| |
Collapse
|
46
|
Rohde PD, Kristensen TN, Sarup P, Muñoz J, Malmendal A. Prediction of complex phenotypes using the Drosophila melanogaster metabolome. Heredity (Edinb) 2021; 126:717-732. [PMID: 33510469 PMCID: PMC8102504 DOI: 10.1038/s41437-021-00404-1] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2020] [Revised: 01/04/2021] [Accepted: 01/04/2021] [Indexed: 01/30/2023] Open
Abstract
Understanding the genotype-phenotype map and how variation at different levels of biological organization is associated are central topics in modern biology. Fast developments in sequencing technologies and other molecular omic tools enable researchers to obtain detailed information on variation at DNA level and on intermediate endophenotypes, such as RNA, proteins and metabolites. This can facilitate our understanding of the link between genotypes and molecular and functional organismal phenotypes. Here, we use the Drosophila melanogaster Genetic Reference Panel and nuclear magnetic resonance (NMR) metabolomics to investigate the ability of the metabolome to predict organismal phenotypes. We performed NMR metabolomics on four replicate pools of male flies from each of 170 different isogenic lines. Our results show that metabolite profiles are variable among the investigated lines and that this variation is highly heritable. Second, we identify genes associated with metabolome variation. Third, using the metabolome gave better prediction accuracies than genomic information for four of five quantitative traits analyzed. Our comprehensive characterization of population-scale diversity of metabolomes and its genetic basis illustrates that metabolites have large potential as predictors of organismal phenotypes. This finding is of great importance, e.g., in human medicine, evolutionary biology and animal and plant breeding.
Collapse
Affiliation(s)
- Palle Duun Rohde
- Department of Chemistry and Bioscience, Aalborg University, Aalborg, Denmark.
| | - Torsten Nygaard Kristensen
- Department of Chemistry and Bioscience, Aalborg University, Aalborg, Denmark
- Department of Animal Science, Aarhus University, Tjele, Denmark
| | - Pernille Sarup
- Department of Molecular Biology and Genetics, Aarhus University, Tjele, Denmark
- Nordic Seed A/S, Odder, Denmark
| | - Joaquin Muñoz
- Department of Chemistry and Bioscience, Aalborg University, Aalborg, Denmark
| | - Anders Malmendal
- Department of Science and Environment, Roskilde University, Roskilde, Denmark.
| |
Collapse
|
47
|
Grinberg NF, Wallace C. Multi-tissue transcriptome-wide association studies. Genet Epidemiol 2021; 45:324-337. [PMID: 33369784 PMCID: PMC8048510 DOI: 10.1002/gepi.22374] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/13/2020] [Revised: 11/04/2020] [Accepted: 11/18/2020] [Indexed: 12/20/2022]
Abstract
A transcriptome-wide association study (TWAS) attempts to identify disease associated genes by imputing gene expression into a genome-wide association study (GWAS) using an expression quantitative trait loci (eQTL) data set and then testing for associations with a trait of interest. Regulatory processes may be shared across related tissues and one natural extension of TWAS is harnessing cross-tissue correlation in gene expression to improve prediction accuracy. Here, we studied multi-tissue extensions of lasso regression and random forests (RF), joint lasso and RF-MTL (multi-task learning RF), respectively. We found that, on our chosen eQTL data set, multi-tissue methods were generally more accurate than their single-tissue counterparts, with RF-MTL performing the best. Simulations showed that these benefits generally translated into more associated genes identified, although highlighted that joint lasso had a tendency to erroneously identify genes in one tissue if there existed an eQTL signal for that gene in another. Applying the four methods to a type 1 diabetes GWAS, we found that multi-tissue methods found more unique associated genes for most of the tissues considered. We conclude that multi-tissue methods are competitive and, for some cell types, superior to single-tissue approaches and hold much promise for TWAS studies.
Collapse
Affiliation(s)
- Nastasiya F. Grinberg
- Department of Medicine, Jeffrey Cheah Biomedical Centre, Cambridge Biomedical Campus, Cambridge Institute of Therapeutic Immunology and Infectious DiseaseUniversity of CambridgeCambridgeUK
| | - Chris Wallace
- Department of Medicine, Jeffrey Cheah Biomedical Centre, Cambridge Biomedical Campus, Cambridge Institute of Therapeutic Immunology and Infectious DiseaseUniversity of CambridgeCambridgeUK
- MRC Biostatistics UnitUniversity of CambridgeCambridgeUK
| |
Collapse
|
48
|
Maldonado C, Mora-Poblete F, Contreras-Soto RI, Ahmar S, Chen JT, do Amaral Júnior AT, Scapim CA. Genome-Wide Prediction of Complex Traits in Two Outcrossing Plant Species Through Deep Learning and Bayesian Regularized Neural Network. FRONTIERS IN PLANT SCIENCE 2020; 11:593897. [PMID: 33329658 PMCID: PMC7728740 DOI: 10.3389/fpls.2020.593897] [Citation(s) in RCA: 14] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/11/2020] [Accepted: 10/27/2020] [Indexed: 05/25/2023]
Abstract
Genomic selection models were investigated to predict several complex traits in breeding populations of Zea mays L. and Eucalyptus globulus Labill. For this, the following methods of Machine Learning (ML) were implemented: (i) Deep Learning (DL) and (ii) Bayesian Regularized Neural Network (BRNN) both in combination with different hyperparameters. These ML methods were also compared with Genomic Best Linear Unbiased Prediction (GBLUP) and different Bayesian regression models [Bayes A, Bayes B, Bayes Cπ, Bayesian Ridge Regression, Bayesian LASSO, and Reproducing Kernel Hilbert Space (RKHS)]. DL models, using Rectified Linear Units (as the activation function), had higher predictive ability values, which varied from 0.27 (pilodyn penetration of 6 years old eucalypt trees) to 0.78 (flowering-related traits of maize). Moreover, the larger mini-batch size (100%) had a significantly higher predictive ability for wood-related traits than the smaller mini-batch size (10%). On the other hand, in the BRNN method, the architectures of one and two layers that used only the pureline function showed better results of prediction, with values ranging from 0.21 (pilodyn penetration) to 0.71 (flowering traits). A significant increase in the prediction ability was observed for DL in comparison with other methods of genomic prediction (Bayesian alphabet models, GBLUP, RKHS, and BRNN). Another important finding was the usefulness of DL models (through an iterative algorithm) as an SNP detection strategy for genome-wide association studies. The results of this study confirm the importance of DL for genome-wide analyses and crop/tree improvement strategies, which holds promise for accelerating breeding progress.
Collapse
Affiliation(s)
- Carlos Maldonado
- Instituto de Ciencias Agroalimentarias, Animales y Ambientales, Universidad de O’ Higgins, San Fernando, Chile
| | | | - Rodrigo Iván Contreras-Soto
- Instituto de Ciencias Agroalimentarias, Animales y Ambientales, Universidad de O’ Higgins, San Fernando, Chile
| | - Sunny Ahmar
- Institute of Biological Sciences, University of Talca, Talca, Chile
- College of Plant Sciences and Technology, Huazhong Agricultural University, Wuhan, China
| | - Jen-Tsung Chen
- Department of Life Sciences, National University of Kaohsiung, Kaohsiung, Taiwan
| | - Antônio Teixeira do Amaral Júnior
- Laboratory de Melhoramento Genético Veget al., Universidade Estadual do Norte Fluminense Darcy Ribeiro/CCTA, Campos dos Goytacazes, Brazil
| | | |
Collapse
|
49
|
Orhobor OI, Alexandrov NN, King RD. Predicting rice phenotypes with meta and multi-target learning. Mach Learn 2020. [DOI: 10.1007/s10994-020-05881-9] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
AbstractThe features in some machine learning datasets can naturally be divided into groups. This is the case with genomic data, where features can be grouped by chromosome. In many applications it is common for these groupings to be ignored, as interactions may exist between features belonging to different groups. However, including a group that does not influence a response introduces noise when fitting a model, leading to suboptimal predictive accuracy. Here we present two general frameworks for the generation and combination of meta-features when feature groupings are present. Furthermore, we make comparisons to multi-target learning, given that one is typically interested in predicting multiple phenotypes. We evaluated the frameworks and multi-target learning approaches on a genomic rice dataset where the regression task is to predict plant phenotype. Our results demonstrate that there are use cases for both the meta and multi-target approaches, given that overall, they significantly outperform the base case.
Collapse
|
50
|
Han Y, Adolphs R. Estimating the heritability of psychological measures in the Human Connectome Project dataset. PLoS One 2020; 15:e0235860. [PMID: 32645058 PMCID: PMC7347217 DOI: 10.1371/journal.pone.0235860] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2020] [Accepted: 06/24/2020] [Indexed: 12/03/2022] Open
Abstract
The Human Connectome Project (HCP) is a large structural and functional MRI dataset with a rich array of behavioral and genotypic measures, as well as a biologically verified family structure. This makes it a valuable resource for investigating questions about individual differences, including questions about heritability. While its MRI data have been analyzed extensively in this regard, to our knowledge a comprehensive estimation of the heritability of the behavioral dataset has never been conducted. Using a set of behavioral measures of personality, emotion and cognition, we show that it is possible to re-identify the same individual across two testing times (fingerprinting), and to identify identical twins significantly above chance. Standard heritability estimates of 37 behavioral measures were derived from twin correlations, and machine-learning models (univariate linear model, Ridge classifier and Random Forest model) were trained to classify monozygotic twins and dizygotic twins. Correlations between the standard heritability metric and each set of model weights ranged from 0.36 to 0.7, and questionnaire-based and task-based measures did not differ significantly in their heritability. We further explored the heritability of a smaller number of latent factors extracted from the 37 measures and repeated the heritability estimation; in this case, the correlations between the standard heritability and each set of model weights were lower, ranging from 0.05 to 0.43. One specific discrepancy arose for the general intelligence factor, which all models assigned high importance, but the standard heritability calculation did not. We present a thorough investigation of the heritabilities of the behavioral measures in the HCP as a resource for other investigators, and illustrate the utility of machine-learning methods for qualitative characterization of the differential heritability across diverse measures.
Collapse
Affiliation(s)
- Yanting Han
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA, United States of America
- * E-mail:
| | - Ralph Adolphs
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA, United States of America
- Division of the Humanities and Social Sciences, California Institute of Technology, Pasadena, CA, United States of America
- Chen Neuroscience Institute, California Institute of Technology, Pasadena, CA, United States of America
| |
Collapse
|