1
|
Togninalli M, Wang X, Kucera T, Shrestha S, Juliana P, Mondal S, Pinto F, Govindan V, Crespo-Herrera L, Huerta-Espino J, Singh RP, Borgwardt K, Poland J. Multi-modal deep learning improves grain yield prediction in wheat breeding by fusing genomics and phenomics. Bioinformatics 2023:7176366. [PMID: 37220903 DOI: 10.1093/bioinformatics/btad336] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2022] [Revised: 02/15/2023] [Accepted: 05/22/2023] [Indexed: 05/25/2023] Open
Abstract
MOTIVATION Developing new crop varieties with superior performance is highly important to ensure robust and sustainable global food security. The speed of variety development is limited by long field cycles and advanced generation selections in plant breeding programs. While methods to predict yield from genotype or phenotype data have been proposed, improved performance and integrated models are needed. RESULTS We propose a machine learning model that leverages both genotype and phenotype measurements by fusing genetic variants with multiple data sources collected by unmanned aerial systems. We use a deep multiple instance learning framework with an attention mechanism that sheds light on the importance given to each input during prediction, enhancing interpretability. Our model reaches 0.754 ± 0.024 Pearson correlation coefficient when predicting yield in similar environmental conditions; a 34.8% improvement over the genotype-only linear baseline (0.559 ± 0.050). We further predict yield on new lines in an unseen environment using only genotypes, obtaining a prediction accuracy of 0.386 ± 0.010, a 13.5% improvement over the linear baseline. Our multi-modal deep learning architecture efficiently accounts for plant health and environment, distilling the genetic contribution and providing excellent predictions. Yield prediction algorithms leveraging phenotypic observations during training therefore promise to improve breeding programs, ultimately speeding up delivery of improved varieties. AVAILABILITY AND IMPLEMENTATION Available at https://github.com/BorgwardtLab/PheGeMIL (code) and https://doi.org/doi:10.5061/dryad.kprr4xh5p (data).
Collapse
Affiliation(s)
- Matteo Togninalli
- Department of Biosystems Science and Engineering, ETH Zurich, Basel, Switzerland
- Swiss Institute of Bioinformatics, Lausanne, Switzerland
- Visium, Lausanne, Switzerland
| | - Xu Wang
- Department of Plant Pathology, Kansas State University, Manhattan, Kansas USA
- Department of Agricultural and Biological Engineering, University of Florida, IFAS Gulf Coast Research and Education Center, Wimauma, Florida USA
| | - Tim Kucera
- Department of Biosystems Science and Engineering, ETH Zurich, Basel, Switzerland
- Swiss Institute of Bioinformatics, Lausanne, Switzerland
- Department of Machine Learning and Systems Biology, Max Planck Institute of Biochemistry, Martinsried, Germany
| | - Sandesh Shrestha
- Department of Plant Pathology, Kansas State University, Manhattan, Kansas USA
| | - Philomin Juliana
- Global Wheat Program, International Maize and Wheat Improvement Center, Mexico City, Mexico
| | - Suchismita Mondal
- Global Wheat Program, International Maize and Wheat Improvement Center, Mexico City, Mexico
| | - Francisco Pinto
- Global Wheat Program, International Maize and Wheat Improvement Center, Mexico City, Mexico
| | - Velu Govindan
- Global Wheat Program, International Maize and Wheat Improvement Center, Mexico City, Mexico
| | | | - Julio Huerta-Espino
- Global Wheat Program, International Maize and Wheat Improvement Center, Mexico City, Mexico
- Campo Experimental Valle de Mexico-INIFAP, Texcoco, Estado de Mexico Mexico
| | - Ravi P Singh
- Global Wheat Program, International Maize and Wheat Improvement Center, Mexico City, Mexico
| | - Karsten Borgwardt
- Department of Biosystems Science and Engineering, ETH Zurich, Basel, Switzerland
- Swiss Institute of Bioinformatics, Lausanne, Switzerland
- Department of Machine Learning and Systems Biology, Max Planck Institute of Biochemistry, Martinsried, Germany
| | - Jesse Poland
- Department of Plant Pathology, Kansas State University, Manhattan, Kansas USA
- Center for Desert Agriculture, King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
| |
Collapse
|
2
|
Togninalli M, Ho ATV, Madl CM, Holbrook CA, Wang YX, Magnusson KEG, Kirillova A, Chang A, Blau HM. Machine learning-based classification of dual fluorescence signals reveals muscle stem cell fate transitions in response to regenerative niche factors. NPJ Regen Med 2023; 8:4. [PMID: 36639373 PMCID: PMC9839750 DOI: 10.1038/s41536-023-00277-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2022] [Accepted: 01/03/2023] [Indexed: 01/15/2023] Open
Abstract
The proper regulation of muscle stem cell (MuSC) fate by cues from the niche is essential for regeneration of skeletal muscle. How pro-regenerative niche factors control the dynamics of MuSC fate decisions remains unknown due to limitations of population-level endpoint assays. To address this knowledge gap, we developed a dual fluorescence imaging time lapse (Dual-FLIT) microscopy approach that leverages machine learning classification strategies to track single cell fate decisions with high temporal resolution. Using two fluorescent reporters that read out maintenance of stemness and myogenic commitment, we constructed detailed lineage trees for individual MuSCs and their progeny, classifying each division event as symmetric self-renewing, asymmetric, or symmetric committed. Our analysis reveals that treatment with the lipid metabolite, prostaglandin E2 (PGE2), accelerates the rate of MuSC proliferation over time, while biasing division events toward symmetric self-renewal. In contrast, the IL6 family member, Oncostatin M (OSM), decreases the proliferation rate after the first generation, while blocking myogenic commitment. These insights into the dynamics of MuSC regulation by niche cues were uniquely enabled by our Dual-FLIT approach. We anticipate that similar binary live cell readouts derived from Dual-FLIT will markedly expand our understanding of how niche factors control tissue regeneration in real time.
Collapse
Affiliation(s)
- Matteo Togninalli
- Baxter Laboratory for Stem Cell Biology, Department of Microbiology and Immunology, Institute for Stem Cell Biology and Regenerative Medicine, Stanford School of Medicine, Stanford, CA, 94305-5175, USA
| | - Andrew T V Ho
- Baxter Laboratory for Stem Cell Biology, Department of Microbiology and Immunology, Institute for Stem Cell Biology and Regenerative Medicine, Stanford School of Medicine, Stanford, CA, 94305-5175, USA
- Department of Functional and Adaptive Biology - UMR 8251 CNRS, Université Paris Cité, 75013, Paris, France
| | - Christopher M Madl
- Baxter Laboratory for Stem Cell Biology, Department of Microbiology and Immunology, Institute for Stem Cell Biology and Regenerative Medicine, Stanford School of Medicine, Stanford, CA, 94305-5175, USA
- Department of Materials Science and Engineering, University of Pennsylvania, Philadelphia, PA, 19104, USA
| | - Colin A Holbrook
- Baxter Laboratory for Stem Cell Biology, Department of Microbiology and Immunology, Institute for Stem Cell Biology and Regenerative Medicine, Stanford School of Medicine, Stanford, CA, 94305-5175, USA
| | - Yu Xin Wang
- Baxter Laboratory for Stem Cell Biology, Department of Microbiology and Immunology, Institute for Stem Cell Biology and Regenerative Medicine, Stanford School of Medicine, Stanford, CA, 94305-5175, USA
- Center for Genetic Disorders and Aging, Sanford Burnham Prebys Medical Discovery Institute, La Jolla, CA, 92037, USA
| | - Klas E G Magnusson
- Baxter Laboratory for Stem Cell Biology, Department of Microbiology and Immunology, Institute for Stem Cell Biology and Regenerative Medicine, Stanford School of Medicine, Stanford, CA, 94305-5175, USA
- Department of Signal Processing, ACCESS Linnaeus Centre, KTH Royal Institute of Technology, 100 44, Stockholm, Sweden
| | - Anna Kirillova
- Baxter Laboratory for Stem Cell Biology, Department of Microbiology and Immunology, Institute for Stem Cell Biology and Regenerative Medicine, Stanford School of Medicine, Stanford, CA, 94305-5175, USA
| | - Andrew Chang
- Baxter Laboratory for Stem Cell Biology, Department of Microbiology and Immunology, Institute for Stem Cell Biology and Regenerative Medicine, Stanford School of Medicine, Stanford, CA, 94305-5175, USA
| | - Helen M Blau
- Baxter Laboratory for Stem Cell Biology, Department of Microbiology and Immunology, Institute for Stem Cell Biology and Regenerative Medicine, Stanford School of Medicine, Stanford, CA, 94305-5175, USA.
| |
Collapse
|
3
|
Chalet FX, Bujaroska T, Germeni E, Ghandri N, Maddalena ET, Modi K, Olopoenia A, Thompson J, Togninalli M, Briggs AH. Mapping the Insomnia Severity Index Instrument to EQ-5D Health State Utilities: A United Kingdom Perspective. Pharmacoecon Open 2023; 7:149-161. [PMID: 36703022 PMCID: PMC9928998 DOI: 10.1007/s41669-023-00388-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Accepted: 01/12/2023] [Indexed: 06/18/2023]
Abstract
OBJECTIVE This study aimed to map the Insomnia Severity Index (ISI) to the EQ-5D-3L utility values from a UK perspective. METHODS Source data were derived from the 2020 National Health and Wellness Survey (NHWS) for France, Germany, Italy, Spain, the UK and the US. Ordinary least squares regression, generalised linear model (GLM), censored least absolute deviation, and adjusted limited dependent variable mixture model (ALDVMM) were employed to explore the relationship between ISI total summary score and EQ-5D utility while accounting for adjustment covariates derived from the NHWS. Fitting performance was assessed using standard metrics, including mean-squared error (MSE) and coefficient of determination (R2). RESULTS A total of 17,955 respondent observations were included, with a mean ISI score of 12.12 ± 5.32 and a mean EQ-5D-3L utility (UK tariff) of 0.71 ± 0.23. GLM gamma-log and ALDVMM were the two best performing models. The ALDVMM had better fitting performance (R2 = 0.320, MSE 0.0347) than the GLM gamma-log (R2 = 0.303, MSE 0.0353); in train-test split-sample validation, ALDVMM also slightly outperformed the GLM gamma-log model, with an MSE of 0.0351 versus 0.0355. Based on fitting performance, ALDVMM and GLM gamma-log were the preferred models. CONCLUSIONS In the absence of preference-based measures, this study provides an updated mapping algorithm for estimating EQ-5D-3L utilities from the ISI summary total score. This new mapping not only draws its strengths from the use of a large international dataset but also the incorporation of adjustment variables (including sociodemographic and general health characteristics) to reduce the effects of confounders.
Collapse
Affiliation(s)
| | - Teodora Bujaroska
- Visium, EPFL Innovation Park, Rte Cantonale, 1015, Lausanne, Switzerland
| | - Evi Germeni
- Health Economics and Health Technology Assessment (HEHTA), School of Health and Wellbeing, University of Glasgow, 1 Lilybank Gardens, Glasgow, G12 8RZ, UK
| | - Nizar Ghandri
- Visium, EPFL Innovation Park, Rte Cantonale, 1015, Lausanne, Switzerland
| | - Emilio T Maddalena
- Visium, EPFL Innovation Park, Rte Cantonale, 1015, Lausanne, Switzerland
| | - Kushal Modi
- Cerner Enviza, 2800 Rockcreek Parkway, North Kansas City, MO, 64117, USA
| | - Abisola Olopoenia
- Cerner Enviza, 2800 Rockcreek Parkway, North Kansas City, MO, 64117, USA
| | - Jeffrey Thompson
- Cerner Enviza, 2800 Rockcreek Parkway, North Kansas City, MO, 64117, USA
| | - Matteo Togninalli
- Visium, EPFL Innovation Park, Rte Cantonale, 1015, Lausanne, Switzerland
| | - Andrew H Briggs
- Department of Health Services Research & Policy, London School of Hygiene and Tropical Medicine, 15-17 Tavistock Place, London, WC1H 9SH, UK
- Avalon Health Economics LLC, 119 Washington St, Morristown, NJ, 07960, USA
| |
Collapse
|
4
|
Kucera T, Togninalli M, Meng-Papaxanthos L. Conditional generative modeling for de novo protein design with hierarchical functions. Bioinformatics 2022; 38:3454-3461. [PMID: 35639661 PMCID: PMC9237736 DOI: 10.1093/bioinformatics/btac353] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/04/2021] [Revised: 04/20/2022] [Accepted: 05/20/2022] [Indexed: 11/18/2022] Open
Abstract
Motivation Protein design has become increasingly important for medical and biotechnological applications. Because of the complex mechanisms underlying protein formation, the creation of a novel protein requires tedious and time-consuming computational or experimental protocols. At the same time, machine learning has enabled the solving of complex problems by leveraging large amounts of available data, more recently with great improvements on the domain of generative modeling. Yet, generative models have mainly been applied to specific sub-problems of protein design. Results Here, we approach the problem of general-purpose protein design conditioned on functional labels of the hierarchical Gene Ontology. Since a canonical way to evaluate generative models in this domain is missing, we devise an evaluation scheme of several biologically and statistically inspired metrics. We then develop the conditional generative adversarial network ProteoGAN and show that it outperforms several classic and more recent deep-learning baselines for protein sequence generation. We further give insights into the model by analyzing hyperparameters and ablation baselines. Lastly, we hypothesize that a functionally conditional model could generate proteins with novel functions by combining labels and provide first steps into this direction of research. Availability and implementation The code and data underlying this article are available on GitHub at https://github.com/timkucera/proteogan, and can be accessed with doi:10.5281/zenodo.6591379. Supplementary information Supplemental data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Tim Kucera
- Department of Biosystems Science and Engineering, ETH Zürich, Basel 4058, Switzerland
| | | | | |
Collapse
|
5
|
Avican K, Aldahdooh J, Togninalli M, Mahmud AKMF, Tang J, Borgwardt KM, Rhen M, Fällman M. RNA atlas of human bacterial pathogens uncovers stress dynamics linked to infection. Nat Commun 2021; 12:3282. [PMID: 34078900 PMCID: PMC8172932 DOI: 10.1038/s41467-021-23588-w] [Citation(s) in RCA: 29] [Impact Index Per Article: 9.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/06/2020] [Accepted: 05/05/2021] [Indexed: 11/25/2022] Open
Abstract
Bacterial processes necessary for adaption to stressful host environments are potential targets for new antimicrobials. Here, we report large-scale transcriptomic analyses of 32 human bacterial pathogens grown under 11 stress conditions mimicking human host environments. The potential relevance of the in vitro stress conditions and responses is supported by comparisons with available in vivo transcriptomes of clinically important pathogens. Calculation of a probability score enables comparative cross-microbial analyses of the stress responses, revealing common and unique regulatory responses to different stresses, as well as overlapping processes participating in different stress responses. We identify conserved and species-specific 'universal stress responders', that is, genes showing altered expression in multiple stress conditions. Non-coding RNAs are involved in a substantial proportion of the responses. The data are collected in a freely available, interactive online resource (PATHOgenex).
Collapse
Affiliation(s)
- Kemal Avican
- Department of Molecular Biology, Laboratory for Molecular Infection Medicine Sweden (MIMS), Umeå Centre for Microbial Research (UCMR), Umeå University, Umeå, Sweden.
| | - Jehad Aldahdooh
- Institute for Molecular Medicine Finland (FIMM), University of Helsinki, Helsinki, Finland
- Research Program in Systems Oncology, Faculty of Medicine, University of Helsinki, Helsinki, Finland
| | - Matteo Togninalli
- Department for Biosystems Science and Engineering, ETH Zürich, Basel, Switzerland
- Swiss Institute for Bioinformatics, Lausanne, Switzerland
| | - A K M Firoj Mahmud
- Department of Molecular Biology, Laboratory for Molecular Infection Medicine Sweden (MIMS), Umeå Centre for Microbial Research (UCMR), Umeå University, Umeå, Sweden
| | - Jing Tang
- Institute for Molecular Medicine Finland (FIMM), University of Helsinki, Helsinki, Finland
- Research Program in Systems Oncology, Faculty of Medicine, University of Helsinki, Helsinki, Finland
| | - Karsten M Borgwardt
- Department for Biosystems Science and Engineering, ETH Zürich, Basel, Switzerland
- Swiss Institute for Bioinformatics, Lausanne, Switzerland
| | - Mikael Rhen
- Department of Microbiology, Tumor and Cell Biology (MTC), Karolinska Institute, Stockholm, Sweden
| | - Maria Fällman
- Department of Molecular Biology, Laboratory for Molecular Infection Medicine Sweden (MIMS), Umeå Centre for Microbial Research (UCMR), Umeå University, Umeå, Sweden.
| |
Collapse
|
6
|
Togninalli M, Seren Ü, Freudenthal JA, Monroe JG, Meng D, Nordborg M, Weigel D, Borgwardt K, Korte A, Grimm DG. AraPheno and the AraGWAS Catalog 2020: a major database update including RNA-Seq and knockout mutation data for Arabidopsis thaliana. Nucleic Acids Res 2020; 48:D1063-D1068. [PMID: 31642487 PMCID: PMC7145550 DOI: 10.1093/nar/gkz925] [Citation(s) in RCA: 29] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2019] [Revised: 09/26/2019] [Accepted: 10/08/2019] [Indexed: 12/23/2022] Open
Abstract
Genome-wide association studies (GWAS) are integral for studying genotype-phenotype relationships and gaining a deeper understanding of the genetic architecture underlying trait variation. A plethora of genetic associations between distinct loci and various traits have been successfully discovered and published for the model plant Arabidopsis thaliana. This success and the free availability of full genomes and phenotypic data for more than 1,000 different natural inbred lines led to the development of several data repositories. AraPheno (https://arapheno.1001genomes.org) serves as a central repository of population-scale phenotypes in A. thaliana, while the AraGWAS Catalog (https://aragwas.1001genomes.org) provides a publicly available, manually curated and standardized collection of marker-trait associations for all available phenotypes from AraPheno. In this major update, we introduce the next generation of both platforms, including new data, features and tools. We included novel results on associations between knockout-mutations and all AraPheno traits. Furthermore, AraPheno has been extended to display RNA-Seq data for hundreds of accessions, providing expression information for over 28 000 genes for these accessions. All data, including the imputed genotype matrix used for GWAS, are easily downloadable via the respective databases.
Collapse
Affiliation(s)
- Matteo Togninalli
- Machine Learning and Computational Biology Lab, Department of Biosystems Science and Engineering, ETH Zürich, Basel, Switzerland
- Swiss Institute of Bioinformatics, Basel, Switzerland
| | - Ümit Seren
- Gregor Mendel Institute of Molecular Plant Biology, Vienna, Austria
| | - Jan A Freudenthal
- Center for Computational and Theoretical Biology, University Würzburg, Würzburg, Germany
| | - J Grey Monroe
- Max Planck Institute for Developmental Biology, Tübingen, Germany
| | - Dazhe Meng
- Gregor Mendel Institute of Molecular Plant Biology, Vienna, Austria
- Google, Mountain View, USA
| | - Magnus Nordborg
- Gregor Mendel Institute of Molecular Plant Biology, Vienna, Austria
| | - Detlef Weigel
- Max Planck Institute for Developmental Biology, Tübingen, Germany
| | - Karsten Borgwardt
- Machine Learning and Computational Biology Lab, Department of Biosystems Science and Engineering, ETH Zürich, Basel, Switzerland
- Swiss Institute of Bioinformatics, Basel, Switzerland
| | - Arthur Korte
- Center for Computational and Theoretical Biology, University Würzburg, Würzburg, Germany
| | - Dominik G Grimm
- Technical University of Munich, TUM Campus Straubing for Biotechnology and Sustainability, Bioinformatics, Straubing, Germany
- Weihenstephan-Triesdorf University of Applied Sciences, Bioinformatics, Straubing, Germany
| |
Collapse
|
7
|
Togninalli M, Yoneoka D, Kolios AGA, Borgwardt K, Nilsson J. Pretransplant Kinetics of Anti-HLA Antibodies in Patients on the Waiting List for Kidney Transplantation. J Am Soc Nephrol 2019; 30:2262-2274. [PMID: 31653784 DOI: 10.1681/asn.2019060594] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2019] [Accepted: 08/19/2019] [Indexed: 12/30/2022] Open
Abstract
BACKGROUND Patients on organ transplant waiting lists are evaluated for preexisting alloimmunity to minimize episodes of acute and chronic rejection by regularly monitoring for changes in alloimmune status. There are few studies on how alloimmunity changes over time in patients on kidney allograft waiting lists, and an apparent lack of research-based evidence supporting currently used monitoring intervals. METHODS To investigate the dynamics of alloimmune responses directed at HLA antigens, we retrospectively evaluated data on anti-HLA antibodies measured by the single-antigen bead assay from 627 waitlisted patients who subsequently received a kidney transplant at University Hospital Zurich, Switzerland, between 2008 and 2017. Our analysis focused on a filtered dataset comprising 467 patients who had at least two assay measurements. RESULTS Within the filtered dataset, we analyzed potential changes in mean fluorescence intensity values (reflecting bound anti-HLA antibodies) between consecutive measurements for individual patients in relation to the time interval between measurements. Using multiple approaches, we found no correlation between these two factors. However, when we stratified the dataset on the basis of documented previous immunizing events (transplant, pregnancy, or transfusion), we found significant differences in the magnitude of change in alloimmune status, especially among patients with a previous transplant versus patients without such a history. Further efforts to cluster patients according to statistical properties related to alloimmune status kinetics were unsuccessful, indicating considerable complexity in individual variability. CONCLUSIONS Alloimmune kinetics in patients on a kidney transplant waiting list do not appear to be related to the interval between measurements, but are instead associated with alloimmunization history. This suggests that an individualized strategy for alloimmune status monitoring may be preferable to currently used intervals.
Collapse
Affiliation(s)
- Matteo Togninalli
- Machine Learning and Computational Biology Laboratory, Department of Biosystems Science and Engineering, ETH Zurich, Zurich, Switzerland.,SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland; and
| | - Daisuke Yoneoka
- Machine Learning and Computational Biology Laboratory, Department of Biosystems Science and Engineering, ETH Zurich, Zurich, Switzerland.,SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland; and
| | | | - Karsten Borgwardt
- Machine Learning and Computational Biology Laboratory, Department of Biosystems Science and Engineering, ETH Zurich, Zurich, Switzerland.,SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland; and
| | - Jakob Nilsson
- Department of Immunology, University Hospital Zurich, Zurich, Switzerland
| |
Collapse
|
8
|
Togninalli M, Seren Ü, Meng D, Fitz J, Nordborg M, Weigel D, Borgwardt K, Korte A, Grimm DG. The AraGWAS Catalog: a curated and standardized Arabidopsis thaliana GWAS catalog. Nucleic Acids Res 2019; 46:D1150-D1156. [PMID: 29059333 PMCID: PMC5753280 DOI: 10.1093/nar/gkx954] [Citation(s) in RCA: 43] [Impact Index Per Article: 8.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2017] [Accepted: 10/06/2017] [Indexed: 12/21/2022] Open
Abstract
The abundance of high-quality genotype and phenotype data for the model organism Arabidopsis thaliana enables scientists to study the genetic architecture of many complex traits at an unprecedented level of detail using genome-wide association studies (GWAS). GWAS have been a great success in A. thaliana and many SNP-trait associations have been published. With the AraGWAS Catalog (https://aragwas.1001genomes.org) we provide a publicly available, manually curated and standardized GWAS catalog for all publicly available phenotypes from the central A. thaliana phenotype repository, AraPheno. All GWAS have been recomputed on the latest imputed genotype release of the 1001 Genomes Consortium using a standardized GWAS pipeline to ensure comparability between results. The catalog includes currently 167 phenotypes and more than 222 000 SNP-trait associations with P < 10−4, of which 3887 are significantly associated using permutation-based thresholds. The AraGWAS Catalog can be accessed via a modern web-interface and provides various features to easily access, download and visualize the results and summary statistics across GWAS.
Collapse
Affiliation(s)
- Matteo Togninalli
- Machine Learning and Computational Biology Lab, Department of Biosystems Science and Engineering, ETH Zürich, 4058 Basel, Switzerland.,Swiss Institute of Bioinformatics, 4056 Basel, Switzerland
| | - Ümit Seren
- Gregor Mendel Institute, Austrian Academy of Sciences, Vienna Biocenter (VBC), 1030 Vienna, Austria
| | - Dazhe Meng
- Gregor Mendel Institute, Austrian Academy of Sciences, Vienna Biocenter (VBC), 1030 Vienna, Austria.,Molecular and Computational Biology, University of Southern California, Los Angeles, CA 90007, USA
| | - Joffrey Fitz
- Max Planck Institute for Developmental Biology, 72076 Tübingen, Germany
| | - Magnus Nordborg
- Gregor Mendel Institute, Austrian Academy of Sciences, Vienna Biocenter (VBC), 1030 Vienna, Austria
| | - Detlef Weigel
- Max Planck Institute for Developmental Biology, 72076 Tübingen, Germany
| | - Karsten Borgwardt
- Machine Learning and Computational Biology Lab, Department of Biosystems Science and Engineering, ETH Zürich, 4058 Basel, Switzerland.,Swiss Institute of Bioinformatics, 4056 Basel, Switzerland
| | - Arthur Korte
- Center for Computational and Theoretical Biology, University Würzburg, 97074 Würzburg, Germany
| | - Dominik G Grimm
- Machine Learning and Computational Biology Lab, Department of Biosystems Science and Engineering, ETH Zürich, 4058 Basel, Switzerland.,Swiss Institute of Bioinformatics, 4056 Basel, Switzerland
| |
Collapse
|
9
|
Abstract
Motivation Methods based on summary statistics obtained from genome-wide association studies have gained considerable interest in genetics due to the computational cost and privacy advantages they present. Imputing missing summary statistics has therefore become a key procedure in many bioinformatics pipelines, but available solutions may rely on additional knowledge about the populations used in the original study and, as a result, may not always ensure feasibility or high accuracy of the imputation procedure. Results We present ARDISS, a method to impute missing summary statistics in mixed-ethnicity cohorts through Gaussian Process Regression and automatic relevance determination. ARDISS is trained on an external reference panel and does not require information about allele frequencies of genotypes from the original study. Our method approximates the original GWAS population by a combination of samples from a reference panel relying exclusively on the summary statistics and without any external information. ARDISS successfully reconstructs the original composition of mixed-ethnicity cohorts and outperforms alternative solutions in terms of speed and imputation accuracy both for heterogeneous and homogeneous datasets. Availability and implementation The proposed method is available at https://github.com/BorgwardtLab/ARDISS. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Matteo Togninalli
- Department of Biosystems Science and Engineering, ETH Zurich, Basel, Switzerland.,SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Damian Roqueiro
- Department of Biosystems Science and Engineering, ETH Zurich, Basel, Switzerland.,SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | | | - Karsten M Borgwardt
- Department of Biosystems Science and Engineering, ETH Zurich, Basel, Switzerland.,SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland
| |
Collapse
|