1
|
Zhang Y, Zhang M, Ye J, Xu Q, Feng Y, Xu S, Hu D, Wei X, Hu P, Yang Y. Integrating genome-wide association study into genomic selection for the prediction of agronomic traits in rice ( Oryza sativa L.). MOLECULAR BREEDING : NEW STRATEGIES IN PLANT IMPROVEMENT 2023; 43:81. [PMID: 37965378 PMCID: PMC10641074 DOI: 10.1007/s11032-023-01423-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 07/11/2023] [Accepted: 10/09/2023] [Indexed: 11/16/2023]
Abstract
Accurately identifying varieties with targeted agronomic traits was thought to contribute to genetic selection and accelerate rice breeding progress. Genomic selection (GS) is a promising technique that uses markers covering the whole genome to predict the genomic-estimated breeding values (GEBV), with the ability to select before phenotypes are measured. To choose the appropriate GS models for breeding work, we analyzed the predictability of nine agronomic traits measured from a population of 459 diverse rice varieties. By the comparison of eight representative GS models, we found that the prediction accuracies ranged from 0.407 to 0.896, with reproducing kernel Hilbert space (RKHS) having the highest predictive ability in most traits. Further results demonstrated the predictivity of GS is altered by several factors. Moreover, we assessed the method of integrating genome-wide association study (GWAS) into various GS models. The predictabilities of GS combined peak-associated markers generated from six different GWAS models were significantly different; a recommendation of Mixed Linear Model (MLM)-RKHS was given for the GWAS-GS-integrated prediction. Finally, based on the above result, we experimented with applying the P-values obtained from optimal GWAS models into ridge regression best linear unbiased prediction (rrBLUP), which benefited the low predictive traits in rice. Supplementary Information The online version contains supplementary material available at 10.1007/s11032-023-01423-y.
Collapse
Affiliation(s)
- Yuanyuan Zhang
- Zhejiang Lab, Hangzhou, 311121 China
- CNRRI-Zhejiang Lab Computational Breeding Joint Laboratory, China National Rice Research Institute, Hangzhou, China
| | - Mengchen Zhang
- Zhejiang Lab, Hangzhou, 311121 China
- CNRRI-Zhejiang Lab Computational Breeding Joint Laboratory, China National Rice Research Institute, Hangzhou, China
- National Nanfan Research Institute (Sanya), Chinese Academy of Agricultural Sciences, Sanya, 572024 China
| | - Junhua Ye
- CNRRI-Zhejiang Lab Computational Breeding Joint Laboratory, China National Rice Research Institute, Hangzhou, China
| | - Qun Xu
- CNRRI-Zhejiang Lab Computational Breeding Joint Laboratory, China National Rice Research Institute, Hangzhou, China
| | - Yue Feng
- CNRRI-Zhejiang Lab Computational Breeding Joint Laboratory, China National Rice Research Institute, Hangzhou, China
- National Nanfan Research Institute (Sanya), Chinese Academy of Agricultural Sciences, Sanya, 572024 China
| | - Siliang Xu
- CNRRI-Zhejiang Lab Computational Breeding Joint Laboratory, China National Rice Research Institute, Hangzhou, China
| | - Dongxiu Hu
- CNRRI-Zhejiang Lab Computational Breeding Joint Laboratory, China National Rice Research Institute, Hangzhou, China
| | - Xinghua Wei
- Zhejiang Lab, Hangzhou, 311121 China
- CNRRI-Zhejiang Lab Computational Breeding Joint Laboratory, China National Rice Research Institute, Hangzhou, China
- National Nanfan Research Institute (Sanya), Chinese Academy of Agricultural Sciences, Sanya, 572024 China
| | - Peisong Hu
- Zhejiang Lab, Hangzhou, 311121 China
- CNRRI-Zhejiang Lab Computational Breeding Joint Laboratory, China National Rice Research Institute, Hangzhou, China
| | - Yaolong Yang
- Zhejiang Lab, Hangzhou, 311121 China
- CNRRI-Zhejiang Lab Computational Breeding Joint Laboratory, China National Rice Research Institute, Hangzhou, China
- National Nanfan Research Institute (Sanya), Chinese Academy of Agricultural Sciences, Sanya, 572024 China
| |
Collapse
|
2
|
Ma Y, Fa B, Yuan X, Zhang Y, Yu Z. STS-BN: An efficient Bayesian network method for detecting causal SNPs. Front Genet 2022; 13:942464. [PMID: 36186431 PMCID: PMC9520706 DOI: 10.3389/fgene.2022.942464] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/12/2022] [Accepted: 08/16/2022] [Indexed: 11/16/2022] Open
Abstract
Background: The identification of the causal SNPs of complex diseases in large-scale genome-wide association analysis is beneficial to the studies of pathogenesis, prevention, diagnosis and treatment of these diseases. However, existing applicable methods for large-scale data suffer from low accuracy. Developing powerful and accurate methods for detecting SNPs associated with complex diseases is highly desired. Results: We propose a score-based two-stage Bayesian network method to identify causal SNPs of complex diseases for case-control designs. This method combines the ideas of constraint-based methods and score-and-search methods to learn the structure of the disease-centered local Bayesian network. Simulation experiments are conducted to compare this new algorithm with several common methods that can achieve the same function. The results show that our method improves the accuracy and stability compared to several common methods. Our method based on Bayesian network theory results in lower false-positive rates when all correct loci are detected. Besides, real-world data application suggests that our algorithm has good performance when handling genome-wide association data. Conclusion: The proposed method is designed to identify the SNPs related to complex diseases, and is more accurate than other methods which can also be adapted to large-scale genome-wide analysis studies data.
Collapse
Affiliation(s)
- Yanran Ma
- Department of Bioinformatics and Biostatistics, School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai, China
| | - Botao Fa
- Department of Biochemistry and Molecular Biology, School of Basic Medical Sciences, Xi’an Jiaotong University, Xi’an, China
| | - Xin Yuan
- Department of Bioinformatics and Biostatistics, School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai, China
| | - Yue Zhang
- Department of Bioinformatics and Biostatistics, School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai, China
- *Correspondence: Yue Zhang, ; Zhangsheng Yu,
| | - Zhangsheng Yu
- Department of Bioinformatics and Biostatistics, School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai, China
- *Correspondence: Yue Zhang, ; Zhangsheng Yu,
| |
Collapse
|
3
|
Yilmaz S, Fakhouri M, Koyutürk M, Çiçek AE, Tastan O. Uncovering complementary sets of variants for predicting quantitative phenotypes. Bioinformatics 2022; 38:908-917. [PMID: 34864867 DOI: 10.1093/bioinformatics/btab803] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/11/2021] [Revised: 09/21/2021] [Accepted: 11/24/2021] [Indexed: 02/03/2023] Open
Abstract
MOTIVATION Genome-wide association studies show that variants in individual genomic loci alone are not sufficient to explain the heritability of complex, quantitative phenotypes. Many computational methods have been developed to address this issue by considering subsets of loci that can collectively predict the phenotype. This problem can be considered a challenging instance of feature selection in which the number of dimensions (loci that are screened) is much larger than the number of samples. While currently available methods can achieve decent phenotype prediction performance, they either do not scale to large datasets or have parameters that require extensive tuning. RESULTS We propose a fast and simple algorithm, Macarons, to select a small, complementary subset of variants by avoiding redundant pairs that are likely to be in linkage disequilibrium. Our method features two interpretable parameters that control the time/performance trade-off without requiring parameter tuning. In our computational experiments, we show that Macarons consistently achieves similar or better prediction performance than state-of-the-art selection methods while having a simpler premise and being at least two orders of magnitude faster. Overall, Macarons can seamlessly scale to the human genome with ∼107 variants in a matter of minutes while taking the dependencies between the variants into account. AVAILABILITYAND IMPLEMENTATION Macarons is available in Matlab and Python at https://github.com/serhan-yilmaz/macarons. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Serhan Yilmaz
- Department of Computer and Data Sciences, Case Western Reserve University, Cleveland, OH 44106, USA
| | - Mohamad Fakhouri
- Department of Computer Engineering, Bilkent University, Ankara 06800, Turkey
| | - Mehmet Koyutürk
- Department of Computer and Data Sciences, Case Western Reserve University, Cleveland, OH 44106, USA.,Center for Proteomics and Bioinformatics, Case Western Reserve University, Cleveland, OH 44106, USA
| | - A Ercüment Çiçek
- Department of Computer Engineering, Bilkent University, Ankara 06800, Turkey.,Department of Computational Biology, Carnegie Mellon University, Pittsburgh, PA 15213, USA
| | - Oznur Tastan
- Faculty of Engineering and Natural Sciences, Sabanci University, Istanbul 34956, Turkey
| |
Collapse
|
4
|
Gumbsch T, Bock C, Moor M, Rieck B, Borgwardt K. Enhancing statistical power in temporal biomarker discovery through representative shapelet mining. Bioinformatics 2021; 36:i840-i848. [PMID: 33381811 PMCID: PMC7773478 DOI: 10.1093/bioinformatics/btaa815] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023] Open
Abstract
Motivation Temporal biomarker discovery in longitudinal data is based on detecting reoccurring trajectories, the so-called shapelets. The search for shapelets requires considering all subsequences in the data. While the accompanying issue of multiple testing has been mitigated in previous work, the redundancy and overlap of the detected shapelets results in an a priori unbounded number of highly similar and structurally meaningless shapelets. As a consequence, current temporal biomarker discovery methods are impractical and underpowered. Results We find that the pre- or post-processing of shapelets does not sufficiently increase the power and practical utility. Consequently, we present a novel method for temporal biomarker discovery: Statistically Significant Submodular Subset Shapelet Mining (S5M) that retrieves short subsequences that are (i) occurring in the data, (ii) are statistically significantly associated with the phenotype and (iii) are of manageable quantity while maximizing structural diversity. Structural diversity is achieved by pruning non-representative shapelets via submodular optimization. This increases the statistical power and utility of S5M compared to state-of-the-art approaches on simulated and real-world datasets. For patients admitted to the intensive care unit (ICU) showing signs of severe organ failure, we find temporal patterns in the sequential organ failure assessment score that are associated with in-ICU mortality. Availability and implementation S5M is an option in the python package of S3M: github.com/BorgwardtLab/S3M.
Collapse
Affiliation(s)
- Thomas Gumbsch
- Department of Biosystems Science and Engineering, ETH Zurich, Basel 4058, Switzerland.,SIB Swiss Institute of Bioinformatics, Lausanne 1015, Switzerland
| | - Christian Bock
- Department of Biosystems Science and Engineering, ETH Zurich, Basel 4058, Switzerland.,SIB Swiss Institute of Bioinformatics, Lausanne 1015, Switzerland
| | - Michael Moor
- Department of Biosystems Science and Engineering, ETH Zurich, Basel 4058, Switzerland.,SIB Swiss Institute of Bioinformatics, Lausanne 1015, Switzerland
| | - Bastian Rieck
- Department of Biosystems Science and Engineering, ETH Zurich, Basel 4058, Switzerland.,SIB Swiss Institute of Bioinformatics, Lausanne 1015, Switzerland
| | - Karsten Borgwardt
- Department of Biosystems Science and Engineering, ETH Zurich, Basel 4058, Switzerland.,SIB Swiss Institute of Bioinformatics, Lausanne 1015, Switzerland
| |
Collapse
|
5
|
Caylak G, Tastan O, Cicek AE. A Tool for Detecting Complementary Single Nucleotide Polymorphism Pairs in Genome-Wide Association Studies for Epistasis Testing. J Comput Biol 2020; 28:378-380. [PMID: 33325775 DOI: 10.1089/cmb.2020.0430] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Detecting interacting loci pairs has been instrumental to understand disease etiology when single locus associations do not fully account for the underlying heritability. However, the number of loci to test is prohibitively large. Epistasis test prioritization algorithms rank likely epistatic single nucleotide polymorphism (SNP) pairs to limit the number of statistical tests. Potpourri detects epistatic SNP pairs by diversifying the selected SNPs' genomic regions and investigating their co-occurrence patterns over the case cohort. It can also input and further prioritize SNPs in regulatory or coding regions. The program identifies and returns a list of prioritized SNP pairs for epistasis testing. This article describes how to use the program and the details of the input and output data.
Collapse
Affiliation(s)
- Gizem Caylak
- Computer Engineering Department, Bilkent University, Ankara, Turkey
| | - Oznur Tastan
- Faculty of Engineering and Natural Sciences, Sabanci University, Istanbul, Turkey
| | - A Ercument Cicek
- Computer Engineering Department, Bilkent University, Ankara, Turkey.,Computational Biology Department, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA
| |
Collapse
|
6
|
Caylak G, Tastan O, Cicek AE. Potpourri: An Epistasis Test Prioritization Algorithm via Diverse SNP Selection. J Comput Biol 2020; 28:365-377. [PMID: 33275856 DOI: 10.1089/cmb.2020.0429] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022] Open
Abstract
Genome-wide association studies (GWAS) explain a fraction of the underlying heritability of genetic diseases. Investigating epistatic interactions between two or more loci help to close this gap. Unfortunately, the sheer number of loci combinations to process and hypotheses prohibit the process both computationally and statistically. Epistasis test prioritization algorithms rank likely epistatic single nucleotide polymorphism (SNP) pairs to limit the number of tests. However, they still suffer from very low precision. It was shown in the literature that selecting SNPs that are individually correlated with the phenotype and also diverse with respect to genomic location leads to better phenotype prediction due to genetic complementation. Here, we propose that an algorithm that pairs SNPs from such diverse regions and ranks them can improve prediction power. We propose an epistasis test prioritization algorithm that optimizes a submodular set function to select a diverse and complementary set of genomic regions that span the underlying genome. The SNP pairs from these regions are then further ranked w.r.t. their co-coverage of the case cohort. We compare our algorithm with the state of the art on three GWAS and show that (1) we substantially improve precision (from 0.003 to 0.652) while maintaining the significance of selected pairs, (2) decrease the number of tests by 25-fold, and (3) decrease the runtime by 4-fold. We also show that promoting SNPs from regulatory/coding regions improves the performance (up to 0.8). Potpourri is available at http:/ciceklab.cs.bilkent.edu.tr/potpourri.
Collapse
Affiliation(s)
- Gizem Caylak
- Computer Engineering Department, Bilkent University, Ankara, Turkey
| | - Oznur Tastan
- Faculty of Engineering and Natural Sciences, Sabanci University, Istanbul, Turkey
| | - A Ercument Cicek
- Computer Engineering Department, Bilkent University, Ankara, Turkey
- Computational Biology Department, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA
| |
Collapse
|
7
|
Jeong S, Kim JY, Kim N. GMStool: GWAS-based marker selection tool for genomic prediction from genomic data. Sci Rep 2020; 10:19653. [PMID: 33184432 PMCID: PMC7665227 DOI: 10.1038/s41598-020-76759-y] [Citation(s) in RCA: 17] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2020] [Accepted: 11/02/2020] [Indexed: 12/20/2022] Open
Abstract
The increased accessibility to genomic data in recent years has laid the foundation for studies to predict various phenotypes of organisms based on the genome. Genomic prediction collectively refers to these studies, and it estimates an individual's phenotypes mainly using single nucleotide polymorphism markers. Typically, the accuracy of these genomic prediction studies is highly dependent on the markers used; however, in practice, choosing optimal markers with high accuracy for the phenotype to be used is a challenging task. Therefore, we present a new tool called GMStool for selecting optimal marker sets and predicting quantitative phenotypes. The GMStool is based on a genome-wide association study (GWAS) and heuristically searches for optimal markers using statistical and machine-learning methods. The GMStool performs the genomic prediction using statistical and machine/deep-learning models and presents the best prediction model with the optimal marker-set. For the evaluation, the GMStool was tested on real datasets with four phenotypes. The prediction results showed higher performance than using the entire markers or the GWAS-top markers, which have been used frequently in prediction studies. Although the GMStool has several limitations, it is expected to contribute to various studies for predicting quantitative phenotypes. The GMStool written in R is available at www.github.com/JaeYoonKim72/GMStool .
Collapse
Affiliation(s)
- Seongmun Jeong
- Genome Editing Research Center, Korea Research Institute of Bioscience and Biotechnology (KRIBB), Daejeon, 34141, Republic of Korea
| | - Jae-Yoon Kim
- Genome Editing Research Center, Korea Research Institute of Bioscience and Biotechnology (KRIBB), Daejeon, 34141, Republic of Korea
- Department of Bioinformatics, KRIBB School of Bioscience, University of Science and Technology (UST), Daejeon, 34141, Republic of Korea
| | - Namshin Kim
- Genome Editing Research Center, Korea Research Institute of Bioscience and Biotechnology (KRIBB), Daejeon, 34141, Republic of Korea.
- Department of Bioinformatics, KRIBB School of Bioscience, University of Science and Technology (UST), Daejeon, 34141, Republic of Korea.
| |
Collapse
|