1
|
Kontou PI, Bagos PG. The goldmine of GWAS summary statistics: a systematic review of methods and tools. BioData Min 2024; 17:31. [PMID: 39238044 PMCID: PMC11375927 DOI: 10.1186/s13040-024-00385-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2024] [Accepted: 08/27/2024] [Indexed: 09/07/2024] Open
Abstract
Genome-wide association studies (GWAS) have revolutionized our understanding of the genetic architecture of complex traits and diseases. GWAS summary statistics have become essential tools for various genetic analyses, including meta-analysis, fine-mapping, and risk prediction. However, the increasing number of GWAS summary statistics and the diversity of software tools available for their analysis can make it challenging for researchers to select the most appropriate tools for their specific needs. This systematic review aims to provide a comprehensive overview of the currently available software tools and databases for GWAS summary statistics analysis. We conducted a comprehensive literature search to identify relevant software tools and databases. We categorized the tools and databases by their functionality, including data management, quality control, single-trait analysis, and multiple-trait analysis. We also compared the tools and databases based on their features, limitations, and user-friendliness. Our review identified a total of 305 functioning software tools and databases dedicated to GWAS summary statistics, each with unique strengths and limitations. We provide descriptions of the key features of each tool and database, including their input/output formats, data types, and computational requirements. We also discuss the overall usability and applicability of each tool for different research scenarios. This comprehensive review will serve as a valuable resource for researchers who are interested in using GWAS summary statistics to investigate the genetic basis of complex traits and diseases. By providing a detailed overview of the available tools and databases, we aim to facilitate informed tool selection and maximize the effectiveness of GWAS summary statistics analysis.
Collapse
Affiliation(s)
| | - Pantelis G Bagos
- Department of Computer Science and Biomedical Informatics, University of Thessaly, 35131, Lamia, Greece.
| |
Collapse
|
2
|
Kojima K, Tadaka S, Okamura Y, Kinoshita K. Two-stage strategy using denoising autoencoders for robust reference-free genotype imputation with missing input genotypes. J Hum Genet 2024:10.1038/s10038-024-01261-6. [PMID: 38918526 DOI: 10.1038/s10038-024-01261-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2023] [Revised: 04/16/2024] [Accepted: 05/13/2024] [Indexed: 06/27/2024]
Abstract
Widely used genotype imputation methods are based on the Li and Stephens model, which assumes that new haplotypes can be represented by modifying existing haplotypes in a reference panel through mutations and recombinations. These methods use genotypes from SNP arrays as inputs to estimate haplotypes that align with the input genotypes by analyzing recombination patterns within a reference panel, and then infer unobserved variants. While these methods require reference panels in an identifiable form, their public use is limited due to privacy and consent concerns. One strategy to overcome these limitations is to use de-identified haplotype information, such as summary statistics or model parameters. Advances in deep learning (DL) offer the potential to develop imputation methods that use haplotype information in a reference-free manner by handling it as model parameters, while maintaining comparable imputation accuracy to methods based on the Li and Stephens model. Here, we provide a brief introduction to DL-based reference-free genotype imputation methods, including RNN-IMP, developed by our research group. We then evaluate the performance of RNN-IMP against widely-used Li and Stephens model-based imputation methods in terms of accuracy (R2), using the 1000 Genomes Project Phase 3 dataset and corresponding simulated Omni2.5 SNP genotype data. Although RNN-IMP is sensitive to missing values in input genotypes, we propose a two-stage imputation strategy: missing genotypes are first imputed using denoising autoencoders; RNN-IMP then processes these imputed genotypes. This approach restores the imputation accuracy that is degraded by missing values, enhancing the practical use of RNN-IMP.
Collapse
Affiliation(s)
- Kaname Kojima
- Tohoku Medical Megabank Organization, Tohoku University, 2-1 Seiryo-machi, Aoba-ku, Sendai, Miyagi, 980-8573, Japan.
| | - Shu Tadaka
- Tohoku Medical Megabank Organization, Tohoku University, 2-1 Seiryo-machi, Aoba-ku, Sendai, Miyagi, 980-8573, Japan
| | - Yasunobu Okamura
- Tohoku Medical Megabank Organization, Tohoku University, 2-1 Seiryo-machi, Aoba-ku, Sendai, Miyagi, 980-8573, Japan
- Advanced Research Center for Innovations in Next-Generation Medicine, Tohoku University, 2-1 Seiryo-machi, Aoba-ku, Sendai, Miyagi, 980-0873, Japan
| | - Kengo Kinoshita
- Tohoku Medical Megabank Organization, Tohoku University, 2-1 Seiryo-machi, Aoba-ku, Sendai, Miyagi, 980-8573, Japan.
- Advanced Research Center for Innovations in Next-Generation Medicine, Tohoku University, 2-1 Seiryo-machi, Aoba-ku, Sendai, Miyagi, 980-0873, Japan.
- Graduate School of Information Sciences, Tohoku University, 6-3-09 Aza-Aoba, Aramaki, Aoba-ku, Sendai, Miyagi, 980-8579, Japan.
- Institute of Development, Aging and Cancer, Tohoku University, 4-1 Seiryo-machi, Aoba-ku, Sendai, Miyagi, 980-8575, Japan.
| |
Collapse
|
3
|
Majumdar S, Basu S, McGue M, Chatterjee S. Simultaneous selection of multiple important single nucleotide polymorphisms in familial genome wide association studies data. Sci Rep 2023; 13:8476. [PMID: 37231056 PMCID: PMC10213008 DOI: 10.1038/s41598-023-35379-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2022] [Accepted: 05/17/2023] [Indexed: 05/27/2023] Open
Abstract
We propose a resampling-based fast variable selection technique for detecting relevant single nucleotide polymorphisms (SNP) in a multi-marker mixed effect model. Due to computational complexity, current practice primarily involves testing the effect of one SNP at a time, commonly termed as 'single SNP association analysis'. Joint modeling of genetic variants within a gene or pathway may have better power to detect associated genetic variants, especially the ones with weak effects. In this paper, we propose a computationally efficient model selection approach-based on the e-values framework-for single SNP detection in families while utilizing information on multiple SNPs simultaneously. To overcome computational bottleneck of traditional model selection methods, our method trains one single model, and utilizes a fast and scalable bootstrap procedure. We illustrate through numerical studies that our proposed method is more effective in detecting SNPs associated with a trait than either single-marker analysis using family data or model selection methods that ignore the familial dependency structure. Further, we perform gene-level analysis in Minnesota Center for Twin and Family Research (MCTFR) dataset using our method to detect several SNPs using this that have been implicated to be associated with alcohol consumption.
Collapse
Affiliation(s)
- Subhabrata Majumdar
- University of Minnesota Twin Cities, Minneapolis, USA.
- AI Risk and Vulnerability Alliance, Seattle, USA.
| | - Saonli Basu
- University of Minnesota Twin Cities, Minneapolis, USA
| | - Matt McGue
- University of Minnesota Twin Cities, Minneapolis, USA
| | | |
Collapse
|
4
|
Qi GA, Zheng YT, Lin F, Huang X, Duan LW, You Y, Liu H, Wang Y, Xu HM, Chen GB. EigenGWAS: An online visualizing and interactive application for detecting genomic signatures of natural selection. Mol Ecol Resour 2021; 21:1732-1744. [PMID: 33665976 DOI: 10.1111/1755-0998.13370] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/07/2020] [Revised: 01/17/2021] [Accepted: 02/25/2021] [Indexed: 11/30/2022]
Abstract
Detecting genetic regions under selection in structured populations is of great importance in ecology, evolutionary biology and breeding programmes. We recently proposed EigenGWAS, an unsupervised genomic scanning approach that is similar to F ST but does not require grouping information of the population, for detection of genomic regions under selection. The original EigenGWAS is designed for the random mating population, and here we extend its use to inbred populations. We also show in theory and simulation that eigenvalues, the previous corrector for genetic drift in EigenGWAS, are overcorrected for genetic drift, and the genomic inflation factor is a better option for this adjustment. Applying the updated algorithm, we introduce the new EigenGWAS online platform with highly efficient core implementation. Our online computational tool accepts plink data in a standard binary format that can be easily converted from the original sequencing data, provides the users with graphical results via the R-Shiny user-friendly interface. We applied the proposed method and tool to various data sets, and biologically interpretable results as well as caveats that may lead to an unsatisfactory outcome are given. The EigenGWAS online platform is available at www.eigengwas.com, and can be localized and scaled up via R (recommended) or docker.
Collapse
Affiliation(s)
- Guo-An Qi
- Institute of Bioinformatics and Institute of Crop Science, College of Agriculture and Biotechnology, Zhejiang University, Hangzhou, China
| | - Yuan-Ting Zheng
- Institute of Bioinformatics and Institute of Crop Science, College of Agriculture and Biotechnology, Zhejiang University, Hangzhou, China
| | - Feng Lin
- Clinical Research Institute, Zhejiang Provincial People's Hospital, People's Hospital of Hangzhou Medical College, Hangzhou, China
| | - Xin Huang
- Institute of Bioinformatics and Institute of Crop Science, College of Agriculture and Biotechnology, Zhejiang University, Hangzhou, China
| | - Li-Wen Duan
- Institute of Bioinformatics and Institute of Crop Science, College of Agriculture and Biotechnology, Zhejiang University, Hangzhou, China
| | - Yue You
- Institute of Bioinformatics and Institute of Crop Science, College of Agriculture and Biotechnology, Zhejiang University, Hangzhou, China
| | - Hailan Liu
- Maize Research Institute, Sichuan Agricultural University, Chengdu, China
| | - Ying Wang
- Phase I Clinical Research Center, Zhejiang Provincial People's Hospital, People's Hospital of Hangzhou Medical College, Hangzhou, China
| | - Hai-Ming Xu
- Institute of Bioinformatics and Institute of Crop Science, College of Agriculture and Biotechnology, Zhejiang University, Hangzhou, China
| | - Guo-Bo Chen
- Clinical Research Institute, Zhejiang Provincial People's Hospital, People's Hospital of Hangzhou Medical College, Hangzhou, China.,Key Laboratory of Endocrine Gland Diseases of Zhejiang Province, People's Hospital of Hangzhou Medical College, Hangzhou, China
| |
Collapse
|
5
|
Kojima K, Tadaka S, Katsuoka F, Tamiya G, Yamamoto M, Kinoshita K. A genotype imputation method for de-identified haplotype reference information by using recurrent neural network. PLoS Comput Biol 2020; 16:e1008207. [PMID: 33001993 PMCID: PMC7529210 DOI: 10.1371/journal.pcbi.1008207] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2020] [Accepted: 07/30/2020] [Indexed: 02/02/2023] Open
Abstract
Genotype imputation estimates the genotypes of unobserved variants using the genotype data of other observed variants based on a collection of haplotypes for thousands of individuals, which is known as a haplotype reference panel. In general, more accurate imputation results were obtained using a larger size of haplotype reference panel. Most of the existing genotype imputation methods explicitly require the haplotype reference panel in precise form, but the accessibility of haplotype data is often limited, due to the requirement of agreements from the donors. Since de-identified information such as summary statistics or model parameters can be used publicly, imputation methods using de-identified haplotype reference information might be useful to enhance the quality of imputation results under the condition where the access of the haplotype data is limited. In this study, we proposed a novel imputation method that handles the reference panel as its model parameters by using bidirectional recurrent neural network (RNN). The model parameters are presented in the form of de-identified information from which the restoration of the genotype data at the individual-level is almost impossible. We demonstrated that the proposed method provides comparable imputation accuracy when compared with the existing imputation methods using haplotype datasets from the 1000 Genomes Project (1KGP) and the Haplotype Reference Consortium. We also considered a scenario where a subset of haplotypes is made available only in de-identified form for the haplotype reference panel. In the evaluation using the 1KGP dataset under the scenario, the imputation accuracy of the proposed method is much higher than that of the existing imputation methods. We therefore conclude that our RNN-based method is quite promising to further promote the data-sharing of sensitive genome data under the recent movement for the protection of individuals’ privacy. Genotype imputation estimates the genotypes of unobserved variants using the genotype data of other observed variants based on a collection of genome data of a large number of individuals called a reference panel. In general, more accurate imputation results are obtained using a larger size of the reference panel. Although most of the existing imputation methods use the reference panel in an explicit form, the accessibility of genome data is often limited due to the requirement of agreements from the donors. We thus proposed a new imputation method that handles the reference panel as its model parameters by using bidirectional recurrent neural network. Since it is almost impossible to restore genome data at the individual-level from the model parameters, they can be shared publicly as the de-identified information even when the accessibility of the original reference panel is limited. We demonstrate that the proposed method provides comparable imputation accuracy with the existing methods. We also considered a scenario where a part of the genome data is made available only in de-identified form for the reference panel and have shown that the imputation accuracy of the proposed method is much higher than that of the existing methods under the scenario.
Collapse
Affiliation(s)
- Kaname Kojima
- Tohoku Medical Megabank Organization, Tohoku University, Sendai, Miyagi, Japan
- RIKEN Center for Advanced Intelligence Project, Chuo-ku, Tokyo, Japan
| | - Shu Tadaka
- Tohoku Medical Megabank Organization, Tohoku University, Sendai, Miyagi, Japan
| | - Fumiki Katsuoka
- Tohoku Medical Megabank Organization, Tohoku University, Sendai, Miyagi, Japan
| | - Gen Tamiya
- Tohoku Medical Megabank Organization, Tohoku University, Sendai, Miyagi, Japan
- RIKEN Center for Advanced Intelligence Project, Chuo-ku, Tokyo, Japan
| | - Masayuki Yamamoto
- Tohoku Medical Megabank Organization, Tohoku University, Sendai, Miyagi, Japan
- School of Medicine, Tohoku University, Sendai, Miyagi, Japan
- Advanced Research Center for Innovations in Next-Generation Medicine, Tohoku University, Sendai, Miyagi, Japan
| | - Kengo Kinoshita
- Tohoku Medical Megabank Organization, Tohoku University, Sendai, Miyagi, Japan
- Advanced Research Center for Innovations in Next-Generation Medicine, Tohoku University, Sendai, Miyagi, Japan
- Graduate School of Information Sciences, Tohoku University, Sendai, Miyagi, Japan
- Institute of Development, Aging and Cancer, Tohoku University, Sendai, Miyagi, Japan
- * E-mail:
| |
Collapse
|