1
|
Ott J, Park T. Overview of frequent pattern mining. Genomics Inform 2022; 20:e39. [PMID: 36617647 PMCID: PMC9847378 DOI: 10.5808/gi.22074] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2022] [Accepted: 12/22/2022] [Indexed: 12/31/2022] Open
Abstract
Various methods of frequent pattern mining have been applied to genetic problems, specifically, to the combined association of two genotypes (a genotype pattern, or diplotype) at different DNA variants with disease. These methods have the ability to come up with a selection of genotype patterns that are more common in affected than unaffected individuals, and the assessment of statistical significance for these selected patterns poses some unique problems, which are briefly outlined here.
Collapse
Affiliation(s)
- Jurg Ott
- Laboratory of Statistical Genetics, Rockefeller University, New York, NY 10065, USA,Corresponding author E-mail:
| | - Taesung Park
- Department of Statistics, Seoul National University, Seoul 08826, Korea
| |
Collapse
|
2
|
Park M, Jeong HB, Lee JH, Park T. Spatial rank-based multifactor dimensionality reduction to detect gene-gene interactions for multivariate phenotypes. BMC Bioinformatics 2021; 22:480. [PMID: 34607566 PMCID: PMC8489107 DOI: 10.1186/s12859-021-04395-y] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2020] [Accepted: 09/17/2021] [Indexed: 01/11/2023] Open
Abstract
Background Identifying interaction effects between genes is one of the main tasks of genome-wide association studies aiming to shed light on the biological mechanisms underlying complex diseases. Multifactor dimensionality reduction (MDR) is a popular approach for detecting gene–gene interactions that has been extended in various forms to handle binary and continuous phenotypes. However, only few multivariate MDR methods are available for multiple related phenotypes. Current approaches use Hotelling’s T2 statistic to evaluate interaction models, but it is well known that Hotelling’s T2 statistic is highly sensitive to heavily skewed distributions and outliers. Results We propose a robust approach based on nonparametric statistics such as spatial signs and ranks. The new multivariate rank-based MDR (MR-MDR) is mainly suitable for analyzing multiple continuous phenotypes and is less sensitive to skewed distributions and outliers. MR-MDR utilizes fuzzy k-means clustering and classifies multi-locus genotypes into two groups. Then, MR-MDR calculates a spatial rank-sum statistic as an evaluation measure and selects the best interaction model with the largest statistic. Our novel idea lies in adopting nonparametric statistics as an evaluation measure for robust inference. We adopt tenfold cross-validation to avoid overfitting. Intensive simulation studies were conducted to compare the performance of MR-MDR with current methods. Application of MR-MDR to a real dataset from a Korean genome-wide association study demonstrated that it successfully identified genetic interactions associated with four phenotypes related to kidney function. The R code for conducting MR-MDR is available at https://github.com/statpark/MR-MDR. Conclusions Intensive simulation studies comparing MR-MDR with several current methods showed that the performance of MR-MDR was outstanding for skewed distributions. Additionally, for symmetric distributions, MR-MDR showed comparable power. Therefore, we conclude that MR-MDR is a useful multivariate non-parametric approach that can be used regardless of the phenotype distribution, the correlations between phenotypes, and sample size.
Collapse
Affiliation(s)
- Mira Park
- Department of Preventive Medicine, Eulji University, Daejeon, 34824, Republic of Korea
| | - Hoe-Bin Jeong
- Department of Statistics, Korea University, Seoul, 02841, Republic of Korea
| | - Jong-Hyun Lee
- Department of Statistics, Korea University, Seoul, 02841, Republic of Korea
| | - Taesung Park
- Department of Statistics, Seoul National University, Seoul, 08826, Republic of Korea.
| |
Collapse
|
3
|
Yilmaz S, Tastan O, Cicek AE. SPADIS: An Algorithm for Selecting Predictive and Diverse SNPs in GWAS. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:1208-1216. [PMID: 31443041 DOI: 10.1109/tcbb.2019.2935437] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Phenotypic heritability of complex traits and diseases is seldom explained by individual genetic variants identified in genome-wide association studies (GWAS). Many methods have been developed to select a subset of variant loci, which are associated with or predictive of the phenotype. Selecting connected SNPs on SNP-SNP networks have been proven successful in finding biologically interpretable and predictive SNPs. However, we argue that the connectedness constraint favors selecting redundant features that affect similar biological processes and therefore does not necessarily yield better predictive performance. In this paper, we propose a novel method called SPADIS that favors the selection of remotely located SNPs in order to account for their complementary effects in explaining a phenotype. SPADIS selects a diverse set of loci on a SNP-SNP network. This is achieved by maximizing a submodular set function with a greedy algorithm that ensures a constant factor approximation to the optimal solution. We compare SPADIS to the state-of-the-art method SConES, on a dataset of Arabidopsis Thaliana with continuous flowering time phenotypes. SPADIS has better average phenotype prediction performance in 15 out of 17 phenotypes when the same number of SNPs are selected and provides consistent improvements across multiple networks and settings on average. Moreover, it identifies more candidate genes and runs faster.
Collapse
|
4
|
Fang H, Zhang H, Yang Y. Poisson Approximation-Based Score Test for Detecting Association of Rare Variants. Ann Hum Genet 2016; 80:221-34. [PMID: 27346734 DOI: 10.1111/ahg.12154] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/07/2015] [Accepted: 02/26/2016] [Indexed: 11/30/2022]
Abstract
Genome-wide association study (GWAS) has achieved great success in identifying genetic variants, but the nature of GWAS has determined its inherent limitations. Under the common disease rare variants (CDRV) hypothesis, the traditional association analysis methods commonly used in GWAS for common variants do not have enough power for detecting rare variants with a limited sample size. As a solution to this problem, pooling rare variants by their functions provides an efficient way for identifying susceptible genes. Rare variant typically have low frequencies of minor alleles, and the distribution of the total number of minor alleles of the rare variants can be approximated by a Poisson distribution. Based on this fact, we propose a new test method, the Poisson Approximation-based Score Test (PAST), for association analysis of rare variants. Two testing methods, namely, ePAST and mPAST, are proposed based on different strategies of pooling rare variants. Simulation results and application to the CRESCENDO cohort data show that our methods are more powerful than the existing methods.
Collapse
Affiliation(s)
- Hongyan Fang
- Department of Statistics and Finance, University of Science and Technology of China, Hefei, Anhui, 230026, China
| | - Hong Zhang
- Institute of Biostatistics, Fudan School of Life Sciences, Fudan, Shanghai, 200433, China
| | - Yaning Yang
- Department of Statistics and Finance, University of Science and Technology of China, Hefei, Anhui, 230026, China
| |
Collapse
|
5
|
Gola D, Mahachie John JM, van Steen K, König IR. A roadmap to multifactor dimensionality reduction methods. Brief Bioinform 2015; 17:293-308. [PMID: 26108231 PMCID: PMC4793893 DOI: 10.1093/bib/bbv038] [Citation(s) in RCA: 56] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2015] [Indexed: 02/02/2023] Open
Abstract
Complex diseases are defined to be determined by multiple genetic and environmental factors alone as well as in interactions. To analyze interactions in genetic data, many statistical methods have been suggested, with most of them relying on statistical regression models. Given the known limitations of classical methods, approaches from the machine-learning community have also become attractive. From this latter family, a fast-growing collection of methods emerged that are based on the Multifactor Dimensionality Reduction (MDR) approach. Since its first introduction, MDR has enjoyed great popularity in applications and has been extended and modified multiple times. Based on a literature search, we here provide a systematic and comprehensive overview of these suggested methods. The methods are described in detail, and the availability of implementations is listed. Most recent approaches offer to deal with large-scale data sets and rare variants, which is why we expect these methods to even gain in popularity.
Collapse
|
6
|
Fang H, Hou B, Wang Q, Yang Y. Rare variants analysis by risk-based variable-threshold method. Comput Biol Chem 2013; 46:32-8. [PMID: 23764529 DOI: 10.1016/j.compbiolchem.2013.04.001] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2012] [Revised: 04/03/2013] [Accepted: 04/10/2013] [Indexed: 11/17/2022]
Abstract
Genome-wide association studies, as a powerful approach for detecting common variants associated with diseases, have revealed many disease-associated loci. However, the traditional association analysis methods do not have enough power for detecting the effects of rare variants with limited sample size. As a solution to this problem, pooling rare variants by their functions into a composite variant provides an alternative way for identifying susceptible genes. In this paper, we propose a new pooling method to test the variant-disease association and to identify the functional rare variants related with the disease. Variants with smaller and larger risk measures defined as the ratio of allele frequencies between cases and controls are pooled and a chi-square test of the resultant pooled table is calculated. We vary the threshold of pooling over all possible values and use the maximal chi-square as test statistic. The maximal chi-square is in fact the global maximum over all possible poolings. Our approach is similar to the existing variable-threshold method, but we threshold on the risk measure instead of allele frequencies of controls. Simulation results show that our method performs better in both association testing and variant selection.
Collapse
Affiliation(s)
- Hongyan Fang
- Department of Statistics and Finance, University of Science and Technology of China, Hefei, Anhui 230026, China
| | | | | | | |
Collapse
|
7
|
High-order SNP combinations associated with complex diseases: efficient discovery, statistical power and functional interactions. PLoS One 2012; 7:e33531. [PMID: 22536319 PMCID: PMC3334940 DOI: 10.1371/journal.pone.0033531] [Citation(s) in RCA: 42] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2011] [Accepted: 02/10/2012] [Indexed: 11/19/2022] Open
Abstract
There has been increased interest in discovering combinations of single-nucleotide polymorphisms (SNPs) that are strongly associated with a phenotype even if each SNP has little individual effect. Efficient approaches have been proposed for searching two-locus combinations from genome-wide datasets. However, for high-order combinations, existing methods either adopt a brute-force search which only handles a small number of SNPs (up to few hundreds), or use heuristic search that may miss informative combinations. In addition, existing approaches lack statistical power because of the use of statistics with high degrees-of-freedom and the huge number of hypotheses tested during combinatorial search. Due to these challenges, functional interactions in high-order combinations have not been systematically explored. We leverage discriminative-pattern-mining algorithms from the data-mining community to search for high-order combinations in case-control datasets. The substantially improved efficiency and scalability demonstrated on synthetic and real datasets with several thousands of SNPs allows the study of several important mathematical and statistical properties of SNP combinations with order as high as eleven. We further explore functional interactions in high-order combinations and reveal a general connection between the increase in discriminative power of a combination over its subsets and the functional coherence among the genes comprising the combination, supported by multiple datasets. Finally, we study several significant high-order combinations discovered from a lung-cancer dataset and a kidney-transplant-rejection dataset in detail to provide novel insights on the complex diseases. Interestingly, many of these associations involve combinations of common variations that occur in small fractions of population. Thus, our approach is an alternative methodology for exploring the genetics of rare diseases for which the current focus is on individually rare variations.
Collapse
|
8
|
Cherdyntseva NV, Denisov EV, Litviakov NV, Maksimov VN, Malinovskaya EA, Babyshkina NN, Slonimskaya EM, Voevoda MI, Choinzonov EL. Crosstalk Between the FGFR2 and TP53 Genes in Breast Cancer: Data from an Association Study and Epistatic Interaction Analysis. DNA Cell Biol 2012; 31:306-16. [DOI: 10.1089/dna.2011.1351] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022] Open
Affiliation(s)
- Nadezhda V. Cherdyntseva
- Department of Experimental Oncology, Cancer Research Institute, Siberian Branch of Russian Academy of Medical Sciences, Tomsk, Russian Federation
| | - Evgeny V. Denisov
- Department of Experimental Oncology, Cancer Research Institute, Siberian Branch of Russian Academy of Medical Sciences, Tomsk, Russian Federation
| | - Nicolay V. Litviakov
- Department of Experimental Oncology, Cancer Research Institute, Siberian Branch of Russian Academy of Medical Sciences, Tomsk, Russian Federation
| | - Vladimir N. Maksimov
- Laboratory of Molecular Genetic Study of Internal Diseases, Institute of Internal Medicine, Siberian Branch of Russian Academy of Medical Sciences, Novosibirsk, Russian Federation
| | - Elena A. Malinovskaya
- Department of Experimental Oncology, Cancer Research Institute, Siberian Branch of Russian Academy of Medical Sciences, Tomsk, Russian Federation
| | - Natalia N. Babyshkina
- Department of Experimental Oncology, Cancer Research Institute, Siberian Branch of Russian Academy of Medical Sciences, Tomsk, Russian Federation
| | - Elena M. Slonimskaya
- Department of General Oncology, Cancer Research Institute, Siberian Branch of Russian Academy of Medical Sciences, Tomsk, Russian Federation
- Department of Oncology, Siberian State Medical University, Tomsk, Russian Federation
| | - Mikhail I. Voevoda
- Laboratory of Molecular Genetic Study of Internal Diseases, Institute of Internal Medicine, Siberian Branch of Russian Academy of Medical Sciences, Novosibirsk, Russian Federation
| | - Evgeny L. Choinzonov
- Department of Oncology, Siberian State Medical University, Tomsk, Russian Federation
- Department of Head and Neck Oncology, Cancer Research Institute, Siberian Branch of Russian Academy of Medical Sciences, Tomsk, Russian Federation
| |
Collapse
|
9
|
Gilbert-Diamond D, Moore JH. Analysis of gene-gene interactions. CURRENT PROTOCOLS IN HUMAN GENETICS 2011; Chapter 1:Unit1.14. [PMID: 21735376 PMCID: PMC4086055 DOI: 10.1002/0471142905.hg0114s70] [Citation(s) in RCA: 32] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
The goal of this unit is to introduce gene-gene interactions (epistasis) as a significant complicating factor in the search for disease susceptibility genes. This unit begins with an overview of gene-gene interactions and why they are likely to be common. Then, it reviews several statistical and computational methods for detecting and characterizing genes with effects that are dependent on other genes. The focus of this unit is genetic association studies of discrete and quantitative traits because most of the methods for detecting gene-gene interactions have been developed specifically for these study designs.
Collapse
Affiliation(s)
- Diane Gilbert-Diamond
- Computational Genetics Laboratory, Departments of Genetics and Community and Family Medicine, Dartmouth Medical School, Lebanon, New Hampshire, USA
| | | |
Collapse
|