1
|
Abegaz F, Van Lishout F, Mahachie John JM, Chiachoompu K, Bhardwaj A, Duroux D, Gusareva ES, Wei Z, Hakonarson H, Van Steen K. Performance of model-based multifactor dimensionality reduction methods for epistasis detection by controlling population structure. BioData Min 2021; 14:16. [PMID: 33608043 PMCID: PMC7893746 DOI: 10.1186/s13040-021-00247-w] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2020] [Accepted: 02/07/2021] [Indexed: 12/15/2022] Open
Abstract
Background In genome-wide association studies the extent and impact of confounding due to population structure have been well recognized. Inadequate handling of such confounding is likely to lead to spurious associations, hampering replication, and the identification of causal variants. Several strategies have been developed for protecting associations against confounding, the most popular one is based on Principal Component Analysis. In contrast, the extent and impact of confounding due to population structure in gene-gene interaction association epistasis studies are much less investigated and understood. In particular, the role of nonlinear genetic population substructure in epistasis detection is largely under-investigated, especially outside a regression framework. Methods To identify causal variants in synergy, to improve interpretability and replicability of epistasis results, we introduce three strategies based on a model-based multifactor dimensionality reduction approach for structured populations, namely MBMDR-PC, MBMDR-PG, and MBMDR-GC. Results Simulation results comparing the performance of various approaches show that in the presence of population structure MBMDR-PC and MBMDR-PG consistently better control type I error rate at the nominal level than MBMDR-GC. Moreover, our proposed three methods of population structure correction outperform MDR-SP in terms of statistical power. Conclusion We demonstrate through extensive simulation studies the effect of various degrees of genetic population structure and relatedness on epistasis detection and propose appropriate remedial measures based on linear and nonlinear sample genetic similarity. Supplementary Information The online version contains supplementary material available at 10.1186/s13040-021-00247-w.
Collapse
Affiliation(s)
- Fentaw Abegaz
- GIGA-R, Medical Genomics - BIO3, University of Liège, Liège, Belgium.
| | | | | | | | - Archana Bhardwaj
- GIGA-R, Medical Genomics - BIO3, University of Liège, Liège, Belgium
| | - Diane Duroux
- GIGA-R, Medical Genomics - BIO3, University of Liège, Liège, Belgium
| | - Elena S Gusareva
- GIGA-R, Medical Genomics - BIO3, University of Liège, Liège, Belgium
| | - Zhi Wei
- Department of Computer Science, New Jersey Institute of Technology, Newark, NJ, USA
| | - Hakon Hakonarson
- Center for Applied Genomics, The Children's Hospital of Philadelphia, Philadelphia, PA, USA.,Department of Pediatrics, Division of Human Genetics, The Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| | - Kristel Van Steen
- GIGA-R, Medical Genomics - BIO3, University of Liège, Liège, Belgium.,WELBIO (Walloon Excellence in Lifesciences and Biotechnology), University of Liège, Liège, Belgium
| |
Collapse
|
2
|
Abegaz F, Chaichoompu K, Génin E, Fardo DW, König IR, Mahachie John JM, Van Steen K. Principals about principal components in statistical genetics. Brief Bioinform 2020; 20:2200-2216. [PMID: 30219892 DOI: 10.1093/bib/bby081] [Citation(s) in RCA: 14] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/18/2018] [Revised: 07/21/2018] [Accepted: 08/12/2018] [Indexed: 12/13/2022] Open
Abstract
Principal components (PCs) are widely used in statistics and refer to a relatively small number of uncorrelated variables derived from an initial pool of variables, while explaining as much of the total variance as possible. Also in statistical genetics, principal component analysis (PCA) is a popular technique. To achieve optimal results, a thorough understanding about the different implementations of PCA is required and their impact on study results, compared to alternative approaches. In this review, we focus on the possibilities, limitations and role of PCs in ancestry prediction, genome-wide association studies, rare variants analyses, imputation strategies, meta-analysis and epistasis detection. We also describe several variations of classic PCA that deserve increased attention in statistical genetics applications.
Collapse
|
3
|
Abegaz F, Van Lishout F, Mahachie John JM, Chiachoompu K, Bhardwaj A, Gusareva ES, Wei Z, Hakonarson H, Van Steen K. Epistasis Detection in Genome-Wide Screening for Complex Human Diseases in Structured Populations. SYSTEMS MEDICINE 2019. [DOI: 10.1089/sysm.2019.0003] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023] Open
Affiliation(s)
- Fentaw Abegaz
- GIGA-R, Medical Genomics—BIO3, University of Liege, Liege, Belgium
| | | | | | | | - Archana Bhardwaj
- GIGA-R, Medical Genomics—BIO3, University of Liege, Liege, Belgium
| | | | - Zhi Wei
- Department of Computer Science, New Jersey Institute of Technology, Newark, New Jersey
| | - Hakon Hakonarson
- Center for Applied Genomics, The Children's Hospital of Philadelphia, Philadelphia, Pennsylvania
- Division of Human Genetics, Department of Pediatrics, The Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania
| | - Kristel Van Steen
- GIGA-R, Medical Genomics—BIO3, University of Liege, Liege, Belgium
- WELBIO (Walloon Excellence in Lifesciences and Biotechnology), University of Liege, Liege, Belgium
| | | |
Collapse
|
4
|
Yang CH, Chuang LY, Lin YD. Multiobjective multifactor dimensionality reduction to detect SNP-SNP interactions. Bioinformatics 2019; 34:2228-2236. [PMID: 29471406 DOI: 10.1093/bioinformatics/bty076] [Citation(s) in RCA: 21] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/04/2017] [Accepted: 02/16/2018] [Indexed: 11/12/2022] Open
Abstract
Motivation Single-nucleotide polymorphism (SNP)-SNP interactions (SSIs) are popular markers for understanding disease susceptibility. Multifactor dimensionality reduction (MDR) can successfully detect considerable SSIs. Currently, MDR-based methods mainly adopt a single-objective function (a single measure based on contingency tables) to detect SSIs. However, generally, a single-measure function might not yield favorable results due to potential model preferences and disease complexities. Approach This study proposes a multiobjective MDR (MOMDR) method that is based on a contingency table of MDR as an objective function. MOMDR considers the incorporated measures, including correct classification and likelihood rates, to detect SSIs and adopts set theory to predict the most favorable SSIs with cross-validation consistency. MOMDR enables simultaneously using multiple measures to determine potential SSIs. Results Three simulation studies were conducted to compare the detection success rates of MOMDR and single-objective MDR (SOMDR), revealing that MOMDR had higher detection success rates than SOMDR. Furthermore, the Wellcome Trust Case Control Consortium dataset was analyzed by MOMDR to detect SSIs associated with coronary artery disease. Availability and implementation: MOMDR is freely available at https://goo.gl/M8dpDg. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Cheng-Hong Yang
- Department of Electronic Engineering, National Kaohsiung University of Applied Sciences, Kaohsiung, Taiwan.,Graduate Institute of Clinical Medicine, Kaohsiung Medical University, Kaohsiung, Taiwan
| | - Li-Yeh Chuang
- Department of Chemical Engineering and Institute of Biotechnology and Chemical Engineering, I-Shou University, Kaohsiung, Taiwan
| | - Yu-Da Lin
- Department of Electronic Engineering, National Kaohsiung University of Applied Sciences, Kaohsiung, Taiwan
| |
Collapse
|
5
|
Uppu S, Krishna A. A deep hybrid model to detect multi-locus interacting SNPs in the presence of noise. Int J Med Inform 2018; 119:134-151. [DOI: 10.1016/j.ijmedinf.2018.09.003] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2017] [Revised: 04/13/2018] [Accepted: 09/03/2018] [Indexed: 01/17/2023]
|
6
|
Gola D, Mahachie John JM, van Steen K, König IR. A roadmap to multifactor dimensionality reduction methods. Brief Bioinform 2015; 17:293-308. [PMID: 26108231 PMCID: PMC4793893 DOI: 10.1093/bib/bbv038] [Citation(s) in RCA: 56] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2015] [Indexed: 02/02/2023] Open
Abstract
Complex diseases are defined to be determined by multiple genetic and environmental factors alone as well as in interactions. To analyze interactions in genetic data, many statistical methods have been suggested, with most of them relying on statistical regression models. Given the known limitations of classical methods, approaches from the machine-learning community have also become attractive. From this latter family, a fast-growing collection of methods emerged that are based on the Multifactor Dimensionality Reduction (MDR) approach. Since its first introduction, MDR has enjoyed great popularity in applications and has been extended and modified multiple times. Based on a literature search, we here provide a systematic and comprehensive overview of these suggested methods. The methods are described in detail, and the availability of implementations is listed. Most recent approaches offer to deal with large-scale data sets and rare variants, which is why we expect these methods to even gain in popularity.
Collapse
|
7
|
MDR method for nonbinary response variable. J MULTIVARIATE ANAL 2015. [DOI: 10.1016/j.jmva.2014.11.008] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
|
8
|
UGMDR: a unified conceptual framework for detection of multifactor interactions underlying complex traits. Heredity (Edinb) 2014; 114:255-61. [PMID: 25335557 DOI: 10.1038/hdy.2014.94] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/07/2014] [Revised: 08/30/2014] [Accepted: 09/04/2014] [Indexed: 01/17/2023] Open
Abstract
Biological outcomes are governed by multiple genetic and environmental factors that act in concert. Determining multifactor interactions is the primary topic of interest in recent genetics studies but presents enormous statistical and mathematical challenges. The computationally efficient multifactor dimensionality reduction (MDR) approach has emerged as a promising tool for meeting these challenges. On the other hand, complex traits are expressed in various forms and have different data generation mechanisms that cannot be appropriately modeled by a dichotomous model; the subjects in a study may be recruited according to its own analytical goals, research strategies and resources available, not only consisting of homogeneous unrelated individuals. Although several modifications and extensions of MDR have in part addressed the practical problems, they are still limited in statistical analyses of diverse phenotypes, multivariate phenotypes and correlated observations, correcting for potential population stratification and unifying both unrelated and family samples into a more powerful analysis. I propose a comprehensive statistical framework, referred as to unified generalized MDR (UGMDR), for systematic extension of MDR. The proposed approach is quite versatile, not only allowing for covariate adjustment, being suitable for analyzing almost any trait type, for example, binary, count, continuous, polytomous, ordinal, time-to-onset, multivariate and others, as well as combinations of those, but also being applicable to various study designs, including homogeneous and admixed unrelated-subject and family as well as mixtures of them. The proposed UGMDR offers an important addition to the arsenal of analytical tools for identifying nonlinear multifactor interactions and unraveling the genetic architecture of complex traits.
Collapse
|
9
|
Lou XY. Gene-Gene and Gene-Environment Interactions Underlying Complex Traits and their Detection. BIOMETRICS & BIOSTATISTICS INTERNATIONAL JOURNAL 2014; 1:00007. [PMID: 25584363 PMCID: PMC4288817 DOI: 10.15406/bbij.2014.01.00007] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/04/2023]
Affiliation(s)
- Xiang-Yang Lou
- Corresponding author: Xiang-Yang Lou, Department of Biostatistics, University of Alabama at Birmingham 1665 University Boulevard, RPHB 327, Birmingham, Alabama 35294-0022, USA, Tel: 205-975-9145; Fax: 205-975-2541;
| |
Collapse
|
10
|
Jin L, Zhu W, Yu Y, Kou C, Meng X, Tao Y, Guo J. Nonparametric tests of associations with disease based on U-statistics. Ann Hum Genet 2013; 78:141-53. [PMID: 24328673 DOI: 10.1111/ahg.12049] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/07/2013] [Accepted: 09/01/2013] [Indexed: 11/25/2022]
Abstract
In case-control studies, association analysis was designed to test whether genetic variants were associated with human diseases. To evaluate the association, analysing one genetic marker at a time suffered from weak power, because of the correction for multiple testing and possibly small genetic effects. An alternative strategy was to test simultaneous effects of multiple markers, which was believed to be more powerful. However, when the number of markers under investigation was large, they would be subjected to weak power as well, because of the greater degrees of freedom. To conquer these limitations in case-control studies, we proposed a novel method that could test joint association of several loci (i.e. haplotype), with only a single degree of freedom. In this research, we developed a nonparametric approach, which was based on U-statistics. We also introduced a new kernel for U-statistic, which could combine the haplotype structure information, and was expected to enhance the power. Simulations indicated that our proposed approach offered merits in identifying the associations between diseases and haplotypes. Application of our method to a study of candidate genes for internalising disorder illustrated its virtue in utility and interpretation, and provided an excellent result in detecting the associations.
Collapse
Affiliation(s)
- Lina Jin
- Key Laboratory for Applied Statistics of MOE and School of Mathematics and Statistics, Northeast Normal University, Changchun, Jilin, 130024, China; School of Public Health, Jilin University, Changchun, Jilin, 130021, China
| | | | | | | | | | | | | |
Collapse
|