1
|
Kunert-Graf JM, Sakhanenko NA, Galas DJ. Optimized permutation testing for information theoretic measures of multi-gene interactions. BMC Bioinformatics 2021; 22:180. [PMID: 33827420 PMCID: PMC8028212 DOI: 10.1186/s12859-021-04107-6] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/01/2020] [Accepted: 03/29/2021] [Indexed: 11/17/2022] Open
Abstract
Background Permutation testing is often considered the “gold standard” for multi-test significance analysis, as it is an exact test requiring few assumptions about the distribution being computed. However, it can be computationally very expensive, particularly in its naive form in which the full analysis pipeline is re-run after permuting the phenotype labels. This can become intractable in multi-locus genome-wide association studies (GWAS), in which the number of potential interactions to be tested is combinatorially large. Results In this paper, we develop an approach for permutation testing in multi-locus GWAS, specifically focusing on SNP–SNP-phenotype interactions using multivariable measures that can be computed from frequency count tables, such as those based in Information Theory. We find that the computational bottleneck in this process is the construction of the count tables themselves, and that this step can be eliminated at each iteration of the permutation testing by transforming the count tables directly. This leads to a speed-up by a factor of over 103 for a typical permutation test compared to the naive approach. Additionally, this approach is insensitive to the number of samples making it suitable for datasets with large number of samples. Conclusions The proliferation of large-scale datasets with genotype data for hundreds of thousands of individuals enables new and more powerful approaches for the detection of multi-locus genotype-phenotype interactions. Our approach significantly improves the computational tractability of permutation testing for these studies. Moreover, our approach is insensitive to the large number of samples in these modern datasets. The code for performing these computations and replicating the figures in this paper is freely available at https://github.com/kunert/permute-counts.
Collapse
Affiliation(s)
- James M Kunert-Graf
- Pacific Northwest Research Institute, 720 Broadway, Seattle, WA, 98122, USA.
| | | | - David J Galas
- Pacific Northwest Research Institute, 720 Broadway, Seattle, WA, 98122, USA
| |
Collapse
|
2
|
Magaña J, Contreras MG, Keys KL, Risse-Adams O, Goddard PC, Zeiger AM, Mak ACY, Elhawary JR, Samedy-Bates LA, Lee E, Thakur N, Hu D, Eng C, Salazar S, Huntsman S, Hu T, Burchard EG, White MJ. An epistatic interaction between pre-natal smoke exposure and socioeconomic status has a significant impact on bronchodilator drug response in African American youth with asthma. BioData Min 2020; 13:7. [PMID: 32636926 PMCID: PMC7333373 DOI: 10.1186/s13040-020-00218-7] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/05/2020] [Accepted: 06/23/2020] [Indexed: 11/30/2022] Open
Abstract
BACKGROUND Asthma is one of the leading chronic illnesses among children in the United States. Asthma prevalence is higher among African Americans (11.2%) compared to European Americans (7.7%). Bronchodilator medications are part of the first-line therapy, and the rescue medication, for acute asthma symptoms. Bronchodilator drug response (BDR) varies substantially among different racial/ethnic groups. Asthma prevalence in African Americans is only 3.5% higher than that of European Americans, however, asthma mortality among African Americans is four times that of European Americans; variation in BDR may play an important role in explaining this health disparity. To improve our understanding of disparate health outcomes in complex phenotypes such as BDR, it is important to consider interactions between environmental and biological variables. RESULTS We evaluated the impact of pairwise and three-variable interactions between environmental, social, and biological variables on BDR in 233 African American youth with asthma using Visualization of Statistical Epistasis Networks (ViSEN). ViSEN is a non-parametric entropy-based approach able to quantify interaction effects using an information-theory metric known as Information Gain (IG). We performed analyses in the full dataset and in sex-stratified subsets. Our analyses identified several interaction models significantly, and suggestively, associated with BDR. The strongest interaction significantly associated with BDR was a pairwise interaction between pre-natal smoke exposure and socioeconomic status (full dataset IG: 2.78%, p = 0.001; female IG: 7.27%, p = 0.004)). Sex-stratified analyses yielded divergent results for females and males, indicating the presence of sex-specific effects. CONCLUSIONS Our study identified novel interaction effects significantly, and suggestively, associated with BDR in African American children with asthma. Notably, we found that all of the interactions identified by ViSEN were "pure" interaction effects, in that they were not the result of strong main effects on BDR, highlighting the complexity of the network of biological and environmental factors impacting this phenotype. Several associations uncovered by ViSEN would not have been detected using regression-based methods, thus emphasizing the importance of employing statistical methods optimized to detect both additive and non-additive interaction effects when studying complex phenotypes such as BDR. The information gained in this study increases our understanding and appreciation of the complex nature of the interactions between environmental and health-related factors that influence BDR and will be invaluable to biomedical researchers designing future studies.
Collapse
Affiliation(s)
- J. Magaña
- Department of Medicine, University of California, 1550 4th Street, UCSF Rock Hall, Box 2911, San Francisco, CA 94158 USA
| | - M. G. Contreras
- Department of Medicine, University of California, 1550 4th Street, UCSF Rock Hall, Box 2911, San Francisco, CA 94158 USA
- Department of Biology, San Francisco State University, San Francisco, CA USA
| | - K. L. Keys
- Department of Medicine, University of California, 1550 4th Street, UCSF Rock Hall, Box 2911, San Francisco, CA 94158 USA
- Berkeley Institute for Data Science, University of California, Berkeley, CA USA
| | - O. Risse-Adams
- Department of Medicine, University of California, 1550 4th Street, UCSF Rock Hall, Box 2911, San Francisco, CA 94158 USA
- Lowell Science Research Program, Lowell High School, San Francisco, CA USA
- Department of Biology, University of California, Santa Cruz, CA USA
| | - P. C. Goddard
- Department of Medicine, University of California, 1550 4th Street, UCSF Rock Hall, Box 2911, San Francisco, CA 94158 USA
- Department of Genetics, Stanford University, Stanford, CA USA
| | - A. M. Zeiger
- Department of Medicine, University of California, 1550 4th Street, UCSF Rock Hall, Box 2911, San Francisco, CA 94158 USA
- Sidney Kimmel Medical College, Thomas Jefferson University, Philadelphia, PA USA
| | - A. C. Y. Mak
- Department of Medicine, University of California, 1550 4th Street, UCSF Rock Hall, Box 2911, San Francisco, CA 94158 USA
| | - J. R. Elhawary
- Department of Medicine, University of California, 1550 4th Street, UCSF Rock Hall, Box 2911, San Francisco, CA 94158 USA
| | - L. A. Samedy-Bates
- Department of Bioengineering and Therapeutic Sciences, University of California, San Francisco, CA USA
| | - E. Lee
- Department of Medicine, University of California, 1550 4th Street, UCSF Rock Hall, Box 2911, San Francisco, CA 94158 USA
| | - N. Thakur
- Department of Medicine, University of California, 1550 4th Street, UCSF Rock Hall, Box 2911, San Francisco, CA 94158 USA
| | - D. Hu
- Department of Medicine, University of California, 1550 4th Street, UCSF Rock Hall, Box 2911, San Francisco, CA 94158 USA
| | - C. Eng
- Department of Medicine, University of California, 1550 4th Street, UCSF Rock Hall, Box 2911, San Francisco, CA 94158 USA
| | - S. Salazar
- Department of Medicine, University of California, 1550 4th Street, UCSF Rock Hall, Box 2911, San Francisco, CA 94158 USA
| | - S. Huntsman
- Department of Medicine, University of California, 1550 4th Street, UCSF Rock Hall, Box 2911, San Francisco, CA 94158 USA
| | - T. Hu
- School of Computing, Queen’s University, Kingston, ON Canada
| | - E. G. Burchard
- Department of Medicine, University of California, 1550 4th Street, UCSF Rock Hall, Box 2911, San Francisco, CA 94158 USA
- Department of Bioengineering and Therapeutic Sciences, University of California, San Francisco, CA USA
| | - M. J. White
- Department of Medicine, University of California, 1550 4th Street, UCSF Rock Hall, Box 2911, San Francisco, CA 94158 USA
| |
Collapse
|
3
|
Chanda P, Costa E, Hu J, Sukumar S, Van Hemert J, Walia R. Information Theory in Computational Biology: Where We Stand Today. ENTROPY (BASEL, SWITZERLAND) 2020; 22:E627. [PMID: 33286399 PMCID: PMC7517167 DOI: 10.3390/e22060627] [Citation(s) in RCA: 22] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/30/2020] [Revised: 05/31/2020] [Accepted: 06/03/2020] [Indexed: 12/30/2022]
Abstract
"A Mathematical Theory of Communication" was published in 1948 by Claude Shannon to address the problems in the field of data compression and communication over (noisy) communication channels. Since then, the concepts and ideas developed in Shannon's work have formed the basis of information theory, a cornerstone of statistical learning and inference, and has been playing a key role in disciplines such as physics and thermodynamics, probability and statistics, computational sciences and biological sciences. In this article we review the basic information theory based concepts and describe their key applications in multiple major areas of research in computational biology-gene expression and transcriptomics, alignment-free sequence comparison, sequencing and error correction, genome-wide disease-gene association mapping, metabolic networks and metabolomics, and protein sequence, structure and interaction analysis.
Collapse
Affiliation(s)
- Pritam Chanda
- Corteva Agriscience™, Indianapolis, IN 46268, USA
- Computer and Information Science, Indiana University-Purdue University, Indianapolis, IN 46202, USA
| | - Eduardo Costa
- Corteva Agriscience™, Mogi Mirim, Sao Paulo 13801-540, Brazil
| | - Jie Hu
- Corteva Agriscience™, Indianapolis, IN 46268, USA
| | | | | | - Rasna Walia
- Corteva Agriscience™, Johnston, IA 50131, USA
| |
Collapse
|
4
|
Mi K, Jiang Y, Chen J, Lv D, Qian Z, Sun H, Shang D. Construction and Analysis of Human Diseases and Metabolites Network. Front Bioeng Biotechnol 2020; 8:398. [PMID: 32426349 PMCID: PMC7203444 DOI: 10.3389/fbioe.2020.00398] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/22/2019] [Accepted: 04/08/2020] [Indexed: 11/13/2022] Open
Abstract
The relationship between aberrant metabolism and the initiation and progression of diseases has gained considerable attention in recent years. To gain insights into the global relationship between diseases and metabolites, here we constructed a human diseases-metabolites network (HDMN). Through analyses based on network biology, the metabolites associated with the same disorder tend to participate in the same metabolic pathway or cascade. In addition, the shortest distance between disease-related metabolites was shorter than that of all metabolites in the Kyoto Encyclopedia of Genes and Genomes (KEGG) metabolic network. Both disease and metabolite nodes in the HDMN displayed slight clustering phenomenon, resulting in functional modules. Furthermore, a significant positive correlation was observed between the degree of metabolites and the proportion of disease-related metabolites in the KEGG metabolic network. We also found that the average degree of disease metabolites is larger than that of all metabolites. Depicting a comprehensive characteristic of HDMN could provide great insights into understanding the global relationship between disease and metabolites.
Collapse
Affiliation(s)
- Kai Mi
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, China
| | - Yanan Jiang
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, China.,Department of Pharmacology (State-Province Key Laboratories of Biomedicine - Pharmaceutics of China, Key Laboratory of Cardiovascular Research, Ministry of Education), College of Pharmacy, Harbin Medical University, Harbin, China.,Translational Medicine Research and Cooperation Center of Northern China, Heilongjiang Academy of Medical Sciences, Harbin, China
| | - Jiaxin Chen
- School of Medical Informatics, Harbin Medical University, Daqing, China
| | - Dongxu Lv
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, China
| | - Zhipeng Qian
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, China
| | - Hui Sun
- Pharmaceutical Experiment Teaching Center, College of Pharmacy, Harbin Medical University, Harbin, China
| | - Desi Shang
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, China
| |
Collapse
|
5
|
Wang S, Jeong HH, Sohn KA. ClearF: a supervised feature scoring method to find biomarkers using class-wise embedding and reconstruction. BMC Med Genomics 2019; 12:95. [PMID: 31296201 PMCID: PMC6624178 DOI: 10.1186/s12920-019-0512-9] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023] Open
Abstract
BACKGROUND Feature selection or scoring methods for the detection of biomarkers are essential in bioinformatics. Various feature selection methods have been developed for the detection of biomarkers, and several studies have employed information-theoretic approaches. However, most of these methods generally require a long processing time. In addition, information-theoretic methods discretize continuous features, which is a drawback that can lead to the loss of information. RESULTS In this paper, a novel supervised feature scoring method named ClearF is proposed. The proposed method is suitable for continuous-valued data, which is similar to the principle of feature selection using mutual information, with the added advantage of a reduced computation time. The proposed score calculation is motivated by the association between the reconstruction error and the information-theoretic measurement. Our method is based on class-wise low-dimensional embedding and the resulting reconstruction error. Given multi-class datasets such as a case-control study dataset, low-dimensional embedding is first applied to each class to obtain a compressed representation of the class, and also for the entire dataset. Reconstruction is then performed to calculate the error of each feature and the final score for each feature is defined in terms of the reconstruction errors. The correlation between the information theoretic measurement and the proposed method is demonstrated using a simulation. For performance validation, we compared the classification performance of the proposed method with those of various algorithms on benchmark datasets. CONCLUSIONS The proposed method showed higher accuracy and lower execution time than the other established methods. Moreover, an experiment was conducted on the TCGA breast cancer dataset, and it was confirmed that the genes with the highest scores were highly associated with subtypes of breast cancer.
Collapse
Affiliation(s)
- Sehee Wang
- Department of Computer Engineering, Ajou University, Suwon, 16499, South Korea
| | - Hyun-Hwan Jeong
- Jan and Dan Duncan Neurological Research Institute, Texas Children's Hospital, Houston, TX, 77030, USA.,Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, 77030, USA
| | - Kyung-Ah Sohn
- Department of Computer Engineering, Ajou University, Suwon, 16499, South Korea.
| |
Collapse
|
6
|
Abstract
Genome-wide association studies are moving to genome-wide interaction studies, as the genetic background of many diseases appears to be more complex than previously supposed. Thus, many statistical approaches have been proposed to detect gene-gene (GxG) interactions, among them numerous information theory-based methods, inspired by the concept of entropy. These are suggested as particularly powerful and, because of their nonlinearity, as better able to capture nonlinear relationships between genetic variants and/or variables. However, the introduced entropy-based estimators differ to a surprising extent in their construction and even with respect to the basic definition of interactions. Also, not every entropy-based measure for interaction is accompanied by a proper statistical test. To shed light on this, a systematic review of the literature is presented answering the following questions: (1) How are GxG interactions defined within the framework of information theory? (2) Which entropy-based test statistics are available? (3) Which underlying distribution do the test statistics follow? (4) What are the given strengths and limitations of these test statistics?
Collapse
Affiliation(s)
| | - Inke R König
- Institut für Medizinische Biometrie und Statistik, Universität zu Lübeck, Universitätsklinikum Schleswig-Holstein, Campus Lübeck, Ratzeburger Allee 160, Lübeck, Germany
- Corresponding author. Inke R. Konig, Institut für Medizinische Biometrie und Statistik, Universität zu Lübeck, Universitätsklinikum Schleswig-Holstein, Campus Lübeck, Ratzeburger Allee 160, 23562 Lübeck, Germany. Tel.: ++49 451 500 50610; Fax: ++49 451 500 50604; E-Mail:
| |
Collapse
|
7
|
Yang CH, Weng ZJ, Chuang LY, Yang CS. Identification of SNP-SNP interaction for chronic dialysis patients. Comput Biol Med 2017; 83:94-101. [DOI: 10.1016/j.compbiomed.2017.02.004] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2016] [Revised: 02/14/2017] [Accepted: 02/15/2017] [Indexed: 01/10/2023]
|
8
|
Lee W, Sjölander A, Pawitan Y. A Critical Look at Entropy-Based Gene-Gene Interaction Measures. Genet Epidemiol 2016; 40:416-24. [PMID: 27229752 DOI: 10.1002/gepi.21974] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2015] [Revised: 02/28/2015] [Accepted: 03/17/2016] [Indexed: 11/12/2022]
Abstract
Several entropy-based measures for detecting gene-gene interaction have been proposed recently. It has been argued that the entropy-based measures are preferred because entropy can better capture the nonlinear relationships between genotypes and traits, so they can be useful to detect gene-gene interactions for complex diseases. These suggested measures look reasonable at intuitive level, but so far there has been no detailed characterization of the interactions captured by them. Here we study analytically the properties of some entropy-based measures for detecting gene-gene interactions in detail. The relationship between interactions captured by the entropy-based measures and those of logistic regression models is clarified. In general we find that the entropy-based measures can suffer from a lack of specificity in terms of target parameters, i.e., they can detect uninteresting signals as interactions. Numerical studies are carried out to confirm theoretical findings.
Collapse
Affiliation(s)
- Woojoo Lee
- Department of Statistics, Inha University, Nam-gu, Incheon, South Korea
| | - Arvid Sjölander
- Department of Medical Epidemiology and Biostatistics, Karolinska Institutet, Stockholm, Sweden
| | - Yudi Pawitan
- Department of Medical Epidemiology and Biostatistics, Karolinska Institutet, Stockholm, Sweden
| |
Collapse
|
9
|
CINOEDV: a co-information based method for detecting and visualizing n-order epistatic interactions. BMC Bioinformatics 2016; 17:214. [PMID: 27184783 PMCID: PMC4869388 DOI: 10.1186/s12859-016-1076-8] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/26/2015] [Accepted: 05/07/2016] [Indexed: 12/13/2022] Open
Abstract
BACKGROUND Detecting and visualizing nonlinear interaction effects of single nucleotide polymorphisms (SNPs) or epistatic interactions are important topics in bioinformatics since they play an important role in unraveling the mystery of "missing heritability". However, related studies are almost limited to pairwise epistatic interactions due to their methodological and computational challenges. RESULTS We develop CINOEDV (Co-Information based N-Order Epistasis Detector and Visualizer) for the detection and visualization of epistatic interactions of their orders from 1 to n (n ≥ 2). CINOEDV is composed of two stages, namely, detecting stage and visualizing stage. In detecting stage, co-information based measures are employed to quantify association effects of n-order SNP combinations to the phenotype, and two types of search strategies are introduced to identify n-order epistatic interactions: an exhaustive search and a particle swarm optimization based search. In visualizing stage, all detected n-order epistatic interactions are used to construct a hypergraph, where a real vertex represents the main effect of a SNP and a virtual vertex denotes the interaction effect of an n-order epistatic interaction. By deeply analyzing the constructed hypergraph, some hidden clues for better understanding the underlying genetic architecture of complex diseases could be revealed. CONCLUSIONS Experiments of CINOEDV and its comparison with existing state-of-the-art methods are performed on both simulation data sets and a real data set of age-related macular degeneration. Results demonstrate that CINOEDV is promising in detecting and visualizing n-order epistatic interactions. CINOEDV is implemented in R and is freely available from R CRAN: http://cran.r-project.org and https://sourceforge.net/projects/cinoedv/files/ .
Collapse
|
10
|
An Improved Opposition-Based Learning Particle Swarm Optimization for the Detection of SNP-SNP Interactions. BIOMED RESEARCH INTERNATIONAL 2015; 2015:524821. [PMID: 26236727 PMCID: PMC4509494 DOI: 10.1155/2015/524821] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/25/2014] [Revised: 12/30/2014] [Accepted: 01/02/2015] [Indexed: 12/22/2022]
Abstract
SNP-SNP interactions have been receiving increasing attention in understanding the mechanism underlying susceptibility to complex diseases. Though many works have been done for the detection of SNP-SNP interactions, the algorithmic development is still ongoing. In this study, an improved opposition-based learning particle swarm optimization (IOBLPSO) is proposed for the detection of SNP-SNP interactions. Highlights of IOBLPSO are the introduction of three strategies, namely, opposition-based learning, dynamic inertia weight, and a postprocedure. Opposition-based learning not only enhances the global explorative ability, but also avoids premature convergence. Dynamic inertia weight allows particles to cover a wider search space when the considered SNP is likely to be a random one and converges on promising regions of the search space while capturing a highly suspected SNP. The postprocedure is used to carry out a deep search in highly suspected SNP sets. Experiments of IOBLPSO are performed on both simulation data sets and a real data set of age-related macular degeneration, results of which demonstrate that IOBLPSO is promising in detecting SNP-SNP interactions. IOBLPSO might be an alternative to existing methods for detecting SNP-SNP interactions.
Collapse
|
11
|
Gusareva ES, Van Steen K. Practical aspects of genome-wide association interaction analysis. Hum Genet 2014; 133:1343-58. [DOI: 10.1007/s00439-014-1480-y] [Citation(s) in RCA: 26] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2014] [Accepted: 08/18/2014] [Indexed: 12/31/2022]
|
12
|
Zuo X, Rao S, Fan A, Lin M, Li H, Zhao X, Qin J. To control false positives in gene-gene interaction analysis: two novel conditional entropy-based approaches. PLoS One 2013; 8:e81984. [PMID: 24339984 PMCID: PMC3858311 DOI: 10.1371/journal.pone.0081984] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/08/2013] [Accepted: 10/19/2013] [Indexed: 11/24/2022] Open
Abstract
Genome-wide analysis of gene-gene interactions has been recognized as a powerful avenue to identify the missing genetic components that can not be detected by using current single-point association analysis. Recently, several model-free methods (e.g. the commonly used information based metrics and several logistic regression-based metrics) were developed for detecting non-linear dependence between genetic loci, but they are potentially at the risk of inflated false positive error, in particular when the main effects at one or both loci are salient. In this study, we proposed two conditional entropy-based metrics to challenge this limitation. Extensive simulations demonstrated that the two proposed metrics, provided the disease is rare, could maintain consistently correct false positive rate. In the scenarios for a common disease, our proposed metrics achieved better or comparable control of false positive error, compared to four previously proposed model-free metrics. In terms of power, our methods outperformed several competing metrics in a range of common disease models. Furthermore, in real data analyses, both metrics succeeded in detecting interactions and were competitive with the originally reported results or the logistic regression approaches. In conclusion, the proposed conditional entropy-based metrics are promising as alternatives to current model-based approaches for detecting genuine epistatic effects.
Collapse
Affiliation(s)
- Xiaoyu Zuo
- Department of Medical Statistics and Epidemiology, Sun Yat-Sen University, Guangzhou, China
| | - Shaoqi Rao
- Department of Medical Statistics and Epidemiology, Sun Yat-Sen University, Guangzhou, China
- Institute of Medical Systems Biology and Department of Medical Statistics and Epidemiology, Guangdong Medical College, Dongguan, China
- * E-mail:
| | - An Fan
- Department of Medical Statistics and Epidemiology, Sun Yat-Sen University, Guangzhou, China
| | - Meihua Lin
- Institute of Medical Systems Biology and Department of Medical Statistics and Epidemiology, Guangdong Medical College, Dongguan, China
| | - Haoli Li
- Institute of Medical Systems Biology and Department of Medical Statistics and Epidemiology, Guangdong Medical College, Dongguan, China
| | - Xiaolei Zhao
- Institute of Medical Systems Biology and Department of Medical Statistics and Epidemiology, Guangdong Medical College, Dongguan, China
| | - Jiheng Qin
- Institute of Medical Systems Biology and Department of Medical Statistics and Epidemiology, Guangdong Medical College, Dongguan, China
| |
Collapse
|
13
|
Hu T, Chen Y, Kiralis JW, Moore JH. ViSEN: methodology and software for visualization of statistical epistasis networks. Genet Epidemiol 2013; 37:283-5. [PMID: 23468157 DOI: 10.1002/gepi.21718] [Citation(s) in RCA: 31] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2012] [Revised: 12/20/2012] [Accepted: 02/05/2013] [Indexed: 11/06/2022]
Abstract
The nonlinear interaction effect among multiple genetic factors, i.e. epistasis, has been recognized as a key component in understanding the underlying genetic basis of complex human diseases and phenotypic traits. Due to the statistical and computational complexity, most epistasis studies are limited to interactions with an order of two. We developed ViSEN to analyze and visualize epistatic interactions of both two-way and three-way. ViSEN not only identifies strong interactions among pairs or trios of genetic attributes, but also provides a global interaction map that shows neighborhood and clustering structures. This visualized information could be very helpful to infer the underlying genetic architecture of complex diseases and to generate plausible hypotheses for further biological validations. ViSEN is implemented in Java and freely available at https://sourceforge.net/projects/visen/.
Collapse
Affiliation(s)
- Ting Hu
- Institute for Quantitative Biomedical Sciences, Dartmouth College, New Hampshire, USA
| | | | | | | |
Collapse
|
14
|
Hu T, Chen Y, Kiralis JW, Collins RL, Wejse C, Sirugo G, Williams SM, Moore JH. An information-gain approach to detecting three-way epistatic interactions in genetic association studies. J Am Med Inform Assoc 2013; 20:630-6. [PMID: 23396514 PMCID: PMC3721169 DOI: 10.1136/amiajnl-2012-001525] [Citation(s) in RCA: 51] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/01/2023] Open
Abstract
Background Epistasis has been historically used to describe the phenomenon that the effect of a given gene on a phenotype can be dependent on one or more other genes, and is an essential element for understanding the association between genetic and phenotypic variations. Quantifying epistasis of orders higher than two is very challenging due to both the computational complexity of enumerating all possible combinations in genome-wide data and the lack of efficient and effective methodologies. Objectives In this study, we propose a fast, non-parametric, and model-free measure for three-way epistasis. Methods Such a measure is based on information gain, and is able to separate all lower order effects from pure three-way epistasis. Results Our method was verified on synthetic data and applied to real data from a candidate-gene study of tuberculosis in a West African population. In the tuberculosis data, we found a statistically significant pure three-way epistatic interaction effect that was stronger than any lower-order associations. Conclusion Our study provides a methodological basis for detecting and characterizing high-order gene-gene interactions in genetic association studies.
Collapse
Affiliation(s)
- Ting Hu
- Computational Genetics Laboratory, Geisel School of Medicine, Dartmouth College, Hanover, New Hampshire, USA
| | | | | | | | | | | | | | | |
Collapse
|
15
|
Aschard H, Lutz S, Maus B, Duell EJ, Fingerlin TE, Chatterjee N, Kraft P, Van Steen K. Challenges and opportunities in genome-wide environmental interaction (GWEI) studies. Hum Genet 2012; 131:1591-613. [PMID: 22760307 DOI: 10.1007/s00439-012-1192-0] [Citation(s) in RCA: 110] [Impact Index Per Article: 9.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2012] [Accepted: 06/11/2012] [Indexed: 02/03/2023]
Abstract
The interest in performing gene-environment interaction studies has seen a significant increase with the increase of advanced molecular genetics techniques. Practically, it became possible to investigate the role of environmental factors in disease risk and hence to investigate their role as genetic effect modifiers. The understanding that genetics is important in the uptake and metabolism of toxic substances is an example of how genetic profiles can modify important environmental risk factors to disease. Several rationales exist to set up gene-environment interaction studies and the technical challenges related to these studies-when the number of environmental or genetic risk factors is relatively small-has been described before. In the post-genomic era, it is now possible to study thousands of genes and their interaction with the environment. This brings along a whole range of new challenges and opportunities. Despite a continuing effort in developing efficient methods and optimal bioinformatics infrastructures to deal with the available wealth of data, the challenge remains how to best present and analyze genome-wide environmental interaction (GWEI) studies involving multiple genetic and environmental factors. Since GWEIs are performed at the intersection of statistical genetics, bioinformatics and epidemiology, usually similar problems need to be dealt with as for genome-wide association gene-gene interaction studies. However, additional complexities need to be considered which are typical for large-scale epidemiological studies, but are also related to "joining" two heterogeneous types of data in explaining complex disease trait variation or for prediction purposes.
Collapse
Affiliation(s)
- Hugues Aschard
- Department of Epidemiology, Harvard School of Public Health, Boston, MA, USA.
| | | | | | | | | | | | | | | |
Collapse
|
16
|
Van Steen K. Perspectives on genome-wide multi-stage family-based association studies. Stat Med 2011; 30:2201-21. [DOI: 10.1002/sim.4259] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2010] [Accepted: 03/07/2011] [Indexed: 01/03/2023]
|
17
|
Abstract
Over the last few years, main effect genetic association analysis has proven to be a successful tool to unravel genetic risk components to a variety of complex diseases. In the quest for disease susceptibility factors and the search for the 'missing heritability', supplementary and complementary efforts have been undertaken. These include the inclusion of several genetic inheritance assumptions in model development, the consideration of different sources of information, and the acknowledgement of disease underlying pathways of networks. The search for epistasis or gene-gene interaction effects on traits of interest is marked by an exponential growth, not only in terms of methodological development, but also in terms of practical applications, translation of statistical epistasis to biological epistasis and integration of omics information sources. The current popularity of the field, as well as its attraction to interdisciplinary teams, each making valuable contributions with sometimes rather unique viewpoints, renders it impossible to give an exhaustive review of to-date available approaches for epistasis screening. The purpose of this work is to give a perspective view on a selection of currently active analysis strategies and concerns in the context of epistasis detection, and to provide an eye to the future of gene-gene interaction analysis.
Collapse
Affiliation(s)
- Kristel Van Steen
- Department of Electrical Engineering and Computer Science (Montefiore Institute), Grande Traverse, Bioinformatique 4000 Liège 1, Belgium.
| |
Collapse
|
18
|
Cattaert T, Urrea V, Naj AC, De Lobel L, De Wit V, Fu M, Mahachie John JM, Shen H, Calle ML, Ritchie MD, Edwards TL, Van Steen K. FAM-MDR: a flexible family-based multifactor dimensionality reduction technique to detect epistasis using related individuals. PLoS One 2010; 5:e10304. [PMID: 20421984 PMCID: PMC2858665 DOI: 10.1371/journal.pone.0010304] [Citation(s) in RCA: 43] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/12/2010] [Accepted: 03/01/2010] [Indexed: 12/05/2022] Open
Abstract
We propose a novel multifactor dimensionality reduction method for epistasis detection in small or extended pedigrees, FAM-MDR. It combines features of the Genome-wide Rapid Association using Mixed Model And Regression approach (GRAMMAR) with Model-Based MDR (MB-MDR). We focus on continuous traits, although the method is general and can be used for outcomes of any type, including binary and censored traits. When comparing FAM-MDR with Pedigree-based Generalized MDR (PGMDR), which is a generalization of Multifactor Dimensionality Reduction (MDR) to continuous traits and related individuals, FAM-MDR was found to outperform PGMDR in terms of power, in most of the considered simulated scenarios. Additional simulations revealed that PGMDR does not appropriately deal with multiple testing and consequently gives rise to overly optimistic results. FAM-MDR adequately deals with multiple testing in epistasis screens and is in contrast rather conservative, by construction. Furthermore, simulations show that correcting for lower order (main) effects is of utmost importance when claiming epistasis. As Type 2 Diabetes Mellitus (T2DM) is a complex phenotype likely influenced by gene-gene interactions, we applied FAM-MDR to examine data on glucose area-under-the-curve (GAUC), an endophenotype of T2DM for which multiple independent genetic associations have been observed, in the Amish Family Diabetes Study (AFDS). This application reveals that FAM-MDR makes more efficient use of the available data than PGMDR and can deal with multi-generational pedigrees more easily. In conclusion, we have validated FAM-MDR and compared it to PGMDR, the current state-of-the-art MDR method for family data, using both simulations and a practical dataset. FAM-MDR is found to outperform PGMDR in that it handles the multiple testing issue more correctly, has increased power, and efficiently uses all available information.
Collapse
Affiliation(s)
- Tom Cattaert
- Montefiore Institute, University of Liège, Liège, Belgium.
| | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|