1
|
Yaldız B, Erdoğan O, Rafatov S, Iyigün C, Aydın Son Y. Revealing third-order interactions through the integration of machine learning and entropy methods in genomic studies. BioData Min 2024; 17:3. [PMID: 38291454 PMCID: PMC10826120 DOI: 10.1186/s13040-024-00355-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2023] [Accepted: 01/16/2024] [Indexed: 02/01/2024] Open
Abstract
BACKGROUND Non-linear relationships at the genotype level are essential in understanding the genetic interactions of complex disease traits. Genome-wide association Studies (GWAS) have revealed statistical association of the SNPs in many complex diseases. As GWAS results could not thoroughly reveal the genetic background of these disorders, Genome-Wide Interaction Studies have started to gain importance. In recent years, various statistical approaches, such as entropy-based methods, have been suggested for revealing these non-additive interactions between variants. This study presents a novel prioritization workflow integrating two-step Random Forest (RF) modeling and entropy analysis after PLINK filtering. PLINK-RF-RF workflow is followed by an entropy-based 3-way interaction information (3WII) method to capture the hidden patterns resulting from non-linear relationships between genotypes in Late-Onset Alzheimer Disease to discover early and differential diagnosis markers. RESULTS Three models from different datasets are developed by integrating PLINK-RF-RF analysis and entropy-based three-way interaction information (3WII) calculation method, which enables the detection of the third-order interactions, which are not primarily considered in epistatic interaction studies. A reduced SNP set is selected for all three datasets by 3WII analysis by PLINK filtering and prioritization of SNP with RF-RF modeling, promising as a model minimization approach. Among SNPs revealed by 3WII, 4 SNPs out of 19 from GenADA, 1 SNP out of 27 from ADNI, and 4 SNPs out of 106 from NCRAD are mapped to genes directly associated with Alzheimer Disease. Additionally, several SNPs are associated with other neurological disorders. Also, the genes the variants mapped to in all datasets are significantly enriched in calcium ion binding, extracellular matrix, external encapsulating structure, and RUNX1 regulates estrogen receptor-mediated transcription pathways. Therefore, these functional pathways are proposed for further examination for a possible LOAD association. Besides, all 3WII variants are proposed as candidate biomarkers for the genotyping-based LOAD diagnosis. CONCLUSION The entropy approach performed in this study reveals the complex genetic interactions that significantly contribute to LOAD risk. We benefited from the entropy-based 3WII as a model minimization step and determined the significant 3-way interactions between the prioritized SNPs by PLINK-RF-RF. This framework is a promising approach for disease association studies, which can also be modified by integrating other machine learning and entropy-based interaction methods.
Collapse
Affiliation(s)
- Burcu Yaldız
- Department of Health Informatics, Graduate School of Informatics, METU, Ankara, Turkey
| | - Onur Erdoğan
- Department of Health Informatics, Graduate School of Informatics, METU, Ankara, Turkey
| | - Sevda Rafatov
- Department of Health Informatics, Graduate School of Informatics, METU, Ankara, Turkey
| | - Cem Iyigün
- Department of Industrial Engineering, METU, Ankara, Turkey
| | - Yeşim Aydın Son
- Department of Health Informatics, Graduate School of Informatics, METU, Ankara, Turkey.
- Graduate School of Informatics, ODTU-NOROM, METU, Ankara, Turkey.
| |
Collapse
|
2
|
Yee J, Park T, Park M. Identification of the associations between genes and quantitative traits using entropy-based kernel density estimation. Genomics Inform 2022; 20:e17. [PMID: 35794697 PMCID: PMC9299569 DOI: 10.5808/gi.22033] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2022] [Accepted: 06/15/2022] [Indexed: 11/20/2022] Open
Abstract
Genetic associations have been quantified using a number of statistical measures. Entropy-based mutual information may be one of the more direct ways of estimating the association, in the sense that it does not depend on the parametrization. For this purpose, both the entropy and conditional entropy of the phenotype distribution should be obtained. Quantitative traits, however, do not usually allow an exact evaluation of entropy. The estimation of entropy needs a probability density function, which can be approximated by kernel density estimation. We have investigated the proper sequence of procedures for combining the kernel density estimation and entropy estimation with a probability density function in order to calculate mutual information. Genotypes and their interactions were constructed to set the conditions for conditional entropy. Extensive simulation data created using three types of generating functions were analyzed using two different kernels as well as two types of multifactor dimensionality reduction and another probability density approximation method called m-spacing. The statistical power in terms of correct detection rates was compared. Using kernels was found to be most useful when the trait distributions were more complex than simple normal or gamma distributions. A full-scale genomic dataset was explored to identify associations using the 2-h oral glucose tolerance test results and γ-glutamyl transpeptidase levels as phenotypes. Clearly distinguishable single-nucleotide polymorphisms (SNPs) and interacting SNP pairs associated with these phenotypes were found and listed with empirical p-values.
Collapse
Affiliation(s)
- Jaeyong Yee
- Department of Physiology and Biophysics, Eulji University, Daejeon 34824, Korea
| | - Taesung Park
- Department of Statistics, Seoul National University, Seoul 08826, Korea
| | - Mira Park
- Department of Preventive Medicine, Eulji University, Daejeon 34824, Korea
| |
Collapse
|
3
|
Kunert-Graf JM, Sakhanenko NA, Galas DJ. Optimized permutation testing for information theoretic measures of multi-gene interactions. BMC Bioinformatics 2021; 22:180. [PMID: 33827420 PMCID: PMC8028212 DOI: 10.1186/s12859-021-04107-6] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/01/2020] [Accepted: 03/29/2021] [Indexed: 11/17/2022] Open
Abstract
Background Permutation testing is often considered the “gold standard” for multi-test significance analysis, as it is an exact test requiring few assumptions about the distribution being computed. However, it can be computationally very expensive, particularly in its naive form in which the full analysis pipeline is re-run after permuting the phenotype labels. This can become intractable in multi-locus genome-wide association studies (GWAS), in which the number of potential interactions to be tested is combinatorially large. Results In this paper, we develop an approach for permutation testing in multi-locus GWAS, specifically focusing on SNP–SNP-phenotype interactions using multivariable measures that can be computed from frequency count tables, such as those based in Information Theory. We find that the computational bottleneck in this process is the construction of the count tables themselves, and that this step can be eliminated at each iteration of the permutation testing by transforming the count tables directly. This leads to a speed-up by a factor of over 103 for a typical permutation test compared to the naive approach. Additionally, this approach is insensitive to the number of samples making it suitable for datasets with large number of samples. Conclusions The proliferation of large-scale datasets with genotype data for hundreds of thousands of individuals enables new and more powerful approaches for the detection of multi-locus genotype-phenotype interactions. Our approach significantly improves the computational tractability of permutation testing for these studies. Moreover, our approach is insensitive to the large number of samples in these modern datasets. The code for performing these computations and replicating the figures in this paper is freely available at https://github.com/kunert/permute-counts.
Collapse
Affiliation(s)
- James M Kunert-Graf
- Pacific Northwest Research Institute, 720 Broadway, Seattle, WA, 98122, USA.
| | | | - David J Galas
- Pacific Northwest Research Institute, 720 Broadway, Seattle, WA, 98122, USA
| |
Collapse
|
4
|
Zhou X, Chan KCC, Huang Z, Wang J. Determining dependency and redundancy for identifying gene-gene interaction associated with complex disease. J Bioinform Comput Biol 2020; 18:2050035. [PMID: 33064052 DOI: 10.1142/s0219720020500353] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
As interactions among genetic variants in different genes can be an important factor for predicting complex diseases, many computational methods have been proposed to detect if a particular set of genes has interaction with a particular complex disease. However, even though many such methods have been shown to be useful, they can be made more effective if the properties of gene-gene interactions can be better understood. Towards this goal, we have attempted to uncover patterns in gene-gene interactions and the patterns reveal an interesting property that can be reflected in an inequality that describes the relationship between two genotype variables and a disease-status variable. We show, in this paper, that this inequality can be generalized to [Formula: see text] genotype variables. Based on this inequality, we establish a conditional independence and redundancy (CIR)-based definition of gene-gene interaction and the concept of an interaction group. From these new definitions, a novel measure of gene-gene interaction is then derived. We discuss the properties of these concepts and explain how they can be used in a novel algorithm to detect high-order gene-gene interactions. Experimental results using both simulated and real datasets show that the proposed method can be very promising.
Collapse
Affiliation(s)
- Xiangdong Zhou
- College of Mathematics and Computer Science, Fuzhou University Fuzhou, Fujian 350108, P. R. China
| | - Keith C C Chan
- Department of Computing, The Hong Kong Polytechnic University, Kowloon, Hong Kong, P. R. China
| | - Zhihua Huang
- College of Mathematics and Computer Science, Fuzhou University Fuzhou, Fujian 350108, P. R. China
| | - Jingbin Wang
- College of Mathematics and Computer Science, Fuzhou University Fuzhou, Fujian 350108, P. R. China
| |
Collapse
|
5
|
Chanda P, Costa E, Hu J, Sukumar S, Van Hemert J, Walia R. Information Theory in Computational Biology: Where We Stand Today. ENTROPY (BASEL, SWITZERLAND) 2020; 22:E627. [PMID: 33286399 PMCID: PMC7517167 DOI: 10.3390/e22060627] [Citation(s) in RCA: 22] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/30/2020] [Revised: 05/31/2020] [Accepted: 06/03/2020] [Indexed: 12/30/2022]
Abstract
"A Mathematical Theory of Communication" was published in 1948 by Claude Shannon to address the problems in the field of data compression and communication over (noisy) communication channels. Since then, the concepts and ideas developed in Shannon's work have formed the basis of information theory, a cornerstone of statistical learning and inference, and has been playing a key role in disciplines such as physics and thermodynamics, probability and statistics, computational sciences and biological sciences. In this article we review the basic information theory based concepts and describe their key applications in multiple major areas of research in computational biology-gene expression and transcriptomics, alignment-free sequence comparison, sequencing and error correction, genome-wide disease-gene association mapping, metabolic networks and metabolomics, and protein sequence, structure and interaction analysis.
Collapse
Affiliation(s)
- Pritam Chanda
- Corteva Agriscience™, Indianapolis, IN 46268, USA
- Computer and Information Science, Indiana University-Purdue University, Indianapolis, IN 46202, USA
| | - Eduardo Costa
- Corteva Agriscience™, Mogi Mirim, Sao Paulo 13801-540, Brazil
| | - Jie Hu
- Corteva Agriscience™, Indianapolis, IN 46268, USA
| | | | | | - Rasna Walia
- Corteva Agriscience™, Johnston, IA 50131, USA
| |
Collapse
|
6
|
Wang S, Jeong HH, Sohn KA. ClearF: a supervised feature scoring method to find biomarkers using class-wise embedding and reconstruction. BMC Med Genomics 2019; 12:95. [PMID: 31296201 PMCID: PMC6624178 DOI: 10.1186/s12920-019-0512-9] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023] Open
Abstract
BACKGROUND Feature selection or scoring methods for the detection of biomarkers are essential in bioinformatics. Various feature selection methods have been developed for the detection of biomarkers, and several studies have employed information-theoretic approaches. However, most of these methods generally require a long processing time. In addition, information-theoretic methods discretize continuous features, which is a drawback that can lead to the loss of information. RESULTS In this paper, a novel supervised feature scoring method named ClearF is proposed. The proposed method is suitable for continuous-valued data, which is similar to the principle of feature selection using mutual information, with the added advantage of a reduced computation time. The proposed score calculation is motivated by the association between the reconstruction error and the information-theoretic measurement. Our method is based on class-wise low-dimensional embedding and the resulting reconstruction error. Given multi-class datasets such as a case-control study dataset, low-dimensional embedding is first applied to each class to obtain a compressed representation of the class, and also for the entire dataset. Reconstruction is then performed to calculate the error of each feature and the final score for each feature is defined in terms of the reconstruction errors. The correlation between the information theoretic measurement and the proposed method is demonstrated using a simulation. For performance validation, we compared the classification performance of the proposed method with those of various algorithms on benchmark datasets. CONCLUSIONS The proposed method showed higher accuracy and lower execution time than the other established methods. Moreover, an experiment was conducted on the TCGA breast cancer dataset, and it was confirmed that the genes with the highest scores were highly associated with subtypes of breast cancer.
Collapse
Affiliation(s)
- Sehee Wang
- Department of Computer Engineering, Ajou University, Suwon, 16499 South Korea
| | - Hyun-Hwan Jeong
- Jan and Dan Duncan Neurological Research Institute, Texas Children’s Hospital, Houston, TX 77030 USA
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030 USA
| | - Kyung-Ah Sohn
- Department of Computer Engineering, Ajou University, Suwon, 16499 South Korea
| |
Collapse
|
7
|
Abstract
Genome-wide association studies are moving to genome-wide interaction studies, as the genetic background of many diseases appears to be more complex than previously supposed. Thus, many statistical approaches have been proposed to detect gene-gene (GxG) interactions, among them numerous information theory-based methods, inspired by the concept of entropy. These are suggested as particularly powerful and, because of their nonlinearity, as better able to capture nonlinear relationships between genetic variants and/or variables. However, the introduced entropy-based estimators differ to a surprising extent in their construction and even with respect to the basic definition of interactions. Also, not every entropy-based measure for interaction is accompanied by a proper statistical test. To shed light on this, a systematic review of the literature is presented answering the following questions: (1) How are GxG interactions defined within the framework of information theory? (2) Which entropy-based test statistics are available? (3) Which underlying distribution do the test statistics follow? (4) What are the given strengths and limitations of these test statistics?
Collapse
Affiliation(s)
| | - Inke R König
- Institut für Medizinische Biometrie und Statistik, Universität zu Lübeck, Universitätsklinikum Schleswig-Holstein, Campus Lübeck, Ratzeburger Allee 160, Lübeck, Germany
- Corresponding author. Inke R. Konig, Institut für Medizinische Biometrie und Statistik, Universität zu Lübeck, Universitätsklinikum Schleswig-Holstein, Campus Lübeck, Ratzeburger Allee 160, 23562 Lübeck, Germany. Tel.: ++49 451 500 50610; Fax: ++49 451 500 50604; E-Mail:
| |
Collapse
|
8
|
Entropy, or Information, Unifies Ecology and Evolution and Beyond. ENTROPY 2018; 20:e20100727. [PMID: 33265816 PMCID: PMC7512290 DOI: 10.3390/e20100727] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/06/2018] [Revised: 08/18/2018] [Accepted: 09/11/2018] [Indexed: 02/07/2023]
Abstract
This article discusses how entropy/information methods are well-suited to analyzing and forecasting the four processes of innovation, transmission, movement, and adaptation, which are the common basis to ecology and evolution. Macroecologists study assemblages of differing species, whereas micro-evolutionary biologists study variants of heritable information within species, such as DNA and epigenetic modifications. These two different modes of variation are both driven by the same four basic processes, but approaches to these processes sometimes differ considerably. For example, macroecology often documents patterns without modeling underlying processes, with some notable exceptions. On the other hand, evolutionary biologists have a long history of deriving and testing mathematical genetic forecasts, previously focusing on entropies such as heterozygosity. Macroecology calls this Gini-Simpson, and has borrowed the genetic predictions, but sometimes this measure has shortcomings. Therefore it is important to note that predictive equations have now been derived for molecular diversity based on Shannon entropy and mutual information. As a result, we can now forecast all major types of entropy/information, creating a general predictive approach for the four basic processes in ecology and evolution. Additionally, the use of these methods will allow seamless integration with other studies such as the physical environment, and may even extend to assisting with evolutionary algorithms.
Collapse
|
9
|
Zhou X, Chan KCC. Detecting gene-gene interactions for complex quantitative traits using generalized fuzzy classification. BMC Bioinformatics 2018; 19:329. [PMID: 30227829 PMCID: PMC6145205 DOI: 10.1186/s12859-018-2361-5] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2017] [Accepted: 09/09/2018] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Quantitative traits or continuous outcomes related to complex diseases can provide more information and therefore more accurate analysis for identifying gene-gene and gene- environment interactions associated with complex diseases. Multifactor Dimensionality Reduction (MDR) is originally proposed to identify gene-gene and gene- environment interactions associated with binary status of complex diseases. Some efforts have been made to extend it to quantitative traits (QTs) and ordinal traits. However these and other methods are still not computationally efficient or effective. RESULTS Generalized Fuzzy Quantitative trait MDR (GFQMDR) is proposed in this paper to strengthen identification of gene-gene interactions associated with a quantitative trait by first transforming it to an ordinal trait and then selecting best sets of genetic markers, mainly single nucleotide polymorphisms (SNPs) or simple sequence length polymorphic markers (SSLPs), as having strong association with the trait through generalized fuzzy classification using extended member functions. Experimental results on simulated datasets and real datasets show that our algorithm has better success rate, classification accuracy and consistency in identifying gene-gene interactions associated with QTs. CONCLUSION The proposed algorithm provides a more effective way to identify gene-gene interactions associated with quantitative traits.
Collapse
Affiliation(s)
- Xiangdong Zhou
- College of Mathematics and Computer Science, Fuzhou University, Fuzhou, Fujian China
| | - Keith C. C. Chan
- Department of Computing, the Hong Kong Polytechnic University, Kowloon, Hong Kong China
| |
Collapse
|
10
|
Sherwin WB, Chao A, Jost L, Smouse PE. Information Theory Broadens the Spectrum of Molecular Ecology and Evolution. Trends Ecol Evol 2017; 32:948-963. [PMID: 29126564 DOI: 10.1016/j.tree.2017.09.012] [Citation(s) in RCA: 42] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2017] [Revised: 09/22/2017] [Accepted: 09/26/2017] [Indexed: 01/18/2023]
Abstract
Information or entropy analysis of diversity is used extensively in community ecology, and has recently been exploited for prediction and analysis in molecular ecology and evolution. Information measures belong to a spectrum (or q profile) of measures whose contrasting properties provide a rich summary of diversity, including allelic richness (q=0), Shannon information (q=1), and heterozygosity (q=2). We present the merits of information measures for describing and forecasting molecular variation within and among groups, comparing forecasts with data, and evaluating underlying processes such as dispersal. Importantly, information measures directly link causal processes and divergence outcomes, have straightforward relationship to allele frequency differences (including monotonicity that q=2 lacks), and show additivity across hierarchical layers such as ecology, behaviour, cellular processes, and nongenetic inheritance.
Collapse
Affiliation(s)
- W B Sherwin
- Evolution and Ecology Research Centre, School of Biological Earth and Environmental Science, University of New South Wales, Sydney, NSW 2052, Australia; Murdoch University Cetacean Research Unit, Murdoch University, South Road, Murdoch, WA 6150, Australia.
| | - A Chao
- Institute of Statistics, National Tsing Hua University, Hsin-Chu 30043, Taiwan
| | - L Jost
- EcoMinga Foundation, Via a Runtun, Baños, Tungurahua, Ecuador
| | - P E Smouse
- Department of Ecology, Evolution and Natural Resources, School of Environmental and Biological Sciences, Rutgers University, New Brunswick, NJ 08901-8551, USA
| |
Collapse
|
11
|
Yang CH, Lin YD, Chuang LY, Chen JB, Chang HW. Joint Analysis of SNP-SNP-Environment Interactions for Chronic Dialysis by an Improved Branch and Bound Algorithm. J Comput Biol 2017; 24:1212-1225. [PMID: 28876085 DOI: 10.1089/cmb.2017.0090] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022] Open
Abstract
In previous studies, both single-nucleotide polymorphism (SNP)-SNP or gene-gene (G × G) interactions and SNP-environmental factor (G × E) interactions were reported to partially account for "missing" heritability. However, (G × G) × E interactions were less commonly addressed. The purpose of this study was to develop a novel strategy to evaluate possible (G × G) × E interactions in D-loop-based chronic dialysis association. Using values from our previously published data set (704 controls and 193 cases) of 77 D-loop SNPs and 7 environmental factors (coronary heart disease, hypertension, diabetes mellitus, triglyceride, cholesterol, blood thiol, and TBARS levels), we compared the performances of G, G × G, G × E, and (G × G) × E. We found that the interactions of four individual SNPs previously associated with a significantly high risk of chronic dialysis [odds ratio (OR) = 1.56-4.93] with environmental factors (G × E) increased the risk of chronic dialysis (maximum OR = 35.43). We then used an improved branch and bound algorithm to identify combinations of two to four SNPs that were most highly associated with chronic dialysis (OR = 9.27-34.39). When the interactions of the two- and three-SNP combinations with environmental factors were evaluated, we found that the (G × G) × E effects increased the risk of chronic dialysis (maximum OR = 8.32-57.54 and OR = 12.52-57.81, respectively; adjusted OR = 8.67-81.81 and OR = 12.29-81.95, respectively). Taken together, the (G × G) × E interactions identified chronic dialysis-associated SNPs that would not have been found using G × G or G × E interactions, suggesting that (G × G) × E interactions may be helpful to solve the problems of missing heritability in association studies.
Collapse
Affiliation(s)
- Cheng-Hong Yang
- 1 Department of Electronic Engineering, National Kaohsiung University of Applied Sciences , Kaohsiung, Taiwan .,2 Graduate Institute of Clinical Medicine, Kaohsiung Medical University , Kaohsiung, Taiwan
| | - Yu-Da Lin
- 1 Department of Electronic Engineering, National Kaohsiung University of Applied Sciences , Kaohsiung, Taiwan
| | - Li-Yeh Chuang
- 3 Department of Chemical Engineering & Institute of Biotechnology and Chemical Engineering, I-Shou University , Kaohsiung, Taiwan
| | - Jin-Bor Chen
- 4 Division of Nephrology, Department of Internal Medicine, Mitochondrial Research Unit, Kaohsiung Chang Gung Memorial Hospital, Chang Gung University College of Medicine , Kaohsiung, Taiwan
| | - Hsueh-Wei Chang
- 5 Institute of Medical Science and Technology, National Sun Yat-Sen University , Kaohsiung, Taiwan .,6 Department of Medical Research, Kaohsiung Medical University Hospital , Kaohsiung, Taiwan .,7 Department of Biomedical Science and Environmental Biology, Kaohsiung Medical University , Kaohsiung, Taiwan
| |
Collapse
|
12
|
CINOEDV: a co-information based method for detecting and visualizing n-order epistatic interactions. BMC Bioinformatics 2016; 17:214. [PMID: 27184783 PMCID: PMC4869388 DOI: 10.1186/s12859-016-1076-8] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/26/2015] [Accepted: 05/07/2016] [Indexed: 12/13/2022] Open
Abstract
BACKGROUND Detecting and visualizing nonlinear interaction effects of single nucleotide polymorphisms (SNPs) or epistatic interactions are important topics in bioinformatics since they play an important role in unraveling the mystery of "missing heritability". However, related studies are almost limited to pairwise epistatic interactions due to their methodological and computational challenges. RESULTS We develop CINOEDV (Co-Information based N-Order Epistasis Detector and Visualizer) for the detection and visualization of epistatic interactions of their orders from 1 to n (n ≥ 2). CINOEDV is composed of two stages, namely, detecting stage and visualizing stage. In detecting stage, co-information based measures are employed to quantify association effects of n-order SNP combinations to the phenotype, and two types of search strategies are introduced to identify n-order epistatic interactions: an exhaustive search and a particle swarm optimization based search. In visualizing stage, all detected n-order epistatic interactions are used to construct a hypergraph, where a real vertex represents the main effect of a SNP and a virtual vertex denotes the interaction effect of an n-order epistatic interaction. By deeply analyzing the constructed hypergraph, some hidden clues for better understanding the underlying genetic architecture of complex diseases could be revealed. CONCLUSIONS Experiments of CINOEDV and its comparison with existing state-of-the-art methods are performed on both simulation data sets and a real data set of age-related macular degeneration. Results demonstrate that CINOEDV is promising in detecting and visualizing n-order epistatic interactions. CINOEDV is implemented in R and is freely available from R CRAN: http://cran.r-project.org and https://sourceforge.net/projects/cinoedv/files/ .
Collapse
|
13
|
The application of information theory for the research of aging and aging-related diseases. Prog Neurobiol 2016; 157:158-173. [PMID: 27004830 DOI: 10.1016/j.pneurobio.2016.03.005] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2015] [Revised: 03/13/2016] [Accepted: 03/19/2016] [Indexed: 11/23/2022]
Abstract
This article reviews the application of information-theoretical analysis, employing measures of entropy and mutual information, for the study of aging and aging-related diseases. The research of aging and aging-related diseases is particularly suitable for the application of information theory methods, as aging processes and related diseases are multi-parametric, with continuous parameters coexisting alongside discrete parameters, and with the relations between the parameters being as a rule non-linear. Information theory provides unique analytical capabilities for the solution of such problems, with unique advantages over common linear biostatistics. Among the age-related diseases, information theory has been used in the study of neurodegenerative diseases (particularly using EEG time series for diagnosis and prediction), cancer (particularly for establishing individual and combined cancer biomarkers), diabetes (mainly utilizing mutual information to characterize the diseased and aging states), and heart disease (mainly for the analysis of heart rate variability). Few works have employed information theory for the analysis of general aging processes and frailty, as underlying determinants and possible early preclinical diagnostic measures for aging-related diseases. Generally, the use of information-theoretical analysis permits not only establishing the (non-linear) correlations between diagnostic or therapeutic parameters of interest, but may also provide a theoretical insight into the nature of aging and related diseases by establishing the measures of variability, adaptation, regulation or homeostasis, within a system of interest. It may be hoped that the increased use of such measures in research may considerably increase diagnostic and therapeutic capabilities and the fundamental theoretical mathematical understanding of aging and disease.
Collapse
|
14
|
Yee J, Kwon MS, Jin S, Park T, Park M. Detecting Genetic Interactions for Quantitative Traits Using m-Spacing Entropy Measure. BIOMED RESEARCH INTERNATIONAL 2015; 2015:523641. [PMID: 26339620 PMCID: PMC4538333 DOI: 10.1155/2015/523641] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/14/2014] [Revised: 02/04/2015] [Accepted: 03/08/2015] [Indexed: 11/17/2022]
Abstract
A number of statistical methods for detecting gene-gene interactions have been developed in genetic association studies with binary traits. However, many phenotype measures are intrinsically quantitative and categorizing continuous traits may not always be straightforward and meaningful. Association of gene-gene interactions with an observed distribution of such phenotypes needs to be investigated directly without categorization. Information gain based on entropy measure has previously been successful in identifying genetic associations with binary traits. We extend the usefulness of this information gain by proposing a nonparametric evaluation method of conditional entropy of a quantitative phenotype associated with a given genotype. Hence, the information gain can be obtained for any phenotype distribution. Because any functional form, such as Gaussian, is not assumed for the entire distribution of a trait or a given genotype, this method is expected to be robust enough to be applied to any phenotypic association data. Here, we show its use to successfully identify the main effect, as well as the genetic interactions, associated with a quantitative trait.
Collapse
Affiliation(s)
- Jaeyong Yee
- Department of Physiology and Biophysics, Eulji University, Daejeon, Republic of Korea
| | - Min-Seok Kwon
- Department of Bioinformatics, Seoul National University, Seoul, Republic of Korea
| | - Seohoon Jin
- Department of Informational Statistics, Korea University, Jochiwon, Republic of Korea
| | - Taesung Park
- Department of Statistics, Seoul National University, Seoul, Republic of Korea
| | - Mira Park
- Department of Preventive Medicine, Eulji University, Daejeon, Republic of Korea
| |
Collapse
|
15
|
An Improved Opposition-Based Learning Particle Swarm Optimization for the Detection of SNP-SNP Interactions. BIOMED RESEARCH INTERNATIONAL 2015; 2015:524821. [PMID: 26236727 PMCID: PMC4509494 DOI: 10.1155/2015/524821] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/25/2014] [Revised: 12/30/2014] [Accepted: 01/02/2015] [Indexed: 12/22/2022]
Abstract
SNP-SNP interactions have been receiving increasing attention in understanding the mechanism underlying susceptibility to complex diseases. Though many works have been done for the detection of SNP-SNP interactions, the algorithmic development is still ongoing. In this study, an improved opposition-based learning particle swarm optimization (IOBLPSO) is proposed for the detection of SNP-SNP interactions. Highlights of IOBLPSO are the introduction of three strategies, namely, opposition-based learning, dynamic inertia weight, and a postprocedure. Opposition-based learning not only enhances the global explorative ability, but also avoids premature convergence. Dynamic inertia weight allows particles to cover a wider search space when the considered SNP is likely to be a random one and converges on promising regions of the search space while capturing a highly suspected SNP. The postprocedure is used to carry out a deep search in highly suspected SNP sets. Experiments of IOBLPSO are performed on both simulation data sets and a real data set of age-related macular degeneration, results of which demonstrate that IOBLPSO is promising in detecting SNP-SNP interactions. IOBLPSO might be an alternative to existing methods for detecting SNP-SNP interactions.
Collapse
|
16
|
Expected Shannon Entropy and Shannon Differentiation between Subpopulations for Neutral Genes under the Finite Island Model. PLoS One 2015; 10:e0125471. [PMID: 26067448 PMCID: PMC4465833 DOI: 10.1371/journal.pone.0125471] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2014] [Accepted: 03/24/2015] [Indexed: 01/21/2023] Open
Abstract
Shannon entropy H and related measures are increasingly used in molecular ecology and population genetics because (1) unlike measures based on heterozygosity or allele number, these measures weigh alleles in proportion to their population fraction, thus capturing a previously-ignored aspect of allele frequency distributions that may be important in many applications; (2) these measures connect directly to the rich predictive mathematics of information theory; (3) Shannon entropy is completely additive and has an explicitly hierarchical nature; and (4) Shannon entropy-based differentiation measures obey strong monotonicity properties that heterozygosity-based measures lack. We derive simple new expressions for the expected values of the Shannon entropy of the equilibrium allele distribution at a neutral locus in a single isolated population under two models of mutation: the infinite allele model and the stepwise mutation model. Surprisingly, this complex stochastic system for each model has an entropy expressable as a simple combination of well-known mathematical functions. Moreover, entropy- and heterozygosity-based measures for each model are linked by simple relationships that are shown by simulations to be approximately valid even far from equilibrium. We also identify a bridge between the two models of mutation. We apply our approach to subdivided populations which follow the finite island model, obtaining the Shannon entropy of the equilibrium allele distributions of the subpopulations and of the total population. We also derive the expected mutual information and normalized mutual information ("Shannon differentiation") between subpopulations at equilibrium, and identify the model parameters that determine them. We apply our measures to data from the common starling (Sturnus vulgaris) in Australia. Our measures provide a test for neutrality that is robust to violations of equilibrium assumptions, as verified on real world data from starlings.
Collapse
|
17
|
Genetic factors associated with gemcitabine pharmacokinetics, disposition, and toxicity. Pharmacogenet Genomics 2014; 24:15-25. [PMID: 24225399 DOI: 10.1097/fpc.0000000000000016] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
Abstract
AIM The goal of this work was to investigate the associations of genetic and environmental factors with gemcitabine disposition and toxicity from genomewide data using a novel information theoretic approach. METHODS We utilized the information theoretic K-way interaction information (KWII) metric to detect gene-gene and gene-environment interactions associated with gemcitabine disposition and gemcitabine-induced neutropenia in genomic and clinical data from Japanese cancer patients. RESULTS The information theoretic KWII analyses identified age and four genes - DMD, HEXDC, CNTN4, and ALOX5AP - to be associated with gemcitabine pharmacokinetics (PK). The rs4769060 single-nucleotide polymorphism in the ALOX5AP gene was associated with all PK parameters studied. For gemcitabine-induced neutropenia, multiple associations with long intergenic noncoding RNA regions were detected. Pathway analysis identified leukotriene and eoxin synthesis, platelet homeostasis, and L1CAM interactions as potential pathways associated with gemcitabine disposition. CONCLUSION The KWII analyses detected novel associations with gemcitabine PK and toxicity. These results could be used to inform future investigations involving gemcitabine efficacy in clinical settings.
Collapse
|
18
|
Yee J, Kwon MS, Park T, Park M. A modified entropy-based approach for identifying gene-gene interactions in case-control study. PLoS One 2013; 8:e69321. [PMID: 23874943 PMCID: PMC3715501 DOI: 10.1371/journal.pone.0069321] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/14/2013] [Accepted: 06/12/2013] [Indexed: 11/24/2022] Open
Abstract
Gene-gene interactions may play an important role in the genetics of a complex disease. Detection and characterization of gene-gene interactions is a challenging issue that has stimulated the development of various statistical methods to address it. In this study, we introduce a method to measure gene interactions using entropy-based statistics from a contingency table of trait and genotype combinations. We also developed an exploration procedure by using graphs. We propose a standardized relative information gain (RIG) measure to evaluate the interactions between single nucleotide polymorphism (SNP) combinations. To identify the kth order interactions, contingency tables of trait and genotype combinations of k SNPs are constructed, with which RIGs are calculated. The RIGs are standardized using the mean and standard deviation from the permuted datasets. SNP combinations yielding high standardized RIG are chosen for gene-gene interactions. Detection of high-order interactions and comparison of interaction strengths between different orders are made possible by using standardized RIG. We have applied the proposed standardized entropy-based method to two types of data sets from a simulation study and a real genetic association study. We have compared our method and the multifactor dimensionality reduction (MDR) method through power analysis of eight different genetic models with varying penetrance rates, number of SNPs, and sample sizes. Our method shows successful identification of genetic associations and gene-gene interactions both in simulation and real genetic data. Simulation results suggest that the proposed entropy-based method is better able to detect high-order interactions and is superior to the MDR method in most cases. The proposed method is well suited for detecting interactions without main effects as well as for models including main effects.
Collapse
Affiliation(s)
- Jaeyong Yee
- Department of Physiology and Biophysics, Eulji University, Daejeon, Korea
| | - Min-Seok Kwon
- Department of Bioinformatics, Seoul National University, Seoul, Korea
| | - Taesung Park
- Department of Statistics, Seoul National University, Seoul, Korea
| | - Mira Park
- Department of Preventive Medicine, Eulji University, Daejeon, Korea
- * E-mail:
| |
Collapse
|
19
|
SYMPHONY, an information-theoretic method for gene-gene and gene-environment interaction analysis of disease syndromes. Heredity (Edinb) 2013; 110:548-59. [PMID: 23423149 DOI: 10.1038/hdy.2012.123] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022] Open
Abstract
We develop an information-theoretic method for gene-gene (GGI) and gene-environmental interactions (GEI) analysis of syndromes, defined as a phenotype vector comprising multiple quantitative traits (QTs). The K-way interaction information (KWII), an information-theoretic metric, was derived for multivariate normal distributed phenotype vectors. The utility of the method was challenged with three simulated data sets, the Genetic Association Workshop-15 (GAW15) rheumatoid arthritis data set, a high-density lipoprotein (HDL) and atherosclerosis data set from a mouse QT locus study, and the 1000 Genomes data. The dependence of the KWII on effect size, minor allele frequency, linkage disequilibrium, population stratification/admixture, as well as the power and computational time requirements of the novel method was systematically assessed in simulation studies. In these studies, phenotype vectors containing two and three constituent multivariate normally distributed QTs were used and the KWII was found to be effective at detecting GEI associated with the phenotype. High KWII values were observed for variables and variable combinations associated with the syndrome phenotype compared with uninformative variables not associated with the phenotype. The KWII values for the phenotype-associated combinations increased monotonically with increasing effect size values. The KWII also exhibited utility in simulations with non-linear dependence between the constituent QTs. Analysis of the HDL and atherosclerosis data set indicated that the simultaneous analysis of both phenotypes identified interactions not detected in the analysis of the individual traits. The information-theoretic approach may be useful for non-parametric analysis of GGI and GEI of complex syndromes.
Collapse
|
20
|
Vertical Integration of Pharmacogenetics in Population PK/PD Modeling: A Novel Information Theoretic Method. CPT-PHARMACOMETRICS & SYSTEMS PHARMACOLOGY 2013. [PMCID: PMC3600754 DOI: 10.1038/psp.2012.25] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
To critically evaluate an information-theoretic method for identifying gene–environmental interactions (GEI) associated with pharmacokinetic (PK), pharmacodynamic (PD), and clinical outcomes from genome-wide pharmacogenetic data. Our approach, which is built on the K-way interaction information (KWII) metric, was challenged with simulated data and clinical PK/PD data sets from the International Warfarin Pharmacogenetics Consortium (IWPC) and a gemcitabine clinical trial. The KWII efficiently identified both novel and known interactions for warfarin and gemcitabine. Interactions between herbal supplementation and VKORC1 genotype were associated with warfarin response. For gemcitabine-associated neutropenia, combination treatment with carboplatin and cytidine deaminase (CDA) 208G→A genotypes were identified as risk factors. Gemcitabine disposition was associated with drug metabolism–transporter interactions between deoxycytidine kinase (DCK) and the equilibrative nucleoside transporter (ENT). This novel approach is effective for detecting GEI involved in drug exposure and response and could enable integration of genome-wide pharmacogenetic data into the population PK/PD analysis paradigm.
Collapse
|
21
|
Aschard H, Lutz S, Maus B, Duell EJ, Fingerlin TE, Chatterjee N, Kraft P, Van Steen K. Challenges and opportunities in genome-wide environmental interaction (GWEI) studies. Hum Genet 2012; 131:1591-613. [PMID: 22760307 DOI: 10.1007/s00439-012-1192-0] [Citation(s) in RCA: 110] [Impact Index Per Article: 9.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2012] [Accepted: 06/11/2012] [Indexed: 02/03/2023]
Abstract
The interest in performing gene-environment interaction studies has seen a significant increase with the increase of advanced molecular genetics techniques. Practically, it became possible to investigate the role of environmental factors in disease risk and hence to investigate their role as genetic effect modifiers. The understanding that genetics is important in the uptake and metabolism of toxic substances is an example of how genetic profiles can modify important environmental risk factors to disease. Several rationales exist to set up gene-environment interaction studies and the technical challenges related to these studies-when the number of environmental or genetic risk factors is relatively small-has been described before. In the post-genomic era, it is now possible to study thousands of genes and their interaction with the environment. This brings along a whole range of new challenges and opportunities. Despite a continuing effort in developing efficient methods and optimal bioinformatics infrastructures to deal with the available wealth of data, the challenge remains how to best present and analyze genome-wide environmental interaction (GWEI) studies involving multiple genetic and environmental factors. Since GWEIs are performed at the intersection of statistical genetics, bioinformatics and epidemiology, usually similar problems need to be dealt with as for genome-wide association gene-gene interaction studies. However, additional complexities need to be considered which are typical for large-scale epidemiological studies, but are also related to "joining" two heterogeneous types of data in explaining complex disease trait variation or for prediction purposes.
Collapse
Affiliation(s)
- Hugues Aschard
- Department of Epidemiology, Harvard School of Public Health, Boston, MA, USA.
| | | | | | | | | | | | | | | |
Collapse
|
22
|
Van Steen K. Perspectives on genome-wide multi-stage family-based association studies. Stat Med 2011; 30:2201-21. [DOI: 10.1002/sim.4259] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2010] [Accepted: 03/07/2011] [Indexed: 01/03/2023]
|
23
|
|
24
|
|
25
|
Entropy and Information Approaches to Genetic Diversity and its Expression: Genomic Geography. ENTROPY 2010. [DOI: 10.3390/e12071765] [Citation(s) in RCA: 46] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/03/2023]
|