1
|
Chu X, Jiang M, Liu ZJ. Biomarker interaction selection and disease detection based on multivariate gain ratio. BMC Bioinformatics 2022; 23:176. [PMID: 35550010 PMCID: PMC9103137 DOI: 10.1186/s12859-022-04699-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2021] [Accepted: 04/14/2022] [Indexed: 11/30/2022] Open
Abstract
Background Disease detection is an important aspect of biotherapy. With the development of biotechnology and computer technology, there are many methods to detect disease based on single biomarker. However, biomarker does not influence disease alone in some cases. It’s the interaction between biomarkers that determines disease status. The existing influence measure I-score is used to evaluate the importance of interaction in determining disease status, but there is a deviation about the number of variables in interaction when applying I-score. To solve the problem, we propose a new influence measure Multivariate Gain Ratio (MGR) based on Gain Ratio (GR) of single-variate, which provides us with multivariate combination called interaction. Results We propose a preprocessing verification algorithm based on partial predictor variables to select an appropriate preprocessing method. In this paper, an algorithm for selecting key interactions of biomarkers and applying key interactions to construct a disease detection model is provided. MGR is more credible than I-score in the case of interaction containing small number of variables. Our method behaves better with average accuracy \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$93.13\%$$\end{document}93.13% than I-score of \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$91.73\%$$\end{document}91.73% in Breast Cancer Wisconsin (Diagnostic) Dataset. Compared to the classification results \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$89.80\%$$\end{document}89.80% based on all predictor variables, MGR identifies the true main biomarkers and realizes the dimension reduction. In Leukemia Dataset, the experiment results show the effectiveness of MGR with the accuracy of \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$97.32\%$$\end{document}97.32% compared to I-score with accuracy \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$89.11\%$$\end{document}89.11%. The results can be explained by the nature of MGR and I-score mentioned above because every key interaction contains a small number of variables in Leukemia Dataset. Conclusions MGR is effective for selecting important biomarkers and biomarker interactions even in high-dimension feature space in which the interaction could contain more than two biomarkers. The prediction ability of interactions selected by MGR is better than I-score in the case of interaction containing small number of variables. MGR is generally applicable to various types of biomarker datasets including cell nuclei, gene, SNPs and protein datasets.
Collapse
Affiliation(s)
- Xiao Chu
- Academy of Mathematics and Systems Science Chinese Academy of Sciences, University of Chinese Academy of Sciences, Beijing, China.
| | - Mao Jiang
- Academy of Mathematics and Systems Science Chinese Academy of Sciences, University of Chinese Academy of Sciences, Beijing, China
| | - Zhuo-Jun Liu
- Academy of Mathematics and Systems Science Chinese Academy of Sciences, Beijing, China
| |
Collapse
|
2
|
Gómez-Guerrero S, Ortiz I, Sosa-Cabrera G, García-Torres M, Schaerer CE. Measuring Interactions in Categorical Datasets Using Multivariate Symmetrical Uncertainty. ENTROPY (BASEL, SWITZERLAND) 2021; 24:64. [PMID: 35052090 PMCID: PMC8774864 DOI: 10.3390/e24010064] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/12/2021] [Revised: 11/29/2021] [Accepted: 12/01/2021] [Indexed: 11/17/2022]
Abstract
Interaction between variables is often found in statistical models, and it is usually expressed in the model as an additional term when the variables are numeric. However, when the variables are categorical (also known as nominal or qualitative) or mixed numerical-categorical, defining, detecting, and measuring interactions is not a simple task. In this work, based on an entropy-based correlation measure for n nominal variables (named as Multivariate Symmetrical Uncertainty (MSU)), we propose a formal and broader definition for the interaction of the variables. Two series of experiments are presented. In the first series, we observe that datasets where some record types or combinations of categories are absent, forming patterns of records, which often display interactions among their attributes. In the second series, the interaction/non-interaction behavior of a regression model (entirely built on continuous variables) gets successfully replicated under a discretized version of the dataset. It is shown that there is an interaction-wise correspondence between the continuous and the discretized versions of the dataset. Hence, we demonstrate that the proposed definition of interaction enabled by the MSU is a valuable tool for detecting and measuring interactions within linear and non-linear models.
Collapse
Affiliation(s)
- Santiago Gómez-Guerrero
- Polytechnic School, National University of Asuncion, San Lorenzo 2111, Paraguay; (I.O.); (G.S.-C.); (C.E.S.)
| | - Inocencio Ortiz
- Polytechnic School, National University of Asuncion, San Lorenzo 2111, Paraguay; (I.O.); (G.S.-C.); (C.E.S.)
| | - Gustavo Sosa-Cabrera
- Polytechnic School, National University of Asuncion, San Lorenzo 2111, Paraguay; (I.O.); (G.S.-C.); (C.E.S.)
| | - Miguel García-Torres
- Data Science and Big Data Lab, Universidad Pablo de Olavide, ES-41013 Seville, Spain;
| | - Christian E. Schaerer
- Polytechnic School, National University of Asuncion, San Lorenzo 2111, Paraguay; (I.O.); (G.S.-C.); (C.E.S.)
| |
Collapse
|
3
|
Zhou X, Chan KCC, Huang Z, Wang J. Determining dependency and redundancy for identifying gene-gene interaction associated with complex disease. J Bioinform Comput Biol 2020; 18:2050035. [PMID: 33064052 DOI: 10.1142/s0219720020500353] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
As interactions among genetic variants in different genes can be an important factor for predicting complex diseases, many computational methods have been proposed to detect if a particular set of genes has interaction with a particular complex disease. However, even though many such methods have been shown to be useful, they can be made more effective if the properties of gene-gene interactions can be better understood. Towards this goal, we have attempted to uncover patterns in gene-gene interactions and the patterns reveal an interesting property that can be reflected in an inequality that describes the relationship between two genotype variables and a disease-status variable. We show, in this paper, that this inequality can be generalized to [Formula: see text] genotype variables. Based on this inequality, we establish a conditional independence and redundancy (CIR)-based definition of gene-gene interaction and the concept of an interaction group. From these new definitions, a novel measure of gene-gene interaction is then derived. We discuss the properties of these concepts and explain how they can be used in a novel algorithm to detect high-order gene-gene interactions. Experimental results using both simulated and real datasets show that the proposed method can be very promising.
Collapse
Affiliation(s)
- Xiangdong Zhou
- College of Mathematics and Computer Science, Fuzhou University Fuzhou, Fujian 350108, P. R. China
| | - Keith C C Chan
- Department of Computing, The Hong Kong Polytechnic University, Kowloon, Hong Kong, P. R. China
| | - Zhihua Huang
- College of Mathematics and Computer Science, Fuzhou University Fuzhou, Fujian 350108, P. R. China
| | - Jingbin Wang
- College of Mathematics and Computer Science, Fuzhou University Fuzhou, Fujian 350108, P. R. China
| |
Collapse
|
4
|
Chanda P, Costa E, Hu J, Sukumar S, Van Hemert J, Walia R. Information Theory in Computational Biology: Where We Stand Today. ENTROPY (BASEL, SWITZERLAND) 2020; 22:E627. [PMID: 33286399 PMCID: PMC7517167 DOI: 10.3390/e22060627] [Citation(s) in RCA: 22] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/30/2020] [Revised: 05/31/2020] [Accepted: 06/03/2020] [Indexed: 12/30/2022]
Abstract
"A Mathematical Theory of Communication" was published in 1948 by Claude Shannon to address the problems in the field of data compression and communication over (noisy) communication channels. Since then, the concepts and ideas developed in Shannon's work have formed the basis of information theory, a cornerstone of statistical learning and inference, and has been playing a key role in disciplines such as physics and thermodynamics, probability and statistics, computational sciences and biological sciences. In this article we review the basic information theory based concepts and describe their key applications in multiple major areas of research in computational biology-gene expression and transcriptomics, alignment-free sequence comparison, sequencing and error correction, genome-wide disease-gene association mapping, metabolic networks and metabolomics, and protein sequence, structure and interaction analysis.
Collapse
Affiliation(s)
- Pritam Chanda
- Corteva Agriscience™, Indianapolis, IN 46268, USA
- Computer and Information Science, Indiana University-Purdue University, Indianapolis, IN 46202, USA
| | - Eduardo Costa
- Corteva Agriscience™, Mogi Mirim, Sao Paulo 13801-540, Brazil
| | - Jie Hu
- Corteva Agriscience™, Indianapolis, IN 46268, USA
| | | | | | - Rasna Walia
- Corteva Agriscience™, Johnston, IA 50131, USA
| |
Collapse
|
5
|
Granada AE, Jiménez A, Stewart-Ornstein J, Blüthgen N, Reber S, Jambhekar A, Lahav G. The effects of proliferation status and cell cycle phase on the responses of single cells to chemotherapy. Mol Biol Cell 2020; 31:845-857. [PMID: 32049575 PMCID: PMC7185964 DOI: 10.1091/mbc.e19-09-0515] [Citation(s) in RCA: 22] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/16/2023] Open
Abstract
DNA-damaging chemotherapeutics are widely used in cancer treatments, but for solid tumors they often leave a residual tumor-cell population. Here we investigated how cellular states might affect the response of individual cells in a clonal population to cisplatin, a DNA-damaging chemotherapeutic agent. Using a live-cell reporter of cell cycle phase and long-term imaging, we monitored single-cell proliferation before, at the time of, and after treatment. We found that in response to cisplatin, cells either arrested or died, and the ratio of these outcomes depended on the dose. While we found that the cell cycle phase at the time of cisplatin addition was not predictive of outcome, the proliferative history of the cell was: highly proliferative cells were more likely to arrest than to die, whereas slowly proliferating cells showed a higher probability of death. Information theory analysis revealed that the dose of cisplatin had the greatest influence on the cells’ decisions to arrest or die, and that the proliferation status interacted with the cisplatin dose to further guide this decision. These results show an unexpected effect of proliferation status in regulating responses to cisplatin and suggest that slowly proliferating cells within tumors may be acutely vulnerable to chemotherapy.
Collapse
Affiliation(s)
- Adrián E Granada
- IRI Life Sciences, Humboldt University Berlin, 10115 Berlin, Germany.,Department of Systems Biology, Harvard Medical School, Boston, MA 02115
| | - Alba Jiménez
- Department of Systems Biology, Harvard Medical School, Boston, MA 02115
| | - Jacob Stewart-Ornstein
- Department of Systems Biology, Harvard Medical School, Boston, MA 02115.,Department of Computational and Systems Biology, University of Pittsburgh Medical School, Pittsburgh, PA 15260
| | - Nils Blüthgen
- IRI Life Sciences, Humboldt University Berlin, 10115 Berlin, Germany.,Institute of Pathology, Charité Universitätsmedizin Berlin, 10117 Berlin, Germany.,German Cancer Consortium (DKTK), German Cancer Research Center (DKFZ), 69120 -Heidelberg, Germany.,Berlin Institute of Health (BIH), 10178 Berlin, Germany
| | - Simone Reber
- IRI Life Sciences, Humboldt University Berlin, 10115 Berlin, Germany.,University of Applied Sciences Berlin, 13353 Berlin, Germany
| | - Ashwini Jambhekar
- Department of Systems Biology, Harvard Medical School, Boston, MA 02115
| | - Galit Lahav
- Department of Systems Biology, Harvard Medical School, Boston, MA 02115
| |
Collapse
|
6
|
Yan J, Risacher SL, Shen L, Saykin AJ. Network approaches to systems biology analysis of complex disease: integrative methods for multi-omics data. Brief Bioinform 2019; 19:1370-1381. [PMID: 28679163 DOI: 10.1093/bib/bbx066] [Citation(s) in RCA: 107] [Impact Index Per Article: 21.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2017] [Indexed: 11/14/2022] Open
Abstract
In the past decade, significant progress has been made in complex disease research across multiple omics layers from genome, transcriptome and proteome to metabolome. There is an increasing awareness of the importance of biological interconnections, and much success has been achieved using systems biology approaches. However, because of the typical focus on one single omics layer at a time, existing systems biology findings explain only a modest portion of complex disease. Recent advances in multi-omics data collection and sharing present us new opportunities for studying complex diseases in a more comprehensive fashion, and yet simultaneously create new challenges considering the unprecedented data dimensionality and diversity. Here, our goal is to review extant and emerging network approaches that can be applied across multiple biological layers to facilitate a more comprehensive and integrative multilayered omics analysis of complex diseases.
Collapse
Affiliation(s)
- Jingwen Yan
- Department of BioHealth Informatics, School of Informatics and Computing, Indiana University Purdue University Indianapolis, USA
| | - Shannon L Risacher
- Department of Radiology and Imaging Sciences, Indiana University School of Medicine, USA
| | - Li Shen
- Department of Radiology and Imaging Sciences, Indiana University School of Medicine, USA
| | - Andrew J Saykin
- Department of Radiology and Imaging Sciences, Indiana University School of Medicine, USA
| |
Collapse
|
7
|
Selecting variants of unknown significance through network-based gene-association significantly improves risk prediction for disease-control cohorts. Sci Rep 2019; 9:3266. [PMID: 30824863 PMCID: PMC6397233 DOI: 10.1038/s41598-019-39796-w] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/17/2018] [Accepted: 01/31/2019] [Indexed: 12/12/2022] Open
Abstract
Variants of unknown/uncertain significance (VUS) pose a huge dilemma in current genetic variation screening methods and genetic counselling. Driven by methods of next generation sequencing (NGS) such as whole exome sequencing (WES), a plethora of VUS are being detected in research laboratories as well as in the health sector. Motivated by this overabundance of VUS, we propose a novel computational methodology, termed VariantClassifier (VarClass), which utilizes gene-association networks and polygenic risk prediction models to shed light into this grey area of genetic variation in association with disease. VarClass has been evaluated using numerous validation steps and proves to be very successful in assigning significance to VUS in association with specific diseases of interest. Notably, using VUS that are deemed significant by VarClass, we improved risk prediction accuracy in four large case-studies involving disease-control cohorts from GWAS as well as WES, when compared to traditional odds ratio analysis. Biological interpretation of selected high scoring VUS revealed interesting biological themes relevant to the diseases under investigation. VarClass is available as a standalone tool for large-scale data analyses, as well as a web-server with additional functionalities through a user-friendly graphical interface.
Collapse
|
8
|
Abstract
Genome-wide association studies are moving to genome-wide interaction studies, as the genetic background of many diseases appears to be more complex than previously supposed. Thus, many statistical approaches have been proposed to detect gene-gene (GxG) interactions, among them numerous information theory-based methods, inspired by the concept of entropy. These are suggested as particularly powerful and, because of their nonlinearity, as better able to capture nonlinear relationships between genetic variants and/or variables. However, the introduced entropy-based estimators differ to a surprising extent in their construction and even with respect to the basic definition of interactions. Also, not every entropy-based measure for interaction is accompanied by a proper statistical test. To shed light on this, a systematic review of the literature is presented answering the following questions: (1) How are GxG interactions defined within the framework of information theory? (2) Which entropy-based test statistics are available? (3) Which underlying distribution do the test statistics follow? (4) What are the given strengths and limitations of these test statistics?
Collapse
Affiliation(s)
| | - Inke R König
- Institut für Medizinische Biometrie und Statistik, Universität zu Lübeck, Universitätsklinikum Schleswig-Holstein, Campus Lübeck, Ratzeburger Allee 160, Lübeck, Germany
- Corresponding author. Inke R. Konig, Institut für Medizinische Biometrie und Statistik, Universität zu Lübeck, Universitätsklinikum Schleswig-Holstein, Campus Lübeck, Ratzeburger Allee 160, 23562 Lübeck, Germany. Tel.: ++49 451 500 50610; Fax: ++49 451 500 50604; E-Mail:
| |
Collapse
|
9
|
Entropy, or Information, Unifies Ecology and Evolution and Beyond. ENTROPY 2018; 20:e20100727. [PMID: 33265816 PMCID: PMC7512290 DOI: 10.3390/e20100727] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/06/2018] [Revised: 08/18/2018] [Accepted: 09/11/2018] [Indexed: 02/07/2023]
Abstract
This article discusses how entropy/information methods are well-suited to analyzing and forecasting the four processes of innovation, transmission, movement, and adaptation, which are the common basis to ecology and evolution. Macroecologists study assemblages of differing species, whereas micro-evolutionary biologists study variants of heritable information within species, such as DNA and epigenetic modifications. These two different modes of variation are both driven by the same four basic processes, but approaches to these processes sometimes differ considerably. For example, macroecology often documents patterns without modeling underlying processes, with some notable exceptions. On the other hand, evolutionary biologists have a long history of deriving and testing mathematical genetic forecasts, previously focusing on entropies such as heterozygosity. Macroecology calls this Gini-Simpson, and has borrowed the genetic predictions, but sometimes this measure has shortcomings. Therefore it is important to note that predictive equations have now been derived for molecular diversity based on Shannon entropy and mutual information. As a result, we can now forecast all major types of entropy/information, creating a general predictive approach for the four basic processes in ecology and evolution. Additionally, the use of these methods will allow seamless integration with other studies such as the physical environment, and may even extend to assisting with evolutionary algorithms.
Collapse
|
10
|
Yang CH, Lin YD, Chuang LY, Chen JB, Chang HW. Joint Analysis of SNP-SNP-Environment Interactions for Chronic Dialysis by an Improved Branch and Bound Algorithm. J Comput Biol 2017; 24:1212-1225. [PMID: 28876085 DOI: 10.1089/cmb.2017.0090] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022] Open
Abstract
In previous studies, both single-nucleotide polymorphism (SNP)-SNP or gene-gene (G × G) interactions and SNP-environmental factor (G × E) interactions were reported to partially account for "missing" heritability. However, (G × G) × E interactions were less commonly addressed. The purpose of this study was to develop a novel strategy to evaluate possible (G × G) × E interactions in D-loop-based chronic dialysis association. Using values from our previously published data set (704 controls and 193 cases) of 77 D-loop SNPs and 7 environmental factors (coronary heart disease, hypertension, diabetes mellitus, triglyceride, cholesterol, blood thiol, and TBARS levels), we compared the performances of G, G × G, G × E, and (G × G) × E. We found that the interactions of four individual SNPs previously associated with a significantly high risk of chronic dialysis [odds ratio (OR) = 1.56-4.93] with environmental factors (G × E) increased the risk of chronic dialysis (maximum OR = 35.43). We then used an improved branch and bound algorithm to identify combinations of two to four SNPs that were most highly associated with chronic dialysis (OR = 9.27-34.39). When the interactions of the two- and three-SNP combinations with environmental factors were evaluated, we found that the (G × G) × E effects increased the risk of chronic dialysis (maximum OR = 8.32-57.54 and OR = 12.52-57.81, respectively; adjusted OR = 8.67-81.81 and OR = 12.29-81.95, respectively). Taken together, the (G × G) × E interactions identified chronic dialysis-associated SNPs that would not have been found using G × G or G × E interactions, suggesting that (G × G) × E interactions may be helpful to solve the problems of missing heritability in association studies.
Collapse
Affiliation(s)
- Cheng-Hong Yang
- 1 Department of Electronic Engineering, National Kaohsiung University of Applied Sciences , Kaohsiung, Taiwan .,2 Graduate Institute of Clinical Medicine, Kaohsiung Medical University , Kaohsiung, Taiwan
| | - Yu-Da Lin
- 1 Department of Electronic Engineering, National Kaohsiung University of Applied Sciences , Kaohsiung, Taiwan
| | - Li-Yeh Chuang
- 3 Department of Chemical Engineering & Institute of Biotechnology and Chemical Engineering, I-Shou University , Kaohsiung, Taiwan
| | - Jin-Bor Chen
- 4 Division of Nephrology, Department of Internal Medicine, Mitochondrial Research Unit, Kaohsiung Chang Gung Memorial Hospital, Chang Gung University College of Medicine , Kaohsiung, Taiwan
| | - Hsueh-Wei Chang
- 5 Institute of Medical Science and Technology, National Sun Yat-Sen University , Kaohsiung, Taiwan .,6 Department of Medical Research, Kaohsiung Medical University Hospital , Kaohsiung, Taiwan .,7 Department of Biomedical Science and Environmental Biology, Kaohsiung Medical University , Kaohsiung, Taiwan
| |
Collapse
|
11
|
CINOEDV: a co-information based method for detecting and visualizing n-order epistatic interactions. BMC Bioinformatics 2016; 17:214. [PMID: 27184783 PMCID: PMC4869388 DOI: 10.1186/s12859-016-1076-8] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/26/2015] [Accepted: 05/07/2016] [Indexed: 12/13/2022] Open
Abstract
BACKGROUND Detecting and visualizing nonlinear interaction effects of single nucleotide polymorphisms (SNPs) or epistatic interactions are important topics in bioinformatics since they play an important role in unraveling the mystery of "missing heritability". However, related studies are almost limited to pairwise epistatic interactions due to their methodological and computational challenges. RESULTS We develop CINOEDV (Co-Information based N-Order Epistasis Detector and Visualizer) for the detection and visualization of epistatic interactions of their orders from 1 to n (n ≥ 2). CINOEDV is composed of two stages, namely, detecting stage and visualizing stage. In detecting stage, co-information based measures are employed to quantify association effects of n-order SNP combinations to the phenotype, and two types of search strategies are introduced to identify n-order epistatic interactions: an exhaustive search and a particle swarm optimization based search. In visualizing stage, all detected n-order epistatic interactions are used to construct a hypergraph, where a real vertex represents the main effect of a SNP and a virtual vertex denotes the interaction effect of an n-order epistatic interaction. By deeply analyzing the constructed hypergraph, some hidden clues for better understanding the underlying genetic architecture of complex diseases could be revealed. CONCLUSIONS Experiments of CINOEDV and its comparison with existing state-of-the-art methods are performed on both simulation data sets and a real data set of age-related macular degeneration. Results demonstrate that CINOEDV is promising in detecting and visualizing n-order epistatic interactions. CINOEDV is implemented in R and is freely available from R CRAN: http://cran.r-project.org and https://sourceforge.net/projects/cinoedv/files/ .
Collapse
|
12
|
An Improved Opposition-Based Learning Particle Swarm Optimization for the Detection of SNP-SNP Interactions. BIOMED RESEARCH INTERNATIONAL 2015; 2015:524821. [PMID: 26236727 PMCID: PMC4509494 DOI: 10.1155/2015/524821] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/25/2014] [Revised: 12/30/2014] [Accepted: 01/02/2015] [Indexed: 12/22/2022]
Abstract
SNP-SNP interactions have been receiving increasing attention in understanding the mechanism underlying susceptibility to complex diseases. Though many works have been done for the detection of SNP-SNP interactions, the algorithmic development is still ongoing. In this study, an improved opposition-based learning particle swarm optimization (IOBLPSO) is proposed for the detection of SNP-SNP interactions. Highlights of IOBLPSO are the introduction of three strategies, namely, opposition-based learning, dynamic inertia weight, and a postprocedure. Opposition-based learning not only enhances the global explorative ability, but also avoids premature convergence. Dynamic inertia weight allows particles to cover a wider search space when the considered SNP is likely to be a random one and converges on promising regions of the search space while capturing a highly suspected SNP. The postprocedure is used to carry out a deep search in highly suspected SNP sets. Experiments of IOBLPSO are performed on both simulation data sets and a real data set of age-related macular degeneration, results of which demonstrate that IOBLPSO is promising in detecting SNP-SNP interactions. IOBLPSO might be an alternative to existing methods for detecting SNP-SNP interactions.
Collapse
|
13
|
Gusareva ES, Van Steen K. Practical aspects of genome-wide association interaction analysis. Hum Genet 2014; 133:1343-58. [DOI: 10.1007/s00439-014-1480-y] [Citation(s) in RCA: 26] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2014] [Accepted: 08/18/2014] [Indexed: 12/31/2022]
|
14
|
Leydesdorff L, Perevodchikov E, Uvarov A. Measuring triple‐helix synergy in the
R
ussian innovation systems at regional, provincial, and national levels. J Assoc Inf Sci Technol 2014. [DOI: 10.1002/asi.23258] [Citation(s) in RCA: 37] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/03/2023]
Affiliation(s)
- Loet Leydesdorff
- Amsterdam School of Communication Research (ASCoR) University of Amsterdam Kloveniersburgwal 48 1012 CX Amsterdam The Netherlands
| | - Evgeniy Perevodchikov
- Institute for Innovations Tomsk State University of Control Systems and Radioelectronics (TUSUR) 40 Lenina Prospect Tomsk 634050 Russia
| | - Alexander Uvarov
- Institute for Innovations Tomsk State University of Control Systems and Radioelectronics (TUSUR) 40 Lenina Prospect Tomsk 634050 Russia
| |
Collapse
|
15
|
Kwon MS, Park M, Park T. IGENT: efficient entropy based algorithm for genome-wide gene-gene interaction analysis. BMC Med Genomics 2014; 7 Suppl 1:S6. [PMID: 25077411 PMCID: PMC4101351 DOI: 10.1186/1755-8794-7-s1-s6] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/04/2023] Open
Abstract
Background With the development of high-throughput genotyping and sequencing technology, there are growing evidences of association with genetic variants and complex traits. In spite of thousands of genetic variants discovered, such genetic markers have been shown to explain only a very small proportion of the underlying genetic variance of complex traits. Gene-gene interaction (GGI) analysis is expected to unveil a large portion of unexplained heritability of complex traits. Methods In this work, we propose IGENT, Information theory-based GEnome-wide gene-gene iNTeraction method. IGENT is an efficient algorithm for identifying genome-wide gene-gene interactions (GGI) and gene-environment interaction (GEI). For detecting significant GGIs in genome-wide scale, it is important to reduce computational burden significantly. Our method uses information gain (IG) and evaluates its significance without resampling. Results Through our simulation studies, the power of the IGENT is shown to be better than or equivalent to that of that of BOOST. The proposed method successfully detected GGI for bipolar disorder in the Wellcome Trust Case Control Consortium (WTCCC) and age-related macular degeneration (AMD). Conclusions The proposed method is implemented by C++ and available on Windows, Linux and MacOSX.
Collapse
|
16
|
Guo CY, Chen YJ, Chen YH. The logistic regression model for gene-environment interactions using both case-parent trios and unrelated case-controls. Ann Hum Genet 2014; 78:299-305. [PMID: 24766627 DOI: 10.1111/ahg.12063] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2013] [Accepted: 03/12/2014] [Indexed: 12/01/2022]
Abstract
One of the greatest challenges in genetic studies is the determination of gene-environment interactions due to underlying complications and inadequate statistical power. With the increased sample size gained by using case-parent trios and unrelated cases and controls, the performance may be much improved. Focusing on a dichotomous trait, a two-stage approach was previously proposed to deal with gene-environment interaction when utilizing mixed study samples. Theoretically, the two-stage association analysis uses likelihood functions such that the computational algorithms may not converge in the maximum likelihood estimation with small study samples. In an effort to avoid such convergence issues, we propose a logistic regression framework model, based on the combined haplotype relative risk (CHRR) method, which intuitively pools the case-parent trios and unrelated subjects in a two by two table. A positive feature of the logistic regression model is the effortless adjustment for either discrete or continuous covariates. According to computer simulations, under the circumstances in which the two-stage test converges in larger sample sizes, we discovered that the performances of the two tests were quite similar; the two-stage test is more powerful under the dominant and additive disease models, but the extended CHRR is more powerful under the recessive disease model.
Collapse
Affiliation(s)
- Chao-Yu Guo
- Division of Biostatistics, Institute of Public Health, National Yang Ming University, Taipei, Taiwan; Aging and Health Research Center, National Yang Ming University, Taipei, Taiwan; Biostatistical Consulting Center, National Yang Ming University, Taipei, Taiwan
| | | | | |
Collapse
|
17
|
Timme N, Alford W, Flecker B, Beggs JM. Synergy, redundancy, and multivariate information measures: an experimentalist’s perspective. J Comput Neurosci 2013; 36:119-40. [DOI: 10.1007/s10827-013-0458-4] [Citation(s) in RCA: 130] [Impact Index Per Article: 11.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2012] [Revised: 04/26/2013] [Accepted: 04/29/2013] [Indexed: 11/29/2022]
|
18
|
Abstract
SUMMARY MASS is a command-line program to perform meta-analysis of sequencing studies by combining the score statistics from multiple studies. It implements three types of multivariate tests that encompass all commonly used association tests for rare variants. The input files can be generated from the accompanying software SCORE-Seq. This bundle of programs allows analysis of large sequencing studies in a time and memory efficient manner. AVAILABILITY AND IMPLEMENTATION MASS and SCORE-Seq, including documentations and executables, are available at http://dlin.web.unc.edu/software/. CONTACT lin@bios.unc.edu.
Collapse
Affiliation(s)
- Zheng-Zheng Tang
- Department of Biostatistics, University of North Carolina, Chapel Hill, NC 27599-7420, USA
| | | |
Collapse
|
19
|
SYMPHONY, an information-theoretic method for gene-gene and gene-environment interaction analysis of disease syndromes. Heredity (Edinb) 2013; 110:548-59. [PMID: 23423149 DOI: 10.1038/hdy.2012.123] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022] Open
Abstract
We develop an information-theoretic method for gene-gene (GGI) and gene-environmental interactions (GEI) analysis of syndromes, defined as a phenotype vector comprising multiple quantitative traits (QTs). The K-way interaction information (KWII), an information-theoretic metric, was derived for multivariate normal distributed phenotype vectors. The utility of the method was challenged with three simulated data sets, the Genetic Association Workshop-15 (GAW15) rheumatoid arthritis data set, a high-density lipoprotein (HDL) and atherosclerosis data set from a mouse QT locus study, and the 1000 Genomes data. The dependence of the KWII on effect size, minor allele frequency, linkage disequilibrium, population stratification/admixture, as well as the power and computational time requirements of the novel method was systematically assessed in simulation studies. In these studies, phenotype vectors containing two and three constituent multivariate normally distributed QTs were used and the KWII was found to be effective at detecting GEI associated with the phenotype. High KWII values were observed for variables and variable combinations associated with the syndrome phenotype compared with uninformative variables not associated with the phenotype. The KWII values for the phenotype-associated combinations increased monotonically with increasing effect size values. The KWII also exhibited utility in simulations with non-linear dependence between the constituent QTs. Analysis of the HDL and atherosclerosis data set indicated that the simultaneous analysis of both phenotypes identified interactions not detected in the analysis of the individual traits. The information-theoretic approach may be useful for non-parametric analysis of GGI and GEI of complex syndromes.
Collapse
|
20
|
Vertical Integration of Pharmacogenetics in Population PK/PD Modeling: A Novel Information Theoretic Method. CPT-PHARMACOMETRICS & SYSTEMS PHARMACOLOGY 2013. [PMCID: PMC3600754 DOI: 10.1038/psp.2012.25] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
To critically evaluate an information-theoretic method for identifying gene–environmental interactions (GEI) associated with pharmacokinetic (PK), pharmacodynamic (PD), and clinical outcomes from genome-wide pharmacogenetic data. Our approach, which is built on the K-way interaction information (KWII) metric, was challenged with simulated data and clinical PK/PD data sets from the International Warfarin Pharmacogenetics Consortium (IWPC) and a gemcitabine clinical trial. The KWII efficiently identified both novel and known interactions for warfarin and gemcitabine. Interactions between herbal supplementation and VKORC1 genotype were associated with warfarin response. For gemcitabine-associated neutropenia, combination treatment with carboplatin and cytidine deaminase (CDA) 208G→A genotypes were identified as risk factors. Gemcitabine disposition was associated with drug metabolism–transporter interactions between deoxycytidine kinase (DCK) and the equilibrative nucleoside transporter (ENT). This novel approach is effective for detecting GEI involved in drug exposure and response and could enable integration of genome-wide pharmacogenetic data into the population PK/PD analysis paradigm.
Collapse
|
21
|
Aschard H, Lutz S, Maus B, Duell EJ, Fingerlin TE, Chatterjee N, Kraft P, Van Steen K. Challenges and opportunities in genome-wide environmental interaction (GWEI) studies. Hum Genet 2012; 131:1591-613. [PMID: 22760307 DOI: 10.1007/s00439-012-1192-0] [Citation(s) in RCA: 110] [Impact Index Per Article: 9.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2012] [Accepted: 06/11/2012] [Indexed: 02/03/2023]
Abstract
The interest in performing gene-environment interaction studies has seen a significant increase with the increase of advanced molecular genetics techniques. Practically, it became possible to investigate the role of environmental factors in disease risk and hence to investigate their role as genetic effect modifiers. The understanding that genetics is important in the uptake and metabolism of toxic substances is an example of how genetic profiles can modify important environmental risk factors to disease. Several rationales exist to set up gene-environment interaction studies and the technical challenges related to these studies-when the number of environmental or genetic risk factors is relatively small-has been described before. In the post-genomic era, it is now possible to study thousands of genes and their interaction with the environment. This brings along a whole range of new challenges and opportunities. Despite a continuing effort in developing efficient methods and optimal bioinformatics infrastructures to deal with the available wealth of data, the challenge remains how to best present and analyze genome-wide environmental interaction (GWEI) studies involving multiple genetic and environmental factors. Since GWEIs are performed at the intersection of statistical genetics, bioinformatics and epidemiology, usually similar problems need to be dealt with as for genome-wide association gene-gene interaction studies. However, additional complexities need to be considered which are typical for large-scale epidemiological studies, but are also related to "joining" two heterogeneous types of data in explaining complex disease trait variation or for prediction purposes.
Collapse
Affiliation(s)
- Hugues Aschard
- Department of Epidemiology, Harvard School of Public Health, Boston, MA, USA.
| | | | | | | | | | | | | | | |
Collapse
|
22
|
Mahachie John JM, Cattaert T, Van Lishout F, Gusareva ES, Van Steen K. Lower-order effects adjustment in quantitative traits model-based multifactor dimensionality reduction. PLoS One 2012; 7:e29594. [PMID: 22242176 PMCID: PMC3252336 DOI: 10.1371/journal.pone.0029594] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/15/2011] [Accepted: 12/01/2011] [Indexed: 11/18/2022] Open
Abstract
Identifying gene-gene interactions or gene-environment interactions in studies of human complex diseases remains a big challenge in genetic epidemiology. An additional challenge, often forgotten, is to account for important lower-order genetic effects. These may hamper the identification of genuine epistasis. If lower-order genetic effects contribute to the genetic variance of a trait, identified statistical interactions may simply be due to a signal boost of these effects. In this study, we restrict attention to quantitative traits and bi-allelic SNPs as genetic markers. Moreover, our interaction study focuses on 2-way SNP-SNP interactions. Via simulations, we assess the performance of different corrective measures for lower-order genetic effects in Model-Based Multifactor Dimensionality Reduction epistasis detection, using additive and co-dominant coding schemes. Performance is evaluated in terms of power and familywise error rate. Our simulations indicate that empirical power estimates are reduced with correction of lower-order effects, likewise familywise error rates. Easy-to-use automatic SNP selection procedures, SNP selection based on “top” findings, or SNP selection based on p-value criterion for interesting main effects result in reduced power but also almost zero false positive rates. Always accounting for main effects in the SNP-SNP pair under investigation during Model-Based Multifactor Dimensionality Reduction analysis adequately controls false positive epistasis findings. This is particularly true when adopting a co-dominant corrective coding scheme. In conclusion, automatic search procedures to identify lower-order effects to correct for during epistasis screening should be avoided. The same is true for procedures that adjust for lower-order effects prior to Model-Based Multifactor Dimensionality Reduction and involve using residuals as the new trait. We advocate using “on-the-fly” lower-order effects adjusting when screening for SNP-SNP interactions using Model-Based Multifactor Dimensionality Reduction analysis.
Collapse
Affiliation(s)
- Jestinah M. Mahachie John
- Systems and Modeling Unit, Montefiore Institute, University of Liege, Liege, Belgium
- Bioinformatics and Modeling, GIGA-R, University of Liege, Liege, Belgium
- * E-mail:
| | - Tom Cattaert
- Systems and Modeling Unit, Montefiore Institute, University of Liege, Liege, Belgium
- Bioinformatics and Modeling, GIGA-R, University of Liege, Liege, Belgium
| | - François Van Lishout
- Systems and Modeling Unit, Montefiore Institute, University of Liege, Liege, Belgium
- Bioinformatics and Modeling, GIGA-R, University of Liege, Liege, Belgium
| | - Elena S. Gusareva
- Systems and Modeling Unit, Montefiore Institute, University of Liege, Liege, Belgium
- Bioinformatics and Modeling, GIGA-R, University of Liege, Liege, Belgium
| | - Kristel Van Steen
- Systems and Modeling Unit, Montefiore Institute, University of Liege, Liege, Belgium
- Bioinformatics and Modeling, GIGA-R, University of Liege, Liege, Belgium
| |
Collapse
|
23
|
Fan R, Zhong M, Wang S, Zhang Y, Andrew A, Karagas M, Chen H, Amos CI, Xiong M, Moore JH. Entropy-based information gain approaches to detect and to characterize gene-gene and gene-environment interactions/correlations of complex diseases. Genet Epidemiol 2011; 35:706-21. [PMID: 22009792 PMCID: PMC3384547 DOI: 10.1002/gepi.20621] [Citation(s) in RCA: 41] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/17/2023]
Abstract
For complex diseases, the relationship between genotypes, environment factors, and phenotype is usually complex and nonlinear. Our understanding of the genetic architecture of diseases has considerably increased over the last years. However, both conceptually and methodologically, detecting gene-gene and gene-environment interactions remains a challenge, despite the existence of a number of efficient methods. One method that offers great promises but has not yet been widely applied to genomic data is the entropy-based approach of information theory. In this article, we first develop entropy-based test statistics to identify two-way and higher order gene-gene and gene-environment interactions. We then apply these methods to a bladder cancer data set and thereby test their power and identify strengths and weaknesses. For two-way interactions, we propose an information gain (IG) approach based on mutual information. For three-ways and higher order interactions, an interaction IG approach is used. In both cases, we develop one-dimensional test statistics to analyze sparse data. Compared to the naive chi-square test, the test statistics we develop have similar or higher power and is robust. Applying it to the bladder cancer data set allowed to investigate the complex interactions between DNA repair gene single nucleotide polymorphisms, smoking status, and bladder cancer susceptibility. Although not yet widely applied, entropy-based approaches appear as a useful tool for detecting gene-gene and gene-environment interactions. The test statistics we develop add to a growing body methodologies that will gradually shed light on the complex architecture of common diseases.
Collapse
Affiliation(s)
- R Fan
- Department of Statistics, Texas A&M University, College Station, Texas 77843, USA.
| | | | | | | | | | | | | | | | | | | |
Collapse
|
24
|
LI FG, WANG ZP, HU G, LI H. Current status of SNPs interaction in genome-wide association study. YI CHUAN = HEREDITAS 2011; 33:901-10. [DOI: 10.3724/sp.j.1005.2011.00901] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
25
|
James RG, Ellison CJ, Crutchfield JP. Anatomy of a bit: Information in a time series observation. CHAOS (WOODBURY, N.Y.) 2011; 21:037109. [PMID: 21974672 DOI: 10.1063/1.3637494] [Citation(s) in RCA: 58] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/31/2023]
Abstract
Appealing to several multivariate information measures--some familiar, some new here--we analyze the information embedded in discrete-valued stochastic time series. We dissect the uncertainty of a single observation to demonstrate how the measures' asymptotic behavior sheds structural and semantic light on the generating process's internal information dynamics. The measures scale with the length of time window, which captures both intensive (rates of growth) and subextensive components. We provide interpretations for the components, developing explicit relationships between them. We also identify the informational component shared between the past and the future that is not contained in a single observation. The existence of this component directly motivates the notion of a process's effective (internal) states and indicates why one must build models.
Collapse
Affiliation(s)
- Ryan G James
- Complexity Sciences Center, University of California at Davis, One Shields Avenue, Davis, California 95616, USA.
| | | | | |
Collapse
|
26
|
Van Steen K. Perspectives on genome-wide multi-stage family-based association studies. Stat Med 2011; 30:2201-21. [DOI: 10.1002/sim.4259] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2010] [Accepted: 03/07/2011] [Indexed: 01/03/2023]
|
27
|
Abstract
SNPsyn (http://snpsyn.biolab.si) is an interactive software tool for the discovery of synergistic pairs of single nucleotide polymorphisms (SNPs) from large genome-wide case-control association studies (GWAS) data on complex diseases. Synergy among SNPs is estimated using an information-theoretic approach called interaction analysis. SNPsyn is both a stand-alone C++/Flash application and a web server. The computationally intensive part is implemented in C++ and can run in parallel on a dedicated cluster or grid. The graphical user interface is written in Adobe Flash Builder 4 and can run in most web browsers or as a stand-alone application. The SNPsyn web server hosts the Flash application, receives GWAS data submissions, invokes the interaction analysis and serves result files. The user can explore details on identified synergistic pairs of SNPs, perform gene set enrichment analysis and interact with the constructed SNP synergy network.
Collapse
Affiliation(s)
- Tomaz Curk
- Faculty of Computer and Information Science, University of Ljubljana, Trzaska cesta 25, SI-1000 Ljubljana, Slovenia.
| | | | | |
Collapse
|
28
|
Abstract
Over the last few years, main effect genetic association analysis has proven to be a successful tool to unravel genetic risk components to a variety of complex diseases. In the quest for disease susceptibility factors and the search for the 'missing heritability', supplementary and complementary efforts have been undertaken. These include the inclusion of several genetic inheritance assumptions in model development, the consideration of different sources of information, and the acknowledgement of disease underlying pathways of networks. The search for epistasis or gene-gene interaction effects on traits of interest is marked by an exponential growth, not only in terms of methodological development, but also in terms of practical applications, translation of statistical epistasis to biological epistasis and integration of omics information sources. The current popularity of the field, as well as its attraction to interdisciplinary teams, each making valuable contributions with sometimes rather unique viewpoints, renders it impossible to give an exhaustive review of to-date available approaches for epistasis screening. The purpose of this work is to give a perspective view on a selection of currently active analysis strategies and concerns in the context of epistasis detection, and to provide an eye to the future of gene-gene interaction analysis.
Collapse
Affiliation(s)
- Kristel Van Steen
- Department of Electrical Engineering and Computer Science (Montefiore Institute), Grande Traverse, Bioinformatique 4000 Liège 1, Belgium.
| |
Collapse
|
29
|
Modeling of environmental and genetic interactions with AMBROSIA, an information-theoretic model synthesis method. Heredity (Edinb) 2011; 107:320-7. [PMID: 21427755 DOI: 10.1038/hdy.2011.18] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022] Open
Abstract
To develop a model synthesis method for parsimoniously modeling gene-environmental interactions (GEI) associated with clinical outcomes and phenotypes. The AMBROSIA model synthesis approach utilizes the k-way interaction information (KWII), an information-theoretic metric capable of identifying variable combinations associated with GEI. For model synthesis, AMBROSIA considers relevance of combinations to the phenotype, it precludes entry of combinations with redundant information, and penalizes for unjustifiable complexity; each step is KWII based. The performance and power of AMBROSIA were evaluated with simulations and Genetic Association Workshop 15 (GAW15) data sets of rheumatoid arthritis (RA). AMBROSIA identified parsimonious models in data sets containing multiple interactions with linkage disequilibrium present. For the GAW15 data set containing 9187 single-nucleotide polymorphisms, the parsimonious AMBROSIA model identified nine RA-associated combinations with power >90%. AMBROSIA was compared with multifactor dimensionality reduction across several diverse models and had satisfactory power. Software source code is available from http://www.cse.buffalo.edu/DBGROUP/bioinformatics/resources.html. AMBROSIA is a promising method for GEI model synthesis.
Collapse
|
30
|
Culverhouse RC, Saccone NL, Stitzel JA, Wang JC, Steinbach JH, Goate AM, Schwantes-An TH, Grucza RA, Stevens VL, Bierut LJ. Uncovering hidden variance: pair-wise SNP analysis accounts for additional variance in nicotine dependence. Hum Genet 2011; 129:177-88. [PMID: 21079997 PMCID: PMC3030551 DOI: 10.1007/s00439-010-0911-7] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/18/2010] [Accepted: 11/01/2010] [Indexed: 02/01/2023]
Abstract
Results from genome-wide association studies of complex traits account for only a modest proportion of the trait variance predicted to be due to genetics. We hypothesize that joint analysis of polymorphisms may account for more variance. We evaluated this hypothesis on a case-control smoking phenotype by examining pairs of nicotinic receptor single-nucleotide polymorphisms (SNPs) using the Restricted Partition Method (RPM) on data from the Collaborative Genetic Study of Nicotine Dependence (COGEND). We found evidence of joint effects that increase explained variance. Four signals identified in COGEND were testable in independent American Cancer Society (ACS) data, and three of the four signals replicated. Our results highlight two important lessons: joint effects that increase the explained variance are not limited to loci displaying substantial main effects, and joint effects need not display a significant interaction term in a logistic regression model. These results suggest that the joint analyses of variants may indeed account for part of the genetic variance left unexplained by single SNP analyses. Methodologies that limit analyses of joint effects to variants that demonstrate association in single SNP analyses, or require a significant interaction term, will likely miss important joint effects.
Collapse
Affiliation(s)
- Robert C Culverhouse
- Division of General Medical Sciences, Department of Medicine, Washington University, Saint Louis, MO 63110, USA.
| | | | | | | | | | | | | | | | | | | |
Collapse
|
31
|
Abstract
The literature on epistasis describes various methods to detect epistatic interactions and to classify different types of epistasis. Reconstructability analysis (RA) has recently been used to detect epistasis in genomic data. This paper shows that RA offers a classification of types of epistasis at three levels of resolution (variable-based models without loops, variable-based models with loops, state-based models). These types can be defined by the simplest RA structures that model the data without information loss; a more detailed classification can be defined by the information content of multiple candidate structures. The RA classification can be augmented with structures from related graphical modeling approaches. RA can analyze epistatic interactions involving an arbitrary number of genes or SNPs and constitutes a flexible and effective methodology for genomic analysis.
Collapse
|
32
|
Comparison of information-theoretic to statistical methods for gene-gene interactions in the presence of genetic heterogeneity. BMC Genomics 2010; 11:487. [PMID: 20815886 PMCID: PMC2996983 DOI: 10.1186/1471-2164-11-487] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2010] [Accepted: 09/03/2010] [Indexed: 11/10/2022] Open
Abstract
Background Multifactorial diseases such as cancer and cardiovascular diseases are caused by the complex interplay between genes and environment. The detection of these interactions remains challenging due to computational limitations. Information theoretic approaches use computationally efficient directed search strategies and thus provide a feasible solution to this problem. However, the power of information theoretic methods for interaction analysis has not been systematically evaluated. In this work, we compare power and Type I error of an information-theoretic approach to existing interaction analysis methods. Methods The k-way interaction information (KWII) metric for identifying variable combinations involved in gene-gene interactions (GGI) was assessed using several simulated data sets under models of genetic heterogeneity driven by susceptibility increasing loci with varying allele frequency, penetrance values and heritability. The power and proportion of false positives of the KWII was compared to multifactor dimensionality reduction (MDR), restricted partitioning method (RPM) and logistic regression. Results The power of the KWII was considerably greater than MDR on all six simulation models examined. For a given disease prevalence at high values of heritability, the power of both RPM and KWII was greater than 95%. For models with low heritability and/or genetic heterogeneity, the power of the KWII was consistently greater than RPM; the improvements in power for the KWII over RPM ranged from 4.7% to 14.2% at for α = 0.001 in the three models at the lowest heritability values examined. KWII performed similar to logistic regression. Conclusions Information theoretic models are flexible and have excellent power to detect GGI under a variety of conditions that characterize complex diseases.
Collapse
|
33
|
Cattaert T, Urrea V, Naj AC, De Lobel L, De Wit V, Fu M, Mahachie John JM, Shen H, Calle ML, Ritchie MD, Edwards TL, Van Steen K. FAM-MDR: a flexible family-based multifactor dimensionality reduction technique to detect epistasis using related individuals. PLoS One 2010; 5:e10304. [PMID: 20421984 PMCID: PMC2858665 DOI: 10.1371/journal.pone.0010304] [Citation(s) in RCA: 43] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/12/2010] [Accepted: 03/01/2010] [Indexed: 12/05/2022] Open
Abstract
We propose a novel multifactor dimensionality reduction method for epistasis detection in small or extended pedigrees, FAM-MDR. It combines features of the Genome-wide Rapid Association using Mixed Model And Regression approach (GRAMMAR) with Model-Based MDR (MB-MDR). We focus on continuous traits, although the method is general and can be used for outcomes of any type, including binary and censored traits. When comparing FAM-MDR with Pedigree-based Generalized MDR (PGMDR), which is a generalization of Multifactor Dimensionality Reduction (MDR) to continuous traits and related individuals, FAM-MDR was found to outperform PGMDR in terms of power, in most of the considered simulated scenarios. Additional simulations revealed that PGMDR does not appropriately deal with multiple testing and consequently gives rise to overly optimistic results. FAM-MDR adequately deals with multiple testing in epistasis screens and is in contrast rather conservative, by construction. Furthermore, simulations show that correcting for lower order (main) effects is of utmost importance when claiming epistasis. As Type 2 Diabetes Mellitus (T2DM) is a complex phenotype likely influenced by gene-gene interactions, we applied FAM-MDR to examine data on glucose area-under-the-curve (GAUC), an endophenotype of T2DM for which multiple independent genetic associations have been observed, in the Amish Family Diabetes Study (AFDS). This application reveals that FAM-MDR makes more efficient use of the available data than PGMDR and can deal with multi-generational pedigrees more easily. In conclusion, we have validated FAM-MDR and compared it to PGMDR, the current state-of-the-art MDR method for family data, using both simulations and a practical dataset. FAM-MDR is found to outperform PGMDR in that it handles the multiple testing issue more correctly, has increased power, and efficiently uses all available information.
Collapse
Affiliation(s)
- Tom Cattaert
- Montefiore Institute, University of Liège, Liège, Belgium.
| | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
34
|
Reconstructability analysis as a tool for identifying gene-gene interactions in studies of human diseases. Stat Appl Genet Mol Biol 2010; 9:Article18. [PMID: 20361857 DOI: 10.2202/1544-6115.1516] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/17/2023]
Abstract
There are a number of common human diseases for which the genetic component may include an epistatic interaction of multiple genes. Detecting these interactions with standard statistical tools is difficult because there may be an interaction effect, but minimal or no main effect. Reconstructability analysis (RA) uses Shannon's information theory to detect relationships between variables in categorical datasets. We applied RA to simulated data for five different models of gene-gene interaction, and find that even with heritability levels as low as 0.008, and with the inclusion of 50 non-associated genes in the dataset, we can identify the interacting gene pairs with an accuracy of > or =80%. We applied RA to a real dataset of type 2 non-insulin-dependent diabetes (NIDDM) cases and controls, and closely approximated the results of more conventional single SNP disease association studies. In addition, we replicated prior evidence for epistatic interactions between SNPs on chromosomes 2 and 15.
Collapse
|
35
|
Chanda P, Zhang A, Sucheston L, Ramanathan M. A two-stage search strategy for detecting multiple loci associated with rheumatoid arthritis. BMC Proc 2009; 3 Suppl 7:S72. [PMID: 20018067 PMCID: PMC2795974 DOI: 10.1186/1753-6561-3-s7-s72] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
Gene x gene interactions play important roles in the etiology of complex multi-factorial diseases like rheumatoid arthritis (RA). In this paper, we describe our use of a two-stage search strategy consisting of information theoretic methods and logistic regression to detect gene x gene interactions associated with RA using the data in Problem 1 of Genetic Analysis Workshop 16. Our method detected interactions of several SNPs (single-SNP and SNP x SNP) that are located on chromosomal regions linked to RA and related diseases in previous studies.
Collapse
Affiliation(s)
- Pritam Chanda
- Departments of Computer Science and Engineering, State University of New York, Buffalo, New York 14260, USA.
| | | | | | | |
Collapse
|
36
|
|
37
|
Chanda P, Sucheston L, Liu S, Zhang A, Ramanathan M. Information-theoretic gene-gene and gene-environment interaction analysis of quantitative traits. BMC Genomics 2009; 10:509. [PMID: 19889230 PMCID: PMC2779196 DOI: 10.1186/1471-2164-10-509] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/08/2008] [Accepted: 11/04/2009] [Indexed: 12/30/2022] Open
Abstract
Background The purpose of this research was to develop a novel information theoretic method and an efficient algorithm for analyzing the gene-gene (GGI) and gene-environmental interactions (GEI) associated with quantitative traits (QT). The method is built on two information-theoretic metrics, the k-way interaction information (KWII) and phenotype-associated information (PAI). The PAI is a novel information theoretic metric that is obtained from the total information correlation (TCI) information theoretic metric by removing the contributions for inter-variable dependencies (resulting from factors such as linkage disequilibrium and common sources of environmental pollutants). Results The KWII and the PAI were critically evaluated and incorporated within an algorithm called CHORUS for analyzing QT. The combinations with the highest values of KWII and PAI identified each known GEI associated with the QT in the simulated data sets. The CHORUS algorithm was tested using the simulated GAW15 data set and two real GGI data sets from QTL mapping studies of high-density lipoprotein levels/atherosclerotic lesion size and ultra-violet light-induced immunosuppression. The KWII and PAI were found to have excellent sensitivity for identifying the key GEI simulated to affect the two quantitative trait variables in the GAW15 data set. In addition, both metrics showed strong concordance with the results of the two different QTL mapping data sets. Conclusion The KWII and PAI are promising metrics for analyzing the GEI of QT.
Collapse
Affiliation(s)
- Pritam Chanda
- Department of Pharmaceutical Sciences, State University of New York, Buffalo, NY, USA.
| | | | | | | | | |
Collapse
|
38
|
Abstract
Following the identification of several disease-associated polymorphisms by genome-wide association (GWA) analysis, interest is now focusing on the detection of effects that, owing to their interaction with other genetic or environmental factors, might not be identified by using standard single-locus tests. In addition to increasing the power to detect associations, it is hoped that detecting interactions between loci will allow us to elucidate the biological and biochemical pathways that underpin disease. Here I provide a critical survey of the methods and related software packages currently used to detect the interactions between genetic loci that contribute to human genetic disease. I also discuss the difficulties in determining the biological relevance of statistical interactions.
Collapse
Affiliation(s)
- Heather J Cordell
- Institute of Human Genetics, Newcastle University, International Centre for Life, Central Parkway, Newcastle upon Tyne NE1 3BZ, UK.
| |
Collapse
|
39
|
Calle ML, Urrea V, Vellalta G, Malats N, Steen KV. Improving strategies for detecting genetic patterns of disease susceptibility in association studies. Stat Med 2009; 27:6532-46. [PMID: 18837071 DOI: 10.1002/sim.3431] [Citation(s) in RCA: 79] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/01/2023]
Abstract
The analysis of gene interactions and epistatic patterns of susceptibility is especially important for investigating complex diseases such as cancer characterized by the joint action of several genes. This work is motivated by a case-control study of bladder cancer, aimed at evaluating the role of both genetic and environmental factors in bladder carcinogenesis. In particular, the analysis of the inflammation pathway is of interest, for which information on a total of 282 SNPs in 108 genes involved in the inflammatory response is available. Detecting and interpreting interactions with such a large number of polymorphisms is a great challenge from both the statistical and the computational perspectives. In this paper we propose a two-stage strategy for identifying relevant interactions: (1) the use of a synergy measure among interacting genes and (2) the use of the model-based multifactor dimensionality reduction method (MB-MDR), a model-based version of the MDR method, which allows adjustment for confounders.
Collapse
Affiliation(s)
- M L Calle
- Department of Systems Biology, Universitat de Vic, Carrer de la Sagrada Família, 7-08500 Vic, Spain.
| | | | | | | | | |
Collapse
|
40
|
Chanda P, Sucheston L, Zhang A, Ramanathan M. The interaction index, a novel information-theoretic metric for prioritizing interacting genetic variations and environmental factors. Eur J Hum Genet 2009; 17:1274-86. [PMID: 19293841 DOI: 10.1038/ejhg.2009.38] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022] Open
Abstract
We developed an information-theoretic metric called the Interaction Index for prioritizing genetic variations and environmental variables for follow-up in detailed sequencing studies. The Interaction Index was found to be effective for prioritizing the genetic and environmental variables involved in GEI for a diverse range of simulated data sets. The metric was also evaluated for a 103-SNP Crohn's disease dataset and a simulated data set containing 9187 SNPs and multiple covariates that was modeled on a rheumatoid arthritis data set. Our results demonstrate that the Interaction Index algorithm is effective and efficient for prioritizing interacting variables for a diverse range of epidemiologic data sets containing complex combinations of direct effects, multiple GGI and GEI.
Collapse
Affiliation(s)
- Pritam Chanda
- Department of Computer Science and Engineering, State University of New York, Buffalo, NY, USA
| | | | | | | |
Collapse
|
41
|
Orloff MS, Eng C. Genetic and phenotypic heterogeneity in the PTEN hamartoma tumour syndrome. Oncogene 2008; 27:5387-97. [PMID: 18794875 DOI: 10.1038/onc.2008.237] [Citation(s) in RCA: 113] [Impact Index Per Article: 7.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/19/2022]
Abstract
Germline PTEN (Phosphatase and TENsin homologue deleted on chromosome TEN) mutations predispose to phenotypically diverse disorders that share several overlapping clinical features: Cowden syndrome, Bannayan-Riley-Ruvalcaba syndrome, Proteus syndrome and Proteus-like syndrome, collectively classified as PTEN hamartoma tumour syndrome (PHTS). The meticulous acquisition and documentation of PHTS phenotypic data at different levels and the profiling of the plethora of genetic changes in PTEN and other genes within the same or related pathways are important in resolving the challenge of discriminating heritable cancers from sporadic PHTS-mimicking clinical features. The characterization of PTEN and PTEN-related pathways from a multidisciplinary perspective underscores the importance of incorporating data from different -omics, which is crucial for the advancement of personalized medicine.
Collapse
Affiliation(s)
- M S Orloff
- Genomic Medicine Institute, Cleveland Clinic Foundation, Cleveland, OH 44195, USA
| | | |
Collapse
|
42
|
AMBIENCE: a novel approach and efficient algorithm for identifying informative genetic and environmental associations with complex phenotypes. Genetics 2008; 180:1191-210. [PMID: 18780753 DOI: 10.1534/genetics.108.088542] [Citation(s) in RCA: 36] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
We developed a computationally efficient algorithm AMBIENCE, for identifying the informative variables involved in gene-gene (GGI) and gene-environment interactions (GEI) that are associated with disease phenotypes. The AMBIENCE algorithm uses a novel information theoretic metric called phenotype-associated information (PAI) to search for combinations of genetic variants and environmental variables associated with the disease phenotype. The PAI-based AMBIENCE algorithm effectively and efficiently detected GEI in simulated data sets of varying size and complexity, including the 10K simulated rheumatoid arthritis data set from Genetic Analysis Workshop 15. The method was also successfully used to detect GGI in a Crohn's disease data set. The performance of the AMBIENCE algorithm was compared to the multifactor dimensionality reduction (MDR), generalized MDR (GMDR), and pedigree disequilibrium test (PDT) methods. Furthermore, we assessed the computational speed of AMBIENCE for detecting GGI and GEI for data sets varying in size from 100 to 10(5) variables. Our results demonstrate that the AMBIENCE information theoretic algorithm is useful for analyzing a diverse range of epidemiologic data sets containing evidence for GGI and GEI.
Collapse
|
43
|
Abstract
PURPOSE OF REVIEW We examine the reasons for investigating gene-environment interactions and address recent reports evaluating interactions between genes and environmental modulators in relation to cardiovascular disease and its common risk factors. RECENT FINDINGS Studies focusing on smoking, physical activity, and alcohol and coffee consumption are observational and include relatively large sample sizes. They tend to examine single genes, however, and fail to address interactions with other genes and other correlated environmental factors. Studies examining gene-diet interactions include both observational and interventional designs. These studies are smaller, especially those including dietary interventions. Among the reported gene-diet interactions, it is important to highlight the strengthened position of APOA5 as a major gene that is involved in triglyceride metabolism and modulated by dietary factors, and the identification of APOA2 as a modulator of food intake and obesity risk. SUMMARY The study of gene-environment interactions is an active and much needed area of research. Although technical barriers of genetic studies are rapidly being overcome, inclusion of comprehensive and reliable environmental information represents a significant shortcoming of genetics studies. Progress in this area requires inclusion of larger populations but also more comprehensive, standardized, and precise approaches to capturing environmental information.
Collapse
Affiliation(s)
- Jose M Ordovas
- Nutrition and Genomics Laboratory, JM-USDA Human Nutrition Research Center on Aging at Tufts University, Boston, Massachusetts 02111, USA.
| | | |
Collapse
|