1
|
Ozminkowski S, Solís‐Lemus C. Identifying microbial drivers in biological phenotypes with a Bayesian network regression model. Ecol Evol 2024; 14:e11039. [PMID: 38774136 PMCID: PMC11106058 DOI: 10.1002/ece3.11039] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/04/2023] [Revised: 01/29/2024] [Accepted: 02/03/2024] [Indexed: 05/24/2024] Open
Abstract
In Bayesian Network Regression models, networks are considered the predictors of continuous responses. These models have been successfully used in brain research to identify regions in the brain that are associated with specific human traits, yet their potential to elucidate microbial drivers in biological phenotypes for microbiome research remains unknown. In particular, microbial networks are challenging due to their high dimension and high sparsity compared to brain networks. Furthermore, unlike in brain connectome research, in microbiome research, it is usually expected that the presence of microbes has an effect on the response (main effects), not just the interactions. Here, we develop the first thorough investigation of whether Bayesian Network Regression models are suitable for microbial datasets on a variety of synthetic and real data under diverse biological scenarios. We test whether the Bayesian Network Regression model that accounts only for interaction effects (edges in the network) is able to identify key drivers (microbes) in phenotypic variability. We show that this model is indeed able to identify influential nodes and edges in the microbial networks that drive changes in the phenotype for most biological settings, but we also identify scenarios where this method performs poorly which allows us to provide practical advice for domain scientists aiming to apply these tools to their datasets. BNR models provide a framework for microbiome researchers to identify connections between microbes and measured phenotypes. We allow the use of this statistical model by providing an easy-to-use implementation which is publicly available Julia package at https://github.com/solislemuslab/BayesianNetworkRegression.jl.
Collapse
Affiliation(s)
- Samuel Ozminkowski
- Department of Statistics and Wisconsin Institute for DiscoveryUniversity of Wisconsin‐MadisonMadisonWisconsinUSA
| | - Claudia Solís‐Lemus
- Department of Plant Pathology and Wisconsin Institute for DiscoveryUniversity of Wisconsin‐MadisonMadisonWisconsinUSA
| |
Collapse
|
2
|
Yaldız B, Erdoğan O, Rafatov S, Iyigün C, Aydın Son Y. Revealing third-order interactions through the integration of machine learning and entropy methods in genomic studies. BioData Min 2024; 17:3. [PMID: 38291454 PMCID: PMC10826120 DOI: 10.1186/s13040-024-00355-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2023] [Accepted: 01/16/2024] [Indexed: 02/01/2024] Open
Abstract
BACKGROUND Non-linear relationships at the genotype level are essential in understanding the genetic interactions of complex disease traits. Genome-wide association Studies (GWAS) have revealed statistical association of the SNPs in many complex diseases. As GWAS results could not thoroughly reveal the genetic background of these disorders, Genome-Wide Interaction Studies have started to gain importance. In recent years, various statistical approaches, such as entropy-based methods, have been suggested for revealing these non-additive interactions between variants. This study presents a novel prioritization workflow integrating two-step Random Forest (RF) modeling and entropy analysis after PLINK filtering. PLINK-RF-RF workflow is followed by an entropy-based 3-way interaction information (3WII) method to capture the hidden patterns resulting from non-linear relationships between genotypes in Late-Onset Alzheimer Disease to discover early and differential diagnosis markers. RESULTS Three models from different datasets are developed by integrating PLINK-RF-RF analysis and entropy-based three-way interaction information (3WII) calculation method, which enables the detection of the third-order interactions, which are not primarily considered in epistatic interaction studies. A reduced SNP set is selected for all three datasets by 3WII analysis by PLINK filtering and prioritization of SNP with RF-RF modeling, promising as a model minimization approach. Among SNPs revealed by 3WII, 4 SNPs out of 19 from GenADA, 1 SNP out of 27 from ADNI, and 4 SNPs out of 106 from NCRAD are mapped to genes directly associated with Alzheimer Disease. Additionally, several SNPs are associated with other neurological disorders. Also, the genes the variants mapped to in all datasets are significantly enriched in calcium ion binding, extracellular matrix, external encapsulating structure, and RUNX1 regulates estrogen receptor-mediated transcription pathways. Therefore, these functional pathways are proposed for further examination for a possible LOAD association. Besides, all 3WII variants are proposed as candidate biomarkers for the genotyping-based LOAD diagnosis. CONCLUSION The entropy approach performed in this study reveals the complex genetic interactions that significantly contribute to LOAD risk. We benefited from the entropy-based 3WII as a model minimization step and determined the significant 3-way interactions between the prioritized SNPs by PLINK-RF-RF. This framework is a promising approach for disease association studies, which can also be modified by integrating other machine learning and entropy-based interaction methods.
Collapse
Affiliation(s)
- Burcu Yaldız
- Department of Health Informatics, Graduate School of Informatics, METU, Ankara, Turkey
| | - Onur Erdoğan
- Department of Health Informatics, Graduate School of Informatics, METU, Ankara, Turkey
| | - Sevda Rafatov
- Department of Health Informatics, Graduate School of Informatics, METU, Ankara, Turkey
| | - Cem Iyigün
- Department of Industrial Engineering, METU, Ankara, Turkey
| | - Yeşim Aydın Son
- Department of Health Informatics, Graduate School of Informatics, METU, Ankara, Turkey.
- Graduate School of Informatics, ODTU-NOROM, METU, Ankara, Turkey.
| |
Collapse
|
3
|
Ventresca C, Mohamed W, Russel WA, Ay A, Ingram KK. Machine learning analyses reveal circadian clock features predictive of anxiety among UK biobank participants. Sci Rep 2023; 13:22304. [PMID: 38102312 PMCID: PMC10724169 DOI: 10.1038/s41598-023-49644-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2023] [Accepted: 12/11/2023] [Indexed: 12/17/2023] Open
Abstract
Mood disorders, including depression and anxiety, affect almost one-fifth of the world's adult population and are becoming increasingly prevalent. Mutations in circadian clock genes have previously been associated with mood disorders both directly and indirectly through alterations in circadian phase, suggesting that the circadian clock influences multiple molecular pathways involved in mood. By targeting previously identified single nucleotide polymorphisms (SNPs) that have been implicated in anxiety and depressive disorders, we use a combination of statistical and machine learning techniques to investigate associations with the generalized anxiety disorder assessment (GAD-7) scores in a UK Biobank sample of 90,882 individuals. As in previous studies, we observed that females exhibited higher GAD-7 scores than males regardless of genotype. Interestingly, we found no significant effects on anxiety from individual circadian gene variants; only circadian genotypes with multiple SNP variants showed significant associations with anxiety. For both sexes, severe anxiety is associated with a 120-fold increase in odds for individuals with CRY2_AG(rs1083852)/ZBTB20_TT(rs1394593) genotypes and is associated with a near 40-fold reduction in odds for individuals with PER3-A_CG(rs228697)/ZBTB20_TT(rs1394593) genotypes. We also report several sex-specific associations with anxiety. In females, the CRY2/ZBTB20 genotype combination showed a > 200-fold increase in odds of anxiety and PER3/ZBTB20 and CRY1 /PER3-A genotype combinations also appeared as female risk factors. In males, CRY1/PER3-A and PER3-B/ZBTB20 genotype combinations were associated with anxiety risk. Mediation analysis revealed direct associations of CRY2/ZBTB20 variant genotypes with moderate anxiety in females and CRY1/PER3-A variant genotypes with severe anxiety in males. The association of CRY1/PER3-A variant genotypes with severe anxiety in females was partially mediated by extreme evening chronotype. Our results reinforce existing findings that females exhibit stronger anxiety outcomes than males, and provide evidence for circadian gene associations with anxiety, particularly in females. Our analyses only identified significant associations using two-gene combinations, underscoring the importance of combined gene effects on anxiety risk. We describe novel, robust associations between gene combinations involving the ZBTB20 SNP (rs1394593) and risk of anxiety symptoms in a large population sample. Our findings also support previous findings that the ZBTB20 SNP is an important factor in mood disorders, including seasonal affective disorder. Our results suggest that reduced expression of this gene significantly modulates the risk of anxiety symptoms through direct influences on mood-related pathways. Together, these observations provide novel links between the circadian clockwork and anxiety symptoms and identify potential molecular pathways through which clock genes may influence anxiety risk.
Collapse
Affiliation(s)
- Cole Ventresca
- Department of Mathematics, Colgate University, Hamilton, NY, USA
- Department of Computer Science, Colgate University, Hamilton, NY, USA
| | - Wael Mohamed
- Department of Computer Science, Colgate University, Hamilton, NY, USA
- Department of Psychological and Brain Sciences, Colgate University, Hamilton, NY, USA
| | | | - Ahmet Ay
- Department of Mathematics, Colgate University, Hamilton, NY, USA
- Department of Biology, Colgate University, Hamilton, NY, USA
| | - Krista K Ingram
- Department of Biology, Colgate University, Hamilton, NY, USA.
| |
Collapse
|
4
|
Wang Z, Zhu Y, Liu Z, Li H, Tang X, Jiang Y. Comparative analysis of tissue-specific genes in maize based on machine learning models: CNN performs technically best, LightGBM performs biologically soundest. Front Genet 2023; 14:1190887. [PMID: 37229198 PMCID: PMC10203421 DOI: 10.3389/fgene.2023.1190887] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2023] [Accepted: 04/17/2023] [Indexed: 05/27/2023] Open
Abstract
Introduction: With the advancement of RNA-seq technology and machine learning, training large-scale RNA-seq data from databases with machine learning models can generally identify genes with important regulatory roles that were previously missed by standard linear analytic methodologies. Finding tissue-specific genes could improve our comprehension of the relationship between tissues and genes. However, few machine learning models for transcriptome data have been deployed and compared to identify tissue-specific genes, particularly for plants. Methods: In this study, an expression matrix was processed with linear models (Limma), machine learning models (LightGBM), and deep learning models (CNN) with information gain and the SHAP strategy based on 1,548 maize multi-tissue RNA-seq data obtained from a public database to identify tissue-specific genes. In terms of validation, V-measure values were computed based on k-means clustering of the gene sets to evaluate their technical complementarity. Furthermore, GO analysis and literature retrieval were used to validate the functions and research status of these genes. Results: Based on clustering validation, the convolutional neural network outperformed others with higher V-measure values as 0.647, indicating that its gene set could cover as many specific properties of various tissues as possible, whereas LightGBM discovered key transcription factors. The combination of three gene sets produced 78 core tissue-specific genes that had previously been shown in the literature to be biologically significant. Discussion: Different tissue-specific gene sets were identified due to the distinct interpretation strategy for machine learning models and researchers may use multiple methodologies and strategies for tissue-specific gene sets based on their goals, types of data, and computational resources. This study provided comparative insight for large-scale data mining of transcriptome datasets, shedding light on resolving high dimensions and bias difficulties in bioinformatics data processing.
Collapse
Affiliation(s)
- Zijie Wang
- School of Agriculture, Sun Yat-sen University, Shenzhen, China
| | - Yuzhi Zhu
- School of Agriculture, Sun Yat-sen University, Shenzhen, China
| | - Zhule Liu
- School of Agriculture, Sun Yat-sen University, Shenzhen, China
| | - Hongfu Li
- School of Agriculture, Sun Yat-sen University, Shenzhen, China
| | - Xinqiang Tang
- School of Intelligent Systems Engineering, Sun Yat-sen University, Shenzhen, China
| | - Yi Jiang
- School of Agriculture, Sun Yat-sen University, Shenzhen, China
| |
Collapse
|
5
|
Xiong W, Chen Y, Ma S. Unified model-free interaction screening via CV-entropy filter. Comput Stat Data Anal 2023; 180:107684. [PMID: 36910335 PMCID: PMC9997997 DOI: 10.1016/j.csda.2022.107684] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/29/2022]
Abstract
For many practical high-dimensional problems, interactions have been increasingly found to play important roles beyond main effects. A representative example is gene-gene interaction. Joint analysis, which analyzes all interactions and main effects in a single model, can be seriously challenged by high dimensionality. For high-dimensional data analysis in general, marginal screening has been established as effective for reducing computational cost, increasing stability, and improving estimation/selection performance. Most of the existing marginal screening methods are designed for the analysis of main effects only. The existing screening methods for interaction analysis are often limited by making stringent model assumptions, lacking robustness, and/or requiring predictors to be continuous (and hence lacking flexibility). A unified marginal screening approach tailored to interaction analysis is developed, which can be applied to regression, classification, and survival analysis. Predictors are allowed to be continuous and discrete. The proposed approach is built on Coefficient of Variation (CV) filters based on information entropy. Statistical properties are rigorously established. It is shown that the CV filters are almost insensitive to the distribution tails of predictors, correlation structure among predictors, and sparsity level of signals. An efficient two-stage algorithm is developed to make the proposed approach scalable to ultrahigh-dimensional data. Simulations and the analysis of TCGA LUAD data further establish the practical superiority of the proposed approach.
Collapse
Affiliation(s)
- Wei Xiong
- School of Statistics, University of International Business and Economics, Beijing 100872, PR China
| | - Yaxian Chen
- Department of Statistics and Actuarial Science, The University of Hong Kong, Hong Kong
| | - Shuangge Ma
- Department of Biostatistics, Yale School of Public Health, USA
| |
Collapse
|
6
|
Walakira A, Ocira J, Duroux D, Fouladi R, Moškon M, Rozman D, Van Steen K. Detecting gene-gene interactions from GWAS using diffusion kernel principal components. BMC Bioinformatics 2022; 23:57. [PMID: 35105309 PMCID: PMC8805268 DOI: 10.1186/s12859-022-04580-7] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2021] [Accepted: 01/18/2022] [Indexed: 11/10/2022] Open
Abstract
Genes and gene products do not function in isolation but as components of complex networks of macromolecules through physical or biochemical interactions. Dependencies of gene mutations on genetic background (i.e., epistasis) are believed to play a role in understanding molecular underpinnings of complex diseases such as inflammatory bowel disease (IBD). However, the process of identifying such interactions is complex due to for instance the curse of high dimensionality, dependencies in the data and non-linearity. Here, we propose a novel approach for robust and computationally efficient epistasis detection. We do so by first reducing dimensionality, per gene via diffusion kernel principal components (kpc). Subsequently, kpc gene summaries are used for downstream analysis including the construction of a gene-based epistasis network. We show that our approach is not only able to recover known IBD associated genes but also additional genes of interest linked to this difficult gastrointestinal disease.
Collapse
Affiliation(s)
- Andrew Walakira
- Centre for Functional Genomics and Bio-Chips, Institute for Biochemistry and Molecular Genetics, Faculty of Medicine, University of Ljubljana, Ljubljana, Slovenia
| | - Junior Ocira
- BIO3 - Laboratory for Systems Genetics, GIGA-R Medical Genomics, University of Liège, Liège, Belgium
| | - Diane Duroux
- BIO3 - Laboratory for Systems Genetics, GIGA-R Medical Genomics, University of Liège, Liège, Belgium
| | - Ramouna Fouladi
- BIO3 - Laboratory for Systems Genetics, GIGA-R Medical Genomics, University of Liège, Liège, Belgium
| | - Miha Moškon
- Faculty of Computer and Information Science, University of Ljubljana, Ljubljana, Slovenia
| | - Damjana Rozman
- Centre for Functional Genomics and Bio-Chips, Institute for Biochemistry and Molecular Genetics, Faculty of Medicine, University of Ljubljana, Ljubljana, Slovenia
| | - Kristel Van Steen
- BIO3 - Laboratory for Systems Genetics, GIGA-R Medical Genomics, University of Liège, Liège, Belgium
- BIO3 - Laboratory for Systems Medicine, Department of Human Genetics, KU Leuven, Leuven, Belgium
| |
Collapse
|
7
|
Kunert-Graf JM, Sakhanenko NA, Galas DJ. Optimized permutation testing for information theoretic measures of multi-gene interactions. BMC Bioinformatics 2021; 22:180. [PMID: 33827420 PMCID: PMC8028212 DOI: 10.1186/s12859-021-04107-6] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/01/2020] [Accepted: 03/29/2021] [Indexed: 11/17/2022] Open
Abstract
Background Permutation testing is often considered the “gold standard” for multi-test significance analysis, as it is an exact test requiring few assumptions about the distribution being computed. However, it can be computationally very expensive, particularly in its naive form in which the full analysis pipeline is re-run after permuting the phenotype labels. This can become intractable in multi-locus genome-wide association studies (GWAS), in which the number of potential interactions to be tested is combinatorially large. Results In this paper, we develop an approach for permutation testing in multi-locus GWAS, specifically focusing on SNP–SNP-phenotype interactions using multivariable measures that can be computed from frequency count tables, such as those based in Information Theory. We find that the computational bottleneck in this process is the construction of the count tables themselves, and that this step can be eliminated at each iteration of the permutation testing by transforming the count tables directly. This leads to a speed-up by a factor of over 103 for a typical permutation test compared to the naive approach. Additionally, this approach is insensitive to the number of samples making it suitable for datasets with large number of samples. Conclusions The proliferation of large-scale datasets with genotype data for hundreds of thousands of individuals enables new and more powerful approaches for the detection of multi-locus genotype-phenotype interactions. Our approach significantly improves the computational tractability of permutation testing for these studies. Moreover, our approach is insensitive to the large number of samples in these modern datasets. The code for performing these computations and replicating the figures in this paper is freely available at https://github.com/kunert/permute-counts.
Collapse
Affiliation(s)
- James M Kunert-Graf
- Pacific Northwest Research Institute, 720 Broadway, Seattle, WA, 98122, USA.
| | | | - David J Galas
- Pacific Northwest Research Institute, 720 Broadway, Seattle, WA, 98122, USA
| |
Collapse
|
8
|
Manavalan R, Priya S. Genetic interactions effects for cancer disease identification using computational models: a review. Med Biol Eng Comput 2021; 59:733-758. [PMID: 33839998 DOI: 10.1007/s11517-021-02343-9] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2020] [Accepted: 03/10/2021] [Indexed: 11/29/2022]
Abstract
Genome-wide association studies (GWAS) provide clear insight into understanding genetic variations and environmental influences responsible for various human diseases. Cancer identification through genetic interactions (epistasis) is one of the significant ongoing researches in GWAS. The growth of the cancer cell emerges from multi-locus as well as complex genetic interaction. It is impractical for the physician to detect cancer via manual examination of SNPs interaction. Due to its importance, several computational approaches have been modeled to infer epistasis effects. This article includes a comprehensive and multifaceted review of all relevant genetic studies published between 2001 and 2020. In this contemporary review, various computational methods are as follows: multifactor dimensionality reduction-based approaches, statistical strategies, machine learning, and optimization-based techniques are carefully reviewed and presented with their evaluation results. Moreover, these computational approaches' strengths and limitations are described. The issues behind the computational methods for identifying the cancer disease through genetic interactions and the various evaluation parameters used by researchers have been analyzed. This review is highly beneficial for researchers and medical professionals to learn techniques adapted to discover the epistasis and aids to design novel automatic epistasis detection systems with strong robustness and maximum efficiency to address the different research problems in finding practical solutions effectively.
Collapse
Affiliation(s)
- R Manavalan
- Department of Computer Science, Arignar Anna Government Arts College, Villupuram, Tamil Nadu, 605602, India.
| | - S Priya
- Computer Science, Arignar Anna Government Arts College, Villupuram, Tamil Nadu, India
| |
Collapse
|
9
|
Chanda P, Costa E, Hu J, Sukumar S, Van Hemert J, Walia R. Information Theory in Computational Biology: Where We Stand Today. ENTROPY (BASEL, SWITZERLAND) 2020; 22:E627. [PMID: 33286399 PMCID: PMC7517167 DOI: 10.3390/e22060627] [Citation(s) in RCA: 22] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/30/2020] [Revised: 05/31/2020] [Accepted: 06/03/2020] [Indexed: 12/30/2022]
Abstract
"A Mathematical Theory of Communication" was published in 1948 by Claude Shannon to address the problems in the field of data compression and communication over (noisy) communication channels. Since then, the concepts and ideas developed in Shannon's work have formed the basis of information theory, a cornerstone of statistical learning and inference, and has been playing a key role in disciplines such as physics and thermodynamics, probability and statistics, computational sciences and biological sciences. In this article we review the basic information theory based concepts and describe their key applications in multiple major areas of research in computational biology-gene expression and transcriptomics, alignment-free sequence comparison, sequencing and error correction, genome-wide disease-gene association mapping, metabolic networks and metabolomics, and protein sequence, structure and interaction analysis.
Collapse
Affiliation(s)
- Pritam Chanda
- Corteva Agriscience™, Indianapolis, IN 46268, USA
- Computer and Information Science, Indiana University-Purdue University, Indianapolis, IN 46202, USA
| | - Eduardo Costa
- Corteva Agriscience™, Mogi Mirim, Sao Paulo 13801-540, Brazil
| | - Jie Hu
- Corteva Agriscience™, Indianapolis, IN 46268, USA
| | | | | | - Rasna Walia
- Corteva Agriscience™, Johnston, IA 50131, USA
| |
Collapse
|
10
|
Malten J, König IR. Modified entropy-based procedure detects gene-gene-interactions in unconventional genetic models. BMC Med Genomics 2020; 13:65. [PMID: 32326960 PMCID: PMC7181579 DOI: 10.1186/s12920-020-0703-4] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/19/2019] [Accepted: 03/13/2020] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Since it is assumed that genetic interactions play an important role in understanding the mechanisms of complex diseases, different statistical approaches have been suggested in recent years for this task. One interesting approach is the entropy-based IGENT method by Kwon et al. that promises an efficient detection of main effects and interaction effects simultaneously. However, a modification is required if the aim is to only detect interaction effects. METHODS Based on the IGENT method, we present a modification that leads to a conditional mutual information based approach under the condition of linkage equilibrium. The modified estimator is investigated in a comprehensive simulation based on five genetic interaction models and applied to real data from the genome-wide association study by the North American Rheumatoid Arthritis Consortium (NARAC). RESULTS The presented modification of IGENT controls the type I error in all simulated constellations. Furthermore, it provides high power for detecting pure interactions specifically on unconventional genetic models both in simulation and real data. CONCLUSIONS The proposed method uses the IGENT software, which is free available, simple and fast, and detects pure interactions on unconventional genetic models. Our results demonstrate that this modification is an attractive complement to established analysis methods.
Collapse
Affiliation(s)
- Jörg Malten
- Institut für Medizinische Biometrie und Statistik, Universität zu Lübeck, Universitätsklinikum Schleswig-Holstein, Campus Lübeck, Ratzeburger Allee 160, 23562 Lübeck, Germany
| | - Inke R König
- Institut für Medizinische Biometrie und Statistik, Universität zu Lübeck, Universitätsklinikum Schleswig-Holstein, Campus Lübeck, Ratzeburger Allee 160, 23562 Lübeck, Germany.
| |
Collapse
|
11
|
Kim S. A miRNA- and mRNA-seq-Based Feature Selection Approach for Kidney Cancer Biomakers. Cancer Inform 2020; 19:1176935120908301. [PMID: 32165847 PMCID: PMC7050029 DOI: 10.1177/1176935120908301] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2020] [Accepted: 02/01/2020] [Indexed: 11/15/2022] Open
Abstract
Microarray data sets have been used for predicting cancer biomarkers. Yet, replication of the prediction has not been fully satisfied. Recently, new data sets called deep sequencing data sets have been generated, with an advantage of less noise in computational analysis. In this study, we analyzed the kidney miRNA and mRNA sequence data sets for predicting cancer markers using 5 different statistical feature selection methods. In the results, we obtained 3 mRNA- and 27 miRNA-based cancer biomarkers to compare with the normal samples. In addition, we clustered the kidney cancer subtypes using a nonnegative matrix factorization method and obtained significant results of survival analysis from the 2 separate groups including miRNA-342 and its target eukaryotic translation initiation factor 5A (EIF5A).
Collapse
Affiliation(s)
- Shinuk Kim
- Department of Civil Engineering, Sangmyung University, Cheonan, Republic of Korea
| |
Collapse
|
12
|
Furxhi I, Murphy F, Poland CA, Sheehan B, Mullins M, Mantecca P. Application of Bayesian networks in determining nanoparticle-induced cellular outcomes using transcriptomics. Nanotoxicology 2019; 13:827-848. [PMID: 31140895 DOI: 10.1080/17435390.2019.1595206] [Citation(s) in RCA: 20] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
Abstract
Inroads have been made in our understanding of the risks posed to human health and the environment by nanoparticles (NPs) but this area requires continuous research and monitoring. Machine learning techniques have been applied to nanotoxicology with very encouraging results. This study deals with bridging physicochemical properties of NPs, experimental exposure conditions and in vitro characteristics with biological effects of NPs on a molecular cellular level from transcriptomics studies. The bridging is done by developing and implementing Bayesian Networks (BNs) with or without data preprocessing. The BN structures are derived either automatically or methodologically and compared. Early stage nanotoxicity measurements represent a challenge, not least when attempting to predict adverse outcomes and modeling is critical to understanding the biological effects of exposure to NPs. The preprocessed data-driven BN showed improved performance over automatically structured BN and the BN with unprocessed datasets. The prestructured BN captures inter relationships between NP properties, exposure condition and in vitro characteristics and links those with cellular effects based on statistic correlation findings. Information gain analysis showed that exposure dose, NP and cell line variables were the most influential attributes in predicting the biological effects. The BN methodology proposed in this study successfully predicts a number of toxicologically relevant cellular disrupted biological processes such as cell cycle and proliferation pathways, cell adhesion and extracellular matrix responses, DNA damage and repair mechanisms etc., with a success rate >80%. The model validation from independent data shows a robust and promising methodology for incorporating transcriptomics outcomes in a hazard and, by extension, risk assessment modeling framework by predicting affected cellular functions from experimental conditions.
Collapse
Affiliation(s)
- Irini Furxhi
- a Department of Accounting and Finance , Kemmy Business School University of Limerick , Limerick , Ireland
| | - Finbarr Murphy
- a Department of Accounting and Finance , Kemmy Business School University of Limerick , Limerick , Ireland
| | - Craig A Poland
- b ELEGI/Colt Laboratory , Queen's Medical Research Institute, University of Edinburgh , Edinburgh , Scotland
| | - Barry Sheehan
- a Department of Accounting and Finance , Kemmy Business School University of Limerick , Limerick , Ireland
| | - Martin Mullins
- a Department of Accounting and Finance , Kemmy Business School University of Limerick , Limerick , Ireland
| | - Paride Mantecca
- c Department of Earth and Environmental Sciences , Particulate Matter and Health Risk (POLARIS) Research Centre University of Milano Bicocca , Milano , Italy
| |
Collapse
|
13
|
Kafaie S, Chen Y, Hu T. A network approach to prioritizing susceptibility genes for genome-wide association studies. Genet Epidemiol 2019; 43:477-491. [PMID: 30859622 DOI: 10.1002/gepi.22198] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/29/2018] [Revised: 01/31/2019] [Accepted: 02/25/2019] [Indexed: 12/22/2022]
Abstract
The heritability of complex diseases including cancer is often attributed to multiple interacting genetic alterations. Such a non-linear, non-additive gene-gene interaction effect, that is, epistasis, renders univariable analysis methods ineffective for genome-wide association studies. In recent years, network science has seen increasing applications in modeling epistasis to characterize the complex relationships between a large number of genetic variations and the phenotypic outcome. In this study, by constructing a statistical epistasis network of colorectal cancer (CRC), we proposed to use multiple network measures to prioritize genes that influence the disease risk of CRC through synergistic interaction effects. We computed and analyzed several global and local properties of the large CRC epistasis network. We utilized topological properties of network vertices such as the edge strength, vertex centrality, and occurrence at different graphlets to identify genes that may be of potential biological relevance to CRC. We found 512 top-ranked single-nucleotide polymorphisms, among which COL22A1, RGS7, WWOX, and CELF2 were the four susceptibility genes prioritized by all described metrics as the most influential on CRC.
Collapse
Affiliation(s)
- Somayeh Kafaie
- Department of Computer Science, Memorial University, St. John's, NL, Canada
| | - Yuanzhu Chen
- Department of Computer Science, Memorial University, St. John's, NL, Canada
| | - Ting Hu
- Department of Computer Science, Memorial University, St. John's, NL, Canada
| |
Collapse
|
14
|
Abstract
Genome-wide association studies are moving to genome-wide interaction studies, as the genetic background of many diseases appears to be more complex than previously supposed. Thus, many statistical approaches have been proposed to detect gene-gene (GxG) interactions, among them numerous information theory-based methods, inspired by the concept of entropy. These are suggested as particularly powerful and, because of their nonlinearity, as better able to capture nonlinear relationships between genetic variants and/or variables. However, the introduced entropy-based estimators differ to a surprising extent in their construction and even with respect to the basic definition of interactions. Also, not every entropy-based measure for interaction is accompanied by a proper statistical test. To shed light on this, a systematic review of the literature is presented answering the following questions: (1) How are GxG interactions defined within the framework of information theory? (2) Which entropy-based test statistics are available? (3) Which underlying distribution do the test statistics follow? (4) What are the given strengths and limitations of these test statistics?
Collapse
Affiliation(s)
| | - Inke R König
- Institut für Medizinische Biometrie und Statistik, Universität zu Lübeck, Universitätsklinikum Schleswig-Holstein, Campus Lübeck, Ratzeburger Allee 160, Lübeck, Germany
- Corresponding author. Inke R. Konig, Institut für Medizinische Biometrie und Statistik, Universität zu Lübeck, Universitätsklinikum Schleswig-Holstein, Campus Lübeck, Ratzeburger Allee 160, 23562 Lübeck, Germany. Tel.: ++49 451 500 50610; Fax: ++49 451 500 50604; E-Mail:
| |
Collapse
|
15
|
ClusterMI: Detecting High-Order SNP Interactions Based on Clustering and Mutual Information. Int J Mol Sci 2018; 19:ijms19082267. [PMID: 30072632 PMCID: PMC6121365 DOI: 10.3390/ijms19082267] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2018] [Revised: 07/23/2018] [Accepted: 07/30/2018] [Indexed: 01/14/2023] Open
Abstract
Identifying single nucleotide polymorphism (SNP) interactions is considered as a popular and crucial way for explaining the missing heritability of complex diseases in genome-wide association studies (GWAS). Many approaches have been proposed to detect SNP interactions. However, existing approaches generally suffer from the high computational complexity resulting from the explosion of candidate high-order interactions. In this paper, we propose a two-stage approach (called ClusterMI) to detect high-order genome-wide SNP interactions based on significant pairwise SNP combinations. In the screening stage, to alleviate the huge computational burden, ClusterMI firstly applies a clustering algorithm combined with mutual information to divide SNPs into different clusters. Then, ClusterMI utilizes conditional mutual information to screen significant pairwise SNP combinations in each cluster. In this way, there is a higher probability of identifying significant two-locus combinations in each group, and the computational load for the follow-up search can be greatly reduced. In the search stage, two different search strategies (exhaustive search and improved ant colony optimization search) are provided to detect high-order SNP interactions based on the cardinality of significant two-locus combinations. Extensive simulation experiments show that ClusterMI has better performance than other related and competitive approaches. Experiments on two real case-control datasets from Wellcome Trust Case Control Consortium (WTCCC) also demonstrate that ClusterMI is more capable of identifying high-order SNP interactions from genome-wide data.
Collapse
|
16
|
Mielniczuk J, Teisseyre P. A deeper look at two concepts of measuring gene-gene interactions: logistic regression and interaction information revisited. Genet Epidemiol 2017; 42:187-200. [PMID: 29265411 DOI: 10.1002/gepi.22108] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/11/2017] [Revised: 10/23/2017] [Accepted: 11/15/2017] [Indexed: 11/09/2022]
Abstract
Detection of gene-gene interactions is one of the most important challenges in genome-wide case-control studies. Besides traditional logistic regression analysis, recently the entropy-based methods attracted a significant attention. Among entropy-based methods, interaction information is one of the most promising measures having many desirable properties. Although both logistic regression and interaction information have been used in several genome-wide association studies, the relationship between them has not been thoroughly investigated theoretically. The present paper attempts to fill this gap. We show that although certain connections between the two methods exist, in general they refer two different concepts of dependence and looking for interactions in those two senses leads to different approaches to interaction detection. We introduce ordering between interaction measures and specify conditions for independent and dependent genes under which interaction information is more discriminative measure than logistic regression. Moreover, we show that for so-called perfect distributions those measures are equivalent. The numerical experiments illustrate the theoretical findings indicating that interaction information and its modified version are more universal tools for detecting various types of interaction than logistic regression and linkage disequilibrium measures.
Collapse
Affiliation(s)
- Jan Mielniczuk
- Institute of Computer Science, Polish Academy of Sciences, Poland.,Faculty of Mathematics and Information Science, Warsaw University of Technology, Poland
| | - Paweł Teisseyre
- Institute of Computer Science, Polish Academy of Sciences, Poland
| |
Collapse
|
17
|
Bernardo M, Bioque M, Cabrera B, Lobo A, González-Pinto A, Pina L, Corripio I, Sanjuán J, Mané A, Castro-Fornieles J, Vieta E, Arango C, Mezquida G, Gassó P, Parellada M, Saiz-Ruiz J, Cuesta MJ, Mas S. Modelling gene-environment interaction in first episodes of psychosis. Schizophr Res 2017; 189:181-189. [PMID: 28179063 DOI: 10.1016/j.schres.2017.01.058] [Citation(s) in RCA: 37] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 09/23/2016] [Revised: 01/24/2017] [Accepted: 01/30/2017] [Indexed: 12/20/2022]
Abstract
INTRODUCTION Recent research demonstrates the heterogeneous etiology of psychotic disorders, where gen-environment (GxE) interaction plays a key role. Large genetic studies have linked many genetic variants with schizophrenia, but each variant is only associated with a small effect and the GxE interaction contribution has not been evaluated. METHODS The PEPs Project was designed to carefully collect a large amount of genetic and environmental exposure data of 335 FEP patients and 253 matched healthy controls.780single-nucleotide polymorphisms (from 159 candidate genes)and 16 environmental variables previously reported as the main psychosis non-genetic risk factors were analyzed together using entropy-based measures of information gain. RESULTS Our analyses identified an interaction between nine SNPs and the exposition to the environmental risk factors of psychosis, showing a clear enrichment of genes linked to serotonin neurotransmission and neurodevelopmental processes. CONCLUSIONS This study has allowed the identification of several GxE-environment interactions involved in the risk of presenting a FEP. Our results highlight the importance of serotonin neurotransmission interacting with certain environmental stimuli. The serotoninergic system may be playing a key role in the regulatory network of stress and other systems implicated in the emergence and development of psychotic disorders.
Collapse
Affiliation(s)
- Miguel Bernardo
- Barcelona Clínic SchizophreniaUnit, Hospital Clínic de Barcelona, CIBERSAM, Spain; Universitat de Barcelona, IDIBAPS, Barcelona, Spain.
| | - Miquel Bioque
- Barcelona Clínic SchizophreniaUnit, Hospital Clínic de Barcelona, CIBERSAM, Spain
| | - Bibiana Cabrera
- Barcelona Clínic SchizophreniaUnit, Hospital Clínic de Barcelona, CIBERSAM, Spain
| | - Antonio Lobo
- Instituto de Investigación Sanitaria Aragón (IIS Aragón), University of Zaragoza, Spain
| | - Ana González-Pinto
- Department of Psychiatry, Hospital Universitario de Alava, CIBERSAM, University of the Basque Country, Spain
| | - Laura Pina
- Child and Adolescent Psychiatry Department, Hospital General Universitario Gregorio Marañón, IiSGM, CIBERSAM, School of Medicine, Universidad Complutense, Madrid, Spain
| | - Iluminada Corripio
- Department of Psychiatry, Hospital de Sant Pau, CIBERSAM, Barcelona, Spain
| | - Julio Sanjuán
- Clinic Hospital Valencia, INCLIVA, CIBERSAM, Valencia University, Spain
| | - Anna Mané
- Department of Psychiatry, Hospital del Mar, Barcelona, IMIM, Barcelona, Spain
| | - Josefina Castro-Fornieles
- Department of Child and Adolescent Psychiatry and Psychology, SGR-489, Neurosciences Institute, Hospital Clínic of Barcelona, IDIBAPS, CIBERSAM, University of Barcelona, Spain
| | - Eduard Vieta
- Hospital Clínic de Barcelona, Universitat de Barcelona, IDIBAPS, CIBERSAM, Spain
| | - Celso Arango
- Child and Adolescent Psychiatry Department, Hospital General Universitario Gregorio Marañón, IiSGM, CIBERSAM, School of Medicine, Universidad Complutense, Madrid, Spain
| | - Gisela Mezquida
- Barcelona Clínic SchizophreniaUnit, Hospital Clínic de Barcelona, CIBERSAM, Spain
| | - Patricia Gassó
- Department of Pathological Anatomy, Pharmacology and Microbiology, University of Barcelona, Institutd'InvestigacionsBiomèdiques August Pi i Sunyer (IDIBAPS), CIBERSAM, Barcelona, Spain
| | - Mara Parellada
- Child and Adolescent Psychiatry Department, Hospital General Universitario Gregorio Marañón, IiSGM, CIBERSAM, School of Medicine, Universidad Complutense, Madrid, Spain
| | - Jerónimo Saiz-Ruiz
- Hospital Ramón y Cajal, Universidad de Alcalá, IRYCIS, CIBERSAM, Madrid, Spain
| | - Manuel J Cuesta
- Psychiatric Department, Complejo Hospitalario de Navarra, Pamplona (Spain), Instituto de Investigación Sanitaria de Navarra (IdiSNA), Spain
| | - Sergi Mas
- Department of Pathological Anatomy, Pharmacology and Microbiology, University of Barcelona, Institutd'InvestigacionsBiomèdiques August Pi i Sunyer (IDIBAPS), CIBERSAM, Barcelona, Spain
| | | |
Collapse
|
18
|
Sauce B, Matzel LD. The paradox of intelligence: Heritability and malleability coexist in hidden gene-environment interplay. Psychol Bull 2017; 144:26-47. [PMID: 29083200 DOI: 10.1037/bul0000131] [Citation(s) in RCA: 52] [Impact Index Per Article: 7.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/30/2023]
Abstract
Intelligence can have an extremely high heritability, but also be malleable; a paradox that has been the source of continuous controversy. Here we attempt to clarify the issue, and advance a frequently overlooked solution to the paradox: Intelligence is a trait with unusual properties that create a large reservoir of hidden gene-environment (GE) networks, allowing for the contribution of high genetic and environmental influences on individual differences in IQ. GE interplay is difficult to specify with current methods, and is underestimated in standard metrics of heritability (thus inflating estimates of "genetic" effects). We describe empirical evidence for GE interplay in intelligence, with malleability existing on top of heritability. The evidence covers cognitive gains consequent to adoption/immigration, changes in IQ's heritability across life span and socioeconomic status, gains in IQ over time consequent to societal development (the Flynn effect), the slowdown of age-related cognitive decline, and the gains in intelligence from early education. The GE solution has novel implications for enduring problems, including our inability to identify intelligence-related genes (also known as IQ's "missing heritability"), and the loss of initial benefits from early intervention programs (such as "Head Start"). The GE solution can be a powerful guide to future research, and may also aid policies to overcome barriers to the development of intelligence, particularly in impoverished and underprivileged populations. (PsycINFO Database Record
Collapse
Affiliation(s)
- Bruno Sauce
- Department of Psychology, Program in Behavioral and Systems Neuroscience, Rutgers University
| | - Louis D Matzel
- Department of Psychology, Program in Behavioral and Systems Neuroscience, Rutgers University
| |
Collapse
|
19
|
Yan W, Li J, Liu M, Bai X, Shao H. Data-based multiple criteria decision-making model and visualized monitoring of urban drinking water quality. Soft comput 2017. [DOI: 10.1007/s00500-017-2809-y] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/01/2022]
|
20
|
Balestre M, de Souza CL. Bayesian reversible-jump for epistasis analysis in genomic studies. BMC Genomics 2016; 17:1012. [PMID: 27938339 PMCID: PMC5148921 DOI: 10.1186/s12864-016-3342-6] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2016] [Accepted: 11/25/2016] [Indexed: 12/03/2022] Open
Abstract
Background The large amount of data used in genomic analysis has allowed geneticists to achieve some understanding of the genetic architecture of complex traits. Although the information gathered by molecular markers has permitted gains in predictive accuracy and gene discovery, epistatic effects have been ignored based on exhaustive searches requesting estimates of its effects on the whole genome. In this work, we propose the reversible-jump technique to estimate epistasis in the genome without drastically altering the model dimension. To this end, we used a real maize dataset based on 256 F2:3 progenies plus a simulation data set based on 300 F2 individuals. In the simulation scenario, six QTL presenting main effects (additive and dominance) were combined with seven other epistatic effects totaling 13 QTL controlling the trait. Results Our model explored 18,624 candidate epistases, but even in this vast space, only one spurious interaction was found. The three epistases selected by our model, named here as 18x26, 56x68 and 59x93, were very close to simulated ones (19x25, 54x72, 59x91 and 59x94). In the real dataset, we estimate 33,024 epistatic effects, and several minor epistatic combinations were found to explain a significant proportion of the genetic variance. The broad participation of epistasis in the real dataset may indicate the presence of pervasive epistasis acting on maize grain yield. Conclusions The power of selecting true epistasis in thousands of possible combinations suggests the attractiveness of our model to handle genomic data Electronic supplementary material The online version of this article (doi:10.1186/s12864-016-3342-6) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Marcio Balestre
- Department of Statistics- Federal University of Lavras, Lavras, MG, CP 3037, Brazil.
| | - Claudio Lopes de Souza
- Departmento de Genética, Escola de Agricultura Luiz de Queiroz, Universidade de São Paulo, (ESALQ-USP) Piracicaba, São Paulo, 13400-970 CP 83, Brazil
| |
Collapse
|
21
|
Woo HJ, Yu C, Kumar K, Gold B, Reifman J. Genotype distribution-based inference of collective effects in genome-wide association studies: insights to age-related macular degeneration disease mechanism. BMC Genomics 2016; 17:695. [PMID: 27576376 PMCID: PMC5006276 DOI: 10.1186/s12864-016-2871-3] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2015] [Accepted: 07/01/2016] [Indexed: 12/18/2022] Open
Abstract
BACKGROUND Genome-wide association studies provide important insights to the genetic component of disease risks. However, an existing challenge is how to incorporate collective effects of interactions beyond the level of independent single nucleotide polymorphism (SNP) tests. While methods considering each SNP pair separately have provided insights, a large portion of expected heritability may reside in higher-order interaction effects. RESULTS We describe an inference approach (discrete discriminant analysis; DDA) designed to probe collective interactions while treating both genotypes and phenotypes as random variables. The genotype distributions in case and control groups are modeled separately based on empirical allele frequency and covariance data, whose differences yield disease risk parameters. We compared pairwise tests and collective inference methods, the latter based both on DDA and logistic regression. Analyses using simulated data demonstrated that significantly higher sensitivity and specificity can be achieved with collective inference in comparison to pairwise tests, and with DDA in comparison to logistic regression. Using age-related macular degeneration (AMD) data, we demonstrated two possible applications of DDA. In the first application, a genome-wide SNP set is reduced into a small number (∼100) of variants via filtering and SNP pairs with significant interactions are identified. We found that interactions between SNPs with highest AMD association were epigenetically active in the liver, adipocytes, and mesenchymal stem cells. In the other application, multiple groups of SNPs were formed from the genome-wide data and their relative strengths of association were compared using cross-validation. This analysis allowed us to discover novel collections of loci for which interactions between SNPs play significant roles in their disease association. In particular, we considered pathway-based groups of SNPs containing up to ∼10, 000 variants in each group. In addition to pathways related to complement activation, our collective inference pointed to pathway groups involved in phospholipid synthesis, oxidative stress, and apoptosis, consistent with the AMD pathogenesis mechanism where the dysfunction of retinal pigment epithelium cells plays central roles. CONCLUSIONS The simultaneous inference of collective interaction effects within a set of SNPs has the potential to reveal novel aspects of disease association.
Collapse
Affiliation(s)
- Hyung Jun Woo
- Biotechnology High Performance Computing Software Applications Institute, Telemedicine and Advanced Technology Research Center, U.S. Army Medical Research and Materiel Command, Fort Detrick, Maryland, USA
| | - Chenggang Yu
- Biotechnology High Performance Computing Software Applications Institute, Telemedicine and Advanced Technology Research Center, U.S. Army Medical Research and Materiel Command, Fort Detrick, Maryland, USA
| | - Kamal Kumar
- Biotechnology High Performance Computing Software Applications Institute, Telemedicine and Advanced Technology Research Center, U.S. Army Medical Research and Materiel Command, Fort Detrick, Maryland, USA
| | - Bert Gold
- Laboratory of Genomic Diversity, National Cancer Institute, Frederick, Maryland, USA
| | - Jaques Reifman
- Biotechnology High Performance Computing Software Applications Institute, Telemedicine and Advanced Technology Research Center, U.S. Army Medical Research and Materiel Command, Fort Detrick, Maryland, USA.
| |
Collapse
|
22
|
Sun L, Wang C, Hu YQ. Utilizing mutual information for detecting rare and common variants associated with a categorical trait. PeerJ 2016; 4:e2139. [PMID: 27350900 PMCID: PMC4918222 DOI: 10.7717/peerj.2139] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/26/2016] [Accepted: 05/25/2016] [Indexed: 11/20/2022] Open
Abstract
Background. Genome-wide association studies have succeeded in detecting novel common variants which associate with complex diseases. As a result of the fast changes in next generation sequencing technology, a large number of sequencing data are generated, which offers great opportunities to identify rare variants that could explain a larger proportion of missing heritability. Many effective and powerful methods are proposed, although they are usually limited to continuous, dichotomous or ordinal traits. Notice that traits having nominal categorical features are commonly observed in complex diseases, especially in mental disorders, which motivates the incorporation of the characteristics of the categorical trait into association studies with rare and common variants. Methods. We construct two simple and intuitive nonparametric tests, MIT and aMIT, based on mutual information for detecting association between genetic variants in a gene or region and a categorical trait. MIT and aMIT can gauge the difference among the distributions of rare and common variants across a region given every categorical trait value. If there is little association between variants and a categorical trait, MIT or aMIT approximately equals zero. The larger the difference in distributions, the greater values MIT and aMIT have. Therefore, MIT and aMIT have the potential for detecting functional variants. Results.We checked the validity of proposed statistics and compared them to the existing ones through extensive simulation studies with varied combinations of the numbers of variants of rare causal, rare non-causal, common causal, and common non-causal, deleterious and protective, various minor allele frequencies and different levels of linkage disequilibrium. The results show our methods have higher statistical power than conventional ones, including the likelihood based score test, in most cases: (1) there are multiple genetic variants in a gene or region; (2) both protective and deleterious variants are present; (3) there exist rare and common variants; and (4) more than half of the variants are neutral. The proposed tests are applied to the data from Collaborative Studies on Genetics of Alcoholism, and a competent performance is exhibited therein. Discussion. As a complementary to the existing methods mainly focusing on quantitative traits, this study provides the nonparametric tests MIT and aMIT for detecting variants associated with categorical trait. Furthermore, we plan to investigate the association between rare variants and multiple categorical traits.
Collapse
Affiliation(s)
- Leiming Sun
- State Key Laboratory of Genetic Engineering, Institute of Biostatistics, School of Life Sciences, Fudan University , Shanghai , China
| | - Chan Wang
- State Key Laboratory of Genetic Engineering, Institute of Biostatistics, School of Life Sciences, Fudan University , Shanghai , China
| | - Yue-Qing Hu
- State Key Laboratory of Genetic Engineering, Institute of Biostatistics, School of Life Sciences, Fudan University , Shanghai , China
| |
Collapse
|
23
|
Lee W, Sjölander A, Pawitan Y. A Critical Look at Entropy-Based Gene-Gene Interaction Measures. Genet Epidemiol 2016; 40:416-24. [PMID: 27229752 DOI: 10.1002/gepi.21974] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2015] [Revised: 02/28/2015] [Accepted: 03/17/2016] [Indexed: 11/12/2022]
Abstract
Several entropy-based measures for detecting gene-gene interaction have been proposed recently. It has been argued that the entropy-based measures are preferred because entropy can better capture the nonlinear relationships between genotypes and traits, so they can be useful to detect gene-gene interactions for complex diseases. These suggested measures look reasonable at intuitive level, but so far there has been no detailed characterization of the interactions captured by them. Here we study analytically the properties of some entropy-based measures for detecting gene-gene interactions in detail. The relationship between interactions captured by the entropy-based measures and those of logistic regression models is clarified. In general we find that the entropy-based measures can suffer from a lack of specificity in terms of target parameters, i.e., they can detect uninteresting signals as interactions. Numerical studies are carried out to confirm theoretical findings.
Collapse
Affiliation(s)
- Woojoo Lee
- Department of Statistics, Inha University, Nam-gu, Incheon, South Korea
| | - Arvid Sjölander
- Department of Medical Epidemiology and Biostatistics, Karolinska Institutet, Stockholm, Sweden
| | - Yudi Pawitan
- Department of Medical Epidemiology and Biostatistics, Karolinska Institutet, Stockholm, Sweden
| |
Collapse
|
24
|
Mas S, Gassó P, Morer A, Calvo A, Bargalló N, Lafuente A, Lázaro L. Integrating Genetic, Neuropsychological and Neuroimaging Data to Model Early-Onset Obsessive Compulsive Disorder Severity. PLoS One 2016; 11:e0153846. [PMID: 27093171 PMCID: PMC4836736 DOI: 10.1371/journal.pone.0153846] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2015] [Accepted: 04/05/2016] [Indexed: 01/03/2023] Open
Abstract
We propose an integrative approach that combines structural magnetic resonance imaging data (MRI), diffusion tensor imaging data (DTI), neuropsychological data, and genetic data to predict early-onset obsessive compulsive disorder (OCD) severity. From a cohort of 87 patients, 56 with complete information were used in the present analysis. First, we performed a multivariate genetic association analysis of OCD severity with 266 genetic polymorphisms. This association analysis was used to select and prioritize the SNPs that would be included in the model. Second, we split the sample into a training set (N = 38) and a validation set (N = 18). Third, entropy-based measures of information gain were used for feature selection with the training subset. Fourth, the selected features were fed into two supervised methods of class prediction based on machine learning, using the leave-one-out procedure with the training set. Finally, the resulting model was validated with the validation set. Nine variables were used for the creation of the OCD severity predictor, including six genetic polymorphisms and three variables from the neuropsychological data. The developed model classified child and adolescent patients with OCD by disease severity with an accuracy of 0.90 in the testing set and 0.70 in the validation sample. Above its clinical applicability, the combination of particular neuropsychological, neuroimaging, and genetic characteristics could enhance our understanding of the neurobiological basis of the disorder.
Collapse
Affiliation(s)
- Sergi Mas
- Dept. Anatomic Pathology, Pharmacology and Microbiology, University of Barcelona, Barcelona, Spain
- Centro de Investigación Biomédica en Red de Salud Mental (CIBERSAM), Barcelona, Spain
- Institut d’Investigacions Biomèdiques August Pi i Sunyer (IDIBAPS), Barcelona, Spain
- * E-mail:
| | - Patricia Gassó
- Dept. Anatomic Pathology, Pharmacology and Microbiology, University of Barcelona, Barcelona, Spain
- Institut d’Investigacions Biomèdiques August Pi i Sunyer (IDIBAPS), Barcelona, Spain
| | - Astrid Morer
- Department of Child and Adolescent Psychiatry and Psychology, Institute of Neurosciences, Hospital Clinic de Barcelona, Barcelona, Spain
- Centro de Investigación Biomédica en Red de Salud Mental (CIBERSAM), Barcelona, Spain
- Institut d’Investigacions Biomèdiques August Pi i Sunyer (IDIBAPS), Barcelona, Spain
| | - Anna Calvo
- Magnetic Resonance Image Core Facility, Institut d’Investigacions Biomèdiques August Pi i Sunyer (IDIBAPS), Barcelona, Spain
- Institut d’Investigacions Biomèdiques August Pi i Sunyer (IDIBAPS), Barcelona, Spain
| | - Nuria Bargalló
- Department of Radiology, Centre de Diagnostic per la Imatge, Hospital Clínic, Barcelona, Spain
- Institut d’Investigacions Biomèdiques August Pi i Sunyer (IDIBAPS), Barcelona, Spain
| | - Amalia Lafuente
- Dept. Anatomic Pathology, Pharmacology and Microbiology, University of Barcelona, Barcelona, Spain
- Centro de Investigación Biomédica en Red de Salud Mental (CIBERSAM), Barcelona, Spain
- Institut d’Investigacions Biomèdiques August Pi i Sunyer (IDIBAPS), Barcelona, Spain
| | - Luisa Lázaro
- Department of Child and Adolescent Psychiatry and Psychology, Institute of Neurosciences, Hospital Clinic de Barcelona, Barcelona, Spain
- Dept. Psychiatry and Clinical Psychobiology, University of Barcelona, Barcelona, Spain
- Centro de Investigación Biomédica en Red de Salud Mental (CIBERSAM), Barcelona, Spain
- Institut d’Investigacions Biomèdiques August Pi i Sunyer (IDIBAPS), Barcelona, Spain
| |
Collapse
|
25
|
The application of information theory for the research of aging and aging-related diseases. Prog Neurobiol 2016; 157:158-173. [PMID: 27004830 DOI: 10.1016/j.pneurobio.2016.03.005] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2015] [Revised: 03/13/2016] [Accepted: 03/19/2016] [Indexed: 11/23/2022]
Abstract
This article reviews the application of information-theoretical analysis, employing measures of entropy and mutual information, for the study of aging and aging-related diseases. The research of aging and aging-related diseases is particularly suitable for the application of information theory methods, as aging processes and related diseases are multi-parametric, with continuous parameters coexisting alongside discrete parameters, and with the relations between the parameters being as a rule non-linear. Information theory provides unique analytical capabilities for the solution of such problems, with unique advantages over common linear biostatistics. Among the age-related diseases, information theory has been used in the study of neurodegenerative diseases (particularly using EEG time series for diagnosis and prediction), cancer (particularly for establishing individual and combined cancer biomarkers), diabetes (mainly utilizing mutual information to characterize the diseased and aging states), and heart disease (mainly for the analysis of heart rate variability). Few works have employed information theory for the analysis of general aging processes and frailty, as underlying determinants and possible early preclinical diagnostic measures for aging-related diseases. Generally, the use of information-theoretical analysis permits not only establishing the (non-linear) correlations between diagnostic or therapeutic parameters of interest, but may also provide a theoretical insight into the nature of aging and related diseases by establishing the measures of variability, adaptation, regulation or homeostasis, within a system of interest. It may be hoped that the increased use of such measures in research may considerably increase diagnostic and therapeutic capabilities and the fundamental theoretical mathematical understanding of aging and disease.
Collapse
|
26
|
Yu Z, Demetriou M, Gillen DL. Genome-Wide Analysis of Gene-Gene and Gene-Environment Interactions Using Closed-Form Wald Tests. Genet Epidemiol 2015; 39:446-55. [PMID: 26095143 DOI: 10.1002/gepi.21907] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2014] [Revised: 02/25/2015] [Accepted: 05/06/2015] [Indexed: 01/31/2023]
Abstract
Despite the successful discovery of hundreds of variants for complex human traits using genome-wide association studies, the degree to which genes and environmental risk factors jointly affect disease risk is largely unknown. One obstacle toward this goal is that the computational effort required for testing gene-gene and gene-environment interactions is enormous. As a result, numerous computationally efficient tests were recently proposed. However, the validity of these methods often relies on unrealistic assumptions such as additive main effects, main effects at only one variable, no linkage disequilibrium between the two single-nucleotide polymorphisms (SNPs) in a pair or gene-environment independence. Here, we derive closed-form and consistent estimates for interaction parameters and propose to use Wald tests for testing interactions. The Wald tests are asymptotically equivalent to the likelihood ratio tests (LRTs), largely considered to be the gold standard tests but generally too computationally demanding for genome-wide interaction analysis. Simulation studies show that the proposed Wald tests have very similar performances with the LRTs but are much more computationally efficient. Applying the proposed tests to a genome-wide study of multiple sclerosis, we identify interactions within the major histocompatibility complex region. In this application, we find that (1) focusing on pairs where both SNPs are marginally significant leads to more significant interactions when compared to focusing on pairs where at least one SNP is marginally significant; and (2) parsimonious parameterization of interaction effects might decrease, rather than increase, statistical power.
Collapse
Affiliation(s)
- Zhaoxia Yu
- Department of Statistics, University of California, Irvine, California, United States of America
| | - Michael Demetriou
- Department of Neurology, University of California, Irvine, California, United States of America.,Department of Microbiology & Molecular Genetics, University of California, Irvine, California, United States of America
| | - Daniel L Gillen
- Department of Statistics, University of California, Irvine, California, United States of America
| |
Collapse
|
27
|
Su L, Liu G, Wang H, Tian Y, Zhou Z, Han L, Yan L. Research on single nucleotide polymorphisms interaction detection from network perspective. PLoS One 2015; 10:e0119146. [PMID: 25763929 PMCID: PMC4357495 DOI: 10.1371/journal.pone.0119146] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2014] [Accepted: 01/09/2015] [Indexed: 12/02/2022] Open
Abstract
Single Nucleotide Polymorphisms (SNPs) found in Genome-Wide Association Study (GWAS) mainly influence the susceptibility of complex diseases, but they still could not comprehensively explain the relationships between mutations and diseases. Interactions between SNPs are considered so important for deeply understanding of those relationships that several strategies have been proposed to explore such interactions. However, part of those methods perform poorly when marginal effects of disease loci are weak or absent, others may lack of considering high-order SNPs interactions, few methods have achieved the requirements in both performance and accuracy. Considering the above reasons, not only low-order, but also high-order SNP interactions as well as main-effect SNPs, should be taken into account in detection methods under an acceptable computational complexity. In this paper, a new pairwise (or low-order) interaction detection method IG (Interaction Gain) is introduced, in which disease models are not required and parallel computing is utilized. Furthermore, high-order SNP interactions were proposed to be detected by finding closely connected function modules of the network constructed from IG detection results. Tested by a wide range of simulated datasets and four WTCCC real datasets, the proposed methods accurately detected both low-order and high-order SNP interactions as well as disease-associated main-effect SNPS and it surpasses all competitors in performances. The research will advance complex diseases research by providing more reliable SNP interactions.
Collapse
Affiliation(s)
- Lingtao Su
- College of Computer Science and Technology, Jilin University, Changchun, People’s Republic of China
- Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, People’s Republic of China
| | - Guixia Liu
- College of Computer Science and Technology, Jilin University, Changchun, People’s Republic of China
- Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, People’s Republic of China
- * E-mail:
| | - Han Wang
- College of Computer Science and Information Technology, Northeast Normal University, Changchun, People’s Republic of China
| | - Yuan Tian
- College of Computer Science and Technology, Jilin University, Changchun, People’s Republic of China
- Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, People’s Republic of China
| | - Zhihui Zhou
- College of Computer Science and Technology, Jilin University, Changchun, People’s Republic of China
- Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, People’s Republic of China
| | - Liang Han
- College of Computer Science and Technology, Jilin University, Changchun, People’s Republic of China
- Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, People’s Republic of China
| | - Lun Yan
- College of Computer Science and Technology, Jilin University, Changchun, People’s Republic of China
- Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, People’s Republic of China
| |
Collapse
|
28
|
Abstract
Here we introduce artificial intelligence (AI) methodology for detecting and characterizing epistasis in genetic association studies. The ultimate goal of our AI strategy is to analyze genome-wide genetics data as a human would using sources of expert knowledge as a guide. The methodology presented here is based on computational evolution, which is a type of genetic programming. The ability to generate interesting solutions while at the same time learning how to solve the problem at hand distinguishes computational evolution from other genetic programming approaches. We provide a general overview of this approach and then present a few examples of its application to real data.
Collapse
Affiliation(s)
- Jason H Moore
- Department of Genetics, Geisel School of Medicine, DHMC, One Medical Center Dr., HB 7937, Lebanon, NH, 03756, USA,
| | | |
Collapse
|
29
|
White MJ, Tacconelli A, Chen JS, Wejse C, Hill PC, Gomes VF, Velez-Edwards DR, Østergaard LJ, Hu T, Moore JH, Novelli G, Scott WK, Williams SM, Sirugo G. Epiregulin (EREG) and human V-ATPase (TCIRG1): genetic variation, ethnicity and pulmonary tuberculosis susceptibility in Guinea-Bissau and The Gambia. Genes Immun 2014; 15:370-7. [PMID: 24898387 DOI: 10.1038/gene.2014.28] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/14/2014] [Revised: 04/23/2014] [Accepted: 04/24/2014] [Indexed: 02/07/2023]
Abstract
We analyzed two West African samples (Guinea-Bissau: n=289 cases and 322 controls; The Gambia: n=240 cases and 248 controls) to evaluate single-nucleotide polymorphisms (SNPs) in Epiregulin (EREG) and V-ATPase (T-cell immune regulator 1 (TCIRG1)) using single and multilocus analyses to determine whether previously described associations with pulmonary tuberculosis (PTB) in Vietnamese and Italians would replicate in African populations. We did not detect any significant single locus or haplotype associations in either sample. We also performed exploratory pairwise interaction analyses using Visualization of Statistical Epistasis Networks (ViSEN), a novel method to detect only interactions among multiple variables, to elucidate possible interaction effects between SNPs and demographic factors. Although we found no strong evidence of marginal effects, there were several significant pairwise interactions that were identified in either the Guinea-Bissau or the Gambian samples, two of which replicated across populations. Our results indicate that the effects of EREG and TCIRG1 variants on PTB susceptibility, to the extent that they exist, are dependent on gene-gene interactions in West African populations as detected with ViSEN. In addition, epistatic effects are likely to be influenced by inter- and intra-population differences in genetic or environmental context and/or the mycobacterial lineages causing disease.
Collapse
Affiliation(s)
- M J White
- 1] Center for Human Genetics Research, Vanderbilt University, Nashville, TN, USA [2] Department of Genetics and Institute of Quantitative Biomedical Sciences, Dartmouth College, Hanover, NH, USA
| | - A Tacconelli
- Centro di Ricerca, Ospedale San Pietro Fatebenefratelli, Rome, Italy
| | - J S Chen
- Department of Genetics and Institute of Quantitative Biomedical Sciences, Dartmouth College, Hanover, NH, USA
| | - C Wejse
- 1] Bandim Health Project, Danish Epidemiology Science Centre and Statens Serum Institute, Bissau, Guinea-Bissau [2] Department of Infectious Diseases, Aarhus University Hospital, Skejby, Denmark [3] Center for Global Health, School of Public Health, Aarhus University, Skejby, Denmark
| | - P C Hill
- 1] Centre for International Health, University of Otago School of Medicine, Dunedin, New Zealand [2] MRC Laboratories, Fajara, The Gambia
| | - V F Gomes
- Bandim Health Project, Danish Epidemiology Science Centre and Statens Serum Institute, Bissau, Guinea-Bissau
| | - D R Velez-Edwards
- 1] Vanderbilt Epidemiology Center, Vanderbilt University, Nashville, TN, USA [2] Institute for Medicine and Public Health, Vanderbilt University, Nashville, TN, USA [3] Center for Human Genetics Research, Vanderbilt University, Nashville, TN, USA [4] Department of Obstetrics and Gynecology, Vanderbilt University, Nashville, TN, USA
| | - L J Østergaard
- Department of Infectious Diseases, Aarhus University Hospital, Skejby, Denmark
| | - T Hu
- Department of Genetics and Institute of Quantitative Biomedical Sciences, Dartmouth College, Hanover, NH, USA
| | - J H Moore
- Department of Genetics and Institute of Quantitative Biomedical Sciences, Dartmouth College, Hanover, NH, USA
| | - G Novelli
- 1] Centro di Ricerca, Ospedale San Pietro Fatebenefratelli, Rome, Italy [2] Dipartimento di Biomedicina e Prevenzione, Sezione di Genetica, Università di Roma 'Tor Vergata', Rome, Italy
| | - W K Scott
- Dr John T. Macdonald Foundation Department of Human Genetics and John P. Hussman Institute for Human Genomics, University of Miami Miller School of Medicine, Miami, FL, USA
| | - S M Williams
- Department of Genetics and Institute of Quantitative Biomedical Sciences, Dartmouth College, Hanover, NH, USA
| | - G Sirugo
- Centro di Ricerca, Ospedale San Pietro Fatebenefratelli, Rome, Italy
| |
Collapse
|
30
|
El-Serag HB, Kanwal F, Davila JA, Kramer J, Richardson P. A new laboratory-based algorithm to predict development of hepatocellular carcinoma in patients with hepatitis C and cirrhosis. Gastroenterology 2014; 146:1249-55.e1. [PMID: 24462733 PMCID: PMC3992177 DOI: 10.1053/j.gastro.2014.01.045] [Citation(s) in RCA: 127] [Impact Index Per Article: 12.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 10/02/2013] [Revised: 01/13/2014] [Accepted: 01/20/2014] [Indexed: 02/08/2023]
Abstract
BACKGROUND & AIMS Serum levels of α-fetoprotein (AFP) are influenced not only by the presence of hepatocellular carcinoma (HCC), but also by the underlying severity and activity of liver disease, which is reflected by liver function tests. We constructed an AFP-based algorithm that included these factors to identify patients at risk for HCC, and tested its predictive ability in a large set of patients with cirrhosis. METHODS We used the national Department of Veterans Affairs Hepatitis C Virus Clinical Case Registry to identify patients with cirrhosis, results from at least 1 AFP test, and 6 months of follow-up. Our algorithm included data on age; levels of aspartate aminotransferase, alanine aminotransferase (ALT), alkaline phosphatase, total bilirubin, albumin, creatinine, and hemoglobin; prothrombin time; and numbers of platelets and white cells. We examined the operating characteristics (calibration, discrimination, predictive values) of several different algorithms for identification of patients who would develop HCC within 6 months of the AFP test. We assessed our final model in the development and validation subsets. RESULTS We identified 11,721 patients with hepatitis C virus-related cirrhosis in whom 35,494 AFP tests were performed, and 987 patients developed HCC. A predictive model that included data on levels of AFP, ALT, and platelets, along with age at time of AFP test (and interaction terms between AFP and ALT, and AFP and platelets), best discriminated between patients who did and did not develop HCC. Using this AFP-adjusted model, the predictive accuracy increased at different AFP cutoffs compared with AFP alone. At any given AFP value, low numbers of platelets and ALT and older age were associated with increased risk of HCC, and high levels of ALT and normal/high numbers of platelets were associated with low risk for HCC. For example, the probabilities of HCC, based only on 20 ng/mL and 120 ng/mL AFP, were 3.5% and 11.4%, respectively. However, patients with the same AFP values (20 ng/mL and 120 ng/mL) who were 70 years old, with ALT levels of 40 IU/mL and platelet counts of 100,000, had probabilities of developing HCC of 8.1% and 29.0%, respectively. CONCLUSIONS We developed and validated an algorithm based on levels of AFP, platelets, and ALT, along with age, which increased the predictive value for identifying patients with hepatitis C virus-associated cirrhosis likely to develop HCC within 6 months. If validated in other patient groups, this model would have immediate clinical applicability.
Collapse
|
31
|
Vinga S. Information theory applications for biological sequence analysis. Brief Bioinform 2014; 15:376-89. [PMID: 24058049 PMCID: PMC7109941 DOI: 10.1093/bib/bbt068] [Citation(s) in RCA: 67] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/11/2013] [Accepted: 08/17/2013] [Indexed: 01/13/2023] Open
Abstract
Information theory (IT) addresses the analysis of communication systems and has been widely applied in molecular biology. In particular, alignment-free sequence analysis and comparison greatly benefited from concepts derived from IT, such as entropy and mutual information. This review covers several aspects of IT applications, ranging from genome global analysis and comparison, including block-entropy estimation and resolution-free metrics based on iterative maps, to local analysis, comprising the classification of motifs, prediction of transcription factor binding sites and sequence characterization based on linguistic complexity and entropic profiles. IT has also been applied to high-level correlations that combine DNA, RNA or protein features with sequence-independent properties, such as gene mapping and phenotype analysis, and has also provided models based on communication systems theory to describe information transmission channels at the cell level and also during evolutionary processes. While not exhaustive, this review attempts to categorize existing methods and to indicate their relation with broader transversal topics such as genomic signatures, data compression and complexity, time series analysis and phylogenetic classification, providing a resource for future developments in this promising area.
Collapse
Affiliation(s)
- Susana Vinga
- IDMEC, Instituto Superior Técnico - Universidade de Lisboa (IST-UL), Av. Rovisco Pais, 1049-001 Lisboa, Portugal. Tel.: +351-218419504; Fax: +351-218498097;
| |
Collapse
|
32
|
Hu T, Pan Q, Andrew AS, Langer JM, Cole MD, Tomlinson CR, Karagas MR, Moore JH. Functional genomics annotation of a statistical epistasis network associated with bladder cancer susceptibility. BioData Min 2014; 7:5. [PMID: 24725556 PMCID: PMC3989783 DOI: 10.1186/1756-0381-7-5] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/03/2013] [Accepted: 04/05/2014] [Indexed: 11/13/2022] Open
Abstract
BACKGROUND Several different genetic and environmental factors have been identified as independent risk factors for bladder cancer in population-based studies. Recent studies have turned to understanding the role of gene-gene and gene-environment interactions in determining risk. We previously developed the bioinformatics framework of statistical epistasis networks (SEN) to characterize the global structure of interacting genetic factors associated with a particular disease or clinical outcome. By applying SEN to a population-based study of bladder cancer among Caucasians in New Hampshire, we were able to identify a set of connected genetic factors with strong and significant interaction effects on bladder cancer susceptibility. FINDINGS To support our statistical findings using networks, in the present study, we performed pathway enrichment analyses on the set of genes identified using SEN, and found that they are associated with the carcinogen benzo[a]pyrene, a component of tobacco smoke. We further carried out an mRNA expression microarray experiment to validate statistical genetic interactions, and to determine if the set of genes identified in the SEN were differentially expressed in a normal bladder cell line and a bladder cancer cell line in the presence or absence of benzo[a]pyrene. Significant nonrandom sets of genes from the SEN were found to be differentially expressed in response to benzo[a]pyrene in both the normal bladder cells and the bladder cancer cells. In addition, the patterns of gene expression were significantly different between these two cell types. CONCLUSIONS The enrichment analyses and the gene expression microarray results support the idea that SEN analysis of bladder in population-based studies is able to identify biologically meaningful statistical patterns. These results bring us a step closer to a systems genetic approach to understanding cancer susceptibility that integrates population and laboratory-based studies.
Collapse
Affiliation(s)
- Ting Hu
- Department of Genetics, Geisel School of Medicine, Dartmouth College, Hanover, NH 03755, USA
- Institute for Quantitative Biomedical Sciences, Dartmouth College, Hanover, NH 03755, USA
| | - Qinxin Pan
- Department of Genetics, Geisel School of Medicine, Dartmouth College, Hanover, NH 03755, USA
| | - Angeline S Andrew
- Institute for Quantitative Biomedical Sciences, Dartmouth College, Hanover, NH 03755, USA
- Department of Community and Family Medicine, Geisel School of Medicine, Dartmouth College, Hanover, NH 03755, USA
| | - Jillian M Langer
- Department of Pharmacology & Toxicology, Geisel School of Medicine, Dartmouth College, Hanover, NH 03755, USA
| | - Michael D Cole
- Department of Pharmacology & Toxicology, Geisel School of Medicine, Dartmouth College, Hanover, NH 03755, USA
| | - Craig R Tomlinson
- Institute for Quantitative Biomedical Sciences, Dartmouth College, Hanover, NH 03755, USA
- Department of Pharmacology & Toxicology, Geisel School of Medicine, Dartmouth College, Hanover, NH 03755, USA
| | - Margaret R Karagas
- Institute for Quantitative Biomedical Sciences, Dartmouth College, Hanover, NH 03755, USA
- Department of Community and Family Medicine, Geisel School of Medicine, Dartmouth College, Hanover, NH 03755, USA
| | - Jason H Moore
- Department of Genetics, Geisel School of Medicine, Dartmouth College, Hanover, NH 03755, USA
- Institute for Quantitative Biomedical Sciences, Dartmouth College, Hanover, NH 03755, USA
- Department of Community and Family Medicine, Geisel School of Medicine, Dartmouth College, Hanover, NH 03755, USA
| |
Collapse
|
33
|
Pan Q, Hu T, Malley JD, Andrew AS, Karagas MR, Moore JH. A system-level pathway-phenotype association analysis using synthetic feature random forest. Genet Epidemiol 2014; 38:209-19. [PMID: 24535726 DOI: 10.1002/gepi.21794] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2013] [Revised: 11/21/2013] [Accepted: 01/02/2014] [Indexed: 11/07/2022]
Abstract
As the cost of genome-wide genotyping decreases, the number of genome-wide association studies (GWAS) has increased considerably. However, the transition from GWAS findings to the underlying biology of various phenotypes remains challenging. As a result, due to its system-level interpretability, pathway analysis has become a popular tool for gaining insights on the underlying biology from high-throughput genetic association data. In pathway analyses, gene sets representing particular biological processes are tested for significant associations with a given phenotype. Most existing pathway analysis approaches rely on single-marker statistics and assume that pathways are independent of each other. As biological systems are driven by complex biomolecular interactions, embracing the complex relationships between single-nucleotide polymorphisms (SNPs) and pathways needs to be addressed. To incorporate the complexity of gene-gene interactions and pathway-pathway relationships, we propose a system-level pathway analysis approach, synthetic feature random forest (SF-RF), which is designed to detect pathway-phenotype associations without making assumptions about the relationships among SNPs or pathways. In our approach, the genotypes of SNPs in a particular pathway are aggregated into a synthetic feature representing that pathway via Random Forest (RF). Multiple synthetic features are analyzed using RF simultaneously and the significance of a synthetic feature indicates the significance of the corresponding pathway. We further complement SF-RF with pathway-based Statistical Epistasis Network (SEN) analysis that evaluates interactions among pathways. By investigating the pathway SEN, we hope to gain additional insights into the genetic mechanisms contributing to the pathway-phenotype association. We apply SF-RF to a population-based genetic study of bladder cancer and further investigate the mechanisms that help explain the pathway-phenotype associations using SEN. The bladder cancer associated pathways we found are both consistent with existing biological knowledge and reveal novel and plausible hypotheses for future biological validations.
Collapse
Affiliation(s)
- Qinxin Pan
- Department of Genetics, Geisel School of Medicine, Dartmouth College, Hanover, New Hampshire, United States of America
| | | | | | | | | | | |
Collapse
|
34
|
Hutter CM, Mechanic LE, Chatterjee N, Kraft P, Gillanders EM. Gene-environment interactions in cancer epidemiology: a National Cancer Institute Think Tank report. Genet Epidemiol 2013; 37:643-57. [PMID: 24123198 PMCID: PMC4143122 DOI: 10.1002/gepi.21756] [Citation(s) in RCA: 78] [Impact Index Per Article: 7.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/03/2013] [Revised: 08/06/2013] [Accepted: 08/14/2013] [Indexed: 01/04/2023]
Abstract
Cancer risk is determined by a complex interplay of genetic and environmental factors. Genome-wide association studies (GWAS) have identified hundreds of common (minor allele frequency [MAF] > 0.05) and less common (0.01 < MAF < 0.05) genetic variants associated with cancer. The marginal effects of most of these variants have been small (odds ratios: 1.1-1.4). There remain unanswered questions on how best to incorporate the joint effects of genes and environment, including gene-environment (G × E) interactions, into epidemiologic studies of cancer. To help address these questions, and to better inform research priorities and allocation of resources, the National Cancer Institute sponsored a "Gene-Environment Think Tank" on January 10-11, 2012. The objective of the Think Tank was to facilitate discussions on (1) the state of the science, (2) the goals of G × E interaction studies in cancer epidemiology, and (3) opportunities for developing novel study designs and analysis tools. This report summarizes the Think Tank discussion, with a focus on contemporary approaches to the analysis of G × E interactions. Selecting the appropriate methods requires first identifying the relevant scientific question and rationale, with an important distinction made between analyses aiming to characterize the joint effects of putative or established genetic and environmental factors and analyses aiming to discover novel risk factors or novel interaction effects. Other discussion items include measurement error, statistical power, significance, and replication. Additional designs, exposure assessments, and analytical approaches need to be considered as we move from the current small number of success stories to a fuller understanding of the interplay of genetic and environmental factors.
Collapse
Affiliation(s)
- Carolyn M Hutter
- Epidemiology and Genomics Research Program, Division of Cancer Control and Population Sciences, National Cancer Institute, National Institutes of Health, Bethesda, Maryland, United States of America
| | | | | | | | | |
Collapse
|
35
|
Gong M, Yi Q, Wang W. Association between NQO1 C609T polymorphism and bladder cancer susceptibility: a systemic review and meta-analysis. Tumour Biol 2013; 34:2551-6. [PMID: 23749485 DOI: 10.1007/s13277-013-0799-7] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2013] [Accepted: 04/08/2013] [Indexed: 12/24/2022] Open
Abstract
There is growing evidence for the important roles of genetic factors in the host's susceptibility to bladder cancer. NAD(P)H:quinone oxidoreductase 1 (NQO1) is a cytosolic enzyme that catalyzes the two-electron reduction of quinoid compounds into hydroquinones. Since the NQO1 C609T polymorphism is linked to enzymatic activity of NQO1, it has also been hypothesized that NQO1 C609T polymorphism may affect the host's susceptibility to bladder cancer by modifying the exposure to carcinogens. There were many studies carried out to assess the association between NQO1 C609T polymorphism and bladder cancer risk, but they reported contradictory results. We conducted a meta-analysis to examine the hypotheses that the NQO1 C609T polymorphism modifies the risk of bladder cancer. Eleven case-control studies with 2,937 bladder cancer cases and 3,008 controls were included in the meta-analysis. Overall, there was no obvious association between NQO1 C609T polymorphism and bladder cancer susceptibility (for T versus C: odds ratio (OR) = 1.12, 95 % confidence interval (95 %CI) 0.99-1.26, P OR = 0.069; for TT versus CC: OR = 1.31, 95 %CI 0.95-1.81, P OR = 0.100; for TT/CT versus CC: OR = 1.06, 95 %CI 0.95-1.18, P OR = 0.304; for TT versus CT/CC: OR = 1.29, 95 %CI 0.94-1.77, P OR = 0.112). After adjusting for heterogeneity, meta-analysis of those left 10 studies showed that there was an obvious association between NQO1 C609T polymorphism and bladder cancer susceptibility (for T versus C: OR = 1.18, 95 %CI 1.06-1.31, P OR = 0.003; for TT versus CC: OR = 1.47, 95 %CI 1.14-1.90, P OR = 0.003; for TT/CT versus CC: OR = 1.16, 95 %CI 1.01-1.34, P OR = 0.036; for TT versus CT/CC: OR = 1.39, 95 %CI 1.10-1.75, P OR = 0.006). There was low risk of publication bias. Therefore, our meta-analysis suggests that NQO1 C609T polymorphism is associated with bladder cancer susceptibility.
Collapse
Affiliation(s)
- Min Gong
- Department of Urology, Shanghai Pudong Hospital, Fudan University Pudong Medical Center, 2800 Gongwei Road, Huinan Town, Shanghai, 201399, China
| | | | | |
Collapse
|
36
|
Hu T, Chen Y, Kiralis JW, Moore JH. ViSEN: methodology and software for visualization of statistical epistasis networks. Genet Epidemiol 2013; 37:283-5. [PMID: 23468157 DOI: 10.1002/gepi.21718] [Citation(s) in RCA: 31] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2012] [Revised: 12/20/2012] [Accepted: 02/05/2013] [Indexed: 11/06/2022]
Abstract
The nonlinear interaction effect among multiple genetic factors, i.e. epistasis, has been recognized as a key component in understanding the underlying genetic basis of complex human diseases and phenotypic traits. Due to the statistical and computational complexity, most epistasis studies are limited to interactions with an order of two. We developed ViSEN to analyze and visualize epistatic interactions of both two-way and three-way. ViSEN not only identifies strong interactions among pairs or trios of genetic attributes, but also provides a global interaction map that shows neighborhood and clustering structures. This visualized information could be very helpful to infer the underlying genetic architecture of complex diseases and to generate plausible hypotheses for further biological validations. ViSEN is implemented in Java and freely available at https://sourceforge.net/projects/visen/.
Collapse
Affiliation(s)
- Ting Hu
- Institute for Quantitative Biomedical Sciences, Dartmouth College, New Hampshire, USA
| | | | | | | |
Collapse
|
37
|
Hu T, Chen Y, Kiralis JW, Collins RL, Wejse C, Sirugo G, Williams SM, Moore JH. An information-gain approach to detecting three-way epistatic interactions in genetic association studies. J Am Med Inform Assoc 2013; 20:630-6. [PMID: 23396514 PMCID: PMC3721169 DOI: 10.1136/amiajnl-2012-001525] [Citation(s) in RCA: 51] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/01/2023] Open
Abstract
Background Epistasis has been historically used to describe the phenomenon that the effect of a given gene on a phenotype can be dependent on one or more other genes, and is an essential element for understanding the association between genetic and phenotypic variations. Quantifying epistasis of orders higher than two is very challenging due to both the computational complexity of enumerating all possible combinations in genome-wide data and the lack of efficient and effective methodologies. Objectives In this study, we propose a fast, non-parametric, and model-free measure for three-way epistasis. Methods Such a measure is based on information gain, and is able to separate all lower order effects from pure three-way epistasis. Results Our method was verified on synthetic data and applied to real data from a candidate-gene study of tuberculosis in a West African population. In the tuberculosis data, we found a statistically significant pure three-way epistatic interaction effect that was stronger than any lower-order associations. Conclusion Our study provides a methodological basis for detecting and characterizing high-order gene-gene interactions in genetic association studies.
Collapse
Affiliation(s)
- Ting Hu
- Computational Genetics Laboratory, Geisel School of Medicine, Dartmouth College, Hanover, New Hampshire, USA
| | | | | | | | | | | | | | | |
Collapse
|
38
|
Wu C, Li S, Cui Y. Genetic association studies: an information content perspective. Curr Genomics 2012; 13:566-73. [PMID: 23633916 PMCID: PMC3468889 DOI: 10.2174/138920212803251382] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2012] [Revised: 06/04/2012] [Accepted: 06/18/2012] [Indexed: 01/02/2023] Open
Abstract
The availability of high-density single nucleotide polymorphisms (SNPs) data has made the human genetic association studies possible to identify common and rare variants underlying complex diseases in a genome-wide scale. A handful of novel genetic variants have been identified, which gives much hope and prospects for the future of genetic association studies. In this process, statistical and computational methods play key roles, among which information-based association tests have gained large popularity. This paper is intended to give a comprehensive review of the current literature in genetic association analysis casted in the framework of information theory. We focus our review on the following topics: (1) information theoretic approaches in genetic linkage and association studies; (2) entropy-based strategies for optimal SNP subset selection; and (3) the usage of theoretic information criteria in gene clustering and gene regulatory network construction.
Collapse
Affiliation(s)
- Cen Wu
- Department of Statistics and Probability, Michigan State University, East Lansing, Michigan 48824
| | - Shaoyu Li
- Department of Biostatistics, St. Jude Children's Research Hospital, Memphis, TN 38105
| | - Yuehua Cui
- Department of Statistics and Probability, Michigan State University, East Lansing, Michigan 48824
- Center for Computational Biology, Beijing Forestry University, Beijing, China 100083
| |
Collapse
|
39
|
Fan R, Albert PS, Schisterman EF. A discussion of gene-gene and gene-environment interactions and longitudinal genetic analysis of complex traits. Stat Med 2012; 31:2565-8. [PMID: 22969024 PMCID: PMC3458189 DOI: 10.1002/sim.5495] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/17/2023]
Affiliation(s)
- Ruzong Fan
- Biostatistics and Bioinformatics Branch, Division of Epidemiology, Statistics, and Prevention, Eunice Kennedy Shriver National Institute of Child Health and Human Development, National Institutes of Health, 6100 Executive Blvd, Room 7B05, MSC 7510, Rockville, MD 20852, USA.
| | | | | |
Collapse
|
40
|
Aschard H, Lutz S, Maus B, Duell EJ, Fingerlin TE, Chatterjee N, Kraft P, Van Steen K. Challenges and opportunities in genome-wide environmental interaction (GWEI) studies. Hum Genet 2012; 131:1591-613. [PMID: 22760307 DOI: 10.1007/s00439-012-1192-0] [Citation(s) in RCA: 110] [Impact Index Per Article: 9.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2012] [Accepted: 06/11/2012] [Indexed: 02/03/2023]
Abstract
The interest in performing gene-environment interaction studies has seen a significant increase with the increase of advanced molecular genetics techniques. Practically, it became possible to investigate the role of environmental factors in disease risk and hence to investigate their role as genetic effect modifiers. The understanding that genetics is important in the uptake and metabolism of toxic substances is an example of how genetic profiles can modify important environmental risk factors to disease. Several rationales exist to set up gene-environment interaction studies and the technical challenges related to these studies-when the number of environmental or genetic risk factors is relatively small-has been described before. In the post-genomic era, it is now possible to study thousands of genes and their interaction with the environment. This brings along a whole range of new challenges and opportunities. Despite a continuing effort in developing efficient methods and optimal bioinformatics infrastructures to deal with the available wealth of data, the challenge remains how to best present and analyze genome-wide environmental interaction (GWEI) studies involving multiple genetic and environmental factors. Since GWEIs are performed at the intersection of statistical genetics, bioinformatics and epidemiology, usually similar problems need to be dealt with as for genome-wide association gene-gene interaction studies. However, additional complexities need to be considered which are typical for large-scale epidemiological studies, but are also related to "joining" two heterogeneous types of data in explaining complex disease trait variation or for prediction purposes.
Collapse
Affiliation(s)
- Hugues Aschard
- Department of Epidemiology, Harvard School of Public Health, Boston, MA, USA.
| | | | | | | | | | | | | | | |
Collapse
|
41
|
McKinney BA, Pajewski NM. Six Degrees of Epistasis: Statistical Network Models for GWAS. Front Genet 2012; 2:109. [PMID: 22303403 PMCID: PMC3261632 DOI: 10.3389/fgene.2011.00109] [Citation(s) in RCA: 30] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/03/2011] [Accepted: 12/22/2011] [Indexed: 11/18/2022] Open
Abstract
There is growing evidence that much more of the genome than previously thought is required to explain the heritability of complex phenotypes. Recent studies have demonstrated that numerous common variants from across the genome explain portions of genetic variability, spawning various avenues of research directed at explaining the remaining heritability. This polygenic structure is also the motivation for the growing application of pathway and gene set enrichment techniques, which have yielded promising results. These findings suggest that the coordination of genes in pathways that are known to occur at the gene regulatory level also can be detected at the population level. Although genes in these networks interact in complex ways, most population studies have focused on the additive contribution of common variants and the potential of rare variants to explain additional variation. In this brief review, we discuss the potential to explain additional genetic variation through the agglomeration of multiple gene-gene interactions as well as main effects of common variants in terms of a network paradigm. Just as is the case for single-locus contributions, we expect each gene-gene interaction edge in the network to have a small effect, but these effects may be reinforced through hubs and other connectivity structures in the network. We discuss some of the opportunities and challenges of network methods for analyzing genome-wide association studies (GWAS) such as the study of hubs and motifs, and integrating other types of variation and environmental interactions. Such network approaches may unveil hidden variation in GWAS, improve understanding of mechanisms of disease, and possibly fit into a network paradigm of evolutionary genetics.
Collapse
Affiliation(s)
- B. A. McKinney
- Department of Mathematics, Tandy School of Computer Science, University of TulsaTulsa, OK, USA
| | - Nicholas M. Pajewski
- Department of Biostatistical Sciences, Wake Forest School of MedicineWinston-Salem, NC, USA
| |
Collapse
|