1
|
Moon J, Posada-Quintero HF, Chon KH. Genetic data visualization using literature text-based neural networks: Examples associated with myocardial infarction. Neural Netw 2023; 165:562-595. [PMID: 37364469 DOI: 10.1016/j.neunet.2023.05.015] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2022] [Revised: 04/11/2023] [Accepted: 05/09/2023] [Indexed: 06/28/2023]
Abstract
Data visualization is critical to unraveling hidden information from complex and high-dimensional data. Interpretable visualization methods are critical, especially in the biology and medical fields, however, there are limited effective visualization methods for large genetic data. Current visualization methods are limited to lower-dimensional data and their performance suffers if there is missing data. In this study, we propose a literature-based visualization method to reduce high-dimensional data without compromising the dynamics of the single nucleotide polymorphisms (SNP) and textual interpretability. Our method is innovative because it is shown to (1) preserves both global and local structures of SNP while reducing the dimension of the data using literature text representations, and (2) enables interpretable visualizations using textual information. For performance evaluations, we examined the proposed approach to classify various classification categories including race, myocardial infarction event age groups, and sex using several machine learning models on the literature-derived SNP data. We used visualization approaches to examine clustering of data as well as quantitative performance metrics for the classification of the risk factors examined above. Our method outperformed all popular dimensionality reduction and visualization methods for both classification and visualization, and it is robust against missing and higher-dimensional data. Moreover, we found it feasible to incorporate both genetic and other risk information obtained from literature with our method.
Collapse
Affiliation(s)
- Jihye Moon
- Department of Biomedical Engineering, University of Connecticut, Storrs, CT 06269, USA.
| | | | - Ki H Chon
- Department of Biomedical Engineering, University of Connecticut, Storrs, CT 06269, USA.
| |
Collapse
|
2
|
Sun R, Weng H, Wang MH. W-Test for Genetic Epistasis Testing. Methods Mol Biol 2021; 2212:45-53. [PMID: 33733349 DOI: 10.1007/978-1-0716-0947-7_4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
The genetic epistasis effect has been widely acknowledged as an essential contributor to genetic variation in complex diseases. In this chapter, we introduce a powerful and efficient statistical method, called W-test, for genetic epistasis testing. A wtest R package is developed for the implementation of the W-test method, which provides various functions to measure the main effect, pairwise interaction, higher-order interaction, and cis-regulation of SNP-CpG pairs in genetic and epigenetic data. It allows flexible stagewise and exhaustive association testing as well as diagnostic checking on the probability distributions in a user-friendly interface. The wtest package is available in CRAN at https://CRAN.R-project.org/package=wtest .
Collapse
Affiliation(s)
- Rui Sun
- The Chinese University of Hong Kong, Hong Kong, China
| | - Haoyi Weng
- The Chinese University of Hong Kong, Hong Kong, China
| | | |
Collapse
|
3
|
Zhang S, Xue Y, Zhang Q, Ma C, Wu M, Ma S. Identification of gene-environment interactions with marginal penalization. Genet Epidemiol 2019; 44:159-196. [PMID: 31724772 DOI: 10.1002/gepi.22270] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2019] [Revised: 10/05/2019] [Accepted: 10/25/2019] [Indexed: 12/29/2022]
Abstract
Gene-environment (G-E) interaction analysis has been extensively conducted for complex diseases. In marginal analysis, the common practice is to conduct likelihood-based (and other "standard") estimation with each marginal model, and then select significant G-E interactions and main effects based on p values and multiple comparisons adjustment. One limitation of this approach is that the identification results often do not respect the "main effects, interactions" hierarchy, which has been stressed in recent G-E interaction analyses. There is some recent effort tackling this problem, however, with very complex formulations. Another limitation of the common practice is that it may not perform well when regularization is needed, for example, because of "non-normal" distributions. In this article, we propose a marginal penalization approach which adopts a novel penalty to directly tackle the aforementioned problems. The proposed approach has a framework more coherent with that of the recently developed joint analysis methods and an intuitive formulation, and can be effectively realized. In simulation, it outperforms the popular significance-based analysis and simple penalization-based alternatives. Promising findings are made in the analysis of a single-nucleotide polymorphism and a gene expression data.
Collapse
Affiliation(s)
- Sanguo Zhang
- School of Mathematics Sciences, University of Chinese Academy of Sciences, Beijing, China
| | - Yuan Xue
- School of Mathematics Sciences, University of Chinese Academy of Sciences, Beijing, China.,Department of Biostatistics, Yale University, New Haven, Connecticut
| | - Qingzhao Zhang
- Department of Statistics, School of Economics, Xiamen University, Xiamen, China
| | - Chenjin Ma
- Department of Biostatistics, Yale University, New Haven, Connecticut.,School of Statistics, Renmin University, Beijing, China
| | - Mengyun Wu
- School of Statistics and Management, Shanghai University of Finance and Economics, Shanghai, China
| | - Shuangge Ma
- Department of Biostatistics, Yale University, New Haven, Connecticut
| |
Collapse
|
4
|
Yang CH, Chuang LY, Lin YD. Multiobjective multifactor dimensionality reduction to detect SNP-SNP interactions. Bioinformatics 2019; 34:2228-2236. [PMID: 29471406 DOI: 10.1093/bioinformatics/bty076] [Citation(s) in RCA: 21] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/04/2017] [Accepted: 02/16/2018] [Indexed: 11/12/2022] Open
Abstract
Motivation Single-nucleotide polymorphism (SNP)-SNP interactions (SSIs) are popular markers for understanding disease susceptibility. Multifactor dimensionality reduction (MDR) can successfully detect considerable SSIs. Currently, MDR-based methods mainly adopt a single-objective function (a single measure based on contingency tables) to detect SSIs. However, generally, a single-measure function might not yield favorable results due to potential model preferences and disease complexities. Approach This study proposes a multiobjective MDR (MOMDR) method that is based on a contingency table of MDR as an objective function. MOMDR considers the incorporated measures, including correct classification and likelihood rates, to detect SSIs and adopts set theory to predict the most favorable SSIs with cross-validation consistency. MOMDR enables simultaneously using multiple measures to determine potential SSIs. Results Three simulation studies were conducted to compare the detection success rates of MOMDR and single-objective MDR (SOMDR), revealing that MOMDR had higher detection success rates than SOMDR. Furthermore, the Wellcome Trust Case Control Consortium dataset was analyzed by MOMDR to detect SSIs associated with coronary artery disease. Availability and implementation: MOMDR is freely available at https://goo.gl/M8dpDg. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Cheng-Hong Yang
- Department of Electronic Engineering, National Kaohsiung University of Applied Sciences, Kaohsiung, Taiwan.,Graduate Institute of Clinical Medicine, Kaohsiung Medical University, Kaohsiung, Taiwan
| | - Li-Yeh Chuang
- Department of Chemical Engineering and Institute of Biotechnology and Chemical Engineering, I-Shou University, Kaohsiung, Taiwan
| | - Yu-Da Lin
- Department of Electronic Engineering, National Kaohsiung University of Applied Sciences, Kaohsiung, Taiwan
| |
Collapse
|
5
|
Wu M, Ma S. Robust genetic interaction analysis. Brief Bioinform 2019; 20:624-637. [PMID: 29897421 PMCID: PMC6556899 DOI: 10.1093/bib/bby033] [Citation(s) in RCA: 17] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/06/2018] [Revised: 03/22/2018] [Indexed: 01/17/2023] Open
Abstract
For the risk, progression, and response to treatment of many complex diseases, it has been increasingly recognized that genetic interactions (including gene-gene and gene-environment interactions) play important roles beyond the main genetic and environmental effects. In practical genetic interaction analyses, model mis-specification and outliers/contaminations in response variables and covariates are not uncommon, and demand robust analysis methods. Compared with their nonrobust counterparts, robust genetic interaction analysis methods are significantly less popular but are gaining attention fast. In this article, we provide a comprehensive review of robust genetic interaction analysis methods, on their methodologies and applications, for both marginal and joint analysis, and for addressing model mis-specification as well as outliers/contaminations in response variables and covariates.
Collapse
Affiliation(s)
- Mengyun Wu
- Mengyun Wu and Shuangge Ma, School of Statistics and Management, Shanghai University of Finance and Economics, Shanghai 200433, China and Yale School of Public Health, New Haven, CT 06520, USA
| | - Shuangge Ma
- Mengyun Wu and Shuangge Ma, School of Statistics and Management, Shanghai University of Finance and Economics, Shanghai 200433, China and Yale School of Public Health, New Haven, CT 06520, USA
| |
Collapse
|
6
|
Yang CH, Lin YD, Chuang LY. Multiple-Criteria Decision Analysis-Based Multifactor Dimensionality Reduction for Detecting Gene-Gene Interactions. IEEE J Biomed Health Inform 2018; 23:416-426. [PMID: 29993963 DOI: 10.1109/jbhi.2018.2790951] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
Gene-gene interactions (GGIs) are important markers for determining susceptibility to a disease. Multifactor dimensionality reduction (MDR) is a popular algorithm for detecting GGIs and primarily adopts the correct classification rate (CCR) to assess the quality of a GGI. However, CCR measurement alone may not successfully detect certain GGIs because of potential model preferences and disease complexities. In this study, multiple-criteria decision analysis (MCDA) based on MDR was named MCDA-MDR and proposed for detecting GGIs. MCDA facilitates MDR to simultaneously adopt multiple measures within the two-way contingency table of MDR to assess GGIs; the CCR and rule utility measure were employed. Cross-validation consistency was adopted to determine the most favorable GGIs among the Pareto sets. Simulation studies were conducted to compare the detection success rates of the MDR-only-based measure and MCDA-MDR, revealing that MCDA-MDR had superior detection success rates. The Wellcome Trust Case Control Consortium dataset was analyzed using MCDA-MDR to detect GGIs associated with coronary artery disease, and MCDA-MDR successfully detected numerous significant GGIs (p < 0.001). MCDA-MDR performance assessment revealed that the applied MCDA successfully enhanced the GGI detection success rate of the MDR-based method compared with MDR alone.
Collapse
|
7
|
Ren WL, Wen YJ, Dunwell JM, Zhang YM. pKWmEB: integration of Kruskal-Wallis test with empirical Bayes under polygenic background control for multi-locus genome-wide association study. Heredity (Edinb) 2018; 120:208-218. [PMID: 29234158 PMCID: PMC5836593 DOI: 10.1038/s41437-017-0007-4] [Citation(s) in RCA: 90] [Impact Index Per Article: 15.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2017] [Revised: 08/30/2017] [Accepted: 09/06/2017] [Indexed: 11/08/2022] Open
Abstract
Although nonparametric methods in genome-wide association studies (GWAS) are robust in quantitative trait nucleotide (QTN) detection, the absence of polygenic background control in single-marker association in genome-wide scans results in a high false positive rate. To overcome this issue, we proposed an integrated nonparametric method for multi-locus GWAS. First, a new model transformation was used to whiten the covariance matrix of polygenic matrix K and environmental noise. Using the transferred model, Kruskal-Wallis test along with least angle regression was then used to select all the markers that were potentially associated with the trait. Finally, all the selected markers were placed into multi-locus model, these effects were estimated by empirical Bayes, and all the nonzero effects were further identified by a likelihood ratio test for true QTN detection. This method, named pKWmEB, was validated by a series of Monte Carlo simulation studies. As a result, pKWmEB effectively controlled false positive rate, although a less stringent significance criterion was adopted. More importantly, pKWmEB retained the high power of Kruskal-Wallis test, and provided QTN effect estimates. To further validate pKWmEB, we re-analyzed four flowering time related traits in Arabidopsis thaliana, and detected some previously reported genes that were not identified by the other methods.
Collapse
Affiliation(s)
- Wen-Long Ren
- State Key Laboratory of Crop Genetics and Germplasm Enhancement, Nanjing Agricultural University, Nanjing, 210095, China
- Statistical Genomics Lab, College of Plant Science and Technology, Huazhong Agricultural University, Wuhan, 430070, China
| | - Yang-Jun Wen
- State Key Laboratory of Crop Genetics and Germplasm Enhancement, Nanjing Agricultural University, Nanjing, 210095, China
- Statistical Genomics Lab, College of Plant Science and Technology, Huazhong Agricultural University, Wuhan, 430070, China
| | - Jim M Dunwell
- School of Agriculture, Policy and Development, University of Reading, Reading, RG6 6AR, UK
| | - Yuan-Ming Zhang
- State Key Laboratory of Crop Genetics and Germplasm Enhancement, Nanjing Agricultural University, Nanjing, 210095, China.
- Statistical Genomics Lab, College of Plant Science and Technology, Huazhong Agricultural University, Wuhan, 430070, China.
| |
Collapse
|
8
|
Yu W, Lee S, Park T. A unified model based multifactor dimensionality reduction framework for detecting gene-gene interactions. Bioinformatics 2017; 32:i605-i610. [PMID: 27587680 DOI: 10.1093/bioinformatics/btw424] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Gene-gene interaction (GGI) is one of the most popular approaches for finding and explaining the missing heritability of common complex traits in genome-wide association studies. The multifactor dimensionality reduction (MDR) method has been widely studied for detecting GGI effects. However, there are several disadvantages of the existing MDR-based approaches, such as the lack of an efficient way of evaluating the significance of multi-locus models and the high computational burden due to intensive permutation. Furthermore, the MDR method does not distinguish marginal effects from pure interaction effects. METHODS We propose a two-step unified model based MDR approach (UM-MDR), in which, the significance of a multi-locus model, even a high-order model, can be easily obtained through a regression framework with a semi-parametric correction procedure for controlling Type I error rates. In comparison to the conventional permutation approach, the proposed semi-parametric correction procedure avoids heavy computation in order to achieve the significance of a multi-locus model. The proposed UM-MDR approach is flexible in the sense that it is able to incorporate different types of traits and evaluate significances of the existing MDR extensions. RESULTS The simulation studies and the analysis of a real example are provided to demonstrate the utility of the proposed method. UM-MDR can achieve at least the same power as MDR for most scenarios, and it outperforms MDR especially when there are some single nucleotide polymorphisms that only have marginal effects, which masks the detection of causal epistasis for the existing MDR approaches. CONCLUSIONS UM-MDR provides a very good supplement of existing MDR method due to its efficiency in achieving significance for every multi-locus model, its power and its flexibility of handling different types of traits. AVAILABILITY AND IMPLEMENTATION A R package "umMDR" and other source codes are freely available at http://statgen.snu.ac.kr/software/umMDR/ CONTACT: tspark@stats.snu.ac.kr SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Wenbao Yu
- Department of Statistics, Seoul National University, Shilim-Dong, Kwanak-Gu, Seoul 151-742, Korea
| | - Seungyeoun Lee
- Department of Mathematics and Statistics, Sejong University, Seoul 143-747, Korea
| | - Taesung Park
- Department of Statistics, Seoul National University, Shilim-Dong, Kwanak-Gu, Seoul 151-742, Korea
| |
Collapse
|
9
|
Tamba CL, Ni YL, Zhang YM. Iterative sure independence screening EM-Bayesian LASSO algorithm for multi-locus genome-wide association studies. PLoS Comput Biol 2017; 13:e1005357. [PMID: 28141824 PMCID: PMC5308866 DOI: 10.1371/journal.pcbi.1005357] [Citation(s) in RCA: 136] [Impact Index Per Article: 19.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2016] [Revised: 02/14/2017] [Accepted: 01/09/2017] [Indexed: 12/18/2022] Open
Abstract
Genome-wide association study (GWAS) entails examining a large number of single nucleotide polymorphisms (SNPs) in a limited sample with hundreds of individuals, implying a variable selection problem in the high dimensional dataset. Although many single-locus GWAS approaches under polygenic background and population structure controls have been widely used, some significant loci fail to be detected. In this study, we used an iterative modified-sure independence screening (ISIS) approach in reducing the number of SNPs to a moderate size. Expectation-Maximization (EM)-Bayesian least absolute shrinkage and selection operator (BLASSO) was used to estimate all the selected SNP effects for true quantitative trait nucleotide (QTN) detection. This method is referred to as ISIS EM-BLASSO algorithm. Monte Carlo simulation studies validated the new method, which has the highest empirical power in QTN detection and the highest accuracy in QTN effect estimation, and it is the fastest, as compared with efficient mixed-model association (EMMA), smoothly clipped absolute deviation (SCAD), fixed and random model circulating probability unification (FarmCPU), and multi-locus random-SNP-effect mixed linear model (mrMLM). To further demonstrate the new method, six flowering time traits in Arabidopsis thaliana were re-analyzed by four methods (New method, EMMA, FarmCPU, and mrMLM). As a result, the new method identified most previously reported genes. Therefore, the new method is a good alternative for multi-locus GWAS. Genome-wide association study is concerned with the associations between markers and traits of interest so as to identify all the significantly associated markers. In genome-wide association studies, hundreds of thousands of markers are genotyped for several hundreds of individuals. Usually, only a very minor subset of these markers is associated with the trait. Most penalization methods fail when the number of markers is much larger than the sample size. Based on this fact, we have developed an algorithm that proceeds in two stages. In the first stage (screening), we reduced the number of markers via correlation learning to a moderate size. We then used a moderate-scale variable selection method to select variables in the reduced model. Conditional on the selected variables, we repeated the screening procedure and chose another set of variables. In the second stage (estimation), all the above-selected variables are accurately estimated in a multi-locus model. Our approach is simple, accurate in estimation, fast and shows high statistical power of detecting relevant markers on simulated data. We have also used this method to identify relevant genes in real data analysis. We recommend our approach for conducting a multi-locus genome-wide association study.
Collapse
Affiliation(s)
- Cox Lwaka Tamba
- State Key Laboratory of Crop Genetics and Germplasm Enhancement, Nanjing Agricultural University, Nanjing, China
- Department of Mathematics, Egerton University, Egerton, Kenya
| | - Yuan-Li Ni
- State Key Laboratory of Crop Genetics and Germplasm Enhancement, Nanjing Agricultural University, Nanjing, China
| | - Yuan-Ming Zhang
- State Key Laboratory of Crop Genetics and Germplasm Enhancement, Nanjing Agricultural University, Nanjing, China
- Statistical Genomics Lab, College of Plant Science and Technology, Huazhong Agricultural University, Wuhan, China
- * E-mail: ,
| |
Collapse
|
10
|
A novel fuzzy set based multifactor dimensionality reduction method for detecting gene-gene interaction. Comput Biol Chem 2016; 65:193-202. [PMID: 27765491 DOI: 10.1016/j.compbiolchem.2016.09.006] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2016] [Accepted: 09/07/2016] [Indexed: 11/20/2022]
Abstract
BACKGROUND Gene-gene interaction (GGI) is one of the most popular approaches for finding the missing heritability of common complex traits in genetic association studies. The multifactor dimensionality reduction (MDR) method has been widely studied for detecting GGIs. In order to identify the best interaction model associated with disease susceptibility, MDR compares all possible genotype combinations in terms of their predictability of disease status from a simple binary high(H) and low(L) risk classification. However, this simple binary classification does not reflect the uncertainty of H/L classification. METHODS We regard classifying H/L as equivalent to defining the degree of membership of two risk groups H/L. By adopting the fuzzy set theory, we propose Fuzzy MDR which takes into account the uncertainty of H/L classification. Fuzzy MDR allows the possibility of partial membership of H/L through a membership function which transforms the degree of uncertainty into a [0,1] scale. The best genotype combinations can be selected which maximizes a new fuzzy set based accuracy measure. RESULTS Two simulation studies are conducted to compare the power of the proposed Fuzzy MDR with that of MDR. Our results show that Fuzzy MDR has higher power than MDR. We illustrate the proposed Fuzzy MDR by analysing bipolar disorder (BD) trait of the WTCCC dataset to detect GGI associated with BD. CONCLUSIONS We propose a novel Fuzzy MDR method to detect gene-gene interaction by taking into account the uncertainly of H/L classification and show that it has higher power than MDR. Fuzzy MDR can be easily extended to handle continuous phenotypes as well. The program written in R for the proposed Fuzzy MDR is available at https://statgen.snu.ac.kr/software/FuzzyMDR.
Collapse
|