1
|
Liou JW, Liou M, Cheng PE. Modeling Categorical Variables by Mutual Information Decomposition. ENTROPY (BASEL, SWITZERLAND) 2023; 25:e25050750. [PMID: 37238505 DOI: 10.3390/e25050750] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/13/2023] [Revised: 04/24/2023] [Accepted: 04/26/2023] [Indexed: 05/28/2023]
Abstract
This paper proposed the use of mutual information (MI) decomposition as a novel approach to identifying indispensable variables and their interactions for contingency table analysis. The MI analysis identified subsets of associative variables based on multinomial distributions and validated parsimonious log-linear and logistic models. The proposed approach was assessed using two real-world datasets dealing with ischemic stroke (with 6 risk factors) and banking credit (with 21 discrete attributes in a sparse table). This paper also provided an empirical comparison of MI analysis versus two state-of-the-art methods in terms of variable and model selections. The proposed MI analysis scheme can be used in the construction of parsimonious log-linear and logistic models with a concise interpretation of discrete multivariate data.
Collapse
Affiliation(s)
- Jiun-Wei Liou
- Department of Electrical Engineering, Ming Chi University of Technology, New Taipei City 243, Taiwan
| | - Michelle Liou
- Institute of Statistical Science, Academia Sinica, Taipei 115, Taiwan
| | - Philip E Cheng
- Institute of Statistical Science, Academia Sinica, Taipei 115, Taiwan
| |
Collapse
|
2
|
Di X, Yin Y, Fu Y, Mo Z, Lo SH, DiGuiseppi C, Eby DW, Hill L, Mielenz TJ, Strogatz D, Kim M, Li G. Detecting mild cognitive impairment and dementia in older adults using naturalistic driving data and interaction-based classification from influence score. Artif Intell Med 2023; 138:102510. [PMID: 36990588 DOI: 10.1016/j.artmed.2023.102510] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2022] [Revised: 02/04/2023] [Accepted: 02/09/2023] [Indexed: 02/22/2023]
Abstract
Several recent studies indicate that atypical changes in driving behaviors appear to be early signs of mild cognitive impairment (MCI) and dementia. These studies, however, are limited by small sample sizes and short follow-up duration. This study aims to develop an interaction-based classification method building on a statistic named Influence Score (i.e., I-score) for prediction of MCI and dementia using naturalistic driving data collected from the Longitudinal Research on Aging Drivers (LongROAD) project. Naturalistic driving trajectories were collected through in-vehicle recording devices for up to 44 months from 2977 participants who were cognitively intact at the time of enrollment. These data were further processed and aggregated to generate 31 time-series driving variables. Because of high dimensional time-series features for driving variables, we used I-score for variable selection. I-score is a measure to evaluate variables' ability to predict and is proven to be effective in differentiating between noisy and predictive variables in big data. It is introduced here to select influential variable modules or groups that account for compound interactions among explanatory variables. It is explainable regarding to what extent variables and their interactions contribute to the predictiveness of a classifier. In addition, I-score boosts the performance of classifiers over imbalanced datasets due to its association with the F1 score. Using predictive variables selected by I-score, interaction-based residual blocks are constructed over top I-score modules to generate predictors and ensemble learning aggregates these predictors to boost the prediction of the overall classifier. Experiments using naturalistic driving data show that our proposed classification method achieves the best accuracy (96%) for predicting MCI and dementia, followed by random forest (93%) and logistic regression (88%). In terms of F1 score and AUC, our proposed classifier achieves 98% and 87%, respectively, followed by random forest (with an F1 score of 96% and an AUC of 79%) and logistic regression (with an F1 score of 92% and an AUC of 77%). The results indicate that incorporating I-score into machine learning algorithms could considerably improve the model performance for predicting MCI and dementia in older drivers. We also performed the feature importance analysis and found that the right to left turn ratio and the number of hard braking events are the most important driving variables to predict MCI and dementia.
Collapse
|
3
|
Chu X, Jiang M, Liu ZJ. Biomarker interaction selection and disease detection based on multivariate gain ratio. BMC Bioinformatics 2022; 23:176. [PMID: 35550010 PMCID: PMC9103137 DOI: 10.1186/s12859-022-04699-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2021] [Accepted: 04/14/2022] [Indexed: 11/30/2022] Open
Abstract
Background Disease detection is an important aspect of biotherapy. With the development of biotechnology and computer technology, there are many methods to detect disease based on single biomarker. However, biomarker does not influence disease alone in some cases. It’s the interaction between biomarkers that determines disease status. The existing influence measure I-score is used to evaluate the importance of interaction in determining disease status, but there is a deviation about the number of variables in interaction when applying I-score. To solve the problem, we propose a new influence measure Multivariate Gain Ratio (MGR) based on Gain Ratio (GR) of single-variate, which provides us with multivariate combination called interaction. Results We propose a preprocessing verification algorithm based on partial predictor variables to select an appropriate preprocessing method. In this paper, an algorithm for selecting key interactions of biomarkers and applying key interactions to construct a disease detection model is provided. MGR is more credible than I-score in the case of interaction containing small number of variables. Our method behaves better with average accuracy \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$93.13\%$$\end{document}93.13% than I-score of \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$91.73\%$$\end{document}91.73% in Breast Cancer Wisconsin (Diagnostic) Dataset. Compared to the classification results \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$89.80\%$$\end{document}89.80% based on all predictor variables, MGR identifies the true main biomarkers and realizes the dimension reduction. In Leukemia Dataset, the experiment results show the effectiveness of MGR with the accuracy of \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$97.32\%$$\end{document}97.32% compared to I-score with accuracy \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$89.11\%$$\end{document}89.11%. The results can be explained by the nature of MGR and I-score mentioned above because every key interaction contains a small number of variables in Leukemia Dataset. Conclusions MGR is effective for selecting important biomarkers and biomarker interactions even in high-dimension feature space in which the interaction could contain more than two biomarkers. The prediction ability of interactions selected by MGR is better than I-score in the case of interaction containing small number of variables. MGR is generally applicable to various types of biomarker datasets including cell nuclei, gene, SNPs and protein datasets.
Collapse
Affiliation(s)
- Xiao Chu
- Academy of Mathematics and Systems Science Chinese Academy of Sciences, University of Chinese Academy of Sciences, Beijing, China.
| | - Mao Jiang
- Academy of Mathematics and Systems Science Chinese Academy of Sciences, University of Chinese Academy of Sciences, Beijing, China
| | - Zhuo-Jun Liu
- Academy of Mathematics and Systems Science Chinese Academy of Sciences, Beijing, China
| |
Collapse
|
4
|
Epistasis Detection via the Joint Cumulant. STATISTICS IN BIOSCIENCES 2022. [DOI: 10.1007/s12561-022-09336-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
|
5
|
Language Semantics Interpretation with an Interaction-Based Recurrent Neural Network. MACHINE LEARNING AND KNOWLEDGE EXTRACTION 2021. [DOI: 10.3390/make3040046] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Text classification is a fundamental language task in Natural Language Processing. A variety of sequential models are capable of making good predictions, yet there is a lack of connection between language semantics and prediction results. This paper proposes a novel influence score (I-score), a greedy search algorithm, called Backward Dropping Algorithm (BDA), and a novel feature engineering technique called the “dagger technique”. First, the paper proposes to use the novel influence score (I-score) to detect and search for the important language semantics in text documents that are useful for making good predictions in text classification tasks. Next, a greedy search algorithm, called the Backward Dropping Algorithm, is proposed to handle long-term dependencies in the dataset. Moreover, the paper proposes a novel engineering technique called the “dagger technique” that fully preserves the relationship between the explanatory variable and the response variable. The proposed techniques can be further generalized into any feed-forward Artificial Neural Networks (ANNs) and Convolutional Neural Networks (CNNs), and any neural network. A real-world application on the Internet Movie Database (IMDB) is used and the proposed methods are applied to improve prediction performance with an 81% error reduction compared to other popular peers if I-score and “dagger technique” are not implemented.
Collapse
|
6
|
An Interaction-Based Convolutional Neural Network (ICNN) Toward a Better Understanding of COVID-19 X-ray Images. ALGORITHMS 2021. [DOI: 10.3390/a14110337] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/28/2022]
Abstract
The field of explainable artificial intelligence (XAI) aims to build explainable and interpretable machine learning (or deep learning) methods without sacrificing prediction performance. Convolutional neural networks (CNNs) have been successful in making predictions, especially in image classification. These popular and well-documented successes use extremely deep CNNs such as VGG16, DenseNet121, and Xception. However, these well-known deep learning models use tens of millions of parameters based on a large number of pretrained filters that have been repurposed from previous data sets. Among these identified filters, a large portion contain no information yet remain as input features. Thus far, there is no effective method to omit these noisy features from a data set, and their existence negatively impacts prediction performance. In this paper, a novel interaction-based convolutional neural network (ICNN) is introduced that does not make assumptions about the relevance of local information. Instead, a model-free influence score (I-score) is proposed to directly extract the influential information from images to form important variable modules. This innovative technique replaces all pretrained filters found by trial-and-error with explainable, influential, and predictive variable sets (modules) determined by the I-score. In other words, future researchers need not rely on pretrained filters; the suggested algorithm identifies only the variables or pixels with high I-score values that are extremely predictive and important. The proposed method and algorithm were tested on real-world data set and a state-of-the-art prediction performance of 99.8% was achieved without sacrificing the explanatory power of the model. This proposed design can efficiently screen patients infected by COVID-19 before human diagnosis and can be a benchmark for addressing future XAI problems in large-scale data sets.
Collapse
|
7
|
Hung H, Huang SY. Sufficient dimension reduction via random-partitions for the large-p-small-n problem. Biometrics 2018; 75:245-255. [PMID: 30052272 DOI: 10.1111/biom.12926] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2017] [Revised: 05/01/2018] [Accepted: 05/01/2018] [Indexed: 11/30/2022]
Abstract
Sufficient dimension reduction (SDR) continues to be an active field of research. When estimating the central subspace (CS), inverse regression based SDR methods involve solving a generalized eigenvalue problem, which can be problematic under the large-p-small-n situation. In recent years, new techniques have emerged in numerical linear algebra, called randomized algorithms or random sketching, for high-dimensional and large scale problems. To overcome the large-p-small-n SDR problem, we combine the idea of statistical inference with random sketching to propose a new SDR method, called integrated random-partition SDR (iRP-SDR). Our method consists of the following three steps: (i) Randomly partition the covariates into subsets to construct an envelope subspace with low dimension. (ii) Obtain a sketch of the CS by applying a conventional SDR method within the constructed envelope subspace. (iii) Repeat the above two steps many times and integrate these multiple sketches to form the final estimate of the CS. After describing the details of these steps, the asymptotic properties of iRP-SDR are established. Unlike existing methods, iRP-SDR does not involve the determination of the structural dimension until the last stage, which makes it more adaptive to a high-dimensional setting. The advantageous performance of iRP-SDR is demonstrated via simulation studies and a practical example analyzing EEG data.
Collapse
Affiliation(s)
- Hung Hung
- Institute of Epidemiology and Preventive Medicine, National Taiwan University, Taiwan
| | - Su-Yun Huang
- Institute of Statistical Science, Academia Sinica, Taiwan
| |
Collapse
|
8
|
Crawford L, Zeng P, Mukherjee S, Zhou X. Detecting epistasis with the marginal epistasis test in genetic mapping studies of quantitative traits. PLoS Genet 2017; 13:e1006869. [PMID: 28746338 PMCID: PMC5550000 DOI: 10.1371/journal.pgen.1006869] [Citation(s) in RCA: 63] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/01/2016] [Revised: 08/09/2017] [Accepted: 06/15/2017] [Indexed: 12/13/2022] Open
Abstract
Epistasis, commonly defined as the interaction between multiple genes, is an important genetic component underlying phenotypic variation. Many statistical methods have been developed to model and identify epistatic interactions between genetic variants. However, because of the large combinatorial search space of interactions, most epistasis mapping methods face enormous computational challenges and often suffer from low statistical power due to multiple test correction. Here, we present a novel, alternative strategy for mapping epistasis: instead of directly identifying individual pairwise or higher-order interactions, we focus on mapping variants that have non-zero marginal epistatic effects-the combined pairwise interaction effects between a given variant and all other variants. By testing marginal epistatic effects, we can identify candidate variants that are involved in epistasis without the need to identify the exact partners with which the variants interact, thus potentially alleviating much of the statistical and computational burden associated with standard epistatic mapping procedures. Our method is based on a variance component model, and relies on a recently developed variance component estimation method for efficient parameter inference and p-value computation. We refer to our method as the "MArginal ePIstasis Test", or MAPIT. With simulations, we show how MAPIT can be used to estimate and test marginal epistatic effects, produce calibrated test statistics under the null, and facilitate the detection of pairwise epistatic interactions. We further illustrate the benefits of MAPIT in a QTL mapping study by analyzing the gene expression data of over 400 individuals from the GEUVADIS consortium.
Collapse
Affiliation(s)
- Lorin Crawford
- Department of Biostatistics, Brown University, Providence, Rhode Island, United States of America
- Center for Statistical Sciences, Brown University, Providence, Rhode Island, United States of America
- Center for Computational Molecular Biology, Brown University, Providence, Rhode Island, United States of America
| | - Ping Zeng
- Department of Biostatistics, University of Michigan, Ann Arbor, Michigan, United States of America
- Center for Statistical Genetics, University of Michigan, Ann Arbor, Michigan, United States of America
| | - Sayan Mukherjee
- Department of Statistical Science, Duke University, Durham, North Carolina, United States of America
- Department of Computer Science, Duke University, Durham, North Carolina, United States of America
- Department of Mathematics, Duke University, Durham, North Carolina, United States of America
- Department of Bioinformatics & Biostatistics, Duke University, Durham, North Carolina, United States of America
| | - Xiang Zhou
- Department of Biostatistics, University of Michigan, Ann Arbor, Michigan, United States of America
- Center for Statistical Genetics, University of Michigan, Ann Arbor, Michigan, United States of America
| |
Collapse
|
9
|
Framework for making better predictions by directly estimating variables' predictivity. Proc Natl Acad Sci U S A 2016; 113:14277-14282. [PMID: 27911830 DOI: 10.1073/pnas.1616647113] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
We propose approaching prediction from a framework grounded in the theoretical correct prediction rate of a variable set as a parameter of interest. This framework allows us to define a measure of predictivity that enables assessing variable sets for, preferably high, predictivity. We first define the prediction rate for a variable set and consider, and ultimately reject, the naive estimator, a statistic based on the observed sample data, due to its inflated bias for moderate sample size and its sensitivity to noisy useless variables. We demonstrate that the [Formula: see text]-score of the PR method of VS yields a relatively unbiased estimate of a parameter that is not sensitive to noisy variables and is a lower bound to the parameter of interest. Thus, the PR method using the [Formula: see text]-score provides an effective approach to selecting highly predictive variables. We offer simulations and an application of the [Formula: see text]-score on real data to demonstrate the statistic's predictive performance on sample data. We conjecture that using the partition retention and [Formula: see text]-score can aid in finding variable sets with promising prediction rates; however, further research in the avenue of sample-based measures of predictivity is much desired.
Collapse
|
10
|
Wang MH, Sun R, Guo J, Weng H, Lee J, Hu I, Sham PC, Zee BCY. A fast and powerful W-test for pairwise epistasis testing. Nucleic Acids Res 2016; 44:e115. [PMID: 27112568 PMCID: PMC4937324 DOI: 10.1093/nar/gkw347] [Citation(s) in RCA: 19] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2015] [Revised: 04/14/2016] [Accepted: 04/15/2016] [Indexed: 01/08/2023] Open
Abstract
Epistasis plays an essential role in the development of complex diseases. Interaction methods face common challenge of seeking a balance between persistent power, model complexity, computation efficiency, and validity of identified bio-markers. We introduce a novel W-test to identify pairwise epistasis effect, which measures the distributional difference between cases and controls through a combined log odds ratio. The test is model-free, fast, and inherits a Chi-squared distribution with data adaptive degrees of freedom. No permutation is needed to obtain the P-values. Simulation studies demonstrated that the W-test is more powerful in low frequency variants environment than alternative methods, which are the Chi-squared test, logistic regression and multifactor-dimensionality reduction (MDR). In two independent real bipolar disorder genome-wide associations (GWAS) datasets, the W-test identified significant interactions pairs that can be replicated, including SLIT3-CENPN, SLIT3-TMEM132D, CNTNAP2-NDST4 and CNTCAP2-RTN4R The genes in the pairs play central roles in neurotransmission and synapse formation. A majority of the identified loci are undiscoverable by main effect and are low frequency variants. The proposed method offers a powerful alternative tool for mapping the genetic puzzle underlying complex disorders.
Collapse
Affiliation(s)
- Maggie Haitian Wang
- Division of Biostatistics and Centre for Clinical Research and Biostatistics, JC School of Public Health and Primary Care, the Chinese University of Hong Kong, Shatin, N.T., Hong Kong SAR, China CUHK Shenzhen Research Institute, Shenzhen, China
| | - Rui Sun
- Division of Biostatistics and Centre for Clinical Research and Biostatistics, JC School of Public Health and Primary Care, the Chinese University of Hong Kong, Shatin, N.T., Hong Kong SAR, China CUHK Shenzhen Research Institute, Shenzhen, China
| | - Junfeng Guo
- The Australian National University, Canberra, Australia
| | - Haoyi Weng
- Division of Biostatistics and Centre for Clinical Research and Biostatistics, JC School of Public Health and Primary Care, the Chinese University of Hong Kong, Shatin, N.T., Hong Kong SAR, China CUHK Shenzhen Research Institute, Shenzhen, China
| | - Jack Lee
- Division of Biostatistics and Centre for Clinical Research and Biostatistics, JC School of Public Health and Primary Care, the Chinese University of Hong Kong, Shatin, N.T., Hong Kong SAR, China
| | - Inchi Hu
- ISOM Department and Biomedical Engineering Division, the Hong Kong University of Science and Technology, Kowloon, Hong Kong SAR, China
| | - Pak Chung Sham
- Department of Psychiatry; Centre for Genomic Sciences, the University of Hong Kong, Pok Fu Lam, Hong Kong SAR, China
| | - Benny Chung-Ying Zee
- Division of Biostatistics and Centre for Clinical Research and Biostatistics, JC School of Public Health and Primary Care, the Chinese University of Hong Kong, Shatin, N.T., Hong Kong SAR, China CUHK Shenzhen Research Institute, Shenzhen, China
| |
Collapse
|
11
|
Abstract
Thus far, genome-wide association studies (GWAS) have been disappointing in the inability of investigators to use the results of identified, statistically significant variants in complex diseases to make predictions useful for personalized medicine. Why are significant variables not leading to good prediction of outcomes? We point out that this problem is prevalent in simple as well as complex data, in the sciences as well as the social sciences. We offer a brief explanation and some statistical insights on why higher significance cannot automatically imply stronger predictivity and illustrate through simulations and a real breast cancer example. We also demonstrate that highly predictive variables do not necessarily appear as highly significant, thus evading the researcher using significance-based methods. We point out that what makes variables good for prediction versus significance depends on different properties of the underlying distributions. If prediction is the goal, we must lay aside significance as the only selection standard. We suggest that progress in prediction requires efforts toward a new research agenda of searching for a novel criterion to retrieve highly predictive variables rather than highly significant variables. We offer an alternative approach that was not designed for significance, the partition retention method, which was very effective predicting on a long-studied breast cancer data set, by reducing the classification error rate from 30% to 8%.
Collapse
|
12
|
Satten GA, Biswas S, Papachristou C, Turkmen A, König IR. Population-based association and gene by environment interactions in Genetic Analysis Workshop 18. Genet Epidemiol 2014; 38 Suppl 1:S49-56. [PMID: 25112188 DOI: 10.1002/gepi.21825] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/09/2023]
Abstract
In the past decade, genome-wide association studies have been successful in identifying genetic loci that play a role in many complex diseases. Despite this, it has become clear that for many traits, investigation of single common variants does not give a complete picture of the genetic contribution to the phenotype. Therefore a number of new approaches are currently being investigated to further the search for susceptibility loci or regions. We summarize the contributions to Genetic Analysis Workshop 18 (GAW18) that concern this search using methods for population-based association analysis. Many of the members of our GAW18 working group made use of data types that have only recently become available through the use of next-generation sequencing technologies, with many focusing on the investigation of rare variants instead of or in combination with common variants. Some contributors used a haplotype-based approach, which to date has been used relatively infrequently but may become more important for analyzing rare variant association data. Others analyzed gene-gene or gene-environment interactions, where novel statistical approaches were needed to make the best use of the available information without requiring an excessive computational burden. GAW18 provided participants with the chance to make use of state-of-the-art data, statistical techniques, and technology. We report here some of the experiences and conclusions that were reached by workshop participants who analyzed the GAW18 data as a population-based association study.
Collapse
Affiliation(s)
- Glen A Satten
- Centers for Disease Control and Prevention, Atlanta, Georgia, United States of America
| | | | | | | | | |
Collapse
|
13
|
Liu Y, Huang C, Hu I, Lo SH, Zheng T. A dual-clustering framework for association screening with whole genome sequencing data and longitudinal traits. BMC Proc 2014; 8:S47. [PMID: 25519328 PMCID: PMC4143709 DOI: 10.1186/1753-6561-8-s1-s47] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022] Open
Abstract
Current sequencing technology enables generation of whole genome sequencing data sets that contain a high density of rare variants, each of which is carried by, at most, 5% of the sampled subjects. Such variants are involved in the etiology of most common diseases in humans. These diseases can be studied by relevant longitudinal phenotype traits. Tests for association between such genotype information and longitudinal traits allow the study of the function of rare variants in complex human disorders. In this paper, we propose an association-screening framework that highlights the genotypic differences observed on rare variants and the longitudinal nature of phenotypes. In particular, both variants within a gene and longitudinal phenotypes are used to create partitions of subjects. Association between the 2 sets of constructed partitions is then evaluated. We apply the proposed strategy to the simulated data from the Genetic Analysis Workshop 18 and compare the obtained results with those from sequence kernel association test using the receiver operating characteristic curves.
Collapse
Affiliation(s)
- Ying Liu
- Department of Statistics, Columbia University, 1255 Amsterdam Avenue, New York, NY 10027, USA
| | - ChienHsun Huang
- Department of Statistics, Columbia University, 1255 Amsterdam Avenue, New York, NY 10027, USA
| | - Inchi Hu
- ISOM, Hong Kong University of Science and Technology, Kowloon, Hong Kong
| | - Shaw-Hwa Lo
- Department of Statistics, Columbia University, 1255 Amsterdam Avenue, New York, NY 10027, USA
| | - Tian Zheng
- Department of Statistics, Columbia University, 1255 Amsterdam Avenue, New York, NY 10027, USA
| |
Collapse
|
14
|
Agne M, Huang CH, Hu I, Wang H, Zheng T, Lo SH. Considering interactive effects in the identification of influential regions with extremely rare variants via fixed bin approach. BMC Proc 2014; 8:S7. [PMID: 25519400 PMCID: PMC4143804 DOI: 10.1186/1753-6561-8-s1-s7] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
In this study, we analyze the Genetic Analysis Workshop 18 (GAW18) data to identify regions of single-nucleotide polymorphisms (SNPs), which significantly influence hypertension status among individuals. We have studied the marginal impact of these regions on disease status in the past, but we extend the method to deal with environmental factors present in data collected over several exam periods. We consider the respective interactions between such traits as smoking status and age with the genetic information and hope to augment those genetic regions deemed influential marginally with those that contribute via an interactive effect. In particular, we focus only on rare variants and apply a procedure to combine signal among rare variants in a number of "fixed bins" along the chromosome. We extend the procedure in Agne et al [1] to incorporate environmental factors by dichotomizing subjects via traits such as smoking status and age, running the marginal procedure among each respective category (i.e., smokers or nonsmokers), and then combining their scores into a score for interaction. To avoid overlap of subjects, we examine each exam period individually. Out of a possible 629 fixed-bin regions in chromosome 3, we observe that 11 show up in multiple exam periods for gene-smoking score. Fifteen regions exhibit significance for multiple exam periods for gene-age score, with 4 regions deemed significant for all 3 exam periods. The procedure pinpoints SNPs in 8 "answer" genes, with 5 of these showing up as significant in multiple testing schemes (Gene-Smoking, Gene-Age for Exams 1, 2, and 3).
Collapse
Affiliation(s)
- Michael Agne
- Department of Statistics, Columbia University, 1255 Amsterdam Avenue, Room 1005, MC 4690, New York, New York 10027, USA
| | - Chien-Hsun Huang
- Department of Statistics, Columbia University, 1255 Amsterdam Avenue, Room 1005, MC 4690, New York, New York 10027, USA
| | - Inchi Hu
- Department of Information Systems, Business Statistics and Operations Management, Hong Kong University of Science and Technology Business School, Hong Kong
| | - Haitian Wang
- Division of Biostatistics, School of Public Health and Primary Care, the Chinese University of Hong Kong, Hong Kong
| | - Tian Zheng
- Department of Statistics, Columbia University, 1255 Amsterdam Avenue, Room 1005, MC 4690, New York, New York 10027, USA
| | - Shaw-Hwa Lo
- Department of Statistics, Columbia University, 1255 Amsterdam Avenue, Room 1005, MC 4690, New York, New York 10027, USA
| |
Collapse
|
15
|
Wang MH, Huang CH, Zheng T, Lo SH, Hu I. Discovering pure gene-environment interactions in blood pressure genome-wide association studies data: a two-step approach incorporating new statistics. BMC Proc 2014; 8:S62. [PMID: 25519396 PMCID: PMC4143689 DOI: 10.1186/1753-6561-8-s1-s62] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022] Open
Abstract
Environment has long been known to play an important part in disease etiology. However, not many genome-wide association studies take environmental factors into consideration. There is also a need for new methods to identify the gene-environment interactions. In this study, we propose a 2-step approach incorporating an influence measure that capturespure gene-environment effect. We found that pure gene-age interaction has a stronger association than considering the genetic effect alone for systolic blood pressure, measured by counting the number of single-nucleotide polymorphisms (SNPs)reaching a certain significance level. We analyzed the subjects by dividing them into two age groups and found no overlap in the top identified SNPs between them. This suggested that age might have a nonlinear effect on genetic association. Furthermore, the scores of the top SNPs for the two age subgroups were about 3times those obtained when using all subjects for systolic blood pressure. In addition, the scores of the older age subgroup were much higher than those for the younger group. The results suggest that genetic effects are stronger in older age and that genetic association studies should take environmental effects into consideration, especially age.
Collapse
Affiliation(s)
- Maggie Haitian Wang
- Division of Biostatistics, School of Public Health and Primary Care, the Chinese University of Hong Kong, Shatin, New Territories, Hong Kong SAR
| | - Chien-Hsun Huang
- Department of Statistics, Columbia University, 1255 Amsterdam Avenue, New York, NY 10027-5927, USA
| | - Tian Zheng
- Department of Statistics, Columbia University, 1255 Amsterdam Avenue, New York, NY 10027-5927, USA
| | - Shaw-Hwa Lo
- Department of Statistics, Columbia University, 1255 Amsterdam Avenue, New York, NY 10027-5927, USA
| | - Inchi Hu
- Department of ISOM, the Hong Kong University of Science and Technology, Clearwater Bay, Kowloon, Hong Kong SAR
| |
Collapse
|
16
|
Fan R, Huang CH, Hu I, Wang H, Zheng T, Lo SH. A partition-based approach to identify gene-environment interactions in genome wide association studies. BMC Proc 2014; 8:S60. [PMID: 25519395 PMCID: PMC4143762 DOI: 10.1186/1753-6561-8-s1-s60] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
It is believed that almost all common diseases are the consequence of complex interactions between genetic markers and environmental factors. However, few such interactions have been documented to date. Conventional statistical methods for detecting gene and environmental interactions are often based on the linear regression model, which assumes a linear interaction effect. In this study, we propose a nonparametric partition-based approach that is able to capture complex interaction patterns. We apply this method to the real data set of hypertension provided by Genetic Analysis Workshop 18. Compared with the linear regression model, the proposed approach is able to identify many additional variants with significant gene-environmental interaction effects. We further investigate one single-nucleotide polymorphism identified by our method and show that its gene-environmental interaction effect is, indeed, nonlinear. To adjust for the family dependence of phenotypes, we apply different permutation strategies and investigate their effects on the outcomes.
Collapse
Affiliation(s)
- Ruixue Fan
- Department of Statistics, Columbia University, 1255 Amsterdam Avenue, 10th Floor, New York, NY 10027, USA
| | - Chien-Hsun Huang
- Department of Statistics, Columbia University, 1255 Amsterdam Avenue, 10th Floor, New York, NY 10027, USA
| | - Inchi Hu
- ISOM, Hong Kong University of Science and Technology, Hong Kong
| | - Haitian Wang
- Division of Biostatistics, School of Public Health and Primary Care, The Chinese University of Hong Kong, Hong Kong
| | - Tian Zheng
- Department of Statistics, Columbia University, 1255 Amsterdam Avenue, 10th Floor, New York, NY 10027, USA
| | - Shaw-Hwa Lo
- Department of Statistics, Columbia University, 1255 Amsterdam Avenue, 10th Floor, New York, NY 10027, USA
| |
Collapse
|
17
|
Hwang JS, Hu TH. A stepwise regression algorithm for high-dimensional variable selection. J STAT COMPUT SIM 2014. [DOI: 10.1080/00949655.2014.902460] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]
|
18
|
Fan R, Lo SH. A robust model-free approach for rare variants association studies incorporating gene-gene and gene-environmental interactions. PLoS One 2013; 8:e83057. [PMID: 24358248 PMCID: PMC3866272 DOI: 10.1371/journal.pone.0083057] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2013] [Accepted: 10/30/2013] [Indexed: 11/19/2022] Open
Abstract
Recently more and more evidence suggest that rare variants with much lower minor allele frequencies play significant roles in disease etiology. Advances in next-generation sequencing technologies will lead to many more rare variants association studies. Several statistical methods have been proposed to assess the effect of rare variants by aggregating information from multiple loci across a genetic region and testing the association between the phenotype and aggregated genotype. One limitation of existing methods is that they only look into the marginal effects of rare variants but do not systematically take into account effects due to interactions among rare variants and between rare variants and environmental factors. In this article, we propose the summation of partition approach (SPA), a robust model-free method that is designed specifically for detecting both marginal effects and effects due to gene-gene (G×G) and gene-environmental (G×E) interactions for rare variants association studies. SPA has three advantages. First, it accounts for the interaction information and gains considerable power in the presence of unknown and complicated G×G or G×E interactions. Secondly, it does not sacrifice the marginal detection power; in the situation when rare variants only have marginal effects it is comparable with the most competitive method in current literature. Thirdly, it is easy to extend and can incorporate more complex interactions; other practitioners and scientists can tailor the procedure to fit their own study friendly. Our simulation studies show that SPA is considerably more powerful than many existing methods in the presence of G×G and G×E interactions.
Collapse
Affiliation(s)
- Ruixue Fan
- Department of Statistics, Columbia University, New York, New York, United States of America
| | - Shaw-Hwa Lo
- Department of Statistics, Columbia University, New York, New York, United States of America
- * E-mail: (SHL)
| |
Collapse
|
19
|
Wang H, Lo SH, Zheng T, Hu I. Interaction-based feature selection and classification for high-dimensional biological data. Bioinformatics 2012; 28:2834-42. [PMID: 22945786 PMCID: PMC3577111 DOI: 10.1093/bioinformatics/bts531] [Citation(s) in RCA: 32] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2012] [Revised: 08/20/2012] [Accepted: 08/22/2012] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Epistasis or gene-gene interaction has gained increasing attention in studies of complex diseases. Its presence as an ubiquitous component of genetic architecture of common human diseases has been contemplated. However, the detection of gene-gene interaction is difficult due to combinatorial explosion. RESULTS We present a novel feature selection method incorporating variable interaction. Three gene expression datasets are analyzed to illustrate our method, although it can also be applied to other types of high-dimensional data. The quality of variables selected is evaluated in two ways: first by classification error rates, then by functional relevance assessed using biological knowledge. We show that the classification error rates can be significantly reduced by considering interactions. Secondly, a sizable portion of genes identified by our method for breast cancer metastasis overlaps with those reported in gene-to-system breast cancer (G2SBC) database as disease associated and some of them have interesting biological implication. In summary, interaction-based methods may lead to substantial gain in biological insights as well as more accurate prediction.
Collapse
Affiliation(s)
- Haitian Wang
- Department of ISOM, HKUST, Clear Water Bay, Kowloon, Hong Kong
| | | | | | | |
Collapse
|
20
|
Papathomas M, Molitor J, Hoggart C, Hastie D, Richardson S. Exploring data from genetic association studies using Bayesian variable selection and the Dirichlet process: application to searching for gene × gene patterns. Genet Epidemiol 2012; 36:663-74. [PMID: 22851500 DOI: 10.1002/gepi.21661] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/12/2012] [Revised: 05/16/2012] [Accepted: 06/08/2012] [Indexed: 11/09/2022]
Abstract
We construct data exploration tools for recognizing important covariate patterns associated with a phenotype, with particular focus on searching for association with gene-gene patterns. To this end, we propose a new variable selection procedure that employs latent selection weights and compare it to an alternative formulation. The selection procedures are implemented in tandem with a Dirichlet process mixture model for the flexible clustering of genetic and epidemiological profiles. We illustrate our approach with the aid of simulated data and the analysis of a real data set from a genome-wide association study.
Collapse
Affiliation(s)
- Michail Papathomas
- School of Mathematics and Statistics, University of St Andrews, Scotland, United Kingdom.
| | | | | | | | | |
Collapse
|
21
|
|
22
|
Liu Y, Huang CH, Hu I, Lo SH, Zheng T. Association screening for genes with multiple potentially rare variants: an inverse-probability weighted clustering approach. BMC Proc 2011; 5 Suppl 9:S106. [PMID: 22373536 PMCID: PMC3287829 DOI: 10.1186/1753-6561-5-s9-s106] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
Both common variants and rare variants are involved in the etiology of most complex diseases in humans. Developments in sequencing technology have led to the identification of a high density of rare variant single-nucleotide polymorphisms (SNPs) on the genome, each of which affects only at most 1% of the population. Genotypes derived from these SNPs allow one to study the involvement of rare variants in common human disorders. Here, we propose an association screening approach that treats genes as units of analysis. SNPs within a gene are used to create partitions of individuals, and inverse-probability weighting is used to overweight genotypic differences observed on rare variants. Association between a phenotype trait and the constructed partition is then evaluated. We consider three association tests (one-way ANOVA, chi-square test, and the partition retention method) and compare these strategies using the simulated data from the Genetic Analysis Workshop 17. Several genes that contain causal SNPs were identified by the proposed method as top genes.
Collapse
Affiliation(s)
- Ying Liu
- Department of Statistics, Columbia University, New York, NY 10027, USA.
| | | | | | | | | |
Collapse
|
23
|
Stepwise Paring down Variation for Identifying Influential Multi-factor Interactions Related to a Continuous Response Variable. STATISTICS IN BIOSCIENCES 2011. [DOI: 10.1007/s12561-011-9045-2] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/15/2022]
|
24
|
Bailey-Wilson JE, Brennan JS, Bull SB, Culverhouse R, Kim Y, Jiang Y, Jung J, Li Q, Lamina C, Liu Y, Mägi R, Niu YS, Simpson CL, Wang L, Yilmaz YE, Zhang H, Zhang Z. Regression and data mining methods for analyses of multiple rare variants in the Genetic Analysis Workshop 17 mini-exome data. Genet Epidemiol 2011; 35 Suppl 1:S92-100. [PMID: 22128066 PMCID: PMC3360949 DOI: 10.1002/gepi.20657] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
Group 14 of Genetic Analysis Workshop 17 examined several issues related to analysis of complex traits using DNA sequence data. These issues included novel methods for analyzing rare genetic variants in an aggregated manner (often termed collapsing rare variants), evaluation of various study designs to increase power to detect effects of rare variants, and the use of machine learning approaches to model highly complex heterogeneous traits. Various published and novel methods for analyzing traits with extreme locus and allelic heterogeneity were applied to the simulated quantitative and disease phenotypes. Overall, we conclude that power is (as expected) dependent on locus-specific heritability or contribution to disease risk, large samples will be required to detect rare causal variants with small effect sizes, extreme phenotype sampling designs may increase power for smaller laboratory costs, methods that allow joint analysis of multiple variants per gene or pathway are more powerful in general than analyses of individual rare variants, population-specific analyses can be optimal when different subpopulations harbor private causal mutations, and machine learning methods may be useful for selecting subsets of predictors for follow-up in the presence of extreme locus heterogeneity and large numbers of potential predictors.
Collapse
Affiliation(s)
- Joan E Bailey-Wilson
- Inherited Disease Research Branch, National Human Genome Research Institute, National Institutes of Health, Baltimore, MD 21224, USA.
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
25
|
Zhang Y, Jiang B, Zhu J, Liu JS. Bayesian models for detecting epistatic interactions from genetic data. Ann Hum Genet 2010; 75:183-93. [PMID: 21091453 DOI: 10.1111/j.1469-1809.2010.00621.x] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/02/2023]
Abstract
Current disease association studies are routinely conducted on a genome-wide scale, testing hundreds of thousands or millions of genetic markers. Besides detecting marginal associations of individual markers with the disease, it is also of interest to identify gene-gene and gene-environment interactions, which confer susceptibility to the disease risk. The astronomical number of possible combinations of markers and environmental factors, however, makes interaction mapping a daunting task both computationally and statistically. In this paper, we review and discuss a set of Bayesian partition methods developed recently for mapping single-nucleotide polymorphisms in case-control studies, their extension to quantitative traits, and further generalization to multiple traits. We use simulation and real data sets to demonstrate the performance of these methods, and we compare them with some existing interaction mapping algorithms. With the recent advance in high-throughput sequencing technologies, genome-wide measurements of epigenetic factor enrichment, structural variations, and transcription activities become available at the individual level. The tsunami of data creates more challenges for gene-gene interaction mapping, but at the same time provides new opportunities that, if utilized properly through sophisticated statistical means, can improve the power of mapping interactions at the genome scale.
Collapse
Affiliation(s)
- Yu Zhang
- Department of Statistics, Penn State University, University Park, PA, USA
| | | | | | | |
Collapse
|
26
|
Chernoff H, Lo SH, Zheng T. Discovering influential variables: A method of partitions. Ann Appl Stat 2009. [DOI: 10.1214/09-aoas265] [Citation(s) in RCA: 32] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|