1
|
Kwan B, Fuhrer T, Montemayor D, Fink JC, He J, Hsu CY, Messer K, Nelson RG, Pu M, Ricardo AC, Rincon-Choles H, Shah VO, Ye H, Zhang J, Sharma K, Natarajan L. A generalized covariate-adjusted top-scoring pair algorithm with applications to diabetic kidney disease stage classification in the Chronic Renal Insufficiency Cohort (CRIC) Study. BMC Bioinformatics 2023; 24:57. [PMID: 36803209 PMCID: PMC9942303 DOI: 10.1186/s12859-023-05171-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2022] [Accepted: 02/02/2023] [Indexed: 02/22/2023] Open
Abstract
BACKGROUND The growing amount of high dimensional biomolecular data has spawned new statistical and computational models for risk prediction and disease classification. Yet, many of these methods do not yield biologically interpretable models, despite offering high classification accuracy. An exception, the top-scoring pair (TSP) algorithm derives parameter-free, biologically interpretable single pair decision rules that are accurate and robust in disease classification. However, standard TSP methods do not accommodate covariates that could heavily influence feature selection for the top-scoring pair. Herein, we propose a covariate-adjusted TSP method, which uses residuals from a regression of features on the covariates for identifying top scoring pairs. We conduct simulations and a data application to investigate our method, and compare it to existing classifiers, LASSO and random forests. RESULTS Our simulations found that features that were highly correlated with clinical variables had high likelihood of being selected as top scoring pairs in the standard TSP setting. However, through residualization, our covariate-adjusted TSP was able to identify new top scoring pairs, that were largely uncorrelated with clinical variables. In the data application, using patients with diabetes (n = 977) selected for metabolomic profiling in the Chronic Renal Insufficiency Cohort (CRIC) study, the standard TSP algorithm identified (valine-betaine, dimethyl-arg) as the top-scoring metabolite pair for classifying diabetic kidney disease (DKD) severity, whereas the covariate-adjusted TSP method identified the pair (pipazethate, octaethylene glycol) as top-scoring. Valine-betaine and dimethyl-arg had, respectively, ≥ 0.4 absolute correlation with urine albumin and serum creatinine, known prognosticators of DKD. Thus without covariate-adjustment the top-scoring pair largely reflected known markers of disease severity, whereas covariate-adjusted TSP uncovered features liberated from confounding, and identified independent prognostic markers of DKD severity. Furthermore, TSP-based methods achieved competitive classification accuracy in DKD to LASSO and random forests, while providing more parsimonious models. CONCLUSIONS We extended TSP-based methods to account for covariates, via a simple, easy to implement residualizing process. Our covariate-adjusted TSP method identified metabolite features, uncorrelated from clinical covariates, that discriminate DKD severity stage based on the relative ordering between two features, and thus provide insights into future studies on the order reversals in early vs advanced disease states.
Collapse
Grants
- R01 DK110541 NIDDK NIH HHS
- U24 DK060990 NIDDK NIH HHS
- R01DK118736, 1R01DK110541-01A1, U01DK060990, U01DK060984, U01DK061022, U01DK061021, U01DK061028, U01DK060980, U01DK060963, U01DK060902, U24DK060990 NIDDK NIH HHS
- National Science Foundation Graduate Research Fellowship Program
- Intramural Research Program of the National Institute of Diabetes and Digestive and Kidney Diseases
- National Institute of Diabetes and Digestive and Kidney Diseases
Collapse
Affiliation(s)
- Brian Kwan
- Division of Biostatistics and Bioinformatics, Herbert Wertheim School of Public Health, University of California, San Diego, La Jolla, CA, USA
- Moores Cancer Center, University of California, San Diego, La Jolla, CA, USA
| | - Tobias Fuhrer
- Institute of Molecular Systems Biology, ETH Zurich, Zurich, Switzerland
| | - Daniel Montemayor
- Division of Nephrology, Department of Medicine, University of Texas Health San Antonio, San Antonio, TX, USA
- Center for Renal Precision Medicine, University of Texas Health San Antonio, San Antonio, TX, USA
| | - Jeffery C Fink
- Department of Medicine, University of Maryland, Baltimore School of Medicine, Baltimore, MD, USA
| | - Jiang He
- Department of Epidemiology, Tulane University School of Public Health and Tropical Medicine and Tulane University Translational Science Institute,, New Orleans, LA, USA
| | - Chi-Yuan Hsu
- Division of Nephrology, University of California, San Francisco School of Medicine, San Francisco, CA, USA
| | - Karen Messer
- Division of Biostatistics and Bioinformatics, Herbert Wertheim School of Public Health, University of California, San Diego, La Jolla, CA, USA
- Moores Cancer Center, University of California, San Diego, La Jolla, CA, USA
| | - Robert G Nelson
- Chronic Kidney Disease Section, National Institute of Diabetes and Digestive and Kidney Diseases, Phoenix, AZ, USA
| | - Minya Pu
- Moores Cancer Center, University of California, San Diego, La Jolla, CA, USA
| | - Ana C Ricardo
- Department of Medicine, University of Illinois, Chicago, IL, USA
| | - Hernan Rincon-Choles
- Department of Nephrology, Glickman Urological and Kidney Institute, Cleveland Clinic Foundation, Cleveland, OH, USA
| | - Vallabh O Shah
- University of New Mexico Health Sciences Center, Albuquerque, NM, USA
| | - Hongping Ye
- Division of Nephrology, Department of Medicine, University of Texas Health San Antonio, San Antonio, TX, USA
- Center for Renal Precision Medicine, University of Texas Health San Antonio, San Antonio, TX, USA
| | - Jing Zhang
- Moores Cancer Center, University of California, San Diego, La Jolla, CA, USA
| | - Kumar Sharma
- Division of Nephrology, Department of Medicine, University of Texas Health San Antonio, San Antonio, TX, USA
- Center for Renal Precision Medicine, University of Texas Health San Antonio, San Antonio, TX, USA
| | - Loki Natarajan
- Division of Biostatistics and Bioinformatics, Herbert Wertheim School of Public Health, University of California, San Diego, La Jolla, CA, USA.
- Moores Cancer Center, University of California, San Diego, La Jolla, CA, USA.
| |
Collapse
|
2
|
Di X, Yin Y, Fu Y, Mo Z, Lo SH, DiGuiseppi C, Eby DW, Hill L, Mielenz TJ, Strogatz D, Kim M, Li G. Detecting mild cognitive impairment and dementia in older adults using naturalistic driving data and interaction-based classification from influence score. Artif Intell Med 2023; 138:102510. [PMID: 36990588 DOI: 10.1016/j.artmed.2023.102510] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2022] [Revised: 02/04/2023] [Accepted: 02/09/2023] [Indexed: 02/22/2023]
Abstract
Several recent studies indicate that atypical changes in driving behaviors appear to be early signs of mild cognitive impairment (MCI) and dementia. These studies, however, are limited by small sample sizes and short follow-up duration. This study aims to develop an interaction-based classification method building on a statistic named Influence Score (i.e., I-score) for prediction of MCI and dementia using naturalistic driving data collected from the Longitudinal Research on Aging Drivers (LongROAD) project. Naturalistic driving trajectories were collected through in-vehicle recording devices for up to 44 months from 2977 participants who were cognitively intact at the time of enrollment. These data were further processed and aggregated to generate 31 time-series driving variables. Because of high dimensional time-series features for driving variables, we used I-score for variable selection. I-score is a measure to evaluate variables' ability to predict and is proven to be effective in differentiating between noisy and predictive variables in big data. It is introduced here to select influential variable modules or groups that account for compound interactions among explanatory variables. It is explainable regarding to what extent variables and their interactions contribute to the predictiveness of a classifier. In addition, I-score boosts the performance of classifiers over imbalanced datasets due to its association with the F1 score. Using predictive variables selected by I-score, interaction-based residual blocks are constructed over top I-score modules to generate predictors and ensemble learning aggregates these predictors to boost the prediction of the overall classifier. Experiments using naturalistic driving data show that our proposed classification method achieves the best accuracy (96%) for predicting MCI and dementia, followed by random forest (93%) and logistic regression (88%). In terms of F1 score and AUC, our proposed classifier achieves 98% and 87%, respectively, followed by random forest (with an F1 score of 96% and an AUC of 79%) and logistic regression (with an F1 score of 92% and an AUC of 77%). The results indicate that incorporating I-score into machine learning algorithms could considerably improve the model performance for predicting MCI and dementia in older drivers. We also performed the feature importance analysis and found that the right to left turn ratio and the number of hard braking events are the most important driving variables to predict MCI and dementia.
Collapse
|
3
|
Signol F, Arnal L, Navarro-Cerdán JR, Llobet R, Arlandis J, Perez-Cortes JC. SEQENS: An ensemble method for relevant gene identification in microarray data. Comput Biol Med 2023; 152:106413. [PMID: 36521355 DOI: 10.1016/j.compbiomed.2022.106413] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/28/2022] [Revised: 11/25/2022] [Accepted: 12/03/2022] [Indexed: 12/12/2022]
Abstract
This paper describes an ensemble feature identification algorithm called SEQENS, and measures its capability to identify the relevant variables in a case-control study using a genetic expression microarray dataset. SEQENS uses Sequential Feature Search on multiple sample splitting to select variables showing stronger relation with the target, and a variable relevance ranking is finally produced. Although designed for feature identification, SEQENS could also serve as a basis for feature selection (classifier optimisation). Cliff, a ranking evaluation metric is also presented and used to assess the feature identification algorithms when a groundtruth of relevant variables is available. To test performance, three types of synthetic groundtruths emulating fictitious diseases are generated from ten randomly chosen variables following different target pattern distributions using the E-MTAB-3732 dataset. Several sample-to-dimensionality ratios ranging from 300 to 3,000 observations and 854 to 54,675 variables are explored. SEQENS is compared with other feature selection or identification state-of-the-art methods. On average, the proposed algorithm identifies better the relevant genes and exhibits a stronger stability. The algorithm is available to the community.
Collapse
Affiliation(s)
- François Signol
- Instituto Tecnológico de Informática (ITI), Universitat Politècnica de València, Camino de Vera, s/n, 46022 València, Spain.
| | - Laura Arnal
- Instituto Tecnológico de Informática (ITI), Universitat Politècnica de València, Camino de Vera, s/n, 46022 València, Spain.
| | - J Ramón Navarro-Cerdán
- Instituto Tecnológico de Informática (ITI), Universitat Politècnica de València, Camino de Vera, s/n, 46022 València, Spain.
| | - Rafael Llobet
- Instituto Tecnológico de Informática (ITI), Universitat Politècnica de València, Camino de Vera, s/n, 46022 València, Spain.
| | - Joaquim Arlandis
- Instituto Tecnológico de Informática (ITI), Universitat Politècnica de València, Camino de Vera, s/n, 46022 València, Spain.
| | - Juan-Carlos Perez-Cortes
- Instituto Tecnológico de Informática (ITI), Universitat Politècnica de València, Camino de Vera, s/n, 46022 València, Spain.
| |
Collapse
|
4
|
Chu X, Jiang M, Liu ZJ. Biomarker interaction selection and disease detection based on multivariate gain ratio. BMC Bioinformatics 2022; 23:176. [PMID: 35550010 PMCID: PMC9103137 DOI: 10.1186/s12859-022-04699-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2021] [Accepted: 04/14/2022] [Indexed: 11/30/2022] Open
Abstract
Background Disease detection is an important aspect of biotherapy. With the development of biotechnology and computer technology, there are many methods to detect disease based on single biomarker. However, biomarker does not influence disease alone in some cases. It’s the interaction between biomarkers that determines disease status. The existing influence measure I-score is used to evaluate the importance of interaction in determining disease status, but there is a deviation about the number of variables in interaction when applying I-score. To solve the problem, we propose a new influence measure Multivariate Gain Ratio (MGR) based on Gain Ratio (GR) of single-variate, which provides us with multivariate combination called interaction. Results We propose a preprocessing verification algorithm based on partial predictor variables to select an appropriate preprocessing method. In this paper, an algorithm for selecting key interactions of biomarkers and applying key interactions to construct a disease detection model is provided. MGR is more credible than I-score in the case of interaction containing small number of variables. Our method behaves better with average accuracy \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$93.13\%$$\end{document}93.13% than I-score of \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$91.73\%$$\end{document}91.73% in Breast Cancer Wisconsin (Diagnostic) Dataset. Compared to the classification results \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$89.80\%$$\end{document}89.80% based on all predictor variables, MGR identifies the true main biomarkers and realizes the dimension reduction. In Leukemia Dataset, the experiment results show the effectiveness of MGR with the accuracy of \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$97.32\%$$\end{document}97.32% compared to I-score with accuracy \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$89.11\%$$\end{document}89.11%. The results can be explained by the nature of MGR and I-score mentioned above because every key interaction contains a small number of variables in Leukemia Dataset. Conclusions MGR is effective for selecting important biomarkers and biomarker interactions even in high-dimension feature space in which the interaction could contain more than two biomarkers. The prediction ability of interactions selected by MGR is better than I-score in the case of interaction containing small number of variables. MGR is generally applicable to various types of biomarker datasets including cell nuclei, gene, SNPs and protein datasets.
Collapse
Affiliation(s)
- Xiao Chu
- Academy of Mathematics and Systems Science Chinese Academy of Sciences, University of Chinese Academy of Sciences, Beijing, China.
| | - Mao Jiang
- Academy of Mathematics and Systems Science Chinese Academy of Sciences, University of Chinese Academy of Sciences, Beijing, China
| | - Zhuo-Jun Liu
- Academy of Mathematics and Systems Science Chinese Academy of Sciences, Beijing, China
| |
Collapse
|
5
|
Inglis A, Parnell A, Hurley CB. Visualizing Variable Importance and Variable Interaction Effects in Machine Learning Models. J Comput Graph Stat 2022. [DOI: 10.1080/10618600.2021.2007935] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Affiliation(s)
- Alan Inglis
- Hamilton Institute, Maynooth University, Maynooth, Ireland
| | - Andrew Parnell
- Hamilton Institute, Insight Centre for Data Analytics, Maynooth University, Maynooth, Ireland
| | - Catherine B. Hurley
- Department of Mathematics and Statistics, Maynooth University, Maynooth, Ireland
| |
Collapse
|
6
|
Affiliation(s)
- Lilun Du
- Department of Information Systems, Business Statistics and Operations Management, Hong Kong University of Science and Technology, Kowloon, Hong Kong
| | - Inchi Hu
- Department of Information Systems, Business Statistics and Operations Management, Hong Kong University of Science and Technology, Kowloon, Hong Kong
| |
Collapse
|
7
|
Xiong W, Pan H. Interaction screening for high-dimensional heterogeneous data via robust hybrid metrics. Stat Med 2021; 40:6651-6673. [PMID: 34542189 DOI: 10.1002/sim.9204] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2021] [Revised: 07/22/2021] [Accepted: 09/02/2021] [Indexed: 11/07/2022]
Abstract
A novel model-free interaction screening approach called the hybrid metrics is introduced for high-dimensional heterogeneous data analysis. The metrics established based on the variation of conditional joint distribution function are measurements of interaction that include both size and direction. They are robust and can work with many types of response variables, including continuous, discrete, and categorical variables. We can apply the hybrid metrics to effective interaction selection for classification, response index models, and Poisson regression, among others. When dealing with classification, the hybrid metrics are capable of capturing both nonlinear category-general and category-specific interaction effects, providing us with a comprehensive overview and precise discovery of category information. When faced with a continuous response, the hybrid metrics perform fairly well even if the signal strength is weak, behaving as if the true interactions were known. To facilitate implementation, a fast two-stage procedure which naturally and efficiently enforces both strong and weak heredity is advocated. We further demonstrate their superior performances over popular competitors by exhaustive simulations and a SRBCT real data example. Supplementary materials for this article are available online.
Collapse
Affiliation(s)
- Wei Xiong
- School of Statistics, University of International Business and Economics, Beijing, China
| | - Han Pan
- School of Statistics, University of International Business and Economics, Beijing, China
| |
Collapse
|
8
|
Abedi M, Marateb HR, Mohebian MR, Aghaee-Bakhtiari SH, Nassiri SM, Gheisari Y. Systems biology and machine learning approaches identify drug targets in diabetic nephropathy. Sci Rep 2021; 11:23452. [PMID: 34873190 PMCID: PMC8648918 DOI: 10.1038/s41598-021-02282-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2021] [Accepted: 11/12/2021] [Indexed: 11/15/2022] Open
Abstract
Diabetic nephropathy (DN), the leading cause of end-stage renal disease, has become a massive global health burden. Despite considerable efforts, the underlying mechanisms have not yet been comprehensively understood. In this study, a systematic approach was utilized to identify the microRNA signature in DN and to introduce novel drug targets (DTs) in DN. Using microarray profiling followed by qPCR confirmation, 13 and 6 differentially expressed (DE) microRNAs were identified in the kidney cortex and medulla, respectively. The microRNA-target interaction networks for each anatomical compartment were constructed and central nodes were identified. Moreover, enrichment analysis was performed to identify key signaling pathways. To develop a strategy for DT prediction, the human proteome was annotated with 65 biochemical characteristics and 23 network topology parameters. Furthermore, all proteins targeted by at least one FDA-approved drug were identified. Next, mGMDH-AFS, a high-performance machine learning algorithm capable of tolerating massive imbalanced size of the classes, was developed to classify DT and non-DT proteins. The sensitivity, specificity, accuracy, and precision of the proposed method were 90%, 86%, 88%, and 89%, respectively. Moreover, it significantly outperformed the state-of-the-art (P-value ≤ 0.05) and showed very good diagnostic accuracy and high agreement between predicted and observed class labels. The cortex and medulla networks were then analyzed with this validated machine to identify potential DTs. Among the high-rank DT candidates are Egfr, Prkce, clic5, Kit, and Agtr1a which is a current well-known target in DN. In conclusion, a combination of experimental and computational approaches was exploited to provide a holistic insight into the disorder for introducing novel therapeutic targets.
Collapse
Affiliation(s)
- Maryam Abedi
- grid.411036.10000 0001 1498 685XRegenerative Medicine Research Center, Isfahan University of Medical Sciences, Isfahan, Iran
| | - Hamid Reza Marateb
- grid.411750.60000 0001 0454 365XBiomedical Engineering Department, Engineering Faculty, University of Isfahan, Isfahan, Iran ,grid.6835.80000 0004 1937 028XDepartment of Automatic Control, Biomedical Engineering Research Center, Universitat Politècnica de Catalunya, BarcelonaTech (UPC), Barcelona, Spain
| | - Mohammad Reza Mohebian
- grid.25152.310000 0001 2154 235XDepartment of Electrical and Computer Engineering, University of Saskatchewan, Saskatoon, Canada
| | - Seyed Hamid Aghaee-Bakhtiari
- grid.411583.a0000 0001 2198 6209Bioinformatics Research Group, Mashhad University of Medical Sciences, Mashhad, Iran ,grid.411583.a0000 0001 2198 6209Department of Medical Biotechnology and Nanotechnology, Faculty of Medicine, Mashhad University of Medical Sciences, Mashhad, Iran
| | - Seyed Mahdi Nassiri
- grid.46072.370000 0004 0612 7950Department of Clinical Pathology, Faculty of Veterinary Medicine, University of Tehran, Tehran, Iran
| | - Yousof Gheisari
- Regenerative Medicine Research Center, Isfahan University of Medical Sciences, Isfahan, Iran. .,Department of Genetics and Molecular Biology, Isfahan University of Medical Sciences, Isfahan, Iran.
| |
Collapse
|
9
|
An Interaction-Based Convolutional Neural Network (ICNN) Toward a Better Understanding of COVID-19 X-ray Images. ALGORITHMS 2021. [DOI: 10.3390/a14110337] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/28/2022]
Abstract
The field of explainable artificial intelligence (XAI) aims to build explainable and interpretable machine learning (or deep learning) methods without sacrificing prediction performance. Convolutional neural networks (CNNs) have been successful in making predictions, especially in image classification. These popular and well-documented successes use extremely deep CNNs such as VGG16, DenseNet121, and Xception. However, these well-known deep learning models use tens of millions of parameters based on a large number of pretrained filters that have been repurposed from previous data sets. Among these identified filters, a large portion contain no information yet remain as input features. Thus far, there is no effective method to omit these noisy features from a data set, and their existence negatively impacts prediction performance. In this paper, a novel interaction-based convolutional neural network (ICNN) is introduced that does not make assumptions about the relevance of local information. Instead, a model-free influence score (I-score) is proposed to directly extract the influential information from images to form important variable modules. This innovative technique replaces all pretrained filters found by trial-and-error with explainable, influential, and predictive variable sets (modules) determined by the I-score. In other words, future researchers need not rely on pretrained filters; the suggested algorithm identifies only the variables or pixels with high I-score values that are extremely predictive and important. The proposed method and algorithm were tested on real-world data set and a state-of-the-art prediction performance of 99.8% was achieved without sacrificing the explanatory power of the model. This proposed design can efficiently screen patients infected by COVID-19 before human diagnosis and can be a benchmark for addressing future XAI problems in large-scale data sets.
Collapse
|
10
|
Feature selection with multi-objective genetic algorithm based on a hybrid filter and the symmetrical complementary coefficient. APPL INTELL 2020. [DOI: 10.1007/s10489-020-02028-0] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
|
11
|
Wang JH, Chen YH. Interaction screening by Kendall's partial correlation for ultrahigh-dimensional data with survival trait. Bioinformatics 2020; 36:2763-2769. [PMID: 31926011 DOI: 10.1093/bioinformatics/btaa017] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2019] [Revised: 12/06/2019] [Accepted: 01/07/2020] [Indexed: 12/14/2022] Open
Abstract
MOTIVATION In gene expression and genome-wide association studies, the identification of interaction effects is an important and challenging issue owing to its ultrahigh-dimensional nature. In particular, contaminated data and right-censored survival outcome make the associated feature screening even challenging. RESULTS In this article, we propose an inverse probability-of-censoring weighted Kendall's tau statistic to measure association of a survival trait with biomarkers, as well as a Kendall's partial correlation statistic to measure the relationship of a survival trait with an interaction variable conditional on the main effects. The Kendall's partial correlation is then used to conduct interaction screening. Simulation studies under various scenarios are performed to compare the performance of our proposal with some commonly available methods. In the real data application, we utilize our proposed method to identify epistasis associated with the clinical survival outcomes of non-small-cell lung cancer, diffuse large B-cell lymphoma and lung adenocarcinoma patients. Both simulation and real data studies demonstrate that our method performs well and outperforms existing methods in identifying main and interaction biomarkers. AVAILABILITY AND IMPLEMENTATION R-package 'IPCWK' is available to implement this method, together with a reference manual describing how to perform the 'IPCWK' package. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Jie-Huei Wang
- Department of Statistics, Feng Chia University, Taichung 40724, Taiwan
| | - Yi-Hau Chen
- Institute of Statistical Science, Academia Sinica, Nankang, Taipei 11529, Taiwan
| |
Collapse
|
12
|
Feature selection with Symmetrical Complementary Coefficient for quantifying feature interactions. APPL INTELL 2020. [DOI: 10.1007/s10489-019-01518-0] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]
|
13
|
Anoke SC, Normand SL, Zigler CM. Approaches to treatment effect heterogeneity in the presence of confounding. Stat Med 2019; 38:2797-2815. [PMID: 30931547 PMCID: PMC6613382 DOI: 10.1002/sim.8143] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/23/2017] [Revised: 02/15/2019] [Accepted: 02/20/2019] [Indexed: 12/26/2022]
Abstract
The literature on causal effect estimation tends to focus on the population mean estimand, which is less informative as medical treatments are becoming more personalized and there is increasing awareness that subpopulations of individuals may experience a group-specific effect that differs from the population average. In fact, it is possible that there is underlying systematic effect heterogeneity that is obscured by focusing on the population mean estimand. In this context, understanding which covariates contribute to this treatment effect heterogeneity (TEH) and how these covariates determine the differential treatment effect (TE) is an important consideration. Towards such an understanding, this paper briefly reviews three approaches used in making causal inferences and conducts a simulation study to compare these approaches according to their performance in an exploratory evaluation of TEH when the heterogeneous subgroups are not known a priori. Performance metrics include the detection of any heterogeneity, the identification and characterization of heterogeneous subgroups, and unconfounded estimation of the TE within subgroups. The methods are then deployed in a comparative effectiveness evaluation of drug-eluting versus bare-metal stents among 54 099 Medicare beneficiaries in the continental United States admitted to a hospital with acute myocardial infarction in 2008.
Collapse
Affiliation(s)
- Sarah C. Anoke
- Department of Biostatistics, Harvard T. H. Chan School of Public Health, Massachusetts, United States
| | - Sharon-Lise Normand
- Department of Biostatistics, Harvard T. H. Chan School of Public Health, Massachusetts, United States
- Department of Health Care Policy, Harvard Medical School, Massachusetts, United States
| | - Corwin M. Zigler
- Department of Statistics & Data Sciences and Department of Womens Health, University of Texas at Austin and Dell Medical School, Texas, United States
| |
Collapse
|
14
|
Fang YH, Wang JH, Hsiung CA. TSGSIS: a high-dimensional grouped variable selection approach for detection of whole-genome SNP-SNP interactions. Bioinformatics 2018. [PMID: 28651334 DOI: 10.1093/bioinformatics/btx409] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Motivation Identification of single nucleotide polymorphism (SNP) interactions is an important and challenging topic in genome-wide association studies (GWAS). Many approaches have been applied to detecting whole-genome interactions. However, these approaches to interaction analysis tend to miss causal interaction effects when the individual marginal effects are uncorrelated to trait, while their interaction effects are highly associated with the trait. Results A grouped variable selection technique, called two-stage grouped sure independence screening (TS-GSIS), is developed to study interactions that may not have marginal effects. The proposed TS-GSIS is shown to be very helpful in identifying not only causal SNP effects that are uncorrelated to trait but also their corresponding SNP-SNP interaction effects. The benefit of TS-GSIS are gaining detection of interaction effects by taking the joint information among the SNPs and determining the size of candidate sets in the model. Simulation studies under various scenarios are performed to compare performance of TS-GSIS and current approaches. We also apply our approach to a real rheumatoid arthritis (RA) dataset. Both the simulation and real data studies show that the TS-GSIS performs very well in detecting SNP-SNP interactions. Availability and implementation R-package is delivered through CRAN and is available at: https://cran.r-project.org/web/packages/TSGSIS/index.html. Contact hsiung@nhri.org.tw. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Yao-Hwei Fang
- Division of Biostatistics and Bioinformatics, Institute of Population Health Sciences, National Health Research Institutes, Zhunan 35053, Taiwan
| | - Jie-Huei Wang
- Division of Biostatistics and Bioinformatics, Institute of Population Health Sciences, National Health Research Institutes, Zhunan 35053, Taiwan
| | - Chao A Hsiung
- Division of Biostatistics and Bioinformatics, Institute of Population Health Sciences, National Health Research Institutes, Zhunan 35053, Taiwan
| |
Collapse
|
15
|
Gene Selection for Microarray Cancer Data Classification by a Novel Rule-Based Algorithm. INFORMATION 2018. [DOI: 10.3390/info9010006] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/19/2022] Open
|
16
|
Wang MH, Chang B, Sun R, Hu I, Xia X, Wu WKK, Chong KC, Zee BCY. Stratified polygenic risk prediction model with application to CAGI bipolar disorder sequencing data. Hum Mutat 2017; 38:1235-1239. [PMID: 28419606 DOI: 10.1002/humu.23229] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2016] [Revised: 03/13/2017] [Accepted: 04/04/2017] [Indexed: 01/31/2023]
Abstract
Genetic data consists of a wide range of marker types, including common, low-frequency, and rare variants. Multiple genetic markers and their interactions play central roles in the heritability of complex disease. In this study, we propose an algorithm that uses a stratified variable selection design by genetic architectures and interaction effects, achieved by a dataset-adaptive W-test. The polygenic sets in all strata were integrated to form a classification rule. The algorithm was applied to the Critical Assessment of Genome Interpretation 4 bipolar challenge sequencing data. The prediction accuracy was 60% using genetic markers on an independent test set. We found that epistasis among common genetic variants contributed most substantially to prediction precision. However, the sample size was not large enough to draw conclusions for the lack of predictability of low-frequency variants and their epistasis.
Collapse
Affiliation(s)
- Maggie Haitian Wang
- Division of Biostatistics and Centre for Clinical Research and Biostatistics, JC School of Public Health and Primary Care, The Chinese University of Hong Kong, Hong Kong SAR, China.,CUHK Shenzhen Research Institute, Shenzhen, China
| | - Billy Chang
- Division of Biostatistics and Centre for Clinical Research and Biostatistics, JC School of Public Health and Primary Care, The Chinese University of Hong Kong, Hong Kong SAR, China
| | - Rui Sun
- Division of Biostatistics and Centre for Clinical Research and Biostatistics, JC School of Public Health and Primary Care, The Chinese University of Hong Kong, Hong Kong SAR, China.,CUHK Shenzhen Research Institute, Shenzhen, China
| | - Inchi Hu
- ISOM Department and Biomedical Engineering Division, The Hong Kong University of Science and Technology, Kowloon, Hong Kong SAR, China
| | - Xiaoxuan Xia
- Division of Biostatistics and Centre for Clinical Research and Biostatistics, JC School of Public Health and Primary Care, The Chinese University of Hong Kong, Hong Kong SAR, China
| | - William Ka Kei Wu
- Department of Anaethesia and Intensive Care, The Chinese University of Hong Kong, Hong Kong SAR, China
| | - Ka Chun Chong
- Division of Biostatistics and Centre for Clinical Research and Biostatistics, JC School of Public Health and Primary Care, The Chinese University of Hong Kong, Hong Kong SAR, China.,CUHK Shenzhen Research Institute, Shenzhen, China
| | - Benny Chung-Ying Zee
- Division of Biostatistics and Centre for Clinical Research and Biostatistics, JC School of Public Health and Primary Care, The Chinese University of Hong Kong, Hong Kong SAR, China.,CUHK Shenzhen Research Institute, Shenzhen, China
| |
Collapse
|
17
|
Tian X, Xin M, Luo J, Liu M, Jiang Z. Identification of Genes Involved in Breast Cancer Metastasis by Integrating Protein–Protein Interaction Information with Expression Data. J Comput Biol 2017; 24:172-182. [DOI: 10.1089/cmb.2015.0206] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/25/2022] Open
Affiliation(s)
- Xin Tian
- Shanghai Key Laboratory of Regulatory Biology, Institute of Biomedical Sciences and School of Life Sciences, East China Normal University, Shanghai, China
| | - Mingyuan Xin
- Shanghai Key Laboratory of Regulatory Biology, Institute of Biomedical Sciences and School of Life Sciences, East China Normal University, Shanghai, China
| | - Jian Luo
- Shanghai Key Laboratory of Regulatory Biology, Institute of Biomedical Sciences and School of Life Sciences, East China Normal University, Shanghai, China
| | - Mingyao Liu
- Shanghai Key Laboratory of Regulatory Biology, Institute of Biomedical Sciences and School of Life Sciences, East China Normal University, Shanghai, China
| | - Zhenran Jiang
- Shanghai Key Laboratory of Multidimensional Information Processing, Department of Computer Science and Technology, East China Normal University, Shanghai, China
| |
Collapse
|
18
|
Framework for making better predictions by directly estimating variables' predictivity. Proc Natl Acad Sci U S A 2016; 113:14277-14282. [PMID: 27911830 DOI: 10.1073/pnas.1616647113] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
We propose approaching prediction from a framework grounded in the theoretical correct prediction rate of a variable set as a parameter of interest. This framework allows us to define a measure of predictivity that enables assessing variable sets for, preferably high, predictivity. We first define the prediction rate for a variable set and consider, and ultimately reject, the naive estimator, a statistic based on the observed sample data, due to its inflated bias for moderate sample size and its sensitivity to noisy useless variables. We demonstrate that the [Formula: see text]-score of the PR method of VS yields a relatively unbiased estimate of a parameter that is not sensitive to noisy variables and is a lower bound to the parameter of interest. Thus, the PR method using the [Formula: see text]-score provides an effective approach to selecting highly predictive variables. We offer simulations and an application of the [Formula: see text]-score on real data to demonstrate the statistic's predictive performance on sample data. We conjecture that using the partition retention and [Formula: see text]-score can aid in finding variable sets with promising prediction rates; however, further research in the avenue of sample-based measures of predictivity is much desired.
Collapse
|
19
|
Wang MH, Sun R, Guo J, Weng H, Lee J, Hu I, Sham PC, Zee BCY. A fast and powerful W-test for pairwise epistasis testing. Nucleic Acids Res 2016; 44:e115. [PMID: 27112568 PMCID: PMC4937324 DOI: 10.1093/nar/gkw347] [Citation(s) in RCA: 19] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2015] [Revised: 04/14/2016] [Accepted: 04/15/2016] [Indexed: 01/08/2023] Open
Abstract
Epistasis plays an essential role in the development of complex diseases. Interaction methods face common challenge of seeking a balance between persistent power, model complexity, computation efficiency, and validity of identified bio-markers. We introduce a novel W-test to identify pairwise epistasis effect, which measures the distributional difference between cases and controls through a combined log odds ratio. The test is model-free, fast, and inherits a Chi-squared distribution with data adaptive degrees of freedom. No permutation is needed to obtain the P-values. Simulation studies demonstrated that the W-test is more powerful in low frequency variants environment than alternative methods, which are the Chi-squared test, logistic regression and multifactor-dimensionality reduction (MDR). In two independent real bipolar disorder genome-wide associations (GWAS) datasets, the W-test identified significant interactions pairs that can be replicated, including SLIT3-CENPN, SLIT3-TMEM132D, CNTNAP2-NDST4 and CNTCAP2-RTN4R The genes in the pairs play central roles in neurotransmission and synapse formation. A majority of the identified loci are undiscoverable by main effect and are low frequency variants. The proposed method offers a powerful alternative tool for mapping the genetic puzzle underlying complex disorders.
Collapse
Affiliation(s)
- Maggie Haitian Wang
- Division of Biostatistics and Centre for Clinical Research and Biostatistics, JC School of Public Health and Primary Care, the Chinese University of Hong Kong, Shatin, N.T., Hong Kong SAR, China CUHK Shenzhen Research Institute, Shenzhen, China
| | - Rui Sun
- Division of Biostatistics and Centre for Clinical Research and Biostatistics, JC School of Public Health and Primary Care, the Chinese University of Hong Kong, Shatin, N.T., Hong Kong SAR, China CUHK Shenzhen Research Institute, Shenzhen, China
| | - Junfeng Guo
- The Australian National University, Canberra, Australia
| | - Haoyi Weng
- Division of Biostatistics and Centre for Clinical Research and Biostatistics, JC School of Public Health and Primary Care, the Chinese University of Hong Kong, Shatin, N.T., Hong Kong SAR, China CUHK Shenzhen Research Institute, Shenzhen, China
| | - Jack Lee
- Division of Biostatistics and Centre for Clinical Research and Biostatistics, JC School of Public Health and Primary Care, the Chinese University of Hong Kong, Shatin, N.T., Hong Kong SAR, China
| | - Inchi Hu
- ISOM Department and Biomedical Engineering Division, the Hong Kong University of Science and Technology, Kowloon, Hong Kong SAR, China
| | - Pak Chung Sham
- Department of Psychiatry; Centre for Genomic Sciences, the University of Hong Kong, Pok Fu Lam, Hong Kong SAR, China
| | - Benny Chung-Ying Zee
- Division of Biostatistics and Centre for Clinical Research and Biostatistics, JC School of Public Health and Primary Care, the Chinese University of Hong Kong, Shatin, N.T., Hong Kong SAR, China CUHK Shenzhen Research Institute, Shenzhen, China
| |
Collapse
|
20
|
Chen Y, Wang L, Li L, Zhang H, Yuan Z. Informative gene selection and the direct classification of tumors based on relative simplicity. BMC Bioinformatics 2016; 17:44. [PMID: 26792270 PMCID: PMC4721022 DOI: 10.1186/s12859-016-0893-0] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/10/2015] [Accepted: 01/19/2016] [Indexed: 12/15/2022] Open
Abstract
BACKGROUND Selecting a parsimonious set of informative genes to build highly generalized performance classifier is the most important task for the analysis of tumor microarray expression data. Many existing gene pair evaluation methods cannot highlight diverse patterns of gene pairs only used one strategy of vertical comparison and horizontal comparison, while individual-gene-ranking method ignores redundancy and synergy among genes. RESULTS Here we proposed a novel score measure named relative simplicity (RS). We evaluated gene pairs according to integrating vertical comparison with horizontal comparison, finally built RS-based direct classifier (RS-based DC) based on a set of informative genes capable of binary discrimination with a paired votes strategy. Nine multi-class gene expression datasets involving human cancers were used to validate the performance of new method. Compared with the nine reference models, RS-based DC received the highest average independent test accuracy (91.40%), the best generalization performance and the smallest informative average gene number (20.56). Compared with the four reference feature selection methods, RS also received the highest average test accuracy in three classifiers (Naïve Bayes, k-Nearest Neighbor and Support Vector Machine), and only RS can improve the performance of SVM. CONCLUSIONS Diverse patterns of gene pairs could be highlighted more fully while integrating vertical comparison with horizontal comparison strategy. DC core classifier can effectively control over-fitting. RS-based feature selection method combined with DC classifier can lead to more robust selection of informative genes and classification accuracy.
Collapse
Affiliation(s)
- Yuan Chen
- Hunan Provincial Key Laboratory for Biology and Control of Plant Diseases and Insect Pests, Changsha, China. .,Hunan Provincial Key Laboratory for Germplasm Innovation and Utilization of Crop, Hunan Agricultural University, Changsha, China.
| | - Lifeng Wang
- Biotechnology Research Center, Hunan Academy of Agricultural Sciences, Changsha, China.
| | - Lanzhi Li
- Hunan Provincial Key Laboratory for Germplasm Innovation and Utilization of Crop, Hunan Agricultural University, Changsha, China.
| | - Hongyan Zhang
- Hunan Provincial Key Laboratory for Germplasm Innovation and Utilization of Crop, Hunan Agricultural University, Changsha, China.
| | - Zheming Yuan
- Hunan Provincial Key Laboratory for Biology and Control of Plant Diseases and Insect Pests, Changsha, China. .,Hunan Provincial Key Laboratory for Germplasm Innovation and Utilization of Crop, Hunan Agricultural University, Changsha, China.
| |
Collapse
|
21
|
Abstract
Thus far, genome-wide association studies (GWAS) have been disappointing in the inability of investigators to use the results of identified, statistically significant variants in complex diseases to make predictions useful for personalized medicine. Why are significant variables not leading to good prediction of outcomes? We point out that this problem is prevalent in simple as well as complex data, in the sciences as well as the social sciences. We offer a brief explanation and some statistical insights on why higher significance cannot automatically imply stronger predictivity and illustrate through simulations and a real breast cancer example. We also demonstrate that highly predictive variables do not necessarily appear as highly significant, thus evading the researcher using significance-based methods. We point out that what makes variables good for prediction versus significance depends on different properties of the underlying distributions. If prediction is the goal, we must lay aside significance as the only selection standard. We suggest that progress in prediction requires efforts toward a new research agenda of searching for a novel criterion to retrieve highly predictive variables rather than highly significant variables. We offer an alternative approach that was not designed for significance, the partition retention method, which was very effective predicting on a long-studied breast cancer data set, by reducing the classification error rate from 30% to 8%.
Collapse
|
22
|
DWFS: a wrapper feature selection tool based on a parallel genetic algorithm. PLoS One 2015; 10:e0117988. [PMID: 25719748 PMCID: PMC4342225 DOI: 10.1371/journal.pone.0117988] [Citation(s) in RCA: 74] [Impact Index Per Article: 8.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2014] [Accepted: 01/04/2015] [Indexed: 11/19/2022] Open
Abstract
Many scientific problems can be formulated as classification tasks. Data that harbor relevant information are usually described by a large number of features. Frequently, many of these features are irrelevant for the class prediction. The efficient implementation of classification models requires identification of suitable combinations of features. The smaller number of features reduces the problem’s dimensionality and may result in higher classification performance. We developed DWFS, a web-based tool that allows for efficient selection of features for a variety of problems. DWFS follows the wrapper paradigm and applies a search strategy based on Genetic Algorithms (GAs). A parallel GA implementation examines and evaluates simultaneously large number of candidate collections of features. DWFS also integrates various filtering methods that may be applied as a pre-processing step in the feature selection process. Furthermore, weights and parameters in the fitness function of GA can be adjusted according to the application requirements. Experiments using heterogeneous datasets from different biomedical applications demonstrate that DWFS is fast and leads to a significant reduction of the number of features without sacrificing performance as compared to several widely used existing methods. DWFS can be accessed online at www.cbrc.kaust.edu.sa/dwfs.
Collapse
|
23
|
Informative gene selection and direct classification of tumor based on Chi-square test of pairwise gene interactions. BIOMED RESEARCH INTERNATIONAL 2014; 2014:589290. [PMID: 25140319 PMCID: PMC4130026 DOI: 10.1155/2014/589290] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/28/2014] [Accepted: 07/10/2014] [Indexed: 01/04/2023]
Abstract
In efforts to discover disease mechanisms and improve clinical diagnosis of tumors, it is useful to mine profiles for informative genes with definite biological meanings and to build robust classifiers with high precision. In this study, we developed a new method for tumor-gene selection, the Chi-square test-based integrated rank gene and direct classifier (χ2-IRG-DC). First, we obtained the weighted integrated rank of gene importance from chi-square tests of single and pairwise gene interactions. Then, we sequentially introduced the ranked genes and removed redundant genes by using leave-one-out cross-validation of the chi-square test-based Direct Classifier (χ2-DC) within the training set to obtain informative genes. Finally, we determined the accuracy of independent test data by utilizing the genes obtained above with χ2-DC. Furthermore, we analyzed the robustness of χ2-IRG-DC by comparing the generalization performance of different models, the efficiency of different feature-selection methods, and the accuracy of different classifiers. An independent test of ten multiclass tumor gene-expression datasets showed that χ2-IRG-DC could efficiently control overfitting and had higher generalization performance. The informative genes selected by χ2-IRG-DC could dramatically improve the independent test precision of other classifiers; meanwhile, the informative genes selected by other feature selection methods also had good performance in χ2-DC.
Collapse
|
24
|
Agne M, Huang CH, Hu I, Wang H, Zheng T, Lo SH. Considering interactive effects in the identification of influential regions with extremely rare variants via fixed bin approach. BMC Proc 2014; 8:S7. [PMID: 25519400 PMCID: PMC4143804 DOI: 10.1186/1753-6561-8-s1-s7] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
In this study, we analyze the Genetic Analysis Workshop 18 (GAW18) data to identify regions of single-nucleotide polymorphisms (SNPs), which significantly influence hypertension status among individuals. We have studied the marginal impact of these regions on disease status in the past, but we extend the method to deal with environmental factors present in data collected over several exam periods. We consider the respective interactions between such traits as smoking status and age with the genetic information and hope to augment those genetic regions deemed influential marginally with those that contribute via an interactive effect. In particular, we focus only on rare variants and apply a procedure to combine signal among rare variants in a number of "fixed bins" along the chromosome. We extend the procedure in Agne et al [1] to incorporate environmental factors by dichotomizing subjects via traits such as smoking status and age, running the marginal procedure among each respective category (i.e., smokers or nonsmokers), and then combining their scores into a score for interaction. To avoid overlap of subjects, we examine each exam period individually. Out of a possible 629 fixed-bin regions in chromosome 3, we observe that 11 show up in multiple exam periods for gene-smoking score. Fifteen regions exhibit significance for multiple exam periods for gene-age score, with 4 regions deemed significant for all 3 exam periods. The procedure pinpoints SNPs in 8 "answer" genes, with 5 of these showing up as significant in multiple testing schemes (Gene-Smoking, Gene-Age for Exams 1, 2, and 3).
Collapse
Affiliation(s)
- Michael Agne
- Department of Statistics, Columbia University, 1255 Amsterdam Avenue, Room 1005, MC 4690, New York, New York 10027, USA
| | - Chien-Hsun Huang
- Department of Statistics, Columbia University, 1255 Amsterdam Avenue, Room 1005, MC 4690, New York, New York 10027, USA
| | - Inchi Hu
- Department of Information Systems, Business Statistics and Operations Management, Hong Kong University of Science and Technology Business School, Hong Kong
| | - Haitian Wang
- Division of Biostatistics, School of Public Health and Primary Care, the Chinese University of Hong Kong, Hong Kong
| | - Tian Zheng
- Department of Statistics, Columbia University, 1255 Amsterdam Avenue, Room 1005, MC 4690, New York, New York 10027, USA
| | - Shaw-Hwa Lo
- Department of Statistics, Columbia University, 1255 Amsterdam Avenue, Room 1005, MC 4690, New York, New York 10027, USA
| |
Collapse
|
25
|
Wang MH, Huang CH, Zheng T, Lo SH, Hu I. Discovering pure gene-environment interactions in blood pressure genome-wide association studies data: a two-step approach incorporating new statistics. BMC Proc 2014; 8:S62. [PMID: 25519396 PMCID: PMC4143689 DOI: 10.1186/1753-6561-8-s1-s62] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022] Open
Abstract
Environment has long been known to play an important part in disease etiology. However, not many genome-wide association studies take environmental factors into consideration. There is also a need for new methods to identify the gene-environment interactions. In this study, we propose a 2-step approach incorporating an influence measure that capturespure gene-environment effect. We found that pure gene-age interaction has a stronger association than considering the genetic effect alone for systolic blood pressure, measured by counting the number of single-nucleotide polymorphisms (SNPs)reaching a certain significance level. We analyzed the subjects by dividing them into two age groups and found no overlap in the top identified SNPs between them. This suggested that age might have a nonlinear effect on genetic association. Furthermore, the scores of the top SNPs for the two age subgroups were about 3times those obtained when using all subjects for systolic blood pressure. In addition, the scores of the older age subgroup were much higher than those for the younger group. The results suggest that genetic effects are stronger in older age and that genetic association studies should take environmental effects into consideration, especially age.
Collapse
Affiliation(s)
- Maggie Haitian Wang
- Division of Biostatistics, School of Public Health and Primary Care, the Chinese University of Hong Kong, Shatin, New Territories, Hong Kong SAR
| | - Chien-Hsun Huang
- Department of Statistics, Columbia University, 1255 Amsterdam Avenue, New York, NY 10027-5927, USA
| | - Tian Zheng
- Department of Statistics, Columbia University, 1255 Amsterdam Avenue, New York, NY 10027-5927, USA
| | - Shaw-Hwa Lo
- Department of Statistics, Columbia University, 1255 Amsterdam Avenue, New York, NY 10027-5927, USA
| | - Inchi Hu
- Department of ISOM, the Hong Kong University of Science and Technology, Clearwater Bay, Kowloon, Hong Kong SAR
| |
Collapse
|
26
|
Fan R, Lo SH. A robust model-free approach for rare variants association studies incorporating gene-gene and gene-environmental interactions. PLoS One 2013; 8:e83057. [PMID: 24358248 PMCID: PMC3866272 DOI: 10.1371/journal.pone.0083057] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2013] [Accepted: 10/30/2013] [Indexed: 11/19/2022] Open
Abstract
Recently more and more evidence suggest that rare variants with much lower minor allele frequencies play significant roles in disease etiology. Advances in next-generation sequencing technologies will lead to many more rare variants association studies. Several statistical methods have been proposed to assess the effect of rare variants by aggregating information from multiple loci across a genetic region and testing the association between the phenotype and aggregated genotype. One limitation of existing methods is that they only look into the marginal effects of rare variants but do not systematically take into account effects due to interactions among rare variants and between rare variants and environmental factors. In this article, we propose the summation of partition approach (SPA), a robust model-free method that is designed specifically for detecting both marginal effects and effects due to gene-gene (G×G) and gene-environmental (G×E) interactions for rare variants association studies. SPA has three advantages. First, it accounts for the interaction information and gains considerable power in the presence of unknown and complicated G×G or G×E interactions. Secondly, it does not sacrifice the marginal detection power; in the situation when rare variants only have marginal effects it is comparable with the most competitive method in current literature. Thirdly, it is easy to extend and can incorporate more complex interactions; other practitioners and scientists can tailor the procedure to fit their own study friendly. Our simulation studies show that SPA is considerably more powerful than many existing methods in the presence of G×G and G×E interactions.
Collapse
Affiliation(s)
- Ruixue Fan
- Department of Statistics, Columbia University, New York, New York, United States of America
| | - Shaw-Hwa Lo
- Department of Statistics, Columbia University, New York, New York, United States of America
- * E-mail: (SHL)
| |
Collapse
|
27
|
Summarizing techniques that combine three non-parametric scores to detect disease-associated 2-way SNP-SNP interactions. Gene 2013; 533:304-12. [PMID: 24076437 DOI: 10.1016/j.gene.2013.09.041] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2013] [Revised: 08/30/2013] [Accepted: 09/09/2013] [Indexed: 10/26/2022]
Abstract
Identifying susceptibility genes that influence complex diseases is extremely difficult because loci often influence the disease state through genetic interactions. Numerous approaches to detect disease-associated SNP-SNP interactions have been developed, but none consistently generates high-quality results under different disease scenarios. Using summarizing techniques to combine a number of existing methods may provide a solution to this problem. Here we used three popular non-parametric methods-Gini, absolute probability difference (APD), and entropy-to develop two novel summary scores, namely principle component score (PCS) and Z-sum score (ZSS), with which to predict disease-associated genetic interactions. We used a simulation study to compare performance of the non-parametric scores, the summary scores, the scaled-sum score (SSS; used in polymorphism interaction analysis (PIA)), and the multifactor dimensionality reduction (MDR). The non-parametric methods achieved high power, but no non-parametric method outperformed all others under a variety of epistatic scenarios. PCS and ZSS, however, outperformed MDR. PCS, ZSS and SSS displayed controlled type-I-errors (<0.05) compared to GS, APDS, ES (>0.05). A real data study using the genetic-analysis-workshop 16 (GAW 16) rheumatoid arthritis dataset identified a number of interesting SNP-SNP interactions.
Collapse
|
28
|
Multiclass Prediction for Cancer Microarray Data Using Various Variables Range Selection Based on Random Forest. LECTURE NOTES IN COMPUTER SCIENCE 2013. [DOI: 10.1007/978-3-642-40319-4_22] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/27/2022]
|