1
|
Zhang B, Wiedermann W. Covariate selection in causal learning under non-Gaussianity. Behav Res Methods 2023:10.3758/s13428-023-02217-y. [PMID: 37704788 DOI: 10.3758/s13428-023-02217-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 08/04/2023] [Indexed: 09/15/2023]
Abstract
Understanding causal mechanisms is a central goal in the behavioral, developmental, and social sciences. When estimating and probing causal effects using observational data, covariate adjustment is a crucial element to remove dependencies between focal predictors and the error term. Covariate selection, however, constitutes a challenging task because availability alone is not an adequate criterion to decide whether a covariate should be included in the statistical model. The present study introduces a non-Gaussian method for covariate selection and provides a forward selection algorithm for linear models (i.e., non-Gaussian forward selection; nGFS) to select appropriate covariates from a set of potential control variables to avoid inconsistent and biased estimators of the causal effect of interest. Further, we demonstrate that the forward selection algorithm has properties compatible with principles of direction of dependence, i.e., probing whether the causal target model is correctly specified with respect to the causal direction of effects. Results of a Monte Carlo simulation study suggest that the selection algorithm performs well, in particular when sample sizes are large (i.e., n ≥ 250) and data strongly deviate from Gaussianity (e.g., distributions with skewness beyond 1.5). An empirical example is given for illustrative purposes.
Collapse
Affiliation(s)
- Bixi Zhang
- Department of Educational Psychology, CUNY Graduate Center, New York, NY, USA.
| | - Wolfgang Wiedermann
- Department of Educational, School, and Counseling Psychology, University of Missouri, Columbia, MO, USA
| |
Collapse
|
2
|
Kartikadarma E, Cakranegara PA, Syafar F, Iskandar A, Paramansyah A, Rahim R. Application of forward selection strategy using C4.5 algorithm to improve the accuracy of classification's data set. J Popul Ther Clin Pharmacol 2023; 30:e14-e23. [PMID: 36631414 DOI: 10.47750/jptcp.2023.1002] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/10/2022] [Accepted: 10/28/2022] [Indexed: 01/13/2023]
Abstract
The purpose of this study is to improve the classification accuracy of the C4.5 Algorithm utilizing the forward selection technique. Breast Cancer from the UCI Machine Learning Repository is the dataset utilized. There are 286 records in the dataset with nine attributes and one class (label). The suggested model was evaluated with two existing classification models (C4.5 and Naïve Bayes) using the RapidMiner program. The procedure consists of multiple stages, the first of which consists of selecting the dominant trait using the feature selection technique (weight by information gain). The second step is forward selection based on the outcome of feature selection. Before processing, the dataset is separated into training and testing halves, where the ratios of comparison are 70:30, 80:20, and 90:10. The final step is examining the output. The experimental results demonstrate that the forward selection methodology employing the C4.5 (C4.5 + FS) method outperforms the C4.5 and Naïve Bayes classification techniques. C4.5 + FS (Split Data 70:30) has an accuracy value of 76.74%, C4.5 + FS (Split Data 80:20) has an accuracy value of 78.95%, C4.5 + FS (Split Data 90:10) has an accuracy value of 78.57%, C4.5 (Split Data 70:30) has an accuracy value of 65.12%, and Naïve Bayes (Split Data is 70:30) has an accuracy value 85.55%. In comparison to typical classification algorithms (C4.5 and Naïve Bayes), the average accuracy values increased by 12.97% and 8.32%, respectively. In terms of precision, recall, and F-measure, the forward selection strategy utilizing the C4.5 method beat all other classification techniques, achieving 79.84%, 92.50%, and 85.55%, respectively. In addition, the results demonstrated an increase in the average Area Under Curve (AUC) from 0.628 to 0.732%. Therefore, it can be inferred that the forward selection strategy can be applied to the Breast Cancer Data Set in order to increase the accuracy value of classification method C4.5.
Collapse
Affiliation(s)
- Etika Kartikadarma
- Department of Informatics, Universitas Dian Nuswantoro, Semarang, Indonesia
| | | | - Faisal Syafar
- Department of Electronics, Faculty of Engineering, Universitas Negeri Makassar, Makassar, Indonesia
| | - Akbar Iskandar
- Department of Informatics, Universitas Teknologi AKBA Makassar, Makassar, Indonesia
| | - Arman Paramansyah
- Department of Education, Institut Agama Islam Nasional Laa Roiba, Bogor, Indonesia
| | - Robbi Rahim
- Department of Management Technology, Sekolah Tinggi Ilmu Manajemen Sukma, Medan, Indonesia;
| |
Collapse
|
3
|
Su PW, Lo SL. Using Landsat 8 imagery for remote monitoring of total phosphorus as a water quality parameter of irrigation ponds in Taiwan. Environ Sci Pollut Res Int 2021; 28:66687-66694. [PMID: 34235681 DOI: 10.1007/s11356-021-15159-9] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/10/2021] [Accepted: 06/23/2021] [Indexed: 06/13/2023]
Abstract
Monitoring water body quality parameters with high spatial and temporal resolutions is crucial because mitigation of pollution is usually costlier than early prevention/intervention. The existing monitoring methods for irrigation ponds in Taoyuan, Taiwan, are based on field measurements that have low spatial and temporal resolutions. In this study, using Landsat 8 satellite imagery, a multiple regression-derived relationship between the satellite band reflectance and the concentration of total phosphorus (TP) was established. The satellite imagery was atmospherically corrected with ACOLITE based on shortwave infrared (SWIR) bands. This method was used to select predictor variables in the multiple regression-derived equation based on forward selection of variables using a p value and variation inflation factor (VIF) threshold. The derived equation yielded a coefficient of determination (R2) of 0.67. The near-infrared band (band 5) was found to be most significant. The Landsat 8 imagery retrieved for two of the three pond studies included only a few pixels from the ponds because parts of the pond surfaces are covered by floating photovoltaic power plants. The TP concentrations resulting from the derived equation indicate the feasibility of using satellite remote sensing methods to monitor the water quality. The derived relationships are potentially applicable to extend the availability of temporal and spatial water quality data for these irrigation ponds.
Collapse
Affiliation(s)
- Po-Wen Su
- Graduate Institute of Environmental Engineering, National Taiwan University, 71, Chou-Shan Rd, Taipei, 10673, Taiwan
| | - Shang-Lien Lo
- Graduate Institute of Environmental Engineering, National Taiwan University, 71, Chou-Shan Rd, Taipei, 10673, Taiwan.
- Water Innovation, Low Carbon and Environmental Sustainability Research Center, National Taiwan University, Taipei, 10617, Taiwan.
| |
Collapse
|
4
|
Lau PY, Fung WK. Evaluation of marker selection methods and statistical models for chronological age prediction based on DNA methylation. Leg Med (Tokyo) 2020; 47:101744. [PMID: 32659707 DOI: 10.1016/j.legalmed.2020.101744] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/23/2020] [Revised: 06/02/2020] [Accepted: 06/26/2020] [Indexed: 12/16/2022]
Abstract
In forensic investigation, retrieving biological information from DNA evidence is a promising field of interest. One of the applications is on the estimation of the age of the donor based on DNA methylation. A large number of studies focused on age prediction using the 450 K Human Methylation Beadchip. Various marker selection methods and prediction models have been considered. However, there is a lack of research evaluating different high-dimensional variable selection methods of CpG sites with various models for age prediction. The aim of this study is to evaluate four variable selection methods (forward selection, LASSO, elastic net and SCAD) combined with a classical statistical model and sophisticated machine learning models based on the mean absolute deviation (MAD) and the root-mean-square error (RMSE). We used publicly available 450 K data set containing 991 whole blood samples (age 19-101 years). We found that the multiple linear regression model with 16 markers selected from the forward selection method performed very well in age prediction (MAD = 3.76 years and RMSE = 5.01 years). On the other hand, the highly advanced ultrahigh dimensional variable selection methods and sophisticated machine learning algorithms appeared unnecessary for age prediction based on DNA methylation.
Collapse
Affiliation(s)
- Pui Yin Lau
- Department of Statistics and Actuarial Science, The University of Hong Kong, Pokfulam Road, Hong Kong, China
| | - Wing Kam Fung
- Department of Statistics and Actuarial Science, The University of Hong Kong, Pokfulam Road, Hong Kong, China.
| |
Collapse
|
5
|
Abstract
Forward regression, a classical variable screening method, has been widely used for model building when the number of covariates is relatively low. However, forward regression is seldom used in high-dimensional settings because of the cumbersome computation and unknown theoretical properties. Some recent works have shown that forward regression, coupled with an extended Bayesian information criterion (EBIC)-based stopping rule, can consistently identify all relevant predictors in high-dimensional linear regression settings. However, the results are based on the sum of residual squares from linear models and it is unclear whether forward regression can be applied to more general regression settings, such as Cox proportional hazards models. We introduce a forward variable selection procedure for Cox models. It selects important variables sequentially according to the increment of partial likelihood, with an EBIC stopping rule. To our knowledge, this is the first study that investigates the partial likelihood-based forward regression in high-dimensional survival settings and establishes selection consistency results. We show that, if the dimension of the true model is finite, forward regression can discover all relevant predictors within a finite number of steps and their order of entry is determined by the size of the increment in partial likelihood. As partial likelihood is not a regular density-based likelihood, we develop some new theoretical results on partial likelihood and use these results to establish the desired sure screening properties. The practical utility of the proposed method is examined via extensive simulations and analysis of a subset of the Boston Lung Cancer Survival Cohort study, a hospital-based study for identifying biomarkers related to lung cancer patients' survival.
Collapse
Affiliation(s)
- Hyokyoung G. Hong
- Department of Statistics and Probability, Michigan State University, 19 Red Cedar Road, East Lansing, MI 48823, USA
| | - Qi Zheng
- Department of Bioinformatics and Biostatistics, University of Louisville, 485 East Gray Street, Louisville, KY 40202, USA
| | - Yi Li
- Department of Biostatistics, University of Michigan, 1415 Washington Heights Ann Arbor, MI 48109-2029, USA
| |
Collapse
|
6
|
Argyropoulos A, Townley S, Upton PM, Dickinson S, Pollard AS. Identifying on admission patients likely to develop acute kidney injury in hospital. BMC Nephrol 2019; 20:56. [PMID: 30764796 PMCID: PMC6376785 DOI: 10.1186/s12882-019-1237-x] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2018] [Accepted: 01/29/2019] [Indexed: 12/23/2022] Open
Abstract
Background The incidence of Acute Kidney Injury (AKI) continues to increase in the UK, with associated mortality rates remaining significant. Approximately one fifth of hospital admissions are associated with AKI and approximately a third of patients with AKI in hospital develop AKI during their time in hospital. A fifth of these cases are considered avoidable. Early risk detection remains key to decreasing AKI in hospitals, where sub-optimal care was noted for half of patients who developed AKI. Methods Electronic anonymised data for adults admitted into the Royal Cornwall Hospitals Trust (RCHT) between 18th March and 31st December 2015 was trimmed to that collected within the first 24 h of hospitalisation. These datasets were split according to three separate time periods: data used for training the Takagi-Sugeno Fuzzy Logic Systems (FLS) and the multivariable logistic regression (MLR) models; data used for testing; and data from a later patient spell used for validation. Three fuzzy logic models and three MLR models were developed to link characteristics of patients diagnosed with a maximum stage AKI within 7 days of admission: the first models to identify any AKI Stage (FLS I, MLR I), the second for patterns of AKI Stage 2 or 3 (FLS II, MLR II), and the third to identify AKI Stage 3 (FLS III, MLR III). Model accuracy is expressed by area under the curve (AUC). Results Accuracy for each model during internal validation was: FLS I and MLR I (AUC 0.70, 95% CI: 0.64–0.77); FLS II (AUC 0.77, 95% CI: 0.69–0.85) and MLR II (AUC 0.74, 95% CI: 0.65–0.83); FLS III and MLR III (AUC 0.95, 95% CI: 0.92–0.98). Conclusions FLS II and FLS III (and the respective MLR models) can identify with a high level of accuracy patients at high risk of developing AKI in hospital. These two models cannot be properly assessed against prior studies as this is the first attempt at quantifying the risk of developing specific Stages of AKI for a broad cohort of both medical and surgical inpatients. FLS I and MLR I performance is comparable to other existing models. Electronic supplementary material The online version of this article (10.1186/s12882-019-1237-x) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Anastasios Argyropoulos
- Centre for Implementation Science, Faculty of Health Sciences, University of Southampton, Southampton, SO17 1BJ, UK.
| | - Stuart Townley
- College of Engineering, Mathematics, and Physical Sciences, University of Exeter, Penryn, Cornwall,, TR10 9FE, UK
| | - Paul M Upton
- Research, Development, and Innovation, Royal Cornwall Hospitals NHS Trust, Truro, TR1 3HD, UK
| | - Stephen Dickinson
- Research, Development, and Innovation, Royal Cornwall Hospitals NHS Trust, Truro, TR1 3HD, UK
| | - Adam S Pollard
- Research, Development, and Innovation, Royal Cornwall Hospitals NHS Trust, Truro, TR1 3HD, UK
| |
Collapse
|
7
|
Tsamardinos I, Borboudakis G, Katsogridakis P, Pratikakis P, Christophides V. A greedy feature selection algorithm for Big Data of high dimensionality. Mach Learn 2018; 108:149-202. [PMID: 30906113 PMCID: PMC6399683 DOI: 10.1007/s10994-018-5748-7] [Citation(s) in RCA: 30] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2017] [Accepted: 07/18/2018] [Indexed: 11/26/2022]
Abstract
We present the Parallel, Forward-Backward with Pruning (PFBP) algorithm for feature selection (FS) for Big Data of high dimensionality. PFBP partitions the data matrix both in terms of rows as well as columns. By employing the concepts of p-values of conditional independence tests and meta-analysis techniques, PFBP relies only on computations local to a partition while minimizing communication costs, thus massively parallelizing computations. Similar techniques for combining local computations are also employed to create the final predictive model. PFBP employs asymptotically sound heuristics to make early, approximate decisions, such as Early Dropping of features from consideration in subsequent iterations, Early Stopping of consideration of features within the same iteration, or Early Return of the winner in each iteration. PFBP provides asymptotic guarantees of optimality for data distributions faithfully representable by a causal network (Bayesian network or maximal ancestral graph). Empirical analysis confirms a super-linear speedup of the algorithm with increasing sample size, linear scalability with respect to the number of features and processing cores. An extensive comparative evaluation also demonstrates the effectiveness of PFBP against other algorithms in its class. The heuristics presented are general and could potentially be employed to other greedy-type of FS algorithms. An application on simulated Single Nucleotide Polymorphism (SNP) data with 500K samples is provided as a use case.
Collapse
Affiliation(s)
- Ioannis Tsamardinos
- Computer Science Department, University of Crete, Heraklion, Greece
- Gnosis Data Analysis PC, Heraklion, Greece
| | | | - Pavlos Katsogridakis
- Computer Science Department, University of Crete, Heraklion, Greece
- Institute of Computer Science, Foundation for Research and Technology - Hellas, Heraklion, Greece
| | - Polyvios Pratikakis
- Computer Science Department, University of Crete, Heraklion, Greece
- Institute of Computer Science, Foundation for Research and Technology - Hellas, Heraklion, Greece
| | | |
Collapse
|
8
|
Abstract
Genetic association mapping has been widely applied to determine genetic markers favorably associated with a trait of interest and provide information for marker-assisted selection. Many association mapping studies commonly focus on main effects due to intolerable computing intensity. This study aims to select several sets of DNA markers with potential epistasis to maximize genetic variations of some key agronomic traits in barley. By doing so, we integrated a MDR (multifactor dimensionality reduction) method with a forward variable selection approach. This integrated approach was used to determine single nucleotide polymorphism pairs with epistasis effects associated with three agronomic traits: heading date, plant height, and grain yield in barley from the barley Coordinated Agricultural Project. Our results showed that four, seven, and five SNP pairs accounted for 51.06, 45.66 and 40.42% for heading date, plant height, and grain yield, respectively with epistasis being considered, while corresponding contributions to these three traits were 45.32, 31.39, 31.31%, respectively without epistasis being included. The results suggested that epistasis model was more effective than non-epistasis model in this study and can be more preferred for other applications.
Collapse
Affiliation(s)
- Yi Xu
- Department of Agronomy, Horticulture, and Plant Science, South Dakota State University, Box 2140C, Brookings, SD, 57007, USA
| | - Yajun Wu
- Department of Biology and Microbiology, South Dakota State University, Brookings, SD, 57007, USA
| | - Jixiang Wu
- Department of Agronomy, Horticulture, and Plant Science, South Dakota State University, Box 2140C, Brookings, SD, 57007, USA.
| |
Collapse
|
9
|
Abstract
The estimation of treatment effects based on observational data usually involves multiple confounders, and dimension reduction is often desirable and sometimes inevitable. We first clarify the definition of a central subspace that is relevant for the efficient estimation of average treatment effects. A criterion is then proposed to simultaneously estimate the structural dimension, the basis matrix of the joint central subspace, and the optimal bandwidth for estimating the conditional treatment effects. The method can easily be implemented by forward selection. Semiparametric efficient estimation of average treatment effects can be achieved by averaging the conditional treatment effects with a different data-adaptive bandwidth to ensure optimal undersmoothing. Asymptotic properties of the estimated joint central subspace and the corresponding estimator of average treatment effects are studied. The proposed methods are applied to a nutritional study, where the covariate dimension is reduced from 11 to an effective dimension of one.
Collapse
Affiliation(s)
- Ming-Yueh Huang
- Department of Biostatistics, University of Washington, Seattle, Washington 98105, @u.washington.edu
| | - Kwun Chuen Gary Chan
- Department of Biostatistics, University of Washington, Seattle, Washington 98105, @u.washington.edu
| |
Collapse
|
10
|
Abstract
In ultra-high dimensional data analysis, it is extremely challenging to identify important interaction effects, and a top concern in practice is computational feasibility. For a data set with n observations and p predictors, the augmented design matrix including all linear and order-2 terms is of size n × (p2 + 3p)/2. When p is large, say more than tens of hundreds, the number of interactions is enormous and beyond the capacity of standard machines and software tools for storage and analysis. In theory, the interaction selection consistency is hard to achieve in high dimensional settings. Interaction effects have heavier tails and more complex covariance structures than main effects in a random design, making theoretical analysis difficult. In this article, we propose to tackle these issues by forward-selection based procedures called iFOR, which identify interaction effects in a greedy forward fashion while maintaining the natural hierarchical model structure. Two algorithms, iFORT and iFORM, are studied. Computationally, the iFOR procedures are designed to be simple and fast to implement. No complex optimization tools are needed, since only OLS-type calculations are involved; the iFOR algorithms avoid storing and manipulating the whole augmented matrix, so the memory and CPU requirement is minimal; the computational complexity is linear in p for sparse models, hence feasible for p ≫ n. Theoretically, we prove that they possess sure screening property for ultra-high dimensional settings. Numerical examples are used to demonstrate their finite sample performance.
Collapse
Affiliation(s)
- Ning Hao
- Assistant Professor, Department of Mathematics, University of Arizona, Tucson, AZ 85721
| | - Hao Helen Zhang
- Associate Professor, Department of Mathematics, University of Arizona, Tucson, AZ 85721
| |
Collapse
|