1
|
Lyu R, Qu Y, Divaris K, Wu D. Methodological Considerations in Longitudinal Analyses of Microbiome Data: A Comprehensive Review. Genes (Basel) 2023; 15:51. [PMID: 38254941 PMCID: PMC11154524 DOI: 10.3390/genes15010051] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2023] [Revised: 12/22/2023] [Accepted: 12/26/2023] [Indexed: 01/24/2024] Open
Abstract
Biological processes underlying health and disease are inherently dynamic and are best understood when characterized in a time-informed manner. In this comprehensive review, we discuss challenges inherent in time-series microbiome data analyses and compare available approaches and methods to overcome them. Appropriate handling of longitudinal microbiome data can shed light on important roles, functions, patterns, and potential interactions between large numbers of microbial taxa or genes in the context of health, disease, or interventions. We present a comprehensive review and comparison of existing microbiome time-series analysis methods, for both preprocessing and downstream analyses, including differential analysis, clustering, network inference, and trait classification. We posit that the careful selection and appropriate utilization of computational tools for longitudinal microbiome analyses can help advance our understanding of the dynamic host-microbiome relationships that underlie health-maintaining homeostases, progressions to disease-promoting dysbioses, as well as phases of physiologic development like those encountered in childhood.
Collapse
Affiliation(s)
- Ruiqi Lyu
- Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA 15213, USA;
| | - Yixiang Qu
- Department of Biostatistics, Gillings School of Global Public Health, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA;
| | - Kimon Divaris
- Division of Pediatric and Public Health, Adams School of Dentistry, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA;
- Department of Epidemiology, Gillings School of Global Public Health, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA
| | - Di Wu
- Department of Biostatistics, Gillings School of Global Public Health, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA;
- Division of Oral and Craniofacial Health Sciences, Adams School of Dentistry, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA
- Lineberger Comprehensive Cancer Center, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA
| |
Collapse
|
2
|
Lee ER, Park S, Lee SK, Hong HG. Quantile forward regression for high-dimensional survival data. LIFETIME DATA ANALYSIS 2023; 29:769-806. [PMID: 37393569 DOI: 10.1007/s10985-023-09603-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/10/2022] [Accepted: 05/17/2023] [Indexed: 07/04/2023]
Abstract
Despite the urgent need for an effective prediction model tailored to individual interests, existing models have mainly been developed for the mean outcome, targeting average people. Additionally, the direction and magnitude of covariates' effects on the mean outcome may not hold across different quantiles of the outcome distribution. To accommodate the heterogeneous characteristics of covariates and provide a flexible risk model, we propose a quantile forward regression model for high-dimensional survival data. Our method selects variables by maximizing the likelihood of the asymmetric Laplace distribution (ALD) and derives the final model based on the extended Bayesian Information Criterion (EBIC). We demonstrate that the proposed method enjoys a sure screening property and selection consistency. We apply it to the national health survey dataset to show the advantages of a quantile-specific prediction model. Finally, we discuss potential extensions of our approach, including the nonlinear model and the globally concerned quantile regression coefficients model.
Collapse
Affiliation(s)
- Eun Ryung Lee
- Department of Statistics, Sungkyunkwan University, Seoul, 03063, Korea
| | - Seyoung Park
- Department of Statistics, Sungkyunkwan University, Seoul, 03063, Korea
| | - Sang Kyu Lee
- Department of Statistics and Probability, Michigan State University, East Lansing, MI, 48823, USA
- Biostatistics Branch, National Cancer Institute, Bethesda, MD, 20892, USA
| | - Hyokyoung G Hong
- Biostatistics Branch, National Cancer Institute, Bethesda, MD, 20892, USA.
| |
Collapse
|
3
|
Liu Y, Li G. Sure Joint Screening for High Dimensional Cox's Proportional Hazards Model Under the Case-Cohort Design. J Comput Biol 2023; 30:663-677. [PMID: 37140454 PMCID: PMC10282795 DOI: 10.1089/cmb.2022.0416] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/05/2023] Open
Abstract
This study develops a sure joint feature screening method for the case-cohort design with ultrahigh-dimensional covariates. Our method is based on a sparsity-restricted Cox proportional hazards model. An iterative reweighted hard thresholding algorithm is proposed to approximate the sparsity-restricted, pseudo-partial likelihood estimator for joint screening. We rigorously show that our method possesses the sure screening property, with the probability of retaining all relevant covariates tending to 1 as the sample size goes to infinity. Our simulation results demonstrate that the proposed procedure has substantially improved screening performance over some existing feature screening methods for the case-cohort design, especially when some covariates are jointly correlated, but marginally uncorrelated, with the event time outcome. A real data illustration is provided using breast cancer data with high-dimensional genomic covariates. We have implemented the proposed method using MATLAB and made it available to readers through GitHub.
Collapse
Affiliation(s)
- Yi Liu
- Department of Mathematics, School of Mathematical Sciences, Ocean University of China, Qingdao, China
| | - Gang Li
- Department of Biostatistics, University of California at Los Angeles, Los Angeles, California, USA
| |
Collapse
|
4
|
Zhang L, Song X. Ultrahigh dimensional single index model estimation via refitted cross-validation. COMMUN STAT-THEOR M 2023. [DOI: 10.1080/03610926.2023.2179881] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/27/2023]
Affiliation(s)
- Lixia Zhang
- School of Statistics, Beijing Normal University, Beijing, PR China
| | - Xuguang Song
- School of Statistics, Beijing Normal University, Beijing, PR China
| |
Collapse
|
5
|
Li T, Yu J, Meng C. Scalable model-free feature screening via sliced-Wasserstein dependency. J Comput Graph Stat 2023. [DOI: 10.1080/10618600.2023.2183213] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/25/2023]
Affiliation(s)
- Tao Li
- Center for Applied Statistics, Institute of Statistics and Big Data, Renmin University of China
| | - Jun Yu
- School of Mathematics and Statistics, Beijing Institute of Technology
| | - Cheng Meng
- Center for Applied Statistics, Institute of Statistics and Big Data, Renmin University of China
| |
Collapse
|
6
|
Chen J, Bie R, Qin Y, Li Y, Ma S. Lq-based robust analytics on ultrahigh and high dimensional data. Stat Med 2022; 41:5220-5241. [PMID: 36098057 DOI: 10.1002/sim.9563] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/27/2021] [Revised: 06/02/2022] [Accepted: 08/02/2022] [Indexed: 11/10/2022]
Abstract
Ultrahigh and high dimensional data are common in regression analysis for various fields, such as omics data, finance, and biological engineering. In addition to the problem of dimension, the data might also be contaminated. There are two main types of contamination: outliers and model misspecification. We develop an unique method that takes into account the ultrahigh or high dimensional issues and both types of contamination. In this article, we propose a framework for feature screening and selection based on the minimum Lq-likelihood estimation (MLqE), which accounts for the model misspecification contamination issue and has also been shown to be robust to outliers. In numerical analysis, we explore the robustness of this framework under different outliers and model misspecification scenarios. To examine the performance of this framework, we conduct real data analysis using the skin cutaneous melanoma data. When comparing with traditional screening and feature selection methods, the proposed method shows superiority in both variable identification effectiveness and parameter estimation accuracy.
Collapse
Affiliation(s)
- Jiachen Chen
- Department of Biostatistics, Boston University, Boston, MA, USA
| | - Ruofan Bie
- Department of Biostatistics, Brown University, Providence, RI, USA
| | - Yichen Qin
- Department of Operations, Business Analytics and Information Systems, University of Cincinnati, Cincinnati, OH, USA
| | - Yang Li
- Center for Applied Statistics and School of Statistics, Renmin University of China, Beijing, China.,RSS and China-Re Life Joint Lab on Public Health and Risk Management, Renmin University of China, Beijing, China
| | - Shuangge Ma
- Department of Biostatistics, Boston University, Boston, MA, USA.,RSS and China-Re Life Joint Lab on Public Health and Risk Management, Renmin University of China, Beijing, China
| |
Collapse
|
7
|
Forward variable selection for ultra-high dimensional quantile regression models. ANN I STAT MATH 2022. [DOI: 10.1007/s10463-022-00849-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/01/2022]
|
8
|
Ke H, Ren Z, Qi J, Chen S, Tseng GC, Ye Z, Ma T. High-dimension to high-dimension screening for detecting genome-wide epigenetic and noncoding RNA regulators of gene expression. Bioinformatics 2022; 38:4078-4087. [PMID: 35856716 PMCID: PMC9438953 DOI: 10.1093/bioinformatics/btac518] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2022] [Revised: 06/29/2022] [Accepted: 07/19/2022] [Indexed: 12/24/2022] Open
Abstract
MOTIVATION The advancement of high-throughput technology characterizes a wide variety of epigenetic modifications and noncoding RNAs across the genome involved in disease pathogenesis via regulating gene expression. The high dimensionality of both epigenetic/noncoding RNA and gene expression data make it challenging to identify the important regulators of genes. Conducting univariate test for each possible regulator-gene pair is subject to serious multiple comparison burden, and direct application of regularization methods to select regulator-gene pairs is computationally infeasible. Applying fast screening to reduce dimension first before regularization is more efficient and stable than applying regularization methods alone. RESULTS We propose a novel screening method based on robust partial correlation to detect epigenetic and noncoding RNA regulators of gene expression over the whole genome, a problem that includes both high-dimensional predictors and high-dimensional responses. Compared to existing screening methods, our method is conceptually innovative that it reduces the dimension of both predictor and response, and screens at both node (regulators or genes) and edge (regulator-gene pairs) levels. We develop data-driven procedures to determine the conditional sets and the optimal screening threshold, and implement a fast iterative algorithm. Simulations and applications to long noncoding RNA and microRNA regulation in Kidney cancer and DNA methylation regulation in Glioblastoma Multiforme illustrate the validity and advantage of our method. AVAILABILITY AND IMPLEMENTATION The R package, related source codes and real datasets used in this article are provided at https://github.com/kehongjie/rPCor. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Hongjie Ke
- Department of Epidemiology and Biostatistics, University of Maryland, College Park, MD 20742, USA
| | - Zhao Ren
- Department of Statistics, University of Pittsburgh, Pittsburgh, PA 15260, USA
| | - Jianfei Qi
- Department of Biochemistry and Molecular Biology, University of Maryland, Baltimore, MD 21201, USA
| | - Shuo Chen
- Department of Epidemiology & Public Health, University of Maryland, Baltimore, MD 21201, USA
| | - George C Tseng
- Department of Biostatistics, University of Pittsburgh, Pittsburgh, PA 15260, USA
| | - Zhenyao Ye
- Department of Epidemiology & Public Health, University of Maryland, Baltimore, MD 21201, USA
| | | |
Collapse
|
9
|
Hyde R, O'Grady L, Green M. Stability selection for mixed effect models with large numbers of predictor variables: A simulation study. Prev Vet Med 2022; 206:105714. [PMID: 35843027 DOI: 10.1016/j.prevetmed.2022.105714] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/28/2022] [Revised: 07/08/2022] [Accepted: 07/10/2022] [Indexed: 10/17/2022]
Abstract
Covariate selection when the number of available variables is large relative to the number of observations is problematic in epidemiology and remains the focus of continued research. Whilst a variety of statistical methods have been developed to attempt to overcome this issue, at present very few methods are available for wide data that include a clustered outcome. The purpose of this research was to make an empirical evaluation of a new method for covariate selection in wide data settings when the dependent variable is clustered. We used 3300 simulated datasets with a variety of defined structures and known sets of true predictor variables to conduct an empirical evaluation of a mixed model stability selection procedure. Comparison was made with an alternative method based on regularisation using the least absolute shrinkage and selection operator (Lasso) penalty. Model performance was assessed using several metrics including the true positive rate (proportion of true covariates selected in a final model) and false discovery rate (proportion of variables selected in a final model that were non-true (false) variables). For stability selection, the false discovery rate was consistently low, generally remaining ≤ 0.02 indicating that on average fewer than 1 in 50 of the variables selected in a final model were false variables. This was in contrast to the Lasso-based method in which the false discovery rate was between 0.59 and 0.72, indicating that generally more than 60% of variables selected in a final model were false variables. In contrast however, the Lasso method attained higher true positive rates than stability selection, although both methods achieved good results. For the Lasso method, true positive rates remained ≥ 0.93 whereas for stability selection the true positive rate was 0.73-0.97. Our results suggest both methods may be of value for covariate selection with high dimensional data with a clustered outcome. When high specificity is needed for identification of true covariates, stability selection appeared to offer the better solution, although with a slight loss of sensitivity. Conversely when high sensitivity is needed, the Lasso approach may be useful, even if accompanied by a substantial loss of specificity. Overall, the results indicated the loss of sensitivity when employing stability selection is relatively small compared to the loss of specificity when using the Lasso and therefore stability selection may provide the better option for the analyst when evaluating data of this type.
Collapse
Affiliation(s)
- Robert Hyde
- School of Veterinary Medicine and Science, University of Nottingham, Sutton Bonington Campus, Leicestershire, United Kingdom
| | - Luke O'Grady
- School of Veterinary Medicine and Science, University of Nottingham, Sutton Bonington Campus, Leicestershire, United Kingdom
| | - Martin Green
- School of Veterinary Medicine and Science, University of Nottingham, Sutton Bonington Campus, Leicestershire, United Kingdom.
| |
Collapse
|
10
|
Zhao S, Fu G. Distribution-free and model-free multivariate feature screening via multivariate rank distance correlation. J MULTIVARIATE ANAL 2022. [DOI: 10.1016/j.jmva.2022.105081] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]
|
11
|
Quantum secure privacy preserving technique to obtain the intersection of two datasets for contact tracing. JOURNAL OF INFORMATION SECURITY AND APPLICATIONS 2022. [DOI: 10.1016/j.jisa.2022.103127] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/23/2022]
|
12
|
Statistical Methods with Applications in Data Mining: A Review of the Most Recent Works. MATHEMATICS 2022. [DOI: 10.3390/math10060993] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/04/2022]
Abstract
The importance of statistical methods in finding patterns and trends in otherwise unstructured and complex large sets of data has grown over the past decade, as the amount of data produced keeps growing exponentially and knowledge obtained from understanding data allows to make quick and informed decisions that save time and provide a competitive advantage. For this reason, we have seen considerable advances over the past few years in statistical methods in data mining. This paper is a comprehensive and systematic review of these recent developments in the area of data mining.
Collapse
|
13
|
Epistasis Detection via the Joint Cumulant. STATISTICS IN BIOSCIENCES 2022. [DOI: 10.1007/s12561-022-09336-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
|
14
|
Manthena V, Jarquín D, Howard R. Integrating and optimizing genomic, weather, and secondary trait data for multiclass classification. Front Genet 2022; 13:1032691. [PMID: 37065625 PMCID: PMC10090538 DOI: 10.3389/fgene.2022.1032691] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2022] [Accepted: 12/22/2022] [Indexed: 04/18/2023] Open
Abstract
Modern plant breeding programs collect several data types such as weather, images, and secondary or associated traits besides the main trait (e.g., grain yield). Genomic data is high-dimensional and often over-crowds smaller data types when naively combined to explain the response variable. There is a need to develop methods able to effectively combine different data types of differing sizes to improve predictions. Additionally, in the face of changing climate conditions, there is a need to develop methods able to effectively combine weather information with genotype data to predict the performance of lines better. In this work, we develop a novel three-stage classifier to predict multi-class traits by combining three data types-genomic, weather, and secondary trait. The method addressed various challenges in this problem, such as confounding, differing sizes of data types, and threshold optimization. The method was examined in different settings, including binary and multi-class responses, various penalization schemes, and class balances. Then, our method was compared to standard machine learning methods such as random forests and support vector machines using various classification accuracy metrics and using model size to evaluate the sparsity of the model. The results showed that our method performed similarly to or better than machine learning methods across various settings. More importantly, the classifiers obtained were highly sparse, allowing for a straightforward interpretation of relationships between the response and the selected predictors.
Collapse
Affiliation(s)
- Vamsi Manthena
- Department of Statistics, University of Nebraska-Lincoln, Lincoln, NE, United States
| | - Diego Jarquín
- Agronomy Department, University of Florida, Gainesville, FL, United States
| | - Reka Howard
- Department of Statistics, University of Nebraska-Lincoln, Lincoln, NE, United States
- *Correspondence: Reka Howard,
| |
Collapse
|
15
|
Nandy D, Chiaromonte F, Li R. Covariate Information Number for Feature Screening in Ultrahigh-Dimensional Supervised Problems. J Am Stat Assoc 2022; 117:1516-1529. [PMID: 36172297 PMCID: PMC9512254 DOI: 10.1080/01621459.2020.1864380] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/03/2023]
Abstract
Contemporary high-throughput experimental and surveying techniques give rise to ultrahigh-dimensional supervised problems with sparse signals; that is, a limited number of observations (n), each with a very large number of covariates (p >> n), only a small share of which is truly associated with the response. In these settings, major concerns on computational burden, algorithmic stability, and statistical accuracy call for substantially reducing the feature space by eliminating redundant covariates before the use of any sophisticated statistical analysis. Along the lines of Sure Independence Screening (Fan and Lv, 2008) and other model- and correlation-based feature screening methods, we propose a model-free procedure called Covariate Information Number - Sure Independence Screening (CIS). CIS uses a marginal utility connected to the notion of the traditional Fisher Information, possesses the sure screening property, and is applicable to any type of response (features) with continuous features (response). Simulations and an application to transcriptomic data on rats reveal the comparative strengths of CIS over some popular feature screening methods.
Collapse
Affiliation(s)
- Debmalya Nandy
- Department of Biostatistics & Informatics, Colorado School of Public Health, University of Colorado Anschutz Medical Campus, Aurora, CO 80045, USA,Corresponding author Debmalya Nandy
| | - Francesca Chiaromonte
- Department of Statistics, Penn State University, University Park, PA 16802, USA,Institute of Economics and EMbeDS, Sant’Anna School of Advanced Studies, Piazza Martiri della Libertà 33, Pisa 56127, Italy
| | - Runze Li
- Department of Statistics, Penn State University, University Park, PA 16802, USA
| |
Collapse
|
16
|
Yang YF, Chen B, Xing LL, Chen JB, Xue HB, Guo KX. Controllable four-wave mixing in an atom–optical cavity coupling system with a second-order nonlinear crystal. JOURNAL OF THE OPTICAL SOCIETY OF AMERICA B 2022; 39:46. [DOI: 10.1364/josab.444507] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/08/2021] [Accepted: 11/16/2021] [Indexed: 09/01/2023]
Abstract
The four-wave mixing (FWM) effect has been systematically studied in an atom–optical cavity coupling system with a second-order nonlinear crystal (SOC), which is formed by coupling an optical cavity with a two-level atom and a SOC. In this research, it is found that the FWM effect largely depends on the SOC, because the SOC can promote a two-photon absorption process. Therefore, a tunable FWM signal can be obtained in this coupling system by controlling the SOC. Moreover, the results also show that the cavity decay rate plays an important role in controlling the FWM signal. By optimizing the cavity decay rate and the SOC, a strong FWM signal can be generated. In addition, by adjusting the cavity–pump detuning, conversion between a single-peak FWM signal and two-peak FWM signal can be easily realized.
Collapse
Affiliation(s)
| | - Bin Chen
- Taiyuan University of Technology
- Ministry of Education and Shanxi Province
| | | | | | | | | |
Collapse
|
17
|
RHDSI: A novel dimensionality reduction based algorithm on high dimensional feature selection with interactions. Inf Sci (N Y) 2021. [DOI: 10.1016/j.ins.2021.06.096] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/23/2023]
|
18
|
Chen X, Li C, Zhang T, Gao Z. On correlation rank screening for ultra-high dimensional competing risks data. J Appl Stat 2021; 49:1848-1864. [DOI: 10.1080/02664763.2021.1884209] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
Affiliation(s)
- Xiaolin Chen
- School of Statistics, Qufu Normal University, Qufu, People's Republic of China
| | - Chenguang Li
- School of Statistics, Qufu Normal University, Qufu, People's Republic of China
| | - Tao Zhang
- School of Science, Guangxi University of Science and Technology, Liuzhou, People's Republic of China
| | - Zhenlong Gao
- School of Statistics, Qufu Normal University, Qufu, People's Republic of China
| |
Collapse
|
19
|
Lima E, Hyde R, Green M. Model selection for inferential models with high dimensional data: synthesis and graphical representation of multiple techniques. Sci Rep 2021; 11:412. [PMID: 33431921 PMCID: PMC7801732 DOI: 10.1038/s41598-020-79317-8] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/11/2020] [Accepted: 12/07/2020] [Indexed: 12/18/2022] Open
Abstract
Inferential research commonly involves identification of causal factors from within high dimensional data but selection of the 'correct' variables can be problematic. One specific problem is that results vary depending on statistical method employed and it has been argued that triangulation of multiple methods is advantageous to safely identify the correct, important variables. To date, no formal method of triangulation has been reported that incorporates both model stability and coefficient estimates; in this paper we develop an adaptable, straightforward method to achieve this. Six methods of variable selection were evaluated using simulated datasets of different dimensions with known underlying relationships. We used a bootstrap methodology to combine stability matrices across methods and estimate aggregated coefficient distributions. Novel graphical approaches provided a transparent route to visualise and compare results between methods. The proposed aggregated method provides a flexible route to formally triangulate results across any chosen number of variable selection methods and provides a combined result that incorporates uncertainty arising from between-method variability. In these simulated datasets, the combined method generally performed as well or better than the individual methods, with low error rates and clearer demarcation of the true causal variables than for the individual methods.
Collapse
Affiliation(s)
- Eliana Lima
- School of Veterinary Medicine and Science, University of Nottingham, Sutton Bonington Campus, Leicestershire, LE12 5RD, UK
| | - Robert Hyde
- School of Veterinary Medicine and Science, University of Nottingham, Sutton Bonington Campus, Leicestershire, LE12 5RD, UK
| | - Martin Green
- School of Veterinary Medicine and Science, University of Nottingham, Sutton Bonington Campus, Leicestershire, LE12 5RD, UK.
| |
Collapse
|
20
|
Forward variable selection for sparse ultra-high-dimensional generalized varying coefficient models. JAPANESE JOURNAL OF STATISTICS AND DATA SCIENCE 2020. [DOI: 10.1007/s42081-020-00090-z] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]
|
21
|
Liu W, Ke Y, Liu J, Li R. Model-Free Feature Screening and FDR Control With Knockoff Features. J Am Stat Assoc 2020. [DOI: 10.1080/01621459.2020.1783274] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]
Affiliation(s)
- Wanjun Liu
- Department of Statistics, The Pennsylvania State University, University Park, PA
| | - Yuan Ke
- Department of Statistics, University of Georgia, Athens, GA
| | - Jingyuan Liu
- MOE Key Laboratory of Econometrics, Department of Statistics, School of Economics, Wang Yanan Institute for Studies in Economics, and Fujian Key Lab of Statistics, Xiamen University, Xiamen, China
| | - Runze Li
- Department of Statistics, The Pennsylvania State University, University Park, PA
| |
Collapse
|
22
|
Hong HG, Chen X, Kang J, Li Y. The Lq- NORM LEARNING FOR ULTRAHIGH-DIMENSIONAL SURVIVAL DATA: AN INTEGRATIVE FRAMEWORK. Stat Sin 2020; 30:1213-1233. [PMID: 32742137 PMCID: PMC7394456 DOI: 10.5705/ss.202017.0537] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
In the era of precision medicine, survival outcome data with high-throughput predictors are routinely collected. Models with an exceedingly large number of covariates are either infeasible to fit or likely to incur low predictability because of overfitting. Variable screening is key in identifying and removing irrelevant attributes. Recent years have seen a surge in screening methods, but most of them rely on some particular modeling assumptions. Motivated by a study on detecting gene signatures for multiple myeloma patients' survival, we propose a model-free L q -norm learning procedure, which includes the well-known Cramér-von Mises and Kolmogorov criteria as two special cases. The work provides an integrative framework for detecting predictors with various levels of impact, such as short- or long-term impact, on censored outcome data. The framework naturally leads to a scheme which combines results from different q to reduce false negatives, an aspect often overlooked by the current literature. We show that our method possesses sure screening properties. The utility of the proposal is confirmed with simulation studies and an analysis of the multiple myeloma study.
Collapse
Affiliation(s)
- H. G. Hong
- Department of Statistics and Probability, Michigan State University, East Lansing, Michigan 48823, USA
| | - X. Chen
- Center of Statistical Research, Southwestern University of Finance and Economics, China
| | - J. Kang
- Department of Biostatistics, University of Michigan, Ann Arbor, Michigan 48109, USA
| | - Y. Li
- Department of Biostatistics, University of Michigan, Ann Arbor, Michigan 48109, USA
| |
Collapse
|
23
|
Liu Y, Chen X. A new robust model-free feature screening method for ultra-high dimensional right censored data. COMMUN STAT-THEOR M 2020. [DOI: 10.1080/03610926.2020.1769672] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]
Affiliation(s)
- Yi Liu
- School of Mathematical Sciences, Ocean University of China, Qingdao, China
| | - Xiaolin Chen
- School of Statistics, Qufu Normal University, Qufu, China
| |
Collapse
|
24
|
Dai X, Fu G, Reese R. Detecting PCOS susceptibility loci from genome-wide association studies via iterative trend correlation based feature screening. BMC Bioinformatics 2020; 21:177. [PMID: 32366216 PMCID: PMC7199379 DOI: 10.1186/s12859-020-3492-z] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/03/2019] [Accepted: 04/13/2020] [Indexed: 01/18/2023] Open
Abstract
Background Feature screening plays a critical role in handling ultrahigh dimensional data analyses when the number of features exponentially exceeds the number of observations. It is increasingly common in biomedical research to have case-control (binary) response and an extremely large-scale categorical features. However, the approach considering such data types is limited in extant literature. In this article, we propose a new feature screening approach based on the iterative trend correlation (ITC-SIS, for short) to detect important susceptibility loci that are associated with the polycystic ovary syndrome (PCOS) affection status by screening 731,442 SNP features that were collected from the genome-wide association studies. Results We prove that the trend correlation based screening approach satisfies the theoretical strong screening consistency property under a set of reasonable conditions, which provides an appealing theoretical support for its outperformance. We demonstrate that the finite sample performance of ITC-SIS is accurate and fast through various simulation designs. Conclusion ITC-SIS serves as a good alternative method to detect disease susceptibility loci for clinic genomic data.
Collapse
Affiliation(s)
- Xiaotian Dai
- Department of Mathematical Sciences, SUNY Binghamton University, New York, USA
| | - Guifang Fu
- Department of Mathematical Sciences, SUNY Binghamton University, New York, USA.
| | | |
Collapse
|
25
|
Wang L, Ma X, Zhang J. Feature screening for ultrahigh-dimensional additive logistic models. J Stat Plan Inference 2020. [DOI: 10.1016/j.jspi.2019.08.005] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
26
|
Chu W, Li R, Liu J, Reimherr M. FEATURE SELECTION FOR GENERALIZED VARYING COEFFICIENT MIXED-EFFECT MODELS WITH APPLICATION TO OBESITY GWAS. Ann Appl Stat 2020; 14:276-298. [PMID: 32802245 PMCID: PMC7426018 DOI: 10.1214/19-aoas1310] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2023]
Abstract
Motivated by an empirical analysis of data from a genome-wide association study on obesity, measured by the body mass index (BMI), we propose a two-step gene-detection procedure for generalized varying coefficient mixed-effects models with ultrahigh dimensional covariates. The proposed procedure selects significant single nucleotide polymorphisms (SNPs) impacting the mean BMI trend, some of which have already been biologically proven to be "fat genes." The method also discovers SNPs that significantly influence the age-dependent variability of BMI. The proposed procedure takes into account individual variations of genetic effects and can also be directly applied to longitudinal data with continuous, binary or count responses. We employ Monte Carlo simulation studies to assess the performance of the proposed method and further carry out causal inference for the selected SNPs.
Collapse
Affiliation(s)
| | - Runze Li
- Department of Statistics and the Methodology Center, Pennsylvania State University
| | - Jingyuan Liu
- MOE Key Laboratory of Econometrics, Department of Statistics, School of Economics, Wang Yanan Institute for Studies in Economics, and Fujian Key Lab of Statistics, Xiamen University
| | | |
Collapse
|
27
|
Yang G, Yang S, Li R. Feature Screening in Ultrahigh Dimensional Generalized Varying-coefficient Models. Stat Sin 2020; 30:1049-1067. [PMID: 32982122 PMCID: PMC7516887 DOI: 10.5705/ss.202017.0362] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
Generalized varying coefficient models are particularly useful for examining dynamic effects of covariates on a continuous, binary or count response. This paper is concerned with feature screening for generalized varying coefficient models with ultrahigh dimensional covariates. The proposed screening procedure is based on joint quasi-likelihood of all predictors, and therefore is distinguished from marginal screening procedures proposed in the literature. In particular, the proposed procedure can effectively identify active predictors that are jointly dependent but marginally independent of the response. In order to carry out the proposed procedure, we propose an effective algorithm and establish the ascent property of the proposed algorithm. We further prove that the proposed procedure possesses the sure screening property. That is, with probability tending to one, the selected variable set includes the actual active predictors. We examine the finite sample performance of the proposed procedure and compare it with existing ones via Monte Carlo simulations, and illustrate the proposed procedure by a real data example.
Collapse
|
28
|
|
29
|
He Y, Zhang L, Ji J, Zhang X. Robust feature screening for elliptical copula regression model. J MULTIVARIATE ANAL 2019. [DOI: 10.1016/j.jmva.2019.05.003] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/08/2023]
|
30
|
Yang G, Yao W, Xiang S. Sure independence screening in ultrahigh dimensional generalized additive models. J Stat Plan Inference 2019. [DOI: 10.1016/j.jspi.2018.04.004] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/14/2022]
|
31
|
Wu C, Zhou F, Ren J, Li X, Jiang Y, Ma S. A Selective Review of Multi-Level Omics Data Integration Using Variable Selection. High Throughput 2019; 8:E4. [PMID: 30669303 PMCID: PMC6473252 DOI: 10.3390/ht8010004] [Citation(s) in RCA: 114] [Impact Index Per Article: 22.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2018] [Revised: 12/24/2018] [Accepted: 01/10/2019] [Indexed: 01/02/2023] Open
Abstract
High-throughput technologies have been used to generate a large amount of omics data. In the past, single-level analysis has been extensively conducted where the omics measurements at different levels, including mRNA, microRNA, CNV and DNA methylation, are analyzed separately. As the molecular complexity of disease etiology exists at all different levels, integrative analysis offers an effective way to borrow strength across multi-level omics data and can be more powerful than single level analysis. In this article, we focus on reviewing existing multi-omics integration studies by paying special attention to variable selection methods. We first summarize published reviews on integrating multi-level omics data. Next, after a brief overview on variable selection methods, we review existing supervised, semi-supervised and unsupervised integrative analyses within parallel and hierarchical integration studies, respectively. The strength and limitations of the methods are discussed in detail. No existing integration method can dominate the rest. The computation aspects are also investigated. The review concludes with possible limitations and future directions for multi-level omics data integration.
Collapse
Affiliation(s)
- Cen Wu
- Department of Statistics, Kansas State University, Manhattan, KS 66506, USA.
| | - Fei Zhou
- Department of Statistics, Kansas State University, Manhattan, KS 66506, USA.
| | - Jie Ren
- Department of Statistics, Kansas State University, Manhattan, KS 66506, USA.
| | - Xiaoxi Li
- Department of Statistics, Kansas State University, Manhattan, KS 66506, USA.
| | - Yu Jiang
- Division of Epidemiology, Biostatistics and Environmental Health, School of Public Health, University of Memphis, Memphis, TN 38152, USA.
| | - Shuangge Ma
- Department of Biostatistics, School of Public Health, Yale University, New Haven, CT 06510, USA.
| |
Collapse
|
32
|
Zhang S, Pan J, Zhou Y. Robust conditional nonparametric independence screening for ultrahigh-dimensional data. Stat Probab Lett 2018. [DOI: 10.1016/j.spl.2018.08.003] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|
33
|
Fang Y, Xu P, Yang J, Qin Y. A quantile regression forest based method to predict drug response and assess prediction reliability. PLoS One 2018; 13:e0205155. [PMID: 30289891 PMCID: PMC6173405 DOI: 10.1371/journal.pone.0205155] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2017] [Accepted: 09/20/2018] [Indexed: 12/24/2022] Open
Abstract
Drug response prediction is a critical step for personalized treatment of cancer patients and ultimately leads to precision medicine. A lot of machine-learning based methods have been proposed to predict drug response from different types of genomic data. However, currently available methods could only give a "point" prediction of drug response value but fail to provide the reliability and distribution of the prediction, which are of equal interest in clinical practice. In this paper, we proposed a method based on quantile regression forest and applied it to the CCLE dataset. Through the out-of-bag validation, our method achieved much higher prediction accuracy of drug response than other available tools. The assessment of prediction reliability by prediction intervals and its significance in personalized medicine were illustrated by several examples. Functional analysis of selected drug response associated genes showed that the proposed method achieves more biologically plausible results.
Collapse
Affiliation(s)
- Yun Fang
- Department of Mathematics, Shanghai Normal University, Shanghai, China
| | - Peirong Xu
- Department of Mathematics, Shanghai Normal University, Shanghai, China
| | - Jialiang Yang
- School of Mathematics and Statistics, Hainan Normal University, Haikou, China
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, United States of America
| | - Yufang Qin
- College of Information Technology, Shanghai Ocean University, Shanghai, China
| |
Collapse
|
34
|
Al-Fakih AM, Algamal ZY, Lee MH, Aziz M. A penalized quantitative structure-property relationship study on melting point of energetic carbocyclic nitroaromatic compounds using adaptive bridge penalty. SAR AND QSAR IN ENVIRONMENTAL RESEARCH 2018; 29:339-353. [PMID: 29493376 DOI: 10.1080/1062936x.2018.1439531] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/02/2017] [Accepted: 02/07/2018] [Indexed: 06/08/2023]
Abstract
A penalized quantitative structure-property relationship (QSPR) model with adaptive bridge penalty for predicting the melting points of 92 energetic carbocyclic nitroaromatic compounds is proposed. To ensure the consistency of the descriptor selection of the proposed penalized adaptive bridge (PBridge), we proposed a ridge estimator ([Formula: see text]) as an initial weight in the adaptive bridge penalty. The Bayesian information criterion was applied to ensure the accurate selection of the tuning parameter ([Formula: see text]). The PBridge based model was internally and externally validated based on [Formula: see text], [Formula: see text], [Formula: see text], [Formula: see text], [Formula: see text], [Formula: see text], the Y-randomization test, [Formula: see text], [Formula: see text], [Formula: see text], [Formula: see text] and the applicability domain. The validation results indicate that the model is robust and not due to chance correlation. The descriptor selection and prediction performance of PBridge for the training dataset outperforms the other methods used. PBridge shows the highest [Formula: see text] of 0.959, [Formula: see text] of 0.953, [Formula: see text] of 0.949 and [Formula: see text] of 0.959, and the lowest [Formula: see text] and [Formula: see text]. For the test dataset, PBridge shows a higher [Formula: see text] of 0.945 and [Formula: see text] of 0.948, and a lower [Formula: see text] and [Formula: see text], indicating its better prediction performance. The results clearly reveal that the proposed PBridge is useful for constructing reliable and robust QSPRs for predicting melting points prior to synthesizing new organic compounds.
Collapse
Affiliation(s)
- A M Al-Fakih
- a Faculty of Science, Department of Chemistry , Universiti Teknologi Malaysia , Johor , Malaysia
- b Faculty of Science, Department of Chemistry , Sana'a University , Sana'a , Yemen
| | - Z Y Algamal
- c Department of Statistics and Informatics , University of Mosul , Mosul , Iraq
| | - M H Lee
- d Faculty of Science, Department of Mathematical Sciences , Universiti Teknologi Malaysia , Johor , Malaysia
| | - M Aziz
- a Faculty of Science, Department of Chemistry , Universiti Teknologi Malaysia , Johor , Malaysia
- e Advanced Membrane Technology Centre , Universiti Teknologi Malaysia , Johor , Malaysia
| |
Collapse
|
35
|
Chen X. Model-free conditional feature screening for ultra-high dimensional right censored data. J STAT COMPUT SIM 2018. [DOI: 10.1080/00949655.2018.1466142] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/25/2023]
Affiliation(s)
- Xiaolin Chen
- School of Statistics, Qufu Normal University, Qufu, People's Republic of China
| |
Collapse
|
36
|
Liu Y, Chen X. Quantile screening for ultra-high-dimensional heterogeneous data conditional on some variables. J STAT COMPUT SIM 2017. [DOI: 10.1080/00949655.2017.1389944] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
Affiliation(s)
- Yi Liu
- College of Science, China University of Petroleum (East China), Qingdao, People's Republic of China
| | - Xiaolin Chen
- School of Statistics, Qufu Normal University, Qufu, People's Republic of China
| |
Collapse
|
37
|
Al-Fakih AM, Algamal ZY, Lee MH, Aziz M. A sparse QSRR model for predicting retention indices of essential oils based on robust screening approach. SAR AND QSAR IN ENVIRONMENTAL RESEARCH 2017; 28:691-703. [PMID: 28976224 DOI: 10.1080/1062936x.2017.1375010] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/30/2017] [Accepted: 08/30/2017] [Indexed: 06/07/2023]
Abstract
A robust screening approach and a sparse quantitative structure-retention relationship (QSRR) model for predicting retention indices (RIs) of 169 constituents of essential oils is proposed. The proposed approach is represented in two steps. First, dimension reduction was performed using the proposed modified robust sure independence screening (MR-SIS) method. Second, prediction of RIs was made using the proposed robust sparse QSRR with smoothly clipped absolute deviation (SCAD) penalty (RSQSRR). The RSQSRR model was internally and externally validated based on [Formula: see text], [Formula: see text], [Formula: see text], [Formula: see text], Y-randomization test, [Formula: see text], [Formula: see text], and the applicability domain. The validation results indicate that the model is robust and not due to chance correlation. The descriptor selection and prediction performance of the RSQSRR for training dataset outperform the other two used modelling methods. The RSQSRR shows the highest [Formula: see text], [Formula: see text], and [Formula: see text], and the lowest [Formula: see text]. For the test dataset, the RSQSRR shows a high external validation value ([Formula: see text]), and a low value of [Formula: see text] compared with the other methods, indicating its higher predictive ability. In conclusion, the results reveal that the proposed RSQSRR is an efficient approach for modelling high dimensional QSRRs and the method is useful for the estimation of RIs of essential oils that have not been experimentally tested.
Collapse
Affiliation(s)
- A M Al-Fakih
- a Department of Chemistry , Universiti Teknologi Malaysia , Johor , Malaysia
- b Department of Chemistry, Faculty of Science , Sana'a University , Sana'a , Yemen
| | - Z Y Algamal
- c Department of Statistics and Informatics , University of Mosul , Mosul , Iraq
| | - M H Lee
- d Department of Mathematical Sciences, Faculty of Science , Universiti Teknologi Malaysia , Johor , Malaysia
| | - M Aziz
- a Department of Chemistry , Universiti Teknologi Malaysia , Johor , Malaysia
- e Advanced Membrane Technology Centre , Universiti Teknologi Malaysia , Johor , Malaysia
| |
Collapse
|
38
|
Habib G. Chemical and optical properties of PM 2.5 from on-road operation of light duty vehicles in Delhi city. THE SCIENCE OF THE TOTAL ENVIRONMENT 2017; 586:900-916. [PMID: 28238373 DOI: 10.1016/j.scitotenv.2017.02.070] [Citation(s) in RCA: 24] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/15/2016] [Revised: 01/26/2017] [Accepted: 02/08/2017] [Indexed: 06/06/2023]
Abstract
This study reports emission factors of PM2.5, elemental carbon (EC), organic carbon (OC), ions, trace elements and mass absorption cross-sections (MAC) of aerosol emitted from the on-road operation of light duty vehicles of different vintages. A portable dilution system was used to achieve complete quenching of aerosol at near ambient condition. The particles were collected on the filters and analyzed for chemical and light absorbing properties of aerosol. The diesel-powered passenger cars emitted higher PM2.5 (56-356mgkm-1) with a large fraction of EC (37-65%), while emissions from gasoline (46-78mgkm-1), and CNG vehicles (33-34mgkm-1) were low and contained low EC (5-15%) and remarkably high OC (46-91%). The MAC of aerosols for diesel vehicles (32-208m2g-1 of PM2.5) were well explained by EC content (31-62%) and showed similarity with MAC values reported for wood fuel combustion in cooking stoves indicating the two sources cannot be resolved on the basis of light absorption properties in source apportionment studies. Ionic contributions to PM2.5 were highest for 4W-gasoline (11-19%) compared to 4W-diesel (7-11%), and CNG (9-10%). The abundance of ions such as Na+, Ca2+, SO42-, NO3-, and NH4+ could be due to use of lubricant oil and abrasive nature of engine of old vehicles. Trace elements (Al, Fe, Zn, Pb, and Cu) emitted from after-treatment devices, additives in lube oil, and wearing of engine components, were found to be 2-14%, 3-8% and 11-12% of total PM2.5 for 4W of diesel, gasoline, and CNG respectively. This study indicates that aerosol emissions from on-road vehicles show a strong dependency on vehicle maintenance, engine type and after-treatment techniques.
Collapse
|
39
|
Fu G, Wang G, Dai X. An adaptive threshold determination method of feature screening for genomic selection. BMC Bioinformatics 2017; 18:212. [PMID: 28403836 PMCID: PMC5389084 DOI: 10.1186/s12859-017-1617-9] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2016] [Accepted: 03/28/2017] [Indexed: 11/10/2022] Open
Abstract
Background Although the dimension of the entire genome can be extremely large, only a parsimonious set of influential SNPs are correlated with a particular complex trait and are important to the prediction of the trait. Efficiently and accurately selecting these influential SNPs from millions of candidates is in high demand, but poses challenges. We propose a backward elimination iterative distance correlation (BE-IDC) procedure to select the smallest subset of SNPs that guarantees sufficient prediction accuracy, while also solving the unclear threshold issue for traditional feature screening approaches. Results Verified through six simulations, the adaptive threshold estimated by the BE-IDC performed uniformly better than fixed threshold methods that have been used in the current literature. We also applied BE-IDC to an Arabidopsis thaliana genome-wide data. Out of 216,130 SNPs, BE-IDC selected four influential SNPs, and confirmed the same FRIGIDA gene that was reported by two other traditional methods. Conclusions BE-IDC accommodates both the prediction accuracy and the computational speed that are highly demanded in the genomic selection. Electronic supplementary material The online version of this article (doi:10.1186/s12859-017-1617-9) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Guifang Fu
- Department of Mathematics and Statistics, Utah State University, Logan, 84322, UT, USA.
| | - Gang Wang
- Department of Mathematics and Statistics, Utah State University, Logan, 84322, UT, USA
| | - Xiaotian Dai
- Department of Mathematics and Statistics, Utah State University, Logan, 84322, UT, USA
| |
Collapse
|