151
|
Bhatnagar SR, Yang Y, Lu T, Schurr E, Loredo-Osti JC, Forest M, Oualkacha K, Greenwood CMT. Simultaneous SNP selection and adjustment for population structure in high dimensional prediction models. PLoS Genet 2020; 16:e1008766. [PMID: 32365090 PMCID: PMC7224575 DOI: 10.1371/journal.pgen.1008766] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2019] [Revised: 05/14/2020] [Accepted: 04/08/2020] [Indexed: 12/23/2022] Open
Abstract
Complex traits are known to be influenced by a combination of environmental factors and rare and common genetic variants. However, detection of such multivariate associations can be compromised by low statistical power and confounding by population structure. Linear mixed effects models (LMM) can account for correlations due to relatedness but have not been applicable in high-dimensional (HD) settings where the number of fixed effect predictors greatly exceeds the number of samples. False positives or false negatives can result from two-stage approaches, where the residuals estimated from a null model adjusted for the subjects' relationship structure are subsequently used as the response in a standard penalized regression model. To overcome these challenges, we develop a general penalized LMM with a single random effect called ggmix for simultaneous SNP selection and adjustment for population structure in high dimensional prediction models. We develop a blockwise coordinate descent algorithm with automatic tuning parameter selection which is highly scalable, computationally efficient and has theoretical guarantees of convergence. Through simulations and three real data examples, we show that ggmix leads to more parsimonious models compared to the two-stage approach or principal component adjustment with better prediction accuracy. Our method performs well even in the presence of highly correlated markers, and when the causal SNPs are included in the kinship matrix. ggmix can be used to construct polygenic risk scores and select instrumental variables in Mendelian randomization studies. Our algorithms are available in an R package available on CRAN (https://cran.r-project.org/package=ggmix).
Collapse
Affiliation(s)
- Sahir R. Bhatnagar
- Department of Epidemiology, Biostatistics and Occupational Health, McGill University, Montréal, Québec, Canada
- Department of Diagnostic Radiology, McGill University, Montréal, Québec, Canada
| | - Yi Yang
- Department of Mathematics and Statistics, McGill University, Montréal, Québec, Canada
| | - Tianyuan Lu
- Quantitative Life Sciences, McGill University, Montréal, Québec, Canada
- Lady Davis Institute, Jewish General Hospital, Montréal, Québec, Canada
| | - Erwin Schurr
- Department of Medicine, McGill University, Montréal, Québec, Canada
| | - JC Loredo-Osti
- Department of Mathematics and Statistics, Memorial University, St. John’s, Newfoundland and Labrador, Canada
| | - Marie Forest
- École de Technologie Supérieure, Montréal, Québec, Canada
| | - Karim Oualkacha
- Département de Mathématiques, Université du Québec à Montréal, Montréal, Québec, Canada
| | - Celia M. T. Greenwood
- Department of Epidemiology, Biostatistics and Occupational Health, McGill University, Montréal, Québec, Canada
- Quantitative Life Sciences, McGill University, Montréal, Québec, Canada
- Lady Davis Institute, Jewish General Hospital, Montréal, Québec, Canada
- Gerald Bronfman Department of Oncology, McGill University, Montréal, Québec, Canada
- Department of Human Genetics, McGill University, Montréal, Québec, Canada
| |
Collapse
|
152
|
Detmer FJ, Cebral J, Slawski M. A note on coding and standardization of categorical variables in (sparse) group lasso regression. J Stat Plan Inference 2020. [DOI: 10.1016/j.jspi.2019.08.003] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]
|
153
|
Evaluation of secondary ions related to plant tissue using least absolute shrinkage and selection operator. Biointerphases 2020; 15:021010. [PMID: 32272844 DOI: 10.1116/6.0000010] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022] Open
Abstract
With regard to life sciences, it is important to understand biological functions such as metabolic reactions at the cellular level. Time-of-flight secondary ion mass spectrometry (TOF-SIMS) that can provide chemical mappings at 100 nm lateral resolutions is useful for obtaining three-dimensional maps of biological molecules in cells and tissues. TOF-SIMS spectra generally contain several hundred to several thousand secondary ion peaks that provide detailed chemical information. In order to manage such a large number of peaks, data analysis methods such as multivariate analysis techniques have been applied to TOF-SIMS data of complex samples. However, the interpretation of the data analysis results is sometimes still difficult, especially for biological samples. In this study, TOF-SIMS data of resin-embedded plant samples were analyzed using one of the sparse modeling methods, least absolute shrinkage and selection operator (LASSO), to directly select secondary ions related to biological structures such as cell walls and nuclei. The same sample was measured by optical microscopy and the same measurement area as TOF-SIMS was extracted in order to prepare a target image for LASSO. The same area of the TOF-SIMS and microscope data were fused to evaluate the influence of the image fusion on the TOF-SIMS spectrum information using principal component analysis. Specifically, the authors examined onion mycorrhizal root colonized with Gigaspora margarita (an arbuscular mycorrhizal fungus). The results showed that by employing this approach using LASSO, important secondary ions from biological samples were effectively selected and could be clearly distinguished from the embedding resin.
Collapse
|
154
|
Investigating matrix effects of different combinations of lipids and peptides on TOF-SIMS data. Biointerphases 2020; 15:021008. [PMID: 32241114 DOI: 10.1116/6.0000036] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022] Open
Abstract
Matrix effects, which cause a change in ion intensity, occur in mass spectrometry methods including time-of-flight secondary ion mass spectrometry (TOF-SIMS). Matrix effects often cause large issues in quantitative analysis because secondary ions related to a particular molecule could be dramatically enhanced or suppressed regardless of the concentration. To investigate matrix effects in biological samples, the authors evaluated mixed lipid {POPC [1-palmitoyl-2-oleoyl-sn-glycero-3-phosphatidylcholine, molecular weight (MW) 759.6]}, peptide [leu-enkephalin, neo-leu-enkephalin (amino acid sequence: YAGFL, MW 569.3), and neo-angiotensin II (amino acid sequence: DRVYIHAF, MW 1019.5)] samples. Matrix effect features were investigated by analyzing the concentration dependence of secondary ions in lipid-peptide mixed samples to develop a method that enables quantitative analysis using TOF-SIMS. Matrix effects depended on the lipid-peptide combination. Interestingly, some secondary ions possessed an intensity that was highly dependent on concentration.
Collapse
|
155
|
Classical and Deep Learning Paradigms for Detection and Validation of Key Genes of Risky Outcomes of HCV. ALGORITHMS 2020. [DOI: 10.3390/a13030073] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/18/2022]
Abstract
Hepatitis C virus (HCV) is one of the most dangerous viruses worldwide. It is the foremost cause of the hepatic cirrhosis, and hepatocellular carcinoma, HCC. Detecting new key genes that play a role in the growth of HCC in HCV patients using machine learning techniques paves the way for producing accurate antivirals. In this work, there are two phases: detecting the up/downregulated genes using classical univariate and multivariate feature selection methods, and validating the retrieved list of genes using Insilico classifiers. However, the classification algorithms in the medical domain frequently suffer from a deficiency of training cases. Therefore, a deep neural network approach is proposed here to validate the significance of the retrieved genes in classifying the HCV-infected samples from the disinfected ones. The validation model is based on the artificial generation of new examples from the retrieved genes’ expressions using sparse autoencoders. Subsequently, the generated genes’ expressions data are used to train conventional classifiers. Our results in the first phase yielded a better retrieval of significant genes using Principal Component Analysis (PCA), a multivariate approach. The retrieved list of genes using PCA had a higher number of HCC biomarkers compared to the ones retrieved from the univariate methods. In the second phase, the classification accuracy can reveal the relevance of the extracted key genes in classifying the HCV-infected and disinfected samples.
Collapse
|
156
|
Gupta S, Lee REC, Faeder JR. Parallel Tempering with Lasso for model reduction in systems biology. PLoS Comput Biol 2020; 16:e1007669. [PMID: 32150537 PMCID: PMC7082068 DOI: 10.1371/journal.pcbi.1007669] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/24/2019] [Revised: 03/19/2020] [Accepted: 01/20/2020] [Indexed: 01/08/2023] Open
Abstract
Systems Biology models reveal relationships between signaling inputs and observable molecular or cellular behaviors. The complexity of these models, however, often obscures key elements that regulate emergent properties. We use a Bayesian model reduction approach that combines Parallel Tempering with Lasso regularization to identify minimal subsets of reactions in a signaling network that are sufficient to reproduce experimentally observed data. The Bayesian approach finds distinct reduced models that fit data equivalently. A variant of this approach that uses Lasso to perform selection at the level of reaction modules is applied to the NF-κB signaling network to test the necessity of feedback loops for responses to pulsatile and continuous pathway stimulation. Taken together, our results demonstrate that Bayesian parameter estimation combined with regularization can isolate and reveal core motifs sufficient to explain data from complex signaling systems. Cells respond to diverse environmental cues using complex networks of interacting proteins and other biomolecules. Mathematical and computational models have become invaluable tools to understand these networks and make informed predictions to rationally perturb cell behavior. However, the complexity of detailed models that try to capture all known biochemical elements of signaling networks often makes it difficult to determine the key regulatory elements that are responsible for specific cell behaviors. Here, we present a Bayesian computational approach, PTLasso, to automatically extract minimal subsets of detailed models that are sufficient to explain experimental data. The method simultaneously calibrates and reduces models, and the Bayesian approach samples globally, allowing us to find alternate mechanistic explanations for the data if present. We demonstrate the method on both synthetic and real biological data and show that PTLasso is an effective method to isolate distinct parts of a larger signaling model that are sufficient for specific data.
Collapse
Affiliation(s)
- Sanjana Gupta
- Department of Computational and Systems Biology, University of Pittsburgh School of Medicine, Pittsburgh, Pennsylvania, United States of America
| | - Robin E C Lee
- Department of Computational and Systems Biology, University of Pittsburgh School of Medicine, Pittsburgh, Pennsylvania, United States of America
| | - James R Faeder
- Department of Computational and Systems Biology, University of Pittsburgh School of Medicine, Pittsburgh, Pennsylvania, United States of America
| |
Collapse
|
157
|
Detmer FJ, Mut F, Slawski M, Hirsch S, Bijlenga P, Cebral JR. Incorporating variability of patient inflow conditions into statistical models for aneurysm rupture assessment. Acta Neurochir (Wien) 2020; 162:553-566. [PMID: 32008209 DOI: 10.1007/s00701-020-04234-8] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/13/2019] [Accepted: 01/18/2020] [Indexed: 12/19/2022]
Abstract
BACKGROUND Hemodynamic patterns have been associated with cerebral aneurysm instability. For patient-specific computational fluid dynamics (CFD) simulations, the inflow rates of a patient are typically not known. The aim of this study was to analyze the influence of inter- and intra-patient variations of cerebral blood flow on the computed hemodynamics through CFD simulations and to incorporate these variations into statistical models for aneurysm rupture prediction. METHODS Image data of 1820 aneurysms were used for patient-specific steady CFD simulations with nine different inflow rates per case, capturing inter- and intra-patient flow variations. Based on the computed flow fields, 17 hemodynamic parameters were calculated and compared for the different flow conditions. Next, statistical models for aneurysm rupture were trained in 1571 of the aneurysms including hemodynamic parameters capturing the flow variations either by defining hemodynamic "response variables" (model A) or repeatedly randomly selecting flow conditions by patients (model B) as well as morphological and patient-specific variables. Both models were evaluated in the remaining 249 cases. RESULTS All hemodynamic parameters were significantly different for the varying flow conditions (p < 0.001). Both the flow-independent "response" model A and the flow-dependent model B performed well with areas under the receiver operating characteristic curve of 0.8182 and 0.8174 ± 0.0045, respectively. CONCLUSIONS The influence of inter- and intra-patient flow variations on computed hemodynamics can be taken into account in multivariate aneurysm rupture prediction models achieving a good predictive performance. Such models can be applied to CFD data independent of the specific inflow boundary conditions.
Collapse
Affiliation(s)
- Felicitas J Detmer
- Bioengineering Department, Volgenau School of Engineering, George Mason University, 4400 University Drive, Fairfax, VA, 22030, USA.
| | - Fernando Mut
- Bioengineering Department, Volgenau School of Engineering, George Mason University, 4400 University Drive, Fairfax, VA, 22030, USA
| | - Martin Slawski
- Statistics Department, George Mason University, Fairfax, VA, USA
| | - Sven Hirsch
- Institute of Applied Simulation, ZHAW University of Applied Sciences, Wädenswil, Switzerland
| | - Philippe Bijlenga
- Neurosurgery, Clinical Neurosciences Department, Geneva University Hospital and Faculty of Medicine, Geneva University, Geneva, Switzerland
| | - Juan R Cebral
- Bioengineering Department, Volgenau School of Engineering, George Mason University, 4400 University Drive, Fairfax, VA, 22030, USA
| |
Collapse
|
158
|
Xie X, Zhang H, Wang J, Chang Q, Wang J, Pal NR. Learning Optimized Structure of Neural Networks by Hidden Node Pruning With L 1 Regularization. IEEE TRANSACTIONS ON CYBERNETICS 2020; 50:1333-1346. [PMID: 31765323 DOI: 10.1109/tcyb.2019.2950105] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
We propose three different methods to determine the optimal number of hidden nodes based on L1 regularization for a multilayer perceptron network. The first two methods, respectively, use a set of multiplier functions and multipliers for the hidden-layer nodes and implement the L1 regularization on those, while the third method equipped with the same multipliers uses a smoothing approximation of the L1 regularization. Each of these methods begins with a given number of hidden nodes, then the network is trained to obtain an optimal architecture discarding redundant hidden nodes using the multiplier functions or multipliers. A simple and generic method, namely, the matrix-based convergence proving method (MCPM), is introduced to prove the weak and strong convergence of the presented smoothing algorithms. The performance of the three pruning methods has been tested on 11 different classification datasets. The results demonstrate the efficient pruning abilities and competitive generalization by the proposed methods. The theoretical results are also validated by the results.
Collapse
|
159
|
Reps JM, Cepeda MS, Ryan PB. Wisdom of the CROUD: Development and validation of a patient-level prediction model for opioid use disorder using population-level claims data. PLoS One 2020; 15:e0228632. [PMID: 32053653 PMCID: PMC7017997 DOI: 10.1371/journal.pone.0228632] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2019] [Accepted: 01/21/2020] [Indexed: 11/18/2022] Open
Abstract
OBJECTIVE Some patients who are given opioids for pain could develop opioid use disorder. If it was possible to identify patients who are at a higher risk of opioid use disorder, then clinicians could spend more time educating these patients about the risks. We develop and validate a model to predict a person's future risk of opioid use disorder at the point before being dispensed their first opioid. METHODS A cohort study patient-level prediction using four US claims databases with target populations ranging between 343,552 and 384,424 patients. The outcome was recorded diagnosis of opioid abuse, dependency or unspecified drug abuse as a proxy for opioid use disorder from 1 day until 365 days after the first opioid is dispensed. We trained a regularized logistic regression using candidate predictors consisting of demographics and any conditions, drugs, procedures or visits prior to the first opioid. We then selected the top predictors and created a simple 8 variable score model. RESULTS We estimated the percentage of new users of opioids with reported opioid use disorder within a year to range between 0.04%-0.26% across US claims data. We developed an 8 variable Calculator of Risk for Opioid Use Disorder (CROUD) score, derived from the prediction models to stratify patients into higher and lower risk groups. The 8 baseline variables were age 15-29, medical history of substance abuse, mood disorder, anxiety disorder, low back pain, renal impairment, painful neuropathy and recent ER visit. 1.8% of people were in the high risk group for opioid use disorder and had a score > = 23 with the model obtaining a sensitivity of 13%, specificity of 98% and PPV of 1.14% for predicting opioid use disorder. CONCLUSIONS CROUD could be used by clinicians to obtain personalized risk scores. CROUD could be used to further educate those at higher risk and to personalize new opioid dispensing guidelines such as urine testing. Due to the high false positive rate, it should not be used for contraindication or to restrict utilization.
Collapse
Affiliation(s)
- Jenna Marie Reps
- Janssen Research and Development Titusville, Titusville, NJ, United States of America
| | - M. Soledad Cepeda
- Janssen Research and Development Titusville, Titusville, NJ, United States of America
| | - Patrick B. Ryan
- Janssen Research and Development Titusville, Titusville, NJ, United States of America
| |
Collapse
|
160
|
Li Y, Sun C, Li P, Zhao Y, Mensah GK, Xu Y, Guo H, Chen J. Hypernetwork Construction and Feature Fusion Analysis Based on Sparse Group Lasso Method on fMRI Dataset. Front Neurosci 2020; 14:60. [PMID: 32116508 PMCID: PMC7029661 DOI: 10.3389/fnins.2020.00060] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2019] [Accepted: 01/15/2020] [Indexed: 01/21/2023] Open
Abstract
Recent works have shown that the resting-state brain functional connectivity hypernetwork, where multiple nodes can be connected, are an effective technique for brain disease diagnosis and classification research. The lasso method was used to construct hypernetworks by solving sparse linear regression models in previous research. But, constructing a hypernetwork based on the lasso method simply selects a single variable, in that it lacks the ability to interpret the grouping effect. Considering the group structure problem, the previous study proposed to create a hypernetwork based on the elastic net and the group lasso methods, and the results showed that the former method had the best classification performance. However, the highly correlated variables selected by the elastic net method were not necessarily in the active set in the group. Therefore, we extended our research to address this issue. Herein, we propose a new method that introduces the sparse group lasso method to improve the construction of the hypernetwork by solving the group structure problem of the brain regions. We used the traditional lasso, group lasso method, and sparse group lasso method to construct a hypernetwork in patients with depression and normal subjects. Meanwhile, other clustering coefficients (clustering coefficients based on pairs of nodes) were also introduced to extract features with traditional clustering coefficients. Two types of features with significant differences obtained after feature selection were subjected to multi-kernel learning for feature fusion and classification using each method, respectively. The network topology results revealed differences among the three networks, where hypernetwork using the lasso method was the strictest; the group lasso, most lenient; and the sgLasso method, moderate. The network topology of the sparse group lasso method was similar to that of the group lasso method but different from the lasso method. The classification results show that the sparse group lasso method achieves the best classification accuracy by using multi-kernel learning, which indicates that better classification performance can be achieved when the group structure exists and is properly extended.
Collapse
Affiliation(s)
- Yao Li
- College of Information and Computer, Taiyuan University of Technology, Taiyuan, China
| | - Chao Sun
- College of Information and Computer, Taiyuan University of Technology, Taiyuan, China
| | - Pengzu Li
- College of Information and Computer, Taiyuan University of Technology, Taiyuan, China
| | - Yunpeng Zhao
- College of Arts, Taiyuan University of Technology, Taiyuan, China
| | - Godfred Kim Mensah
- College of Information and Computer, Taiyuan University of Technology, Taiyuan, China
| | - Yong Xu
- Department of Psychiatry, First Hospital of Shanxi Medical University, Taiyuan, China
| | - Hao Guo
- College of Information and Computer, Taiyuan University of Technology, Taiyuan, China
| | - Junjie Chen
- College of Information and Computer, Taiyuan University of Technology, Taiyuan, China
| |
Collapse
|
161
|
|
162
|
Noorie Z, Afsari F. Sparse feature selection: Relevance, redundancy and locality structure preserving guided by pairwise constraints. Appl Soft Comput 2020. [DOI: 10.1016/j.asoc.2019.105956] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]
|
163
|
Prediction in Cancer Genomics Using Topological Signatures and Machine Learning. TOPOLOGICAL DATA ANALYSIS 2020. [DOI: 10.1007/978-3-030-43408-3_10] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/07/2023]
|
164
|
Huang S, Garshick E, Weschler LB, Hong C, Li J, Li L, Qu F, Gao D, Zhou Y, Sundell J, Zhang Y, Koutrakis P. Home environmental and lifestyle factors associated with asthma, rhinitis and wheeze in children in Beijing, China. ENVIRONMENTAL POLLUTION (BARKING, ESSEX : 1987) 2020; 256:113426. [PMID: 31672368 PMCID: PMC7050389 DOI: 10.1016/j.envpol.2019.113426] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/12/2019] [Revised: 10/15/2019] [Accepted: 10/15/2019] [Indexed: 05/04/2023]
Abstract
BACKGROUND The prevalence of asthma and allergic diseases has increased rapidly in urban China since 2000. There has been limited study of associations between home environmental and lifestyle factors with asthma and symptoms of allergic disease in China. METHODS In a cross-sectional analysis of 2214 children in Beijing, we applied a two-step hybrid Least Absolute Shrinkage and Selection Operator (LASSO) algorithm to identify environmental and lifestyle-related factors associated with asthma, rhinitis and wheeze from a wide range of candidates. We used group LASSO to select variables, using cross-validation as the criterion. Effect estimates were then calculated using adaptive LASSO. Model performance was assessed using Area Under the Curve (AUC) values. RESULTS We found a number of environmental and lifestyle-related factors significantly associated with asthma, rhinitis or wheeze, which changed the probability of asthma, rhinitis or wheeze from -5.76% (95%CI: -7.74%, -3.79%) to 27.4% (95%CI: 16.6%, 38.3%). The three factors associated with the largest change in probability of asthma were short birth length, carpeted floor and paternal allergy; for rhinitis they were maternal smoking during pregnancy, paternal allergy and living close to industrial area; and for wheeze they were carpeted floor, short birth length and maternal allergy. Other home environmental risk factors identified were living close to a highway, industrial area or river, sharing bedroom, cooking with gas, furry pets, cockroaches, incense, printer/photocopier, TV, damp, and window condensation in winter. Lifestyle-related risk factors were child caretakers other than parents, and age<3 for the day-care. Other risk factors included use of antibiotics, and mother's occupation. Major protective factors for wheeze were living in a rural/suburban region, air conditioner use, and mother's occupation in healthcare. CONCLUSIONS Our findings suggest that changes in lifestyle and indoor environments associated with the urbanization and industrialization of China are associated with asthma, rhinitis, and wheeze in children.
Collapse
Affiliation(s)
- Shaodan Huang
- Department of Building Science, Tsitnghua University, Beijing, 100084, China; Beijing Key Lab of Indoor Air Quality Evaluation and Control, Beijing, 100084, China; Department of Environmental Health, Harvard T.H. Chan School of Public Health, Boston, 02115, USA
| | - Eric Garshick
- Pulmonary, Allergy, Sleep, and Critical Care Medicine Section, Medical Service, VA Boston Healthcare System, Boston, MA, 02132, USA; Channing Division of Network Medicine, Brigham and Women's Hospital and Harvard Medical School, Boston, MA, 02115, USA
| | - Louise B Weschler
- Department of Building Science, Tsitnghua University, Beijing, 100084, China; 161 Richdale Road, Colts Neck, NJ, 07722, USA
| | - Chuan Hong
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, 02115, USA
| | - Jing Li
- Department of Environmental Health, Harvard T.H. Chan School of Public Health, Boston, 02115, USA.
| | - Linyan Li
- Department of Environmental Health, Harvard T.H. Chan School of Public Health, Boston, 02115, USA
| | - Fang Qu
- Department of Building Science, Tsitnghua University, Beijing, 100084, China; China Meteorological Administration Training Centre, China Meteorological Administration, Beijing, 100081, China
| | - Dewen Gao
- Beijing Key Lab of Indoor Air Quality Evaluation and Control, Beijing, 100084, China
| | - Yanmin Zhou
- School of Architecture, Tsinghua University, Beijing, 100084, China; Beijing Key Lab of Indoor Air Quality Evaluation and Control, Beijing, 100084, China
| | - Jan Sundell
- School of Environmental Science and Engineering, Tianjin University, Tianjing, 300072, China
| | - Yinping Zhang
- Department of Building Science, Tsitnghua University, Beijing, 100084, China; Beijing Key Lab of Indoor Air Quality Evaluation and Control, Beijing, 100084, China.
| | - Petros Koutrakis
- Department of Environmental Health, Harvard T.H. Chan School of Public Health, Boston, 02115, USA
| |
Collapse
|
165
|
Zhang X, Zhang Q, Wang X, Ma S, Fang K. Structured sparse logistic regression with application to lung cancer prediction using breath volatile biomarkers. Stat Med 2019; 39:955-967. [PMID: 31880351 DOI: 10.1002/sim.8454] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/27/2019] [Revised: 09/24/2019] [Accepted: 11/21/2019] [Indexed: 11/10/2022]
Abstract
This article is motivated by a study of lung cancer prediction using breath volatile organic compound (VOC) biomarkers, where the challenge is that the predictors include not only high-dimensional time-dependent or functional VOC features but also the time-independent clinical variables. We consider a high-dimensional logistic regression and propose two different penalties: group spline-penalty or group smooth-penalty to handle the group structures of the time-dependent variables in the model. The new methods have the advantage for the situation where the model coefficients are sparse but change smoothly within the group, compared with other existing methods such as the group lasso and the group bridge approaches. Our methods are easy to implement since they can be turned into a group minimax concave penalty problem after certain transformations. We show that our fitting algorithm possesses the descent property and leads to attractive convergence properties. The simulation studies and the lung cancer application are performed to demonstrate the accuracy and stability of the proposed approaches.
Collapse
Affiliation(s)
- Xiaochen Zhang
- Department of Statistics, School of Economics, Xiamen University, China
| | - Qingzhao Zhang
- Department of Statistics, School of Economics, Xiamen University, China.,The Wang Yanan Institute for Studies in Economics, Xiamen University, China
| | - Xiaofeng Wang
- Department of Quantitative Health Sciences, Cleveland Clinic, Cleveland, Ohio
| | - Shuangge Ma
- Department of Biostatistics, Yale University, New Haven, Connecticut
| | - Kuangnan Fang
- Department of Statistics, School of Economics, Xiamen University, China
| |
Collapse
|
166
|
Hesamian G, Akbari MG. Fuzzy Lasso regression model with exact explanatory variables and fuzzy responses. Int J Approx Reason 2019. [DOI: 10.1016/j.ijar.2019.10.007] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
167
|
Groll A, Hambuckers J, Kneib T, Umlauf N. LASSO-type penalization in the framework of generalized additive models for location, scale and shape. Comput Stat Data Anal 2019. [DOI: 10.1016/j.csda.2019.06.005] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|
168
|
Honda T, Ing CK, Wu WY. Adaptively weighted group Lasso for semiparametric quantile regression models. BERNOULLI 2019. [DOI: 10.3150/18-bej1091] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
169
|
Zègre-Hemsey JK, Asafu-Adjei J, Fernandez A, Brice J. Characteristics of Prehospital Electrocardiogram Use in North Carolina Using a Novel Linkage of Emergency Medical Services and Emergency Department Data. PREHOSP EMERG CARE 2019; 23:772-779. [PMID: 30885071 PMCID: PMC6751030 DOI: 10.1080/10903127.2019.1597230] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2018] [Revised: 03/14/2019] [Accepted: 03/14/2019] [Indexed: 10/27/2022]
Abstract
Objective: Prehospital electrocardiography (ECG) is recommended for patients with suspected acute coronary syndrome (ACS), yet only 20-80% of chest pain patients receive a prehospital ECG. Less is known about prehospital ECG use in patients with less common complaints (e.g., fatigue) suspicious for ACS who are transported by emergency medical services (EMS). The aims of this study were to determine: (1) the proportion of patients with chest pain and less typical complaints, and (2) patient characteristics associated with prehospital ECG use in patients transported by EMS to emergency departments across North Carolina. Methods: A novel linked database was created between prehospital and emergency department (ED) patient care data from the North Carolina Prehospital Medical Information System and the North Carolina Disease Event Tracking and Epidemiologic Collection Tool. Institutional review board approval and a data use agreement were received prior to the start of the study. Patients ≥21 transported during 2010-14 by EMS with select variables were included. We examined patients' complaints (symptoms), characteristics (e.g., race, ethnicity, final hospital diagnosis), and prehospital ECG use (yes/no). Analysis included descriptive statistics and mixed logistic regression. Results: During 2010-14, there were 1,967,542 patients with linked EMS-ED data (mean age: 56.9 [SD: 22.2], 43.2% male, 63.7% White). Of these, 643,174 (32.6%) received a prehospital ECG. Patients with prehospital ECG presented with the following complaints: 20% chest pain; 10% shortness of breath; 6% abdominal pain/problems; 6% altered level of consciousness; 5% syncope/dizziness; 4% palpitations; 12% other complaints; and 37% missing. Patients' presenting complaints were the strongest predictor of prehospital ECG use, adjusting for age, sex, race, ethnicity, urbanicity, and date and time of EMS dispatch. Conclusions: Patients with chest pain were significantly more likely to receive a prehospital ECG compared to those with less typical but suspicious complaints for ACS. Patients with less common presentations remain disadvantaged for early triage, risk stratification, and intervention prior to the hospital.
Collapse
Affiliation(s)
- Jessica K. Zègre-Hemsey
- University of North Carolina at Chapel Hill, School of Nursing,
, 919-966-5490 (office),
919-966-7298 (fax)
| | | | - Antonio Fernandez
- University of North Carolina at Chapel Hill and EMS Performance
Improvement Center
| | - Jane Brice
- University of North Carolina at Chapel Hill, Department of
Emergency Medicine
| |
Collapse
|
170
|
Kim K, Sun H. Incorporating genetic networks into case-control association studies with high-dimensional DNA methylation data. BMC Bioinformatics 2019; 20:510. [PMID: 31640538 PMCID: PMC6805595 DOI: 10.1186/s12859-019-3040-x] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/10/2019] [Accepted: 08/21/2019] [Indexed: 12/23/2022] Open
Abstract
Background In human genetic association studies with high-dimensional gene expression data, it has been well known that statistical selection methods utilizing prior biological network knowledge such as genetic pathways and signaling pathways can outperform other methods that ignore genetic network structures in terms of true positive selection. In recent epigenetic research on case-control association studies, relatively many statistical methods have been proposed to identify cancer-related CpG sites and their corresponding genes from high-dimensional DNA methylation array data. However, most of existing methods are not designed to utilize genetic network information although methylation levels between linked genes in the genetic networks tend to be highly correlated with each other. Results We propose new approach that combines data dimension reduction techniques with network-based regularization to identify outcome-related genes for analysis of high-dimensional DNA methylation data. In simulation studies, we demonstrated that the proposed approach overwhelms other statistical methods that do not utilize genetic network information in terms of true positive selection. We also applied it to the 450K DNA methylation array data of the four breast invasive carcinoma cancer subtypes from The Cancer Genome Atlas (TCGA) project. Conclusions The proposed variable selection approach can utilize prior biological network information for analysis of high-dimensional DNA methylation array data. It first captures gene level signals from multiple CpG sites using data a dimension reduction technique and then performs network-based regularization based on biological network graph information. It can select potentially cancer-related genes and genetic pathways that were missed by the existing methods. Electronic supplementary material The online version of this article (10.1186/s12859-019-3040-x) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Kipoong Kim
- Department of Statistic, Pusan National University, Busan, 46241, Korea
| | - Hokeun Sun
- Department of Statistic, Pusan National University, Busan, 46241, Korea.
| |
Collapse
|
171
|
Zhou S, Zhou J, Zhang B. Overlapping group lasso for high-dimensional generalized linear models. COMMUN STAT-THEOR M 2019. [DOI: 10.1080/03610926.2018.1500604] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]
Affiliation(s)
- Shengbin Zhou
- Department of Statistics, Harbin Normal University, Harbin, China
| | - Jingke Zhou
- Department of Statistics, Harbin Normal University, Harbin, China
| | - Bo Zhang
- Department of Statistics, Harbin Normal University, Harbin, China
| |
Collapse
|
172
|
Mining user interaction patterns in the darkweb to predict enterprise cyber incidents. SOCIAL NETWORK ANALYSIS AND MINING 2019. [DOI: 10.1007/s13278-019-0603-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]
|
173
|
Detmer FJ, Lückehe D, Mut F, Slawski M, Hirsch S, Bijlenga P, von Voigt G, Cebral JR. Comparison of statistical learning approaches for cerebral aneurysm rupture assessment. Int J Comput Assist Radiol Surg 2019; 15:141-150. [PMID: 31485987 DOI: 10.1007/s11548-019-02065-2] [Citation(s) in RCA: 22] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2019] [Accepted: 08/29/2019] [Indexed: 11/29/2022]
Abstract
PURPOSE Incidental aneurysms pose a challenge to physicians who need to decide whether or not to treat them. A statistical model could potentially support such treatment decisions. The aim of this study was to compare a previously developed aneurysm rupture logistic regression probability model (LRM) to other machine learning (ML) classifiers for discrimination of aneurysm rupture status. METHODS Hemodynamic, morphological, and patient-related information of 1631 cerebral aneurysms characterized by computational fluid dynamics simulations were used to train support vector machines (SVMs) with linear and RBF kernel (RBF-SVM), k-nearest neighbors (kNN), decision tree, random forest, and multilayer perceptron (MLP) neural network classifiers for predicting the aneurysm rupture status. The classifiers' accuracy, sensitivity, specificity, and area under the receiver operating characteristic curve (AUC) were evaluated and compared to the LRM using 249 test cases obtained from two external cohorts. Additionally, important variables were determined based on the random forest and weights of the linear SVM. RESULTS The AUCs of the MLP, LRM, linear SVM, RBF-SVM, kNN, decision tree, and random forest were 0.83, 0.82, 0.80, 0.81, 0.76, 0.70, and 0.79, respectively. The accuracy ranged between 0.76 (decision tree,) and 0.79 (linear SVM, RBF-SVM, and MLP). Important variables for predicting the aneurysm rupture status included aneurysm location, the mean surface curvature, and maximum flow velocity. CONCLUSION The performance of the LRM was overall comparable to that of the other ML classifiers, confirming its potential for aneurysm rupture assessment. To further improve the predictions, additional information, e.g., related to the aneurysm wall, might be needed.
Collapse
Affiliation(s)
- Felicitas J Detmer
- Bioengineering Department, Volgenau School of Engineering, George Mason University, 4400 University Drive, Fairfax, VA, 22030, USA.
| | - Daniel Lückehe
- Computational Health Informatics, Leibniz University, Hannover, Germany
| | - Fernando Mut
- Bioengineering Department, Volgenau School of Engineering, George Mason University, 4400 University Drive, Fairfax, VA, 22030, USA
| | - Martin Slawski
- Statistics Department, George Mason University, Fairfax, VA, USA
| | - Sven Hirsch
- Institute of Applied Simulation, ZHAW University of Applied Sciences, Wädenswil, Switzerland
| | - Philippe Bijlenga
- Neurosurgery, Clinical Neurosciences Department, Faculty of Medicine, University of Geneva, Geneva, Switzerland
| | | | - Juan R Cebral
- Bioengineering Department, Volgenau School of Engineering, George Mason University, 4400 University Drive, Fairfax, VA, 22030, USA
| |
Collapse
|
174
|
Bai H, Zhu R, An H, Zhou G, Huang H, Ren H, Zhang Y. Influence of wastewater sludge properties on the performance of electro-osmosis dewatering. ENVIRONMENTAL TECHNOLOGY 2019; 40:2853-2863. [PMID: 29557729 DOI: 10.1080/09593330.2018.1455744] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/07/2017] [Accepted: 03/14/2018] [Indexed: 06/08/2023]
Abstract
Although the properties of municipal wastewater sludge play key roles in the electro-osmosis dewatering process, it is still controversial which properties have the greatest effect on the dewatering performance. In this study, multiple regression models with the Group Lasso method were used to investigate the relationship between the final moisture content and the sludge properties, including pH, electrical conductivity (EC), volatile solids content, zeta potential (ζ), initial moisture content, extracellular polymeric substances (EPS), proteins of EPS (EPSPr), polysaccharides of EPS (EPSPo) and the ratio of EPSPr and EPSPo (EPSR). Under the optimal conditions (pressure = 100 kPa, voltage = 50 V and cake thickness = 15 mm), EPS, EC and ζ were significantly related to sludge dewaterability and EPS was the most important factor. Furthermore, the coefficient estimate of EPSPo was greater than that of EPSPr and the coefficient of EPSR was negative, indicating that EPSPo plays more important roles in electro-osmosis dewatering than EPSPr. Thus, reducing the EPS content of sludge, especially the EPSPo content, is necessary to improve the performance of electro-osmosis dewatering.
Collapse
Affiliation(s)
- Hao Bai
- a State Key Laboratory of Pollution Control and Resource Reuse, School of the Environment, Nanjing University , Nanjing , People's Republic of China
| | - Rong Zhu
- b Academy of Mathematics and Systems Science, Chinese Academy of Sciences , Beijing , People's Republic of China
| | - Hao An
- a State Key Laboratory of Pollution Control and Resource Reuse, School of the Environment, Nanjing University , Nanjing , People's Republic of China
| | - Guoya Zhou
- c Peng Yao Environmental Protection Institute , Yixing , People's Republic of China
| | - Hui Huang
- a State Key Laboratory of Pollution Control and Resource Reuse, School of the Environment, Nanjing University , Nanjing , People's Republic of China
| | - Hongqiang Ren
- a State Key Laboratory of Pollution Control and Resource Reuse, School of the Environment, Nanjing University , Nanjing , People's Republic of China
| | - Yan Zhang
- a State Key Laboratory of Pollution Control and Resource Reuse, School of the Environment, Nanjing University , Nanjing , People's Republic of China
| |
Collapse
|
175
|
Wang Y, Li X, Ruiz R. Weighted General Group Lasso for Gene Selection in Cancer Classification. IEEE TRANSACTIONS ON CYBERNETICS 2019; 49:2860-2873. [PMID: 29993764 DOI: 10.1109/tcyb.2018.2829811] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
Relevant gene selection is crucial for analyzing cancer gene expression datasets including two types of tumors in cancer classification. Intrinsic interactions among selected genes cannot be fully identified by most existing gene selection methods. In this paper, we propose a weighted general group lasso (WGGL) model to select cancer genes in groups. A gene grouping heuristic method is presented based on weighted gene co-expression network analysis. To determine the importance of genes and groups, a method for calculating gene and group weights is presented in terms of joint mutual information. To implement the complex calculation process of WGGL, a gene selection algorithm is developed. Experimental results on both random and three cancer gene expression datasets demonstrate that the proposed model achieves better classification performance than two existing state-of-the-art gene selection methods.
Collapse
|
176
|
Alquier P, Cottet V, Lecué G. Estimation bounds and sharp oracle inequalities of regularized procedures with Lipschitz loss functions. Ann Stat 2019. [DOI: 10.1214/18-aos1742] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
177
|
Grebla R, Setyawan J, Park C, Richards KM, Nwokeji ED, Pawaskar M, Haim Erder M, Lawson KA. Examining the heterogeneity of treatment patterns in attention deficit hyperactivity disorder among children and adolescents in the Texas Medicaid population: modeling suboptimal treatment response. J Med Econ 2019; 22:788-797. [PMID: 30983465 DOI: 10.1080/13696998.2019.1606814] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
Abstract
Objectives: To examine suboptimal responses (SR) in attention deficit hyperactivity disorder (ADHD) among pediatric patients in the Texas Medicaid program receiving osmotic-release oral system methylphenidate (OROS-MPH) or lisdexamfetamine (LDX) and apply an SR prediction model to identify patients most likely to experience an SR to either OROS-MPH or LDX therapies. Methods: A retrospective cohort study was conducted using Texas Medicaid claims data of ADHD children and adolescents (6-17 years of age) initiating OROS-MPH or LDX. Primary SR endpoints were drug discontinuation, switching, and augmentation 12-months post-ADHD drug initiation. Logistic regression models were developed to predict SR to OROS-MPH and LDX in 1:1 matched groups of children and adolescent cohorts. Results: A total of 3,633 children and 1,611 adolescents were matched for each cohort. SR was observed among more children (76.4% vs 72.3%; p < 0.001) and adolescents (82.7% vs 78.2%; p = 0.002) initiating OROS-MPH compared to LDX. Patient sub-groups with the highest predicted risk of OROS-MPH SR experienced significantly lower observed SR rates (p < 0.05) when initiating LDX (children: 80.6% for OROS-MPH vs 75.8% for LDX; OR = 0.75, 95% CI = 0.60-0.94; adolescents: 87.2% for OROS-MPH vs 80.6% for LDX; OR = 0.61, 95% CI = 0.41-0.89). For patients with highest predicted SR rates to LDX, observed SR rates were not significantly different between patients initiating LDX or OROS-MPH. Conclusions: This study demonstrated how a personalized medicine approach using administrative claims data can be used to identify sub-groups of child and adolescent ADHD patients with different risks for suboptimal response with OROS-MPH or LDX in a Medicaid population.
Collapse
Affiliation(s)
- Regina Grebla
- a Global Outcomes Research and Epidemiology , Shire, Lexington , MA , USA
| | - Juliana Setyawan
- a Global Outcomes Research and Epidemiology , Shire, Lexington , MA , USA
| | - Chanhyun Park
- b Health Outcomes Division , The University of Texas at Austin, College of Pharmacy , Austin , TX , USA
| | - Kristin M Richards
- b Health Outcomes Division , The University of Texas at Austin, College of Pharmacy , Austin , TX , USA
| | - Esmond D Nwokeji
- b Health Outcomes Division , The University of Texas at Austin, College of Pharmacy , Austin , TX , USA
| | - Manjiri Pawaskar
- a Global Outcomes Research and Epidemiology , Shire, Lexington , MA , USA
| | - M Haim Erder
- a Global Outcomes Research and Epidemiology , Shire, Lexington , MA , USA
| | - Kenneth A Lawson
- b Health Outcomes Division , The University of Texas at Austin, College of Pharmacy , Austin , TX , USA
| |
Collapse
|
178
|
Drumetz L, Meyer TR, Chanussot J, Bertozzi AL, Jutten C. Hyperspectral Image Unmixing With Endmember Bundles and Group Sparsity Inducing Mixed Norms. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2019; 28:3435-3450. [PMID: 30716036 DOI: 10.1109/tip.2019.2897254] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
Hyperspectral images provide much more information than conventional imaging techniques, allowing a precise identification of the materials in the observed scene, but because of the limited spatial resolution, the observations are usually mixtures of the contributions of several materials. The spectral unmixing problem aims at recovering the spectra of the pure materials of the scene (endmembers), along with their proportions (abundances) in each pixel. In order to deal with the intra-class variability of the materials and the induced spectral variability of the endmembers, several spectra per material, constituting endmember bundles, can be considered. However, the usual abundance estimation techniques do not take advantage of the particular structure of these bundles, organized into groups of spectra. In this paper, we propose to use group sparsity by introducing mixed norms in the abundance estimation optimization problem. In particular, we propose a new penalty, which simultaneously enforces group and within-group sparsity, to the cost of being nonconvex. All the proposed penalties are compatible with the abundance sum-to-one constraint, which is not the case with traditional sparse regression. We show on simulated and real datasets that well-chosen penalties can significantly improve the unmixing performance compared to classical sparse regression techniques or to the naive bundle approach.
Collapse
|
179
|
Koster GT, Nguyen TTM, van Zwet EW, Garcia BL, Rowling HR, Bosch J, Schonewille WJ, Velthuis BK, van den Wijngaard IR, den Hertog HM, Roos YBWEM, van Walderveen MAA, Wermer MJH, Kruyt ND. Clinical prediction of thrombectomy eligibility: A systematic review and 4-item decision tree. Int J Stroke 2019; 14:530-539. [PMID: 30209989 PMCID: PMC6710617 DOI: 10.1177/1747493018801225] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/11/2018] [Accepted: 06/25/2018] [Indexed: 01/19/2023]
Abstract
BACKGROUND A clinical large anterior vessel occlusion (LAVO)-prediction scale could reduce treatment delays by allocating intra-arterial thrombectomy (IAT)-eligible patients directly to a comprehensive stroke center. AIM To subtract, validate and compare existing LAVO-prediction scales, and develop a straightforward decision support tool to assess IAT-eligibility. METHODS We performed a systematic literature search to identify LAVO-prediction scales. Performance was compared in a prospective, multicenter validation cohort of the Dutch acute Stroke study (DUST) by calculating area under the receiver operating curves (AUROC). With group lasso regression analysis, we constructed a prediction model, incorporating patient characteristics next to National Institutes of Health Stroke Scale (NIHSS) items. Finally, we developed a decision tree algorithm based on dichotomized NIHSS items. RESULTS We identified seven LAVO-prediction scales. From DUST, 1316 patients (35.8% LAVO-rate) from 14 centers were available for validation. FAST-ED and RACE had the highest AUROC (both >0.81, p < 0.01 for comparison with other scales). Group lasso analysis revealed a LAVO-prediction model containing seven NIHSS items (AUROC 0.84). With the GACE (Gaze, facial Asymmetry, level of Consciousness, Extinction/inattention) decision tree, LAVO is predicted (AUROC 0.76) for 61% of patients with assessment of only two dichotomized NIHSS items, and for all patients with four items. CONCLUSION External validation of seven LAVO-prediction scales showed AUROCs between 0.75 and 0.83. Most scales, however, appear too complex for Emergency Medical Services use with prehospital validation generally lacking. GACE is the first LAVO-prediction scale using a simple decision tree as such increasing feasibility, while maintaining high accuracy. Prehospital prospective validation is planned.
Collapse
Affiliation(s)
- Gaia T Koster
- Department of Neurology, Leiden University Medical Center, Leiden, Netherlands
| | - T Truc My Nguyen
- Department of Neurology, Leiden University Medical Center, Leiden, Netherlands
| | - Erik W van Zwet
- Department of Medical Statistics, Leiden University Medical Center, Leiden, Netherlands
| | - Bjarty L Garcia
- Department of Public Health and Primary Care, Leiden University Medical Center, Leiden, Netherlands
| | - Hannah R Rowling
- Department of Neurology, Leiden University Medical Center, Leiden, Netherlands
| | - J Bosch
- Department of Research and Development, RAV Hollands Midden, Leiden, Netherlands
| | - Wouter J Schonewille
- Department of Neurology, St. Antonius Hospital, Nieuwegein, Netherlands; Department of Neurology and Neurosurgery, Brain Center Rudolf Magnus, Utrecht, Netherlands
| | - Birgitta K Velthuis
- Department of Radiology, University Medical Center Utrecht, Utrecht, Netherlands
| | | | - Heleen M den Hertog
- Department of Neurology, Medisch Spectrum Twente; Department of Neurology, Isala Clinics, Zwolle, Netherlands
| | - Yvo BWEM Roos
- Department of Neurology, Academic Medical Center, Amsterdam, Netherlands
| | | | - Marieke JH Wermer
- Department of Neurology, Leiden University Medical Center, Leiden, Netherlands
| | - Nyika D Kruyt
- Department of Neurology, Leiden University Medical Center, Leiden, Netherlands
| |
Collapse
|
180
|
Komatsu S, Yamashita Y, Ninomiya Y. AIC for the group Lasso in generalized linear models. JAPANESE JOURNAL OF STATISTICS AND DATA SCIENCE 2019. [DOI: 10.1007/s42081-019-00052-0] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]
|
181
|
Dang Y, Wang Q. Simultaneous variable and factor selection via sparse group lasso in factor analysis. J STAT COMPUT SIM 2019. [DOI: 10.1080/00949655.2019.1633324] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]
Affiliation(s)
- Yuanchu Dang
- Department of Mathematics and Statistics, Williams College, Williamstown, MA, USA
| | - Qing Wang
- Department of Mathematics, Wellesley College, Wellesley, MA, USA
| |
Collapse
|
182
|
He Z, Fong Y. Maximum diversity weighting for biomarkers with application in HIV-1 vaccine studies. Stat Med 2019; 38:3936-3946. [PMID: 31215662 DOI: 10.1002/sim.8212] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/11/2018] [Revised: 02/15/2019] [Accepted: 05/08/2019] [Indexed: 11/07/2022]
Abstract
While studying the association between risk of HIV-1 infection and vaccine-elicited immune responses in preventative HIV-1 vaccine recipients, we encountered a need to combine a collection of biomarkers in an unsupervised fashion with the goal of preserving signal diversity within that collection. Inspired by methods for weighting protein sequences from the biological sequence analysis literature, we propose novel methods for weighting biomarkers, which we call maximum diversity weights. These weights are defined as the weights that maximize measures of signal diversity within a collection of biomarkers. While the optimization problems do not admit analytical solutions, they are convex and hence can be solved efficiently using iterative search algorithms. Through Monte Carlo studies and a real data example from HIV-1 vaccine research, we show that using maximum diversity weights in association studies can lead to an increase in power over other commonly used weights such as uniform weights or principal component-based weights.
Collapse
Affiliation(s)
- Zonglin He
- Vaccine and Infectious Disease Division and Public Health Sciences Division, Fred Hutchinson Cancer Research Center, Seattle, Washington
| | - Youyi Fong
- Vaccine and Infectious Disease Division and Public Health Sciences Division, Fred Hutchinson Cancer Research Center, Seattle, Washington.,Department of Biostatistics, University of Washington, Seattle, Washington
| |
Collapse
|
183
|
Wilder-Smith A, Wei Y, de Araújo TVB, VanKerkhove M, Turchi Martelli CM, Turchi MD, Teixeira M, Tami A, Souza J, Sousa P, Soriano-Arandes A, Soria-Segarra C, Sanchez Clemente N, Rosenberger KD, Reveiz L, Prata-Barbosa A, Pomar L, Pelá Rosado LE, Perez F, Passos SD, Nogueira M, Noel TP, Moura da Silva A, Moreira ME, Morales I, Miranda Montoya MC, Miranda-Filho DDB, Maxwell L, Macpherson CNL, Low N, Lan Z, LaBeaud AD, Koopmans M, Kim C, João E, Jaenisch T, Hofer CB, Gustafson P, Gérardin P, Ganz JS, Dias ACF, Elias V, Duarte G, Debray TPA, Cafferata ML, Buekens P, Broutet N, Brickley EB, Brasil P, Brant F, Bethencourt S, Benedetti A, Avelino-Silva VL, Ximenes RADA, Alves da Cunha A, Alger J. Understanding the relation between Zika virus infection during pregnancy and adverse fetal, infant and child outcomes: a protocol for a systematic review and individual participant data meta-analysis of longitudinal studies of pregnant women and their infants and children. BMJ Open 2019; 9:e026092. [PMID: 31217315 PMCID: PMC6588966 DOI: 10.1136/bmjopen-2018-026092] [Citation(s) in RCA: 28] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 08/17/2018] [Revised: 02/11/2019] [Accepted: 05/09/2019] [Indexed: 12/14/2022] Open
Abstract
INTRODUCTION Zika virus (ZIKV) infection during pregnancy is a known cause of microcephaly and other congenital and developmental anomalies. In the absence of a ZIKV vaccine or prophylactics, principal investigators (PIs) and international leaders in ZIKV research have formed the ZIKV Individual Participant Data (IPD) Consortium to identify, collect and synthesise IPD from longitudinal studies of pregnant women that measure ZIKV infection during pregnancy and fetal, infant or child outcomes. METHODS AND ANALYSIS We will identify eligible studies through the ZIKV IPD Consortium membership and a systematic review and invite study PIs to participate in the IPD meta-analysis (IPD-MA). We will use the combined dataset to estimate the relative and absolute risk of congenital Zika syndrome (CZS), including microcephaly and late symptomatic congenital infections; identify and explore sources of heterogeneity in those estimates and develop and validate a risk prediction model to identify the pregnancies at the highest risk of CZS or adverse developmental outcomes. The variable accuracy of diagnostic assays and differences in exposure and outcome definitions means that included studies will have a higher level of systematic variability, a component of measurement error, than an IPD-MA of studies of an established pathogen. We will use expert testimony, existing internal and external diagnostic accuracy validation studies and laboratory external quality assessments to inform the distribution of measurement error in our models. We will apply both Bayesian and frequentist methods to directly account for these and other sources of uncertainty. ETHICS AND DISSEMINATION The IPD-MA was deemed exempt from ethical review. We will convene a group of patient advocates to evaluate the ethical implications and utility of the risk stratification tool. Findings from these analyses will be shared via national and international conferences and through publication in open access, peer-reviewed journals. TRIAL REGISTRATION NUMBER PROSPERO International prospective register of systematic reviews (CRD42017068915).
Collapse
Affiliation(s)
- Annelies Wilder-Smith
- Lee Kong Chian School of Medicine, Nanyang Technological University, Singapore, Singapore
| | - Yinghui Wei
- Centre for Mathematical Sciences, University of Plymouth, Plymouth, UK
| | | | - Maria VanKerkhove
- Health Emergencies Programme, Organisation mondiale de la Sante, Geneve, Switzerland
| | | | - Marília Dalva Turchi
- Institute of Tropical Pathology and Public Health, Federal University of Goias, Goiânia, Brazil
| | - Mauro Teixeira
- Department of Biochemistry and Immunology, Federal University of Minas Gerais, Belo Horizonte, Minas Gerais, Brazil
| | - Adriana Tami
- Department of Medical Microbiology, University Medical Center Groningen, Groningen, The Netherlands
| | - João Souza
- Department of Social Medicine, University of São Paulo, São Paulo, Brazil
| | - Patricia Sousa
- Reference Center for Neurodevelopment, Assistance, and Rehabilitation of Children, State Department of Health of Maranhão, Sao Luís, Brazil
| | | | | | | | - Kerstin Daniela Rosenberger
- Department of Infectious Diseases, Section Clinical Tropical Medicine, UniversitatsKlinikum Heidelberg, Heidelberg, Germany
| | - Ludovic Reveiz
- Evidence and Intelligence for Action in Health, Pan American Health Organization, Washington, District of Columbia, USA
| | - Arnaldo Prata-Barbosa
- Department of Pediatrics, D’Or Institute for Research & Education, Rio de Janeiro, Brazil
| | - Léo Pomar
- Department of Obstetrics and Gynecology, Centre Hospitalier de l’Ouest Guyanais, Saint-Laurent du Maroni, French Guiana
| | | | - Freddy Perez
- Communicable Diseases and Environmental Determinants of Health Department, Pan American Health Organization, Washington, District of Columbia, USA
| | | | - Mauricio Nogueira
- Faculdade de Medicina de Sao Jose do Rio Preto, Department of Dermatologic Diseases, São José do Rio Preto, Brazil
| | - Trevor P. Noel
- Windward Islands Research and Education Foundation, St. George’s University, True Blue Point, Grenada
| | - Antônio Moura da Silva
- Department of Public Health, Universidade Federal do Maranhão – São Luís, São Luís, Brazil
| | | | - Ivonne Morales
- Department of Infectious Diseases, Section Clinical Tropical Medicine, UniversitatsKlinikum Heidelberg, Heidelberg, Germany
| | | | | | - Lauren Maxwell
- Reproductive Health and Research, World Health Organization, Geneva, Switzerland
- Hubert Department of Global Health, Emory University, Atlanta, Georgia, USA
| | - Calum N. L. Macpherson
- Windward Islands Research and Education Foundation, St. George’s University, True Blue Point, Grenada
| | - Nicola Low
- Institute of Social and Preventive Medicine, University of Bern, Bern, Switzerland
| | - Zhiyi Lan
- McGill University Health Centre, McGill University, Montréal, Canada
| | | | - Marion Koopmans
- Department of Virology, Erasmus Medical Center, Rotterdam, The Netherlands
| | - Caron Kim
- Department of Reproductive Health and Research, World Health Organization, Geneva, Switzerland
| | - Esaú João
- Department of Infectious Diseases, Hospital Federal dos Servidores do Estado, Rio de Janeiro, Brazil
| | - Thomas Jaenisch
- Department of Infectious Diseases, Section Clinical Tropical Medicine, UniversitatsKlinikum Heidelberg, Heidelberg, Germany
| | - Cristina Barroso Hofer
- Instituto de Puericultura e Pediatria Martagão Gesteira, Universidade Federal do Rio de Janeiro, Rio de Janeiro, Brazil
| | - Paul Gustafson
- Statistics, University of British Columbia, British Columbia, Vancouver, Canada
| | - Patrick Gérardin
- INSERM CIC1410 Clinical Epidemiology, CHU La Réunion, Saint Pierre, Réunion
- UM 134 PIMIT (CNRS 9192, INSERM U1187, IRD 249, Université de la Réunion), Universite de la Reunion, Sainte Clotilde, Réunion
| | | | - Ana Carolina Fialho Dias
- Department of Biochemistry and Immunology, Federal University of Minas Gerais, Belo Horizonte, Minas Gerais, Brazil
| | - Vanessa Elias
- Sustainable Development and Environmental Health, Pan American Health Organization, Washington, District of Columbia, USA
| | - Geraldo Duarte
- Department of Gynecology and Obstetrics, University of São Paulo, São Paulo, Brazil
| | - Thomas Paul Alfons Debray
- Julius Center for Health Sciences and Primary Care, University Medical Center Utrecht, Utrecht, The Netherlands
| | - María Luisa Cafferata
- Mother and Children Health Research Department, Instituto de Efectividad Clinica y Sanitaria, Buenos Aires, Argentina
| | - Pierre Buekens
- School of Public Health and Tropical Medicine, Tulane University, New Orleans, USA
| | - Nathalie Broutet
- Department of Reproductive Health and Research, World Health Organization, Geneva, Switzerland
| | - Elizabeth B. Brickley
- Department of Infectious Disease Epidemiology, London School of Hygiene and Tropical Medicine, London, UK
| | - Patrícia Brasil
- Instituto de pesquisa Clínica Evandro Chagas, Fundacao Oswaldo Cruz, Rio de Janeiro, Brazil
| | - Fátima Brant
- Department of Biochemistry and Immunology, Federal University of Minas Gerais, Belo Horizonte, Minas Gerais, Brazil
| | - Sarah Bethencourt
- Facultad de Ciencias de la Salud, Universidad de Carabobo, Valencia, Carabobo, Bolivarian Republic of Venezuela
| | - Andrea Benedetti
- Departments of Medicine and of Epidemiology, Biostatistics & Occupational Health, McGill University, Montreal, Quebec, Canada
| | - Vivian Lida Avelino-Silva
- Department of Infectious and Parasitic Diseases, Faculdade de Medicina da Universidade de Sao Paulo, São Paulo, Brazil
| | | | | | - Jackeline Alger
- Facultad de Ciencias Médicas, Universidad Nacional Autónoma de Honduras, Tegucigalpa, Honduras
| | | |
Collapse
|
184
|
Sung CL, Wang W, Plumlee M, Haaland B. Multiresolution Functional ANOVA for Large-Scale, Many-Input Computer Experiments. J Am Stat Assoc 2019. [DOI: 10.1080/01621459.2019.1595630] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
Affiliation(s)
- Chih-Li Sung
- Department of Statistics and Probability, Michigan State University, East Lansing, MI
| | - Wenjia Wang
- Statistical and Applied Mathematical Sciences Institute, Raleigh, NC
| | - Matthew Plumlee
- Department of Industrial Engineering and Management Sciences, Northwestern Universit, Evanston, IL
| | - Benjamin Haaland
- Department of Population Health Sciences, University of Utah, Salt Lake City, UT
- School of Industrial and Systems Engineering, Georgia Institute of Technology, Atlanta, GA
| |
Collapse
|
185
|
|
186
|
Luu TD, Fadili J, Chesneau C. PAC-Bayesian risk bounds for group-analysis sparse regression by exponential weighting. J MULTIVARIATE ANAL 2019. [DOI: 10.1016/j.jmva.2018.12.004] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
|
187
|
Liu M, Zhang J, Adeli E, Shen D. Joint Classification and Regression via Deep Multi-Task Multi-Channel Learning for Alzheimer's Disease Diagnosis. IEEE Trans Biomed Eng 2019; 66:1195-1206. [PMID: 30222548 PMCID: PMC6764421 DOI: 10.1109/tbme.2018.2869989] [Citation(s) in RCA: 105] [Impact Index Per Article: 21.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/19/2022]
Abstract
In the field of computer-aided Alzheimer's disease (AD) diagnosis, jointly identifying brain diseases and predicting clinical scores using magnetic resonance imaging (MRI) have attracted increasing attention since these two tasks are highly correlated. Most of existing joint learning approaches require hand-crafted feature representations for MR images. Since hand-crafted features of MRI and classification/regression models may not coordinate well with each other, conventional methods may lead to sub-optimal learning performance. Also, demographic information (e.g., age, gender, and education) of subjects may also be related to brain status, and thus can help improve the diagnostic performance. However, conventional joint learning methods seldom incorporate such demographic information into the learning models. To this end, we propose a deep multi-task multi-channel learning (DM 2L) framework for simultaneous brain disease classification and clinical score regression, using MRI data and demographic information of subjects. Specifically, we first identify the discriminative anatomical landmarks from MR images in a data-driven manner, and then extract multiple image patches around these detected landmarks. We then propose a deep multi-task multi-channel convolutional neural network for joint classification and regression. Our DM 2L framework can not only automatically learn discriminative features for MR images, but also explicitly incorporate the demographic information of subjects into the learning process. We evaluate the proposed method on four large multi-center cohorts with 1984 subjects, and the experimental results demonstrate that DM 2L is superior to several state-of-the-art joint learning methods in both the tasks of disease classification and clinical score regression.
Collapse
|
188
|
Zhong H, Kim S, Zhi D, Cui X. Predicting gene expression using DNA methylation in three human populations. PeerJ 2019; 7:e6757. [PMID: 31106051 PMCID: PMC6500370 DOI: 10.7717/peerj.6757] [Citation(s) in RCA: 21] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2018] [Accepted: 03/10/2019] [Indexed: 12/30/2022] Open
Abstract
Background DNA methylation, an important epigenetic mark, is well known for its regulatory role in gene expression, especially the negative correlation in the promoter region. However, its correlation with gene expression across genome at human population level has not been well studied. In particular, it is unclear if genome-wide DNA methylation profile of an individual can predict her/his gene expression profile. Previous studies were mostly limited to association analyses between single CpG site methylation and gene expression. It is not known whether DNA methylation of a gene has enough prediction power to serve as a surrogate for gene expression in existing human study cohorts with DNA samples other than RNA samples. Results We examined DNA methylation in the gene region for predicting gene expression across individuals in non-cancer tissues of three human population datasets, adipose tissue of the Multiple Tissue Human Expression Resource Projects (MuTHER), peripheral blood mononuclear cell (PBMC) from Asthma and normal control study participates, and lymphoblastoid cell lines (LCL) from healthy individuals. Three prediction models were investigated, single linear regression, multiple linear regression, and least absolute shrinkage and selection operator (LASSO) penalized regression. Our results showed that LASSO regression has superior performance among these methods. However, the prediction power is generally low and varies across datasets. Only 30 and 42 genes were found to have cross-validation R2 greater than 0.3 in the PBMC and Adipose datasets, respectively. A substantially larger number of genes (258) were identified in the LCL dataset, which was generated from a more homogeneous cell line sample source. We also demonstrated that it gives better prediction power not to exclude any CpG probe due to cross hybridization or SNP effect. Conclusion In our three population analyses DNA methylation of CpG sites at gene region have limited prediction power for gene expression across individuals with linear regression models. The prediction power potentially varies depending on tissue, cell type, and data sources. In our analyses, the combination of LASSO regression and all probes not excluding any probe on the methylation array provides the best prediction for gene expression.
Collapse
Affiliation(s)
- Huan Zhong
- Department of Biology, Hong Kong Baptist University, Hong Kong, China
| | - Soyeon Kim
- School of Medicine, University of Pittsburgh, Pittsburgh, PA, United States of America
| | - Degui Zhi
- School of Biomendical Informatics, University of Texas Health Center at Houston, Houston, TX, United States of America
| | - Xiangqin Cui
- Department of Biostatistics and Bioinformatics, Emory University, Atlanta, GA, United States of America
| |
Collapse
|
189
|
Improved Reconstruction of MR Scanned Images by Using a Dictionary Learning Scheme. SENSORS 2019; 19:s19081918. [PMID: 31018597 PMCID: PMC6514997 DOI: 10.3390/s19081918] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/11/2019] [Revised: 04/16/2019] [Accepted: 04/21/2019] [Indexed: 11/17/2022]
Abstract
The application of compressed sensing (CS) to biomedical imaging is sensational since it permits a rationally accurate reconstruction of images by exploiting the image sparsity. The quality of CS reconstruction methods largely depends on the use of various sparsifying transforms, such as wavelets, curvelets or total variation (TV), to recover MR images. As per recently developed mathematical concepts of CS, the biomedical images with sparse representation can be recovered from randomly undersampled data, provided that an appropriate nonlinear recovery method is used. Due to high under-sampling, the reconstructed images have noise like artifacts because of aliasing. Reconstruction of images from CS involves two steps, one for dictionary learning and the other for sparse coding. In this novel framework, we choose Simultaneous code word optimization (SimCO) patch-based dictionary learning that updates the atoms simultaneously, whereas Focal underdetermined system solver (FOCUSS) is used for sparse representation because of a soft constraint on sparsity of an image. Combining SimCO and FOCUSS, we propose a new scheme called SiFo. Our proposed alternating reconstruction scheme learns the dictionary, uses it to eliminate aliasing and noise in one stage, and afterwards restores and fills in the k-space data in the second stage. Experiments were performed using different sampling schemes with noisy and noiseless cases of both phantom and real brain images. Based on various performance parameters, it has been shown that our designed technique outperforms the conventional techniques, like K-SVD with OMP, used in dictionary learning based MRI (DLMRI) reconstruction.
Collapse
|
190
|
Ijaz M, Asghar Z, Gul A. Ensemble of penalized logistic models for classification of high-dimensional data. COMMUN STAT-SIMUL C 2019. [DOI: 10.1080/03610918.2019.1595647] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
Affiliation(s)
- Musarrat Ijaz
- Department of Statistics, Quaid-i-Azam University, Islamabad, Pakistan
- Department of Statistics, Shaheed Benazir Bhutto Women University, Peshawar, Pakistan
| | - Zahid Asghar
- Department of Economics, Quaid-i-Azam University, Islamabad, Pakistan
| | - Asma Gul
- Department of Statistics, Shaheed Benazir Bhutto Women University, Peshawar, Pakistan
| |
Collapse
|
191
|
Song H, Raskutti G. PUlasso: High-Dimensional Variable Selection With Presence-Only Data. J Am Stat Assoc 2019; 115:334-347. [PMID: 32255883 PMCID: PMC7133715 DOI: 10.1080/01621459.2018.1546587] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2017] [Revised: 10/13/2018] [Accepted: 10/29/2018] [Indexed: 10/27/2022]
Abstract
In various real-world problems, we are presented with classification problems with positive and unlabeled data, referred to as presence-only responses. In this article we study variable selection in the context of presence only responses where the number of features or covariates p is large. The combination of presence-only responses and high dimensionality presents both statistical and computational challenges. In this article, we develop the PUlasso algorithm for variable selection and classification with positive and unlabeled responses. Our algorithm involves using the majorization-minimization framework which is a generalization of the well-known expectation-maximization (EM) algorithm. In particular to make our algorithm scalable, we provide two computational speed-ups to the standard EM algorithm. We provide a theoretical guarantee where we first show that our algorithm converges to a stationary point, and then prove that any stationary point within a local neighborhood of the true parameter achieves the minimax optimal mean-squared error under both strict sparsity and group sparsity assumptions. We also demonstrate through simulations that our algorithm outperforms state-of-the-art algorithms in the moderate p settings in terms of classification performance. Finally, we demonstrate that our PUlasso algorithm performs well on a biochemistry example. Supplementary materials for this article are available online.
Collapse
Affiliation(s)
- Hyebin Song
- Department of Statistics, University of Wisconsin-Madison, Madison, WI
| | - Garvesh Raskutti
- Department of Statistics, University of Wisconsin-Madison, Madison, WI
| |
Collapse
|
192
|
Qi Z, Liu D, Fu H, Liu Y. Multi-Armed Angle-Based Direct Learning for Estimating Optimal Individualized Treatment Rules With Various Outcomes. J Am Stat Assoc 2019; 115:678-691. [PMID: 34219848 DOI: 10.1080/01621459.2018.1529597] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
Abstract
Estimating an optimal individualized treatment rule (ITR) based on patients' information is an important problem in precision medicine. An optimal ITR is a decision function that optimizes patients' expected clinical outcomes. Many existing methods in the literature are designed for binary treatment settings with the interest of a continuous outcome. Much less work has been done on estimating optimal ITRs in multiple treatment settings with good interpretations. In this article, we propose angle-based direct learning (AD-learning) to efficiently estimate optimal ITRs with multiple treatments. Our proposed method can be applied to various types of outcomes, such as continuous, survival, or binary outcomes. Moreover, it has an interesting geometric interpretation on the effect of different treatments for each individual patient, which can help doctors and patients make better decisions. Finite sample error bounds have been established to provide a theoretical guarantee for AD-learning. Finally, we demonstrate the superior performance of our method via an extensive simulation study and real data applications. Supplementary materials for this article are available online.
Collapse
Affiliation(s)
- Zhengling Qi
- Department of Statistics and Operations Research, University of North Carolina, Chapel Hill, NC
| | - Dacheng Liu
- Boehringer Ingelheim Pharmaceuticals, Inc., Ridgefield, CT
| | - Haoda Fu
- Eli Lilly and Company, Lilly Corporate Center, Indianapolis, IN
| | - Yufeng Liu
- Department of Statistics and Operations Research, Department of Genetics, Department of Biostatistics, Carolina Center for Genome Sciences, Lineberger Comprehensive Cancer Center, University of North Carolina, Chapel Hill, NC
| |
Collapse
|
193
|
Jackknife Model Averaging Prediction Methods for Complex Phenotypes with Gene Expression Levels by Integrating External Pathway Information. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2019; 2019:2807470. [PMID: 31089389 PMCID: PMC6476151 DOI: 10.1155/2019/2807470] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/13/2019] [Accepted: 03/20/2019] [Indexed: 01/03/2023]
Abstract
Motivation In the past few years many prediction approaches have been proposed and widely employed in high dimensional genetic data for disease risk evaluation. However, those approaches typically ignore in model fitting the important group structures that naturally exists in genetic data. Methods In the present study, we applied a novel model-averaging approach, called jackknife model averaging prediction (JMAP), for high dimensional genetic risk prediction while incorporating pathway information into the model specification. JMAP selects the optimal weights across candidate models by minimizing a cross validation criterion in a jackknife way. Compared with previous approaches, one of the primary features of JMAP is to allow model weights to vary from 0 to 1 but without the limitation that the summation of weights is equal to one. We evaluated the performance of JMAP using extensive simulation studies and compared it with existing methods. We finally applied JMAP to four real cancer datasets that are publicly available from TCGA. Results The simulations showed that compared with other existing approaches (e.g., gsslasso), JMAP performed best or is among the best methods across a range of scenarios. For example, among 14 out of 16 simulation settings with PVE = 0.3, JMAP has an average of 0.075 higher prediction accuracy compared with gsslasso. We further found that in the simulation, the model weights for the true candidate models have much smaller chances to be zero compared with those for the null candidate models and are substantially greater in magnitude. In the real data application, JMAP also behaves comparably or better compared with the other methods for continuous phenotypes. For example, for the COAD, CRC, and PAAD datasets, the average gains of predictive accuracy of JMAP are 0.019, 0.064, and 0.052 compared with gsslasso. Conclusion The proposed method JMAP is a novel model-averaging approach for high dimensional genetic risk prediction while incorporating external useful group structures into the model specification.
Collapse
|
194
|
Qian W, Li W, Sogawa Y, Fujimaki R, Yang X, Liu J. An Interactive Greedy Approach to Group Sparsity in High Dimensions. Technometrics 2019. [DOI: 10.1080/00401706.2018.1537897] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
Affiliation(s)
- Wei Qian
- Department of Applied Economics and Statistics, University of Delaware
| | - Wending Li
- Department of Computer Science, University of Rochester
| | | | | | - Xitong Yang
- Department of Computer Science, University of Rochester
| | - Ji Liu
- Department of Computer Science, University of Rochester
| |
Collapse
|
195
|
Group Lasso Regularized Deep Learning for Cancer Prognosis from Multi-Omics and Clinical Features. Genes (Basel) 2019; 10:genes10030240. [PMID: 30901858 PMCID: PMC6471789 DOI: 10.3390/genes10030240] [Citation(s) in RCA: 42] [Impact Index Per Article: 8.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/19/2019] [Revised: 03/12/2019] [Accepted: 03/18/2019] [Indexed: 12/17/2022] Open
Abstract
Accurate prognosis of patients with cancer is important for the stratification of patients, the optimization of treatment strategies, and the design of clinical trials. Both clinical features and molecular data can be used for this purpose, for instance, to predict the survival of patients censored at specific time points. Multi-omics data, including genome-wide gene expression, methylation, protein expression, copy number alteration, and somatic mutation data, are becoming increasingly common in cancer studies. To harness the rich information in multi-omics data, we developed GDP (Group lass regularized Deep learning for cancer Prognosis), a computational tool for survival prediction using both clinical and multi-omics data. GDP integrated a deep learning framework and Cox proportional hazard model (CPH) together, and applied group lasso regularization to incorporate gene-level group prior knowledge into the model training process. We evaluated its performance in both simulated and real data from The Cancer Genome Atlas (TCGA) project. In simulated data, our results supported the importance of group prior information in the regularization of the model. Compared to the standard lasso regularization, we showed that group lasso achieved higher prediction accuracy when the group prior knowledge was provided. We also found that GDP performed better than CPH for complex survival data. Furthermore, analysis on real data demonstrated that GDP performed favorably against other methods in several cancers with large-scale omics data sets, such as glioblastoma multiforme, kidney renal clear cell carcinoma, and bladder urothelial carcinoma. In summary, we demonstrated that GDP is a powerful tool for prognosis of patients with cancer, especially when large-scale molecular features are available.
Collapse
|
196
|
Greb F, Steffens J, Schlotz W. Modeling Music-Selection Behavior in Everyday Life: A Multilevel Statistical Learning Approach and Mediation Analysis of Experience Sampling Data. Front Psychol 2019; 10:390. [PMID: 30941066 PMCID: PMC6433931 DOI: 10.3389/fpsyg.2019.00390] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2018] [Accepted: 02/07/2019] [Indexed: 12/05/2022] Open
Abstract
Music listening has become a highly individualized activity with smartphones and music streaming services providing listeners with absolute freedom to listen to any kind of music in any situation. Until now, little has been written about the processes underlying the selection of music in daily life. The present study aimed to disentangle some of the complex processes among the listener, situation, and functions of music listening involved in music selection. Utilizing the experience sampling method, data were collected from 119 participants using a smartphone application. For 10 consecutive days, participants received 14 prompts using stratified-random sampling throughout the day and reported on their music-listening behavior. Statistical learning procedures on multilevel regression models and multilevel structural equation modeling were used to determine the most important predictors and analyze mediation processes between person, situation, functions of listening, and music selection. Results revealed that the features of music selected in daily life were predominantly determined by situational characteristics, whereas consistent individual differences were of minor importance. Functions of music listening were found to act as a mediator between characteristics of the situation and music-selection behavior. We further observed several significant random effects, which indicated that individuals differed in how situational variables affected their music selection behavior. Our findings suggest a need to shift the focus of music-listening research from individual differences to situational influences, including potential person-situation interactions.
Collapse
Affiliation(s)
- Fabian Greb
- Max Planck Institute for Empirical Aesthetics, Frankfurt am Main, Germany.,Audio Communication Group, Technische Universität Berlin, Berlin, Germany
| | - Jochen Steffens
- Max Planck Institute for Empirical Aesthetics, Frankfurt am Main, Germany.,Audio Communication Group, Technische Universität Berlin, Berlin, Germany
| | - Wolff Schlotz
- Max Planck Institute for Empirical Aesthetics, Frankfurt am Main, Germany.,Institute of Psychology, Goethe University, Frankfurt am Main, Germany
| |
Collapse
|
197
|
van de Wiel MA, Te Beest DE, Münch MM. Learning from a lot: Empirical Bayes for high-dimensional model-based prediction. Scand Stat Theory Appl 2019; 46:2-25. [PMID: 31007342 PMCID: PMC6472625 DOI: 10.1111/sjos.12335] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2017] [Revised: 01/24/2018] [Accepted: 03/22/2018] [Indexed: 12/21/2022]
Abstract
Empirical Bayes is a versatile approach to "learn from a lot" in two ways: first, from a large number of variables and, second, from a potentially large amount of prior information, for example, stored in public repositories. We review applications of a variety of empirical Bayes methods to several well-known model-based prediction methods, including penalized regression, linear discriminant analysis, and Bayesian models with sparse or dense priors. We discuss "formal" empirical Bayes methods that maximize the marginal likelihood but also more informal approaches based on other data summaries. We contrast empirical Bayes to cross-validation and full Bayes and discuss hybrid approaches. To study the relation between the quality of an empirical Bayes estimator and p, the number of variables, we consider a simple empirical Bayes estimator in a linear model setting. We argue that empirical Bayes is particularly useful when the prior contains multiple parameters, which model a priori information on variables termed "co-data". In particular, we present two novel examples that allow for co-data: first, a Bayesian spike-and-slab setting that facilitates inclusion of multiple co-data sources and types and, second, a hybrid empirical Bayes-full Bayes ridge regression approach for estimation of the posterior predictive interval.
Collapse
Affiliation(s)
- Mark A. van de Wiel
- Department of Epidemiology and Biostatistics, Amsterdam Public Health Research InstituteVU University Medical CenterAmsterdamThe Netherlands
- Department of MathematicsVU UniversityAmsterdamThe Netherlands
| | - Dennis E. Te Beest
- Department of Epidemiology and Biostatistics, Amsterdam Public Health Research InstituteVU University Medical CenterAmsterdamThe Netherlands
| | - Magnus M. Münch
- Department of Epidemiology and Biostatistics, Amsterdam Public Health Research InstituteVU University Medical CenterAmsterdamThe Netherlands
- Mathematical Institute, Faculty of ScienceLeiden UniversityLeidenThe Netherlands
| |
Collapse
|
198
|
Zhang J, Zhao Z, Zhang K, Wei Z. A Feature Sampling Strategy for Analysis of High Dimensional Genomic Data. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2019; 16:434-441. [PMID: 29990199 DOI: 10.1109/tcbb.2017.2779492] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
With the development of high throughput technology, it has become feasible and common to profile tens of thousands of gene activities simultaneously. These genomic data typically have sample size of hundreds or fewer, which is much less than the feature size (number of genes). In addition, the genes, in particular the ones from the same pathway, are often highly correlated. These issues impose a great challenge for selecting meaningful genes from a large number of (correlated) candidates in many genomic studies. Quite a few methods have been proposed to attack this challenge. Among them, regularization-based techniques, e.g., lasso, become much more appealing, because they can do model fitting and variable selection at the same time. However, the lasso regression has its known limitations. One is that the number of genes selected by the lasso couldn't exceed the number of samples. Another limitation is that, if causal genes are highly correlated, the lasso tends to select only one or few genes from them. Biologists, however, desire to identify them all. To overcome these limitations, we present here a novel, robust, and stable variable selection method. Through simulation studies and a real application to the transcriptome data, we demonstrate the superiority of the proposed method in selecting highly correlated causal genes. We also provide some theoretical justifications for this feature sampling strategy based on the mean and variance analyses.
Collapse
|
199
|
Li Y, Wu FX, Ngom A. A review on machine learning principles for multi-view biological data integration. Brief Bioinform 2019; 19:325-340. [PMID: 28011753 DOI: 10.1093/bib/bbw113] [Citation(s) in RCA: 126] [Impact Index Per Article: 25.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2016] [Indexed: 01/08/2023] Open
Abstract
Driven by high-throughput sequencing techniques, modern genomic and clinical studies are in a strong need of integrative machine learning models for better use of vast volumes of heterogeneous information in the deep understanding of biological systems and the development of predictive models. How data from multiple sources (called multi-view data) are incorporated in a learning system is a key step for successful analysis. In this article, we provide a comprehensive review on omics and clinical data integration techniques, from a machine learning perspective, for various analyses such as prediction, clustering, dimension reduction and association. We shall show that Bayesian models are able to use prior information and model measurements with various distributions; tree-based methods can either build a tree with all features or collectively make a final decision based on trees learned from each view; kernel methods fuse the similarity matrices learned from individual views together for a final similarity matrix or learning model; network-based fusion methods are capable of inferring direct and indirect associations in a heterogeneous network; matrix factorization models have potential to learn interactions among features from different views; and a range of deep neural networks can be integrated in multi-modal learning for capturing the complex mechanism of biological systems.
Collapse
Affiliation(s)
- Yifeng Li
- Information and Communications Technologies, National Research Council Canada, Ottawa, Ontario, Canada
| | - Fang-Xiang Wu
- Department of Mechanical Engineering, University of Saskatchewan, Saskatoon, Saskatchewan, Canada
| | - Alioune Ngom
- School of Computer Science, University of Windsor, Windsor, Ontario, Canada
| |
Collapse
|
200
|
Tang Z, Lei S, Zhang X, Yi Z, Guo B, Chen JY, Shen Y, Yi N. Gsslasso Cox: a Bayesian hierarchical model for predicting survival and detecting associated genes by incorporating pathway information. BMC Bioinformatics 2019; 20:94. [PMID: 30813883 PMCID: PMC6391807 DOI: 10.1186/s12859-019-2656-1] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2018] [Accepted: 01/28/2019] [Indexed: 12/13/2022] Open
Abstract
BACKGROUND Group structures among genes encoded in functional relationships or biological pathways are valuable and unique features in large-scale molecular data for survival analysis. However, most of previous approaches for molecular data analysis ignore such group structures. It is desirable to develop powerful analytic methods for incorporating valuable pathway information for predicting disease survival outcomes and detecting associated genes. RESULTS We here propose a Bayesian hierarchical Cox survival model, called the group spike-and-slab lasso Cox (gsslasso Cox), for predicting disease survival outcomes and detecting associated genes by incorporating group structures of biological pathways. Our hierarchical model employs a novel prior on the coefficients of genes, i.e., the group spike-and-slab double-exponential distribution, to integrate group structures and to adaptively shrink the effects of genes. We have developed a fast and stable deterministic algorithm to fit the proposed models. We performed extensive simulation studies to assess the model fitting properties and the prognostic performance of the proposed method, and also applied our method to analyze three cancer data sets. CONCLUSIONS Both the theoretical and empirical studies show that the proposed method can induce weaker shrinkage on predictors in an active pathway, thereby incorporating the biological similarity of genes within a same pathway into the hierarchical modeling. Compared with several existing methods, the proposed method can more accurately estimate gene effects and can better predict survival outcomes. For the three cancer data sets, the results show that the proposed method generates more powerful models for survival prediction and detecting associated genes. The method has been implemented in a freely available R package BhGLM at https://github.com/nyiuab/BhGLM .
Collapse
Affiliation(s)
- Zaixiang Tang
- Department of Biostatistics, School of Public Health, Medical College of Soochow University, University of Alabama at Birmingham, Suzhou, 215123 China
- Jiangsu Key Laboratory of Preventive and Translational Medicine for Geriatric Diseases, Medical College of Soochow University, Suzhou, 215123 China
- Department of Biostatistics, School of Public Health, University of Alabama at Birmingham, Birmingham, AL 35294-0022 USA
| | - Shufeng Lei
- Department of Biostatistics, School of Public Health, Medical College of Soochow University, University of Alabama at Birmingham, Suzhou, 215123 China
- Jiangsu Key Laboratory of Preventive and Translational Medicine for Geriatric Diseases, Medical College of Soochow University, Suzhou, 215123 China
| | - Xinyan Zhang
- Department of Biostatistics, Jiann-Ping Hsu College of Public Health, Georgia Southern University, Statesboro, GA 30458 USA
| | - Zixuan Yi
- Eastern Virginia Medical School, Norfork, VA 23507 USA
| | - Boyi Guo
- Department of Biostatistics, School of Public Health, University of Alabama at Birmingham, Birmingham, AL 35294-0022 USA
| | - Jake Y. Chen
- Informatics Institute, School of Medicine, University of Alabama at Birmingham, Birmingham, AL 35294 USA
| | - Yueping Shen
- Department of Biostatistics, School of Public Health, Medical College of Soochow University, University of Alabama at Birmingham, Suzhou, 215123 China
- Jiangsu Key Laboratory of Preventive and Translational Medicine for Geriatric Diseases, Medical College of Soochow University, Suzhou, 215123 China
| | - Nengjun Yi
- Department of Biostatistics, School of Public Health, University of Alabama at Birmingham, Birmingham, AL 35294-0022 USA
| |
Collapse
|