101
|
Zahid FM, Faisal S, Heumann C. Multiple imputation with compatibility for high-dimensional data. PLoS One 2021; 16:e0254112. [PMID: 34237092 PMCID: PMC8266107 DOI: 10.1371/journal.pone.0254112] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/02/2020] [Accepted: 06/20/2021] [Indexed: 11/18/2022] Open
Abstract
Multiple Imputation (MI) is always challenging in high dimensional settings. The imputation model with some selected number of predictors can be incompatible with the analysis model leading to inconsistent and biased estimates. Although compatibility in such cases may not be achieved, but one can obtain consistent and unbiased estimates using a semi-compatible imputation model. We propose to relax the lasso penalty for selecting a large set of variables (at most n). The substantive model that also uses some formal variable selection procedure in high-dimensional structures is then expected to be nested in this imputation model. The resulting imputation model will be semi-compatible with high probability. The likelihood estimates can be unstable and can face the convergence issues as the number of variables becomes nearly as large as the sample size. To address these issues, we further propose to use a ridge penalty for obtaining the posterior distribution of the parameters based on the observed data. The proposed technique is compared with the standard MI software and MI techniques available for high-dimensional data in simulation studies and a real life dataset. Our results exhibit the superiority of the proposed approach to the existing MI approaches while addressing the compatibility issue.
Collapse
Affiliation(s)
- Faisal Maqbool Zahid
- Department of Statistics, Government College University Faisalabad, Faisalabad, Pakistan
| | - Shahla Faisal
- Department of Statistics, Government College University Faisalabad, Faisalabad, Pakistan
- * E-mail:
| | - Christian Heumann
- Department of Statistics, Ludwig Maximilians University Munich Germany, Munich, Germany
| |
Collapse
|
102
|
Mirzaei M, Furxhi I, Murphy F, Mullins M. A Machine Learning Tool to Predict the Antibacterial Capacity of Nanoparticles. NANOMATERIALS (BASEL, SWITZERLAND) 2021; 11:1774. [PMID: 34361160 PMCID: PMC8308172 DOI: 10.3390/nano11071774] [Citation(s) in RCA: 16] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/20/2021] [Revised: 06/13/2021] [Accepted: 07/06/2021] [Indexed: 12/22/2022]
Abstract
The emergence and rapid spread of multidrug-resistant bacteria strains are a public health concern. This emergence is caused by the overuse and misuse of antibiotics leading to the evolution of antibiotic-resistant strains. Nanoparticles (NPs) are objects with all three external dimensions in the nanoscale that varies from 1 to 100 nm. Research on NPs with enhanced antimicrobial activity as alternatives to antibiotics has grown due to the increased incidence of nosocomial and community acquired infections caused by pathogens. Machine learning (ML) tools have been used in the field of nanoinformatics with promising results. As a consequence of evident achievements on a wide range of predictive tasks, ML techniques are attracting significant interest across a variety of stakeholders. In this article, we present an ML tool that successfully predicts the antibacterial capacity of NPs while the model's validation demonstrates encouraging results (R2 = 0.78). The data were compiled after a literature review of 60 articles and consist of key physico-chemical (p-chem) properties and experimental conditions (exposure variables and bacterial clustering) from in vitro studies. Following data homogenization and pre-processing, we trained various regression algorithms and we validated them using diverse performance metrics. Finally, an important attribute evaluation, which ranks the attributes that are most important in predicting the outcome, was performed. The attribute importance revealed that NP core size, the exposure dose, and the species of bacterium are key variables in predicting the antibacterial effect of NPs. This tool assists various stakeholders and scientists in predicting the antibacterial effects of NPs based on their p-chem properties and diverse exposure settings. This concept also aids the safe-by-design paradigm by incorporating functionality tools.
Collapse
Affiliation(s)
- Mahsa Mirzaei
- Department of Accounting and Finance, Kemmy Business School, University of Limerick, V94PH93 Limerick, Ireland; (M.M.); (F.M.); (M.M.)
| | - Irini Furxhi
- Department of Accounting and Finance, Kemmy Business School, University of Limerick, V94PH93 Limerick, Ireland; (M.M.); (F.M.); (M.M.)
- Transgero Limited, Cullinagh, Newcastle West, V42V384 Limerick, Ireland
| | - Finbarr Murphy
- Department of Accounting and Finance, Kemmy Business School, University of Limerick, V94PH93 Limerick, Ireland; (M.M.); (F.M.); (M.M.)
- Transgero Limited, Cullinagh, Newcastle West, V42V384 Limerick, Ireland
| | - Martin Mullins
- Department of Accounting and Finance, Kemmy Business School, University of Limerick, V94PH93 Limerick, Ireland; (M.M.); (F.M.); (M.M.)
| |
Collapse
|
103
|
Köhler C, Robitzsch A, Fährmann K, von Davier M, Hartig J. A semiparametric approach for item response function estimation to detect item misfit. THE BRITISH JOURNAL OF MATHEMATICAL AND STATISTICAL PSYCHOLOGY 2021; 74 Suppl 1:157-175. [PMID: 33332585 DOI: 10.1111/bmsp.12224] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/06/2019] [Revised: 09/17/2020] [Indexed: 06/12/2023]
Abstract
When scaling data using item response theory, valid statements based on the measurement model are only permissible if the model fits the data. Most item fit statistics used to assess the fit between observed item responses and the item responses predicted by the measurement model show significant weaknesses, such as the dependence of fit statistics on sample size and number of items. In order to assess the size of misfit and to thus use the fit statistic as an effect size, dependencies on properties of the data set are undesirable. The present study describes a new approach and empirically tests it for consistency. We developed an estimator of the distance between the predicted item response functions (IRFs) and the true IRFs by semiparametric adaptation of IRFs. For the semiparametric adaptation, the approach of extended basis functions due to Ramsay and Silverman (2005) is used. The IRF is defined as the sum of a linear term and a more flexible term constructed via basis function expansions. The group lasso method is applied as a regularization of the flexible term, and determines whether all parameters of the basis functions are fixed at zero or freely estimated. Thus, the method serves as a selection criterion for items that should be adjusted semiparametrically. The distance between the predicted and semiparametrically adjusted IRF of misfitting items can then be determined by describing the fitting items by the parametric form of the IRF and the misfitting items by the semiparametric approach. In a simulation study, we demonstrated that the proposed method delivers satisfactory results in large samples (i.e., N ≥ 1,000).
Collapse
Affiliation(s)
- Carmen Köhler
- DIPF - Leibniz Institute for Research and Information in Education, Frankfurt, Germany
| | - Alexander Robitzsch
- IPN - Leibniz Institute for Science and Mathematics Education, Kiel, Germany
- Centre for International Student Assessment (ZIB), Munich, Germany
| | - Katharina Fährmann
- DIPF - Leibniz Institute for Research and Information in Education, Frankfurt, Germany
| | | | - Johannes Hartig
- DIPF - Leibniz Institute for Research and Information in Education, Frankfurt, Germany
| |
Collapse
|
104
|
Kim Y, Lee S, Jang JY, Lee S, Park T. Identifying miRNA-mRNA Integration Set Associated With Survival Time. Front Genet 2021; 12:634922. [PMID: 34267778 PMCID: PMC8276759 DOI: 10.3389/fgene.2021.634922] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2020] [Accepted: 04/06/2021] [Indexed: 11/26/2022] Open
Abstract
In the “personalized medicine” era, one of the most difficult problems is identification of combined markers from different omics platforms. Many methods have been developed to identify candidate markers for each type of omics data, but few methods facilitate the identification of multiple markers on multi-omics platforms. microRNAs (miRNAs) is well known to affect only indirectly phenotypes by regulating mRNA expression and/or protein translation. To take into account this knowledge into practice, we suggest a miRNA-mRNA integration model for survival time analysis, called mimi-surv, which accounts for the biological relationship, to identify such integrated markers more efficiently. Through simulation studies, we found that the statistical power of mimi-surv be better than other models. Application to real datasets from Seoul National University Hospital and The Cancer Genome Atlas demonstrated that mimi-surv successfully identified miRNA-mRNA integrations sets associated with progression-free survival of pancreatic ductal adenocarcinoma (PDAC) patients. Only mimi-surv found miR-96, a previously unidentified PDAC-related miRNA in these two real datasets. Furthermore, mimi-surv was shown to identify more PDAC related miRNAs than other methods because it used the known structure for miRNA-mRNA regularization. An implementation of mimi-surv is available at http://statgen.snu.ac.kr/software/mimi-surv.
Collapse
Affiliation(s)
- Yongkang Kim
- Department of Statistics, Seoul National University, Seoul, South Korea
| | - Sungyoung Lee
- Center for Precision Medicine, Seoul National University Hospital, Seoul, South Korea.,Department of Genomic Medicine, Seoul National University Hospital, Seoul, South Korea
| | - Jin-Young Jang
- Department of Surgery and Cancer Research Institute, Seoul National University College of Medicine, Seoul, South Korea
| | - Seungyeoun Lee
- Department of Mathematics and Statistics, Sejong University, Seoul, South Korea
| | - Taesung Park
- Department of Statistics, Seoul National University, Seoul, South Korea.,Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul, South Korea
| |
Collapse
|
105
|
Li W, Chekouo T. Bayesian group selection with non-local priors. Comput Stat 2021. [DOI: 10.1007/s00180-021-01115-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
|
106
|
Chen D, Cremona MA, Qi Z, Mitra RD, Chiaromonte F, Makova KD. Human L1 Transposition Dynamics Unraveled with Functional Data Analysis. Mol Biol Evol 2021; 37:3576-3600. [PMID: 32722770 DOI: 10.1093/molbev/msaa194] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022] Open
Abstract
Long INterspersed Elements-1 (L1s) constitute >17% of the human genome and still actively transpose in it. Characterizing L1 transposition across the genome is critical for understanding genome evolution and somatic mutations. However, to date, L1 insertion and fixation patterns have not been studied comprehensively. To fill this gap, we investigated three genome-wide data sets of L1s that integrated at different evolutionary times: 17,037 de novo L1s (from an L1 insertion cell-line experiment conducted in-house), and 1,212 polymorphic and 1,205 human-specific L1s (from public databases). We characterized 49 genomic features-proxying chromatin accessibility, transcriptional activity, replication, recombination, etc.-in the ±50 kb flanks of these elements. These features were contrasted between the three L1 data sets and L1-free regions using state-of-the-art Functional Data Analysis statistical methods, which treat high-resolution data as mathematical functions. Our results indicate that de novo, polymorphic, and human-specific L1s are surrounded by different genomic features acting at specific locations and scales. This led to an integrative model of L1 transposition, according to which L1s preferentially integrate into open-chromatin regions enriched in non-B DNA motifs, whereas they are fixed in regions largely free of purifying selection-depleted of genes and noncoding most conserved elements. Intriguingly, our results suggest that L1 insertions modify local genomic landscape by extending CpG methylation and increasing mononucleotide microsatellite density. Altogether, our findings substantially facilitate understanding of L1 integration and fixation preferences, pave the way for uncovering their role in aging and cancer, and inform their use as mutagenesis tools in genetic studies.
Collapse
Affiliation(s)
- Di Chen
- Intercollege Graduate Degree Program in Genetics, The Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, PA
| | - Marzia A Cremona
- Department of Statistics, The Pennsylvania State University, University Park, PA.,Department of Operations and Decision Systems, Université Laval, Québec, Canada
| | - Zongtai Qi
- Department of Genetics and Center for Genome Sciences and Systems Biology, Washington University School of Medicine, St. Louis, MO
| | - Robi D Mitra
- Department of Genetics and Center for Genome Sciences and Systems Biology, Washington University School of Medicine, St. Louis, MO
| | - Francesca Chiaromonte
- Department of Statistics, The Pennsylvania State University, University Park, PA.,EMbeDS, Sant'Anna School of Advanced Studies, Pisa, Italy.,The Huck Institutes of the Life Sciences, Center for Medical Genomics, The Pennsylvania State University, University Park, PA
| | - Kateryna D Makova
- The Huck Institutes of the Life Sciences, Center for Medical Genomics, The Pennsylvania State University, University Park, PA.,Department of Biology, The Pennsylvania State University, University Park, PA
| |
Collapse
|
107
|
Oh B, Hwangbo S, Jung T, Min K, Lee C, Apio C, Lee H, Lee S, Moon MK, Kim SW, Park T. Prediction Models for the Clinical Severity of Patients With COVID-19 in Korea: Retrospective Multicenter Cohort Study. J Med Internet Res 2021; 23:e25852. [PMID: 33822738 PMCID: PMC8054775 DOI: 10.2196/25852] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2020] [Revised: 02/04/2021] [Accepted: 03/18/2021] [Indexed: 12/26/2022] Open
Abstract
BACKGROUND Limited information is available about the present characteristics and dynamic clinical changes that occur in patients with COVID-19 during the early phase of the illness. OBJECTIVE This study aimed to develop and validate machine learning models based on clinical features to assess the risk of severe disease and triage for COVID-19 patients upon hospital admission. METHODS This retrospective multicenter cohort study included patients with COVID-19 who were released from quarantine until April 30, 2020, in Korea. A total of 5628 patients were included in the training and testing cohorts to train and validate the models that predict clinical severity and the duration of hospitalization, and the clinical severity score was defined at four levels: mild, moderate, severe, and critical. RESULTS Out of a total of 5601 patients, 4455 (79.5%), 330 (5.9%), 512 (9.1%), and 301 (5.4%) were included in the mild, moderate, severe, and critical levels, respectively. As risk factors for predicting critical patients, we selected older age, shortness of breath, a high white blood cell count, low hemoglobin levels, a low lymphocyte count, and a low platelet count. We developed 3 prediction models to classify clinical severity levels. For example, the prediction model with 6 variables yielded a predictive power of >0.93 for the area under the receiver operating characteristic curve. We developed a web-based nomogram, using these models. CONCLUSIONS Our prediction models, along with the web-based nomogram, are expected to be useful for the assessment of the onset of severe and critical illness among patients with COVID-19 and triage patients upon hospital admission.
Collapse
Affiliation(s)
- Bumjo Oh
- Department of Family Medicine, Seoul Metropolitan Government Seoul National University Boramae Medical Center, Seoul, Republic of Korea
| | - Suhyun Hwangbo
- Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul, Republic of Korea
| | - Taeyeong Jung
- Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul, Republic of Korea
| | - Kyungha Min
- Department of Family Medicine, Seoul Metropolitan Government Seoul National University Boramae Medical Center, Seoul, Republic of Korea
| | - Chanhee Lee
- Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul, Republic of Korea
| | - Catherine Apio
- Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul, Republic of Korea
| | - Hyejin Lee
- Department of Family Medicine, Seoul National University Bundang Hospital, Gyeonggi-do, Republic of Korea
| | - Seungyeoun Lee
- Department of Mathematics and Statistics, Sejong University, Seoul, Republic of Korea
| | - Min Kyong Moon
- Department of Internal Medicine, Seoul Metropolitan Government Seoul National University Boramae Medical Center, Seoul, Republic of Korea
| | - Shin-Woo Kim
- Department of Internal Medicine, Kyungpook National University, Daegu, Republic of Korea
| | - Taesung Park
- Department of Statistics, Seoul National University, Seoul, Republic of Korea
| |
Collapse
|
108
|
Gao W, Shu T, Liu Q, Ling S, Guan Y, Liu S, Zhou L. Predictive Modeling of Lignin Content for the Screening of Suitable Poplar Genotypes Based on Fourier Transform-Raman Spectrometry. ACS OMEGA 2021; 6:8578-8587. [PMID: 33817518 PMCID: PMC8015071 DOI: 10.1021/acsomega.1c00400] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/22/2021] [Accepted: 03/03/2021] [Indexed: 05/26/2023]
Abstract
The quick and non-invasive evaluation of lignin from biomass has been the focus of much attention. Several types of spectroscopies, for example, near-infrared (NIR) and Fourier transform-Raman (FT-Raman), have been successfully applied to build quantitative predictive lignin models based on chemometrics. However, due to the effect of sample moisture content and ambient humidity on its signals, NIR spectroscopy requires sophisticated pre-testing preparation. In addition, the current FT-Raman predictive models require large variations in the independent value inputs as restrictions in the corresponding mathematical algorithms prevent the effective biomass screening of suitable genotypes for lignin contents within a narrow range. In order to overcome the limitations associated with the current methods, in this paper, we employed Raman spectra excited using a 1064 nm laser, thus avoiding the impact of water and auto-fluorescence on NIR signals. The optimal baseline correction method, data type, mathematical algorithm, and internal reference were selected in order to build quantitative lignin models based on the data with limited variation. The resulting two predictive models, constructed through lasso and ridge regressions, respectively, proved to be effective in assessing the lignin content of poplar in large-scale breeding and genetic engineering programs.
Collapse
Affiliation(s)
- Wenli Gao
- School
of Forestry and Landscape Architecture, Anhui Agriculture University, Hefei 230036, Anhui, China
- Key
Lab of State Forest and Grassland Administration on Wood Quality Improvement
& High Efficient Utilization, Hefei 230036, Anhui, China
| | - Ting Shu
- School
of Physical Science and Technology, Shanghai
Tech University, 393
Middle Huaxia Road, Shanghai 201210, China
| | - Qiang Liu
- School
of Physical Science and Technology, Shanghai
Tech University, 393
Middle Huaxia Road, Shanghai 201210, China
| | - Shengjie Ling
- School
of Physical Science and Technology, Shanghai
Tech University, 393
Middle Huaxia Road, Shanghai 201210, China
| | - Ying Guan
- School
of Forestry and Landscape Architecture, Anhui Agriculture University, Hefei 230036, Anhui, China
| | - Shengquan Liu
- School
of Forestry and Landscape Architecture, Anhui Agriculture University, Hefei 230036, Anhui, China
- Key
Lab of State Forest and Grassland Administration on Wood Quality Improvement
& High Efficient Utilization, Hefei 230036, Anhui, China
| | - Liang Zhou
- School
of Forestry and Landscape Architecture, Anhui Agriculture University, Hefei 230036, Anhui, China
- Key
Lab of State Forest and Grassland Administration on Wood Quality Improvement
& High Efficient Utilization, Hefei 230036, Anhui, China
| |
Collapse
|
109
|
Huang J, Jiao Y, Kang L, Liu J, Liu Y, Lu X. GSDAR: a fast Newton algorithm for $$\ell _0$$ regularized generalized linear models with statistical guarantee. Comput Stat 2021. [DOI: 10.1007/s00180-021-01098-z] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
|
110
|
Lu Z, Lou W. Bayesian approaches to variable selection: a comparative study from practical perspectives. Int J Biostat 2021; 18:83-108. [PMID: 33761580 DOI: 10.1515/ijb-2020-0130] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/11/2020] [Accepted: 02/27/2021] [Indexed: 11/15/2022]
Abstract
In many clinical studies, researchers are interested in parsimonious models that simultaneously achieve consistent variable selection and optimal prediction. The resulting parsimonious models will facilitate meaningful biological interpretation and scientific findings. Variable selection via Bayesian inference has been receiving significant advancement in recent years. Despite its increasing popularity, there is limited practical guidance for implementing these Bayesian approaches and evaluating their comparative performance in clinical datasets. In this paper, we review several commonly used Bayesian approaches to variable selection, with emphasis on application and implementation through R software. These approaches can be roughly categorized into four classes: namely the Bayesian model selection, spike-and-slab priors, shrinkage priors, and the hybrid of both. To evaluate their variable selection performance under various scenarios, we compare these four classes of approaches using real and simulated datasets. These results provide practical guidance to researchers who are interested in applying Bayesian approaches for the purpose of variable selection.
Collapse
Affiliation(s)
- Zihang Lu
- Department of Public Health Sciences, Queen's University, Kingston, Ontario, Canada
| | - Wendy Lou
- Dalla Lana School of Public Health, University of Toronto, Toronto, Ontario, Canada
| |
Collapse
|
111
|
Guo G. Taylor quasi-likelihood for limited generalized linear models. J Appl Stat 2021; 48:669-692. [DOI: 10.1080/02664763.2020.1743650] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]
Affiliation(s)
- Guangbao Guo
- Department of Statistics, Shandong University of Technology, Zibo, People's Republic of China
| |
Collapse
|
112
|
Wang C, Gonzalez Y, Shen C, Hrycushko B, Jia X. Simultaneous needle catheter selection and dwell time optimization for preplanning of high-dose-rate brachytherapy of prostate cancer. Phys Med Biol 2021; 66:055028. [PMID: 33264753 DOI: 10.1088/1361-6560/abd00e] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
PURPOSE Needle catheter positions critically affect the quality of treatment plans in prostate cancer high-dose-rate (HDR) brachytherapy. The current standard needle positioning approach is based on human intuition, which cannot guarantee a high-quality plan. This study proposed a method to simultaneously select needle catheter positions and determine dwell time for preplanning of HDR brachytherapy of prostate cancer. METHODS We formulated the needle catheter selection problem and inverse dwell time optimization problem in a unified framework. In addition to the dose objectives of the planning target volume (PTV) and organs at risk (OARs), the objective function incorporated a group-sparsity term with a needle-specific adaptive weighting scheme to generate high-quality plans with the minimal number of needle catheters. The optimization problem was solved by a fast-iterative shrinkage-thresholding algorithm. For validation purposes, we tested the proposed algorithm on 10 patient cases previously treated at our institution and compared the resulting plans with plans generated using needle catheters selected manually. RESULTS Compared to the plan with manually selected needle catheters, when normalizing both plans to the same PTV coverage V 100% = 95%, the plans generated by the proposed algorithm reduced median V 125% from 65% to 64%, but increased median V 150% from 35% to 38%, and V 200% from 14% to 16%. All planning objectives were met. All clinically important dosimetric parameters of OARs were reduced. D 1cc of bladder and rectum were reduced from 8.57 Gy to 8.50 Gy and from 7.24 Gy to 6.80 Gy, respectively. D max of urethra was reduced from 15.85 Gy to 15.77 Gy. The median number of selected needle catheters was reduced by two. The computational time for solving the proposed optimization problem was ∼90 s using MATLAB. CONCLUSION The proposed algorithm was able to generate plans for prostate cancer HDR brachytherapy preplanning with increased median conformity index (0.73-0.77) and slightly lower median homogeneity index (0.64-0.62) with the number of selected needles reduced by two compared to the manual needle selection approach.
Collapse
Affiliation(s)
- Chao Wang
- Innovative Technology Of Radiotherapy Computation and Hardware (iTORCH) Laboratory, Department of Radiation Oncology, University of Texas Southwestern Medical Center, Dallas, TX 75390, United States of America. Department of Radiation Oncology, University of Texas Southwestern Medical Center, Dallas, TX 75287, United States of America
| | | | | | | | | |
Collapse
|
113
|
Frias M, Moyano JM, Rivero-Juarez A, Luna JM, Camacho Á, Fardoun HM, Machuca I, Al-Twijri M, Rivero A, Ventura S. Classification Accuracy of Hepatitis C Virus Infection Outcome: Data Mining Approach. J Med Internet Res 2021; 23:e18766. [PMID: 33624609 PMCID: PMC7946589 DOI: 10.2196/18766] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2020] [Revised: 11/02/2020] [Accepted: 12/17/2020] [Indexed: 11/30/2022] Open
Abstract
BACKGROUND The dataset from genes used to predict hepatitis C virus outcome was evaluated in a previous study using a conventional statistical methodology. OBJECTIVE The aim of this study was to reanalyze this same dataset using the data mining approach in order to find models that improve the classification accuracy of the genes studied. METHODS We built predictive models using different subsets of factors, selected according to their importance in predicting patient classification. We then evaluated each independent model and also a combination of them, leading to a better predictive model. RESULTS Our data mining approach identified genetic patterns that escaped detection using conventional statistics. More specifically, the partial decision trees and ensemble models increased the classification accuracy of hepatitis C virus outcome compared with conventional methods. CONCLUSIONS Data mining can be used more extensively in biomedicine, facilitating knowledge building and management of human diseases.
Collapse
Affiliation(s)
- Mario Frias
- Department of Clinical Virology and Zoonoses, Maimonides Biomedical Research Institute of Córdoba, Córdoba, Spain
| | - Jose M Moyano
- Department of Computer Science and Numerical Analysis, University of Córdoba, Córdoba, Spain
- Knowledge Discovery and Intelligent Systems in Biomedicine Laboratory, Maimonides Biomedical Research Institute of Córdoba, Córdoba, Spain
| | - Antonio Rivero-Juarez
- Department of Clinical Virology and Zoonoses, Maimonides Biomedical Research Institute of Córdoba, Córdoba, Spain
| | - Jose M Luna
- Department of Computer Science and Numerical Analysis, University of Córdoba, Córdoba, Spain
- Knowledge Discovery and Intelligent Systems in Biomedicine Laboratory, Maimonides Biomedical Research Institute of Córdoba, Córdoba, Spain
| | - Ángela Camacho
- Department of Clinical Virology and Zoonoses, Maimonides Biomedical Research Institute of Córdoba, Córdoba, Spain
| | - Habib M Fardoun
- Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah, Saudi Arabia
| | - Isabel Machuca
- Department of Clinical Virology and Zoonoses, Maimonides Biomedical Research Institute of Córdoba, Córdoba, Spain
| | - Mohamed Al-Twijri
- Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah, Saudi Arabia
| | - Antonio Rivero
- Department of Clinical Virology and Zoonoses, Maimonides Biomedical Research Institute of Córdoba, Córdoba, Spain
| | - Sebastian Ventura
- Department of Computer Science and Numerical Analysis, University of Córdoba, Córdoba, Spain
- Knowledge Discovery and Intelligent Systems in Biomedicine Laboratory, Maimonides Biomedical Research Institute of Córdoba, Córdoba, Spain
| |
Collapse
|
114
|
Li Y, Tang C, Lu J, Wu J, Chang EF. Human cortical encoding of pitch in tonal and non-tonal languages. Nat Commun 2021; 12:1161. [PMID: 33608548 PMCID: PMC7896081 DOI: 10.1038/s41467-021-21430-x] [Citation(s) in RCA: 29] [Impact Index Per Article: 9.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/10/2020] [Accepted: 01/26/2021] [Indexed: 11/09/2022] Open
Abstract
Languages can use a common repertoire of vocal sounds to signify distinct meanings. In tonal languages, such as Mandarin Chinese, pitch contours of syllables distinguish one word from another, whereas in non-tonal languages, such as English, pitch is used to convey intonation. The neural computations underlying language specialization in speech perception are unknown. Here, we use a cross-linguistic approach to address this. Native Mandarin- and English- speaking participants each listened to both Mandarin and English speech, while neural activity was directly recorded from the non-primary auditory cortex. Both groups show language-general coding of speaker-invariant pitch at the single electrode level. At the electrode population level, we find language-specific distribution of cortical tuning parameters in Mandarin speakers only, with enhanced sensitivity to Mandarin tone categories. Our results show that speech perception relies upon a shared cortical auditory feature processing mechanism, which may be tuned to the statistics of a given language. Different languages rely on different vocal sounds to convey meaning. Here the authors show that language-general coding of pitch occurs in the non-primary auditory cortex for both tonal (Mandarin Chinese) and non-tonal (English) languages, with some language specificity on the population level.
Collapse
Affiliation(s)
- Yuanning Li
- Department of Neurological Surgery, University of California, San Francisco, CA, USA.,Center for Integrative Neuroscience, University of California, San Francisco, CA, USA
| | - Claire Tang
- Department of Neurological Surgery, University of California, San Francisco, CA, USA.,Center for Integrative Neuroscience, University of California, San Francisco, CA, USA
| | - Junfeng Lu
- Brain Function Laboratory, Neurosurgical Institute of Fudan University, Shanghai, China.,Shanghai Key laboratory of Brain Function Restoration and Neural Regeneration, Shanghai, China
| | - Jinsong Wu
- Brain Function Laboratory, Neurosurgical Institute of Fudan University, Shanghai, China. .,Shanghai Key laboratory of Brain Function Restoration and Neural Regeneration, Shanghai, China. .,Neurologic Surgery Department, Huashan Hospital, Shanghai Medical College, Fudan University, Shanghai, China. .,Institute of Brain-Intelligence Technology, Zhangjiang Lab, Shanghai, China.
| | - Edward F Chang
- Department of Neurological Surgery, University of California, San Francisco, CA, USA. .,Center for Integrative Neuroscience, University of California, San Francisco, CA, USA.
| |
Collapse
|
115
|
Sperger J, Shah KS, Lu M, Zhang X, Ungaro RC, Brenner EJ, Agrawal M, Colombel JF, Kappelman MD, Kosorok MR. Development and validation of multivariable prediction models for adverse COVID-19 outcomes in IBD patients. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2021. [PMID: 33501455 PMCID: PMC7836127 DOI: 10.1101/2021.01.15.21249889] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
Importance Risk calculators can facilitate shared medical decision-making1. Demographics, comorbidities, medication use, geographic region, and other factors may increase the risk for COVID-19-related complications among patients with IBD2,3. Objectives Develop an individualized prognostic risk prediction tool for predicting the probability of adverse COVID-19 outcomes in patients with IBD. Design, Setting, and Participants This study developed and validated prognostic penalized logistic regression models4 using reports to Surveillance Epidemiology of Coronavirus Under Research Exclusion for Inflammatory Bowel Disease (SECURE-IBD) from March–October 2020. Model development was done using a training data set (85% of cases reported March 13 – September 15, 2020), and model validation was conducted using a test data set (the remaining 15% of cases plus all cases reported September 16–October 20, 2020. Main Outcomes and Measures COVID-19 related:
Hospitalization+: composite outcome of hospitalization, ICU admission, mechanical ventilation, or death ICU+: composite outcome of ICU admission, mechanical ventilation, or death Death
We assessed the resulting models’ discrimination using the area under the curve (AUC) of the receiver-operator characteristic (ROC) curves and reported the corresponding 95% confidence intervals (CIs). Results We included 2709 cases from 59 countries (mean age 41.2 years [s.d. 18], 50.2% male). A total of 633 (24%) were hospitalized, 137 (5%) were admitted to the ICU or intubated, and 69 (3%) died. 2009 patients comprised the training set and 700 the test set. The models demonstrated excellent discrimination, with a test set AUC (95% CI) of 0.79 (0.75, 0.83) for Hospitalization+, 0.88 (0.82, 0.95) for ICU+, and 0.94 (0.89, 0.99) for Death. Age, comorbidities, corticosteroid use, and male gender were associated with higher risk of death, while use of biologic therapies was associated with a lower risk. Conclusions and Relevance Prognostic models can effectively predict who is at higher risk for COVID-19-related adverse outcomes in a population of IBD patients. A free online risk calculator (https://covidibd.org/covid-19-risk-calculator/) is available for healthcare providers to facilitate discussion of risks due to COVID-19 with IBD patients. The tool numerically and visually summarizes the patient’s probabilities of adverse outcomes and associated CIs. Helping physicians identify their highest-risk patients will be important in the coming months as cases rise in the US and worldwide. This tool can also serve as a model for risk stratification in other chronic diseases.
Collapse
|
116
|
A panel of two miRNAs correlated to systolic blood pressure is a good diagnostic indicator for stroke. Biosci Rep 2021; 41:227391. [PMID: 33345284 PMCID: PMC7805026 DOI: 10.1042/bsr20203458] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/03/2020] [Revised: 12/07/2020] [Accepted: 12/10/2020] [Indexed: 12/21/2022] Open
Abstract
BACKGROUND We aimed to develop a diagnostic indicator of stroke based on serum miRNAs correlated to systolic blood pressure. METHODS Using miRNA expression profiles in GSE117604 from the Gene Expression Omnibus (GEO), we utilized the WGCNA to identify hub miRNAs correlated to systolic blood pressure (SBP). Differential analysis was applied to highlight hub differentially expressed miRNAs (DE-miRNAs), whereby we built a miRNA-based diagnostic indicator for stroke using bootstrap ranking Least Absolute Shrinkage and Selection Operator (LASSO) regression with 10-fold cross-validation. The classification value of the indicator was validated with receiver operating characteristic (ROC) analysis in both the training set and test set, as well as quantitative real-time PCR (qRT-PCR) for the feature miRNAs. Further, target genes of hub miRNAs and hub DE-miRNAs were retrieved for functional enrichment. RESULTS A total of 447 hub miRNAs in the blue modules were significantly correlated with systolic blood pressure (r = 0.32, false discovery rate = 10-6). Target genes predicted with the hub miRNAs were mostly implicated in the Kyoto Encyclopedia of Genes and Genomes (KEGG) terms including mitogen-activated protein kinase (MAPK) pathway, senescence, and TGF-β signaling pathway. The diagnostic indicator with miR-4420 and miR-6793-5p showed remarkable performance in the training set (area under curve [AUC]= 0.953), as well as in the test set (AUC = 0.894). Results of qRT-PCR validated the diagnostic value of the two miRNAs embedded in the proposed indicator. CONCLUSIONS We developed a panel of two miRNAs, which is a good diagnostic indicator for stroke. These results require further investigation.
Collapse
|
117
|
Dang X, Huang S, Qian X. Risk Factor Identification in Heterogeneous Disease Progression with L1-Regularized Multi-state Models. JOURNAL OF HEALTHCARE INFORMATICS RESEARCH 2021; 5:20-53. [DOI: 10.1007/s41666-020-00085-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/11/2020] [Revised: 10/13/2020] [Accepted: 11/26/2020] [Indexed: 10/22/2022]
|
118
|
Kenney A, Chiaromonte F, Felici G. MIP-BOOST: Efficient and Effective L0 Feature Selection for Linear Regression. J Comput Graph Stat 2021; 30:566-577. [DOI: 10.1080/10618600.2020.1845184] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
Affiliation(s)
- Ana Kenney
- Department of Statistics, Penn State University, University Park, PA
| | - Francesca Chiaromonte
- Department of Statistics, Penn State University, University Park, PA
- Institute of Economics & EMbeDS, Sant’Anna School of Advanced Studies, Pisa, Italy
| | - Giovanni Felici
- Istituto di Analisi dei Sistemi ed Informatica, Consiglio Nazionale delle Ricerche, Rome, Italy
| |
Collapse
|
119
|
Zhu G, Zhao T. Deep-gKnock: Nonlinear group-feature selection with deep neural networks. Neural Netw 2021; 135:139-147. [PMID: 33385830 DOI: 10.1016/j.neunet.2020.12.004] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2020] [Revised: 11/26/2020] [Accepted: 12/02/2020] [Indexed: 01/21/2023]
Abstract
Feature selection is central to contemporary high-dimensional data analysis. Group structure among features arises naturally in various scientific problems. Many methods have been proposed to incorporate the group structure information into feature selection. However, these methods are normally restricted to a linear regression setting. To relax the linear constraint, we design a new Deep Neural Network (DNN) architecture and integrating it with the recently proposed knockoff technique to perform nonlinear group-feature selection with controlled group-wise False Discovery Rate (gFDR). Experimental results on high-dimensional synthetic data demonstrate that our method achieves the highest power and accurate gFDR control compared with state-of-the-art methods. The performance of Deep-gKnock is especially superior in the following five situations: (1) nonlinearity relationship; (2) dimension p greater than sample size n; (3) high between-group correlation; (4) high within-group correlation; (5) large number of associated groups. And Deep-gKnock is also demonstrated to be robust to the misspecification of the feature distribution and the change of network architecture. Moreover, Deep-gKnock achieves scientifically meaningful group-feature selection results for cutting-edge real world datasets.
Collapse
Affiliation(s)
- Guangyu Zhu
- Department of Computer Science and Statistics, University of Rhode Island, United States of America.
| | - Tingting Zhao
- Department of Electrical and Computer Engineering, Northeastern University, United States of America
| |
Collapse
|
120
|
An improved lasso regression model for evaluating the efficiency of intervention actions in a system reliability analysis. Neural Comput Appl 2021. [DOI: 10.1007/s00521-020-05537-8] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
|
121
|
|
122
|
Li M, Kong L, Su Z. Double fused Lasso regularized regression with both matrix and vector valued predictors. Electron J Stat 2021. [DOI: 10.1214/21-ejs1829] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Affiliation(s)
- Mei Li
- Department of Applied Mathematics, Beijing Jiaotong University
| | - Lingchen Kong
- Department of Applied Mathematics, Beijing Jiaotong University
| | - Zhihua Su
- Department of Statistics, University of Florida
| |
Collapse
|
123
|
MOSS-Multi-Modal Best Subset Modeling in Smart Manufacturing. SENSORS 2021; 21:s21010243. [PMID: 33401493 PMCID: PMC7796348 DOI: 10.3390/s21010243] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/09/2020] [Revised: 12/28/2020] [Accepted: 12/28/2020] [Indexed: 11/23/2022]
Abstract
Smart manufacturing, which integrates a multi-sensing system with physical manufacturing processes, has been widely adopted in the industry to support online and real-time decision making to improve manufacturing quality. A multi-sensing system for each specific manufacturing process can efficiently collect the in situ process variables from different sensor modalities to reflect the process variations in real-time. However, in practice, we usually do not have enough budget to equip too many sensors in each manufacturing process due to the cost consideration. Moreover, it is also important to better interpret the relationship between the sensing modalities and the quality variables based on the model. Therefore, it is necessary to model the quality-process relationship by selecting the most relevant sensor modalities with the specific quality measurement from the multi-modal sensing system in smart manufacturing. In this research, we adopted the concept of best subset variable selection and proposed a new model called Multi-mOdal beSt Subset modeling (MOSS). The proposed MOSS can effectively select the important sensor modalities and improve the modeling accuracy in quality-process modeling via functional norms that characterize the overall effects of individual modalities. The significance of sensor modalities can be used to determine the sensor placement strategy in smart manufacturing. Moreover, the selected modalities can better interpret the quality-process model by identifying the most correlated root cause of quality variations. The merits of the proposed model are illustrated by both simulations and a real case study in an additive manufacturing (i.e., fused deposition modeling) process.
Collapse
|
124
|
Conditional score matching for high-dimensional partial graphical models. Comput Stat Data Anal 2021. [DOI: 10.1016/j.csda.2020.107066] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
125
|
Qiu H, Li Y, Cheng S, Li J, He C, Li J. A Prognostic Microenvironment-Related Immune Signature via ESTIMATE (PROMISE Model) Predicts Overall Survival of Patients With Glioma. Front Oncol 2020; 10:580263. [PMID: 33425732 PMCID: PMC7793983 DOI: 10.3389/fonc.2020.580263] [Citation(s) in RCA: 25] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2020] [Accepted: 10/22/2020] [Indexed: 12/13/2022] Open
Abstract
Objective In the development of immunotherapies in gliomas, the tumor microenvironment (TME) needs to be investigated. We aimed to construct a prognostic microenvironment-related immune signature via ESTIMATE (PROMISE model) for glioma. Methods Stromal score (SS) and immune score (IS) were calculated via ESTIMATE for each glioma sample in the cancer genome atlas (TCGA), and differentially expressed genes (DEGs) were identified between high-score and low-score groups. Prognostic DEGs were selected via univariate Cox regression analysis. Using the lower-grcade glioma (LGG) data set in TCGA, we performed LASSO regression based on the prognostic DEGs and constructed a PROMISE model for glioma. The model was validated with survival analysis and the receiver operating characteristic (ROC) in TCGA glioma data sets (LGG, glioblastoma multiforme [GBM] and LGG+GBM) and Chinese glioma genome atlas (CGGA). A nomogram was developed to predict individual survival chances. Further, we explored the underlying mechanisms using gene set enrichment analysis (GSEA) and Cibersort analysis of tumor-infiltrating immune cells between risk groups as defined by the PROMISE model. Results We obtained 220 upregulated DEGs and 42 downregulated DEGs in both high-IS and high-SS groups. The Cox regression highlighted 155 prognostic DEGs, out of which we selected 4 genes (CD86, ANXA1, C5AR1, and CD5) to construct a PROMISE model. The model stratifies glioma patients in TCGA as well as in CGGA with distinct survival outcome (P<0.05, Hazard ratio [HR]>1) and acceptable predictive accuracy (AUCs>0.6). With the nomogram, an individualized survival chance could be predicted intuitively with specific age, tumor grade, Isocitrate dehydrogenase (IDH) status, and the PROMISE risk score. ROC showed significant discrimination with the area under curves (AUCs) of 0.917 and 0.817 in TCGA and CGGA, respectively. GSEA between risk groups in both data sets were significantly enriched in multiple immune-related pathways. The Cibersort analysis highlighted four immune cells, i.e., CD 8 T cells, neutrophils, follicular helper T (Tfh) cells, and Natural killer (NK) cells. Conclusions The PROMISE model can further stratify both LGG and GBM patients with distinct survival outcomes.These findings may help further our understanding of TME in gliomas and shed light on immunotherapies.
Collapse
Affiliation(s)
- Huaide Qiu
- Center of Rehabilitation Medicine, The First Affiliated Hospital of Nanjing Medical University, Nanjing, China
| | - Yongqiang Li
- Center of Rehabilitation Medicine, The First Affiliated Hospital of Nanjing Medical University, Nanjing, China
| | - Shupeng Cheng
- Center of Rehabilitation Medicine, The First Affiliated Hospital of Nanjing Medical University, Nanjing, China
| | - Jiahui Li
- Center of Rehabilitation Medicine, The First Affiliated Hospital of Nanjing Medical University, Nanjing, China
| | - Chuan He
- Department of Rehabilitation Medicine, The Affiliated Jiangsu Shengze Hospital of Nanjing Medical University, Suzhou, China
| | - Jianan Li
- Center of Rehabilitation Medicine, The First Affiliated Hospital of Nanjing Medical University, Nanjing, China
| |
Collapse
|
126
|
|
127
|
Huang A, Wu F. Two-stage adaptive integration of multi-source heterogeneous data based on an improved random subspace and prediction of default risk of microcredit. Neural Comput Appl 2020. [DOI: 10.1007/s00521-020-05489-z] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|
128
|
Bastien B, Boukhobza T, Dumond H, Gégout-Petit A, Muller-Gueudin A, Thiébaut C. A statistical methodology to select covariates in high-dimensional data under dependence. Application to the classification of genetic profiles in oncology. J Appl Stat 2020; 49:764-781. [DOI: 10.1080/02664763.2020.1837083] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
Affiliation(s)
- B. Bastien
- Transgene S.A., Illkirch-Graffenstaden Cedex, France
| | - T. Boukhobza
- Université de Lorraine, CNRS, CRAN, Nancy, France
| | - H. Dumond
- Université de Lorraine, CNRS, CRAN, Nancy, France
| | | | | | - C. Thiébaut
- Université de Lorraine, CNRS, CRAN, Nancy, France
| |
Collapse
|
129
|
AlJame M, Ahmad I, Imtiaz A, Mohammed A. Ensemble learning model for diagnosing COVID-19 from routine blood tests. INFORMATICS IN MEDICINE UNLOCKED 2020; 21:100449. [PMID: 33102686 PMCID: PMC7572278 DOI: 10.1016/j.imu.2020.100449] [Citation(s) in RCA: 48] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/11/2020] [Revised: 09/28/2020] [Accepted: 10/07/2020] [Indexed: 12/19/2022] Open
Abstract
BACKGROUND AND OBJECTIVES The pandemic of novel coronavirus disease 2019 (COVID-19) has severely impacted human society with a massive death toll worldwide. There is an urgent need for early and reliable screening of COVID-19 patients to provide better and timely patient care and to combat the spread of the disease. In this context, recent studies have reported some key advantages of using routine blood tests for initial screening of COVID-19 patients. In this article, first we present a review of the emerging techniques for COVID-19 diagnosis using routine laboratory and/or clinical data. Then, we propose ERLX which is an ensemble learning model for COVID-19 diagnosis from routine blood tests. METHOD The proposed model uses three well-known diverse classifiers, extra trees, random forest and logistic regression, which have different architectures and learning characteristics at the first level, and then combines their predictions by using a second level extreme gradient boosting (XGBoost) classifier to achieve a better performance. For data preparation, the proposed methodology employs a KNNImputer algorithm to handle null values in the dataset, isolation forest (iForest) to remove outlier data, and a synthetic minority oversampling technique (SMOTE) to balance data distribution. For model interpretability, features importance are reported by using the SHapley Additive exPlanations (SHAP) technique. RESULTS The proposed model was trained and evaluated by using a publicly available data set from Albert Einstein Hospital in Brazil, which consisted of 5644 data samples with 559 confirmed COVID-19 cases. The ensemble model achieved outstanding performance with an overall accuracy of 99.88% [95% CI: 99.6-100], AUC of 99.38% [95% CI: 97.5-100], a sensitivity of 98.72% [95% CI: 94.6-100] and a specificity of 99.99% [95% CI: 99.99-100]. DISCUSSION The proposed model revealed better performance when compared against existing state-of-the-art studies (Banerjee et al., 2020; de Freitas Barbosa et al., 2020; de Moraes Batista et al., 2020; Soares et al., 2020) [3,22,56,71] for the same set of features employed by them. As compared to the best performing Bayes Net model (de Freitas Barbosa et al., 2020) [22] average accuracy of 95.159%, ERLX achieved an average accuracy of 99.94%. In comparison with AUC of 85% reported by the SVM model (de Moraes Batista et al., 2020) [56], ERLX obtained AUC of 99.77% in addition to improvements in sensitivity, and specificity. As compared with ER-COV model (Soares et al., 2020) [71] average sensitivity of 70.25% and specificity of 85.98%, ERLX model achieved sensitivity of 99.47% and specificity of 99.99%. The ERLX model obtained a considerably higher score as compared with ANN model (Banerjee et al., 2020) [3] in all performance metrics. Therefore, the model presented is robust and can be deployed for reliable early and rapid screening of COVID-19 patients.
Collapse
Affiliation(s)
- Maryam AlJame
- Computer Engineering Department, Kuwait University, Kuwait
| | - Imtiaz Ahmad
- Computer Engineering Department, Kuwait University, Kuwait
| | | | - Ameer Mohammed
- Computer Engineering Department, Kuwait University, Kuwait
| |
Collapse
|
130
|
Zhou J, Viles WD, Lu B, Li Z, Madan JC, Karagas MR, Gui J, Hoen AG. Identification of microbial interaction network: zero-inflated latent Ising model based approach. BioData Min 2020; 13:16. [PMID: 33042226 PMCID: PMC7542390 DOI: 10.1186/s13040-020-00226-7] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2020] [Accepted: 09/22/2020] [Indexed: 12/12/2022] Open
Abstract
BACKGROUND Throughout their lifespans, humans continually interact with the microbial world, including those organisms which live in and on the human body. Research in this domain has revealed the extensive links between the human-associated microbiota and health. In particular, the microbiota of the human gut plays essential roles in digestion, nutrient metabolism, immune maturation and homeostasis, neurological signaling, and endocrine regulation. Microbial interaction networks are frequently estimated from data and are an indispensable tool for representing and understanding the conditional correlation between the microbes. In this high-dimensional setting, zero-inflation and unit-sum constraint for relative abundance data pose challenges to the reliable estimation of microbial interaction networks. METHODS AND RESULTS To identify the microbial interaction network, the zero-inflated latent Ising (ZILI) model is proposed which assumes the distribution of relative abundance relies only on finite latent states and provides a novel way to solve issues induced by the unit-sum and zero-inflation constrains. A two-step algorithm is proposed for the model selection of ZILI. ZILI is evaluated through simulated data and subsequently applied to an infant gut microbiota dataset from New Hampshire Birth Cohort Study. The results are compared with results from Gaussian graphical model (GGM) and dichotomous Ising model (DIS). Providing ZILI is the true data-generating model, the simulation studies show that the two-step algorithm can identify the graphical structure effectively and is robust to a range of parameter settings. For the infant gut microbiota dataset, the final estimated networks from GGM and ZILI turn out to have significant overlap in which the ZILI tends to select the sparser network than those from GGM. From the shared subnetwork, a hub taxon Lachnospiraceae is identified whose involvement in human disease development has been discovered recently in literature. CONCLUSIONS Constrains induced by relative abundance of microbiota such as zero inflation and unit sum render the conditional correlation analysis unreliable for conventional methods such as GGM. The proposed optimal categoricalization based ZILI model provides an alternative yet elegant way to deal with these difficulties. The results from ZILI have reasonable biological interpretation. This model can also be used to study the microbial interaction in other body parts.
Collapse
Affiliation(s)
- Jie Zhou
- Department of Biomedical Data Science, Geisel School of Medicine, Dartmouth College, Hanover, NH USA
| | - Weston D. Viles
- Department of Mathematics and Statistics, University of Southern Maine, Portland, ME USA
| | - Boran Lu
- Department of Biomedical Data Science, Geisel School of Medicine, Dartmouth College, Hanover, NH USA
| | - Zhigang Li
- Department of Biostatistics, University of Florida, Gainesville, FL USA
| | - Juliette C. Madan
- Department of Epidemiology, Geisel School of Medicine, Dartmouth College, Hanover, NH USA
| | - Margaret R. Karagas
- Department of Epidemiology, Geisel School of Medicine, Dartmouth College, Hanover, NH USA
| | - Jiang Gui
- Department of Biomedical Data Science, Geisel School of Medicine, Dartmouth College, Hanover, NH USA
| | - Anne G. Hoen
- Department of Biomedical Data Science, Geisel School of Medicine, Dartmouth College, Hanover, NH USA
- Department of Epidemiology, Geisel School of Medicine, Dartmouth College, Hanover, NH USA
| |
Collapse
|
131
|
Culos A, Tsai AS, Stanley N, Becker M, Ghaemi MS, McIlwain DR, Fallahzadeh R, Tanada A, Nassar H, Espinosa C, Xenochristou M, Ganio E, Peterson L, Han X, Stelzer IA, Ando K, Gaudilliere D, Phongpreecha T, Marić I, Chang AL, Shaw GM, Stevenson DK, Bendall S, Davis KL, Fantl W, Nolan GP, Hastie T, Tibshirani R, Angst MS, Gaudilliere B, Aghaeepour N. Integration of mechanistic immunological knowledge into a machine learning pipeline improves predictions. NAT MACH INTELL 2020; 2:619-628. [PMID: 33294774 PMCID: PMC7720904 DOI: 10.1038/s42256-020-00232-8] [Citation(s) in RCA: 36] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/13/2020] [Accepted: 08/26/2020] [Indexed: 12/17/2022]
Abstract
The dense network of interconnected cellular signalling responses that are quantifiable in peripheral immune cells provides a wealth of actionable immunological insights. Although high-throughput single-cell profiling techniques, including polychromatic flow and mass cytometry, have matured to a point that enables detailed immune profiling of patients in numerous clinical settings, the limited cohort size and high dimensionality of data increase the possibility of false-positive discoveries and model overfitting. We introduce a generalizable machine learning platform, the immunological Elastic-Net (iEN), which incorporates immunological knowledge directly into the predictive models. Importantly, the algorithm maintains the exploratory nature of the high-dimensional dataset, allowing for the inclusion of immune features with strong predictive capabilities even if not consistent with prior knowledge. In three independent studies our method demonstrates improved predictions for clinically relevant outcomes from mass cytometry data generated from whole blood, as well as a large simulated dataset. The iEN is available under an open-source licence.
Collapse
Affiliation(s)
- Anthony Culos
- Department of Anesthesiology, Perioperative and Pain Medicine, Stanford University School of Medicine, Stanford, CA, USA
- Department of Biomedical Data Sciences, Stanford University, Stanford, CA, USA
- These authors contributed equally: Anthony Culos, Amy S. Tsai
| | - Amy S Tsai
- Department of Anesthesiology, Perioperative and Pain Medicine, Stanford University School of Medicine, Stanford, CA, USA
- These authors contributed equally: Anthony Culos, Amy S. Tsai
| | - Natalie Stanley
- Department of Anesthesiology, Perioperative and Pain Medicine, Stanford University School of Medicine, Stanford, CA, USA
- Department of Biomedical Data Sciences, Stanford University, Stanford, CA, USA
| | - Martin Becker
- Department of Anesthesiology, Perioperative and Pain Medicine, Stanford University School of Medicine, Stanford, CA, USA
- Department of Biomedical Data Sciences, Stanford University, Stanford, CA, USA
| | - Mohammad S Ghaemi
- Department of Anesthesiology, Perioperative and Pain Medicine, Stanford University School of Medicine, Stanford, CA, USA
- Department of Biomedical Data Sciences, Stanford University, Stanford, CA, USA
- Digital Technologies Research Centre, National Research Council Canada, Toronto, Ontario, Canada
| | - David R McIlwain
- Department of Microbiology and Immunology, Baxter Laboratory in Stem Cell Biology, Stanford University School of Medicine, Stanford, CA, USA
| | - Ramin Fallahzadeh
- Department of Anesthesiology, Perioperative and Pain Medicine, Stanford University School of Medicine, Stanford, CA, USA
- Department of Biomedical Data Sciences, Stanford University, Stanford, CA, USA
| | - Athena Tanada
- Department of Anesthesiology, Perioperative and Pain Medicine, Stanford University School of Medicine, Stanford, CA, USA
- Department of Biomedical Data Sciences, Stanford University, Stanford, CA, USA
| | - Huda Nassar
- Department of Anesthesiology, Perioperative and Pain Medicine, Stanford University School of Medicine, Stanford, CA, USA
- Department of Biomedical Data Sciences, Stanford University, Stanford, CA, USA
| | - Camilo Espinosa
- Department of Anesthesiology, Perioperative and Pain Medicine, Stanford University School of Medicine, Stanford, CA, USA
- Department of Biomedical Data Sciences, Stanford University, Stanford, CA, USA
| | - Maria Xenochristou
- Department of Anesthesiology, Perioperative and Pain Medicine, Stanford University School of Medicine, Stanford, CA, USA
- Department of Biomedical Data Sciences, Stanford University, Stanford, CA, USA
| | - Edward Ganio
- Department of Anesthesiology, Perioperative and Pain Medicine, Stanford University School of Medicine, Stanford, CA, USA
| | - Laura Peterson
- Department of Anesthesiology, Perioperative and Pain Medicine, Stanford University School of Medicine, Stanford, CA, USA
- Department of Pediatrics, Division of Neonatal and Developmental Medicine, Stanford University School of Medicine, Stanford, CA, USA
| | - Xiaoyuan Han
- Department of Anesthesiology, Perioperative and Pain Medicine, Stanford University School of Medicine, Stanford, CA, USA
| | - Ina A Stelzer
- Department of Anesthesiology, Perioperative and Pain Medicine, Stanford University School of Medicine, Stanford, CA, USA
| | - Kazuo Ando
- Department of Anesthesiology, Perioperative and Pain Medicine, Stanford University School of Medicine, Stanford, CA, USA
| | - Dyani Gaudilliere
- Department of Anesthesiology, Perioperative and Pain Medicine, Stanford University School of Medicine, Stanford, CA, USA
| | - Thanaphong Phongpreecha
- Department of Anesthesiology, Perioperative and Pain Medicine, Stanford University School of Medicine, Stanford, CA, USA
- Department of Biomedical Data Sciences, Stanford University, Stanford, CA, USA
- Department of Pathology, Stanford University School of Medicine, Stanford, CA, USA
| | - Ivana Marić
- Department of Anesthesiology, Perioperative and Pain Medicine, Stanford University School of Medicine, Stanford, CA, USA
- Department of Pediatrics, Division of Neonatal and Developmental Medicine, Stanford University School of Medicine, Stanford, CA, USA
| | - Alan L Chang
- Department of Anesthesiology, Perioperative and Pain Medicine, Stanford University School of Medicine, Stanford, CA, USA
- Department of Biomedical Data Sciences, Stanford University, Stanford, CA, USA
| | - Gary M Shaw
- Department of Pediatrics, Division of Neonatal and Developmental Medicine, Stanford University School of Medicine, Stanford, CA, USA
| | - David K Stevenson
- Department of Pediatrics, Division of Neonatal and Developmental Medicine, Stanford University School of Medicine, Stanford, CA, USA
| | - Sean Bendall
- Department of Pathology, Stanford University School of Medicine, Stanford, CA, USA
| | - Kara L Davis
- Department of Pediatrics, Division of Neonatal and Developmental Medicine, Stanford University School of Medicine, Stanford, CA, USA
| | - Wendy Fantl
- Department of Microbiology and Immunology, Baxter Laboratory in Stem Cell Biology, Stanford University School of Medicine, Stanford, CA, USA
- Department of Obstetrics and Gynecology, Stanford University School of Medicine, Stanford, CA, USA
- Department of Urology, Stanford University School of Medicine, Stanford, CA, USA
| | - Garry P Nolan
- Department of Pathology, Stanford University School of Medicine, Stanford, CA, USA
| | - Trevor Hastie
- Department of Biomedical Data Sciences, Stanford University, Stanford, CA, USA
- Department of Statistics, Stanford University, Stanford, CA, USA
| | - Robert Tibshirani
- Department of Biomedical Data Sciences, Stanford University, Stanford, CA, USA
- Department of Statistics, Stanford University, Stanford, CA, USA
| | - Martin S Angst
- Department of Anesthesiology, Perioperative and Pain Medicine, Stanford University School of Medicine, Stanford, CA, USA
- These authors jointly supervised this work: Martin S. Angst, Brice Gaudilliere, Nima Aghaeepour
| | - Brice Gaudilliere
- Department of Anesthesiology, Perioperative and Pain Medicine, Stanford University School of Medicine, Stanford, CA, USA
- Department of Pediatrics, Division of Neonatal and Developmental Medicine, Stanford University School of Medicine, Stanford, CA, USA
- These authors jointly supervised this work: Martin S. Angst, Brice Gaudilliere, Nima Aghaeepour
| | - Nima Aghaeepour
- Department of Anesthesiology, Perioperative and Pain Medicine, Stanford University School of Medicine, Stanford, CA, USA
- Department of Biomedical Data Sciences, Stanford University, Stanford, CA, USA
- Department of Pediatrics, Division of Neonatal and Developmental Medicine, Stanford University School of Medicine, Stanford, CA, USA
- These authors jointly supervised this work: Martin S. Angst, Brice Gaudilliere, Nima Aghaeepour
| |
Collapse
|
132
|
Davagdorj K, Pham VH, Theera-Umpon N, Ryu KH. XGBoost-Based Framework for Smoking-Induced Noncommunicable Disease Prediction. INTERNATIONAL JOURNAL OF ENVIRONMENTAL RESEARCH AND PUBLIC HEALTH 2020; 17:ijerph17186513. [PMID: 32906777 PMCID: PMC7558165 DOI: 10.3390/ijerph17186513] [Citation(s) in RCA: 33] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/22/2020] [Revised: 08/28/2020] [Accepted: 09/05/2020] [Indexed: 12/23/2022]
Abstract
Smoking-induced noncommunicable diseases (SiNCDs) have become a significant threat to public health and cause of death globally. In the last decade, numerous studies have been proposed using artificial intelligence techniques to predict the risk of developing SiNCDs. However, determining the most significant features and developing interpretable models are rather challenging in such systems. In this study, we propose an efficient extreme gradient boosting (XGBoost) based framework incorporated with the hybrid feature selection (HFS) method for SiNCDs prediction among the general population in South Korea and the United States. Initially, HFS is performed in three stages: (I) significant features are selected by t-test and chi-square test; (II) multicollinearity analysis serves to obtain dissimilar features; (III) final selection of best representative features is done based on least absolute shrinkage and selection operator (LASSO). Then, selected features are fed into the XGBoost predictive model. The experimental results show that our proposed model outperforms several existing baseline models. In addition, the proposed model also provides important features in order to enhance the interpretability of the SiNCDs prediction model. Consequently, the XGBoost based framework is expected to contribute for early diagnosis and prevention of the SiNCDs in public health concerns.
Collapse
Affiliation(s)
- Khishigsuren Davagdorj
- Database and Bioinformatics Laboratory, College of Electrical and Computer Engineering, Chungbuk National University, Cheongju 28644, Korea;
| | - Van Huy Pham
- Faculty of Information Technology, Ton Duc Thang University, Ho Chi Minh 700000, Vietnam;
| | - Nipon Theera-Umpon
- Department of Electrical Engineering, Faculty of Engineering, Chiang Mai University, Chiang Mai 50200, Thailand;
- Biomedical Engineering Institute, Chiang Mai University, Chiang Mai 50200, Thailand
| | - Keun Ho Ryu
- Faculty of Information Technology, Ton Duc Thang University, Ho Chi Minh 700000, Vietnam;
- Biomedical Engineering Institute, Chiang Mai University, Chiang Mai 50200, Thailand
- Correspondence: ; Tel.: +82-10-4930-1500
| |
Collapse
|
133
|
Detmer FJ, Hadad S, Chung BJ, Mut F, Slawski M, Juchler N, Kurtcuoglu V, Hirsch S, Bijlenga P, Uchiyama Y, Fujimura S, Yamamoto M, Murayama Y, Takao H, Koivisto T, Frösen J, Cebral JR. Extending statistical learning for aneurysm rupture assessment to Finnish and Japanese populations using morphology, hemodynamics, and patient characteristics. Neurosurg Focus 2020; 47:E16. [PMID: 31261120 DOI: 10.3171/2019.4.focus19145] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2019] [Accepted: 04/09/2019] [Indexed: 11/06/2022]
Abstract
OBJECTIVE Incidental aneurysms pose a challenge for physicians, who need to weigh the rupture risk against the risks associated with treatment and its complications. A statistical model could potentially support such treatment decisions. A recently developed aneurysm rupture probability model performed well in the US data used for model training and in data from two European cohorts for external validation. Because Japanese and Finnish patients are known to have a higher aneurysm rupture risk, the authors' goals in the present study were to evaluate this model using data from Japanese and Finnish patients and to compare it with new models trained with Finnish and Japanese data. METHODS Patient and image data on 2129 aneurysms in 1472 patients were used. Of these aneurysm cases, 1631 had been collected mainly from US hospitals, 249 from European (other than Finnish) hospitals, 147 from Japanese hospitals, and 102 from Finnish hospitals. Computational fluid dynamics simulations and shape analyses were conducted to quantitatively characterize each aneurysm's shape and hemodynamics. Next, the previously developed model's discrimination was evaluated using the Finnish and Japanese data in terms of the area under the receiver operating characteristic curve (AUC). Models with and without interaction terms between patient population and aneurysm characteristics were trained and evaluated including data from all four cohorts obtained by repeatedly randomly splitting the data into training and test data. RESULTS The US model's AUC was reduced to 0.70 and 0.72, respectively, in the Finnish and Japanese data compared to 0.82 and 0.86 in the European and US data. When training the model with Japanese and Finnish data, the average AUC increased only slightly for the Finnish sample (to 0.76 ± 0.16) and Finnish and Japanese cases combined (from 0.74 to 0.75 ± 0.14) and decreased for the Japanese data (to 0.66 ± 0.33). In models including interaction terms, the AUC in the Finnish and Japanese data combined increased significantly to 0.83 ± 0.10. CONCLUSIONS Developing an aneurysm rupture prediction model that applies to Japanese and Finnish aneurysms requires including data from these two cohorts for model training, as well as interaction terms between patient population and the other variables in the model. When including this information, the performance of such a model with Japanese and Finnish data is close to its performance with US or European data. These results suggest that population-specific differences determine how hemodynamics and shape associate with rupture risk in intracranial aneurysms.
Collapse
Affiliation(s)
| | | | - Bong Jae Chung
- 2Department of Mathematical Sciences, Montclair State University, Montclair, New Jersey
| | | | - Martin Slawski
- 3Statistics Department, George Mason University, Fairfax, Virginia
| | - Norman Juchler
- 4Institute of Applied Simulation, ZHAW University of Applied Sciences, Wädenswil, Switzerland.,5The Interface Group, Institute of Physiology, University of Zürich, Switzerland
| | - Vartan Kurtcuoglu
- 5The Interface Group, Institute of Physiology, University of Zürich, Switzerland
| | - Sven Hirsch
- 4Institute of Applied Simulation, ZHAW University of Applied Sciences, Wädenswil, Switzerland
| | - Philippe Bijlenga
- 6Clinical Neurosciences Department, University of Geneva, Switzerland
| | - Yuya Uchiyama
- 7Graduate School of Mechanical Engineering, Tokyo University of Science, Tokyo, Japan.,Departments of8Innovation for Medical Information Technology and
| | - Soichiro Fujimura
- 7Graduate School of Mechanical Engineering, Tokyo University of Science, Tokyo, Japan.,Departments of8Innovation for Medical Information Technology and
| | - Makoto Yamamoto
- 9Department of Mechanical Engineering, Tokyo University of Science, Tokyo, Japan; and
| | - Yuichi Murayama
- 10Neurosurgery, The Jikei University of Medicine, Tokyo, Japan
| | - Hiroyuki Takao
- 7Graduate School of Mechanical Engineering, Tokyo University of Science, Tokyo, Japan.,Departments of8Innovation for Medical Information Technology and.,10Neurosurgery, The Jikei University of Medicine, Tokyo, Japan
| | - Timo Koivisto
- 11Hemorrhagic Brain Pathology Research Group, Department of Neurosurgery, Kuopio University Hospital, Kuopio, Finland
| | - Juhana Frösen
- 11Hemorrhagic Brain Pathology Research Group, Department of Neurosurgery, Kuopio University Hospital, Kuopio, Finland
| | | |
Collapse
|
134
|
Yao Y, Luo XL. Improving vertical positioning accuracy with the weighted multinomial logistic regression classifier. SN APPLIED SCIENCES 2020. [DOI: 10.1007/s42452-020-03240-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/01/2022] Open
|
135
|
Stieger M, Eck M, Rüegger D, Kowatsch T, Flückiger C, Allemand M. Who wants to become more conscientious, more extraverted, or less neurotic with the help of a digital intervention? JOURNAL OF RESEARCH IN PERSONALITY 2020. [DOI: 10.1016/j.jrp.2020.103983] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/07/2023]
|
136
|
Alhamzawi R, Taha Mohammad Ali H. A new Gibbs sampler for Bayesian lasso. COMMUN STAT-SIMUL C 2020. [DOI: 10.1080/03610918.2018.1508699] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
Affiliation(s)
- Rahim Alhamzawi
- Department of Statistics, University of Al-Qadisiyah, Al-Qadisiyah, Iraq
| | | |
Collapse
|
137
|
Jiang J, Wang C, Wu J, Qin W, Xu M, Yin E. Temporal Combination Pattern Optimization Based on Feature Selection Method for Motor Imagery BCIs. Front Hum Neurosci 2020; 14:231. [PMID: 32714167 PMCID: PMC7344307 DOI: 10.3389/fnhum.2020.00231] [Citation(s) in RCA: 24] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2020] [Accepted: 05/25/2020] [Indexed: 11/19/2022] Open
Abstract
Common spatial pattern (CSP) method is widely used for spatial filtering and brain pattern extraction from electroencephalogram (EEG) signals in motor imagery (MI)-based brain-computer interfaces (BCIs). The participant-specific time window relative to the visual cue has a significant impact on the effectiveness of the CSP. However, the time window is usually selected experientially or manually. To solve this problem, we propose a novel feature selection approach for MI-based BCIs. Specifically, multiple time segments were obtained by decomposing each EEG sample of the MI task. Furthermore, the features were extracted by CSP from each time segment and were combined to form a new feature vector. Finally, the optimal temporal combination patterns for the new feature vector were selected based on four feature selection algorithms, i.e., mutual information, least absolute shrinkage and selection operator, principal component analysis and stepwise linear discriminant analysis (denoted as MUIN, LASSO, PCA, and SWLDA, respectively), and the classification algorithm was employed to evaluate the average classification accuracy. With three BCI competition datasets, the results of the four proposed algorithms were compared with traditional CSP algorithm in classification accuracy. Experimental results show that compared with traditional algorithm, the proposed methods significantly improve performance. Specifically, the LASSO achieved the highest accuracy (88.58%) among the proposed methods. Importantly, the average classification accuracies using the proposed approaches significantly improved 10.14% (MUIN), 11.40% (LASSO), 6.08% (PCA), and 10.25% (SWLDA) compared to that using CSP. These results indicate that the proposed approach is expected to be practical in MI-based BCIs.
Collapse
Affiliation(s)
- Jing Jiang
- National Key Laboratory of Human Factors Engineering, China Astronaut Research and Training Center, Beijing, China
| | - Chunhui Wang
- National Key Laboratory of Human Factors Engineering, China Astronaut Research and Training Center, Beijing, China
| | - Jinghan Wu
- Academy of Medical Engineering and Translational Medicine, Tianjin University, Tianjin, China
| | - Wei Qin
- Unmanned Systems Research Center, National Innovation Institute of Defense Technology, Academy of Military Sciences China, Beijing, China.,Tianjin Artificial Intelligence Innovation Center (TAIIC), Tianjin, China
| | - Minpeng Xu
- Academy of Medical Engineering and Translational Medicine, Tianjin University, Tianjin, China
| | - Erwei Yin
- Unmanned Systems Research Center, National Innovation Institute of Defense Technology, Academy of Military Sciences China, Beijing, China.,Tianjin Artificial Intelligence Innovation Center (TAIIC), Tianjin, China
| |
Collapse
|
138
|
An B, Zhang B. Logistic regression with image covariates via the combination of L1 and Sobolev regularizations. PLoS One 2020; 15:e0234975. [PMID: 32589677 PMCID: PMC7319310 DOI: 10.1371/journal.pone.0234975] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2019] [Accepted: 06/06/2020] [Indexed: 11/19/2022] Open
Abstract
The use of image covariates to build a classification model has lots of impact in various fields, such as computer science, medicine, and so on. The aim of this paper is to develop an estimation method for logistic regression model with image covariates. We propose a novel regularized estimation approach, where the regularization is a combination of L1 regularization and Sobolev norm regularization. The L1 penalty can perform variable selection, while the Sobolev norm penalty can capture the shape edges information of image data. We develop an efficient algorithm for the optimization problem. We also establish a nonasymptotic error bound on parameter estimation. Simulated studies and a real data application demonstrate that our proposed method performs very well.
Collapse
Affiliation(s)
- Baiguo An
- School of Statistics, Capital University of Economics and Business, Beijing, China
| | - Beibei Zhang
- School of Statistics, Capital University of Economics and Business, Beijing, China
| |
Collapse
|
139
|
Stanciu A, Banciu M, Sadighi A, Marshall KA, Holland NR, Abedi V, Zand R. A predictive analytics model for differentiating between transient ischemic attacks (TIA) and its mimics. BMC Med Inform Decis Mak 2020; 20:112. [PMID: 32552700 PMCID: PMC7302339 DOI: 10.1186/s12911-020-01154-6] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/07/2020] [Accepted: 06/12/2020] [Indexed: 12/22/2022] Open
Abstract
Background Transient ischemic attack (TIA) is a brief episode of neurological dysfunction resulting from cerebral ischemia not associated with permanent cerebral infarction. TIA is associated with high diagnostic errors because of the subjective nature of findings and the lack of clinical and imaging biomarkers. The goal of this study was to design and evaluate a novel multinomial classification model, based on a combination of feature selection mechanisms coupled with logistic regression, to predict the likelihood of TIA, TIA mimics, and minor stroke. Methods We conducted our modeling on consecutive patients who were evaluated in our health system with an initial diagnosis of TIA in a 9-month period. We established the final diagnoses after the clinical evaluation by independent verification from two stroke neurologists. We used Recursive Feature Elimination (RFE) and Least Absolute Shrinkage and Selection Operator (LASSO) for prediction modeling. Results The RFE-based classifier correctly predicts 78% of the overall observations. In particular, the classifier correctly identifies 68% of the cases labeled as “TIA mimic” and 83% of the “TIA” discharge diagnosis. The LASSO classifier had an overall accuracy of 74%. Both the RFE and LASSO-based classifiers tied or outperformed the ABCD2 score and the Diagnosis of TIA (DOT) score. With respect to predicting TIA, the RFE-based classifier has 61.1% accuracy, the LASSO-based classifier has 79.5% accuracy, whereas the DOT score applied to the dataset yields an accuracy of 63.1%. Conclusion The results of this pilot study indicate that a multinomial classification model, based on a combination of feature selection mechanisms coupled with logistic regression, can be used to effectively differentiate between TIA, TIA mimics, and minor stroke.
Collapse
Affiliation(s)
- Alia Stanciu
- Freeman College of Management, Bucknell University, 1 Dent Drive, Lewisburg, PA, 17837-2005, USA
| | - Mihai Banciu
- Freeman College of Management, Bucknell University, 1 Dent Drive, Lewisburg, PA, 17837-2005, USA.
| | - Alireza Sadighi
- Department of Neurology, Division of Cerebrovascular Diseases, Geisinger Medical Center, 100 N Academy Ave, Danville, PA, 17822, USA
| | - Kyle A Marshall
- Department of Emergency Medicine, Medicine Institute, Geisinger Medical Center, 100 N Academy Ave, Danville, PA, 17822, USA.,Geisinger Commonwealth School of Medicine, 525 Pine St., Scranton, PA, 18509, USA
| | - Neil R Holland
- Department of Neurology, Division of Cerebrovascular Diseases, Geisinger Medical Center, 100 N Academy Ave, Danville, PA, 17822, USA.,Geisinger Commonwealth School of Medicine, 525 Pine St., Scranton, PA, 18509, USA
| | - Vida Abedi
- Department of Molecular and Functional Genomics, Weis Center for Research, Geisinger Health System, 100 N Academy Ave, Danville, PA, 17822, USA.,Biocomplexity Institute of Virginia Tech, 1015 Life Science Circle, Blacksburg, Virginia, 24061, USA
| | - Ramin Zand
- Department of Neurology, Division of Cerebrovascular Diseases, Geisinger Medical Center, 100 N Academy Ave, Danville, PA, 17822, USA
| |
Collapse
|
140
|
Jiang L, Greenwood CMT, Yao W, Li L. Bayesian Hyper-LASSO Classification for Feature Selection with Application to Endometrial Cancer RNA-seq Data. Sci Rep 2020; 10:9747. [PMID: 32546735 PMCID: PMC7297975 DOI: 10.1038/s41598-020-66466-z] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2019] [Accepted: 04/29/2020] [Indexed: 11/30/2022] Open
Abstract
Feature selection is demanded in many modern scientific research problems that use high-dimensional data. A typical example is to identify gene signatures that are related to a certain disease from high-dimensional gene expression data. The expression of genes may have grouping structures, for example, a group of co-regulated genes that have similar biological functions tend to have similar expressions. Thus it is preferable to take the grouping structure into consideration to select features. In this paper, we propose a Bayesian Robit regression method with Hyper-LASSO priors (shortened by BayesHL) for feature selection in high dimensional genomic data with grouping structure. The main features of BayesHL include that it discards more aggressively unrelated features than LASSO, and it makes feature selection within groups automatically without a pre-specified grouping structure. We apply BayesHL in gene expression analysis to identify subsets of genes that contribute to the 5-year survival outcome of endometrial cancer (EC) patients. Results show that BayesHL outperforms alternative methods (including LASSO, group LASSO, supervised group LASSO, penalized logistic regression, random forest, neural network, XGBoost and knockoff) in terms of predictive power, sparsity and the ability to uncover grouping structure, and provides insight into the mechanisms of multiple genetic pathways leading to differentiated EC survival outcome.
Collapse
Affiliation(s)
- Lai Jiang
- Lady Davis Institute for Medical Research, Jewish General Hospital, Montreal, Canada.
- Department of Epidemiology, Biostatistics and Occupational Health, McGill University, Montreal, Canada.
| | - Celia M T Greenwood
- Lady Davis Institute for Medical Research, Jewish General Hospital, Montreal, Canada
- Department of Epidemiology, Biostatistics and Occupational Health, McGill University, Montreal, Canada
- Gerald Bronfman Department of Oncology, McGill University, Montreal, Canada
| | - Weixin Yao
- Department of Statistics, University of California, Riverside, US
| | - Longhai Li
- Department of Mathematics and Statistics, University of Saskatchewan, Saskatoon, Canada.
| |
Collapse
|
141
|
Yang Z, Chen Z, Wang C. An accelerated stochastic variance-reduced method for machine learning problems. Knowl Based Syst 2020. [DOI: 10.1016/j.knosys.2020.105941] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]
|
142
|
Chu BB, Keys KL, German CA, Zhou H, Zhou JJ, Sobel EM, Sinsheimer JS, Lange K. Iterative hard thresholding in genome-wide association studies: Generalized linear models, prior weights, and double sparsity. Gigascience 2020; 9:giaa044. [PMID: 32491161 PMCID: PMC7268817 DOI: 10.1093/gigascience/giaa044] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2019] [Revised: 02/27/2020] [Accepted: 04/14/2020] [Indexed: 11/17/2022] Open
Abstract
BACKGROUND Consecutive testing of single nucleotide polymorphisms (SNPs) is usually employed to identify genetic variants associated with complex traits. Ideally one should model all covariates in unison, but most existing analysis methods for genome-wide association studies (GWAS) perform only univariate regression. RESULTS We extend and efficiently implement iterative hard thresholding (IHT) for multiple regression, treating all SNPs simultaneously. Our extensions accommodate generalized linear models, prior information on genetic variants, and grouping of variants. In our simulations, IHT recovers up to 30% more true predictors than SNP-by-SNP association testing and exhibits a 2-3 orders of magnitude decrease in false-positive rates compared with lasso regression. We also test IHT on the UK Biobank hypertension phenotypes and the Northern Finland Birth Cohort of 1966 cardiovascular phenotypes. We find that IHT scales to the large datasets of contemporary human genetics and recovers the plausible genetic variants identified by previous studies. CONCLUSIONS Our real data analysis and simulation studies suggest that IHT can (i) recover highly correlated predictors, (ii) avoid over-fitting, (iii) deliver better true-positive and false-positive rates than either marginal testing or lasso regression, (iv) recover unbiased regression coefficients, (v) exploit prior information and group-sparsity, and (vi) be used with biobank-sized datasets. Although these advances are studied for genome-wide association studies inference, our extensions are pertinent to other regression problems with large numbers of predictors.
Collapse
Affiliation(s)
- Benjamin B Chu
- Department of Computational Medicine, University of California, Los Angeles, 621 Charles E Young Dr S, Los Angeles, CA, 90095, USA
| | - Kevin L Keys
- Department of Medicine, University of California, San Francisco, 1701 Divisadero St, San Francisco, CA, 94115, USA
- Berkeley Institute of Data Science, University of California, Berkeley, 190 Doe Library, Berkeley, CA 94720, USA
| | - Christopher A German
- Department of Biostatistics, University of California, Los Angeles, 650 Charles E Young Dr S, Los Angeles, CA, 90095, USA
| | - Hua Zhou
- Department of Biostatistics, University of California, Los Angeles, 650 Charles E Young Dr S, Los Angeles, CA, 90095, USA
| | - Jin J Zhou
- Division of Epidemiology and Biostatistics, University of Arizona, 1295 N. Martin Ave. Tucson, AZ, 85724, USA
| | - Eric M Sobel
- Department of Computational Medicine, University of California, Los Angeles, 621 Charles E Young Dr S, Los Angeles, CA, 90095, USA
- Department of Human Genetics, University of California, Los Angeles, 695 Charles E Young Dr S, Los Angeles, CA, 90095 USA
| | - Janet S Sinsheimer
- Department of Computational Medicine, University of California, Los Angeles, 621 Charles E Young Dr S, Los Angeles, CA, 90095, USA
- Department of Biostatistics, University of California, Los Angeles, 650 Charles E Young Dr S, Los Angeles, CA, 90095, USA
- Department of Human Genetics, University of California, Los Angeles, 695 Charles E Young Dr S, Los Angeles, CA, 90095 USA
| | - Kenneth Lange
- Department of Computational Medicine, University of California, Los Angeles, 621 Charles E Young Dr S, Los Angeles, CA, 90095, USA
- Department of Human Genetics, University of California, Los Angeles, 695 Charles E Young Dr S, Los Angeles, CA, 90095 USA
| |
Collapse
|
143
|
Zhang C, Yu Z, Fu H, Zhu P, Chen L, Hu Q. Hybrid Noise-Oriented Multilabel Learning. IEEE TRANSACTIONS ON CYBERNETICS 2020; 50:2837-2850. [PMID: 30762579 DOI: 10.1109/tcyb.2019.2894985] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
For real-world applications, multilabel learning usually suffers from unsatisfactory training data. Typically, features may be corrupted or class labels may be noisy or both. Ignoring noise in the learning process tends to result in an unreasonable model and, thus, inaccurate prediction. Most existing methods only consider either feature noise or label noise in multilabel learning. In this paper, we propose a unified robust multilabel learning framework for data with hybrid noise, that is, both feature noise and label noise. The proposed method, hybrid noise-oriented multilabel learning (HNOML), is simple but rather robust for noisy data. HNOML simultaneously addresses feature and label noise by bi-sparsity regularization bridged with label enrichment. Specifically, the label enrichment matrix explores the underlying correlation among different classes which improves the noisy labeling. Bridged with the enriching label matrix, the structured sparsity is imposed to jointly handle the corrupted features and noisy labeling. We utilize the alternating direction method (ADM) to efficiently solve our problem. Experimental results on several benchmark datasets demonstrate the advantages of our method over the state-of-the-art ones.
Collapse
|
144
|
Xia Y. Correlation and association analyses in microbiome study integrating multiomics in health and disease. PROGRESS IN MOLECULAR BIOLOGY AND TRANSLATIONAL SCIENCE 2020; 171:309-491. [PMID: 32475527 DOI: 10.1016/bs.pmbts.2020.04.003] [Citation(s) in RCA: 37] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/07/2023]
Abstract
Correlation and association analyses are one of the most widely used statistical methods in research fields, including microbiome and integrative multiomics studies. Correlation and association have two implications: dependence and co-occurrence. Microbiome data are structured as phylogenetic tree and have several unique characteristics, including high dimensionality, compositionality, sparsity with excess zeros, and heterogeneity. These unique characteristics cause several statistical issues when analyzing microbiome data and integrating multiomics data, such as large p and small n, dependency, overdispersion, and zero-inflation. In microbiome research, on the one hand, classic correlation and association methods are still applied in real studies and used for the development of new methods; on the other hand, new methods have been developed to target statistical issues arising from unique characteristics of microbiome data. Here, we first provide a comprehensive view of classic and newly developed univariate correlation and association-based methods. We discuss the appropriateness and limitations of using classic methods and demonstrate how the newly developed methods mitigate the issues of microbiome data. Second, we emphasize that concepts of correlation and association analyses have been shifted by introducing network analysis, microbe-metabolite interactions, functional analysis, etc. Third, we introduce multivariate correlation and association-based methods, which are organized by the categories of exploratory, interpretive, and discriminatory analyses and classification methods. Fourth, we focus on the hypothesis testing of univariate and multivariate regression-based association methods, including alpha and beta diversities-based, count-based, and relative abundance (or compositional)-based association analyses. We demonstrate the characteristics and limitations of each approaches. Fifth, we introduce two specific microbiome-based methods: phylogenetic tree-based association analysis and testing for survival outcomes. Sixth, we provide an overall view of longitudinal methods in analysis of microbiome and omics data, which cover standard, static, regression-based time series methods, principal trend analysis, and newly developed univariate overdispersed and zero-inflated as well as multivariate distance/kernel-based longitudinal models. Finally, we comment on current association analysis and future direction of association analysis in microbiome and multiomics studies.
Collapse
Affiliation(s)
- Yinglin Xia
- Department of Medicine, University of Illinois at Chicago, Chicago, IL, United States.
| |
Collapse
|
145
|
Lee K, Cao X. Bayesian group selection in logistic regression with application to MRI data analysis. Biometrics 2020; 77:391-400. [PMID: 32365231 DOI: 10.1111/biom.13290] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/25/2019] [Revised: 04/24/2020] [Accepted: 04/27/2020] [Indexed: 12/22/2022]
Abstract
We consider Bayesian logistic regression models with group-structured covariates. In high-dimensional settings, it is often assumed that only a small portion of groups are significant, and thus, consistent group selection is of significant importance. While consistent frequentist group selection methods have been proposed, theoretical properties of Bayesian group selection methods for logistic regression models have not been investigated yet. In this paper, we consider a hierarchical group spike and slab prior for logistic regression models in high-dimensional settings. Under mild conditions, we establish strong group selection consistency of the induced posterior, which is the first theoretical result in the Bayesian literature. Through simulation studies, we demonstrate that the proposed method outperforms existing state-of-the-art methods in various settings. We further apply our method to a magnetic resonance imaging data set for predicting Parkinson's disease and show its benefits over other contenders.
Collapse
Affiliation(s)
- Kyoungjae Lee
- Department of Statistics, Inha University, Incheon, South Korea
| | - Xuan Cao
- Department of Mathematical Sciences, University of Cincinnati, Cincinnati, Ohio
| |
Collapse
|
146
|
A Comparative Analysis of Machine Learning Methods for Class Imbalance in a Smoking Cessation Intervention. APPLIED SCIENCES-BASEL 2020. [DOI: 10.3390/app10093307] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Smoking is one of the major public health issues, which has a significant impact on premature death. In recent years, numerous decision support systems have been developed to deal with smoking cessation based on machine learning methods. However, the inevitable class imbalance is considered a major challenge in deploying such systems. In this paper, we study an empirical comparison of machine learning techniques to deal with the class imbalance problem in the prediction of smoking cessation intervention among the Korean population. For the class imbalance problem, the objective of this paper is to improve the prediction performance based on the utilization of synthetic oversampling techniques, which we called the synthetic minority over-sampling technique (SMOTE) and an adaptive synthetic (ADASYN). This has been achieved by the experimental design, which comprises three components. First, the selection of the best representative features is performed in two phases: the lasso method and multicollinearity analysis. Second, generate the newly balanced data utilizing SMOTE and ADASYN technique. Third, machine learning classifiers are applied to construct the prediction models among all subjects and each gender. In order to justify the effectiveness of the prediction models, the f-score, type I error, type II error, balanced accuracy and geometric mean indices are used. Comprehensive analysis demonstrates that Gradient Boosting Trees (GBT), Random Forest (RF) and multilayer perceptron neural network (MLP) classifiers achieved the best performances in all subjects and each gender when SMOTE and ADASYN were utilized. The SMOTE with GBT and RF models also provide feature importance scores that enhance the interpretability of the decision-support system. In addition, it is proven that the presented synthetic oversampling techniques with machine learning models outperformed baseline models in smoking cessation prediction.
Collapse
|
147
|
Arayeshgari M, Tapak L, Roshanaei G, Poorolajal J, Ghaleiha A. Application of group smoothly clipped absolute deviation method in identifying correlates of psychiatric distress among college students. BMC Psychiatry 2020; 20:198. [PMID: 32366242 PMCID: PMC7199302 DOI: 10.1186/s12888-020-02591-3] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 08/14/2019] [Accepted: 04/07/2020] [Indexed: 11/21/2022] Open
Abstract
BACKGROUND College students are at an increased risk of psychiatric distress. So, identifying its important correlates using more reliable statistical models, instead of inefficient traditional variable selection methods like stepwise regression, is of great importance. The objective of this study was to investigate correlates of psychiatric distress among college students in Iran; using group smoothly clipped absolute deviation method (SCAD). METHODS A number of 1259 voluntary college students participated in this cross-sectional study (Jan-May 2016) at Hamadan University of Medical Sciences, Iran. The data were collected using a self-administered questionnaire consisting of demographic information, a behavioral risk factors checklist and the GHQ-28 questionnaire (with a cut-off of 23 to measure psychiatric distress, recommended by the Iranian version of the questionnaire). Penalized logistic regression with a group-SCAD regularization method was used to analyze the data (α = 0.05). RESULTS The majority of students were aged 18-25 (87.61%), and 60.76% of them were female. About 41% of students had psychiatric distress. Significant correlates of psychiatric distress among college students selected by group-SCAD included the average grade, educational level, being optimistic about future, having a boy/girlfriend, having an emotional breakup, the average daily number of cigarettes, substance abusing during previous month and having suicidal thoughts ever (P < 0.05). CONCLUSIONS Penalized logistic regression methods such as group-SCAD and group-Adaptive-LASSO should be considered as plausible alternatives to stepwise regression for identifying correlates of a binary response. Several behavioral variables were associated with psychological distress which highlights the necessity of designing multiple factors and behavioral changes in interventional programs.
Collapse
Affiliation(s)
- Mahya Arayeshgari
- grid.411950.80000 0004 0611 9280Department of Biostatistics, School of Public Health, Hamadan University of Medical Sciences, Hamadan, Iran
| | - Leili Tapak
- Department of Biostatistics, School of Public Health, Hamadan University of Medical Sciences, Hamadan, Iran. .,Modeling of Noncommunicable diseases Research Center, Hamadan University of Medical Sciences, Hamadan, Iran.
| | - Ghodratollah Roshanaei
- grid.411950.80000 0004 0611 9280Department of Biostatistics, School of Public Health, Hamadan University of Medical Sciences, Hamadan, Iran ,grid.411950.80000 0004 0611 9280Modeling of Noncommunicable diseases Research Center, Hamadan University of Medical Sciences, Hamadan, Iran
| | - Jalal Poorolajal
- grid.411950.80000 0004 0611 9280Department of Epidemiology, School of Public Health, Hamadan University of Medical Sciences, Hamadan, Iran ,grid.411950.80000 0004 0611 9280Research Center for Health Sciences, Hamadan University of Medical Sciences, Hamadan, Iran
| | - Ali Ghaleiha
- grid.411950.80000 0004 0611 9280Department of Psychiatry, School of Medicine, Hamadan University of Medical Sciences, Hamadan, Iran ,grid.411950.80000 0004 0611 9280Research Center for Behavioral Disorders and Substance Abuse, Hamadan University of Medical Sciences, Hamadan, Iran
| |
Collapse
|
148
|
Jiang H, Fan X. A consistent variable screening procedure with family-wise error control. J STAT COMPUT SIM 2020. [DOI: 10.1080/00949655.2020.1724291] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]
Affiliation(s)
- Hangjin Jiang
- Center for Data Science, Zhejiang University, Hangzhou, People's Republic of China
- Department of Statistics, The Chinese University of Hong Kong, Hong Kong, People's Republic of China
| | - Xiaodan Fan
- Department of Statistics, The Chinese University of Hong Kong, Hong Kong, People's Republic of China
| |
Collapse
|
149
|
Bhatnagar SR, Yang Y, Lu T, Schurr E, Loredo-Osti JC, Forest M, Oualkacha K, Greenwood CMT. Simultaneous SNP selection and adjustment for population structure in high dimensional prediction models. PLoS Genet 2020; 16:e1008766. [PMID: 32365090 PMCID: PMC7224575 DOI: 10.1371/journal.pgen.1008766] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2019] [Revised: 05/14/2020] [Accepted: 04/08/2020] [Indexed: 12/23/2022] Open
Abstract
Complex traits are known to be influenced by a combination of environmental factors and rare and common genetic variants. However, detection of such multivariate associations can be compromised by low statistical power and confounding by population structure. Linear mixed effects models (LMM) can account for correlations due to relatedness but have not been applicable in high-dimensional (HD) settings where the number of fixed effect predictors greatly exceeds the number of samples. False positives or false negatives can result from two-stage approaches, where the residuals estimated from a null model adjusted for the subjects' relationship structure are subsequently used as the response in a standard penalized regression model. To overcome these challenges, we develop a general penalized LMM with a single random effect called ggmix for simultaneous SNP selection and adjustment for population structure in high dimensional prediction models. We develop a blockwise coordinate descent algorithm with automatic tuning parameter selection which is highly scalable, computationally efficient and has theoretical guarantees of convergence. Through simulations and three real data examples, we show that ggmix leads to more parsimonious models compared to the two-stage approach or principal component adjustment with better prediction accuracy. Our method performs well even in the presence of highly correlated markers, and when the causal SNPs are included in the kinship matrix. ggmix can be used to construct polygenic risk scores and select instrumental variables in Mendelian randomization studies. Our algorithms are available in an R package available on CRAN (https://cran.r-project.org/package=ggmix).
Collapse
Affiliation(s)
- Sahir R. Bhatnagar
- Department of Epidemiology, Biostatistics and Occupational Health, McGill University, Montréal, Québec, Canada
- Department of Diagnostic Radiology, McGill University, Montréal, Québec, Canada
| | - Yi Yang
- Department of Mathematics and Statistics, McGill University, Montréal, Québec, Canada
| | - Tianyuan Lu
- Quantitative Life Sciences, McGill University, Montréal, Québec, Canada
- Lady Davis Institute, Jewish General Hospital, Montréal, Québec, Canada
| | - Erwin Schurr
- Department of Medicine, McGill University, Montréal, Québec, Canada
| | - JC Loredo-Osti
- Department of Mathematics and Statistics, Memorial University, St. John’s, Newfoundland and Labrador, Canada
| | - Marie Forest
- École de Technologie Supérieure, Montréal, Québec, Canada
| | - Karim Oualkacha
- Département de Mathématiques, Université du Québec à Montréal, Montréal, Québec, Canada
| | - Celia M. T. Greenwood
- Department of Epidemiology, Biostatistics and Occupational Health, McGill University, Montréal, Québec, Canada
- Quantitative Life Sciences, McGill University, Montréal, Québec, Canada
- Lady Davis Institute, Jewish General Hospital, Montréal, Québec, Canada
- Gerald Bronfman Department of Oncology, McGill University, Montréal, Québec, Canada
- Department of Human Genetics, McGill University, Montréal, Québec, Canada
| |
Collapse
|
150
|
Detmer FJ, Cebral J, Slawski M. A note on coding and standardization of categorical variables in (sparse) group lasso regression. J Stat Plan Inference 2020. [DOI: 10.1016/j.jspi.2019.08.003] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]
|