1
|
Guo S, Zhang J, Wu Y, McLain AC, Hardin JW, Olatosi B, Li X. Functional Multivariable Logistic Regression With an Application to HIV Viral Suppression Prediction. Biom J 2024; 66:e202300081. [PMID: 38966906 DOI: 10.1002/bimj.202300081] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/11/2023] [Revised: 01/21/2024] [Accepted: 01/24/2024] [Indexed: 07/06/2024]
Abstract
Motivated by improving the prediction of the human immunodeficiency virus (HIV) suppression status using electronic health records (EHR) data, we propose a functional multivariable logistic regression model, which accounts for the longitudinal binary process and continuous process simultaneously. Specifically, the longitudinal measurements for either binary or continuous variables are modeled by functional principal components analysis, and their corresponding functional principal component scores are used to build a logistic regression model for prediction. The longitudinal binary data are linked to underlying Gaussian processes. The estimation is done using penalized spline for the longitudinal continuous and binary data. Group-lasso is used to select longitudinal processes, and the multivariate functional principal components analysis is proposed to revise functional principal component scores with the correlation. The method is evaluated via comprehensive simulation studies and then applied to predict viral suppression using EHR data for people living with HIV in South Carolina.
Collapse
Affiliation(s)
- Siyuan Guo
- Department of Epidemiology and Biostatistics, University of South Carolina, Columbia, South Carolina, USA
| | - Jiajia Zhang
- Department of Epidemiology and Biostatistics, University of South Carolina, Columbia, South Carolina, USA
| | - Yichao Wu
- Department of Mathematics, Statistics and Computer Science, University of Illinois at Chicago, Chicago, Illinois, USA
| | - Alexander C McLain
- Department of Epidemiology and Biostatistics, University of South Carolina, Columbia, South Carolina, USA
| | - James W Hardin
- Department of Epidemiology and Biostatistics, University of South Carolina, Columbia, South Carolina, USA
| | - Bankole Olatosi
- Department of Health Services Policy and Management, University of South Carolina, Columbia, South Carolina, USA
| | - Xiaoming Li
- Department of Health Promotion, Education, and Behavior, University of South Carolina, Columbia, South Carolina, USA
| |
Collapse
|
2
|
Lü C, Wang T, Xi X, Wang M, Wang J, Zhilenko A, Li L. A novel temporal-frequency combination pattern optimization approach based on information fusion for motor imagery BCIs. Comput Methods Biomech Biomed Engin 2024:1-13. [PMID: 38946233 DOI: 10.1080/10255842.2024.2371036] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/26/2023] [Accepted: 06/16/2024] [Indexed: 07/02/2024]
Abstract
Motor imagery (MI) stands as a powerful paradigm within Brain-Computer Interface (BCI) research due to its ability to induce changes in brain rhythms detectable through common spatial patterns (CSP). However, the raw feature sets captured often contain redundant and invalid information, potentially hindering CSP performance. Methodology-wise, we propose the Information Fusion for Optimizing Temporal-Frequency Combination Pattern (IFTFCP) algorithm to enhance raw feature optimization. Initially, preprocessed data undergoes simultaneous processing in both time and frequency domains via sliding overlapping time windows and filter banks. Subsequently, we introduce the Pearson-Fisher combinational method along with Discriminant Correlation Analysis (DCA) for joint feature selection and fusion. These steps aim to refine raw electroencephalogram (EEG) features. For precise classification of binary MI problems, an Radial Basis Function (RBF)-kernel Support Vector Machine classifier is trained. To validate the efficacy of IFTFCP and evaluate it against other techniques, we conducted experimental investigations using two EEG datasets. Results indicate a notably superior classification performance, boasting an average accuracy of 78.14% and 85.98% on dataset 1 and dataset 2, which is better than other methods outlined in this article. The study's findings suggest potential benefits for the advancement of MI-based BCI strategies, particularly in the domain of feature fusion.
Collapse
Affiliation(s)
- Chenyang Lü
- School of Automation, Hangzhou Dianzi University, Hangzhou, China
- Key Laboratory of Brain Machine Collaborative Intelligence of Zhejiang Province, Hangzhou Dianzi University, Hangzhou, China
| | - Ting Wang
- School of Automation, Hangzhou Dianzi University, Hangzhou, China
- Key Laboratory of Brain Machine Collaborative Intelligence of Zhejiang Province, Hangzhou Dianzi University, Hangzhou, China
| | - Xugang Xi
- School of Automation, Hangzhou Dianzi University, Hangzhou, China
- Key Laboratory of Brain Machine Collaborative Intelligence of Zhejiang Province, Hangzhou Dianzi University, Hangzhou, China
| | - Maofeng Wang
- Affiliated Dongyang Hospital of Wenzhou Medical University, Dongyang, China
| | - Jian Wang
- School of Automation, Hangzhou Dianzi University, Hangzhou, China
- Key Laboratory of Brain Machine Collaborative Intelligence of Zhejiang Province, Hangzhou Dianzi University, Hangzhou, China
| | - Anton Zhilenko
- Department of Cyber-Physical Systems, St. Petersburg State Marine Technical University, Saint-Petersburg, Russia
| | - Lihua Li
- School of Automation, Hangzhou Dianzi University, Hangzhou, China
- Key Laboratory of Brain Machine Collaborative Intelligence of Zhejiang Province, Hangzhou Dianzi University, Hangzhou, China
| |
Collapse
|
3
|
Mohammed I, Elbashir MK, Faggad AS. Singular Value Decomposition-Based Penalized Multinomial Regression for Classifying Imbalanced Medulloblastoma Subgroups Using Methylation Data. J Comput Biol 2024; 31:458-471. [PMID: 38752890 DOI: 10.1089/cmb.2023.0198] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/23/2024] Open
Abstract
Medulloblastoma (MB) is a molecularly heterogeneous brain malignancy with large differences in clinical presentation. According to genomic studies, there are at least four distinct molecular subgroups of MB: sonic hedgehog (SHH), wingless/INT (WNT), Group 3, and Group 4. The treatment and outcomes depend on appropriate classification. It is difficult for the classification algorithms to identify these subgroups from an imbalanced MB genomic data set, where the distribution of samples among the MB subgroups may not be equal. To overcome this problem, we used singular value decomposition (SVD) and group lasso techniques to find DNA methylation probe features that maximize the separation between the different imbalanced MB subgroups. We used multinomial regression as a classification method to classify the four different molecular subgroups of MB using the reduced DNA methylation data. Coordinate descent is used to solve our loss function associated with the group lasso, which promotes sparsity. By using SVD, we were able to reduce the 321,174 probe features to just 200 features. Less than 40 features were successfully selected after applying the group lasso, which we then used as predictors for our classification models. Our proposed method achieved an average overall accuracy of 99% based on fivefold cross-validation technique. Our approach produces improved classification performance compared with the state-of-the-art methods for classifying MB molecular subgroups.
Collapse
Affiliation(s)
- Isra Mohammed
- Department of Statistics, Faculty of Mathematical and Computer Sciences, University of Gezira, Wad Madani, Sudan
| | - Murtada K Elbashir
- Department of Information Systems, College of Computer and Information Sciences, Jouf University, Sakaka, Saudi Arabia
- Department of Computer Science, Faculty of Mathematical and Computer Sciences, University of Gezira, Wad Madani, Sudan
| | - Areeg S Faggad
- Department of Molecular Biology, National Cancer Institute-University of Gezira, Wad Madani, Sudan
| |
Collapse
|
4
|
Kong X, Wu C, Chen S, Wu T, Han J. Efficient Feature Learning Model of Motor Imagery EEG Signals with L1-Norm and Weighted Fusion. BIOSENSORS 2024; 14:211. [PMID: 38785685 PMCID: PMC11117874 DOI: 10.3390/bios14050211] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/24/2024] [Revised: 03/24/2024] [Accepted: 04/18/2024] [Indexed: 05/25/2024]
Abstract
Brain-computer interface (BCI) for motor imagery is an advanced technology used in the field of medical rehabilitation. However, due to the poor accuracy of electroencephalogram feature classification, BCI systems often misrecognize user commands. Although many state-of-the-art feature selection methods aim to enhance classification accuracy, they usually overlook the interrelationships between individual features, indirectly impacting the accuracy of feature classification. To overcome this issue, we propose an adaptive feature learning model that employs a Riemannian geometric approach to generate a feature matrix from electroencephalogram signals, serving as the model's input. By integrating the enhanced adaptive L1 penalty and weighted fusion penalty into the sparse learning model, we select the most informative features from the matrix. Specifically, we measure the importance of features using mutual information and introduce an adaptive weight construction strategy to penalize regression coefficients corresponding to each variable adaptively. Moreover, the weighted fusion penalty balances weight differences among correlated variables, reducing the model's overreliance on specific variables and enhancing accuracy. The performance of the proposed method was validated on BCI Competition IV datasets IIa and IIb using the support vector machine. Experimental results demonstrate the effectiveness and superiority of the proposed model compared to the existing models.
Collapse
Affiliation(s)
- Xiangzeng Kong
- College of Mechanical and Electrical Engineering, Fujian Agriculture and Forestry University, Fuzhou 350108, China;
| | - Cailin Wu
- School of Future Technology, Fujian Agriculture and Forestry University, Fuzhou 350002, China; (C.W.); (S.C.)
| | - Shimiao Chen
- School of Future Technology, Fujian Agriculture and Forestry University, Fuzhou 350002, China; (C.W.); (S.C.)
| | - Tao Wu
- College of Mechanical and Electrical Engineering, Fujian Agriculture and Forestry University, Fuzhou 350108, China;
| | - Junfeng Han
- College of Mechanical and Electrical Engineering, Fujian Agriculture and Forestry University, Fuzhou 350108, China;
| |
Collapse
|
5
|
Yang Y, McMahan CS, Wang YB, Ouyang Y. Estimation of l0 Norm Penalized Models: A Statistical Treatment. Comput Stat Data Anal 2024; 192:107902. [PMID: 38222104 PMCID: PMC10785287 DOI: 10.1016/j.csda.2023.107902] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/16/2024]
Abstract
Fitting penalized models for the purpose of merging the estimation and model selection problem has become commonplace in statistical practice. Of the various regularization strategies that can be leveraged to this end, the use of the l 0 norm to penalize parameter estimation poses the most daunting model fitting task. In fact, this particular strategy requires an end user to solve a non-convex NP-hard optimization problem irregardless of the underlying data model. For this reason, the use of the l 0 norm as a regularization strategy has been woefully under utilized. To obviate this difficulty, a strategy to solve such problems that is generally accessible by the statistical community is developed. The approach can be adopted to solve l 0 norm penalized problems across a very broad class of models, can be implemented using existing software, and is computationally efficient. The performance of the method is demonstrated through in-depth numerical experiments and through using it to analyze several prototypical data sets.
Collapse
Affiliation(s)
- Yuan Yang
- School of Mathematical and Statistical Sciences, Clemson University, Clemson, 29634, SC, U.S.A
| | - Christopher S McMahan
- School of Mathematical and Statistical Sciences, Clemson University, Clemson, 29634, SC, U.S.A
| | - Yu-Bo Wang
- School of Mathematical and Statistical Sciences, Clemson University, Clemson, 29634, SC, U.S.A
| | - Yuyuan Ouyang
- School of Mathematical and Statistical Sciences, Clemson University, Clemson, 29634, SC, U.S.A
| |
Collapse
|
6
|
Lac L, Leung CK, Hu P. Computational frameworks integrating deep learning and statistical models in mining multimodal omics data. J Biomed Inform 2024; 152:104629. [PMID: 38552994 DOI: 10.1016/j.jbi.2024.104629] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/03/2024] [Revised: 02/26/2024] [Accepted: 03/25/2024] [Indexed: 04/04/2024]
Abstract
BACKGROUND In health research, multimodal omics data analysis is widely used to address important clinical and biological questions. Traditional statistical methods rely on the strong assumptions of distribution. Statistical methods such as testing and differential expression are commonly used in omics analysis. Deep learning, on the other hand, is an advanced computer science technique that is powerful in mining high-dimensional omics data for prediction tasks. Recently, integrative frameworks or methods have been developed for omics studies that combine statistical models and deep learning algorithms. METHODS AND RESULTS The aim of these integrative frameworks is to combine the strengths of both statistical methods and deep learning algorithms to improve prediction accuracy while also providing interpretability and explainability. This review report discusses the current state-of-the-art integrative frameworks, their limitations, and potential future directions in survival and time-to-event longitudinal analysis, dimension reduction and clustering, regression and classification, feature selection, and causal and transfer learning.
Collapse
Affiliation(s)
- Leann Lac
- Department of Computer Science, University of Manitoba, Winnipeg, Manitoba, Canada; Department of Statistics, University of Manitoba, Winnipeg, Manitoba, Canada
| | - Carson K Leung
- Department of Computer Science, University of Manitoba, Winnipeg, Manitoba, Canada
| | - Pingzhao Hu
- Department of Computer Science, University of Manitoba, Winnipeg, Manitoba, Canada; Department of Biochemistry, Western University, London, Ontario, Canada; Department of Computer Science, Western University, London, Ontario, Canada; Department of Oncology, Western University, London, Ontario, Canada; Department of Epidemiology and Biostatistics, Western University, London, Ontario, Canada; The Children's Health Research Institute, Lawson Health Research Institute, London, Ontario, Canada.
| |
Collapse
|
7
|
Choi H, Choi B, Han S, Lee M, Shin GT, Kim H, Son M, Kim KH, Kwon JM, Park RW, Park I. Applicable Machine Learning Model for Predicting Contrast-induced Nephropathy Based on Pre-catheterization Variables. Intern Med 2024; 63:773-780. [PMID: 37558487 PMCID: PMC11008999 DOI: 10.2169/internalmedicine.1459-22] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 12/16/2022] [Accepted: 07/02/2023] [Indexed: 08/11/2023] Open
Abstract
Objective Contrast agents used for radiological examinations are an important cause of acute kidney injury (AKI). We developed and validated a machine learning and clinical scoring prediction model to stratify the risk of contrast-induced nephropathy, considering the limitations of current classical and machine learning models. Methods This retrospective study included 38,481 percutaneous coronary intervention cases from 23,703 patients in a tertiary hospital. We divided the cases into development and internal test sets (8:2). Using the development set, we trained a gradient boosting machine prediction model (complex model). We then developed a simple model using seven variables based on variable importance. We validated the performance of the models using an internal test set and tested them externally in two other hospitals. Results The complex model had the best area under the receiver operating characteristic (AUROC) curve at 0.885 [95% confidence interval (CI) 0.876-0.894] in the internal test set and 0.837 (95% CI 0.819-0.854) and 0.850 (95% CI 0.781-0.918) in two different external validation sets. The simple model showed an AUROC of 0.795 (95% CI 0.781-0.808) in the internal test set and 0.766 (95% CI 0.744-0.789) and 0.782 (95% CI 0.687-0.877) in the two different external validation sets. This was higher than the value in the well-known scoring system (Mehran criteria, AUROC=0.67). The seven precatheterization variables selected for the simple model were age, known chronic kidney disease, hematocrit, troponin I, blood urea nitrogen, base excess, and N-terminal pro-brain natriuretic peptide. The simple model is available at http://52.78.230.235:8081/Conclusions We developed an AKI prediction machine learning model with reliable performance. This can aid in bedside clinical decision making.
Collapse
Affiliation(s)
- Heejung Choi
- Department of Nephrology, Ajou University School of Medicine, Korea
| | - Byungjin Choi
- Department of Biomedical Informatics, Ajou University School of Medicine, Korea
| | | | - Minjeong Lee
- Department of Nephrology, Ajou University School of Medicine, Korea
| | - Gyu-Tae Shin
- Department of Nephrology, Ajou University School of Medicine, Korea
| | - Heungsoo Kim
- Department of Nephrology, Ajou University School of Medicine, Korea
| | - Minkook Son
- Department of Physiology, College of Medicine, Dong-A University, Korea
| | - Kyung-Hee Kim
- Department of Cardiology, Cardiovascular Center, Incheon Sejong Hospital, Korea
| | - Joon-Myoung Kwon
- Department of Critical Care and Emergency Medicine, Incheon Sejong Hospital, Korea
- Artificial Intelligence and Big Data Research Center, Sejong Medical Research Institute, Korea
- Medical Research Team, Medical AI, Korea
| | - Rae Woong Park
- Department of Biomedical Informatics, Ajou University School of Medicine, Korea
- Department of Biomedical Sciences, Ajou University Graduate School of Medicine, Korea
| | - Inwhee Park
- Department of Nephrology, Ajou University School of Medicine, Korea
| |
Collapse
|
8
|
Cao X, Liang X, Zhang S, Sha Q. Gene selection by incorporating genetic networks into case-control association studies. Eur J Hum Genet 2024; 32:270-277. [PMID: 36529820 PMCID: PMC10923938 DOI: 10.1038/s41431-022-01264-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/08/2022] [Revised: 11/27/2022] [Accepted: 11/30/2022] [Indexed: 12/23/2022] Open
Abstract
Large-scale genome-wide association studies (GWAS) have been successfully applied to a wide range of genetic variants underlying complex diseases. The network-based regression approach has been developed to incorporate a biological genetic network and to overcome the challenges caused by the computational efficiency for analyzing high-dimensional genomic data. In this paper, we propose a gene selection approach by incorporating genetic networks into case-control association studies for DNA sequence data or DNA methylation data. Instead of using traditional dimension reduction techniques such as principal component analyses and supervised principal component analyses, we use a linear combination of genotypes at SNPs or methylation values at CpG sites in a gene to capture gene-level signals. We employ three linear combination approaches: optimally weighted sum (OWS), beta-based weighted sum (BWS), and LD-adjusted polygenic risk score (LD-PRS). OWS and LD-PRS are supervised approaches that depend on the effect of each SNP or CpG site on the case-control status, while BWS can be extracted without using the case-control status. After using one of the linear combinations of genotypes or methylation values in each gene to capture gene-level signals, we regularize them to perform gene selection based on the biological network. Simulation studies show that the proposed approaches have higher true positive rates than using traditional dimension reduction techniques. We also apply our approaches to DNA methylation data and UK Biobank DNA sequence data for analyzing rheumatoid arthritis. The results show that the proposed methods can select potentially rheumatoid arthritis related genes that are missed by existing methods.
Collapse
Affiliation(s)
- Xuewei Cao
- Department of Mathematical Sciences, Michigan Technological University, Houghton, MI, USA
| | - Xiaoyu Liang
- Department of Epidemiology and Biostatistics, Michigan State University, East Lansing, MI, USA
| | - Shuanglin Zhang
- Department of Mathematical Sciences, Michigan Technological University, Houghton, MI, USA
| | - Qiuying Sha
- Department of Mathematical Sciences, Michigan Technological University, Houghton, MI, USA.
| |
Collapse
|
9
|
Chang RI, Lin JY, Hung YH. Cloud-Based Machine Learning Methods for Parameter Prediction in Textile Manufacturing. SENSORS (BASEL, SWITZERLAND) 2024; 24:1304. [PMID: 38400462 PMCID: PMC10891737 DOI: 10.3390/s24041304] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/22/2023] [Revised: 02/15/2024] [Accepted: 02/16/2024] [Indexed: 02/25/2024]
Abstract
In traditional textile manufacturing, downstream manufacturers use raw materials, such as Nylon and cotton yarns, to produce textile products. The manufacturing process involves warping, sizing, beaming, weaving, and inspection. Staff members typically use a trial-and-error approach to adjust the appropriate production parameters in the manufacturing process, which can be time consuming and a waste of resources. To enhance the efficiency and effectiveness of textile manufacturing economically, this study proposes a query-based learning method in regression analytics using existing manufacturing data. Query-based learning allows the model training to evolve its decision-making process through dynamic interactions with its solution space. In this study, predefined target parameters of quality factors were first used to validate the training results and create new training patterns. These new patterns were then imported into the solution space of the training model. In predicting product quality, the results show that the proposed query-based regression algorithm has a mean squared error of 0.0153, which is better than those of the original regression-related methods (Avg. mean squared error = 0.020). The trained model was deployed as an application programing interface (API) for cloud-based analytics and an extensive auto-notification service.
Collapse
Affiliation(s)
- Ray-I Chang
- Department of Engineering Science and Ocean Engineering, National Taiwan University, No. 1, Sec. 4, Roosevelt Road, Taipei 10617, Taiwan;
| | - Jia-Ying Lin
- Department of Engineering Science and Ocean Engineering, National Taiwan University, No. 1, Sec. 4, Roosevelt Road, Taipei 10617, Taiwan;
| | - Yu-Hsin Hung
- Department of Industrial Engineering and Management, National Yunlin University of Science and Technology, Yunlin 64002, Taiwan
| |
Collapse
|
10
|
Chen YF, Chawla S, Mousa-Doust D, Nichol A, Ng R, Isaac KV. Machine Learning to Predict the Need for Postmastectomy Radiotherapy after Immediate Breast Reconstruction. PLASTIC AND RECONSTRUCTIVE SURGERY-GLOBAL OPEN 2024; 12:e5599. [PMID: 38322813 PMCID: PMC10846766 DOI: 10.1097/gox.0000000000005599] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/02/2023] [Accepted: 12/15/2023] [Indexed: 02/08/2024]
Abstract
Background Post mastectomy radiotherapy (PMRT) is an independent predictor of reconstructive complications. PMRT may alter the timing and type of reconstruction recommended. This study aimed to create a machine learning model to predict the probability of requiring PMRT after immediate breast reconstruction (IBR). Methods In this retrospective study, breast cancer patients who underwent IBR from January 2017 to December 2020 were reviewed and data were collected on 81 preoperative characteristics. Primary outcome was recommendation for PMRT. Four algorithms were compared to maximize performance and clinical utility: logistic regression, elastic net (EN), logistic lasso, and random forest (RF). The cohort was split into a development dataset (75% of cohort for training-validation) and 25% used for the test set. Model performance was evaluated using area under the receiver operating characteristic curve (AUC), precision-recall curves, and calibration plots. Results In a total of 800 patients, 325 (40.6%) patients were recommended to undergo PMRT. With the training-validation dataset (n = 600), model performance was logistic regression 0.73 AUC [95% confidence interval (CI) 0.65-0.80]; RF 0.77 AUC (95% CI, 0.74-0.81); EN 0.77 AUC (95% CI, 0.73-0.81); logistic lasso 0.76 AUC (95% CI, 0.72-0.80). Without significantly sacrificing performance, 81 predictive factors were reduced to 12 for prediction with the EN method. With the test dataset (n = 200), performance of the EN prediction model was confirmed [0.794 AUC (95% CI, 0.730-0.858)]. Conclusion A parsimonious accurate machine learning model for predicting PMRT after IBR was developed, tested, and translated into a clinically applicable online calculator for providers and patients.
Collapse
Affiliation(s)
- Yi-Fu Chen
- From the Department of Computer Science, Faculty of Science, University of British Columbia, Vancouver, British Columbia, Canada
| | - Sahil Chawla
- Department of Surgery, Faculty of Medicine, University of British Columbia, Vancouver, British Columbia, Canada
| | - Dorsa Mousa-Doust
- Department of Surgery, Faculty of Medicine, University of British Columbia, Vancouver, British Columbia, Canada
| | - Alan Nichol
- Department of Radiation Oncology, BC Cancer, Vancouver, British Columbia, Canada
| | - Raymond Ng
- From the Department of Computer Science, Faculty of Science, University of British Columbia, Vancouver, British Columbia, Canada
| | - Kathryn V Isaac
- Department of Surgery, Faculty of Medicine, University of British Columbia, Vancouver, British Columbia, Canada
- From the Department of Computer Science, Faculty of Science, University of British Columbia, Vancouver, British Columbia, Canada
| |
Collapse
|
11
|
Jiang W, Wang H, Dong X, Zhao Y, Long C, Chen D, Yan B, Cheng J, Lin Z, Zhuo S, Wang H, Yan J. Association of the pathomics-collagen signature with lymph node metastasis in colorectal cancer: a retrospective multicenter study. J Transl Med 2024; 22:103. [PMID: 38273371 PMCID: PMC10811897 DOI: 10.1186/s12967-024-04851-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2023] [Accepted: 01/02/2024] [Indexed: 01/27/2024] Open
Abstract
BACKGROUND Lymph node metastasis (LNM) is a prognostic biomarker and affects therapeutic selection in colorectal cancer (CRC). Current evaluation methods are not adequate for estimating LNM in CRC. H&E images contain much pathological information, and collagen also affects the biological behavior of tumor cells. Hence, the objective of the study is to investigate whether a fully quantitative pathomics-collagen signature (PCS) in the tumor microenvironment can be used to predict LNM. METHODS Patients with histologically confirmed stage I-III CRC who underwent radical surgery were included in the training cohort (n = 329), the internal validation cohort (n = 329), and the external validation cohort (n = 315). Fully quantitative pathomics features and collagen features were extracted from digital H&E images and multiphoton images of specimens, respectively. LASSO regression was utilized to develop the PCS. Then, a PCS-nomogram was constructed incorporating the PCS and clinicopathological predictors for estimating LNM in the training cohort. The performance of the PCS-nomogram was evaluated via calibration, discrimination, and clinical usefulness. Furthermore, the PCS-nomogram was tested in internal and external validation cohorts. RESULTS By LASSO regression, the PCS was developed based on 11 pathomics and 9 collagen features. A significant association was found between the PCS and LNM in the three cohorts (P < 0.001). Then, the PCS-nomogram based on PCS, preoperative CEA level, lymphadenectasis on CT, venous emboli and/or lymphatic invasion and/or perineural invasion (VELIPI), and pT stage achieved AUROCs of 0.939, 0.895, and 0.893 in the three cohorts. The calibration curves identified good agreement between the nomogram-predicted and actual outcomes. Decision curve analysis indicated that the PCS-nomogram was clinically useful. Moreover, the PCS was still an independent predictor of LNM at station Nos. 1, 2, and 3. The PCS nomogram displayed AUROCs of 0.849-0.939 for the training cohort, 0.837-0.902 for the internal validation cohort, and 0.851-0.895 for the external validation cohorts in the three nodal stations. CONCLUSIONS This study proposed that PCS integrating pathomics and collagen features was significantly associated with LNM, and the PCS-nomogram has the potential to be a useful tool for predicting individual LNM in CRC patients.
Collapse
Affiliation(s)
- Wei Jiang
- Department of General Surgery, Guangdong Provincial Key Laboratory of Precision Medicine for Gastrointestinal Tumor, Nanfang Hospital, The First School of Clinical Medicine, Southern Medical University, Guangzhou, Guangdong, 510515, People's Republic of China
- School of Science, Jimei University, Xiamen, Fujian, 361021, People's Republic of China
| | - Huaiming Wang
- Department of General Surgery (Colorectal Surgery), The Sixth Affiliated Hospital, Sun Yat-Sen University, Guangzhou, Guangdong, 510655, People's Republic of China
- Guangdong Provincial Key Laboratory of Colorectal and Pelvic Floor Diseases, The Sixth Affiliated Hospital, Sun Yat-Sen University, Guangzhou, Guangdong, 510655, People's Republic of China
- Biomedical Innovation Center, The Sixth Affiliated Hospital, Sun Yat-Sen University, Guangzhou, Guangdong, 510655, People's Republic of China
| | - Xiaoyu Dong
- Department of General Surgery, Guangdong Provincial Key Laboratory of Precision Medicine for Gastrointestinal Tumor, Nanfang Hospital, The First School of Clinical Medicine, Southern Medical University, Guangzhou, Guangdong, 510515, People's Republic of China
| | - Yandong Zhao
- Department of Pathology, The Sixth Affiliated Hospital, Sun Yat-Sen University, Guangzhou, Guangdong, 510655, People's Republic of China
| | - Chenyan Long
- Department of General Surgery, Guangdong Provincial Key Laboratory of Precision Medicine for Gastrointestinal Tumor, Nanfang Hospital, The First School of Clinical Medicine, Southern Medical University, Guangzhou, Guangdong, 510515, People's Republic of China
- Division of Colorectal and Anal Surgery, Department of Gastrointestinal Surgery, Guangxi Medical University Cancer Hospital, Nanning, 530000, People's Republic of China
| | - Dexin Chen
- Department of General Surgery, Guangdong Provincial Key Laboratory of Precision Medicine for Gastrointestinal Tumor, Nanfang Hospital, The First School of Clinical Medicine, Southern Medical University, Guangzhou, Guangdong, 510515, People's Republic of China
| | - Botao Yan
- Department of General Surgery, Guangdong Provincial Key Laboratory of Precision Medicine for Gastrointestinal Tumor, Nanfang Hospital, The First School of Clinical Medicine, Southern Medical University, Guangzhou, Guangdong, 510515, People's Republic of China
| | - Jiaxin Cheng
- Department of General Surgery, Guangdong Provincial Key Laboratory of Precision Medicine for Gastrointestinal Tumor, Nanfang Hospital, The First School of Clinical Medicine, Southern Medical University, Guangzhou, Guangdong, 510515, People's Republic of China
| | - Zexi Lin
- School of Science, Jimei University, Xiamen, Fujian, 361021, People's Republic of China
| | - Shuangmu Zhuo
- School of Science, Jimei University, Xiamen, Fujian, 361021, People's Republic of China.
| | - Hui Wang
- Department of General Surgery (Colorectal Surgery), The Sixth Affiliated Hospital, Sun Yat-Sen University, Guangzhou, Guangdong, 510655, People's Republic of China.
| | - Jun Yan
- Department of General Surgery, Guangdong Provincial Key Laboratory of Precision Medicine for Gastrointestinal Tumor, Nanfang Hospital, The First School of Clinical Medicine, Southern Medical University, Guangzhou, Guangdong, 510515, People's Republic of China.
- Department of Gastrointestinal Surgery, Shenzhen People's Hospital, Second Clinical Medical College of Jinan University, First Affiliated Hospital of Southern University of Science and Technology, Shenzhen, Guangdong, 518020, People's Republic of China.
| |
Collapse
|
12
|
Hoogland J, Debray TPA, Crowther MJ, Riley RD, IntHout J, Reitsma JB, Zwinderman AH. Regularized parametric survival modeling to improve risk prediction models. Biom J 2024; 66:e2200319. [PMID: 37775946 DOI: 10.1002/bimj.202200319] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2022] [Revised: 04/30/2023] [Accepted: 09/17/2023] [Indexed: 10/01/2023]
Abstract
We propose to combine the benefits of flexible parametric survival modeling and regularization to improve risk prediction modeling in the context of time-to-event data. Thereto, we introduce ridge, lasso, elastic net, and group lasso penalties for both log hazard and log cumulative hazard models. The log (cumulative) hazard in these models is represented by a flexible function of time that may depend on the covariates (i.e., covariate effects may be time-varying). We show that the optimization problem for the proposed models can be formulated as a convex optimization problem and provide a user-friendly R implementation for model fitting and penalty parameter selection based on cross-validation. Simulation study results show the advantage of regularization in terms of increased out-of-sample prediction accuracy and improved calibration and discrimination of predicted survival probabilities, especially when sample size was relatively small with respect to model complexity. An applied example illustrates the proposed methods. In summary, our work provides both a foundation for and an easily accessible implementation of regularized parametric survival modeling and suggests that it improves out-of-sample prediction performance.
Collapse
Affiliation(s)
- J Hoogland
- Julius Center for Health Sciences and Primary Care, University Medical Center Utrecht, Utrecht University, Utrecht, The Netherlands
- Department of Epidemiology and Data Science, Amsterdam University Medical Centers, Amsterdam, The Netherlands
| | - T P A Debray
- Julius Center for Health Sciences and Primary Care, University Medical Center Utrecht, Utrecht University, Utrecht, The Netherlands
- Cochrane Netherlands, University Medical Center Utrecht, Utrecht University, Utrecht, The Netherlands
| | - M J Crowther
- Department of Medical Epidemiology and Biostatistics, Karolinska Institutet, Stockholm, Sweden
| | - R D Riley
- School for Medicine, Keele University, Keele, Staffordshire, UK
| | - J IntHout
- Radboud Institute for Health Sciences (RIHS), Radboud University Medical Center, Nijmegen, The Netherlands
| | - J B Reitsma
- Julius Center for Health Sciences and Primary Care, University Medical Center Utrecht, Utrecht University, Utrecht, The Netherlands
- Cochrane Netherlands, University Medical Center Utrecht, Utrecht University, Utrecht, The Netherlands
| | - A H Zwinderman
- Department of Epidemiology and Data Science, Amsterdam University Medical Centers, Amsterdam, The Netherlands
| |
Collapse
|
13
|
Naik AK, Kuppili V. An embedded feature selection method based on generalized classifier neural network for cancer classification. Comput Biol Med 2024; 168:107677. [PMID: 37988786 DOI: 10.1016/j.compbiomed.2023.107677] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/02/2022] [Revised: 10/26/2023] [Accepted: 11/06/2023] [Indexed: 11/23/2023]
Abstract
The selection of relevant genes plays a vital role in classifying high-dimensional microarray gene expression data. Sparse group Lasso and its variants have been employed for gene selection to capture the interactions of genes within a group. Most of the embedded methods are linear sparse learning models that fail to capture the non-linear interactions. Additionally, very less attention is given to solving multi-class problems. The existing methods create overlapping groups, which further increases dimensionality. The paper proposes a neural network-based embedded feature selection method that can represent the non-linear relationship. In an effort toward an explainable model, a generalized classifier neural network (GCNN) is adopted as the model for the proposed embedded feature selection. GCNN has well-defined architecture in terms of the number of layers and neurons within each layer. Each layer has a distinct functionality, eliminating the obscure nature of most neural networks. The paper proposes a feature selection approach called Weighted GCNN (WGCNN) that embeds feature weighting as a part of training the neural network. Since the gene expression data comprises a large number of features, to avoid overfitting of the model a statistical guided dropout is implemented at the input layer. The proposed method works for binary as well as multi-class classification problems likewise. Experimental validation is carried out on seven microarray datasets on three learning models and compared with six state-of-art methods that are popularly employed for feature selection. The WGCNN performs well in terms of the F1 score and the number of features selected.
Collapse
Affiliation(s)
- Akshata K Naik
- Department of Computer Science and Engineering, National Institute of Technology, Farmagudi, Ponda, Goa, India.
| | - Venkatanareshbabu Kuppili
- Department of Computer Science and Engineering, National Institute of Technology, Farmagudi, Ponda, Goa, India
| |
Collapse
|
14
|
Lyu R, Qu Y, Divaris K, Wu D. Methodological Considerations in Longitudinal Analyses of Microbiome Data: A Comprehensive Review. Genes (Basel) 2023; 15:51. [PMID: 38254941 PMCID: PMC11154524 DOI: 10.3390/genes15010051] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2023] [Revised: 12/22/2023] [Accepted: 12/26/2023] [Indexed: 01/24/2024] Open
Abstract
Biological processes underlying health and disease are inherently dynamic and are best understood when characterized in a time-informed manner. In this comprehensive review, we discuss challenges inherent in time-series microbiome data analyses and compare available approaches and methods to overcome them. Appropriate handling of longitudinal microbiome data can shed light on important roles, functions, patterns, and potential interactions between large numbers of microbial taxa or genes in the context of health, disease, or interventions. We present a comprehensive review and comparison of existing microbiome time-series analysis methods, for both preprocessing and downstream analyses, including differential analysis, clustering, network inference, and trait classification. We posit that the careful selection and appropriate utilization of computational tools for longitudinal microbiome analyses can help advance our understanding of the dynamic host-microbiome relationships that underlie health-maintaining homeostases, progressions to disease-promoting dysbioses, as well as phases of physiologic development like those encountered in childhood.
Collapse
Affiliation(s)
- Ruiqi Lyu
- Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA 15213, USA;
| | - Yixiang Qu
- Department of Biostatistics, Gillings School of Global Public Health, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA;
| | - Kimon Divaris
- Division of Pediatric and Public Health, Adams School of Dentistry, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA;
- Department of Epidemiology, Gillings School of Global Public Health, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA
| | - Di Wu
- Department of Biostatistics, Gillings School of Global Public Health, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA;
- Division of Oral and Craniofacial Health Sciences, Adams School of Dentistry, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA
- Lineberger Comprehensive Cancer Center, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA
| |
Collapse
|
15
|
Tang VH, Duong STM, Nguyen CDT, Huynh TM, Duc VT, Phan C, Le H, Bui T, Truong SQH. Wavelet radiomics features from multiphase CT images for screening hepatocellular carcinoma: analysis and comparison. Sci Rep 2023; 13:19559. [PMID: 37950031 PMCID: PMC10638447 DOI: 10.1038/s41598-023-46695-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2023] [Accepted: 11/03/2023] [Indexed: 11/12/2023] Open
Abstract
Early detection of liver malignancy based on medical image analysis plays a crucial role in patient prognosis and personalized treatment. This task, however, is challenging due to several factors, including medical data scarcity and limited training samples. This paper presents a study of three important aspects of radiomics feature from multiphase computed tomography (CT) for classifying hepatocellular carcinoma (HCC) and other focal liver lesions: wavelet-transformed feature extraction, relevant feature selection, and radiomics features-based classification under the inadequate training samples. Our analysis shows that combining radiomics features extracted from the wavelet and original CT domains enhance the classification performance significantly, compared with using those extracted from the wavelet or original domain only. To facilitate the multi-domain and multiphase radiomics feature combination, we introduce a logistic sparsity-based model for feature selection with Bayesian optimization and find that the proposed model yields more discriminative and relevant features than several existing methods, including filter-based, wrapper-based, or other model-based techniques. In addition, we present analysis and performance comparison with several recent deep convolutional neural network (CNN)-based feature models proposed for hepatic lesion diagnosis. The results show that under the inadequate data scenario, the proposed wavelet radiomics feature model produces comparable, if not higher, performance metrics than the CNN-based feature models in terms of area under the curve.
Collapse
Affiliation(s)
- Van Ha Tang
- VinBrain JSC., 458 Minh Khai, Hanoi, 11619, Vietnam
- Le Quy Don Technical University, 236 Hoang Quoc Viet, Hanoi, 11917, Vietnam
| | - Soan T M Duong
- VinBrain JSC., 458 Minh Khai, Hanoi, 11619, Vietnam.
- Le Quy Don Technical University, 236 Hoang Quoc Viet, Hanoi, 11917, Vietnam.
| | - Chanh D Tr Nguyen
- VinBrain JSC., 458 Minh Khai, Hanoi, 11619, Vietnam
- VinUniversity, Vinhomes Ocean Park, Hanoi, 12406, Vietnam
| | - Thanh M Huynh
- VinBrain JSC., 458 Minh Khai, Hanoi, 11619, Vietnam
- VinUniversity, Vinhomes Ocean Park, Hanoi, 12406, Vietnam
| | - Vo T Duc
- University Medical Center Ho Chi Minh City, 215 Hong Bang, Ho Chi Minh City, 12406, Vietnam
| | - Chien Phan
- University Medical Center Ho Chi Minh City, 215 Hong Bang, Ho Chi Minh City, 12406, Vietnam
| | - Huyen Le
- University Medical Center Ho Chi Minh City, 215 Hong Bang, Ho Chi Minh City, 12406, Vietnam
| | - Trung Bui
- Adobe Research, San Francisco, CA, 94103, USA
| | - Steven Q H Truong
- VinBrain JSC., 458 Minh Khai, Hanoi, 11619, Vietnam
- VinUniversity, Vinhomes Ocean Park, Hanoi, 12406, Vietnam
| |
Collapse
|
16
|
Lee YH, Thaweethai T, Sheu YH, Feng YCA, Karlson EW, Ge T, Kraft P, Smoller JW. Impact of selection bias on polygenic risk score estimates in healthcare settings. Psychol Med 2023; 53:7435-7445. [PMID: 37226828 DOI: 10.1017/s0033291723001186] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 05/26/2023]
Abstract
BACKGROUND Hospital-based biobanks are being increasingly considered as a resource for translating polygenic risk scores (PRS) into clinical practice. However, since these biobanks originate from patient populations, there is a possibility of bias in polygenic risk estimation due to overrepresentation of patients with higher frequency of healthcare interactions. METHODS PRS for schizophrenia, bipolar disorder, and depression were calculated using summary statistics from the largest available genomic studies for a sample of 24 153 European ancestry participants in the Mass General Brigham (MGB) Biobank. To correct for selection bias, we fitted logistic regression models with inverse probability (IP) weights, which were estimated using 1839 sociodemographic, clinical, and healthcare utilization features extracted from electronic health records of 1 546 440 non-Hispanic White patients eligible to participate in the Biobank study at their first visit to the MGB-affiliated hospitals. RESULTS Case prevalence of bipolar disorder among participants in the top decile of bipolar disorder PRS was 10.0% (95% CI 8.8-11.2%) in the unweighted analysis but only 6.2% (5.0-7.5%) when selection bias was accounted for using IP weights. Similarly, case prevalence of depression among those in the top decile of depression PRS was reduced from 33.5% (31.7-35.4%) to 28.9% (25.8-31.9%) after IP weighting. CONCLUSIONS Non-random selection of participants into volunteer biobanks may induce clinically relevant selection bias that could impact implementation of PRS in research and clinical settings. As efforts to integrate PRS in medical practice expand, recognition and mitigation of these biases should be considered and may need to be optimized in a context-specific manner.
Collapse
Affiliation(s)
- Younga Heather Lee
- Psychiatric and Neurodevelopmental Genetics Unit, Center for Genomic Medicine, Massachusetts General Hospital, Boston, Massachusetts, USA
- Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, Massachusetts, USA
- Harvard Medical School, Boston, Massachusetts, USA
| | - Tanayott Thaweethai
- Harvard Medical School, Boston, Massachusetts, USA
- Biostatistics Center, Massachusetts General Hospital, Boston, Massachusetts, USA
| | - Yi-Han Sheu
- Psychiatric and Neurodevelopmental Genetics Unit, Center for Genomic Medicine, Massachusetts General Hospital, Boston, Massachusetts, USA
- Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, Massachusetts, USA
- Harvard Medical School, Boston, Massachusetts, USA
| | - Yen-Chen Anne Feng
- Psychiatric and Neurodevelopmental Genetics Unit, Center for Genomic Medicine, Massachusetts General Hospital, Boston, Massachusetts, USA
- Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, Massachusetts, USA
- Harvard Medical School, Boston, Massachusetts, USA
- Analytic and Translational Genetics Unit, Center for Genomic Medicine, Massachusetts General Hospital, Boston, Massachusetts, USA
- Division of Biostatistics and Data Science, Institute of Epidemiology and Preventive Medicine, National Taiwan University, Taipei, Taiwan
| | - Elizabeth W Karlson
- Harvard Medical School, Boston, Massachusetts, USA
- Division of Rheumatology, Immunity, and Inflammation, Department of Medicine, Brigham and Women's Hospital, Boston, Massachusetts, USA
| | - Tian Ge
- Psychiatric and Neurodevelopmental Genetics Unit, Center for Genomic Medicine, Massachusetts General Hospital, Boston, Massachusetts, USA
- Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, Massachusetts, USA
- Harvard Medical School, Boston, Massachusetts, USA
- Center for Precision Psychiatry, Massachusetts General Hospital, Boston, Massachusetts, USA
| | - Peter Kraft
- Department of Epidemiology, Harvard T. H. Chan School of Public Health, Boston, Massachusetts, USA
- Department of Biostatistics, Harvard T. H. Chan School of Public Health, Boston, Massachusetts, USA
| | - Jordan W Smoller
- Psychiatric and Neurodevelopmental Genetics Unit, Center for Genomic Medicine, Massachusetts General Hospital, Boston, Massachusetts, USA
- Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, Massachusetts, USA
- Harvard Medical School, Boston, Massachusetts, USA
- Center for Precision Psychiatry, Massachusetts General Hospital, Boston, Massachusetts, USA
| |
Collapse
|
17
|
Wen Z, Long J, Zhu L, Liu S, Zeng X, Huang D, Qiu X, Su L. Associations of dietary, sociodemographic, and anthropometric factors with anemia among the Zhuang ethnic adults: a cross-sectional study in Guangxi Zhuang Autonomous Region, China. BMC Public Health 2023; 23:1934. [PMID: 37803356 PMCID: PMC10557179 DOI: 10.1186/s12889-023-16697-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/01/2022] [Accepted: 09/04/2023] [Indexed: 10/08/2023] Open
Abstract
BACKGROUND After decades of rapid economic development, anemia remains a significant public health challenge globally. This study aimed to estimate the associations of sociodemographic, dietary, and body composition factors with anemia among the Zhuang in Guangxi Zhuang Autonomous Region, China. METHODS Our study population from the baseline survey of the Guangxi ethnic minority Cohort Study of Chronic Diseases consisted of 13,465 adults (6,779 women and 6,686 men) aged 24-82 years. A validated interviewer-administered laptop-based questionnaire system was used to collect information on participants' sociodemographic, lifestyle, and dietary factors. Each participant underwent a physical examination, and hematological indices were measured. Least absolute shrinkage and selection operator (LASSO) regression was used to select the variables, and logistic regression was applied to estimate the associations of independent risk factors with anemia. RESULTS The overall prevalences of anemia in men and women were 9.63% (95% CI: 8.94-10.36%) and 18.33% (95% CI: 17.42─19.28%), respectively. LASSO and logistic regression analyses showed that age was positively associated with anemia for both women and men. For diet in women, red meat consumption for 5-7 days/week (OR = 0.79, 95% CI: 0.65-0.98, p = 0.0290) and corn/sweet potato consumption for 5-7 days/week (OR = 0.73, 95% CI: 0.55-0.96, p = 0.0281) were negatively associated with anemia. For men, fruit consumption for 5-7 days/week (OR = 0.75, 95% CI: 0.60-0.94, p = 0.0130) and corn/sweet potato consumption for 5-7 days/week (OR = 0.66, 95% CI: 0.46-0.91, p = 0.0136) were negatively correlated with anemia. Compared with a normal body water percentage (55-65%), a body water percentage below normal (< 55%) was negatively related to anemia (OR = 0.68, 95% CI: 0.53-0.86, p = 0.0014). Conversely, a body water percentage above normal (> 65%) was positively correlated with anemia in men (OR = 1.73, 95% CI: 1.38-2.17, p < 0.0001). CONCLUSIONS Anemia remains a moderate public health problem for premenopausal women and the elderly population in the Guangxi Zhuang minority region. The prevention of anemia at the population level requires multifaceted intervention measures according to sex and age, with a focus on dietary factors and the control of body composition.
Collapse
Affiliation(s)
- Zheng Wen
- Department of Epidemiology and Health Statistics, School of Public Health, Guangxi Medical University, 22 Shuangyong Road, Nanning, 530021, Guangxi, China
- Guangxi Colleges and Universities Key Laboratory of Prevention and Control of Highly Prevalent Diseases, Guangxi Medical University, Nanning, 530021, Guangxi, China
| | - Jianxiong Long
- Department of Epidemiology and Health Statistics, School of Public Health, Guangxi Medical University, 22 Shuangyong Road, Nanning, 530021, Guangxi, China
- Guangxi Colleges and Universities Key Laboratory of Prevention and Control of Highly Prevalent Diseases, Guangxi Medical University, Nanning, 530021, Guangxi, China
| | - Lulu Zhu
- Department of Epidemiology and Health Statistics, School of Public Health, Guangxi Medical University, 22 Shuangyong Road, Nanning, 530021, Guangxi, China
- Guangxi Colleges and Universities Key Laboratory of Prevention and Control of Highly Prevalent Diseases, Guangxi Medical University, Nanning, 530021, Guangxi, China
| | - Shun Liu
- Guangxi Colleges and Universities Key Laboratory of Prevention and Control of Highly Prevalent Diseases, Guangxi Medical University, Nanning, 530021, Guangxi, China
- Department of Maternal, Child and Adolescent Health, School of Public Health, Guangxi Medical University, Nanning, Guangxi, China
| | - Xiaoyun Zeng
- Department of Epidemiology and Health Statistics, School of Public Health, Guangxi Medical University, 22 Shuangyong Road, Nanning, 530021, Guangxi, China
- Guangxi Colleges and Universities Key Laboratory of Prevention and Control of Highly Prevalent Diseases, Guangxi Medical University, Nanning, 530021, Guangxi, China
| | - Dongping Huang
- Guangxi Colleges and Universities Key Laboratory of Prevention and Control of Highly Prevalent Diseases, Guangxi Medical University, Nanning, 530021, Guangxi, China
- Department of Sanitary Chemistry, School of Public Health, Guangxi Medical University, Nanning, Guangxi, China
| | - Xiaoqiang Qiu
- Department of Epidemiology and Health Statistics, School of Public Health, Guangxi Medical University, 22 Shuangyong Road, Nanning, 530021, Guangxi, China.
- Guangxi Colleges and Universities Key Laboratory of Prevention and Control of Highly Prevalent Diseases, Guangxi Medical University, Nanning, 530021, Guangxi, China.
| | - Li Su
- Department of Epidemiology and Health Statistics, School of Public Health, Guangxi Medical University, 22 Shuangyong Road, Nanning, 530021, Guangxi, China.
- Guangxi Colleges and Universities Key Laboratory of Prevention and Control of Highly Prevalent Diseases, Guangxi Medical University, Nanning, 530021, Guangxi, China.
| |
Collapse
|
18
|
Yang S, Cai Z. Cross Domain Lifelong Learning Based on Task Similarity. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2023; 45:11612-11623. [PMID: 37195848 DOI: 10.1109/tpami.2023.3276991] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/19/2023]
Abstract
Humans gradually learn a sequence of cross-domain tasks and seldom experience catastrophic forgetting. In contrast, deep neural networks achieve good performance only in specific tasks within a single domain. To equip the network with lifelong learning capabilities, we propose a Cross-Domain Lifelong Learning (CDLL) framework that fully explores task similarities. Specifically, we employ a Dual Siamese Network (DSN) to learn the essential similarity features of tasks across different domains. To further understand similarity information across domains, we introduce a Domain-Invariant Feature Enhancement Module (DFEM) to better extract domain-invariant features. Moreover, we propose a Spatial Attention Network (SAN) that assigns different weights to various tasks based on the learned similarity features. Ultimately, to maximize the use of model parameters for learning new tasks, we propose a Structural Sparsity Loss (SSL) that can make the SAN as sparse as possible while ensuring accuracy. Experimental results show that our method effectively reduces catastrophic forgetting compared with state-of-the-art methods when continuously learning multiple tasks across different domains. It is worth noting that the proposed method scarcely forgets old knowledge while consistently enhancing the performance of learned tasks, more closely aligning with human learning.
Collapse
|
19
|
Zhou T, Ren Z, Ma Y, He L, Liu J, Tang J, Zhang H. Early identification of bloodstream infection in hemodialysis patients by machine learning. Heliyon 2023; 9:e18263. [PMID: 37519767 PMCID: PMC10375788 DOI: 10.1016/j.heliyon.2023.e18263] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2022] [Revised: 07/08/2023] [Accepted: 07/12/2023] [Indexed: 08/01/2023] Open
Abstract
Background Bloodstream infection (BSI) is a prevalent cause of admission in hemodialysis (HD) patients and is associated with increased morbidity and mortality. This study aimed to establish a diagnostic, predictive model for the early identification of BSI in HD patients. Methods HD patients who underwent blood culture testing between August 2018 and March 2022 were enrolled in this study. Machine learning algorithms, including stepwise logistic regression (SLR), Lasso logistic regression (LLR), support vector machine (SVM), decision tree, random forest (RF), and gradient boosting machine (XGboost), were used to predict the risk of developing BSI from the patient's clinical data. The accuracy (ACC) and area under the subject working curve (AUC) were used to evaluate the performance of such models. The Shapley Additive Explanation (SHAP) values were used to explain each feature's predictive value on the models' output. Finally, a simplified nomogram for predicting BSI was devised. Results A total of 391 HD patients were enrolled in this study, of whom 74 (18.9%) were diagnosed with BSI. The XGboost model achieved the highest AUC (0.914, 95% confidence interval [CI]: 0.861-0.964) and ACC (86.3%) for BSI prediction. The four most significant co-variables in both the significance matrix plot of the XGboost model variables and the SHAP summary plot were body temperature, dialysis access via a non-arteriovenous fistula (non-AVF), the procalcitonin levels (PCT), and neutrophil-lymphocyte ratio (NLR). Conclusions This study created an effective machine-learning model for predicting BSI in HD patients. The model could be used to detect BSI at an early stage and hence guide antibiotic treatment in HD patients.
Collapse
Affiliation(s)
- Tong Zhou
- Department of Nephrology, Affiliated Hospital of North Sichuan Medical College, Nanchong, China
| | - Zhouting Ren
- Department of Nephrology, Affiliated Hospital of North Sichuan Medical College, Nanchong, China
| | - Yimei Ma
- Department of Nephrology, Affiliated Hospital of North Sichuan Medical College, Nanchong, China
| | - Linqian He
- Department of Nephrology, Affiliated Hospital of North Sichuan Medical College, Nanchong, China
| | - Jiali Liu
- Department of Clinical Medicine, North Sichuan Medical College, Nanchong, China
| | - Jincheng Tang
- Department of Nephrology, Affiliated Hospital of North Sichuan Medical College, Nanchong, China
| | - Heping Zhang
- Department of Nephrology, Affiliated Hospital of North Sichuan Medical College, Nanchong, China
| |
Collapse
|
20
|
Won JH, Lange K, Xu J. A unified analysis of convex and non‑convex 𝓛p‑ball projection problems. OPTIMIZATION LETTERS 2023; 17:1133-1159. [PMID: 38516636 PMCID: PMC10956251 DOI: 10.1007/s11590-022-01919-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/21/2020] [Accepted: 07/29/2022] [Indexed: 03/23/2024]
Abstract
The task of projecting onto ℓ p norm balls is ubiquitous in statistics and machine learning, yet the availability of actionable algorithms for doing so is largely limited to the special cases of p ∈ { 0 , 1 , 2 , ∞ } . In this paper, we introduce novel, scalable methods for projecting onto the ℓ p -ball for general p > 0 . For p ≥ 1 , we solve the univariate Lagrangian dual via a dual Newton method. We then carefully design a bisection approach For p < 1 , presenting theoretical and empirical evidence of zero or a small duality gap in the non-convex case. The success of our contributions is thoroughly assessed empirically, and applied to large-scale regularized multi-task learning and compressed sensing. The code implementing our methods is publicly available on Github.
Collapse
Affiliation(s)
- Joong-Ho Won
- Department of Statistics, Seoul National University, Seoul, Republic of Korea
| | - Kenneth Lange
- Departments of Computational Medicine, Human Genetics and Statistics, University of California, Los Angeles, California, USA
| | - Jason Xu
- Department of Statistical Science, Duke University, Durham, North Carolina, USA
| |
Collapse
|
21
|
Reuter A, Smolić Š, Bärnighausen T, Sudharsanan N. Predicting missed health care visits during the COVID-19 pandemic using machine learning methods: evidence from 55,500 individuals from 28 European countries. BMC Health Serv Res 2023; 23:544. [PMID: 37231416 PMCID: PMC10209940 DOI: 10.1186/s12913-023-09473-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2022] [Accepted: 04/28/2023] [Indexed: 05/27/2023] Open
Abstract
BACKGROUND Pandemics such as the COVID-19 pandemic and other severe health care disruptions endanger individuals to miss essential care. Machine learning models that predict which patients are at greatest risk of missing care visits can help health administrators prioritize retentions efforts towards patients with the most need. Such approaches may be especially useful for efficiently targeting interventions for health systems overburdened during states of emergency. METHODS We use data on missed health care visits from over 55,500 respondents of the Survey of Health, Ageing and Retirement in Europe (SHARE) COVID-19 surveys (June - August 2020 and June - August 2021) with longitudinal data from waves 1-8 (April 2004 - March 2020). We compare the performance of four machine learning algorithms (stepwise selection, lasso, random forest, and neural networks) to predict missed health care visits during the first COVID-19 survey based on common patient characteristics available to most health care providers. We test the prediction accuracy, sensitivity, and specificity of the selected models for the first COVID-19 survey by employing 5-fold cross-validation, and test the out-of-sample performance of the models by applying them to the data from the second COVID-19 survey. RESULTS Within our sample, 15.5% of the respondents reported any missed essential health care visit due to the COVID-19 pandemic. All four machine learning methods perform similarly in their predictive power. All models have an area under the curve (AUC) of around 0.61, outperforming random prediction. This performance is sustained for data from the second COVID-19 wave one year later, with an AUC of 0.59 for men and 0.61 for women. When classifying all men (women) with a predicted risk of 0.135 (0.170) or higher as being at risk of missing care, the neural network model correctly identifies 59% (58%) of the individuals with missed care visits, and 57% (58%) of the individuals without missed care visits. As the sensitivity and specificity of the models are strongly related to the risk threshold used to classify individuals, the models can be calibrated depending on users' resource constraints and targeting approach. CONCLUSIONS Pandemics such as COVID-19 require rapid and efficient responses to reduce disruptions in health care. Based on characteristics available to health administrators or insurance providers, simple machine learning algorithms can be used to efficiently target efforts to reduce missed essential care.
Collapse
Affiliation(s)
- Anna Reuter
- Heidelberg Institute of Global Health, Heidelberg University, Heidelberg, Germany.
- Department of Economics, University of Göttingen, Göttingen, Germany.
| | - Šime Smolić
- Faculty of Economics and Business, University of Zagreb, Zagreb, Croatia
| | - Till Bärnighausen
- Heidelberg Institute of Global Health, Heidelberg University, Heidelberg, Germany
| | - Nikkil Sudharsanan
- Heidelberg Institute of Global Health, Heidelberg University, Heidelberg, Germany
- Professorship of Behavioral Science for Disease Prevention and Health Care, Technical University of Munich, Munich, Germany
| |
Collapse
|
22
|
Francis DP, Laustsen M, Dossi E, Treiberg T, Hardy I, Shiv SH, Hansen BS, Mogensen J, Jakobsen MH, Alstrøm TS. Machine learning methods for the detection of explosives, drugs and precursor chemicals gathered using a colorimetric sniffer sensor. ANALYTICAL METHODS : ADVANCING METHODS AND APPLICATIONS 2023; 15:2343-2354. [PMID: 37157832 DOI: 10.1039/d3ay00247k] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/10/2023]
Abstract
Colorimetric sensing technology for the detection of explosives, drugs, and their precursor chemicals is an important and effective approach. In this work, we use various machine learning models to detect these substances from colorimetric sensing experiments conducted in controlled environments. The detection experiments based on the response of a colorimetric chip containing 26 chemo-responsive dyes indicate that homemade explosives (HMEs) such as hexamethylene triperoxide diamine (HMTD), triacetone triperoxide (TATP), and methyl ethyl ketone peroxide (MEKP) used in improvised explosives devices are detected with true positive rate (TPR) of 70-75%, 73-90% and 60-82% respectively. Time series classifiers such as Convolutional Neural Networks (CNN) are explored, and the results indicate that improvements can be achieved with the use of kinetics of the chemical responses. The use of CNNs is limited, however, to scenarios where a large number of measurements, typically in the range of a few hundred, of each analyte are available. Feature selection of important dyes using the Group Lasso (GPLASSO) algorithm indicated that certain dyes are more important in discrimination of an analyte from ambient air. This information could be used for optimizing the colorimetric sensor and extend the detection to more analytes.
Collapse
Affiliation(s)
- Deena P Francis
- DTU Compute, Technical University of Denmark, 2800 Kgs. Lyngby, Denmark.
| | | | - Eleftheria Dossi
- Centre for Defence Chemistry, Cranfield University, Defence Academy of United Kingdom, Shrivenham, SN6 8LA, UK
| | - Tuule Treiberg
- DTU Chemistry, Technical University of Denmark, 2800 Kgs. Lyngby, Denmark
| | - Iona Hardy
- Centre for Defence Chemistry, Cranfield University, Defence Academy of United Kingdom, Shrivenham, SN6 8LA, UK
| | - Shai Hvid Shiv
- DTU Chemistry, Technical University of Denmark, 2800 Kgs. Lyngby, Denmark
| | | | - Jesper Mogensen
- Danish Emergency Management Agency, Chemical Division, Nørre Allé 67, 2100 Copenhagen, Denmark
| | - Mogens H Jakobsen
- DTU Chemistry, Technical University of Denmark, 2800 Kgs. Lyngby, Denmark
| | - Tommy S Alstrøm
- DTU Compute, Technical University of Denmark, 2800 Kgs. Lyngby, Denmark.
| |
Collapse
|
23
|
Wang J, Huang S, Wang Z, Huang D, Qin J, Wang H, Wang W, Liang Y. A calibrated SVM based on weighted smooth GL1/2 for Alzheimer’s disease prediction. Comput Biol Med 2023; 158:106752. [PMID: 37003069 DOI: 10.1016/j.compbiomed.2023.106752] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2022] [Revised: 01/17/2023] [Accepted: 03/06/2023] [Indexed: 03/31/2023]
Abstract
Alzheimer's disease (AD) is currently one of the mainstream senile diseases in the world. It is a key problem predicting the early stage of AD. Low accuracy recognition of AD and high redundancy brain lesions are the main obstacles. Traditionally, Group Lasso method can achieve good sparseness. But, redundancy inside group is ignored. This paper proposes an improved smooth classification framework which combines the weighted smooth GL1/2 (wSGL1/2) as feature selection method and a calibrated support vector machine (cSVM) as the classifier. wSGL1/2 can make intra-group and inner-group features sparse, in which the group weights can further improve the efficiency of the model. cSVM can enhance the speed and stability of model by adding calibrated hinge function. Before feature selecting, an anatomical boundary-based clustering, called as ac-SLIC-AAL, is designed to make adjacent similar voxels into one group for accommodating the overall differences of all data. The cSVM model is fast convergence speed, high accuracy and good interpretability on AD classification, AD early diagnosis and MCI transition prediction. In experiments, all steps are tested respectively, including classifiers' comparison, feature selection verification, generalization verification and comparing with state-of-the-art methods. The results are supportive and satisfactory. The superior of the proposed model are verified globally. At the same time, the algorithm can point out the important brain areas in the MRI, which has important reference value for the doctor's predictive work. The source code and data is available at http://github.com/Hu-s-h/c-SVMForMRI.
Collapse
Affiliation(s)
- Jinfeng Wang
- College of Mathematics and Informatics, South China Agricultural University, Guangzhou, 510642, Guangdong, China.
| | - Shuaihui Huang
- College of Mathematics and Informatics, South China Agricultural University, Guangzhou, 510642, Guangdong, China
| | - Zhiwen Wang
- College of Mathematics and Informatics, South China Agricultural University, Guangzhou, 510642, Guangdong, China
| | - Dong Huang
- College of Mathematics and Informatics, South China Agricultural University, Guangzhou, 510642, Guangdong, China
| | - Jing Qin
- Centre for Smart Health, School of Nursing, The Hong Kong Polytechnic University, Kowloon, Hong Kong, China
| | - Hui Wang
- School of EEECS, Queen's University Belfast, Belfast, UK
| | - Wenzhong Wang
- College of Economics and Management, South China Agricultural University, Guangzhou, 510642, Guangdong, China
| | - Yong Liang
- Peng Cheng Laboratory, 518005, Shenzhen, Guangdong, China
| |
Collapse
|
24
|
Jiang S, Cao J, Colditz GA. Identifying regions of interest in mammogram images. Stat Methods Med Res 2023; 32:895-903. [PMID: 36951095 PMCID: PMC10247406 DOI: 10.1177/09622802231160551] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/24/2023]
Abstract
Screening mammography is the primary preventive strategy for early detection of breast cancer and an essential input to breast cancer risk prediction and application of prevention/risk management guidelines. Identifying regions of interest within mammogram images that are associated with 5- or 10-year breast cancer risk is therefore clinically meaningful. The problem is complicated by the irregular boundary issue posed by the semi-circular domain of the breast area within mammograms. Accommodating the irregular domain is especially crucial when identifying regions of interest, as the true signal comes only from the semi-circular domain of the breast region, and noise elsewhere. We address these challenges by introducing a proportional hazards model with imaging predictors characterized by bivariate splines over triangulation. The model sparsity is enforced with the group lasso penalty function. We apply the proposed method to the motivating Joanne Knight Breast Health Cohort to illustrate important risk patterns and show that the proposed method is able to achieve higher discriminatory performance.
Collapse
Affiliation(s)
- Shu Jiang
- Division of Public Health Sciences,
Washington University School of Medicine, St Louis, MO, USA
| | - Jiguo Cao
- Department of Statistics and Actuarial
Science, Simon Fraser University, Burnaby, BC, Canada
| | - Graham A. Colditz
- Division of Public Health Sciences,
Washington University School of Medicine, St Louis, MO, USA
| |
Collapse
|
25
|
van Nee MM, Wessels LFA, van de Wiel MA. ecpc: an R-package for generic co-data models for high-dimensional prediction. BMC Bioinformatics 2023; 24:172. [PMID: 37101151 PMCID: PMC10134536 DOI: 10.1186/s12859-023-05289-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/13/2022] [Accepted: 04/12/2023] [Indexed: 04/28/2023] Open
Abstract
BACKGROUND High-dimensional prediction considers data with more variables than samples. Generic research goals are to find the best predictor or to select variables. Results may be improved by exploiting prior information in the form of co-data, providing complementary data not on the samples, but on the variables. We consider adaptive ridge penalised generalised linear and Cox models, in which the variable-specific ridge penalties are adapted to the co-data to give a priori more weight to more important variables. The R-package ecpc originally accommodated various and possibly multiple co-data sources, including categorical co-data, i.e. groups of variables, and continuous co-data. Continuous co-data, however, were handled by adaptive discretisation, potentially inefficiently modelling and losing information. As continuous co-data such as external p values or correlations often arise in practice, more generic co-data models are needed. RESULTS Here, we present an extension to the method and software for generic co-data models, particularly for continuous co-data. At the basis lies a classical linear regression model, regressing prior variance weights on the co-data. Co-data variables are then estimated with empirical Bayes moment estimation. After placing the estimation procedure in the classical regression framework, extension to generalised additive and shape constrained co-data models is straightforward. Besides, we show how ridge penalties may be transformed to elastic net penalties. In simulation studies we first compare various co-data models for continuous co-data from the extension to the original method. Secondly, we compare variable selection performance to other variable selection methods. The extension is faster than the original method and shows improved prediction and variable selection performance for non-linear co-data relations. Moreover, we demonstrate use of the package in several genomics examples throughout the paper. CONCLUSIONS The R-package ecpc accommodates linear, generalised additive and shape constrained additive co-data models for the purpose of improved high-dimensional prediction and variable selection. The extended version of the package as presented here (version number 3.1.1 and higher) is available on ( https://cran.r-project.org/web/packages/ecpc/ ).
Collapse
Affiliation(s)
- Mirrelijn M van Nee
- Epidemiology & Data Science, Amsterdam Public Health research institute, Amsterdam University Medical Centers, Amsterdam, The Netherlands.
| | - Lodewyk F A Wessels
- Molecular Carcinogenesis, Netherlands Cancer Institute, Amsterdam, The Netherlands
- Computational Cancer Biology, Oncode Institute, Amsterdam, The Netherlands
- Intelligent Systems, Delft University Medical Centers, Delft, The Netherlands
| | - Mark A van de Wiel
- Epidemiology & Data Science, Amsterdam Public Health research institute, Amsterdam University Medical Centers, Amsterdam, The Netherlands
| |
Collapse
|
26
|
Alramadeen W, Ding Y, Costa C, Si B. A Novel Sparse Linear Mixed Model for Multi-Source Mixed-Frequency Data Fusion in Telemedicine. IISE TRANSACTIONS ON HEALTHCARE SYSTEMS ENGINEERING 2023; 13:215-225. [PMID: 37635864 PMCID: PMC10454975 DOI: 10.1080/24725579.2023.2202877] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 08/29/2023]
Abstract
Digital health and telemonitoring have resulted in a wealth of information to be collected to monitor, manage, and improve human health. The multi-source mixed-frequency health data overwhelm the modeling capacity of existing statistical and machine learning models, due to many challenging properties. Although predictive analytics for big health data plays an important role in telemonitoring, there is a lack of rigorous prediction model that can automatically predicts patients' health conditions, e.g., Disease Severity Indicators (DSIs), from multi-source mixed-frequency data. Sleep disorder is a prevalent cardiac syndrome that is characterized by abnormal respiratory patterns during sleep. Although wearable devices are available to administrate sleep studies at home, the manual scoring process to generate the DSI remains a bottleneck in automated monitoring and diagnosis of sleep disorder. To address the multi-fold challenges for precise prediction of the DSI from high-dimensional multi-source mixed-frequency data in sleep disorder, we propose a sparse linear mixed model that combines the modified Cholesky decomposition with group lasso penalties to enable joint group selection of fixed effects and random effects. A novel Expectation Maximization (EM) algorithm integrated with an efficient Majorization Maximization (MM) algorithm is developed for model estimation of the proposed sparse linear mixed model with group variable selection. The proposed method was applied to the SHHS data for telemonitoring and diagnosis of sleep disorder and found that a few significant feature groups that are consistent with prior medical studies on sleep disorder. The proposed method also outperformed a few benchmark methods with the highest prediction accuracy.
Collapse
Affiliation(s)
- Wesam Alramadeen
- Department of Systems Science and Industrial Engineering, State University of New York at Binghamton, Binghamton, NY, USA 13902, USA
| | - Yu Ding
- Department of Systems Science and Industrial Engineering, State University of New York at Binghamton, Binghamton, NY, USA 13902, USA
| | - Carlos Costa
- IBM T. J. Watson Research Center, Yorktown Heights, NY 10510, USA
| | - Bing Si
- Department of Systems Science and Industrial Engineering, State University of New York at Binghamton, Binghamton, NY, USA 13902, USA
| |
Collapse
|
27
|
Frndak S, Yu G, Oulhote Y, Queirolo EI, Barg G, Vahter M, Mañay N, Peregalli F, Olson JR, Ahmed Z, Kordas K. Reducing the complexity of high-dimensional environmental data: An analytical framework using LASSO with considerations of confounding for statistical inference. Int J Hyg Environ Health 2023; 249:114116. [PMID: 36805184 PMCID: PMC10977870 DOI: 10.1016/j.ijheh.2023.114116] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2022] [Revised: 01/10/2023] [Accepted: 01/17/2023] [Indexed: 02/19/2023]
Abstract
PURPOSE Frameworks for selecting exposures in high-dimensional environmental datasets, while considering confounding, are lacking. We present a two-step approach for exposure selection with subsequent confounder adjustment for statistical inference. METHODS We measured cognitive ability in 338 children using the Woodcock-Muñoz General Intellectual Ability (GIA) score, and potential associated features across several environmental domains. Initially, 111 variables theoretically associated with GIA score were introduced into a Least Absolute Shrinkage and Selection Operator (LASSO) in a 50% feature selection subsample. Effect estimates for selected features were subsequently modeled in linear regressions in a 50% inference (hold out) subsample, first adjusting for sex and age and later for covariates selected via directed acyclic graphs (DAGs). All models were adjusted for clustering by school. RESULTS Of the 15 LASSO selected variables, eleven were not associated with GIA score following our inference modeling approach. Four variables were associated with GIA scores, including: serum ferritin adjusted for inflammation (inversely), mother's IQ (positively), father's education (positively), and hours per day the child works on homework (positively). Serum ferritin was not in the expected direction. CONCLUSIONS Our two-step approach moves high-dimensional feature selection a step further by incorporating DAG-based confounder adjustment for statistical inference.
Collapse
Affiliation(s)
- Seth Frndak
- Department of Epidemiology and Environmental Health: University at Buffalo, The State University of New York, USA.
| | - Guan Yu
- Department of Biostatistics: University of Pittsburgh, USA
| | - Youssef Oulhote
- Department of Epidemiology, University of Massachusetts Amherst, USA
| | - Elena I Queirolo
- Department of Neuroscience and Learning, Catholic University of Uruguay, Montevideo, Uruguay
| | - Gabriel Barg
- Department of Neuroscience and Learning, Catholic University of Uruguay, Montevideo, Uruguay
| | - Marie Vahter
- Department of Environmental Medicine: Karolinska Institute, Sweden
| | - Nelly Mañay
- Faculty of Chemistry, University of the Republic of Uruguay (UDELAR), Montevideo, Uruguay
| | - Fabiana Peregalli
- Department of Neuroscience and Learning, Catholic University of Uruguay, Montevideo, Uruguay
| | - James R Olson
- Department of Epidemiology and Environmental Health: University at Buffalo, The State University of New York, USA
| | - Zia Ahmed
- Research and Education in eNergy, Environment and Water (RENEW) Institute University at Buffalo, The State University of New York, USA
| | - Katarzyna Kordas
- Department of Epidemiology and Environmental Health: University at Buffalo, The State University of New York, USA
| |
Collapse
|
28
|
Song X, Liang K, Li J. WGRLR: A Weighted Group Regularized Logistic Regression for Cancer Diagnosis and Gene Selection. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:1563-1573. [PMID: 36044492 DOI: 10.1109/tcbb.2022.3203167] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
Sparse regressions applied to cancer diagnosis suffer from noise reduction, gene grouping, and group significance evaluation. This paper presented the weighted group regularized logistic regression (WGRLR) for dealing with the above problems. Clean data was separated from noisy gene expression profile data, based on which gene grouping and model building were performed. An interpretable gene group significance evaluation criterion was proposed based on symmetrical uncertainty and module eigengene. A group-wise individual gene significance evaluation criterion was also presented. The performances of the proposed method were compared with WGGL, ASGL-CMI, SGL, GL, Elastic Net, and lasso on acute leukemia and brain cancer data. Experimental results demonstrate that the proposed method is superior to the other six methods in cancer diagnosis accuracy and gene selection.
Collapse
|
29
|
Choi G, Kim W, Koo J. Investigating the Performance of Machine Learning Methods in Predicting Functional Properties of the Hydrogenase Variants. BIOTECHNOL BIOPROC E 2023. [DOI: 10.1007/s12257-022-0330-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/13/2023]
|
30
|
Bernardi M, Guidolin M. The determinants of Airbnb prices in New York City: a spatial quantile regression approach. J R Stat Soc Ser C Appl Stat 2023. [DOI: 10.1093/jrsssc/qlad001] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/11/2023]
Abstract
Abstract
In this paper, we study the price determinants of Airbnb rentals, for the case of New York City, by developing a new dataset, which combines attributes of the property and of the related service, with other information available as open data. This dataset is employed within a spatial quantile semiparametric regression model, able to handle the intrinsic heterogeneity of house prices. The results confirm that property and service attributes play a significant role in determining rental prices, while some variables exert a different impact on prices in magnitude and sign, depending on the quantile considered.
Collapse
Affiliation(s)
- Mauro Bernardi
- Department of Statistical Science, University of Padova , Via Cesare Battisti 241, Padova , Italy
| | - Mariangela Guidolin
- Department of Statistical Science, University of Padova , Via Cesare Battisti 241, Padova , Italy
| |
Collapse
|
31
|
Multi-modality data-driven analysis of diagnosis and treatment of psoriatic arthritis. NPJ Digit Med 2023; 6:13. [PMID: 36732611 PMCID: PMC9895430 DOI: 10.1038/s41746-023-00757-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2022] [Accepted: 01/16/2023] [Indexed: 02/04/2023] Open
Abstract
Psoriatic arthritis (PsA) is associated with psoriasis, featured by its irreversible joint symptoms. Despite the significant impact on the healthcare system, it is still challenging to leverage machine learning or statistical models to predict PsA and its progression, or analyze drug efficacy. With 3961 patients' clinical records, we developed a machine learning model for PsA diagnosis and analysis of PsA progression risk, respectively. Furthermore, general additive models (GAMs) and the Kaplan-Meier (KM) method were applied to analyze the efficacy of various drugs on psoriasis treatment and inhibiting PsA progression. The independent experiment on the PsA prediction model demonstrates outstanding prediction performance with an AUC score of 0.87 and an AUPR score of 0.89, and the Jackknife validation test on the PsA progression prediction model also suggests the superior performance with an AUC score of 0.80 and an AUPR score of 0.83, respectively. We also identified that interleukin-17 inhibitors were the more effective drug for severe psoriasis compared to other drugs, and methotrexate had a lower effect in inhibiting PsA progression. The results demonstrate that machine learning and statistical approaches enable accurate early prediction of PsA and its progression, and analysis of drug efficacy.
Collapse
|
32
|
Han Y, Tsay RS, Wu WB. High dimensional generalized linear models for temporal dependent data. BERNOULLI 2023. [DOI: 10.3150/21-bej1451] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Affiliation(s)
- Yuefeng Han
- Department of Statistics, Rutgers University, Piscataway, NJ 08854, USA
| | - Ruey S. Tsay
- Booth School of Business, University of Chicago, Chicago, IL 60637, USA
| | - Wei Biao Wu
- Department of Statistics, University of Chicago, Chicago, IL 60637, USA
| |
Collapse
|
33
|
Lin FC, Shih YS, Yu YB. A nonparametric method for classification trees using grouped covariates. Biom J 2023; 65:e2100107. [PMID: 36161314 PMCID: PMC9925394 DOI: 10.1002/bimj.202100107] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2021] [Revised: 03/11/2022] [Accepted: 04/03/2022] [Indexed: 11/11/2022]
Abstract
A group of variables are commonly seen in diagnostic medicine when multiple prognostic factors are aggregated into a composite score to represent the risk profile. A model selection method considers these covariates as all-in or all-out types. Model selection procedures for grouped covariates and their applications have thrived in recent years, in part because of the development of genetic research in which gene-gene or gene-environment interactions and regulatory network pathways are considered groups of individual variables. However, little has been discussed on how to utilize grouped covariates to grow a classification tree. In this paper, we propose a nonparametric method to address the selection of split variables for grouped covariates and their following selection of split points. Comprehensive simulations were implemented to show the superiority of our procedures compared to a commonly used recursive partition algorithm. The practical use of our method is demonstrated through a real data analysis that uses a group of prognostic factors to classify the successful mobilization of peripheral blood stem cells.
Collapse
Affiliation(s)
- Feng-Chang Lin
- Department of Biostatistics, University of North Carolina, McGavran-Greenberg Hall, Chapel Hill, North Carolina, USA
| | - Yu-Shan Shih
- Department of Mathematics, National Chung Cheng University, Chia-Yi, Taiwan ROC
| | - Yuan-Bin Yu
- Division of Oncology and Hematology, Department of Medicine, Far Eastern Memorial Hospital, Banciao Dist, New Taipei City, Taiwan ROC
| |
Collapse
|
34
|
Saha A, Sundaram R. Variable selection for discrete survival model with frailty in presence of left truncation and right censoring: Studying association of environmental toxicants on time-to-pregnancy. Stat Med 2023; 42:193-208. [PMID: 36457137 DOI: 10.1002/sim.9609] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2022] [Revised: 09/11/2022] [Accepted: 11/07/2022] [Indexed: 12/05/2022]
Abstract
Understanding the association between mixtures of environmental toxicants and time-to-pregnancy (TTP) is an important scientific question as sufficient evidence has emerged about the impact of individual toxicants on reproductive health and that individuals are exposed to a whole host of toxicants rather than an individual toxicant. Assessing mixtures of chemical effects on TTP poses significant statistical challenges, namely (i) TTP being a discrete survival outcome, typically subject to left truncation and right censoring, (ii) chemical exposures being strongly correlated, (iii) appropriate transformation to account for some lipid-binding chemicals, (iv) non-linear effects of some chemicals, and (v) high percentage of concentration below the limit of detection (LOD) for some chemicals. We propose a discrete frailty modeling framework (named Discnet) that allows selection of correlated covariates while appropriately addressing the methodological issues mentioned above. Discnet is shown to have better and stable false negative and false positive rates compared to alternative methods in various simulation settings. We did a detailed analysis of the pre-conception endocrine disrupting chemicals and TTP from the LIFE study and found that older females, female exposure to cotinine (smoking), DDT conferred a delay in getting pregnant, which was consistent across various approaches to account for LOD as well as non-linear associations.
Collapse
Affiliation(s)
- Abhisek Saha
- Eunice Kennedy Shriver National Institute of Child Health and Human Development, National Institutes of Health, Bethesda, Maryland, USA
| | - Rajeshwari Sundaram
- Eunice Kennedy Shriver National Institute of Child Health and Human Development, National Institutes of Health, Bethesda, Maryland, USA
| |
Collapse
|
35
|
Li X, Ma Y, Pan Q. Standardization of continuous and categorical covariates in sparse penalized regressions. Stat Methods Med Res 2023; 32:41-54. [PMID: 36189470 DOI: 10.1177/09622802221129042] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/05/2023]
Abstract
In sparse penalized regressions, candidate covariates of different units need to be standardized beforehand so that the coefficient sizes are directly comparable and reflect their relative impacts, which leads to fairer variable selection. However, when covariates of mixed data types (e.g. continuous, binary or categorical) exist in the same dataset, the commonly used standardization methods may lead to different selection probabilities even when the covariates have the same impact on or level of association with the outcome. In this paper, we propose a novel standardization method that targets at generating comparable selection probabilities in sparse penalized regressions for continuous, binary or categorical covariates with the same impact. We illustrate the advantages of the proposed method in simulation studies, and apply it to the National Ambulatory Medical Care Survey data to select factors related to the opioid prescription in the US.
Collapse
Affiliation(s)
- Xiang Li
- Statistics Department, 8367George Washington University, Washington, DC, USA
| | - Yong Ma
- Center for Drug Evaluation and Research, 4137Food and Drug Administration, Silver Spring, MD, USA
| | - Qing Pan
- Statistics Department, 8367George Washington University, Washington, DC, USA
| |
Collapse
|
36
|
Applied machine learning to identify differential risk groups underlying externalizing and internalizing problem behaviors trajectories: A case study using a cohort of Asian American children. PLoS One 2023; 18:e0282235. [PMID: 36867610 PMCID: PMC9983857 DOI: 10.1371/journal.pone.0282235] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/20/2022] [Accepted: 02/09/2023] [Indexed: 03/04/2023] Open
Abstract
BACKGROUND Internalizing and externalizing problems account for over 75% of the mental health burden in children and adolescents in the US, with higher burden among minority children. While complex interactions of multilevel factors are associated with these outcomes and may enable early identification of children in higher risk, prior research has been limited by data and application of traditional analysis methods. In this case example focused on Asian American children, we address the gap by applying data-driven statistical and machine learning methods to study clusters of mental health trajectories among children, investigate optimal predictions of children at high-risk cluster, and identify key early predictors. METHODS Data from the US Early Childhood Longitudinal Study 2010-2011 were used. Multilevel information provided by children, families, teachers, schools, and care-providers were considered as predictors. Unsupervised machine learning algorithm was applied to identify groups of internalizing and externalizing problems trajectories. For prediction of high-risk group, ensemble algorithm, Superlearner, was implemented by combining several supervised machine learning algorithms. Performance of Superlearner and candidate algorithms, including logistic regression, was assessed using discrimination and calibration metrics via crossvalidation. Variable importance measures along with partial dependence plots were utilized to rank and visualize key predictors. FINDINGS We found two clusters suggesting high- and low-risk groups for both externalizing and internalizing problems trajectories. While Superlearner had overall best discrimination performance, logistic regression had comparable performance for externalizing problems but worse for internalizing problems. Predictions from logistic regression were not well calibrated compared to those from Superlearner, however they were still better than few candidate algorithms. Important predictors identified were combination of test scores, child factors, teacher rated scores, and contextual factors, which showed non-linear associations with predicted probabilities. CONCLUSIONS We demonstrated the application of data-driven analytical approach to predict mental health outcomes among Asian American children. Findings from the cluster analysis can inform critical age for early intervention, while prediction analysis has potential to inform intervention programing prioritization decisions. However, to better understand external validity, replicability, and value of machine learning in broader mental health research, more studies applying similar analytical approach is needed.
Collapse
|
37
|
Hyperchloremia and association with acute kidney injury in critically ill children. Pediatr Nephrol 2022:10.1007/s00467-022-05823-8. [PMID: 36409366 DOI: 10.1007/s00467-022-05823-8] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 07/14/2022] [Revised: 10/30/2022] [Accepted: 11/07/2022] [Indexed: 11/22/2022]
Abstract
BACKGROUND Hyperchloremia has been associated with acute kidney injury (AKI) in critically ill adult patients. Data is limited in pediatric patients. Our study sought to determine if an association exists between hyperchloremia and AKI in pediatric patients admitted to the intensive care unit (PICU). METHODS This is a single-center retrospective cohort study of pediatric patients admitted to the PICU for greater than 24 h and who received intravenous fluids. Patients were excluded if they had a diagnosis of kidney disease or required kidney replacement therapy (KRT) within 6 h of admission. Exposures were hyperchloremia (serum chloride ≥ 110 mmol/L) within the first 7 days of PICU admission. The primary outcome was the development of AKI using the Kidney Disease Improving Global Outcomes (KDIGO) criteria. Secondary outcomes included time on mechanical ventilation, new KRT, PICU length of stay, and mortality. Outcomes were analyzed using multivariate logistic regression. RESULTS There were 407 patients included in the study, 209 in the hyperchloremic group and 198 in the non-hyperchloremic group. Univariate analysis demonstrated 108 (51.7%) patients in the hyperchloremic group vs. 54 (27.3%) in the non-hyperchloremic group (p = < .001) with AKI. On multivariate analysis, the odds ratio of AKI with hyperchloremia was 2.24 (95% CI 1.39-3.61) (p = .001). Hyperchloremia was not associated with increased odds of mortality, need for KRT, time on mechanical ventilation, or length of stay. CONCLUSION Hyperchloremia was associated with AKI in critically ill pediatric patients. Further pediatric clinical trials are needed to determine the benefit of a chloride restrictive vs. liberal fluid strategy. A higher resolution version of the Graphical abstract is available as Supplementary information.
Collapse
|
38
|
van Nee MM, van de Brug T, van de Wiel MA. Fast Marginal Likelihood Estimation of Penalties for Group-Adaptive Elastic Net. J Comput Graph Stat 2022; 32:950-960. [PMID: 38013849 PMCID: PMC10511031 DOI: 10.1080/10618600.2022.2128809] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2021] [Accepted: 09/12/2022] [Indexed: 10/10/2022]
Abstract
Elastic net penalization is widely used in high-dimensional prediction and variable selection settings. Auxiliary information on the variables, for example, groups of variables, is often available. Group-adaptive elastic net penalization exploits this information to potentially improve performance by estimating group penalties, thereby penalizing important groups of variables less than other groups. Estimating these group penalties is, however, hard due to the high dimension of the data. Existing methods are computationally expensive or not generic in the type of response. Here we present a fast method for estimation of group-adaptive elastic net penalties for generalized linear models. We first derive a low-dimensional representation of the Taylor approximation of the marginal likelihood for group-adaptive ridge penalties, to efficiently estimate these penalties. Then we show by using asymptotic normality of the linear predictors that this marginal likelihood approximates that of elastic net models. The ridge group penalties are then transformed to elastic net group penalties by matching the ridge prior variance to the elastic net prior variance as function of the group penalties. The method allows for overlapping groups and unpenalized variables, and is easily extended to other penalties. For a model-based simulation study and two cancer genomics applications we demonstrate a substantially decreased computation time and improved or matching performance compared to other methods. Supplementary materials for this article are available online.
Collapse
Affiliation(s)
- Mirrelijn M. van Nee
- Department of Epidemiology and Data Science, Amsterdam University Medical Centers, Amsterdam, The Netherlands
| | - Tim van de Brug
- Department of Epidemiology and Data Science, Amsterdam University Medical Centers, Amsterdam, The Netherlands
| | - Mark A. van de Wiel
- Department of Epidemiology and Data Science, Amsterdam University Medical Centers, Amsterdam, The Netherlands
| |
Collapse
|
39
|
Bhandari N, Walambe R, Kotecha K, Khare SP. A comprehensive survey on computational learning methods for analysis of gene expression data. Front Mol Biosci 2022; 9:907150. [PMID: 36458095 PMCID: PMC9706412 DOI: 10.3389/fmolb.2022.907150] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2022] [Accepted: 09/28/2022] [Indexed: 09/19/2023] Open
Abstract
Computational analysis methods including machine learning have a significant impact in the fields of genomics and medicine. High-throughput gene expression analysis methods such as microarray technology and RNA sequencing produce enormous amounts of data. Traditionally, statistical methods are used for comparative analysis of gene expression data. However, more complex analysis for classification of sample observations, or discovery of feature genes requires sophisticated computational approaches. In this review, we compile various statistical and computational tools used in analysis of expression microarray data. Even though the methods are discussed in the context of expression microarrays, they can also be applied for the analysis of RNA sequencing and quantitative proteomics datasets. We discuss the types of missing values, and the methods and approaches usually employed in their imputation. We also discuss methods of data normalization, feature selection, and feature extraction. Lastly, methods of classification and class discovery along with their evaluation parameters are described in detail. We believe that this detailed review will help the users to select appropriate methods for preprocessing and analysis of their data based on the expected outcome.
Collapse
Affiliation(s)
- Nikita Bhandari
- Computer Science Department, Symbiosis Institute of Technology, Symbiosis International (Deemed University), Pune, India
| | - Rahee Walambe
- Electronics and Telecommunication Department, Symbiosis Institute of Technology, Symbiosis International (Deemed University), Pune, India
- Symbiosis Center for Applied AI (SCAAI), Symbiosis International (Deemed University), Pune, India
| | - Ketan Kotecha
- Computer Science Department, Symbiosis Institute of Technology, Symbiosis International (Deemed University), Pune, India
- Symbiosis Center for Applied AI (SCAAI), Symbiosis International (Deemed University), Pune, India
| | - Satyajeet P. Khare
- Symbiosis School of Biological Sciences, Symbiosis International (Deemed University), Pune, India
| |
Collapse
|
40
|
Prediction Model for 30-Day Mortality after Non-Cardiac Surgery Using Machine-Learning Techniques Based on Preoperative Evaluation of Electronic Medical Records. J Clin Med 2022; 11:jcm11216487. [PMID: 36362715 PMCID: PMC9659244 DOI: 10.3390/jcm11216487] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2022] [Revised: 10/24/2022] [Accepted: 10/25/2022] [Indexed: 11/06/2022] Open
Abstract
Background: Machine-learning techniques are useful for creating prediction models in clinical practice. This study aimed to construct a prediction model of postoperative 30-day mortality based on an automatically extracted electronic preoperative evaluation sheet. Methods: We used data from 276,341 consecutive adult patients who underwent non-cardiac surgery between January 2011 and December 2020 at a tertiary center for model development and internal validation, and another dataset from 63,384 patients between January 2011 and October 2021 at another center for external validation. Postoperative 30-day mortality was 0.16%. We developed an extreme gradient boosting (XGB) prediction model using only variables from preoperative evaluation sheets. Results: The model yielded an area under the curve of 0.960 and an area under the precision and recall curve of 0.216, which were 0.932 and 0.122, respectively, in the external validation set. The optimal threshold calculated by Youden’s J statistic had a sensitivity of 0.885 and specificity of 0.914. In an additional analysis with balanced distribution, the model showed a similar predictive value. Conclusion: We presented a machine-learning prediction model for 30-day mortality after non-cardiac surgery using preoperative variables automatically extracted from electronic medical records and validated the model in a multi-center setting. Our model may help clinicians predict postoperative outcomes.
Collapse
|
41
|
Gao R, Sarkka S, Claveria-Vega R, Godsill S. Autonomous Tracking and State Estimation With Generalized Group Lasso. IEEE TRANSACTIONS ON CYBERNETICS 2022; 52:12056-12070. [PMID: 34166218 DOI: 10.1109/tcyb.2021.3085426] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
We address the problem of autonomous tracking and state estimation for marine vessels, autonomous vehicles, and other dynamic signals under a (structured) sparsity assumption. The aim is to improve the tracking and estimation accuracy with respect to the classical Bayesian filters and smoothers. We formulate the estimation problem as a dynamic generalized group Lasso problem and develop a class of smoothing-and-splitting methods to solve it. The Levenberg-Marquardt iterated extended Kalman smoother-based multiblock alternating direction method of multipliers (LM-IEKS-mADMMs) algorithms are based on the alternating direction method of multipliers (ADMMs) framework. This leads to minimization subproblems with an inherent structure to which three new augmented recursive smoothers are applied. Our methods can deal with large-scale problems without preprocessing for dimensionality reduction. Moreover, the methods allow one to solve nonsmooth nonconvex optimization problems. We then prove that under mild conditions, the proposed methods converge to a stationary point of the optimization problem. By simulated and real-data experiments, including multisensor range measurement problems, marine vessel tracking, autonomous vehicle tracking, and audio signal restoration, we show the practical effectiveness of the proposed methods.
Collapse
|
42
|
Xu Y, Tao T, Li S, Tan S, Liu H, Zhu X. Prognostic model and immunotherapy prediction based on molecular chaperone-related lncRNAs in lung adenocarcinoma. Front Genet 2022; 13:975905. [PMID: 36313456 PMCID: PMC9606628 DOI: 10.3389/fgene.2022.975905] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2022] [Accepted: 09/21/2022] [Indexed: 11/17/2022] Open
Abstract
Introduction: Molecular chaperones and long non-coding RNAs (lncRNAs) have been confirmed to be closely related to the occurrence and development of tumors, especially lung cancer. Our study aimed to construct a kind of molecular chaperone-related long non-coding RNAs (MCRLncs) marker to accurately predict the prognosis of lung adenocarcinoma (LUAD) patients and find new immunotherapy targets. Methods: In this study, we acquired molecular chaperone genes from two databases, Genecards and molecular signatures database (MsigDB). And then, we downloaded transcriptome data, clinical data, and mutation information of LUAD patients through the Cancer Genome Atlas (TCGA). MCRLncs were determined by Spearman correlation analysis. We used univariate, least absolute shrinkage and selection operator (LASSO) and multivariate Cox regression analysis to construct risk models. Kaplan-meier (KM) analysis was used to understand the difference in survival between high and low-risk groups. Nomogram, calibration curve, concordance index (C-index) curve, and receiver operating characteristic (ROC) curve were used to evaluate the accuracy of the risk model prediction. In addition, we used gene ontology (GO) enrichment analysis and kyoto encyclopedia of genes and genomes (KEGG) enrichment analyses to explore the potential biological functions of MCRLncs. Immune microenvironmental landscapes were constructed by using single-sample gene set enrichment analysis (ssGSEA), tumor immune dysfunction and exclusion (TIDE) algorithm, “pRRophetic” R package, and “IMvigor210” dataset. The stem cell index based on mRNAsi expression was used to further evaluate the patient’s prognosis. Results: Sixteen MCRLncs were identified as independent prognostic indicators in patients with LUAD. Patients in the high-risk group had significantly worse overall survival (OS). ROC curve suggested that the prognostic features of MCRLncs had a good predictive ability for OS. Immune system activation was more pronounced in the high-risk group. Prognostic features of the high-risk group were strongly associated with exclusion and cancer-associated fibroblasts (CAF). According to this prognostic model, a total of 15 potential chemotherapeutic agents were screened for the treatment of LUAD. Immunotherapy analysis showed that the selected chemotherapeutic drugs had potential application value. Stem cell index mRNAsi correlates with prognosis in patients with LUAD. Conclusion: Our study established a kind of novel MCRLncs marker that can effectively predict OS in LUAD patients and provided a new model for the application of immunotherapy in clinical practice.
Collapse
Affiliation(s)
- Yue Xu
- Marine Medical Research Institute, Guangdong Medical University, Zhanjiang, China
| | - Tao Tao
- Department of Gastroscope, Zibo Central Hospital, Zibo, China
| | - Shi Li
- Guangdong Provincial Key Laboratory of Systems Biology and Synthetic Biology for Urogenital Tumors, Shenzhen Key Laboratory of Genitourinary Tumor, Department of Urology, The First Affiliated Hospital of Shenzhen University, Shenzhen Second People’s Hospital (Shenzhen Institute of Translational Medicine), Shenzhen, China
| | - Shuzhen Tan
- Department of Dermatology, The First Affiliated Hospital of Guangzhou Medical University, Guangzhou, China
| | - Haiyan Liu
- Department of Cardiovascular Medicine, Nanchong Central Hospital, The Affiliated Nanchong Central Hospital of North Sichuan Medical College, Nanchong, China
- *Correspondence: Haiyan Liu, ; Xiao Zhu,
| | - Xiao Zhu
- Marine Medical Research Institute, Guangdong Medical University, Zhanjiang, China
- Guangdong Provincial Key Laboratory of Systems Biology and Synthetic Biology for Urogenital Tumors, Shenzhen Key Laboratory of Genitourinary Tumor, Department of Urology, The First Affiliated Hospital of Shenzhen University, Shenzhen Second People’s Hospital (Shenzhen Institute of Translational Medicine), Shenzhen, China
- Laboratory of Molecular Diagnosis, The First Affiliated Hospital of Chongqing Medical University, Chongqing, China
- *Correspondence: Haiyan Liu, ; Xiao Zhu,
| |
Collapse
|
43
|
Li P, Jiao Y, Lu X, Kang L. A data-driven line search rule for support recovery in high-dimensional data analysis. Comput Stat Data Anal 2022. [DOI: 10.1016/j.csda.2022.107524] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
44
|
Park S, Lee ER, Zhao H. Low-rank regression models for multiple binary responses and their applications to cancer cell-line encyclopedia data. J Am Stat Assoc 2022; 119:202-216. [PMID: 38481466 PMCID: PMC10928550 DOI: 10.1080/01621459.2022.2105704] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2021] [Accepted: 07/16/2022] [Indexed: 10/16/2022]
Abstract
In this paper, we study high-dimensional multivariate logistic regression models in which a common set of covariates is used to predict multiple binary outcomes simultaneously. Our work is primarily motivated from many biomedical studies with correlated multiple responses such as the cancer cell-line encyclopedia project. We assume that the underlying regression coefficient matrix is simultaneously low-rank and row-wise sparse. We propose an intuitively appealing selection and estimation framework based on marginal model likelihood, and we develop an efficient computational algorithm for inference. We establish a novel high-dimensional theory for this nonlinear multivariate regression. Our theory is general, allowing for potential correlations between the binary responses. We propose a new type of nuclear norm penalty using the smooth clipped absolute deviation, filling the gap in the related non-convex penalization literature. We theoretically demonstrate that the proposed approach improves estimation accuracy by considering multiple responses jointly through the proposed estimator when the underlying coefficient matrix is low-rank and row-wise sparse. In particular, we establish the non-asymptotic error bounds, and both rank and row support consistency of the proposed method. Moreover, we develop a consistent rule to simultaneously select the rank and row dimension of the coefficient matrix. Furthermore, we extend the proposed methods and theory to a joint Ising model, which accounts for the dependence relationships. In our analysis of both simulated data and the cancer cell line encyclopedia data, the proposed methods outperform the existing methods in better predicting responses.
Collapse
Affiliation(s)
- Seyoung Park
- Department of Statistics, Sungkyunkwan University, Seoul, 03063, Korea
| | - Eun Ryung Lee
- Department of Statistics, Sungkyunkwan University, Seoul, 03063, Korea
| | - Hongyu Zhao
- Department of Biostatistics, Yale University, New Haven, CT, 06511, USA
| |
Collapse
|
45
|
Wang Y, Huang X, Chen S, Jiang H, Rao H, Lu L, Wen F, Pei J. In Silico Identification and Validation of Cuproptosis-Related LncRNA Signature as a Novel Prognostic Model and Immune Function Analysis in Colon Adenocarcinoma. Curr Oncol 2022; 29:6573-6593. [PMID: 36135086 PMCID: PMC9497598 DOI: 10.3390/curroncol29090517] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2022] [Revised: 09/13/2022] [Accepted: 09/14/2022] [Indexed: 11/25/2022] Open
Abstract
Background: Colon adenocarcinoma (COAD) is the most common subtype of colon cancer, and cuproptosis is a recently newly defined form of cell death that plays an important role in the development of several malignant cancers. However, studies of cuproptosis-related lncRNAs (CRLs) involved in regulating colon adenocarcinoma are limited. The purpose of this study is to develop a new prognostic CRLs signature of colon adenocarcinoma and explore its underlying biological mechanism. Methods: In this study, we downloaded RNA-seq profiles, clinical data and tumor mutational burden (TMB) data from the TCGA database, identified cuproptosis-associated lncRNAs using univariate Cox, lasso regression analysis and multivariate Cox analysis, and constructed a prognostic model with risk score based on these lncRNAs. COAD patients were divided into high- and low-risk subgroups based on the risk score. Cox regression was also used to test whether they were independent prognostic factors. The accuracy of this prognostic model was further validated by receiver operating characteristic curve (ROC), C-index and Nomogram. In addition, the lncRNA/miRNA/mRNA competing endogenous RNA (ceRNA) network and protein−protein interaction (PPI) network were constructed based on the weighted gene co-expression network analysis (WGCNA). Results: We constructed a prognostic model based on 15 cuproptosis-associated lncRNAs. The validation results showed that the risk score of the model (HR = 1.003, 95% CI = 1.001−1.004; p < 0.001) could serve as an independent prognostic factor with accurate and credible predictive power. The risk score had the highest AUC (0.793) among various factors such as risk score, stage, gender and age, also indicating that the model we constructed to predict patient survival was better than other clinical characteristics. Meanwhile, the possible biological mechanisms of colon adenocarcinoma were explored based on the lncRNA/miRNA/mRNA ceRNA network and PPI network constructed by WGCNA. Conclusion: The prognostic model based on 15 cuproptosis-related lncRNAs has accurate and reliable predictive power to effectively predict clinical outcomes in colon adenocarcinoma patients.
Collapse
Affiliation(s)
| | | | | | | | | | | | | | - Jin Pei
- Correspondence: (F.W.); (J.P.)
| |
Collapse
|
46
|
Gao W, Zhou L, Liu S, Guan Y, Gao H, Hu J. Machine learning algorithms for rapid estimation of holocellulose content of poplar clones based on Raman spectroscopy. Carbohydr Polym 2022; 292:119635. [DOI: 10.1016/j.carbpol.2022.119635] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2022] [Revised: 05/08/2022] [Accepted: 05/16/2022] [Indexed: 11/02/2022]
|
47
|
Ouhourane M, Yang Y, Benedet AL, Oualkacha K. Group penalized quantile regression. STAT METHOD APPL-GER 2022. [DOI: 10.1007/s10260-021-00580-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|
48
|
Belvederi Murri M, Cattelani L, Chesani F, Palumbo P, Triolo F, Alexopoulos GS. Risk Prediction Models for Depression in Community-Dwelling Older Adults. Am J Geriatr Psychiatry 2022; 30:949-960. [PMID: 35821215 DOI: 10.1016/j.jagp.2022.05.017] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 02/21/2022] [Revised: 05/26/2022] [Accepted: 05/30/2022] [Indexed: 12/23/2022]
Abstract
OBJECTIVE To develop streamlined Risk Prediction Models (Manto RPMs) for late-life depression. DESIGN Prospective study. SETTING The Survey of Health, Ageing and Retirement in Europe (SHARE) study. PARTICIPANTS Participants were community residing adults aged 55 years or older. MEASUREMENTS The outcome was presence of depression at a 2-year follow up evaluation. Risk factors were identified after a literature review of longitudinal studies. Separate RPMs were developed in the 29,116 participants who were not depressed at baseline and in the combined sample of 39,439 of non-depressed and depressed subjects. Models derived from the combined sample were used to develop a web-based risk calculator. RESULTS The authors identified 129 predictors of late-life depression after reviewing 227 studies. In non-depressed participants at baseline, the RPMs based on regression and Least Absolute Shrinkage and Selection Operator (LASSO) penalty (34 and 58 predictors, respectively) and the RPM based on Artificial Neural Networks (124 predictors) had a similar performance (AUC: 0.730-0.743). In the combined depressed and non-depressed participants at baseline, the RPM based on neural networks (35 predictors; AUC: 0.807; 95% CI: 0.80-0.82) and the model based on linear regression and LASSO penalty (32 predictors; AUC: 0.81; 95% CI: 0.79-0.82) had satisfactory accuracy. CONCLUSIONS The Manto RPMs can identify community-dwelling older individuals at risk for developing depression over 2 years. A web-based calculator based on the streamlined Manto model is freely available at https://manto.unife.it/ for use by individuals, clinicians, and policy makers and may be used to target prevention interventions at the individual and the population levels.
Collapse
Affiliation(s)
- Martino Belvederi Murri
- Department of Neuroscience and Rehabilitation, Institute of Psychiatry, University of Ferrara (MBM), Ferrara, Italy
| | - Luca Cattelani
- Department of Computer Science and Engineering, University of Bologna (LC, FC), Bologna, Italy; Faculty of Medicine and Health Technologies, Tampere University (LC), Tampere, Finland; Institute of Biomedicine, University of Eastern Finland (LC), Kuopio, Finland
| | - Federico Chesani
- Department of Computer Science and Engineering, University of Bologna (LC, FC), Bologna, Italy
| | - Pierpaolo Palumbo
- Department of Electrical, Electronic and Information Engineering "Guglielmo Marconi", University of Bologna (PP), Bologna, Italy
| | - Federico Triolo
- Aging Research Center, Department of Neurobiology, Care Sciences and Society, Karolinska Institutet (FT), Stockholm, Sweden
| | - George S Alexopoulos
- Weill Cornell Institute of Geriatric Psychiatry, Weill Cornell Medicine (GA), White Plains, NY.
| |
Collapse
|
49
|
Wang K, Li X, Liu Y, Kang L. A communication-efficient method for generalized linear regression with ℓ 0 regularization. COMMUN STAT-SIMUL C 2022. [DOI: 10.1080/03610918.2022.2115072] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/03/2022]
Affiliation(s)
- Kunpeng Wang
- School of Mathematics and Statistics, Wuhan University, Wuhan, China
| | - Xuerui Li
- School of Mathematics and Statistics, Wuhan University, Wuhan, China
| | - Yanyan Liu
- School of Mathematics and Statistics, Wuhan University, Wuhan, China
| | - Lican Kang
- Center for Quantitative Medicine Duke-NUS Medical School, Singapore, Singapore
| |
Collapse
|
50
|
Li Y, Hsu W. A classification for complex imbalanced data in disease screening and early diagnosis. Stat Med 2022; 41:3679-3695. [PMID: 35603639 PMCID: PMC9541048 DOI: 10.1002/sim.9442] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2021] [Revised: 04/11/2022] [Accepted: 05/10/2022] [Indexed: 11/09/2022]
Abstract
Imbalanced classification has drawn considerable attention in the statistics and machine learning literature. Typically, traditional classification methods often perform poorly when a severely skewed class distribution is observed, not to mention under a high-dimensional longitudinal data structure. Given the ubiquity of big data in modern health research, it is expected that imbalanced classification in disease diagnosis may encounter an additional level of difficulty that is imposed by such a complex data structure. In this article, we propose a nonparametric classification approach for imbalanced data in longitudinal and high-dimensional settings. Technically, the functional principal component analysis is first applied for feature extraction under the longitudinal structure. The univariate exponential loss function coupled with group LASSO penalty is then adopted into the classification procedure in high-dimensional settings. Along with a good improvement in imbalanced classification, our approach provides a meaningful feature selection for interpretation while enjoying a remarkably lower computational complexity. The proposed method is illustrated on the real data application of Alzheimer's disease early detection and its empirical performance in finite sample size is extensively evaluated by simulations.
Collapse
Affiliation(s)
- Yiming Li
- Department of StatisticsKansas State UniversityManhattanKansasUSA
| | - Wei‐Wen Hsu
- Division of Biostatistics and Bioinformatics, Department of Environmental and Public Health SciencesUniversity of CincinnatiCincinnatiOhioUSA
| | | |
Collapse
|