1
|
Zhang M, Zheng Y, Maidaiti X, Liang B, Wei Y, Sun F. Integrating Machine Learning into Statistical Methods in Disease Risk Prediction Modeling: A Systematic Review. HEALTH DATA SCIENCE 2024; 4:0165. [PMID: 39050273 PMCID: PMC11266123 DOI: 10.34133/hds.0165] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 01/03/2024] [Accepted: 06/20/2024] [Indexed: 07/27/2024]
Abstract
Background: Disease prediction models often use statistical methods or machine learning, both with their own corresponding application scenarios, raising the risk of errors when used alone. Integrating machine learning into statistical methods may yield robust prediction models. This systematic review aims to comprehensively assess current development of global disease prediction integration models. Methods: PubMed, EMbase, Web of Science, CNKI, VIP, WanFang, and SinoMed databases were searched to collect studies on prediction models integrating machine learning into statistical methods from database inception to 2023 May 1. Information including basic characteristics of studies, integrating approaches, application scenarios, modeling details, and model performance was extracted. Results: A total of 20 eligible studies in English and 1 in Chinese were included. Five studies concentrated on diagnostic models, while 16 studies concentrated on predicting disease occurrence or prognosis. Integrating strategies of classification models included majority voting, weighted voting, stacking, and model selection (when statistical methods and machine learning disagreed). Regression models adopted strategies including simple statistics, weighted statistics, and stacking. AUROC of integration models surpassed 0.75 and performed better than statistical methods and machine learning in most studies. Stacking was used for situations with >100 predictors and needed relatively larger amount of training data. Conclusion: Research on integrating machine learning into statistical methods in prediction models remains limited, but some studies have exhibited great potential that integration models outperform single models. This study provides insights for the selection of integration methods for different scenarios. Future research could emphasize on the improvement and validation of integrating strategies.
Collapse
Affiliation(s)
- Meng Zhang
- Department of Epidemiology and Biostatistics, School of Public Health,
Peking University, Beijing, China
- Key Laboratory of Epidemiology of Major Diseases (Peking University), Ministry of Education, Beijing, China
| | - Yongqi Zheng
- Department of Epidemiology and Biostatistics, School of Public Health,
Peking University, Beijing, China
- Key Laboratory of Epidemiology of Major Diseases (Peking University), Ministry of Education, Beijing, China
| | | | - Baosheng Liang
- Department of Biostatistics, School of Public Health,
Peking University, Beijing, China
| | - Yongyue Wei
- Department of Epidemiology and Biostatistics, School of Public Health,
Peking University, Beijing, China
- Key Laboratory of Epidemiology of Major Diseases (Peking University), Ministry of Education, Beijing, China
| | - Feng Sun
- Department of Epidemiology and Biostatistics, School of Public Health,
Peking University, Beijing, China
- Key Laboratory of Epidemiology of Major Diseases (Peking University), Ministry of Education, Beijing, China
| |
Collapse
|
2
|
Su L, Hounye AH, Pan Q, Miao K, Wang J, Hou M, Xiong L. Explainable cancer factors discovery: Shapley additive explanation for machine learning models demonstrates the best practices in the case of pancreatic cancer. Pancreatology 2024; 24:404-423. [PMID: 38342661 DOI: 10.1016/j.pan.2024.02.002] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 09/12/2023] [Revised: 01/07/2024] [Accepted: 02/05/2024] [Indexed: 02/13/2024]
Abstract
Pancreatic cancer is one of digestive tract cancers with high mortality rate. Despite the wide range of available treatments and improvements in surgery, chemotherapy, and radiation therapy, the five-year prognosis for individuals diagnosed pancreatic cancer remains poor. There is still research to be done to see if immunotherapy may be used to treat pancreatic cancer. The goals of our research were to comprehend the tumor microenvironment of pancreatic cancer, found a useful biomarker to assess the prognosis of patients, and investigated its biological relevance. In this paper, machine learning methods such as random forest were fused with weighted gene co-expression networks for screening hub immune-related genes (hub-IRGs). LASSO regression model was used to further work. Thus, we got eight hub-IRGs. Based on hub-IRGs, we created a prognosis risk prediction model for PAAD that can stratify accurately and produce a prognostic risk score (IRG_Score) for each patient. In the raw data set and the validation data set, the five-year area under the curve (AUC) for this model was 0.9 and 0.7, respectively. And shapley additive explanation (SHAP) portrayed the importance of prognostic risk prediction influencing factors from a machine learning perspective to obtain the most influential certain gene (or clinical factor). The five most important factors were TRIM67, CORT, PSPN, SCAMP5, RFXAP, all of which are genes. In summary, the eight hub-IRGs had accurate risk prediction performance and biological significance, which was validated in other cancers. The result of SHAP helped to understand the molecular mechanism of pancreatic cancer.
Collapse
Affiliation(s)
- Liuyan Su
- School of Mathematics and Statistics, Central South University, Changsha, 410083, China
| | | | - Qi Pan
- School of Mathematics and Statistics, Central South University, Changsha, 410083, China
| | - Kexin Miao
- School of Mathematics and Statistics, Central South University, Changsha, 410083, China
| | - Jiaoju Wang
- School of Mathematics and Statistics, Central South University, Changsha, 410083, China
| | - Muzhou Hou
- School of Mathematics and Statistics, Central South University, Changsha, 410083, China.
| | - Li Xiong
- Department of General Surgery, The Second Xiangya Hospital, Central South University, Changsha, 410011, China; Hunan Clinical Research Center for Intelligent General Surgery, Changsha, 410011, China.
| |
Collapse
|
3
|
Çalışkan M, Tazaki K. AI/ML advances in non-small cell lung cancer biomarker discovery. Front Oncol 2023; 13:1260374. [PMID: 38148837 PMCID: PMC10750392 DOI: 10.3389/fonc.2023.1260374] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2023] [Accepted: 11/16/2023] [Indexed: 12/28/2023] Open
Abstract
Lung cancer is the leading cause of cancer deaths among both men and women, representing approximately 25% of cancer fatalities each year. The treatment landscape for non-small cell lung cancer (NSCLC) is rapidly evolving due to the progress made in biomarker-driven targeted therapies. While advancements in targeted treatments have improved survival rates for NSCLC patients with actionable biomarkers, long-term survival remains low, with an overall 5-year relative survival rate below 20%. Artificial intelligence/machine learning (AI/ML) algorithms have shown promise in biomarker discovery, yet NSCLC-specific studies capturing the clinical challenges targeted and emerging patterns identified using AI/ML approaches are lacking. Here, we employed a text-mining approach and identified 215 studies that reported potential biomarkers of NSCLC using AI/ML algorithms. We catalogued these studies with respect to BEST (Biomarkers, EndpointS, and other Tools) biomarker sub-types and summarized emerging patterns and trends in AI/ML-driven NSCLC biomarker discovery. We anticipate that our comprehensive review will contribute to the current understanding of AI/ML advances in NSCLC biomarker research and provide an important catalogue that may facilitate clinical adoption of AI/ML-derived biomarkers.
Collapse
Affiliation(s)
- Minal Çalışkan
- Translational Science Department, Precision Medicine Function, Daiichi Sankyo, Inc., Basking Ridge, NJ, United States
| | - Koichi Tazaki
- Translational Science Department I, Precision Medicine Function, Daiichi Sankyo, Tokyo, Japan
| |
Collapse
|
4
|
Dessie EY, Gautam Y, Ding L, Altaye M, Beyene J, Mersha TB. Development and validation of asthma risk prediction models using co-expression gene modules and machine learning methods. Sci Rep 2023; 13:11279. [PMID: 37438356 DOI: 10.1038/s41598-023-35866-2] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/24/2022] [Accepted: 05/25/2023] [Indexed: 07/14/2023] Open
Abstract
Asthma is a heterogeneous respiratory disease characterized by airway inflammation and obstruction. Despite recent advances, the genetic regulation of asthma pathogenesis is still largely unknown. Gene expression profiling techniques are well suited to study complex diseases including asthma. In this study, differentially expressed genes (DEGs) followed by weighted gene co-expression network analysis (WGCNA) and machine learning techniques using dataset generated from airway epithelial cells (AECs) and nasal epithelial cells (NECs) were used to identify candidate genes and pathways and to develop asthma classification and predictive models. The models were validated using bronchial epithelial cells (BECs), airway smooth muscle (ASM) and whole blood (WB) datasets. DEG and WGCNA followed by least absolute shrinkage and selection operator (LASSO) method identified 30 and 34 gene signatures and these gene signatures with support vector machine (SVM) discriminated asthmatic subjects from controls in AECs (Area under the curve: AUC = 1) and NECs (AUC = 1), respectively. We further validated AECs derived gene-signature in BECs (AUC = 0.72), ASM (AUC = 0.74) and WB (AUC = 0.66). Similarly, NECs derived gene-signature were validated in BECs (AUC = 0.75), ASM (AUC = 0.82) and WB (AUC = 0.69). Both AECs and NECs based gene-signatures showed a strong diagnostic performance with high sensitivity and specificity. Functional annotation of gene-signatures from AECs and NECs were enriched in pathways associated with IL-13, PI3K/AKT and apoptosis signaling. Several asthma related genes were prioritized including SERPINB2 and CTSC genes, which showed functional relevance in multiple tissue/cell types and related to asthma pathogenesis. Taken together, epithelium gene signature-based model could serve as robust surrogate model for hard-to-get tissues including BECs to improve the molecular etiology of asthma.
Collapse
Affiliation(s)
- Eskezeia Y Dessie
- Department of Pediatrics, Cincinnati Children's Hospital Medical Center, University of Cincinnati College of Medicine, Cincinnati, OH, USA
| | - Yadu Gautam
- Department of Pediatrics, Cincinnati Children's Hospital Medical Center, University of Cincinnati College of Medicine, Cincinnati, OH, USA
| | - Lili Ding
- Department of Pediatrics, Cincinnati Children's Hospital Medical Center, University of Cincinnati College of Medicine, Cincinnati, OH, USA
| | - Mekibib Altaye
- Department of Pediatrics, Cincinnati Children's Hospital Medical Center, University of Cincinnati College of Medicine, Cincinnati, OH, USA
| | - Joseph Beyene
- Department of Health Research Methods, Evidence, and Impact, McMaster University, Hamilton, Canada
| | - Tesfaye B Mersha
- Department of Pediatrics, Cincinnati Children's Hospital Medical Center, University of Cincinnati College of Medicine, Cincinnati, OH, USA.
| |
Collapse
|
5
|
Du X, Zhao Y. Multimodal adversarial representation learning for breast cancer prognosis prediction. Comput Biol Med 2023; 157:106765. [PMID: 36963355 DOI: 10.1016/j.compbiomed.2023.106765] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2022] [Revised: 02/27/2023] [Accepted: 03/07/2023] [Indexed: 03/17/2023]
Abstract
With the increasing incidence of breast cancer, accurate prognosis prediction of breast cancer patients is a key issue in current cancer research, and it is also of great significance for patients' psychological rehabilitation and assisting clinical decision-making. Many studies that integrate data from different heterogeneous modalities such as gene expression profile, clinical data, and copy number alteration, have achieved greater success than those with only one modality in prognostic prediction. However, many of these approaches that exist fail to dramatically reduce the modality gap by aligning multimodal distributions. Therefore, it is crucial to develop a method that fully considers a modality-invariant embedding space to effectively integrate multimodal data. In this study, to reduce the modality gap, we propose a multimodal data adversarial representation framework (MDAR) to reduce the modal heterogeneity by translating source modalities into distributions for the target modality. Additionally, we apply reconstruction and classification losses to embedding space to further constrain it. Then, we design a multi-scale bilinear convolutional neural network (MS-B-CNN) for uni-modality to improve the feature expression ability. In addition, the embedding space generates predictions as stacked feature inputs to the extremely randomized trees classifier. With 10-fold cross-validation, our results show that the proposed adversarial representation learning improves prognostic performance. A comparative study of this method and other existing methods on the METABRIC (1980 patients) dataset showed that Matthews correlation coefficient (Mcc) was significantly enhanced by 7.4% in the prognosis prediction of breast cancer patients.
Collapse
Affiliation(s)
- Xiuquan Du
- Key Laboratory of Intelligent Computing and Signal Processing of Ministry of Education, Anhui University, Hefei, China; School of Computer Science and Technology, Anhui University, Hefei, China.
| | - Yuefan Zhao
- School of Computer Science and Technology, Anhui University, Hefei, China
| |
Collapse
|