1
|
Li S, Yi H, Leng Q, Wu Y, Mao Y. New perspectives on cancer clinical research in the era of big data and machine learning. Surg Oncol 2024; 52:102009. [PMID: 38215544 DOI: 10.1016/j.suronc.2023.102009] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/29/2023] [Accepted: 10/16/2023] [Indexed: 01/14/2024]
Abstract
In the 21st century, the development of medical science has entered the era of big data, and machine learning has become an essential tool for mining medical big data. The establishment of the SEER database has provided a wealth of epidemiological data for cancer clinical research, and the number of studies based on SEER and machine learning has been growing in recent years. This article reviews recent research based on SEER and machine learning and finds that the current focus of such studies is primarily on the development and validation of models using machine learning algorithms, with the main directions being lymph node metastasis prediction, distant metastasis prediction, and prognosis-related research. Compared to traditional models, machine learning algorithms have the advantage of stronger adaptability, but also suffer from disadvantages such as overfitting and poor interpretability, which need to be weighed in practical applications. At present, machine learning algorithms, as the foundation of artificial intelligence, have just begun to emerge in the field of cancer clinical research. The future development of oncology will enter a more precise era of cancer research, characterized by larger data, higher dimensions, and more frequent information exchange. Machine learning is bound to shine brightly in this field.
Collapse
Affiliation(s)
- Shujun Li
- Department of Hematology, Xiangya Hospital, Central South University, Changsha, 410008, China; National Clinical Research Center for Geriatric Diseases (Xiangya Hospital), China; Hunan Hematology Oncology Clinical Medical Research Center, China
| | - Hang Yi
- Department of Thoracic Surgery, National Cancer Center/National Clinical Research Center for Cancer/Cancer Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, 100021, China
| | - Qihao Leng
- Xiangya School of Medicine, Central South University, Changsha, 410013, Hunan Province, China
| | - You Wu
- Institute for Hospital Management, School of Medicine, Tsinghua University, 30 Shuangqing Rd, Haidian District, Beijing, China; Department of Health Policy and Management, Bloomberg School of Public Health, Johns Hopkins University, Baltimore, MD, 21205, USA.
| | - Yousheng Mao
- Department of Thoracic Surgery, National Cancer Center/National Clinical Research Center for Cancer/Cancer Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, 100021, China.
| |
Collapse
|
2
|
Wu R, Luo J, Wan H, Zhang H, Yuan Y, Hu H, Feng J, Wen J, Wang Y, Li J, Liang Q, Gan F, Zhang G. Evaluation of machine learning algorithms for the prognosis of breast cancer from the Surveillance, Epidemiology, and End Results database. PLoS One 2023; 18:e0280340. [PMID: 36701415 PMCID: PMC9879508 DOI: 10.1371/journal.pone.0280340] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2022] [Accepted: 12/26/2022] [Indexed: 01/27/2023] Open
Abstract
INTRODUCTION Many researchers used machine learning (ML) to predict the prognosis of breast cancer (BC) patients and noticed that the ML model had good individualized prediction performance. OBJECTIVE The cohort study was intended to establish a reliable data analysis model by comparing the performance of 10 common ML algorithms and the the traditional American Joint Committee on Cancer (AJCC) stage, and used this model in Web application development to provide a good individualized prediction for others. METHODS This study included 63145 BC patients from the Surveillance, Epidemiology, and End Results database. RESULTS Through the performance of the 10 ML algorithms and 7th AJCC stage in the optimal test set, we found that in terms of 5-year overall survival, multivariate adaptive regression splines (MARS) had the highest area under the curve (AUC) value (0.831) and F1-score (0.608), and both sensitivity (0.737) and specificity (0.772) were relatively high. Besides, MARS showed a highest AUC value (0.831, 95%confidence interval: 0.820-0.842) in comparison to the other ML algorithms and 7th AJCC stage (all P < 0.05). MARS, the best performing model, was selected for web application development (https://w12251393.shinyapps.io/app2/). CONCLUSIONS The comparative study of multiple forecasting models utilizing a large data noted that MARS based model achieved a much better performance compared to other ML algorithms and 7th AJCC stage in individualized estimation of survival of BC patients, which was very likely to be the next step towards precision medicine.
Collapse
Affiliation(s)
- Ruiyang Wu
- Department of Breast and Thyroid Surgery, Sichuan Provincial Hospital for Women and Children (Affiliated Women and Children’s Hospital of Chengdu Medical College), Chengdu, China
| | - Jing Luo
- Department of Breast and Thyroid Surgery, Sichuan Provincial Hospital for Women and Children (Affiliated Women and Children’s Hospital of Chengdu Medical College), Chengdu, China
| | - Hangyu Wan
- Department of Breast and Thyroid Surgery, Sichuan Provincial Hospital for Women and Children (Affiliated Women and Children’s Hospital of Chengdu Medical College), Chengdu, China
| | - Haiyan Zhang
- Department of Breast and Thyroid Surgery, Sichuan Provincial Hospital for Women and Children (Affiliated Women and Children’s Hospital of Chengdu Medical College), Chengdu, China
| | - Yewei Yuan
- Department of Breast and Thyroid Surgery, Sichuan Provincial Hospital for Women and Children (Affiliated Women and Children’s Hospital of Chengdu Medical College), Chengdu, China
| | - Huihua Hu
- Department of Breast and Thyroid Surgery, Sichuan Provincial Hospital for Women and Children (Affiliated Women and Children’s Hospital of Chengdu Medical College), Chengdu, China
| | - Jinyan Feng
- Department of Breast and Thyroid Surgery, Sichuan Provincial Hospital for Women and Children (Affiliated Women and Children’s Hospital of Chengdu Medical College), Chengdu, China
| | - Jing Wen
- Department of Breast and Thyroid Surgery, Sichuan Provincial Hospital for Women and Children (Affiliated Women and Children’s Hospital of Chengdu Medical College), Chengdu, China
| | - Yan Wang
- Department of Breast and Thyroid Surgery, Sichuan Provincial Hospital for Women and Children (Affiliated Women and Children’s Hospital of Chengdu Medical College), Chengdu, China
| | - Junyan Li
- Department of Breast and Thyroid Surgery, Sichuan Provincial Hospital for Women and Children (Affiliated Women and Children’s Hospital of Chengdu Medical College), Chengdu, China
| | - Qi Liang
- Department of Breast and Thyroid Surgery, Sichuan Provincial Hospital for Women and Children (Affiliated Women and Children’s Hospital of Chengdu Medical College), Chengdu, China
| | - Fengjiao Gan
- Department of Breast and Thyroid Surgery, Sichuan Provincial Hospital for Women and Children (Affiliated Women and Children’s Hospital of Chengdu Medical College), Chengdu, China
| | - Gang Zhang
- Department of Breast and Thyroid Surgery, Sichuan Provincial Hospital for Women and Children (Affiliated Women and Children’s Hospital of Chengdu Medical College), Chengdu, China
- * E-mail:
| |
Collapse
|
3
|
Xiao J, Mo M, Wang Z, Zhou C, Shen J, Yuan J, He Y, Zheng Y. Machine Learning Models for the Prediction of Breast Cancer Prognostic: Application and Comparison Based on a Retrospective Cohort Study (Preprint). JMIR Med Inform 2021; 10:e33440. [PMID: 35179504 PMCID: PMC8900909 DOI: 10.2196/33440] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/08/2021] [Revised: 12/15/2021] [Accepted: 01/02/2022] [Indexed: 11/17/2022] Open
Abstract
Background Over the recent years, machine learning methods have been increasingly explored in cancer prognosis because of the appearance of improved machine learning algorithms. These algorithms can use censored data for modeling, such as support vector machines for survival analysis and random survival forest (RSF). However, it is still debated whether traditional (Cox proportional hazard regression) or machine learning-based prognostic models have better predictive performance. Objective This study aimed to compare the performance of breast cancer prognostic prediction models based on machine learning and Cox regression. Methods This retrospective cohort study included all patients diagnosed with breast cancer and subsequently hospitalized in Fudan University Shanghai Cancer Center between January 1, 2008, and December 31, 2016. After all exclusions, a total of 22,176 cases with 21 features were eligible for model development. The data set was randomly split into a training set (15,523 cases, 70%) and a test set (6653 cases, 30%) for developing 4 models and predicting the overall survival of patients diagnosed with breast cancer. The discriminative ability of models was evaluated by the concordance index (C-index), the time-dependent area under the curve, and D-index; the calibration ability of models was evaluated by the Brier score. Results The RSF model revealed the best discriminative performance among the 4 models with 3-year, 5-year, and 10-year time-dependent area under the curve of 0.857, 0.838, and 0.781, a D-index of 7.643 (95% CI 6.542, 8.930) and a C-index of 0.827 (95% CI 0.809, 0.845). The statistical difference of the C-index was tested, and the RSF model significantly outperformed the Cox-EN (elastic net) model (C-index 0.816, 95% CI 0.796, 0.836; P=.01), the Cox model (C-index 0.814, 95% CI 0.794, 0.835; P=.003), and the support vector machine model (C-index 0.812, 95% CI 0.793, 0.832; P<.001). The 4 models’ 3-year, 5-year, and 10-year Brier scores were very close, ranging from 0.027 to 0.094 and less than 0.1, which meant all models had good calibration. In the context of feature importance, elastic net and RSF both indicated that TNM staging, neoadjuvant therapy, number of lymph node metastases, age, and tumor diameter were the top 5 important features for predicting the prognosis of breast cancer. A final online tool was developed to predict the overall survival of patients with breast cancer. Conclusions The RSF model slightly outperformed the other models on discriminative ability, revealing the potential of the RSF method as an effective approach to building prognostic prediction models in the context of survival analysis.
Collapse
Affiliation(s)
- Jialong Xiao
- Department of Epidemiology, School of Public Health, Fudan University, Shanghai, China
- Department of Cancer Prevention, Fudan University Shanghai Cancer Center, Shanghai, China
- Department of Oncology, Shanghai Medical College, Fudan University, Shanghai, China
| | - Miao Mo
- Department of Cancer Prevention, Fudan University Shanghai Cancer Center, Shanghai, China
- Department of Oncology, Shanghai Medical College, Fudan University, Shanghai, China
| | - Zezhou Wang
- Department of Cancer Prevention, Fudan University Shanghai Cancer Center, Shanghai, China
- Department of Oncology, Shanghai Medical College, Fudan University, Shanghai, China
| | - Changming Zhou
- Department of Cancer Prevention, Fudan University Shanghai Cancer Center, Shanghai, China
- Department of Oncology, Shanghai Medical College, Fudan University, Shanghai, China
| | - Jie Shen
- Department of Cancer Prevention, Fudan University Shanghai Cancer Center, Shanghai, China
- Department of Oncology, Shanghai Medical College, Fudan University, Shanghai, China
| | - Jing Yuan
- Department of Cancer Prevention, Fudan University Shanghai Cancer Center, Shanghai, China
- Department of Oncology, Shanghai Medical College, Fudan University, Shanghai, China
| | - Yulian He
- Department of Epidemiology, School of Public Health, Fudan University, Shanghai, China
- Department of Cancer Prevention, Fudan University Shanghai Cancer Center, Shanghai, China
| | - Ying Zheng
- Department of Cancer Prevention, Fudan University Shanghai Cancer Center, Shanghai, China
- Department of Oncology, Shanghai Medical College, Fudan University, Shanghai, China
- Shanghai Engineering Research Center of Artificial Intelligence Technology for Tumor Diseases, Shanghai, China
| |
Collapse
|
4
|
Li J, Zhou Z, Dong J, Fu Y, Li Y, Luan Z, Peng X. Predicting breast cancer 5-year survival using machine learning: A systematic review. PLoS One 2021; 16:e0250370. [PMID: 33861809 PMCID: PMC8051758 DOI: 10.1371/journal.pone.0250370] [Citation(s) in RCA: 20] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2021] [Accepted: 04/06/2021] [Indexed: 12/24/2022] Open
Abstract
BACKGROUND Accurately predicting the survival rate of breast cancer patients is a major issue for cancer researchers. Machine learning (ML) has attracted much attention with the hope that it could provide accurate results, but its modeling methods and prediction performance remain controversial. The aim of this systematic review is to identify and critically appraise current studies regarding the application of ML in predicting the 5-year survival rate of breast cancer. METHODS In accordance with the PRISMA guidelines, two researchers independently searched the PubMed (including MEDLINE), Embase, and Web of Science Core databases from inception to November 30, 2020. The search terms included breast neoplasms, survival, machine learning, and specific algorithm names. The included studies related to the use of ML to build a breast cancer survival prediction model and model performance that can be measured with the value of said verification results. The excluded studies in which the modeling process were not explained clearly and had incomplete information. The extracted information included literature information, database information, data preparation and modeling process information, model construction and performance evaluation information, and candidate predictor information. RESULTS Thirty-one studies that met the inclusion criteria were included, most of which were published after 2013. The most frequently used ML methods were decision trees (19 studies, 61.3%), artificial neural networks (18 studies, 58.1%), support vector machines (16 studies, 51.6%), and ensemble learning (10 studies, 32.3%). The median sample size was 37256 (range 200 to 659820) patients, and the median predictor was 16 (range 3 to 625). The accuracy of 29 studies ranged from 0.510 to 0.971. The sensitivity of 25 studies ranged from 0.037 to 1. The specificity of 24 studies ranged from 0.008 to 0.993. The AUC of 20 studies ranged from 0.500 to 0.972. The precision of 6 studies ranged from 0.549 to 1. All of the models were internally validated, and only one was externally validated. CONCLUSIONS Overall, compared with traditional statistical methods, the performance of ML models does not necessarily show any improvement, and this area of research still faces limitations related to a lack of data preprocessing steps, the excessive differences of sample feature selection, and issues related to validation. Further optimization of the performance of the proposed model is also needed in the future, which requires more standardization and subsequent validation.
Collapse
Affiliation(s)
- Jiaxin Li
- School of Nursing, Jilin University, Jilin, China
| | - Zijun Zhou
- Breast Surgery, Jilin Province Tumor Hospital, Jilin, China
| | - Jianyu Dong
- School of Nursing, Jilin University, Jilin, China
| | - Ying Fu
- School of Nursing, Jilin University, Jilin, China
| | - Yuan Li
- School of Nursing, Jilin University, Jilin, China
| | - Ze Luan
- School of Nursing, Jilin University, Jilin, China
| | - Xin Peng
- School of Nursing, Jilin University, Jilin, China
- * E-mail:
| |
Collapse
|
5
|
Lotfnezhad Afshar H, Jabbari N, Khalkhali HR, Esnaashari O. Prediction of Breast Cancer Survival by Machine Learning Methods: An Application of Multiple Imputation. IRANIAN JOURNAL OF PUBLIC HEALTH 2021; 50:598-605. [PMID: 34178808 PMCID: PMC8214598 DOI: 10.18502/ijph.v50i3.5606] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
Abstract
Background: The low breast cancer survival rates in less developed countries are critical. The machine learning techniques predict cancers survival with high accuracy. Missing data are the most important limitation for using the highest potential of these techniques to predict cancers survival. Multiple imputation (MI) was implemented and analyzed in detail to impute the missing data of a breast cancer dataset. Methods: The dataset was from The Omid Treatment and Research Center Urmia, Iran between Jan 2006 and Dec 2012 and had information from 856 women. The algorithms such as C5 and repeated incremental pruning to produce error reduction were applied on the imputed versions of the original dataset and the non-imputed dataset to predict and extract clinical rules, respectively. Results: The findings showed the performance of C5 in all the evaluation criteria including accuracy (84.42%), sensitivity (92.21%), specificity (64%), Kappa statistic (59.06%), and the area under the receiver operator characteristic (ROC) curve (0.84), was improved after imputation. Conclusion: The dataset of the present study met the requirements for using the multiple imputation method. The extracted rules after the application of MI were more comprehensive and contained knowledge that is more clinical. However, the clinical value of the extracted rules after filling in the missing data did not noticeably increase.
Collapse
Affiliation(s)
- Hadi Lotfnezhad Afshar
- Department of Health Information Technology, School of Paramedical, Urmia University of Medical Sciences, Urmia, Iran
| | - Nasrollah Jabbari
- Department of Medical Physics, Solid Tumor Research Center, School of Paramedical, Urmia University of Medical Sciences, Urmia, Iran
| | - Hamid Reza Khalkhali
- Department of Biostatistics and Epidemiology, Patient Safety Research Center, School of Medicine, Urmia University of Medical Sciences, Urmia, Iran
| | | |
Collapse
|
6
|
Yang L, Liu Q, Zhao Q, Zhu X, Wang L. Machine learning is a valid method for predicting prehospital delay after acute ischemic stroke. Brain Behav 2020; 10:e01794. [PMID: 32812396 PMCID: PMC7559608 DOI: 10.1002/brb3.1794] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 01/10/2020] [Revised: 07/15/2020] [Accepted: 07/20/2020] [Indexed: 12/27/2022] Open
Abstract
OBJECTIVES This study aimed to identify the influencing factors associated with long onset-to-door time and establish predictive models that could help to assess the probability of prehospital delay in populations with a high risk for stroke. MATERIALS AND METHODS Patients who were diagnosed with acute ischemic stroke (AIS) and hospitalized between 1 November 2018 and 31 July 2019 were interviewed, and their medical records were extracted for data analysis. Two machine learning algorithms (support vector machine and Bayesian network) were applied in this study, and their predictive performance was compared with that of the classical logistic regression models after using several variable selection methods. Timely admission (onset-to-door time < 3 hr) and prehospital delay (onset-to-door time ≥ 3 hr) were the outcome variables. We computed the area under curve (AUC) and the difference in the mean AUC values between the models. RESULTS A total of 450 patients with AIS were enrolled; 57 (12.7%) with timely admission and 393 (87.3%) patients with prehospital delay. All models, both those constructed by logistic regression and those by machine learning, performed well in predicting prehospital delay (range mean AUC: 0.800-0.846). The difference in the mean AUC values between the best performing machine learning model and the best performing logistic regression model was negligible (0.014; 95% CI: 0.013-0.015). CONCLUSIONS Machine learning algorithms were not inferior to logistic regression models for prediction of prehospital delay after stroke. All models provided good discrimination, thereby creating valuable diagnostic programs for prehospital delay prediction.
Collapse
Affiliation(s)
- Li Yang
- School of Nursing, Qingdao University, Qingdao, China
| | - Qinqin Liu
- School of Nursing, The second Affiliated Hospital of Harbin Medical University, Harbin Medical University, Harbin, China
| | - Qiuli Zhao
- School of Nursing, The second Affiliated Hospital of Harbin Medical University, Harbin Medical University, Harbin, China
| | - Xuemei Zhu
- School of Nursing, The second Affiliated Hospital of Harbin Medical University, Harbin Medical University, Harbin, China
| | - Ling Wang
- School of Nursing, The second Affiliated Hospital of Harbin Medical University, Harbin Medical University, Harbin, China
| |
Collapse
|
7
|
Iraji Z, Jafari Koshki T, Dolatkhah R, Asghari Jafarabadi M. Parametric survival model to identify the predictors of breast cancer mortality: An accelerated failure time approach. JOURNAL OF RESEARCH IN MEDICAL SCIENCES 2020; 25:38. [PMID: 32582344 PMCID: PMC7306232 DOI: 10.4103/jrms.jrms_743_19] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/03/2019] [Revised: 12/03/2019] [Accepted: 01/11/2020] [Indexed: 01/04/2023]
Abstract
Background: Breast cancer (BC) was the fifth cause of mortality worldwide in 2015 and second cause of mortality in Iran in 2012. This study aimed to explore factors associated with survival of patients with BC using parametric survival models. Materials and Methods: Data of 1154 patients that diagnosed with BC recorded in the East Azerbaijan population-based cancer registry database between March 2007 and March 2016. The parametric survival model with an accelerated failure time (AFT) approach was used to assess the association between sex, age, grade, and morphology with time to death. Results: A total of 217 (18.8%) individuals experienced death due to BC by the end of the study. Among the fitted parametric survival models including exponential, Weibull, log logistic, and log-normal models, the log-normal model was the best model with the Akaike information criterion = 1441.47 and Bayesian information criterion = 1486.93 where patients with higher ages (time ratio [TR] =0.693; 95% confidence interval [CI] = [0.531, 0.904]) and higher grades (TR = 0.350; 95% CI = [0.201, 0.608]) had significantly lower survival while the lobular carcinoma type of morphology (TR = 1.975; 95% CI = [1.049, 3.720]) had significantly higher survival. Conclusion: Log-normal model showed to be an optimal tool to model the survival of patients with BC in the current study. Age, grade, and morphology showed significant association with time to death in patients with BC using AFT model. This finding could be recommended for planning and health policymaking in patients with BC. However, the impact of the models used for analysis on the significance and magnitude of estimated effects should be acknowledged.
Collapse
Affiliation(s)
- Zeinab Iraji
- Department of Statistics and Epidemiology, Faculty of Health, Tabriz University of Medical Sciences, Tabriz, Iran
| | - Tohid Jafari Koshki
- Department of Statistics and Epidemiology, Faculty of Health, Tabriz University of Medical Sciences, Tabriz, Iran
| | - Roya Dolatkhah
- Hematology and Oncology Research Center, Tabriz University of Medical Sciences, Tabriz, Iran
| | - Mohammad Asghari Jafarabadi
- Department of Statistics and Epidemiology, Faculty of Health, Tabriz University of Medical Sciences, Tabriz, Iran.,Road Traffic Injury Research Center, Tabriz University of Medical Sciences, Tabriz, Iran
| |
Collapse
|
8
|
Chlioui I, Idri A, Abnane I. Data preprocessing in knowledge discovery in breast cancer: systematic mapping study. COMPUTER METHODS IN BIOMECHANICS AND BIOMEDICAL ENGINEERING: IMAGING & VISUALIZATION 2020. [DOI: 10.1080/21681163.2020.1730974] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/29/2022]
Affiliation(s)
- Imane Chlioui
- Software Project Management Research Team, ENSIAS, Mohammed V University, Rabat , Morocco
| | - Ali Idri
- Software Project Management Research Team, ENSIAS, Mohammed V University, Rabat , Morocco
- Complex Systems Engineering and Human Systems, University Mohammed VI Polytechnic , Ben Guerir, Morocco
| | - Ibtissam Abnane
- Software Project Management Research Team, ENSIAS, Mohammed V University, Rabat , Morocco
| |
Collapse
|
9
|
Moreau JT, Hankinson TC, Baillet S, Dudley RWR. Individual-patient prediction of meningioma malignancy and survival using the Surveillance, Epidemiology, and End Results database. NPJ Digit Med 2020; 3:12. [PMID: 32025573 PMCID: PMC6992687 DOI: 10.1038/s41746-020-0219-5] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2018] [Accepted: 01/10/2020] [Indexed: 01/17/2023] Open
Abstract
Meningiomas are known to have relatively lower aggressiveness and better outcomes than other central nervous system (CNS) tumors. However, there is considerable overlap between clinical and radiological features characterizing benign, atypical, and malignant tumors. In this study, we developed methods and a practical app designed to assist with the diagnosis and prognosis of meningiomas. Statistical learning models were trained and validated on 62,844 patients from the Surveillance, Epidemiology, and End Results database. We used balanced logistic regression-random forest ensemble classifiers and proportional hazards models to learn multivariate patterns of association between malignancy, survival, and a series of basic clinical variables-such as tumor size, location, and surgical procedure. We demonstrate that our models are capable of predicting meaningful individual-specific clinical outcome variables and show good generalizability across 16 SEER registries. A free smartphone and web application is provided for readers to access and test the predictive models (www.meningioma.app). Future model improvements and prospective replication will be necessary to demonstrate true clinical utility. Rather than being used in isolation, we expect that the proposed models will be integrated into larger and more comprehensive models that integrate imaging and molecular biomarkers. Whether for meningiomas or other tumors of the CNS, the power of these methods to make individual-patient predictions could lead to improved diagnosis, patient counseling, and outcomes.
Collapse
Affiliation(s)
- Jeremy T. Moreau
- McConnell Brain Imaging Centre, Department of Neurology and Neurosurgery, Montreal Neurological Institute, McGill University, Montreal, QC Canada
- Department of Pediatric Surgery, Division of Neurosurgery, Montreal Children’s Hospital, Montreal, QC Canada
| | - Todd C. Hankinson
- Department of Pediatric Neurosurgery, Children’s Hospital Colorado, University of Colorado Anschutz Medical Campus, Aurora, CO USA
- Morgan Adams Foundation Pediatric Brain Tumor Research Program, Aurora, CO USA
| | - Sylvain Baillet
- McConnell Brain Imaging Centre, Department of Neurology and Neurosurgery, Montreal Neurological Institute, McGill University, Montreal, QC Canada
| | - Roy W. R. Dudley
- Department of Pediatric Surgery, Division of Neurosurgery, Montreal Children’s Hospital, Montreal, QC Canada
| |
Collapse
|
10
|
Xiang Y, Sun Y, Liu Y, Han B, Chen Q, Ye X, Zhu L, Gao W, Fang W. Development and validation of a predictive model for the diagnosis of solid solitary pulmonary nodules using data mining methods. J Thorac Dis 2019; 11:950-958. [PMID: 31019785 DOI: 10.21037/jtd.2019.01.90] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
Background The purpose of this study is to develop a predictive model to accurately predict the malignancy of solid solitary pulmonary nodule (SPN) by data mining methods. Methods A training cohort of 388 consecutive patients with solid SPNs was used to develop a predictive model to evaluate the malignancy of solid SPNs. By using SPSS Modeler, we utilized logistic regression (LR), artificial neural network (ANN), k-nearest neighbor (KNN), random forest (RF), and support vector machines (SVM) classifiers to build predictive models. Another cohort of 200 consecutive patients with solid SPNs was used to verify the accuracy of the predictive model. Predictive performance was evaluated using the area under the receiver operating characteristic curve (AUC). Results There was no significant difference in patients' characteristics between the training cohort and the validation cohort. The AUCs of LR, ANN, KNN, RF, and SVM models for the validation cohort were 0.874±0.0280 (P=0.605), 0.833±0.0351 (P=0.104), 0.792±0.0418 (P=0.014), 0.775±0.0400 (P=0.013), and 0.890±0.0323 (reference), respectively. The SVM algorithm had the highest AUC, and the best sensitivity (90.3%), specificity (80.4%), positive predictive value (93.9%), negative predictive value (71.2%) and accuracy (88.0%) for the validation cohort among the five models. Conclusions Data mining by SVM might be a useful auxiliary algorithm in predicting malignancy of solid SPNs.
Collapse
Affiliation(s)
- Yangwei Xiang
- Department of Thoracic Surgery, Shanghai Chest Hospital, Shanghai Jiaotong University, Shanghai 200030, China
| | - Yifeng Sun
- Department of Thoracic Surgery, Shanghai Chest Hospital, Shanghai Jiaotong University, Shanghai 200030, China
| | - Yuan Liu
- Department of Statistics Cente, Shanghai Chest Hospital, Shanghai Jiaotong University, Shanghai 200030, China
| | - Baohui Han
- Department of Pulmonary Medicine, Shanghai Chest Hospital, Shanghai Jiaotong University, Shanghai 200030, China
| | - Qunhui Chen
- Department of Radiology, Shanghai Chest Hospital, Shanghai Jiaotong University, Shanghai 200030, China
| | - Xiaodan Ye
- Department of Radiology, Shanghai Chest Hospital, Shanghai Jiaotong University, Shanghai 200030, China
| | - Li Zhu
- Department of Radiology, Shanghai Chest Hospital, Shanghai Jiaotong University, Shanghai 200030, China
| | - Wen Gao
- Department of Thoracic Surgery, Shanghai Chest Hospital, Shanghai Jiaotong University, Shanghai 200030, China.,Department of Thoracic Surgery, Shanghai Huadong Hospital, Fudan University School of Medicine, Shanghai 200030, China
| | - Wentao Fang
- Department of Thoracic Surgery, Shanghai Chest Hospital, Shanghai Jiaotong University, Shanghai 200030, China
| |
Collapse
|
11
|
Xiong CZ, Su M, Jiang Z, Jiang W. Prediction of Hemodialysis Timing Based on LVW Feature Selection and Ensemble Learning. J Med Syst 2018; 43:18. [PMID: 30547238 DOI: 10.1007/s10916-018-1136-x] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2018] [Accepted: 12/03/2018] [Indexed: 11/30/2022]
Abstract
We propose an improved model based on LVW embedded model feature extractor and ensemble learning for improving prediction accuracy of hemodialysis timing in this paper. Due to this drawback caused by feature extraction models, we adopt an enhanced LVW embedded model to search the feature subset by stochastic strategy, which can find the best feature combination that are most beneficial to learner performance. In the model application, we present an improved integrated learners for model fusion to reduce errors caused by overfitting problem of the single classifier. We run several state-of-the-art Q&A methods as contrastive experiments. The experimental results show that the ensemble learning model based on LVW has better generalization ability (97.04%) and lower standard error (± 0.04). We adopt the model to make high-precision predictions of hemodialysis timing, and the experimental results have shown that our framework significantly outperforms several strong baselines. Our model provides strong clinical decision support for physician diagnosis and has important clinical implications.
Collapse
Affiliation(s)
- Chang-Zhu Xiong
- Department of electronic information, Sichuan University, Chengdu, China.
| | - Minglian Su
- West China School of clinical medicine, Sichuan University, Chengdu, China
| | - Zitao Jiang
- Department of electronic information, Sichuan University, Chengdu, China
| | - Wei Jiang
- Department of electronic information, Sichuan University, Chengdu, China
| |
Collapse
|
12
|
Shukla N, Hagenbuchner M, Win KT, Yang J. Breast cancer data analysis for survivability studies and prediction. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2018; 155:199-208. [PMID: 29512500 DOI: 10.1016/j.cmpb.2017.12.011] [Citation(s) in RCA: 34] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/14/2017] [Revised: 11/08/2017] [Accepted: 12/11/2017] [Indexed: 06/08/2023]
Abstract
BACKGROUND Breast cancer is the most common cancer affecting females worldwide. Breast cancer survivability prediction is challenging and a complex research task. Existing approaches engage statistical methods or supervised machine learning to assess/predict the survival prospects of patients. OBJECTIVE The main objectives of this paper is to develop a robust data analytical model which can assist in (i) a better understanding of breast cancer survivability in presence of missing data, (ii) providing better insights into factors associated with patient survivability, and (iii) establishing cohorts of patients that share similar properties. METHODS Unsupervised data mining methods viz. the self-organising map (SOM) and density-based spatial clustering of applications with noise (DBSCAN) is used to create patient cohort clusters. These clusters, with associated patterns, were used to train multilayer perceptron (MLP) model for improved patient survivability analysis. A large dataset available from SEER program is used in this study to identify patterns associated with the survivability of breast cancer patients. Information gain was computed for the purpose of variable selection. All of these methods are data-driven and require little (if any) input from users or experts. RESULTS SOM consolidated patients into cohorts of patients with similar properties. From this, DBSCAN identified and extracted nine cohorts (clusters). It is found that patients in each of the nine clusters have different survivability time. The separation of patients into clusters improved the overall survival prediction accuracy based on MLP and revealed intricate conditions that affect the accuracy of a prediction. CONCLUSIONS A new, entirely data driven approach based on unsupervised learning methods improves understanding and helps identify patterns associated with the survivability of patient. The results of the analysis can be used to segment the historical patient data into clusters or subsets, which share common variable values and survivability. The survivability prediction accuracy of a MLP is improved by using identified patient cohorts as opposed to using raw historical data. Analysis of variable values in each cohort provide better insights into survivability of a particular subgroup of breast cancer patients.
Collapse
Affiliation(s)
- Nagesh Shukla
- School of Systems, Management and Leadership, Faculty of Engineering and Information Technology, University of Technology Sydney, NSW 2007, Australia.
| | - Markus Hagenbuchner
- School of Computing and Information Technology, University of Wollongong, Wollongong, NSW 2500, Australia
| | - Khin Than Win
- School of Computing and Information Technology, University of Wollongong, Wollongong, NSW 2500, Australia
| | - Jack Yang
- SMART Infrastructure Facility, Faculty of Engineering and Information Sciences, University of Wollongong, Wollongong, NSW 2500, Australia
| |
Collapse
|
13
|
Al-Turaiki I, Alshahrani M, Almutairi T. Building predictive models for MERS-CoV infections using data mining techniques. J Infect Public Health 2016; 9:744-748. [PMID: 27641481 PMCID: PMC7102847 DOI: 10.1016/j.jiph.2016.09.007] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/23/2016] [Revised: 07/20/2016] [Accepted: 09/06/2016] [Indexed: 11/30/2022] Open
Abstract
Background Recently, the outbreak of MERS-CoV infections caused worldwide attention to Saudi Arabia. The novel virus belongs to the coronaviruses family, which is responsible for causing mild to moderate colds. The control and command center of Saudi Ministry of Health issues a daily report on MERS-CoV infection cases. The infection with MERS-CoV can lead to fatal complications, however little information is known about this novel virus. In this paper, we apply two data mining techniques in order to better understand the stability and the possibility of recovery from MERS-CoV infections. Method The Naive Bayes classifier and J48 decision tree algorithm were used to build our models. The dataset used consists of 1082 records of cases reported between 2013 and 2015. In order to build our prediction models, we split the dataset into two groups. The first group combined recovery and death records. A new attribute was created to indicate the record type, such that the dataset can be used to predict the recovery from MERS-CoV. The second group contained the new case records to be used to predict the stability of the infection based on the current status attribute. Results The resulting recovery models indicate that healthcare workers are more likely to survive. This could be due to the vaccinations that healthcare workers are required to get on regular basis. As for the stability models using J48, two attributes were found to be important for predicting stability: symptomatic and age. Old patients are at high risk of developing MERS-CoV complications. Finally, the performance of all the models was evaluated using three measures: accuracy, precision, and recall. In general, the accuracy of the models is between 53.6% and 71.58%. Conclusion We believe that the performance of the prediction models can be enhanced with the use of more patient data. As future work, we plan to directly contact hospitals in Riyadh in order to collect more information related to patients with MERS-CoV infections.
Collapse
Affiliation(s)
- Isra Al-Turaiki
- Information Technology Department, College of Computer and Information Sciences, King Saud University, Saudi Arabia.
| | - Mona Alshahrani
- Information Technology Department, College of Computer and Information Sciences, King Saud University, Saudi Arabia.
| | - Tahani Almutairi
- Information Technology Department, College of Computer and Information Sciences, King Saud University, Saudi Arabia.
| |
Collapse
|
14
|
Applying Data Mining Techniques to Extract Hidden Patterns about Breast Cancer Survival in an Iranian Cohort Study. J Res Health Sci 2015; 16:31-5. [PMID: 27061994 PMCID: PMC7189091] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/12/2016] [Revised: 02/18/2016] [Accepted: 03/14/2016] [Indexed: 11/03/2022] Open
Abstract
BACKGROUND Breast cancer survival has been analyzed by many standard data mining algorithms. A group of these algorithms belonged to the decision tree category. Ability of the decision tree algorithms in terms of visualizing and formulating of hidden patterns among study variables were main reasons to apply an algorithm from the decision tree category in the current study that has not studied already. METHODS The classification and regression trees (CART) was applied to a breast cancer database contained information on 569 patients in 2007-2010. The measurement of Gini impurity used for categorical target variables was utilized. The classification error that is a function of tree size was measured by 10-fold cross-validation experiments. The performance of created model was evaluated by the criteria as accuracy, sensitivity and specificity. RESULTS The CART model produced a decision tree with 17 nodes, 9 of which were associated with a set of rules. The rules were meaningful clinically. They showed in the if-then format that Stage was the most important variable for predicting breast cancer survival. The scores of accuracy, sensitivity and specificity were: 80.3%, 93.5% and 53%, respectively. CONCLUSIONS The current study model as the first one created by the CART was able to extract useful hidden rules from a relatively small size dataset.
Collapse
|