1
|
Li S, Yi H, Leng Q, Wu Y, Mao Y. New perspectives on cancer clinical research in the era of big data and machine learning. Surg Oncol 2024; 52:102009. [PMID: 38215544 DOI: 10.1016/j.suronc.2023.102009] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/29/2023] [Accepted: 10/16/2023] [Indexed: 01/14/2024]
Abstract
In the 21st century, the development of medical science has entered the era of big data, and machine learning has become an essential tool for mining medical big data. The establishment of the SEER database has provided a wealth of epidemiological data for cancer clinical research, and the number of studies based on SEER and machine learning has been growing in recent years. This article reviews recent research based on SEER and machine learning and finds that the current focus of such studies is primarily on the development and validation of models using machine learning algorithms, with the main directions being lymph node metastasis prediction, distant metastasis prediction, and prognosis-related research. Compared to traditional models, machine learning algorithms have the advantage of stronger adaptability, but also suffer from disadvantages such as overfitting and poor interpretability, which need to be weighed in practical applications. At present, machine learning algorithms, as the foundation of artificial intelligence, have just begun to emerge in the field of cancer clinical research. The future development of oncology will enter a more precise era of cancer research, characterized by larger data, higher dimensions, and more frequent information exchange. Machine learning is bound to shine brightly in this field.
Collapse
Affiliation(s)
- Shujun Li
- Department of Hematology, Xiangya Hospital, Central South University, Changsha, 410008, China; National Clinical Research Center for Geriatric Diseases (Xiangya Hospital), China; Hunan Hematology Oncology Clinical Medical Research Center, China
| | - Hang Yi
- Department of Thoracic Surgery, National Cancer Center/National Clinical Research Center for Cancer/Cancer Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, 100021, China
| | - Qihao Leng
- Xiangya School of Medicine, Central South University, Changsha, 410013, Hunan Province, China
| | - You Wu
- Institute for Hospital Management, School of Medicine, Tsinghua University, 30 Shuangqing Rd, Haidian District, Beijing, China; Department of Health Policy and Management, Bloomberg School of Public Health, Johns Hopkins University, Baltimore, MD, 21205, USA.
| | - Yousheng Mao
- Department of Thoracic Surgery, National Cancer Center/National Clinical Research Center for Cancer/Cancer Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, 100021, China.
| |
Collapse
|
2
|
Zhai Y, Lin X, Wei Q, Pu Y, Pang Y. Interpretable prediction of cardiopulmonary complications after non-small cell lung cancer surgery based on machine learning and SHapley additive exPlanations. Heliyon 2023; 9:e17772. [PMID: 37483738 PMCID: PMC10359813 DOI: 10.1016/j.heliyon.2023.e17772] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/04/2023] [Revised: 06/26/2023] [Accepted: 06/27/2023] [Indexed: 07/25/2023] Open
Abstract
Introduction Lung cancer is a prevalent malignancy globally, with approximately 20% of patients developing cardiopulmonary complications after lobectomy. In order to prevent complications, an accurate and personalized method based on machine learning (ML) is required. Methods During the period of 2017-2021, a retrospective analysis was conducted on the medical records of patients who had undergone lobectomy for non-small cell lung cancer (NSCLC). We performed logical regression, decision tree (DT), random forest (RF), gradient boost DT, and eXtreme gradient boosting analyses to establish an ML model. The ten-fold cross-validation was used to evaluate the performance of multiple ML models based on various evaluation metrics, including accuracy, precision, recall, F1 score, and area under the receiver operating (AUC). Additionally, we also calculated the Kappa value of these model. Each model used grid search to optimize hyper-parameters and then used the interpretability method to provide explanations for the model's Decisions. Results The study included 718 eligible patients, among whom the incidence of postoperative cardiopulmonary complications was 20.89%. The RF model showed the best comprehensive performance among all models, and its ten-fold cross-validation accuracy, precision, recall, F1 score, and AUC were (OR and 95% confidence interval [CI]) 0.786 (0.738-0.834), 0.803 (0.735-0.872), 0.738 (0.678-0.797), 0.766 (0.714-0.818), 0.856 (0.815-0.898), respectively. The kappa value of the RF model was 0.696 (0.617-0.768). The SHAP method showed that gender, age, and intraoperative blood loss were closely associated with postoperative cardiopulmonary complications. Conclusion The application of ML methods for predicting postoperative cardiopulmonary complications based on clinical data of patients with NSCLC showed a good performance. The results indicate that ML combined with the SHAP individualized interpretation method has practical clinical value.
Collapse
Affiliation(s)
- Yihai Zhai
- Guangxi Medical University Cancer Hospital, Department of Thoracic Surgery, Nanning, 530021, China
| | - Xue Lin
- The Second Affiliated Hospital of Guangxi Medical University, Department of Oncology, Nanning, 530000, China
| | - Qiaolin Wei
- Guangxi Medical University Cancer Hospital, Department of Interventional Therapy, Nanning, 530021, China
| | - Yuanjin Pu
- Guangxi Medical University Cancer Hospital, Department of Thoracic Surgery, Nanning, 530021, China
| | - Yonghui Pang
- Guangxi Medical University Cancer Hospital, Department of Thoracic Surgery, Nanning, 530021, China
| |
Collapse
|
3
|
Appadurai JP, G S, Prabhu Kavin B, C K, Lai WC. Multi-Process Remora Enhanced Hyperparameters of Convolutional Neural Network for Lung Cancer Prediction. Biomedicines 2023; 11:biomedicines11030679. [PMID: 36979657 PMCID: PMC10045623 DOI: 10.3390/biomedicines11030679] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/28/2022] [Revised: 01/30/2023] [Accepted: 02/08/2023] [Indexed: 03/30/2023] Open
Abstract
In recent years, lung cancer prediction is an essential topic for reducing the death rate of humans. In the literature section, some papers are reviewed that reduce the accuracy level during the prediction stage. Hence, in this paper, we develop a Multi-Process Remora Optimized Hyperparameters of Convolutional Neural Network (MPROH-CNN) aimed at lung cancer prediction. The proposed technique can be utilized to detect the CT images of the human lung. The proposed technique proceeds with four phases, including pre-processing, feature extraction and classification. Initially, the databases are collected from the open-source system. After that, the collected CT images contain unwanted noise, which affects classification efficiency. So, the pre-processing techniques can be considered to remove unwanted noise from the input images, such as filtering and contrast enhancement. Following that, the essential features are extracted with the assistance of feature extraction techniques such as histogram, texture and wavelet. The extracted features are utilized to classification stage. The proposed classifier is a combination of the Remora Optimization Algorithm (ROA) and Convolutional Neural Network (CNN). In the CNN, the ROA is utilized for multi process optimization such as structure optimization and hyperparameter optimization. The proposed methodology is implemented in MATLAB and performances are evaluated by utilized performance matrices such as accuracy, precision, recall, specificity, sensitivity and F_Measure. To validate the projected approach, it is compared with the traditional techniques CNN, CNN-Particle Swarm Optimization (PSO) and CNN-Firefly Algorithm (FA), respectively. From the analysis, the proposed method achieved a 0.98 accuracy level in the lung cancer prediction.
Collapse
Affiliation(s)
- Jothi Prabha Appadurai
- Computer Science and Engineering Department, Kakatiya Institute of Technology and Science, Warangal 506015, Telangana, India
| | - Suganeshwari G
- School of Computer Science and Engineering, Vellore Institute of Technology, Chennai 600127, Tamil Nadu, India
| | - Balasubramanian Prabhu Kavin
- Department of Data Science and Business Systems, College of Engineering and Technology, SRM Institute of Science and Technology, SRM Nagar, Chengalpattu District, Chennai 603203, Tamil Nadu, India
| | - Kavitha C
- Department of Computer Science and Engineering, Sathyabama Institute of Science and Technology, Chennai 600119, Tamil Nadu, India
| | - Wen-Cheng Lai
- Bachelor Program in Industrial Projects, National Yunlin University of Science and Technology, Douliu 640301, Taiwan
- Department Electronic Engineering, National Yunlin University of Science and Technology, Douliu 640301, Taiwan
| |
Collapse
|
4
|
Sedighi-Maman Z, Heath JJ. An Interpretable Two-Phase Modeling Approach for Lung Cancer Survivability Prediction. SENSORS (BASEL, SWITZERLAND) 2022; 22:6783. [PMID: 36146145 PMCID: PMC9503480 DOI: 10.3390/s22186783] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 07/23/2022] [Revised: 08/28/2022] [Accepted: 09/05/2022] [Indexed: 06/16/2023]
Abstract
Although lung cancer survival status and survival length predictions have primarily been studied individually, a scheme that leverages both fields in an interpretable way for physicians remains elusive. We propose a two-phase data analytic framework that is capable of classifying survival status for 0.5-, 1-, 1.5-, 2-, 2.5-, and 3-year time-points (phase I) and predicting the number of survival months within 3 years (phase II) using recent Surveillance, Epidemiology, and End Results data from 2010 to 2017. In this study, we employ three analytical models (general linear model, extreme gradient boosting, and artificial neural networks), five data balancing techniques (synthetic minority oversampling technique (SMOTE), relocating safe level SMOTE, borderline SMOTE, adaptive synthetic sampling, and majority weighted minority oversampling technique), two feature selection methods (least absolute shrinkage and selection operator (LASSO) and random forest), and the one-hot encoding approach. By implementing a comprehensive data preparation phase, we demonstrate that a computationally efficient and interpretable method such as GLM performs comparably to more complex models. Moreover, we quantify the effects of individual features in phase I and II by exploiting GLM coefficients. To the best of our knowledge, this study is the first to (a) implement a comprehensive data processing approach to develop performant, computationally efficient, and interpretable methods in comparison to black-box models, (b) visualize top factors impacting survival odds by utilizing the change in odds ratio, and (c) comprehensively explore short-term lung cancer survival using a two-phase approach.
Collapse
Affiliation(s)
- Zahra Sedighi-Maman
- Robert B. Willumstad School of Business, Adelphi University, Garden City, NY 11530, USA
| | - Jonathan J. Heath
- McDonough School of Business, Georgetown University, Washington, DC 20057, USA
| |
Collapse
|
5
|
Li Z, Li X, Jin M, Liu Y, He Y, Jia N, Cui X, Liu Y, Hu G, Yu Q. Identification of potential biomarkers and their correlation with immune infiltration cells in schizophrenia using combinative bioinformatics strategy. Psychiatry Res 2022; 314:114658. [PMID: 35660966 DOI: 10.1016/j.psychres.2022.114658] [Citation(s) in RCA: 12] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 11/23/2021] [Revised: 05/17/2022] [Accepted: 05/29/2022] [Indexed: 10/18/2022]
Abstract
Many studies have identified changes in gene expression in brains of schizophrenia patients and their altered molecular processes, but the findings in different datasets were inconsistent and diverse. Here we performed the most comprehensive analysis of gene expression patterns to explore the underlying mechanisms and the potential biomarkers for early diagnosis in schizophrenia. We focused on 10 gene expression datasets in post-mortem human brain samples of schizophrenia downloaded from gene expression omnibus (GEO) database using the integrated bioinformatics analyses including robust rank aggregation (RRA) algorithm, Weighted gene co-expression network analysis (WGCNA) and CIBERSORT. Machine learning algorithm was used to construct the risk prediction model for early diagnosis of schizophrenia. We identified 15 key genes (SLC1A3, AQP4, GJA1, ALDH1L1, SOX9, SLC4A4, EGR1, NOTCH2, PVALB, ID4, ABCG2, METTL7A, ARC, F3 and EMX2) in schizophrenia by performing multiple bioinformatics analysis algorithms. Moreover, the interesting part of the study is that there is a correlation between the expression of hub genes and the immune infiltrating cells estimated by CIBERSORT. Besides, the risk prediction model was constructed by using both these genes and the immune cells with a high accuracy of 0.83 in the training set, and achieved a high AUC of 0.77 for the test set. Our study identified several potential biomarkers for diagnosis of SCZ based on multiple bioinformatics algorithms, and the constructed risk prediction model using these biomarkers achieved high accuracy. The results provide evidence for an improved understanding of the molecular mechanism of schizophrenia.
Collapse
Affiliation(s)
- Zhijun Li
- Department of Epidemiology and Biostatistics, School of public health, Jilin University, Changchun, 130021, China
| | - Xinwei Li
- Department of Epidemiology and Biostatistics, School of public health, Jilin University, Changchun, 130021, China
| | - Mengdi Jin
- Department of Epidemiology and Biostatistics, School of public health, Jilin University, Changchun, 130021, China
| | - Yang Liu
- Department of Epidemiology and Biostatistics, School of public health, Jilin University, Changchun, 130021, China
| | - Yang He
- Department of Epidemiology and Biostatistics, School of public health, Jilin University, Changchun, 130021, China
| | - Ningning Jia
- Department of Epidemiology and Biostatistics, School of public health, Jilin University, Changchun, 130021, China
| | - Xingyao Cui
- Department of Epidemiology and Biostatistics, School of public health, Jilin University, Changchun, 130021, China
| | - Yane Liu
- Department of Epidemiology and Biostatistics, School of public health, Jilin University, Changchun, 130021, China
| | - Guoyan Hu
- Department of Epidemiology and Biostatistics, School of public health, Jilin University, Changchun, 130021, China
| | - Qiong Yu
- Department of Epidemiology and Biostatistics, School of public health, Jilin University, Changchun, 130021, China.
| |
Collapse
|
6
|
Golder S, O'Connor K, Wang Y, Stevens R, Gonzalez-Hernandez G. Best Practices on Big Data Analytics to Address Sex-Specific Biases in Our Understanding of the Etiology, Diagnosis, and Prognosis of Diseases. Annu Rev Biomed Data Sci 2022; 5:251-267. [PMID: 35562851 DOI: 10.1146/annurev-biodatasci-122120-025806] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
A bias in health research to favor understanding diseases as they present in men can have a grave impact on the health of women. This paper reports on a conceptual review of the literature on machine learning or natural language processing (NLP) techniques to interrogate big data for identifying sex-specific health disparities. We searched Ovid MEDLINE, Embase, and PsycINFO in October 2021 using synonyms and indexing terms for (a) "women," "men," or "sex"; (b) "big data," "artificial intelligence," or "NLP"; and (c) "disparities" or "differences." From 902 records, 22 studies met the inclusion criteria and were analyzed. Results demonstrate that the inclusion by sex is inconsistent and often unreported, although the inclusion of men in these studies is disproportionately less than women. Even though artificial intelligence and NLP techniques are widely applied in health research, few studies use them to take advantage of unstructured text to investigate sex-related differences or disparities. Researchers are increasingly aware of sex-based data bias, but the process toward correction is slow. We reflect on best practices on using big data analytics to address sex-specific biases in understanding the etiology, diagnosis, and prognosis of diseases. Expected final online publication date for the Annual Review of Biomedical Data Science, Volume 5 is August 2022. Please see http://www.annualreviews.org/page/journal/pubdates for revised estimates.
Collapse
Affiliation(s)
- Su Golder
- Department of Health Sciences, University of York, York, United Kingdom
| | - Karen O'Connor
- Department of Biostatistics, Epidemiology and Informatics (DBEI), University of Pennsylvania, Philadelphia, Pennsylvania, USA;
| | - Yunwen Wang
- Annenberg School for Communication and Journalism, University of Southern California, Los Angeles, California, USA
| | - Robin Stevens
- Annenberg School for Communication and Journalism, University of Southern California, Los Angeles, California, USA
| | - Graciela Gonzalez-Hernandez
- Department of Biostatistics, Epidemiology and Informatics (DBEI), University of Pennsylvania, Philadelphia, Pennsylvania, USA;
| |
Collapse
|
7
|
Kumar N, Sharma M, Singh VP, Madan C, Mehandia S. An empirical study of handcrafted and dense feature extraction techniques for lung and colon cancer classification from histopathological images. Biomed Signal Process Control 2022. [DOI: 10.1016/j.bspc.2022.103596] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/30/2022]
|
8
|
Machine Learning and Feature Selection Methods for EGFR Mutation Status Prediction in Lung Cancer. APPLIED SCIENCES-BASEL 2021. [DOI: 10.3390/app11073273] [Citation(s) in RCA: 14] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/07/2023]
Abstract
The evolution of personalized medicine has changed the therapeutic strategy from classical chemotherapy and radiotherapy to a genetic modification targeted therapy, and although biopsy is the traditional method to genetically characterize lung cancer tumor, it is an invasive and painful procedure for the patient. Nodule image features extracted from computed tomography (CT) scans have been used to create machine learning models that predict gene mutation status in a noninvasive, fast, and easy-to-use manner. However, recent studies have shown that radiomic features extracted from an extended region of interest (ROI) beyond the tumor, might be more relevant to predict the mutation status in lung cancer, and consequently may be used to significantly decrease the mortality rate of patients battling this condition. In this work, we investigated the relation between image phenotypes and the mutation status of Epidermal Growth Factor Receptor (EGFR), the most frequently mutated gene in lung cancer with several approved targeted-therapies, using radiomic features extracted from the lung containing the nodule. A variety of linear, nonlinear, and ensemble predictive classification models, along with several feature selection methods, were used to classify the binary outcome of wild-type or mutant EGFR mutation status. The results show that a comprehensive approach using a ROI that included the lung with nodule can capture relevant information and successfully predict the EGFR mutation status with increased performance compared to local nodule analyses. Linear Support Vector Machine, Elastic Net, and Logistic Regression, combined with the Principal Component Analysis feature selection method implemented with 70% of variance in the feature set, were the best-performing classifiers, reaching Area Under the Curve (AUC) values ranging from 0.725 to 0.737. This approach that exploits a holistic analysis indicates that information from more extensive regions of the lung containing the nodule allows a more complete lung cancer characterization and should be considered in future radiogenomic studies.
Collapse
|