Tian Y, Zhao WY, Liu YR, Song WW, Lin QX, Gong YN, Deng YT, Gu DN, Tian L. Machine learning reveals CAT gene as a novel potential diagnostic and prognostic biomarker in non-small cell lung cancer.
Discov Oncol 2024;
15:774. [PMID:
39692815 DOI:
10.1007/s12672-024-01670-1]
[Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 07/13/2024] [Accepted: 12/04/2024] [Indexed: 12/19/2024] Open
Abstract
BACKGROUND
Non-small cell lung cancer (NSCLC) represents one of the most prevalent forms of lung cancer, with a five-year survival rate of 21.7%. There is an urgent need to identify pertinent biomarkers to inform the diagnosis and prognosis of tumors, particularly those that can be applied to different age groups. Herein, we would apply machine learning methods to specifically analyze the issue of biomarker applicability across different age groups in NSCLC.
METHODS
Studies have shown a higher incidence of NSCLC in people over 40 years of age, and due to the limitations of data set, studies of individuals under 40 years of age were not included in this study. To simulate the human aging model as closely as possible, we gathered corresponding non-small cell lung cancer (NSCLC) samples from the UCSC Xena database based on patient age information. These samples were then categorized into three groups: 40-60, 60-80, and over 80 years old. Subsequently, we employed four machine learning methods-Random Forest, LASSO regression analysis, XGBoost, and GBM-to identify gene sets with significant diagnostic value for each age group. By taking the intersection of these sets, we identified the optimal gene and assessed its prognostic significance in NSCLC. Then, the diagnostic value of CAT gene was validated using global public databases, including the GSE32863, GSE43458, GSE68571, GSE10072, and GSE63459 datasets from the Americas, the GSE30219 and GSE102511 datasets from Europe, and the GSE31210 and GSE19804 datasets from Asia. Furthermore, immunohistochemical staining was performed in an independent cohort from a tissue microarray. Additionally, cell culture and RT-qPCR were employed for external validation.
RESULTS
Through the implementation of machine learning methods, we successfully identified the catalase (CAT) gene. Our analysis revealed that individuals with high expression of the CAT gene experienced improved survival rates. Additionally, these individuals exhibited elevated immune scores. We further discovered that the CAT gene synergizes with multiple components of neutrophils, including TLRs, FcRn, and the selective GEF of Rho-family GTPases. In addition, we identified a potential immune checkpoint, TNFSF15, which is applicable to the human aging model. Finally, we validated the CAT gene's diagnostic value using databases encompassing the Americas, Europe, and Asia regions. Through external RT-qPCR validation, we verified that CAT expression in BEAS-2B was higher than that of A549. In an independent human cohort, we also verified that CAT is lowly expressed in lung cancer tissues. In addition, higher CAT levels were associated with improved survival in the 40-60 and 60-80 age groups.
CONCLUSIONS
In our analysis of the NSCLC database, we pinpointed the CAT gene, which holds promise for potential diagnostic and prognostic applications in the context of human aging. Furthermore, it may offer insights into addressing age-related heterogeneity of NSCLC.
Collapse