1
|
Uddin S, Lu H, Rahman A, Gao J. A novel approach for assessing fairness in deployed machine learning algorithms. Sci Rep 2024; 14:17753. [PMID: 39085344 PMCID: PMC11291763 DOI: 10.1038/s41598-024-68651-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2024] [Accepted: 07/25/2024] [Indexed: 08/02/2024] Open
Abstract
Fairness in machine learning (ML) emerges as a critical concern as AI systems increasingly influence diverse aspects of society, from healthcare decisions to legal judgments. Many studies show evidence of unfair ML outcomes. However, the current body of literature lacks a statistically validated approach that can evaluate the fairness of a deployed ML algorithm against a dataset. A novel evaluation approach is introduced in this research based on k-fold cross-validation and statistical t-tests to assess the fairness of ML algorithms. This approach was exercised across five benchmark datasets using six classical ML algorithms. Considering four fair ML definitions guided by the current literature, our analysis showed that the same dataset generates a fair outcome for one ML algorithm but an unfair result for another. Such an observation reveals complex, context-dependent fairness issues in ML, complicated further by the varied operational mechanisms of the underlying ML models. Our proposed approach enables researchers to check whether deploying any ML algorithms against a protected attribute within datasets is fair. We also discuss the broader implications of the proposed approach, highlighting a notable variability in its fairness outcomes. Our discussion underscores the need for adaptable fairness definitions and the exploration of methods to enhance the fairness of ensemble approaches, aiming to advance fair ML practices and ensure equitable AI deployment across societal sectors.
Collapse
Affiliation(s)
- Shahadat Uddin
- School of Project Management, Faculty of Engineering, The University of Sydney, Forest Lodge, Camperdown, NSW, 2037, Australia.
| | - Haohui Lu
- School of Project Management, Faculty of Engineering, The University of Sydney, Forest Lodge, Camperdown, NSW, 2037, Australia
| | | | - Junbin Gao
- Discipline of Business Analytics, The University of Sydney Business School, The University of Sydney, Camperdown, NSW, 2006, Australia
| |
Collapse
|
2
|
Sánchez-Marqués R, García V, Sánchez JS. A data-centric machine learning approach to improve prediction of glioma grades using low-imbalance TCGA data. Sci Rep 2024; 14:17195. [PMID: 39060383 PMCID: PMC11282236 DOI: 10.1038/s41598-024-68291-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2024] [Accepted: 07/22/2024] [Indexed: 07/28/2024] Open
Abstract
Accurate prediction and grading of gliomas play a crucial role in evaluating brain tumor progression, assessing overall prognosis, and treatment planning. In addition to neuroimaging techniques, identifying molecular biomarkers that can guide the diagnosis, prognosis and prediction of the response to therapy has aroused the interest of researchers in their use together with machine learning and deep learning models. Most of the research in this field has been model-centric, meaning it has been based on finding better performing algorithms. However, in practice, improving data quality can result in a better model. This study investigates a data-centric machine learning approach to determine their potential benefits in predicting glioma grades. We report six performance metrics to provide a complete picture of model performance. Experimental results indicate that standardization and oversizing the minority class increase the prediction performance of four popular machine learning models and two classifier ensembles applied on a low-imbalanced data set consisting of clinical factors and molecular biomarkers. The experiments also show that the two classifier ensembles significantly outperform three of the four standard prediction models. Furthermore, we conduct a comprehensive descriptive analysis of the glioma data set to identify relevant statistical characteristics and discover the most informative attributes using four feature ranking algorithms.
Collapse
Affiliation(s)
- Raquel Sánchez-Marqués
- Fundación Estatal, Salud, Infancia y Bienestar Social, 28029, Madrid, Spain
- Centro de Investigación Biomédica en Red de Enfermedades Infecciosas (CIBERINFEC), Instituto de Salud Carlos III, 28029, Madrid, Spain
| | - Vicente García
- Dept. Electrical and Computer Engineering, Instituto de Ingeniería y Tecnología, Universidad Autónoma de Ciudad Juárez, 32310, Ciudad Juárez, Mexico.
| | - J Salvador Sánchez
- Dept. Computer Languages and Systems, Institute of New Imaging Technologies, Universitat Jaume I, 12071, Castelló, Spain
| |
Collapse
|
3
|
Li Y, Geng Y, Sheng H. An improved mountain gazelle optimizer based on chaotic map and spiral disturbance for medical feature selection. PLoS One 2024; 19:e0307288. [PMID: 39012921 PMCID: PMC11251600 DOI: 10.1371/journal.pone.0307288] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2023] [Accepted: 07/03/2024] [Indexed: 07/18/2024] Open
Abstract
Feature selection is an important solution for dealing with high-dimensional data in the fields of machine learning and data mining. In this paper, we present an improved mountain gazelle optimizer (IMGO) based on the newly proposed mountain gazelle optimizer (MGO) and design a binary version of IMGO (BIMGO) to solve the feature selection problem for medical data. First, the gazelle population is initialized using iterative chaotic map with infinite collapses (ICMIC) mapping, which increases the diversity of the population. Second, a nonlinear control factor is introduced to balance the exploration and exploitation components of the algorithm. Individuals in the population are perturbed using a spiral perturbation mechanism to enhance the local search capability of the algorithm. Finally, a neighborhood search strategy is used for the optimal individuals to enhance the exploitation and convergence capabilities of the algorithm. The superior ability of the IMGO algorithm to solve continuous problems is demonstrated on 23 benchmark datasets. Then, BIMGO is evaluated on 16 medical datasets of different dimensions and compared with 8 well-known metaheuristic algorithms. The experimental results indicate that BIMGO outperforms the competing algorithms in terms of the fitness value, number of selected features and sensitivity. In addition, the statistical results of the experiments demonstrate the significantly superior ability of BIMGO to select the most effective features in medical datasets.
Collapse
Affiliation(s)
- Ying Li
- College of Computer Science and Technology, Jilin University, Changchun, People’s Republic of China
- Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, People’s Republic of China
| | - Yanyu Geng
- College of Computer Science and Technology, Jilin University, Changchun, People’s Republic of China
- Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, People’s Republic of China
| | - Huankun Sheng
- College of Computer Science and Technology, Jilin University, Changchun, People’s Republic of China
- Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, People’s Republic of China
| |
Collapse
|
4
|
Tasci E, Shah Y, Jagasia S, Zhuge Y, Shephard J, Johnson MO, Elemento O, Joyce T, Chappidi S, Cooley Zgela T, Sproull M, Mackey M, Camphausen K, Krauze AV. MGMT ProFWise: Unlocking a New Application for Combined Feature Selection and the Rank-Based Weighting Method to Link MGMT Methylation Status to Serum Protein Expression in Patients with Glioblastoma. Int J Mol Sci 2024; 25:4082. [PMID: 38612892 PMCID: PMC11012706 DOI: 10.3390/ijms25074082] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2024] [Revised: 04/02/2024] [Accepted: 04/03/2024] [Indexed: 04/14/2024] Open
Abstract
Glioblastoma (GBM) is a fatal brain tumor with limited treatment options. O6-methylguanine-DNA-methyltransferase (MGMT) promoter methylation status is the central molecular biomarker linked to both the response to temozolomide, the standard chemotherapy drug employed for GBM, and to patient survival. However, MGMT status is captured on tumor tissue which, given the difficulty in acquisition, limits the use of this molecular feature for treatment monitoring. MGMT protein expression levels may offer additional insights into the mechanistic understanding of MGMT but, currently, they correlate poorly to promoter methylation. The difficulty of acquiring tumor tissue for MGMT testing drives the need for non-invasive methods to predict MGMT status. Feature selection aims to identify the most informative features to build accurate and interpretable prediction models. This study explores the new application of a combined feature selection (i.e., LASSO and mRMR) and the rank-based weighting method (i.e., MGMT ProFWise) to non-invasively link MGMT promoter methylation status and serum protein expression in patients with GBM. Our method provides promising results, reducing dimensionality (by more than 95%) when employed on two large-scale proteomic datasets (7k SomaScan® panel and CPTAC) for all our analyses. The computational results indicate that the proposed approach provides 14 shared serum biomarkers that may be helpful for diagnostic, prognostic, and/or predictive operations for GBM-related processes, given further validation.
Collapse
Affiliation(s)
- Erdal Tasci
- Radiation Oncology Branch, Center for Cancer Research, National Cancer Institute, NIH, 9000 Rockville Pike, Building 10, CRC, Bethesda, MD 20892, USA
| | - Yajas Shah
- Caryl and Israel Englander Institute for Precision Medicine, Weill Cornell Medicine, New York, NY 10021, USA
| | - Sarisha Jagasia
- Radiation Oncology Branch, Center for Cancer Research, National Cancer Institute, NIH, 9000 Rockville Pike, Building 10, CRC, Bethesda, MD 20892, USA
| | - Ying Zhuge
- Radiation Oncology Branch, Center for Cancer Research, National Cancer Institute, NIH, 9000 Rockville Pike, Building 10, CRC, Bethesda, MD 20892, USA
| | - Jason Shephard
- Radiation Oncology Branch, Center for Cancer Research, National Cancer Institute, NIH, 9000 Rockville Pike, Building 10, CRC, Bethesda, MD 20892, USA
| | - Margaret O. Johnson
- Department of Neurosurgery, Duke University, Durham, NC 27710, USA
- National Tele-Oncology, Veterans Health Administration, Durham, NC 27710, USA
| | - Olivier Elemento
- Caryl and Israel Englander Institute for Precision Medicine, Weill Cornell Medicine, New York, NY 10021, USA
| | - Thomas Joyce
- Radiation Oncology Branch, Center for Cancer Research, National Cancer Institute, NIH, 9000 Rockville Pike, Building 10, CRC, Bethesda, MD 20892, USA
| | - Shreya Chappidi
- Radiation Oncology Branch, Center for Cancer Research, National Cancer Institute, NIH, 9000 Rockville Pike, Building 10, CRC, Bethesda, MD 20892, USA
| | - Theresa Cooley Zgela
- Radiation Oncology Branch, Center for Cancer Research, National Cancer Institute, NIH, 9000 Rockville Pike, Building 10, CRC, Bethesda, MD 20892, USA
| | - Mary Sproull
- Radiation Oncology Branch, Center for Cancer Research, National Cancer Institute, NIH, 9000 Rockville Pike, Building 10, CRC, Bethesda, MD 20892, USA
| | - Megan Mackey
- Radiation Oncology Branch, Center for Cancer Research, National Cancer Institute, NIH, 9000 Rockville Pike, Building 10, CRC, Bethesda, MD 20892, USA
| | - Kevin Camphausen
- Radiation Oncology Branch, Center for Cancer Research, National Cancer Institute, NIH, 9000 Rockville Pike, Building 10, CRC, Bethesda, MD 20892, USA
| | - Andra Valentina Krauze
- Radiation Oncology Branch, Center for Cancer Research, National Cancer Institute, NIH, 9000 Rockville Pike, Building 10, CRC, Bethesda, MD 20892, USA
| |
Collapse
|
5
|
Huma C, Hawon L, Sarisha J, Erdal T, Kevin C, Valentina KA. Advances in the field of developing biomarkers for re-irradiation: a how-to guide to small, powerful data sets and artificial intelligence. EXPERT REVIEW OF PRECISION MEDICINE AND DRUG DEVELOPMENT 2024; 9:3-16. [PMID: 38550554 PMCID: PMC10972602 DOI: 10.1080/23808993.2024.2325936] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/20/2023] [Accepted: 02/28/2024] [Indexed: 04/01/2024]
Abstract
Introduction Patient selection remains challenging as the clinical use of re-irradiation (re-RT) increases. Re-RT data is limited to retrospective studies and small prospective single-institution reports, resulting in small, heterogenous data sets. Validated prognostic and predictive biomarkers are derived from large-volume studies with long-term follow-up. This review aims to examine existing re-RT publications and available data sets and discuss strategies using artificial intelligence (AI) to approach small data sets to optimize the use of re-RT data. Methods Re-RT publications were identified where associated public data was present. The existing literature on small data sets to identify biomarkers was also explored. Results Publications with associated public data were identified, with glioma and nasopharyngeal cancers emerging as the most common tumor sites where the use of re-RT was the primary management approach. Existing and emerging AI strategies have been used to approach small data sets including data generation, augmentation, discovery, and transfer learning. Conclusions Further data is needed to generate adaptive frameworks, improve the collection of specimens for molecular analysis, and improve the interpretability of results in re-RT data.
Collapse
Affiliation(s)
- Chaudhry Huma
- Radiation Oncology Branch, Center for Cancer Research, National Cancer Institute, National Institutes of Health, Building 10, Bethesda, MD, 20892, United States
| | - Lee Hawon
- Radiation Oncology Branch, Center for Cancer Research, National Cancer Institute, National Institutes of Health, Building 10, Bethesda, MD, 20892, United States
| | - Jagasia Sarisha
- Radiation Oncology Branch, Center for Cancer Research, National Cancer Institute, National Institutes of Health, Building 10, Bethesda, MD, 20892, United States
| | - Tasci Erdal
- Radiation Oncology Branch, Center for Cancer Research, National Cancer Institute, National Institutes of Health, Building 10, Bethesda, MD, 20892, United States
| | - Camphausen Kevin
- Radiation Oncology Branch, Center for Cancer Research, National Cancer Institute, National Institutes of Health, Building 10, Bethesda, MD, 20892, United States
| | - Krauze Andra Valentina
- Radiation Oncology Branch, Center for Cancer Research, National Cancer Institute, National Institutes of Health, Building 10, Bethesda, MD, 20892, United States
| |
Collapse
|
6
|
Tasci E, Jagasia S, Zhuge Y, Camphausen K, Krauze AV. GradWise: A Novel Application of a Rank-Based Weighted Hybrid Filter and Embedded Feature Selection Method for Glioma Grading with Clinical and Molecular Characteristics. Cancers (Basel) 2023; 15:4628. [PMID: 37760597 PMCID: PMC10526509 DOI: 10.3390/cancers15184628] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2023] [Revised: 09/01/2023] [Accepted: 09/14/2023] [Indexed: 09/29/2023] Open
Abstract
Glioma grading plays a pivotal role in guiding treatment decisions, predicting patient outcomes, facilitating clinical trial participation and research, and tailoring treatment strategies. Current glioma grading in the clinic is based on tissue acquired at the time of resection, with tumor aggressiveness assessed from tumor morphology and molecular features. The increased emphasis on molecular characteristics as a guide for management and prognosis estimation underscores is driven by the need for accurate and standardized grading systems that integrate molecular and clinical information in the grading process and carry the expectation of the exposure of molecular markers that go beyond prognosis to increase understanding of tumor biology as a means of identifying druggable targets. In this study, we introduce a novel application (GradWise) that combines rank-based weighted hybrid filter (i.e., mRMR) and embedded (i.e., LASSO) feature selection methods to enhance the performance of feature selection and machine learning models for glioma grading using both clinical and molecular predictors. We utilized publicly available TCGA from the UCI ML Repository and CGGA datasets to identify the most effective scheme that allows for the selection of the minimum number of features with their names. Two popular feature selection methods with a rank-based weighting procedure were employed to conduct comprehensive experiments with the five supervised models. The computational results demonstrate that our proposed method achieves an accuracy rate of 87.007% with 13 features and an accuracy rate of 80.412% with five features on the TCGA and CGGA datasets, respectively. We also obtained four shared biomarkers for the glioma grading that emerged in both datasets and can be employed with transferable value to other datasets and data-based outcome analyses. These findings are a significant step toward highlighting the effectiveness of our approach by offering pioneering results with novel markers with prospects for understanding and targeting the biologic mechanisms of glioma progression to improve patient outcomes.
Collapse
Affiliation(s)
| | | | | | | | - Andra Valentina Krauze
- Radiation Oncology Branch, Center for Cancer Research, National Cancer Institute, National Institutes of Health, Building 10, Bethesda, MD 20892, USA
| |
Collapse
|
7
|
Tasci E, Jagasia S, Zhuge Y, Sproull M, Cooley Zgela T, Mackey M, Camphausen K, Krauze AV. RadWise: A Rank-Based Hybrid Feature Weighting and Selection Method for Proteomic Categorization of Chemoirradiation in Patients with Glioblastoma. Cancers (Basel) 2023; 15:2672. [PMID: 37345009 PMCID: PMC10216128 DOI: 10.3390/cancers15102672] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/27/2023] [Revised: 05/03/2023] [Accepted: 05/06/2023] [Indexed: 06/23/2023] Open
Abstract
Glioblastomas (GBM) are rapidly growing, aggressive, nearly uniformly fatal, and the most common primary type of brain cancer. They exhibit significant heterogeneity and resistance to treatment, limiting the ability to analyze dynamic biological behavior that drives response and resistance, which are central to advancing outcomes in glioblastoma. Analysis of the proteome aimed at signal change over time provides a potential opportunity for non-invasive classification and examination of the response to treatment by identifying protein biomarkers associated with interventions. However, data acquired using large proteomic panels must be more intuitively interpretable, requiring computational analysis to identify trends. Machine learning is increasingly employed, however, it requires feature selection which has a critical and considerable effect on machine learning problems when applied to large-scale data to reduce the number of parameters, improve generalization, and find essential predictors. In this study, using 7k proteomic data generated from the analysis of serum obtained from 82 patients with GBM pre- and post-completion of concurrent chemoirradiation (CRT), we aimed to select the most discriminative proteomic features that define proteomic alteration that is the result of administering CRT. Thus, we present a novel rank-based feature weighting method (RadWise) to identify relevant proteomic parameters using two popular feature selection methods, least absolute shrinkage and selection operator (LASSO) and the minimum redundancy maximum relevance (mRMR). The computational results show that the proposed method yields outstanding results with very few selected proteomic features, with higher accuracy rate performance than methods that do not employ a feature selection process. While the computational method identified several proteomic signals identical to the clinical intuitive (heuristic approach), several heuristically identified proteomic signals were not selected while other novel proteomic biomarkers not selected with the heuristic approach that carry biological prognostic relevance in GBM only emerged with the novel method. The computational results show that the proposed method yields promising results, reducing 7k proteomic data to 7 selected proteomic features with a performance value of 93.921%, comparing favorably with techniques that do not employ feature selection.
Collapse
Affiliation(s)
| | | | | | | | | | | | | | - Andra Valentina Krauze
- Radiation Oncology Branch, Center for Cancer Research, National Cancer Institute, National Institutes of Health, Building 10, Bethesda, MD 20892, USA
| |
Collapse
|
8
|
Cost Matrix of Molecular Pathology in Glioma-Towards AI-Driven Rational Molecular Testing and Precision Care for the Future. Biomedicines 2022; 10:biomedicines10123029. [PMID: 36551786 PMCID: PMC9775648 DOI: 10.3390/biomedicines10123029] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2022] [Revised: 11/09/2022] [Accepted: 11/19/2022] [Indexed: 11/27/2022] Open
Abstract
Gliomas are the most common and aggressive primary brain tumors. Gliomas carry a poor prognosis because of the tumor's resistance to radiation and chemotherapy leading to nearly universal recurrence. Recent advances in large-scale genomic research have allowed for the development of more targeted therapies to treat glioma. While precision medicine can target specific molecular features in glioma, targeted therapies are often not feasible due to the lack of actionable markers and the high cost of molecular testing. This review summarizes the clinically relevant molecular features in glioma and the current cost of care for glioma patients, focusing on the molecular markers and meaningful clinical features that are linked to clinical outcomes and have a realistic possibility of being measured, which is a promising direction for precision medicine using artificial intelligence approaches.
Collapse
|