1
|
Zhu J, Pu S, He J, Su D, Cai W, Xu X, Liu H. Processing imbalanced medical data at the data level with assisted-reproduction data as an example. BioData Min 2024; 17:29. [PMID: 39232851 PMCID: PMC11373105 DOI: 10.1186/s13040-024-00384-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/21/2024] [Accepted: 08/27/2024] [Indexed: 09/06/2024] Open
Abstract
OBJECTIVE Data imbalance is a pervasive issue in medical data mining, often leading to biased and unreliable predictive models. This study aims to address the urgent need for effective strategies to mitigate the impact of data imbalance on classification models. We focus on quantifying the effects of different imbalance degrees and sample sizes on model performance, identifying optimal cut-off values, and evaluating the efficacy of various methods to enhance model accuracy in highly imbalanced and small sample size scenarios. METHODS We collected medical records of patients receiving assisted reproductive treatment in a reproductive medicine center. Random forest was used to screen the key variables for the prediction target. Various datasets with different imbalance degrees and sample sizes were constructed to compare the classification performance of logistic regression models. Metrics such as AUC, G-mean, F1-Score, Accuracy, Recall, and Precision were used for evaluation. Four imbalance treatment methods (SMOTE, ADASYN, OSS, and CNN) were applied to datasets with low positive rates and small sample sizes to assess their effectiveness. RESULTS The logistic model's performance was low when the positive rate was below 10% but stabilized beyond this threshold. Similarly, sample sizes below 1200 yielded poor results, with improvement seen above this threshold. For robustness, the optimal cut-offs for positive rate and sample size were identified as 15% and 1500, respectively. SMOTE and ADASYN oversampling significantly improved classification performance in datasets with low positive rates and small sample sizes. CONCLUSIONS The study identifies a positive rate of 15% and a sample size of 1500 as optimal cut-offs for stable logistic model performance. For datasets with low positive rates and small sample sizes, SMOTE and ADASYN are recommended to improve balance and model accuracy.
Collapse
Affiliation(s)
- Junliang Zhu
- Department of Health Statistics, School of Public Health, China Medical University, Shenyang, 110122, PR China
| | - Shaowei Pu
- Department of Health Statistics, School of Public Health, China Medical University, Shenyang, 110122, PR China
| | - Jiaji He
- Department of Health Statistics, School of Public Health, China Medical University, Shenyang, 110122, PR China
| | - Dongchao Su
- Department of Health Statistics, School of Public Health, China Medical University, Shenyang, 110122, PR China
| | - Weijie Cai
- Department of Health Statistics, School of Public Health, China Medical University, Shenyang, 110122, PR China
| | - Xueying Xu
- Department of Health Statistics, School of Public Health, China Medical University, Shenyang, 110122, PR China
| | - Hongbo Liu
- Department of Health Statistics, School of Public Health, China Medical University, Shenyang, 110122, PR China.
- Key Lab of Environmental Stress and Chronic Disease Control & Prevention, China Medical University, No.77 Puhe Road, Shenyang North New Area, Shenyang, 110122, Liaoning Province, PR China.
| |
Collapse
|
2
|
Ferdowsi M, Hasan MM, Habib W. Responsible AI for cardiovascular disease detection: Towards a privacy-preserving and interpretable model. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2024; 254:108289. [PMID: 38905988 DOI: 10.1016/j.cmpb.2024.108289] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/14/2024] [Revised: 06/10/2024] [Accepted: 06/16/2024] [Indexed: 06/23/2024]
Abstract
BACKGROUND AND OBJECTIVE Cardiovascular disease (CD) is a major global health concern, affecting millions with symptoms like fatigue and chest discomfort. Timely identification is crucial due to its significant contribution to global mortality. In healthcare, artificial intelligence (AI) holds promise for advancing disease risk assessment and treatment outcome prediction. However, machine learning (ML) evolution raises concerns about data privacy and biases, especially in sensitive healthcare applications. The objective is to develop and implement a responsible AI model for CD prediction that prioritize patient privacy, security, ensuring transparency, explainability, fairness, and ethical adherence in healthcare applications. METHODS To predict CD while prioritizing patient privacy, our study employed data anonymization involved adding Laplace noise to sensitive features like age and gender. The anonymized dataset underwent analysis using a differential privacy (DP) framework to preserve data privacy. DP ensured confidentiality while extracting insights. Compared with Logistic Regression (LR), Gaussian Naïve Bayes (GNB), and Random Forest (RF), the methodology integrated feature selection, statistical analysis, and SHapley Additive exPlanations (SHAP) and Local Interpretable Model-agnostic Explanations (LIME) for interpretability. This approach facilitates transparent and interpretable AI decision-making, aligning with responsible AI development principles. Overall, it combines privacy preservation, interpretability, and ethical considerations for accurate CD predictions. RESULTS Our investigations from the DP framework with LR were promising, with an area under curve (AUC) of 0.848 ± 0.03, an accuracy of 0.797 ± 0.02, precision at 0.789 ± 0.02, recall at 0.797 ± 0.02, and an F1 score of 0.787 ± 0.02, with a comparable performance with the non-privacy framework. The SHAP and LIME based results support clinical findings, show a commitment to transparent and interpretable AI decision-making, and aligns with the principles of responsible AI development. CONCLUSIONS Our study endorses a novel approach in predicting CD, amalgamating data anonymization, privacy-preserving methods, interpretability tools SHAP, LIME, and ethical considerations. This responsible AI framework ensures accurate predictions, privacy preservation, and user trust, underscoring the significance of comprehensive and transparent ML models in healthcare. Therefore, this research empowers the ability to forecast CD, providing a vital lifeline to millions of CD patients globally and potentially preventing numerous fatalities.
Collapse
Affiliation(s)
- Mahbuba Ferdowsi
- Department of Mechatronics and Biomedical Engineering, Lee Kong Chian Faculty of Engineering and Science, Universiti Tunku Abdul Rahman (UTAR), Kajang, Selangor 43200, Malaysia.
| | - Md Mahmudul Hasan
- School of Computer Science and Engineering, University of New South Wales (UNSW), Sydney, NSW 2052, Australia
| | - Wafa Habib
- Department of Biomedical Engineering, Faculty of Engineering, Universiti Malaya (UM), Kuala Lumpur 50603, Malaysia
| |
Collapse
|
3
|
Oka H, Kawahara D, Murakami Y. Radiomics-based prediction of recurrence for head and neck cancer patients using data imbalanced correction. Comput Biol Med 2024; 180:108879. [PMID: 39067154 DOI: 10.1016/j.compbiomed.2024.108879] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2023] [Revised: 06/21/2024] [Accepted: 07/10/2024] [Indexed: 07/30/2024]
Abstract
OBJECTIVES To propose a radiomics-based prediction model for head and neck squamous cell carcinoma (HSNCC) recurrence after radiation therapy using a novel data imbalance correction method known as Gaussian noise upsampling (GNUS). MATERIALS AND METHODS The dataset includes 97 HNSCC patients treated with definitive radiotherapy alone or concurrent chemoradiotherapy at two institutions. We performed radiomics analysis using nine segmentations created on pretreatment positron emission tomography and computed tomography images. Feature selection was performed by the least absolute shrinkage and selection operator analysis via five-fold cross-validation. The proposed GNUS was compared with seven conventional data-imbalance correction methods. Classification models of HNSCC recurrence were constructed on oversampled features using the machine learning algorithms of linear regression. Their predictive performance was evaluated based on accuracy, sensitivity, specificity, and the area under the curve (AUC) of the receiver operating performance characteristic curve via five-fold cross-validation using the same combinations as for feature selection. RESULT The prediction model without data imbalance correction shows sensitivity, specificity, accuracy, and AUC values of 83 %, 96 %, 92 %, and 0.96, respectively. The conventional model with the best performance is the random over-sampler model, which shows sensitivity, specificity, accuracy, and AUC values of 93 %, 91 %, 92 %, 0.97, respectively, whereas the GNUS model shows values of 93 %, 94 %, 94 %, 0.98, respectively. CONCLUSION Oversampling methods can reduce sensitivity and specificity bias. The proposed GNUS can improve accuracy as well as reduce sensitivity and specificity bias.
Collapse
Affiliation(s)
- Hiroki Oka
- Department of Radiation Oncology, Graduate School of Biomedical and Health Sciences, Hiroshima University, Hiroshima, 734-8551, Japan
| | - Daisuke Kawahara
- Department of Radiation Oncology, Graduate School of Biomedical and Health Sciences, Hiroshima University, Hiroshima, 734-8551, Japan.
| | - Yuji Murakami
- Department of Radiation Oncology, Graduate School of Biomedical and Health Sciences, Hiroshima University, Hiroshima, 734-8551, Japan
| |
Collapse
|
4
|
Paetkau O, Weppler S, Kwok J, Quon HC, Gomes da Rocha C, Smith W, Tchistiakova E, Kirkby C. Pharyngeal Constrictor Dose-Volume Histogram Metrics and Patient-Reported Dysphagia in Head and Neck Radiotherapy. Clin Oncol (R Coll Radiol) 2024; 36:173-182. [PMID: 38220581 DOI: 10.1016/j.clon.2024.01.002] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2022] [Revised: 11/03/2023] [Accepted: 01/05/2024] [Indexed: 01/16/2024]
Abstract
AIMS Head and neck radiotherapy long-term survival continues to improve and the management of long-term side-effects is moving to the forefront of patient care. Dysphagia is associated with dose to the pharyngeal constrictors and can be measured using patient-reported outcomes to evaluate its effect on quality of life. The aim of the present study was to relate pharyngeal constrictor dose-volume parameters with patient-reported outcomes to identify prognostic dose constraints. MATERIALS AND METHODS A 64-patient training cohort and a 24-patient testing cohort of oropharynx and nasopharynx cancer patients treated with curative-intent chemoradiotherapy were retrospectively examined. These patients completed the MD Anderson Dysphagia Inventory outcome survey at 12 months post-radiotherapy to evaluate late dysphagia: a composite score lower than 60 indicated dysphagia. The pharyngeal constrictor muscles were subdivided into four substructures: superior, middle, inferior and cricopharyngeal. Dose-volume histogram (DVH) metrics for each of the structure combinations were extracted. A decision tree classifier was run for each DVH metric to identify dose constraints optimising the accuracy and sensitivity of the cohort. A 60% accuracy threshold and feature selection method were used to ensure statistically significant DVH metrics were identified. These dose constraints were then validated on the 24-patient testing cohort. RESULTS Existing literature dose constraints only had two dose constraints performing above 60% accuracy and sensitivity when evaluated on our training cohort. We identified two well-performing dose constraints: the pharyngeal constrictor muscle D63% < 55 Gy and the superior-middle pharyngeal constrictor combination structure V31Gy < 100%. Both dose constraints resulted in ≥73% mean accuracy and ≥80% mean sensitivity on the training and testing patient cohorts. In addition, a pharyngeal constrictor muscle mean dose <57 Gy resulted in a mean accuracy ≥74% and mean sensitivity ≥60%. CONCLUSION Mid-dose pharyngeal constrictor muscle and substructure combination dose constraints should be used in the treatment planning process to reduce late patient-reported dysphagia.
Collapse
Affiliation(s)
- O Paetkau
- Department of Physics and Astronomy, University of Calgary, Calgary, Alberta, Canada.
| | - S Weppler
- Tom Baker Cancer Center, Calgary, Alberta, Canada
| | - J Kwok
- Tom Baker Cancer Center, Calgary, Alberta, Canada; Division of Radiation Oncology, Department of Oncology, University of Calgary, Calgary, Alberta, Canada
| | - H C Quon
- Tom Baker Cancer Center, Calgary, Alberta, Canada
| | - C Gomes da Rocha
- Department of Physics and Astronomy, University of Calgary, Calgary, Alberta, Canada; Hotchkiss Brain Institute, University of Calgary, Calgary, Alberta, Canada; Institute for Quantum Science and Technology, University of Calgary, Calgary, Alberta, Canada
| | - W Smith
- Varian Medical Systems - A Siemens Healthineers Company, Palo Alto, California, USA
| | - E Tchistiakova
- Department of Physics and Astronomy, University of Calgary, Calgary, Alberta, Canada
| | - C Kirkby
- Department of Physics and Astronomy, University of Calgary, Calgary, Alberta, Canada
| |
Collapse
|
5
|
Patkar S, Mannheimer J, Harmon S, Mazcko C, Choyke P, Brown GT, Turkbey B, LeBlanc A, Beck J. Large Scale Comparative Deconvolution Analysis of the Canine and Human Osteosarcoma Tumor Microenvironment Uncovers Conserved Clinically Relevant Subtypes. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.09.27.559797. [PMID: 37808704 PMCID: PMC10557692 DOI: 10.1101/2023.09.27.559797] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/10/2023]
Abstract
Osteosarcoma is a relatively rare but aggressive cancer of the bones with a shortage of effective biomarkers. Although less common in humans, Osteosarcomas are fairly common in adult pet dogs and have been shown to share many similarities with their human analogs. In this work, we analyze bulk transcriptomic data of 213 primary and 100 metastatic Osteosarcoma samples from 210 pet dogs enrolled in nation-wide clinical trials to uncover three Tumor Microenvironment (TME)-based subtypes: Immune Enriched (IE), Immune Enriched Dense Extra-Cellular Matrix-like (IE-ECM) and Immune Desert (ID) with distinct cell type compositions, oncogenic pathway activity and chromosomal instability. Furthermore, leveraging bulk transcriptomic data of canine primary tumors and their matched metastases from different sites, we characterize how the Osteosarcoma TME evolves from primary to metastatic disease in a standard of care clinical setting and assess its overall impact on clinical outcomes of canines. Most importantly, we find that TME-based subtypes of canine Osteosarcomas are conserved in humans and predictive of progression free survival outcomes of human patients, independently of known prognostic biomarkers such as presence of metastatic disease at diagnosis and percent necrosis following chemotherapy. In summary, these results demonstrate the power of using canines to model the human Osteosarcoma TME and discover novel biomarkers for clinical translation.
Collapse
Affiliation(s)
- Sushant Patkar
- Artificial Intelligence Resource, Molecular Imaging Branch, National Cancer Institute, NIH, Bethesda, MD, USA
| | - Josh Mannheimer
- Comparative Oncology Program, Center for Cancer Research, National Cancer Institute, NIH, Bethesda, MD, USA
| | - Stephanie Harmon
- Artificial Intelligence Resource, Molecular Imaging Branch, National Cancer Institute, NIH, Bethesda, MD, USA
| | - Christina Mazcko
- Comparative Oncology Program, Center for Cancer Research, National Cancer Institute, NIH, Bethesda, MD, USA
| | - Peter Choyke
- Artificial Intelligence Resource, Molecular Imaging Branch, National Cancer Institute, NIH, Bethesda, MD, USA
| | - G Tom Brown
- Artificial Intelligence Resource, Molecular Imaging Branch, National Cancer Institute, NIH, Bethesda, MD, USA
| | - Baris Turkbey
- Artificial Intelligence Resource, Molecular Imaging Branch, National Cancer Institute, NIH, Bethesda, MD, USA
| | - Amy LeBlanc
- Comparative Oncology Program, Center for Cancer Research, National Cancer Institute, NIH, Bethesda, MD, USA
| | - Jessica Beck
- Comparative Oncology Program, Center for Cancer Research, National Cancer Institute, NIH, Bethesda, MD, USA
| |
Collapse
|
6
|
Jani J, Doshi J, Kheria I, Mehta K, Bhadane C, Karani R. LayNet-A multi-layer architecture to handle imbalance in medical imaging data. Comput Biol Med 2023; 163:107179. [PMID: 37354820 DOI: 10.1016/j.compbiomed.2023.107179] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2023] [Revised: 06/02/2023] [Accepted: 06/11/2023] [Indexed: 06/26/2023]
Abstract
In an imbalanced dataset, a machine learning classifier using traditional imbalance handling methods may achieve good accuracy, but in highly imbalanced datasets, it may over-predict the majority class and ignore the minority class. In the medical domain, failing to correctly estimate the minority class might lead to a false negative, which is concerning in cases of life-threatening illnesses and infectious diseases like Covid-19. Currently, classification in deep learning has a single layered architecture where a neural network is employed. This paper proposes a multilayer design entitled LayNet to address this issue. LayNet aims to lessen the class imbalance by dividing the classes among layers and achieving a balanced class distribution at each layer. To ensure that all the classes are being classified, minor classes are combined to form a single new 'hybrid' class at higher layers. The final layer has no hybrid class and only singleton(distinct) classes. Each layer of the architecture includes a separate model that determines if an input belongs to one class or a hybrid class. If it fits into the hybrid class, it advances to the following layer, which is further categorized within the hybrid class. The method to divide the classes into various architectural levels is also introduced in this paper. The Ocular Disease Intelligent Recognition Dataset, Covid-19 Radiography Dataset, and Retinal OCT Dataset are used to evaluate this methodology. The LayNet architecture performs better on these datasets when the results of the traditional single-layer architecture and the proposed multilayered architecture are compared.
Collapse
Affiliation(s)
- Jay Jani
- Computer Engineering Department, D.J. Sanghvi College of Engineering, Mumbai, India.
| | - Jay Doshi
- Computer Engineering Department, D.J. Sanghvi College of Engineering, Mumbai, India.
| | - Ishita Kheria
- Computer Engineering Department, D.J. Sanghvi College of Engineering, Mumbai, India.
| | - Karishni Mehta
- Computer Engineering Department, D.J. Sanghvi College of Engineering, Mumbai, India.
| | - Chetashri Bhadane
- Computer Engineering Department, D.J. Sanghvi College of Engineering, Mumbai, India.
| | - Ruhina Karani
- Computer Engineering Department, D.J. Sanghvi College of Engineering, Mumbai, India.
| |
Collapse
|
7
|
Rengma NS, Yadav M, Kalambukattu JG, Kumar S. Machine learning-based digital mapping of soil organic carbon and texture in the mid-Himalayan terrain. ENVIRONMENTAL MONITORING AND ASSESSMENT 2023; 195:994. [PMID: 37491644 DOI: 10.1007/s10661-023-11608-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/27/2023] [Accepted: 07/14/2023] [Indexed: 07/27/2023]
Abstract
Mountain soils have received significant attention due to their profound influence on ecological processes and environmental factors. However, mapping these soils in digital soil mapping technique encounters several challenges, including high local variability, non-linear relationships between environmental covariates and soil properties, limited accessibility in complex topographical settings, and the absence of universally applicable covariates for soil formation. To address these issues, this study integrates soil-forming factors of the scorpan model to map soil organic carbon (SOC) and soil texture in the mid-Himalayas. By considering over 100 environmental covariates, with a focus on terrain parameters relevant to mountainous environments, the study aims to enhance the accuracy of ML regression models through augmentation techniques that overcome data insufficiency. Using augmented soil observations and covariates, a non-parametric random forest regression model is trained and applied to predict soil variables across the study area, generating a continuous fine-resolution map. The model's performance, evaluated against an unknown dataset, was significant with an R-square of 0.80, 0.79, 0.72, and 0.84 for clay, sand, silt, and SOC, respectively. Furthermore, a sensitivity analysis of the environmental covariates and their impact on the model revealed that all the soil-forming factors make a significant contribution to the model's effectiveness. The insights gained from this research contribute to a better understanding of mountain soils and facilitate the development of effective conservation and sustainable management strategies for mountainous regions.
Collapse
Affiliation(s)
- Nyenshu Seb Rengma
- Geographic Information System (GIS) Cell, Motilal Nehru National Institute of Technology Allahabad, Prayagraj, Uttar Pradesh, -211004, India
| | - Manohar Yadav
- Geographic Information System (GIS) Cell, Motilal Nehru National Institute of Technology Allahabad, Prayagraj, Uttar Pradesh, -211004, India.
| | - Justin George Kalambukattu
- Agriculture & Soils Department, Indian Institute of Remote Sensing, Indian Space Research Organisation, Govt. of India, 4, Kalidas Road, Dehradun, Uttarakhand, -248001, India
| | - Suresh Kumar
- Agriculture & Soils Department, Indian Institute of Remote Sensing, Indian Space Research Organisation, Govt. of India, 4, Kalidas Road, Dehradun, Uttarakhand, -248001, India
| |
Collapse
|
8
|
Klau JH, Maj C, Klinkhammer H, Krawitz PM, Mayr A, Hillmer AM, Schumacher J, Heider D. AI-based multi-PRS models outperform classical single-PRS models. Front Genet 2023; 14:1217860. [PMID: 37441549 PMCID: PMC10335560 DOI: 10.3389/fgene.2023.1217860] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2023] [Accepted: 06/13/2023] [Indexed: 07/15/2023] Open
Abstract
Polygenic risk scores (PRS) calculate the risk for a specific disease based on the weighted sum of associated alleles from different genetic loci in the germline estimated by regression models. Recent advances in genetics made it possible to create polygenic predictors of complex human traits, including risks for many important complex diseases, such as cancer, diabetes, or cardiovascular diseases, typically influenced by many genetic variants, each of which has a negligible effect on overall risk. In the current study, we analyzed whether adding additional PRS from other diseases to the prediction models and replacing the regressions with machine learning models can improve overall predictive performance. Results showed that multi-PRS models outperform single-PRS models significantly on different diseases. Moreover, replacing regression models with machine learning models, i.e., deep learning, can also improve overall accuracy.
Collapse
Affiliation(s)
- Jan Henric Klau
- Department of Mathematics and Computer Science, University of Marburg, Marburg, Germany
| | - Carlo Maj
- Center for Human Genetics, University of Marburg, Marburg, Germany
| | - Hannah Klinkhammer
- Institute for Genomic Statistics and Bioinformatics, Medical Faculty, University Bonn, Bonn, Germany
- Institute for Medical Biometry, Informatics and Epidemiology, Medical Faculty, University Bonn, Bonn, Germany
| | - Peter M. Krawitz
- Institute for Genomic Statistics and Bioinformatics, Medical Faculty, University Bonn, Bonn, Germany
| | - Andreas Mayr
- Institute for Medical Biometry, Informatics and Epidemiology, Medical Faculty, University Bonn, Bonn, Germany
| | - Axel M. Hillmer
- Institute of Pathology, Faculty of Medicine, University of Cologne, Cologne, Germany
| | | | - Dominik Heider
- Department of Mathematics and Computer Science, University of Marburg, Marburg, Germany
| |
Collapse
|
9
|
Marin L, Casado F. Prediction of prostate cancer biochemical recurrence by using discretization supports the critical contribution of the extra-cellular matrix genes. Sci Rep 2023; 13:10144. [PMID: 37349324 PMCID: PMC10287745 DOI: 10.1038/s41598-023-35821-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2023] [Accepted: 05/24/2023] [Indexed: 06/24/2023] Open
Abstract
Due to its complexity, much effort has been devoted to the development of biomarkers for prostate cancer that have acquired the utmost clinical relevance for diagnosis and grading. However, all of these advances are limited due to the relatively large percentage of biochemical recurrence (BCR) and the limited strategies for follow up. This work proposes a methodology that uses discretization to predict prostate cancer BCR while optimizing the necessary variables. We used discretization of RNA-seq data to increase the prediction of biochemical recurrence and retrieve a subset of ten genes functionally known to be related to the tissue structure. Equal width and equal frequency data discretization methods were compared to isolate the contribution of the genes and their interval of action, simultaneously. Adding a robust clinical biomarker such as prostate specific antigen (PSA) improved the prediction of BCR. Discretization allowed classifying the cancer patients with an accuracy of 82% on testing datasets, and 75% on a validation dataset when a five-bin discretization by equal width was used. After data pre-processing, feature selection and classification, our predictions had a precision of 71% (testing dataset: MSKCC and GSE54460) and 69% (Validation dataset: GSE70769) should the patients present BCR up to 24 months after their final treatment. These results emphasize the use of equal width discretization as a pre-processing step to improve classification for a limited number of genes in the signature. Functionally, many of these genes have a direct or expected role in tissue structure and extracellular matrix organization. The processing steps presented in this study are also applicable to other cancer types to increase the speed and accuracy of the models in diverse datasets.
Collapse
Affiliation(s)
- Laura Marin
- Department of Engineering, Pontificia Universidad Catolica del Peru, Av. Universitaria 1801, San Miguel, 15088, Lima, Peru
- Institute of Omics Sciences and Applied Biotechnology, Pontificia Universidad Catolica del Peru, Av. Universitaria 1801, San Miguel, 15088, Lima, Peru
| | - Fanny Casado
- Institute of Omics Sciences and Applied Biotechnology, Pontificia Universidad Catolica del Peru, Av. Universitaria 1801, San Miguel, 15088, Lima, Peru.
| |
Collapse
|
10
|
Mayor D, Steffert T, Datseris G, Firth A, Panday D, Kandel H, Banks D. Complexity and Entropy in Physiological Signals (CEPS): Resonance Breathing Rate Assessed Using Measures of Fractal Dimension, Heart Rate Asymmetry and Permutation Entropy. ENTROPY (BASEL, SWITZERLAND) 2023; 25:301. [PMID: 36832667 PMCID: PMC9955651 DOI: 10.3390/e25020301] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 12/01/2022] [Revised: 01/09/2023] [Accepted: 01/21/2023] [Indexed: 06/18/2023]
Abstract
BACKGROUND As technology becomes more sophisticated, more accessible methods of interpretating Big Data become essential. We have continued to develop Complexity and Entropy in Physiological Signals (CEPS) as an open access MATLAB® GUI (graphical user interface) providing multiple methods for the modification and analysis of physiological data. METHODS To demonstrate the functionality of the software, data were collected from 44 healthy adults for a study investigating the effects on vagal tone of breathing paced at five different rates, as well as self-paced and un-paced. Five-minute 15-s recordings were used. Results were also compared with those from shorter segments of the data. Electrocardiogram (ECG), electrodermal activity (EDA) and Respiration (RSP) data were recorded. Particular attention was paid to COVID risk mitigation, and to parameter tuning for the CEPS measures. For comparison, data were processed using Kubios HRV, RR-APET and DynamicalSystems.jl software. We also compared findings for ECG RR interval (RRi) data resampled at 4 Hz (4R) or 10 Hz (10R), and non-resampled (noR). In total, we used around 190-220 measures from CEPS at various scales, depending on the analysis undertaken, with our investigation focused on three families of measures: 22 fractal dimension (FD) measures, 40 heart rate asymmetries or measures derived from Poincaré plots (HRA), and 8 measures based on permutation entropy (PE). RESULTS FDs for the RRi data differentiated strongly between breathing rates, whether data were resampled or not, increasing between 5 and 7 breaths per minute (BrPM). Largest effect sizes for RRi (4R and noR) differentiation between breathing rates were found for the PE-based measures. Measures that both differentiated well between breathing rates and were consistent across different RRi data lengths (1-5 min) included five PE-based (noR) and three FDs (4R). Of the top 12 measures with short-data values consistently within ± 5% of their values for the 5-min data, five were FDs, one was PE-based, and none were HRAs. Effect sizes were usually greater for CEPS measures than for those implemented in DynamicalSystems.jl. CONCLUSION The updated CEPS software enables visualisation and analysis of multichannel physiological data using a variety of established and recently introduced complexity entropy measures. Although equal resampling is theoretically important for FD estimation, it appears that FD measures may also be usefully applied to non-resampled data.
Collapse
Affiliation(s)
- David Mayor
- School of Health and Social Work, University of Hertfordshire, Hatfield AL10 9AB, UK
| | - Tony Steffert
- MindSpire, Napier House, 14–16 Mount Ephraim Rd., Tunbridge Wells TN1 1EE, UK
- School of Life, Health and Chemical Sciences, STEM, Walton Hall, The Open University, Milton Keynes MK7 6AA, UK
| | - George Datseris
- Department of Mathematics and Statistics, University of Exeter, North Park Road, Exeter EX4 4QF, UK
| | - Andrea Firth
- University Campus Football Business, Wembley HA9 0WS, UK
| | - Deepak Panday
- School of Engineering and Computer Science, University of Hertfordshire, Hatfield AL10 9AB, UK
| | - Harikala Kandel
- Department of Computer Science and Information Systems, Birkbeck, University of London, Malet Street, London WC1E 7HX, UK
| | - Duncan Banks
- School of Life, Health and Chemical Sciences, STEM, Walton Hall, The Open University, Milton Keynes MK7 6AA, UK
- Department of Physiology, Busitema University, Mbale P.O. Box 1966, Uganda
| |
Collapse
|
11
|
Saak S, Huelsmeier D, Kollmeier B, Buhl M. A flexible data-driven audiological patient stratification method for deriving auditory profiles. Front Neurol 2022; 13:959582. [PMID: 36188360 PMCID: PMC9520582 DOI: 10.3389/fneur.2022.959582] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2022] [Accepted: 08/11/2022] [Indexed: 11/13/2022] Open
Abstract
For characterizing the complexity of hearing deficits, it is important to consider different aspects of auditory functioning in addition to the audiogram. For this purpose, extensive test batteries have been developed aiming to cover all relevant aspects as defined by experts or model assumptions. However, as the assessment time of physicians is limited, such test batteries are often not used in clinical practice. Instead, fewer measures are used, which vary across clinics. This study aimed at proposing a flexible data-driven approach for characterizing distinct patient groups (patient stratification into auditory profiles) based on one prototypical database (N = 595) containing audiogram data, loudness scaling, speech tests, and anamnesis questions. To further maintain the applicability of the auditory profiles in clinical routine, we built random forest classification models based on a reduced set of audiological measures which are often available in clinics. Different parameterizations regarding binarization strategy, cross-validation procedure, and evaluation metric were compared to determine the optimum classification model. Our data-driven approach, involving model-based clustering, resulted in a set of 13 patient groups, which serve as auditory profiles. The 13 auditory profiles separate patients within certain ranges across audiological measures and are audiologically plausible. Both a normal hearing profile and profiles with varying extents of hearing impairments are defined. Further, a random forest classification model with a combination of a one-vs.-all and one-vs.-one binarization strategy, 10-fold cross-validation, and the kappa evaluation metric was determined as the optimal model. With the selected model, patients can be classified into 12 of the 13 auditory profiles with adequate precision (mean across profiles = 0.9) and sensitivity (mean across profiles = 0.84). The proposed approach, consequently, allows generating of audiologically plausible and interpretable, data-driven clinical auditory profiles, providing an efficient way of characterizing hearing deficits, while maintaining clinical applicability. The method should by design be applicable to all audiological data sets from clinics or research, and in addition be flexible to summarize information across databases by means of profiles, as well as to expand the approach toward aided measurements, fitting parameters, and further information from databases.
Collapse
|
12
|
Application of data augmentation techniques towards metabolomics. Comput Biol Med 2022; 148:105916. [DOI: 10.1016/j.compbiomed.2022.105916] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2022] [Revised: 07/11/2022] [Accepted: 07/23/2022] [Indexed: 11/22/2022]
|
13
|
Vision for Improving Pregnancy Health: Innovation and the Future of Pregnancy Research. Reprod Sci 2022; 29:2908-2920. [PMID: 35534766 PMCID: PMC9537127 DOI: 10.1007/s43032-022-00951-w] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2022] [Accepted: 04/15/2022] [Indexed: 10/25/2022]
Abstract
Understanding, predicting, and preventing pregnancy disorders have been a major research target. Nonetheless, the lack of progress is illustrated by research results related to preeclampsia and other hypertensive pregnancy disorders. These remain a major cause of maternal and infant mortality worldwide. There is a general consensus that the rate of progress toward understanding pregnancy disorders lags behind progress in other aspects of human health. In this presentation, we advance an explanation for this failure and suggest solutions. We propose that progress has been impeded by narrowly focused research training and limited imagination and innovation, resulting in the failure to think beyond conventional research approaches and analytical strategies. Investigations have been largely limited to hypothesis-generating approaches constrained by attempts to force poorly defined complex disorders into a single "unifying" hypothesis. Future progress could be accelerated by rethinking this approach. We advise taking advantage of innovative approaches that will generate new research strategies for investigating pregnancy abnormalities. Studies should begin before conception, assessing pregnancy longitudinally, before, during, and after pregnancy. Pregnancy disorders should be defined by pathophysiology rather than phenotype, and state of the art agnostic assessment of data should be adopted to generate new ideas. Taking advantage of new approaches mandates emphasizing innovation, inclusion of large datasets, and use of state of the art experimental and analytical techniques. A revolution in understanding pregnancy-associated disorders will depend on networks of scientists who are driven by an intense biological curiosity, a team spirit, and the tools to make new discoveries.
Collapse
|
14
|
Danieli MG, Tonacci A, Paladini A, Longhi E, Moroncini G, Allegra A, Sansone F, Gangemi S. A machine learning analysis to predict the response to intravenous and subcutaneous immunoglobulin in inflammatory myopathies. A proposal for a future multi-omics approach in autoimmune diseases. Autoimmun Rev 2022; 21:103105. [PMID: 35452850 DOI: 10.1016/j.autrev.2022.103105] [Citation(s) in RCA: 17] [Impact Index Per Article: 8.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/06/2022] [Accepted: 04/18/2022] [Indexed: 02/07/2023]
Abstract
OBJECTIVE To evaluate the response to treatment with intravenous (IVIg) and subcutaneous (20%SCIg) immunoglobulin in our series of patients with Inflammatory idiopathic myopathies (IIM) by the means of artificial intelligence. BACKGROUND IIM are rare diseases mainly involving the skeletal muscle with particular clinical, laboratory and radiological characteristics. Artificial intelligence (AI) represents computer processes which allows to perform complex calculations and data analyses, with the least human intervention. Recently, the use an AI in medicine significantly expanded, especially through machine learning (ML) which analyses huge amounts of information and accordingly makes decisions, and deep learning (DL) which uses artificial neural networks to analyse data and automatically learn. METHODS In this study, we employed AI in the evaluation of the response to treatment with IVIg and 20%SCIg in our series of patients with IIM. The diagnoses were determined on the established EULAR/ACR criteria. The treatment response was evaluated employing the following: serum creatine kinase levels, muscle strength (MMT8 score), disease activity (MITAX score) and disability (HAQ-DI score). We evaluated all the above parameters, applying, with R, different supervised ML algorithms, including Least Absolute Shrinkage and Selection Operator, Ridge, Elastic Net, Classification and Regression Trees and Random Forest to estimate the most important predictors for a good response to IVIg and 20%SCIg treatment. RESULTS AND CONCLUSION By the means of AI we have been able to identify the scores that best predict a good response to IVIg and 20%SCIg treatment. The muscle strength as evaluated by MMT8 score at the follow-up is predicted by the presence of dysphagia and of skin disorders, and the myositis activity index (MITAX) at the beginning of the treatment. The relationship between muscle strength and MITAX indicates a better action of IVIg therapy in patients with more active systemic disease. Considering our results, Elastic Net and similar approaches were seen to be the most viable, efficient, and effective ML methods for predicting the clinical outcome (MMT8 and MITAX at most) in myositis.
Collapse
Affiliation(s)
- Maria Giovanna Danieli
- Clinica Medica, Dipartimento di Scienze Cliniche e Molecolari, Università Politecnica delle Marche, via Tronto 10/A, 60126 Torrette di Ancona, Italy; Postgraduate School of Allergy and Clinical Immunology, Università Politecnica delle Marche, via Tronto 10/A, 60126 Ancona, Italy.
| | - Alessandro Tonacci
- Institute of Clinical Physiology, National Research Council of Italy (IFC-CNR), Via G. Moruzzi 1, 56124 Pisa, Italy
| | - Alberto Paladini
- PostGraduate School of Internal Medicine, Università Politecnica delle Marche, via Tronto 10/A, 60126 Ancona, Italy
| | - Eleonora Longhi
- Scuola di Medicina e Chirurgia, Alma Mater Studiorum, Università degli Studi di Bologna, 40126 Bologna, Italy
| | - Gianluca Moroncini
- Clinica Medica, Dipartimento di Scienze Cliniche e Molecolari, Università Politecnica delle Marche, via Tronto 10/A, 60126 Torrette di Ancona, Italy; PostGraduate School of Internal Medicine, Università Politecnica delle Marche, via Tronto 10/A, 60126 Ancona, Italy
| | - Alessandro Allegra
- Division of Haematology, Department of Human Pathology in Adulthood and Childhood "Gaetano Barresi", University of Messina, Via Consolare Valeria 1, 98125 Messina, Italy
| | - Francesco Sansone
- Institute of Clinical Physiology, National Research Council of Italy (IFC-CNR), Via G. Moruzzi 1, 56124 Pisa, Italy
| | - Sebastiano Gangemi
- School and Operative Unit of Allergy and Clinical Immunology, Department of Clinical and Experimental Medicine, University of Messina, Via Consolare Valeria 1, 98125 Messina, Italy.
| |
Collapse
|