1
|
Mylona E, Zaridis DI, Kalantzopoulos CΝ, Tachos NS, Regge D, Papanikolaou N, Tsiknakis M, Marias K, Fotiadis DI. Optimizing radiomics for prostate cancer diagnosis: feature selection strategies, machine learning classifiers, and MRI sequences. Insights Imaging 2024; 15:265. [PMID: 39495422 PMCID: PMC11535140 DOI: 10.1186/s13244-024-01783-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2024] [Accepted: 06/27/2024] [Indexed: 11/05/2024] Open
Abstract
OBJECTIVES Radiomics-based analyses encompass multiple steps, leading to ambiguity regarding the optimal approaches for enhancing model performance. This study compares the effect of several feature selection methods, machine learning (ML) classifiers, and sources of radiomic features, on models' performance for the diagnosis of clinically significant prostate cancer (csPCa) from bi-parametric MRI. METHODS Two multi-centric datasets, with 465 and 204 patients each, were used to extract 1246 radiomic features per patient and MRI sequence. Ten feature selection methods, such as Boruta, mRMRe, ReliefF, recursive feature elimination (RFE), random forest (RF) variable importance, L1-lasso, etc., four ML classifiers, namely SVM, RF, LASSO, and boosted generalized linear model (GLM), and three sets of radiomics features, derived from T2w images, ADC maps, and their combination, were used to develop predictive models of csPCa. Their performance was evaluated in a nested cross-validation and externally, using seven performance metrics. RESULTS In total, 480 models were developed. In nested cross-validation, the best model combined Boruta with Boosted GLM (AUC = 0.71, F1 = 0.76). In external validation, the best model combined L1-lasso with boosted GLM (AUC = 0.71, F1 = 0.47). Overall, Boruta, RFE, L1-lasso, and RF variable importance were the top-performing feature selection methods, while the choice of ML classifier didn't significantly affect the results. The ADC-derived features showed the highest discriminatory power with T2w-derived features being less informative, while their combination did not lead to improved performance. CONCLUSION The choice of feature selection method and the source of radiomic features have a profound effect on the models' performance for csPCa diagnosis. CRITICAL RELEVANCE STATEMENT This work may guide future radiomic research, paving the way for the development of more effective and reliable radiomic models; not only for advancing prostate cancer diagnostic strategies, but also for informing broader applications of radiomics in different medical contexts. KEY POINTS Radiomics is a growing field that can still be optimized. Feature selection method impacts radiomics models' performance more than ML algorithms. Best feature selection methods: RFE, LASSO, RF, and Boruta. ADC-derived radiomic features yield more robust models compared to T2w-derived radiomic features.
Collapse
Affiliation(s)
- Eugenia Mylona
- Biomedical Research Institute, FORTH, GR 45110, Ioannina, Greece
- Unit of Medical Technology Intelligent Information Systems, University of Ioannina, Ioannina, Greece
| | - Dimitrios I Zaridis
- Biomedical Research Institute, FORTH, GR 45110, Ioannina, Greece
- Unit of Medical Technology Intelligent Information Systems, University of Ioannina, Ioannina, Greece
- Biomedical Engineering Laboratory, School of Electrical & Computer Engineering, National Technical University of Athens, Athens, Greece
| | - Charalampos Ν Kalantzopoulos
- Biomedical Research Institute, FORTH, GR 45110, Ioannina, Greece
- Unit of Medical Technology Intelligent Information Systems, University of Ioannina, Ioannina, Greece
| | - Nikolaos S Tachos
- Biomedical Research Institute, FORTH, GR 45110, Ioannina, Greece
- Unit of Medical Technology Intelligent Information Systems, University of Ioannina, Ioannina, Greece
| | - Daniele Regge
- Department of Radiology, Candiolo Cancer Institute, FPO-IRCCS, Candiolo, Italy
| | | | - Manolis Tsiknakis
- Computational Biomedicine Laboratory, Institute of Computer Science, FORTH, GR 70013, Heraklion, Greece
- Department of Electrical and Computer Engineering, Hellenic Mediterranean University, GR 71004, Heraklion, Greece
| | - Kostas Marias
- Computational Biomedicine Laboratory, Institute of Computer Science, FORTH, GR 70013, Heraklion, Greece
- Department of Electrical and Computer Engineering, Hellenic Mediterranean University, GR 71004, Heraklion, Greece
| | - Dimitrios I Fotiadis
- Biomedical Research Institute, FORTH, GR 45110, Ioannina, Greece.
- Unit of Medical Technology Intelligent Information Systems, University of Ioannina, Ioannina, Greece.
| |
Collapse
|
2
|
Bao Z, Tom G, Cheng A, Watchorn J, Aspuru-Guzik A, Allen C. Towards the prediction of drug solubility in binary solvent mixtures at various temperatures using machine learning. J Cheminform 2024; 16:117. [PMID: 39468626 PMCID: PMC11520512 DOI: 10.1186/s13321-024-00911-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/26/2024] [Accepted: 09/28/2024] [Indexed: 10/30/2024] Open
Abstract
Drug solubility is an important parameter in the drug development process, yet it is often tedious and challenging to measure, especially for expensive drugs or those available in small quantities. To alleviate these challenges, machine learning (ML) has been applied to predict drug solubility as an alternative approach. However, the majority of existing ML research has focused on the predictions of aqueous solubility and/or solubility at specific temperatures, which restricts the model applicability in pharmaceutical development. To bridge this gap, we compiled a dataset of 27,000 solubility datapoints, including solubility of small molecules measured in a range of binary solvent mixtures under various temperatures. Next, a panel of ML models were trained on this dataset with their hyperparameters tuned using Bayesian optimization. The resulting top-performing models, both gradient boosted decision trees (light gradient boosting machine and extreme gradient boosting), achieved mean absolute errors (MAE) of 0.33 for LogS (S in g/100 g) on the holdout set. These models were further validated through a prospective study, wherein the solubility of four drug molecules were predicted by the models and then validated with in-house solubility experiments. This prospective study demonstrated that the models accurately predicted the solubility of solutes in specific binary solvent mixtures under different temperatures, especially for drugs whose features closely align within the solutes in the dataset (MAE < 0.5 for LogS). To support future research and facilitate advancements in the field, we have made the dataset and code openly available. Scientific contribution Our research advances the state-of-the-art in predicting solubility for small molecules by leveraging ML and a uniquely comprehensive dataset. Unlike existing ML studies that predominantly focus on solubility in aqueous solvents at fixed temperatures, our work enables prediction of drug solubility in a variety of binary solvent mixtures over a broad temperature range, providing practical insights on the modeling of solubility for realistic pharmaceutical applications. These advancements along with the open access dataset and code support significant steps in the drug development process including new molecule discovery, drug analysis and formulation.
Collapse
Affiliation(s)
- Zeqing Bao
- Leslie Dan Faculty of Pharmacy, University of Toronto, Toronto, ON, M5S 3M2, Canada
| | - Gary Tom
- Department of Chemistry, University of Toronto, Toronto, ON, M5S 3H6, Canada
- Department of Computer Science, University of Toronto, Toronto, ON, M5S 2E4, Canada
- Vector Institute for Artificial Intelligence, Toronto, ON, M5S 1M1, Canada
| | - Austin Cheng
- Department of Chemistry, University of Toronto, Toronto, ON, M5S 3H6, Canada
- Department of Computer Science, University of Toronto, Toronto, ON, M5S 2E4, Canada
- Vector Institute for Artificial Intelligence, Toronto, ON, M5S 1M1, Canada
| | | | - Alán Aspuru-Guzik
- Department of Chemistry, University of Toronto, Toronto, ON, M5S 3H6, Canada
- Department of Computer Science, University of Toronto, Toronto, ON, M5S 2E4, Canada
- Vector Institute for Artificial Intelligence, Toronto, ON, M5S 1M1, Canada
- Acceleration Consortium, Toronto, ON, M5S 3H6, Canada
- Lebovic Fellow, Canadian Institute for Advanced Research (CIFAR), Toronto, ON, M5S 1M1, Canada
- Department of Chemical Engineering and Applied Chemistry, University of Toronto, Toronto, ON, M5S 3E5, Canada
- Department of Materials Science and Engineering, University of Toronto, Toronto, ON, M5S 3E4, Canada
- CIFAR Artificial Intelligence Research Chair, Vector Institute, Toronto, ON, M5S 1M1, Canada
| | - Christine Allen
- Leslie Dan Faculty of Pharmacy, University of Toronto, Toronto, ON, M5S 3M2, Canada.
- Acceleration Consortium, Toronto, ON, M5S 3H6, Canada.
- Department of Chemical Engineering and Applied Chemistry, University of Toronto, Toronto, ON, M5S 3E5, Canada.
| |
Collapse
|
3
|
Madakkatel I, Hyppönen E. LLpowershap: logistic loss-based automated Shapley values feature selection method. BMC Med Res Methodol 2024; 24:247. [PMID: 39448895 PMCID: PMC11515487 DOI: 10.1186/s12874-024-02370-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2024] [Accepted: 10/14/2024] [Indexed: 10/26/2024] Open
Abstract
BACKGROUND Shapley values have been used extensively in machine learning, not only to explain black box machine learning models, but among other tasks, also to conduct model debugging, sensitivity and fairness analyses and to select important features for robust modelling and for further follow-up analyses. Shapley values satisfy certain axioms that promote fairness in distributing contributions of features toward prediction or reducing error, after accounting for non-linear relationships and interactions when complex machine learning models are employed. Recently, feature selection methods using predictive Shapley values and p-values have been introduced, including powershap. METHODS We present a novel feature selection method, LLpowershap, that takes forward these recent advances by employing loss-based Shapley values to identify informative features with minimal noise among the selected sets of features. We also enhance the calculation of p-values and power to identify informative features and to estimate number of iterations of model development and testing. RESULTS Our simulation results show that LLpowershap not only identifies higher number of informative features but outputs fewer noise features compared to other state-of-the-art feature selection methods. Benchmarking results on four real-world datasets demonstrate higher or comparable predictive performance of LLpowershap compared to other Shapley based wrapper methods, or filter methods. LLpowershap is also ranked the best in mean ranking among the seven feature selection methods tested on the benchmark datasets. CONCLUSION Our results demonstrate that LLpowershap is a viable wrapper feature selection method that can be used for feature selection in large biomedical datasets and other settings.
Collapse
Affiliation(s)
- Iqbal Madakkatel
- Australian Centre for Precision Health, Unit of Clinical and Health Sciences, University of South Australia, Adelaide, 5001, South Australia, Australia.
- South Australian Health and Medical Research Institute (SAHMRI), Adelaide, 5001, South Australia, Australia.
| | - Elina Hyppönen
- Australian Centre for Precision Health, Unit of Clinical and Health Sciences, University of South Australia, Adelaide, 5001, South Australia, Australia
- South Australian Health and Medical Research Institute (SAHMRI), Adelaide, 5001, South Australia, Australia
| |
Collapse
|
4
|
Chan HTJ, Veas E. Importance estimate of features via analysis of their weight and gradient profile. Sci Rep 2024; 14:23532. [PMID: 39384831 PMCID: PMC11464895 DOI: 10.1038/s41598-024-72640-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/04/2024] [Accepted: 09/09/2024] [Indexed: 10/11/2024] Open
Abstract
Understanding what is important and redundant within data can improve the modelling process of neural networks by reducing unnecessary model complexity, training time and memory storage. This information is however not always priorly available nor trivial to obtain from neural networks. There are existing feature selection methods which utilise the internal working of a neural network for selection, however further analysis and interpretation of the input features' significance is often limiting. We propose an approach that offers an extension that estimates the significance of features by analysing the gradient descent of a pairwise layer within a model. The changes that occur with the weights and gradients throughout training provide a profile that can be used to better understand the importance hierarchy between the features for ranking and feature selection. Additionally, this method is transferable to existing fully or partially trained models, which is beneficial for understanding existing or active models. The proposed approach is demonstrated empirically with a study which uses benchmark datasets from libraries such as MNIST and scikit-feat, as well as a simulated dataset and an applied real world dataset. This is verified with the ground truth where available, and if not, via a comparison of fundamental feature selection methods, which includes existing statistical based and embedded neural network based feature selection methods through the methodology of Reduce and Retrain.
Collapse
Affiliation(s)
- Ho Tung Jeremy Chan
- Interactive System and Data Science, Graz University of Technology, 8010, Graz, Austria.
- Human AI Interaction, Know Center GmbH, 8010, Graz, Austria.
| | - Eduardo Veas
- Interactive System and Data Science, Graz University of Technology, 8010, Graz, Austria
- Human AI Interaction, Know Center GmbH, 8010, Graz, Austria
| |
Collapse
|
5
|
Sadeghpour A, Badal VD, Pogge DL, O'Donohue EM, Bigdeli T, Harvey PD. Using machine learning modeling to identify childhood abuse victims on the basis of personality inventory responses. J Psychiatr Res 2024; 180:8-15. [PMID: 39366273 DOI: 10.1016/j.jpsychires.2024.09.046] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 02/25/2024] [Revised: 07/31/2024] [Accepted: 09/29/2024] [Indexed: 10/06/2024]
Abstract
Trauma is very common and associated with significant co-morbidity world-wide, particularly PTSD and frequently other mental health disorders. However, it can be challenging to identify victims of abuse as self-reports can be difficult to elicit due to emotional distress. Better confirmation of a history of significant mistreatment can assist significantly in treatment planning. We evaluate an alternate approach based on machine-learning techniques applied to personality inventory data (Minnesota Personality Inventory, Adolescent Version; MMPI-A) obtained concurrently to examine convergence with reports of past trauma exposure. The Childhood Trauma Questionnaire (CTQ) was administered to 733 child and adolescent inpatients. Statistical and information-theory measures showed that each type of abuse - sexual, physical, and emotional - had a unique "fingerprint" of MMPI-A profiles. In contrast to our previous findings in terms of specific correlations with IQ, individuals positive for Sexual abuse had the fewest MMPI-A elevations, followed by Physical abuse, while those reporting Emotional abuse had the greatest number of elevations. We developed an initial classifier Machine Learning (ML) model for predicting a history of abuse that demonstrates equivalent sensitivity compared to other widely used screening measures. In addition, we show via PCA and cluster analysis that the different levels of severity of emotional abuse present with unique mixtures of personality trait characteristics. Thus, this type of ML mediated analysis could permit at-scale detection of those at potential high risk of a history of abuse by use of real-time information, using a variety of nontransparent data sources.
Collapse
Affiliation(s)
- Angelo Sadeghpour
- University of Miami Miller School of Medicine, 1600 NW 10th Ave, Miami, FL, 33136, USA; Research Service, Bruce W. Carter VA Medical Center, 1201 NW 16th St, Miami, FL, 33125, USA.
| | - Varsha D Badal
- Niversity of California San Diego, Department of Psychiatry, 9500 Gilman Dr, La Jolla, CA, 92093, USA.
| | - David L Pogge
- Four Winds Hospital, 800 Cross River Rd, Katonah, NY, 10536, USA; Fairleigh Dickinson University, 1000 River Rd, Teaneck, NJ, 07666, USA.
| | - Elizabeth M O'Donohue
- Four Winds Hospital, 800 Cross River Rd, Katonah, NY, 10536, USA; University of Toledo, 2801 Bancroft St, Toledo, OH, 43606, USA. Elizabeth.O'
| | - Tim Bigdeli
- SUNY Downstate Medical Center, 2801 Bancroft St, Toledo, OH, 43606, USA; New York Harbor VA Health Services Organization, 423 E 23rd St, New York, NY, 10010, USA.
| | - Philip D Harvey
- University of Miami Miller School of Medicine, 1600 NW 10th Ave, Miami, FL, 33136, USA; Research Service, Bruce W. Carter VA Medical Center, 1201 NW 16th St, Miami, FL, 33125, USA.
| |
Collapse
|
6
|
Kim JI, Manuele A, Maguire F, Zaheer R, McAllister TA, Beiko RG. Identification of key drivers of antimicrobial resistance in Enterococcus using machine learning. Can J Microbiol 2024; 70:446-460. [PMID: 39079170 DOI: 10.1139/cjm-2024-0049] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/03/2024]
Abstract
With antimicrobial resistance (AMR) rapidly evolving in pathogens, quick and accurate identification of genetic determinants of phenotypic resistance is essential for improving surveillance, stewardship, and clinical mitigation. Machine learning (ML) models show promise for AMR prediction in diagnostics but require a deep understanding of internal processes to use effectively. Our study utilised AMR gene, pangenomic, and predicted plasmid features from 647 Enterococcus faecium and Enterococcus faecalis genomes across the One Health continuum, along with corresponding resistance phenotypes, to develop interpretive ML classifiers. Vancomycin resistance could be predicted with 99% accuracy with AMR gene features, 98% with pangenome features, and 96% with plasmid clusters. Top pangenome features overlapped with the resistance genes of the vanA operon, which are often laterally transmitted via plasmids. Doxycycline resistance prediction achieved approximately 92% accuracy with pangenome features, with the top feature being elements of Tn916 conjugative transposon, a tet(M) carrier. Erythromycin resistance prediction models achieved about 90% accuracy, but top features were negatively correlated with resistance due to the confounding effect of population structure. This work demonstrates the importance of reviewing ML models' features to discern biological relevance even when achieving high-performance metrics. Our workflow offers the potential to propose hypotheses for experimental testing, enhancing the understanding of AMR mechanisms, which are crucial for combating the AMR crisis.
Collapse
Affiliation(s)
- Jee In Kim
- Faculty of Computer Science, Dalhousie University, Halifax, NS, Canada
- Institute for Comparative Genomics, Dalhousie University, Halifax, NS, Canada
- Agriculture and Agri-Food Canada, Lethbridge, AB, Canada
| | - Alexander Manuele
- Faculty of Computer Science, Dalhousie University, Halifax, NS, Canada
- Institute for Comparative Genomics, Dalhousie University, Halifax, NS, Canada
| | - Finlay Maguire
- Faculty of Computer Science, Dalhousie University, Halifax, NS, Canada
- Institute for Comparative Genomics, Dalhousie University, Halifax, NS, Canada
- Department of Community Health and Epidemiology, Dalhousie University, Faculty of Medicine, Halifax, NS, Canada
| | - Rahat Zaheer
- Agriculture and Agri-Food Canada, Lethbridge, AB, Canada
| | | | - Robert G Beiko
- Faculty of Computer Science, Dalhousie University, Halifax, NS, Canada
- Institute for Comparative Genomics, Dalhousie University, Halifax, NS, Canada
| |
Collapse
|
7
|
Robert Vincent ACS, Sengan S. Effective clinical decision support implementation using a multi filter and wrapper optimisation model for Internet of Things based healthcare data. Sci Rep 2024; 14:21820. [PMID: 39294200 PMCID: PMC11410983 DOI: 10.1038/s41598-024-71726-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/04/2024] [Accepted: 08/30/2024] [Indexed: 09/20/2024] Open
Abstract
Feature Selection (FS) is essential in the Internet of Things (IoT)-based Clinical Decision Support Systems (CDSS) to improve the accuracy and efficiency of the system. With the increasing number of sensors and devices used in healthcare, the volume of data generated is vast and complex. Relevant FS from this data is crucial in reducing computational overhead, improving the system's interpretability, and enhancing the Decision-Making System (DMS) quality. FS also aids in addressing the problems of data redundancy and noise, which can negatively impact the system's performance. FS is critical to developing practical and dependable CDSS in IoT-based healthcare sectors. This research proposes a two-phase FS model. Phase-I employs an ensemble of five Filter Methods (FM), followed by a Pearson Correlation Method (PCM). Phase-II uses the Binary Optimized Genetic Grey Wolf Optimization Algorithm (BOGGWOA) as a Wrapper Method (WM). This recommended model integrates the most valuable features of each filter. Then, it uses the Pearson Correlation Coefficient (PCC) to get rid of features that aren't needed, a Support Vector Machine (SVM) to guess how accurate their classification will be, and BOGGWOA as the Wrapper Method (WM) to pick the most essential features with the best CA.
Collapse
Affiliation(s)
| | - Sudhakar Sengan
- Department of Computer Science and Engineering, PSN College of Engineering and Technology, Tirunelveli, Tamil Nadu, 627451, India.
| |
Collapse
|
8
|
Cantor E, Guauque-Olarte S, León R, Chabert S, Salas R. Knowledge-slanted random forest method for high-dimensional data and small sample size with a feature selection application for gene expression data. BioData Min 2024; 17:34. [PMID: 39256872 PMCID: PMC11389072 DOI: 10.1186/s13040-024-00388-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2024] [Accepted: 09/02/2024] [Indexed: 09/12/2024] Open
Abstract
The use of prior knowledge in the machine learning framework has been considered a potential tool to handle the curse of dimensionality in genetic and genomics data. Although random forest (RF) represents a flexible non-parametric approach with several advantages, it can provide poor accuracy in high-dimensional settings, mainly in scenarios with small sample sizes. We propose a knowledge-slanted RF that integrates biological networks as prior knowledge into the model to improve its performance and explainability, exemplifying its use for selecting and identifying relevant genes. knowledge-slanted RF is a combination of two stages. First, prior knowledge represented by graphs is translated by running a random walk with restart algorithm to determine the relevance of each gene based on its connection and localization on a protein-protein interaction network. Then, each relevance is used to modify the selection probability to draw a gene as a candidate split-feature in the conventional RF. Experiments in simulated datasets with very small sample sizes ( n ≤ 30 ) comparing knowledge-slanted RF against conventional RF and logistic lasso regression, suggest an improved precision in outcome prediction compared to the other methods. The knowledge-slanted RF was completed with the introduction of a modified version of the Boruta feature selection algorithm. Finally, knowledge-slanted RF identified more relevant biological genes, offering a higher level of explainability for users than conventional RF. These findings were corroborated in one real case to identify relevant genes to calcific aortic valve stenosis.
Collapse
Affiliation(s)
- Erika Cantor
- Department of clinical epidemiology and biostatistics, Pontificia Universidad Javeriana, Bogotá, 110221, Colombia.
| | - Sandra Guauque-Olarte
- Department of basic sciences and oral medicine, Universidad Nacional de Colombia, Bogotá, 16486, Colombia
| | - Roberto León
- Department of Computer Science, Universidad Técnica Federico Santa María, Santiago de Chile, 8940897, Chile
| | - Steren Chabert
- School of Biomedical Engineering, Universidad de Valparaiso, Valparaíso, 2360102, Chile
- Millennium Science Initiative Intelligent Healthcare Engineering, Santiago de Chile, 7820436, Chile
- Center of Interdisciplinary Biomedical and Engineering Research for Health - MEDING, Universidad de Valparaiso, Valparaíso, 2360102, Chile
| | - Rodrigo Salas
- School of Biomedical Engineering, Universidad de Valparaiso, Valparaíso, 2360102, Chile
- Millennium Science Initiative Intelligent Healthcare Engineering, Santiago de Chile, 7820436, Chile
- Center of Interdisciplinary Biomedical and Engineering Research for Health - MEDING, Universidad de Valparaiso, Valparaíso, 2360102, Chile
| |
Collapse
|
9
|
Xu W, Huang M, Jiang Z, Qian Y. Graph-Based Unsupervised Feature Selection for Interval-Valued Information System. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2024; 35:12576-12589. [PMID: 37067967 DOI: 10.1109/tnnls.2023.3263684] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/19/2023]
Abstract
Feature selection has become one of the hot research topics in the era of big data. At the same time, as an extension of single-valued data, interval-valued data with its inherent uncertainty tend to be more applicable than single-valued data in some fields for characterizing inaccurate and ambiguous information, such as medical test results and qualified product indicators. However, there are relatively few studies on unsupervised attribute reduction for interval-valued information systems (IVISs), and it remains to be studied how to effectively control the dramatic increase of time cost in feature selection of large sample datasets. For these reasons, we propose a feature selection method for IVISs based on graph theory. Then, the model complexity could be greatly reduced after we utilize the properties of the matrix power series to optimize the calculation of the original model. Our approach can be divided into two steps. The first is feature ranking with the principles of relevance and nonredundancy, and the second is selecting top-ranked attributes when the number of features to keep is fixed as a priori. In this article, experiments are performed on 14 public datasets and the corresponding seven comparative algorithms. The results of the experiments verify that our algorithm is effective and efficient for feature selection in IVISs.
Collapse
|
10
|
Vahed SZ, Khatibi SMH, Saadat YR, Emdadi M, Khodaei B, Alishani MM, Boostani F, Dizaj SM, Pirmoradi S. Introducing effective genes in lymph node metastasis of breast cancer patients using SHAP values based on the mRNA expression data. PLoS One 2024; 19:e0308531. [PMID: 39150915 PMCID: PMC11329117 DOI: 10.1371/journal.pone.0308531] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2024] [Accepted: 07/24/2024] [Indexed: 08/18/2024] Open
Abstract
OBJECTIVE Breast cancer, a global concern predominantly impacting women, poses a significant threat when not identified early. While survival rates for breast cancer patients are typically favorable, the emergence of regional metastases markedly diminishes survival prospects. Detecting metastases and comprehending their molecular underpinnings are crucial for tailoring effective treatments and improving patient survival outcomes. METHODS Various artificial intelligence methods and techniques were employed in this study to achieve accurate outcomes. Initially, the data was organized and underwent hold-out cross-validation, data cleaning, and normalization. Subsequently, feature selection was conducted using ANOVA and binary Particle Swarm Optimization (PSO). During the analysis phase, the discriminative power of the selected features was evaluated using machine learning classification algorithms. Finally, the selected features were considered, and the SHAP algorithm was utilized to identify the most significant features for enhancing the decoding of dominant molecular mechanisms in lymph node metastases. RESULTS In this study, five main steps were followed for the analysis of mRNA expression data: reading, preprocessing, feature selection, classification, and SHAP algorithm. The RF classifier utilized the candidate mRNAs to differentiate between negative and positive categories with an accuracy of 61% and an AUC of 0.6. During the SHAP process, intriguing relationships between the selected mRNAs and positive/negative lymph node status were discovered. The results indicate that GDF5, BAHCC1, LCN2, FGF14-AS2, and IDH2 are among the top five most impactful mRNAs based on their SHAP values. CONCLUSION The prominent identified mRNAs including GDF5, BAHCC1, LCN2, FGF14-AS2, and IDH2, are implicated in lymph node metastasis. This study holds promise in elucidating a thorough insight into key candidate genes that could significantly impact the early detection and tailored therapeutic strategies for lymph node metastasis in patients with breast cancer.
Collapse
Affiliation(s)
| | - Seyed Mahdi Hosseiniyan Khatibi
- Kidney Research Center, Tabriz University of Medical Sciences, Tabriz, Iran
- Rahat Breath and Sleep Research Center, Tabriz University of Medical Science, Tabriz, Iran
| | | | - Manijeh Emdadi
- Department of Computer Engineering, Abadan Branch, Islamic Azad University, Abadan, Iran
| | - Bahareh Khodaei
- Clinical Research Development Unit of Tabriz Valiasr Hospital, Tabriz University of Medical Sciences, Tabriz, Iran
| | - Mohammad Matin Alishani
- Department of Computer Science, Faculty of Information Technology, University of Shahid Madani of Tabriz, Tabriz, Iran
| | - Farnaz Boostani
- Clinical Research Development Unit of Tabriz Valiasr Hospital, Tabriz University of Medical Sciences, Tabriz, Iran
| | - Solmaz Maleki Dizaj
- Dental and Periodontal Research Center, Tabriz University of Medical Sciences, Tabriz, Iran
| | - Saeed Pirmoradi
- Clinical Research Development Unit of Tabriz Valiasr Hospital, Tabriz University of Medical Sciences, Tabriz, Iran
| |
Collapse
|
11
|
Huang X, Xie X, Huang S, Wu S, Huang L. Predicting non-chemotherapy drug-induced agranulocytosis toxicity through ensemble machine learning approaches. Front Pharmacol 2024; 15:1431941. [PMID: 39206259 PMCID: PMC11349714 DOI: 10.3389/fphar.2024.1431941] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2024] [Accepted: 08/02/2024] [Indexed: 09/04/2024] Open
Abstract
Agranulocytosis, induced by non-chemotherapy drugs, is a serious medical condition that presents a formidable challenge in predictive toxicology due to its idiosyncratic nature and complex mechanisms. In this study, we assembled a dataset of 759 compounds and applied a rigorous feature selection process prior to employing ensemble machine learning classifiers to forecast non-chemotherapy drug-induced agranulocytosis (NCDIA) toxicity. The balanced bagging classifier combined with a gradient boosting decision tree (BBC + GBDT), utilizing the combined descriptor set of DS and RDKit comprising 237 features, emerged as the top-performing model, with an external validation AUC of 0.9164, ACC of 83.55%, and MCC of 0.6095. The model's predictive reliability was further substantiated by an applicability domain analysis. Feature importance, assessed through permutation importance within the BBC + GBDT model, highlighted key molecular properties that significantly influence NCDIA toxicity. Additionally, 16 structural alerts identified by SARpy software further revealed potential molecular signatures associated with toxicity, enriching our understanding of the underlying mechanisms. We also applied the constructed models to assess the NCDIA toxicity of novel drugs approved by FDA. This study advances predictive toxicology by providing a framework to assess and mitigate agranulocytosis risks, ensuring the safety of pharmaceutical development and facilitating post-market surveillance of new drugs.
Collapse
Affiliation(s)
- Xiaojie Huang
- Department of Clinical Pharmacy, Jieyang People’s Hospital, Jieyang, China
| | | | | | | | | |
Collapse
|
12
|
Mohtasham F, Pourhoseingholi M, Hashemi Nazari SS, Kavousi K, Zali MR. Comparative analysis of feature selection techniques for COVID-19 dataset. Sci Rep 2024; 14:18627. [PMID: 39128991 PMCID: PMC11317481 DOI: 10.1038/s41598-024-69209-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/03/2024] [Accepted: 08/01/2024] [Indexed: 08/13/2024] Open
Abstract
In the context of early disease detection, machine learning (ML) has emerged as a vital tool. Feature selection (FS) algorithms play a crucial role in ensuring the accuracy of predictive models by identifying the most influential variables. This study, focusing on a retrospective cohort of 4778 COVID-19 patients from Iran, explores the performance of various FS methods, including filter, embedded, and hybrid approaches, in predicting mortality outcomes. The researchers leveraged 115 routine clinical, laboratory, and demographic features and employed 13 ML models to assess the effectiveness of these FS methods based on classification accuracy, predictive accuracy, and statistical tests. The results indicate that a Hybrid Boruta-VI model combined with the Random Forest algorithm demonstrated superior performance, achieving an accuracy of 0.89, an F1 score of 0.76, and an AUC value of 0.95 on test data. Key variables identified as important predictors of adverse outcomes include age, oxygen saturation levels, albumin levels, neutrophil counts, platelet levels, and markers of kidney function. These findings highlight the potential of advanced FS techniques and ML models in enhancing early disease detection and informing clinical decision-making.
Collapse
Affiliation(s)
- Farideh Mohtasham
- Gastroenterology and Liver Diseases Research Center, Research Institute for Gastroenterology and Liver Diseases, Shahid Beheshti University of Medical Sciences, Tehran, Iran.
| | - MohamadAmin Pourhoseingholi
- Hearing Sciences, Mental Health and Clinical Neurosciences, School of Medicine, National Institute for Health and Care Research (NIHR) Nottingham Biomedical Research Center, University of Nottingham, Nottingham, UK
| | - Seyed Saeed Hashemi Nazari
- Department of Epidemiology, School of Public Health & Safety, Shahid Beheshti University of Medical Sciences (SBMU), Tehran, Iran
| | - Kaveh Kavousi
- Laboratory of Complex Biological Systems and Bioinformatics (CBB), Department of Bioinformatics, Institute of Biochemistry and Biophysics (IBB), University of Tehran, Tehran, Iran.
| | - Mohammad Reza Zali
- Gastroenterology and Liver Diseases Research Center, Research Institute for Gastroenterology and Liver Diseases, Shahid Beheshti University of Medical Sciences, Tehran, Iran
| |
Collapse
|
13
|
Li M, Guo H, Wang K, Kang C, Yin Y, Zhang H. AVBAE-MODFR: A novel deep learning framework of embedding and feature selection on multi-omics data for pan-cancer classification. Comput Biol Med 2024; 177:108614. [PMID: 38796884 DOI: 10.1016/j.compbiomed.2024.108614] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/07/2023] [Revised: 02/27/2024] [Accepted: 05/11/2024] [Indexed: 05/29/2024]
Abstract
Integration analysis of cancer multi-omics data for pan-cancer classification has the potential for clinical applications in various aspects such as tumor diagnosis, analyzing clinically significant features, and providing precision medicine. In these applications, the embedding and feature selection on high-dimensional multi-omics data is clinically necessary. Recently, deep learning algorithms become the most promising cancer multi-omic integration analysis methods, due to the powerful capability of capturing nonlinear relationships. Developing effective deep learning architectures for cancer multi-omics embedding and feature selection remains a challenge for researchers in view of high dimensionality and heterogeneity. In this paper, we propose a novel two-phase deep learning model named AVBAE-MODFR for pan-cancer classification. AVBAE-MODFR achieves embedding by a multi2multi autoencoder based on the adversarial variational Bayes method and further performs feature selection utilizing a dual-net-based feature ranking method. AVBAE-MODFR utilizes AVBAE to pre-train the network parameters, which improves the classification performance and enhances feature ranking stability in MODFR. Firstly, AVBAE learns high-quality representation among multiple omics features for unsupervised pan-cancer classification. We design an efficient discriminator architecture to distinguish the latent distributions for updating forward variational parameters. Secondly, we propose MODFR to simultaneously evaluate multi-omics feature importance for feature selection by training a designed multi2one selector network, where the efficient evaluation approach based on the average gradient of random mask subsets can avoid bias caused by input feature drift. We conduct experiments on the TCGA pan-cancer dataset and compare it with four state-of-the-art methods for each phase. The results show the superiority of AVBAE-MODFR over SOTA methods.
Collapse
Affiliation(s)
- Minghe Li
- National Key Laboratory of Intelligent Tracking and Forecasting for Infectious Diseases, Engineering Research Center of Trusted Behavior Intelligence, Ministry of Education, College of Artificial Intelligence, Nankai University, Tongyan Road, Tianjin, China
| | - Huike Guo
- National Key Laboratory of Intelligent Tracking and Forecasting for Infectious Diseases, Engineering Research Center of Trusted Behavior Intelligence, Ministry of Education, College of Artificial Intelligence, Nankai University, Tongyan Road, Tianjin, China
| | - Keao Wang
- National Key Laboratory of Intelligent Tracking and Forecasting for Infectious Diseases, Engineering Research Center of Trusted Behavior Intelligence, Ministry of Education, College of Artificial Intelligence, Nankai University, Tongyan Road, Tianjin, China
| | - Chuanze Kang
- National Key Laboratory of Intelligent Tracking and Forecasting for Infectious Diseases, Engineering Research Center of Trusted Behavior Intelligence, Ministry of Education, College of Artificial Intelligence, Nankai University, Tongyan Road, Tianjin, China
| | - Yanbin Yin
- Department of Food Science and Technology, University of Nebraska - Lincoln, NE, USA
| | - Han Zhang
- National Key Laboratory of Intelligent Tracking and Forecasting for Infectious Diseases, Engineering Research Center of Trusted Behavior Intelligence, Ministry of Education, College of Artificial Intelligence, Nankai University, Tongyan Road, Tianjin, China.
| |
Collapse
|
14
|
Zayed A, Belhadj N, Ben Khalifa K, Bedoui MH, Valderrama C. Efficient Generalized Electroencephalography-Based Drowsiness Detection Approach with Minimal Electrodes. SENSORS (BASEL, SWITZERLAND) 2024; 24:4256. [PMID: 39001037 PMCID: PMC11244425 DOI: 10.3390/s24134256] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 05/13/2024] [Revised: 06/21/2024] [Accepted: 06/27/2024] [Indexed: 07/16/2024]
Abstract
Drowsiness is a main factor for various costly defects, even fatal accidents in areas such as construction, transportation, industry and medicine, due to the lack of monitoring vigilance in the mentioned areas. The implementation of a drowsiness detection system can greatly help to reduce the defects and accident rates by alerting individuals when they enter a drowsy state. This research proposes an electroencephalography (EEG)-based approach for detecting drowsiness. EEG signals are passed through a preprocessing chain composed of artifact removal and segmentation to ensure accurate detection followed by different feature extraction methods to extract the different features related to drowsiness. This work explores the use of various machine learning algorithms such as Support Vector Machine (SVM), the K nearest neighbor (KNN), the Naive Bayes (NB), the Decision Tree (DT), and the Multilayer Perceptron (MLP) to analyze EEG signals sourced from the DROZY database, carefully labeled into two distinct states of alertness (awake and drowsy). Segmentation into 10 s intervals ensures precise detection, while a relevant feature selection layer enhances accuracy and generalizability. The proposed approach achieves high accuracy rates of 99.84% and 96.4% for intra (subject by subject) and inter (cross-subject) modes, respectively. SVM emerges as the most effective model for drowsiness detection in the intra mode, while MLP demonstrates superior accuracy in the inter mode. This research offers a promising avenue for implementing proactive drowsiness detection systems to enhance occupational safety across various industries.
Collapse
Affiliation(s)
- Aymen Zayed
- Technology and Medical Imaging Laboratory, Faculty of Medicine Monastir, University of Monastir, Monastir 5019, Tunisia
- National Engineering School of Sousse, University of Sousse, BP 264 Erriyadh, Sousse 4023, Tunisia
- Department of Electronics and Microelectronics (SEMi), University of Mons, 7000 Mons, Belgium
| | - Nidhameddine Belhadj
- Laboratory of Electronics and Microelectronics, Faculty of Sciences of Monastir, Monsatir 5019, Tunisia
| | - Khaled Ben Khalifa
- Technology and Medical Imaging Laboratory, Faculty of Medicine Monastir, University of Monastir, Monastir 5019, Tunisia
- Higher Institute of Applied Science and Technology of Sousse, University of Sousse, Sousse 4003, Tunisia
| | - Mohamed Hedi Bedoui
- Technology and Medical Imaging Laboratory, Faculty of Medicine Monastir, University of Monastir, Monastir 5019, Tunisia
| | - Carlos Valderrama
- Department of Electronics and Microelectronics (SEMi), University of Mons, 7000 Mons, Belgium
| |
Collapse
|
15
|
Dai J, Li W, Dong G. Dung Beetle Optimizer Algorithm and Machine Learning-Based Genome Analysis of Lactococcus lactis: Predicting Electronic Sensory Properties of Fermented Milk. Foods 2024; 13:1958. [PMID: 38998464 PMCID: PMC11241492 DOI: 10.3390/foods13131958] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/08/2024] [Revised: 06/11/2024] [Accepted: 06/19/2024] [Indexed: 07/14/2024] Open
Abstract
In the global food industry, fermented dairy products are valued for their unique flavors and nutrients. Lactococcus lactis is crucial in developing these flavors during fermentation. Meeting diverse consumer flavor preferences requires the careful selection of fermentation agents. Traditional assessment methods are slow, costly, and subjective. Although electronic-nose and -tongue technologies provide objective assessments, they are mostly limited to laboratory environments. Therefore, this study developed a model to predict the electronic sensory characteristics of fermented milk. This model is based on the genomic data of Lactococcus lactis, using the DBO (Dung Beetle Optimizer) optimization algorithm combined with 10 different machine learning methods. The research results show that the combination of the DBO optimization algorithm and multi-round feature selection with a ridge regression model significantly improved the performance of the model. In the 10-fold cross-validation, the R2 values of all the electronic sensory phenotypes exceeded 0.895, indicating an excellent performance. In addition, a deep analysis of the electronic sensory data revealed an important phenomenon: the correlation between the electronic sensory phenotypes is positively related to the number of features jointly selected. Generally, a higher correlation among the electronic sensory phenotypes corresponds to a greater number of features being jointly selected. Specifically, phenotypes with high correlations exhibit from 2 to 60 times more jointly selected features than those with low correlations. This suggests that our feature selection strategy effectively identifies the key features impacting multiple phenotypes, likely originating from their regulation by similar biological pathways or metabolic processes. Overall, this study proposes a more efficient and cost-effective method for predicting the electronic sensory characteristics of milk fermented by Lactococcus lactis. It helps to screen and optimize fermenting agents with desirable flavor characteristics, thereby driving innovation and development in the dairy industry and enhancing the product quality and market competitiveness.
Collapse
Affiliation(s)
- Jinhui Dai
- College of Computer and Information Engineering, Inner Mongolia Agricultural University, Hohhot 010011, China
- Inner Mongolia Autonomous Region Key Laboratory of Big Data Research and Application of Agriculture and Animal Husbandry, Hohhot 010011, China
| | - Weicheng Li
- Key Laboratory of Dairy Biotechnology and Engineering (IMAU), Ministry of Education, Inner Mongolia Agricultural University, Hohhot 010018, China
- Key Laboratory of Dairy Products Processing, Ministry of Agriculture and Rural Affairs, Inner Mongolia Agricultural University, Hohhot 010018, China
- Inner Mongolia Key Laboratory of Dairy Biotechnology and Engineering, Inner Mongolia Agricultural University, Hohhot 010018, China
- Collaborative Innovative Center for Lactic Acid Bacteria and Fermented Dairy Products, Ministry of Education, Inner Mongolia Agricultural University, Hohhot 010018, China
| | - Gaifang Dong
- College of Computer and Information Engineering, Inner Mongolia Agricultural University, Hohhot 010011, China
- Inner Mongolia Autonomous Region Key Laboratory of Big Data Research and Application of Agriculture and Animal Husbandry, Hohhot 010011, China
| |
Collapse
|
16
|
Saini R, Tiwari AK, Nath A, Singh P, Maurya SP, Shah MA. Covering assisted intuitionistic fuzzy bi-selection technique for data reduction and its applications. Sci Rep 2024; 14:13568. [PMID: 38866851 DOI: 10.1038/s41598-024-62099-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2023] [Accepted: 05/13/2024] [Indexed: 06/14/2024] Open
Abstract
The dimension and size of data is growing rapidly with the extensive applications of computer science and lab based engineering in daily life. Due to availability of vagueness, later uncertainty, redundancy, irrelevancy, and noise, which imposes concerns in building effective learning models. Fuzzy rough set and its extensions have been applied to deal with these issues by various data reduction approaches. However, construction of a model that can cope with all these issues simultaneously is always a challenging task. None of the studies till date has addressed all these issues simultaneously. This paper investigates a method based on the notions of intuitionistic fuzzy (IF) and rough sets to avoid these obstacles simultaneously by putting forward an interesting data reduction technique. To accomplish this task, firstly, a novel IF similarity relation is addressed. Secondly, we establish an IF rough set model on the basis of this similarity relation. Thirdly, an IF granular structure is presented by using the established similarity relation and the lower approximation. Next, the mathematical theorems are used to validate the proposed notions. Then, the importance-degree of the IF granules is employed for redundant size elimination. Further, significance-degree-preserved dimensionality reduction is discussed. Hence, simultaneous instance and feature selection for large volume of high-dimensional datasets can be performed to eliminate redundancy and irrelevancy in both dimension and size, where vagueness and later uncertainty are handled with rough and IF sets respectively, whilst noise is tackled with IF granular structure. Thereafter, a comprehensive experiment is carried out over the benchmark datasets to demonstrate the effectiveness of simultaneous feature and data point selection methods. Finally, our proposed methodology aided framework is discussed to enhance the regression performance for IC50 of Antiviral Peptides.
Collapse
Affiliation(s)
- Rajat Saini
- Department of Mathematics, School of Basic Sciences, Central University of Haryana, Mahendergarh, 123031, India
| | - Anoop Kumar Tiwari
- Department of Computer Science and Information Technology, Central University of Haryana, Mahendergarh, 123031, India.
| | - Abhigyan Nath
- Department of Biochemistry, Pt. Jawahar Lal Nehru Memorial Medical College, Raipur, 492001, India
| | - Phool Singh
- Department of Mathematics (SoET), Central University of Haryana, Mahendergarh, 123031, India
| | - S P Maurya
- Department of Geophysics, Institute of Science, Banaras Hindu University, Varanasi, 221005, India
| | - Mohd Asif Shah
- Department of Economics, Kebri Dehar University, 250, Kebri Dehar, Somali, Ethiopia.
- Division of Research and Development, Lovely Professional University, Phagwara, Punjab, 144001, India.
- Department of Economics, Kardan University, Parwan e Du, Kabul, 1001, Afghanistan.
| |
Collapse
|
17
|
Iqbal A, Amin R, Alsubaei FS, Alzahrani A. Anomaly detection in multivariate time series data using deep ensemble models. PLoS One 2024; 19:e0303890. [PMID: 38843255 PMCID: PMC11156414 DOI: 10.1371/journal.pone.0303890] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/03/2024] [Accepted: 05/03/2024] [Indexed: 06/09/2024] Open
Abstract
Anomaly detection in time series data is essential for fraud detection and intrusion monitoring applications. However, it poses challenges due to data complexity and high dimensionality. Industrial applications struggle to process high-dimensional, complex data streams in real time despite existing solutions. This study introduces deep ensemble models to improve traditional time series analysis and anomaly detection methods. Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM) networks effectively handle variable-length sequences and capture long-term relationships. Convolutional Neural Networks (CNNs) are also investigated, especially for univariate or multivariate time series forecasting. The Transformer, an architecture based on Artificial Neural Networks (ANN), has demonstrated promising results in various applications, including time series prediction and anomaly detection. Graph Neural Networks (GNNs) identify time series anomalies by capturing temporal connections and interdependencies between periods, leveraging the underlying graph structure of time series data. A novel feature selection approach is proposed to address challenges posed by high-dimensional data, improving anomaly detection by selecting different or more critical features from the data. This approach outperforms previous techniques in several aspects. Overall, this research introduces state-of-the-art algorithms for anomaly detection in time series data, offering advancements in real-time processing and decision-making across various industrial sectors.
Collapse
Affiliation(s)
- Amjad Iqbal
- Department of Computer Science, University of Engineering and Technology, Taxila, Pakistan
| | - Rashid Amin
- Department of Computer Science, University of Engineering and Technology, Taxila, Pakistan
- Department of Computer Science and Information Technology, University of Chakwal, Chakwal, Pakistan
| | - Faisal S. Alsubaei
- Department of Cybersecurity, College of Computer Science and Engineering, University of Jeddah, Jeddah, Saudi Arabia
| | - Abdulrahman Alzahrani
- Department of Information System and Technology, College of Computer Science and Engineering, University of Jeddah, Jeddah, Saudi Arabia
| |
Collapse
|
18
|
Rostamzadeh S, Abouhossein A, Alam K, Vosoughi S, Sattari SS. Exploratory analysis using machine learning algorithms to predict pinch strength by anthropometric and socio-demographic features. INTERNATIONAL JOURNAL OF OCCUPATIONAL SAFETY AND ERGONOMICS 2024; 30:518-531. [PMID: 38553890 DOI: 10.1080/10803548.2024.2322888] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/08/2024]
Abstract
Objectives. This study examines the role of different machine learning (ML) algorithms to determine which socio-demographic factors and hand-forearm anthropometric dimensions can be used to accurately predict hand function. Methods. The cross-sectional study was conducted with 7119 healthy Iranian participants (3525 males and 3594 females) aged 10-89 years. Seventeen hand-forearm anthropometric dimensions were measured by JEGS digital caliper and a measuring tape. Tip-to-tip, key and three-jaw chuck pinches were measured using a calibrated pinch gauge. Subsequently, 21 features pertinent to socio-demographic factors and hand-forearm anthropometric dimensions were used for classification. Furthermore, 12 well-known classifiers were implemented and evaluated to predict pinches. Results. Among the 21 features considered in this study, hand length, stature, age, thumb length and index finger length were found to be the most relevant and effective components for each of the three pinch predictions. The k-nearest neighbor, adaptive boosting (AdaBoost) and random forest classifiers achieved the highest classification accuracy of 96.75, 86.49 and 84.66% to predict three pinches, respectively. Conclusions. Predicting pinch strength and determining the predictive hand-forearm anthropometric and socio-demographic characteristics using ML may pave the way to designing an enhanced tool handle and reduce common musculoskeletal disorders of the hand.
Collapse
Affiliation(s)
- Sajjad Rostamzadeh
- Department of Ergonomics, School of Public Health and Safety, Shahid Beheshti University of Medical Sciences, Tehran, Iran
| | - Alireza Abouhossein
- Department of Ergonomics, School of Public Health and Safety, Shahid Beheshti University of Medical Sciences, Tehran, Iran
| | - Khurshid Alam
- Department of Mechanical and Industrial Engineering, College of Engineering, Sultan Qaboos University, Muscat, Oman
| | - Shahram Vosoughi
- Department of Occupational Health Engineering, School of Public Health, Iran University of Medical Sciences, Tehran, Iran
| | | |
Collapse
|
19
|
Park JY, Lee SH, Kim YJ, Kim KG, Lee GJ. Machine learning model based on radiomics features for AO/OTA classification of pelvic fractures on pelvic radiographs. PLoS One 2024; 19:e0304350. [PMID: 38814948 PMCID: PMC11139281 DOI: 10.1371/journal.pone.0304350] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/07/2024] [Accepted: 05/10/2024] [Indexed: 06/01/2024] Open
Abstract
Depending on the degree of fracture, pelvic fracture can be accompanied by vascular damage, and in severe cases, it may progress to hemorrhagic shock. Pelvic radiography can quickly diagnose pelvic fractures, and the Association for Osteosynthesis Foundation and Orthopedic Trauma Association (AO/OTA) classification system is useful for evaluating pelvic fracture instability. This study aimed to develop a radiomics-based machine-learning algorithm to quickly diagnose fractures on pelvic X-ray and classify their instability. data used were pelvic anteroposterior radiographs of 990 adults over 18 years of age diagnosed with pelvic fractures, and 200 normal subjects. A total of 93 features were extracted based on radiomics:18 first-order, 24 GLCM, 16 GLRLM, 16 GLSZM, 5 NGTDM, and 14 GLDM features. To improve the performance of machine learning, the feature selection methods RFE, SFS, LASSO, and Ridge were used, and the machine learning models used LR, SVM, RF, XGB, MLP, KNN, and LGBM. Performance measurement was evaluated by area under the curve (AUC) by analyzing the receiver operating characteristic curve. The machine learning model was trained based on the selected features using four feature-selection methods. When the RFE feature selection method was used, the average AUC was higher than that of the other methods. Among them, the combination with the machine learning model SVM showed the best performance, with an average AUC of 0.75±0.06. By obtaining a feature-importance graph for the combination of RFE and SVM, it is possible to identify features with high importance. The AO/OTA classification of normal pelvic rings and pelvic fractures on pelvic AP radiographs using a radiomics-based machine learning model showed the highest AUC when using the SVM classification combination. Further research on the radiomic features of each part of the pelvic bone constituting the pelvic ring is needed.
Collapse
Affiliation(s)
- Jun Young Park
- Department of Health Sciences and Technology, Gachon Advanced Institute for Health Sciences and Technology (GAIHST), Gachon University, Incheon, Republic of Korea
| | - Seung Hwan Lee
- Department of Trauma Surgery, Gachon University Gil Medical Center, Gachon University, Incheon, Republic of Korea
- Department of Traumatology, Gachon University College of Medicine, Gachon University, Incheon, Republic of Korea
| | - Young Jae Kim
- Department of Health Sciences and Technology, Gachon Advanced Institute for Health Sciences and Technology (GAIHST), Gachon University, Incheon, Republic of Korea
- Department of Medical Devices R&D Center, Gachon University Gil Medical Center, Gachon University, Incheon, Republic of Korea
- Department of Biomedical Engineering, Pre-medical Course, College of Medicine, Gachon University, Incheon, Republic of Korea
| | - Kwang Gi Kim
- Department of Health Sciences and Technology, Gachon Advanced Institute for Health Sciences and Technology (GAIHST), Gachon University, Incheon, Republic of Korea
- Department of Medical Devices R&D Center, Gachon University Gil Medical Center, Gachon University, Incheon, Republic of Korea
- Department of Biomedical Engineering, Pre-medical Course, College of Medicine, Gachon University, Incheon, Republic of Korea
| | - Gil Jae Lee
- Department of Trauma Surgery, Gachon University Gil Medical Center, Gachon University, Incheon, Republic of Korea
- Department of Traumatology, Gachon University College of Medicine, Gachon University, Incheon, Republic of Korea
| |
Collapse
|
20
|
Canero FM, Rodriguez-Galiano V, Aragones D. Machine Learning and Feature Selection for soil spectroscopy. An evaluation of Random Forest wrappers to predict soil organic matter, clay, and carbonates. Heliyon 2024; 10:e30228. [PMID: 38707402 PMCID: PMC11066688 DOI: 10.1016/j.heliyon.2024.e30228] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2023] [Revised: 04/19/2024] [Accepted: 04/22/2024] [Indexed: 05/07/2024] Open
Abstract
Soil spectroscopy estimates soil properties using the absorption features in soil spectra. However, modelling soil properties with soil spectroscopy is challenging due to the high dimensionality of spectral data. Feature Selection wrapper methods are promising approaches to reduce the dimensionality but are barely used in soil spectroscopy. The aim of this study is to evaluate the performance of two feature selection wrapper methods, Sequential Forward Selection (SFS) and Sequential Flotant Forward Selection (SFFS) built using the Random Forest (RF) algorithm, for dimensionality reduction of spectral data and predictive modelling of modelling soil organic matter (SOM), clay and carbonates. The reflectance of 100 soil samples, acquired from Sierra de las Nieves (Spain), was measured under laboratory conditions using ASD FieldSpec Pro JR. Four different datasets were obtained after applying two spectral preprocessing methods to raw spectra: raw spectra, Continuum Removal (CR), Multiplicative Scatter Correction (MSC), and a so-called "Global" dataset composed of raw, CR and MSC features. The performance of RF models built with feature selection methods was compared to that of Partial Least Squares Regression (PLSR) and RF (alone). RF models built with SFS and SFFS outperformed PLSR and RF alone models: The best RF models with feature selection had a respective ratio of performance to interquartile distance of 1.93, 0.38 and 2.56. PLSR models had an accuracy of 1.41, 0.29 and 1.81 for SOM, carbonates, and clay, respectively. RF alone had a respective performance of 1.29, 0.29 and 1.81. The application of feature selection wrapper methods reduced the number of features to less than 1 % of the starting features. Features were selected across all spectra for SOM and clay, and around 900 nm, 1900 nm, and 2350 nm for carbonates. However, feature selection highlighted features around 1100 nm in SOM modelling, as well as other features around 2200 nm, which is considered a main absorption feature of clay. The application of feature selection with Random Forest was very important in improving modelling accuracy, reducing the redundant features and avoiding the curse of dimensionality or Hughes effect. Thus, this research showed an alternative to dimensionality reduction approaches that have been applied to date to model soil properties with spectroscopy and paves the way for further scientific investigation based on feature selection methods and machine learning.
Collapse
Affiliation(s)
- Francisco M. Canero
- Department of Physical Geography and Regional Geographic Analysis, Universidad de Sevilla, 41004, Seville, Spain
| | - Victor Rodriguez-Galiano
- Department of Physical Geography and Regional Geographic Analysis, Universidad de Sevilla, 41004, Seville, Spain
| | - David Aragones
- Remote Sensing and Geographic Information Systems Lab (LAST-EBD), Doñana Biological Station, C.S.I.C., 41092, Seville, Spain
| |
Collapse
|
21
|
Tiwari AK, Saini R, Nath A, Singh P, Shah MA. Hybrid similarity relation based mutual information for feature selection in intuitionistic fuzzy rough framework and its applications. Sci Rep 2024; 14:5958. [PMID: 38472266 PMCID: PMC10933482 DOI: 10.1038/s41598-024-55902-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/26/2023] [Accepted: 02/28/2024] [Indexed: 03/14/2024] Open
Abstract
Fuzzy rough entropy established in the notion of fuzzy rough set theory, which has been effectively and efficiently applied for feature selection to handle the uncertainty in real-valued datasets. Further, Fuzzy rough mutual information has been presented by integrating information entropy with fuzzy rough set to measure the importance of features. However, none of the methods till date can handle noise, uncertainty and vagueness simultaneously due to both judgement and identification, which lead to degrade the overall performances of the learning algorithms with the increment in the number of mixed valued conditional features. In the current study, these issues are tackled by presenting a novel intuitionistic fuzzy (IF) assisted mutual information concept along with IF granular structure. Initially, a hybrid IF similarity relation is introduced. Based on this relation, an IF granular structure is introduced. Then, IF rough conditional and joint entropies are established. Further, mutual information based on these concepts are discussed. Next, mathematical theorems are proved to demonstrate the validity of the given notions. Thereafter, significance of the features subset is computed by using this mutual information, and corresponding feature selection is suggested to delete the irrelevant and redundant features. The current approach effectively handles noise and subsequent uncertainty in both nominal and mixed data (including both nominal and category variables). Moreover, comprehensive experimental performances are evaluated on real-valued benchmark datasets to demonstrate the practical validation and effectiveness of the addressed technique. Finally, an application of the proposed method is exhibited to improve the prediction of phospholipidosis positive molecules. RF(h2o) produces the most effective results till date based on our proposed methodology with sensitivity, accuracy, specificity, MCC, and AUC of 86.7%, 90.1%, 93.0% , 0.808, and 0.922 respectively.
Collapse
Affiliation(s)
- Anoop Kumar Tiwari
- Department of Computer Science and Information Technology, Central University of Haryana, Mahendergarh, 123031, India
| | - Rajat Saini
- Department of Mathematics, School of Basic Sciences, Central University of Haryana, Mahendergarh, 123031, India.
| | - Abhigyan Nath
- Department of Biochemistry, Pt. Jawahar Lal Nehru Memorial Medical College, Raipur, 492001, India
| | - Phool Singh
- Department of Mathematics (SoET), Central University of Haryana, Mahendergarh, 123031, India
| | - Mohd Asif Shah
- Department of Economics, Kebri Dehar University, 250, Kebri Dehar, Somali, Ethiopia.
- Centre of Research Impact and Outcome, Chitkara University Institute of Engineering and Technology, Chitkara University, Rajpura, 140401, Punjab, India.
- Division of Research and Development, Lovely Professional University, Phagwara, 144001, Punjab, India.
| |
Collapse
|
22
|
Zhou W, Yan Z, Zhang L. A comparative study of 11 non-linear regression models highlighting autoencoder, DBN, and SVR, enhanced by SHAP importance analysis in soybean branching prediction. Sci Rep 2024; 14:5905. [PMID: 38467662 PMCID: PMC10928191 DOI: 10.1038/s41598-024-55243-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2023] [Accepted: 02/21/2024] [Indexed: 03/13/2024] Open
Abstract
To explore a robust tool for advancing digital breeding practices through an artificial intelligence-driven phenotype prediction expert system, we undertook a thorough analysis of 11 non-linear regression models. Our investigation specifically emphasized the significance of Support Vector Regression (SVR) and SHapley Additive exPlanations (SHAP) in predicting soybean branching. By using branching data (phenotype) of 1918 soybean accessions and 42 k SNP (Single Nucleotide Polymorphism) polymorphic data (genotype), this study systematically compared 11 non-linear regression AI models, including four deep learning models (DBN (deep belief network) regression, ANN (artificial neural network) regression, Autoencoders regression, and MLP (multilayer perceptron) regression) and seven machine learning models (e.g., SVR (support vector regression), XGBoost (eXtreme Gradient Boosting) regression, Random Forest regression, LightGBM regression, GPs (Gaussian processes) regression, Decision Tree regression, and Polynomial regression). After being evaluated by four valuation metrics: R2 (R-squared), MAE (Mean Absolute Error), MSE (Mean Squared Error), and MAPE (Mean Absolute Percentage Error), it was found that the SVR, Polynomial Regression, DBN, and Autoencoder outperformed other models and could obtain a better prediction accuracy when they were used for phenotype prediction. In the assessment of deep learning approaches, we exemplified the SVR model, conducting analyses on feature importance and gene ontology (GO) enrichment to provide comprehensive support. After comprehensively comparing four feature importance algorithms, no notable distinction was observed in the feature importance ranking scores across the four algorithms, namely Variable Ranking, Permutation, SHAP, and Correlation Matrix, but the SHAP value could provide rich information on genes with negative contributions, and SHAP importance was chosen for feature selection. The results of this study offer valuable insights into AI-mediated plant breeding, addressing challenges faced by traditional breeding programs. The method developed has broad applicability in phenotype prediction, minor QTL (quantitative trait loci) mining, and plant smart-breeding systems, contributing significantly to the advancement of AI-based breeding practices and transitioning from experience-based to data-based breeding.
Collapse
Affiliation(s)
- Wei Zhou
- Florida Agricultural and Mechanical University, Tallahassee, FL, 32307, USA.
| | - Zhengxiao Yan
- Florida State University, Tallahassee, FL, 32306, USA
| | - Liting Zhang
- Florida State University, Tallahassee, FL, 32306, USA
| |
Collapse
|
23
|
Atimbire SA, Appati JK, Owusu E. Empirical exploration of whale optimisation algorithm for heart disease prediction. Sci Rep 2024; 14:4530. [PMID: 38402276 PMCID: PMC10894250 DOI: 10.1038/s41598-024-54990-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2023] [Accepted: 02/19/2024] [Indexed: 02/26/2024] Open
Abstract
Heart Diseases have the highest mortality worldwide, necessitating precise predictive models for early risk assessment. Much existing research has focused on improving model accuracy with single datasets, often neglecting the need for comprehensive evaluation metrics and utilization of different datasets in the same domain (heart disease). This research introduces a heart disease risk prediction approach by harnessing the whale optimization algorithm (WOA) for feature selection and implementing a comprehensive evaluation framework. The study leverages five distinct datasets, including the combined dataset comprising the Cleveland, Long Beach VA, Switzerland, and Hungarian heart disease datasets. The others are the Z-AlizadehSani, Framingham, South African, and Cleveland heart datasets. The WOA-guided feature selection identifies optimal features, subsequently integrated into ten classification models. Comprehensive model evaluation reveals significant improvements across critical performance metrics, including accuracy, precision, recall, F1 score, and the area under the receiver operating characteristic curve. These enhancements consistently outperform state-of-the-art methods using the same dataset, validating the effectiveness of our methodology. The comprehensive evaluation framework provides a robust assessment of the model's adaptability, underscoring the WOA's effectiveness in identifying optimal features in multiple datasets in the same domain.
Collapse
Affiliation(s)
| | | | - Ebenezer Owusu
- Department of Computer Science, University of Ghana, Accra, Ghana
| |
Collapse
|
24
|
Yang K, Liu L, Wen Y. The impact of Bayesian optimization on feature selection. Sci Rep 2024; 14:3948. [PMID: 38366092 PMCID: PMC10873405 DOI: 10.1038/s41598-024-54515-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2023] [Accepted: 02/13/2024] [Indexed: 02/18/2024] Open
Abstract
Feature selection is an indispensable step for the analysis of high-dimensional molecular data. Despite its importance, consensus is lacking on how to choose the most appropriate feature selection methods, especially when the performance of the feature selection methods itself depends on hyper-parameters. Bayesian optimization has demonstrated its advantages in automatically configuring the settings of hyper-parameters for various models. However, it remains unclear whether Bayesian optimization can benefit feature selection methods. In this research, we conducted extensive simulation studies to compare the performance of various feature selection methods, with a particular focus on the impact of Bayesian optimization on those where hyper-parameters tuning is needed. We further utilized the gene expression data obtained from the Alzheimer's Disease Neuroimaging Initiative to predict various brain imaging-related phenotypes, where various feature selection methods were employed to mine the data. We found through simulation studies that feature selection methods with hyper-parameters tuned using Bayesian optimization often yield better recall rates, and the analysis of transcriptomic data further revealed that Bayesian optimization-guided feature selection can improve the accuracy of disease risk prediction models. In conclusion, Bayesian optimization can facilitate feature selection methods when hyper-parameter tuning is needed and has the potential to substantially benefit downstream tasks.
Collapse
Affiliation(s)
- Kaixin Yang
- Department of Health Statistics, School of Public Health, Shanxi Medical University, No 56 Xinjian South Road, Yingze District, Taiyuan, Shanxi, China
| | - Long Liu
- Department of Health Statistics, School of Public Health, Shanxi Medical University, No 56 Xinjian South Road, Yingze District, Taiyuan, Shanxi, China.
| | - Yalu Wen
- Department of Statistics, University of Auckland, 38 Princes Street, Auckland Central, Auckland, 1010, New Zealand.
| |
Collapse
|
25
|
Lu M, Yin R, Chen XS. Ensemble methods of rank-based trees for single sample classification with gene expression profiles. J Transl Med 2024; 22:140. [PMID: 38321494 PMCID: PMC10848444 DOI: 10.1186/s12967-024-04940-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2023] [Accepted: 01/27/2024] [Indexed: 02/08/2024] Open
Abstract
Building Single Sample Predictors (SSPs) from gene expression profiles presents challenges, notably due to the lack of calibration across diverse gene expression measurement technologies. However, recent research indicates the viability of classifying phenotypes based on the order of expression of multiple genes. Existing SSP methods often rely on Top Scoring Pairs (TSP), which are platform-independent and easy to interpret through the concept of "relative expression reversals". Nevertheless, TSP methods face limitations in classifying complex patterns involving comparisons of more than two gene expressions. To overcome these constraints, we introduce a novel approach that extends TSP rules by constructing rank-based trees capable of encompassing extensive gene-gene comparisons. This method is bolstered by incorporating two ensemble strategies, boosting and random forest, to mitigate the risk of overfitting. Our implementation of ensemble rank-based trees employs boosting with LogitBoost cost and random forests, addressing both binary and multi-class classification problems. In a comparative analysis across 12 cancer gene expression datasets, our proposed methods demonstrate superior performance over both the k-TSP classifier and nearest template prediction methods. We have further refined our approach to facilitate variable selection and the generation of clear, precise decision rules from rank-based trees, enhancing interpretability. The cumulative evidence from our research underscores the significant potential of ensemble rank-based trees in advancing disease classification via gene expression data, offering a robust, interpretable, and scalable solution. Our software is available at https://CRAN.R-project.org/package=ranktreeEnsemble .
Collapse
Affiliation(s)
- Min Lu
- Division of Biostatistics, Department of Public Health Sciences, Miller School of Medicine, University of Miami, 1120 NW 14th Street, Miami, FL, 33136, USA.
| | - Ruijie Yin
- Division of Biostatistics, Department of Public Health Sciences, Miller School of Medicine, University of Miami, 1120 NW 14th Street, Miami, FL, 33136, USA
| | - X Steven Chen
- Division of Biostatistics, Department of Public Health Sciences, Miller School of Medicine, University of Miami, 1120 NW 14th Street, Miami, FL, 33136, USA.
- Sylvester Comprehensive Cancer Center, Miller School of Medicine, University of Miami, 1475 NW 12th Ave, Miami, FL, 33136, USA.
| |
Collapse
|
26
|
Sheng J, Lam S, Zhang J, Zhang Y, Cai J. Multi-omics fusion with soft labeling for enhanced prediction of distant metastasis in nasopharyngeal carcinoma patients after radiotherapy. Comput Biol Med 2024; 168:107684. [PMID: 38039891 DOI: 10.1016/j.compbiomed.2023.107684] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2023] [Revised: 10/06/2023] [Accepted: 11/06/2023] [Indexed: 12/03/2023]
Abstract
Omics fusion has emerged as a crucial preprocessing approach in medical image processing, significantly assisting several studies. One of the challenges encountered in integrating omics data is the unpredictability arising from disparities in data sources and medical imaging equipment. Due to these differences, the distribution of omics futures exhibits spatial heterogeneity, diminishing their capacity to enhance subsequent tasks. To overcome this challenge and facilitate the integration of their joint application to specific medical objectives, this study aims to develop a fusion methodology for nasopharyngeal carcinoma (NPC) distant metastasis prediction to mitigate the disparities inherent in omics data. The multi-kernel late-fusion method can reduce the impact of these differences by mapping the features using the most suiTable single-kernel function and then combining them in a high-dimensional space that can effectively represent the data. The proposed approach in this study employs a distinctive framework incorporating a label-softening technique alongside a multi-kernel-based Radial basis function (RBF) neural network to address these limitations. An efficient representation of the data may be achieved by utilizing the multi-kernel to map the inherent features and then merging them in a space with many dimensions. However, the inflexibility of label fitting poses a constraint on using multi-kernel late-fusion methods in complex NPC datasets, hence affecting the efficacy of general classifiers in dealing with high-dimensional characteristics. The label softening increases the disparity between the two cohorts, providing a more flexible structure for allocating labels. The proposed model is evaluated on multi-omics datasets, and the results demonstrate its strength and effectiveness in predicting distant metastasis of NPC patients.
Collapse
Affiliation(s)
- Jiabao Sheng
- Department of Health Technology and Informatics, The Hong Kong Polytechnic University, Hong Kong Special Administrative Region, China; Research Institute for Smart Ageing, The Hong Kong Polytechnic University, Hong Kong Special Administrative Region, China.
| | - SaiKit Lam
- Research Institute for Smart Ageing, The Hong Kong Polytechnic University, Hong Kong Special Administrative Region, China; Department of Biomedical Engineering, The Hong Kong Polytechnic University, Hong Kong Special Administrative Region, China.
| | - Jiang Zhang
- Department of Health Technology and Informatics, The Hong Kong Polytechnic University, Hong Kong Special Administrative Region, China.
| | - Yuanpeng Zhang
- Department of Health Technology and Informatics, The Hong Kong Polytechnic University, Hong Kong Special Administrative Region, China; The Hong Kong Polytechnic University Shenzhen Research Institute, Shenzhen, China.
| | - Jing Cai
- Department of Health Technology and Informatics, The Hong Kong Polytechnic University, Hong Kong Special Administrative Region, China; Research Institute for Smart Ageing, The Hong Kong Polytechnic University, Hong Kong Special Administrative Region, China; The Hong Kong Polytechnic University Shenzhen Research Institute, Shenzhen, China.
| |
Collapse
|
27
|
Hosseiniyan Khatibi SM, Rahbar Saadat Y, Hejazian SM, Sharifi S, Ardalan M, Teshnehlab M, Zununi Vahed S, Pirmoradi S. Decoding the Possible Molecular Mechanisms in Pediatric Wilms Tumor and Rhabdoid Tumor of the Kidney through Machine Learning Approaches. Fetal Pediatr Pathol 2023; 42:825-844. [PMID: 37548233 DOI: 10.1080/15513815.2023.2242979] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 05/15/2023] [Accepted: 07/26/2023] [Indexed: 08/08/2023]
Abstract
Objective: Wilms tumor (WT) and Rhabdoid tumor (RT) are pediatric renal tumors and their differentiation is based on histopathological and molecular analysis. The present study aimed to introduce the panels of mRNAs and microRNAs involved in the pathogenesis of these cancers using deep learning algorithms. Methods: Filter, graph, and association rule mining algorithms were applied to the mRNAs/microRNAs data. Results: Candidate miRNAs and mRNAs with high accuracy (AUC: 97%/93% and 94%/97%, respectively) could differentiate the WT and RT classes in training and test data. Let-7a-2 and C19orf24 were identified in the WT, while miR-199b and RP1-3E10.2 were detected in the RT by analysis of Association Rule Mining. Conclusion: The application of the machine learning methods could identify mRNA/miRNA patterns to discriminate WT from RT. The identified miRNAs/mRNAs panels could offer novel insights into the underlying molecular mechanisms that are responsible for the initiation and development of these cancers. They may provide further insight into the pathogenesis, prognosis, diagnosis, and molecular-targeted therapy in pediatric renal tumors.
Collapse
Affiliation(s)
- Seyed Mahdi Hosseiniyan Khatibi
- Kidney Research Center, Tabriz University of Medical Sciences, Tabriz, Iran
- Clinical Research Development Unit of Tabriz Valiasr Hospital, Tabriz University of Medical Sciences, Tabriz, Iran
| | | | | | - Simin Sharifi
- Dental and Periodontal Research Center, Tabriz University of Medical Sciences, Tabriz Iran
| | | | - Mohammad Teshnehlab
- Department of Electrical and Computer Engineering, K.N. Toosi University of Technology, Tehran, Iran
| | | | - Saeed Pirmoradi
- Clinical Research Development Unit of Tabriz Valiasr Hospital, Tabriz University of Medical Sciences, Tabriz, Iran
| |
Collapse
|
28
|
Sun S, Alkahtani ME, Gaisford S, Basit AW, Elbadawi M, Orlu M. Virtually Possible: Enhancing Quality Control of 3D-Printed Medicines with Machine Vision Trained on Photorealistic Images. Pharmaceutics 2023; 15:2630. [PMID: 38004607 PMCID: PMC10674815 DOI: 10.3390/pharmaceutics15112630] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2023] [Revised: 11/01/2023] [Accepted: 11/14/2023] [Indexed: 11/26/2023] Open
Abstract
Three-dimensional (3D) printing is an advanced pharmaceutical manufacturing technology, and concerted efforts are underway to establish its applicability to various industries. However, for any technology to achieve widespread adoption, robustness and reliability are critical factors. Machine vision (MV), a subset of artificial intelligence (AI), has emerged as a powerful tool to replace human inspection with unprecedented speed and accuracy. Previous studies have demonstrated the potential of MV in pharmaceutical processes. However, training models using real images proves to be both costly and time consuming. In this study, we present an alternative approach, where synthetic images were used to train models to classify the quality of dosage forms. We generated 200 photorealistic virtual images that replicated 3D-printed dosage forms, where seven machine learning techniques (MLTs) were used to perform image classification. By exploring various MV pipelines, including image resizing and transformation, we achieved remarkable classification accuracies of 80.8%, 74.3%, and 75.5% for capsules, tablets, and films, respectively, for classifying stereolithography (SLA)-printed dosage forms. Additionally, we subjected the MLTs to rigorous stress tests, evaluating their scalability to classify over 3000 images and their ability to handle irrelevant images, where accuracies of 66.5% (capsules), 72.0% (tablets), and 70.9% (films) were obtained. Moreover, model confidence was also measured, and Brier scores ranged from 0.20 to 0.40. Our results demonstrate promising proof of concept that virtual images exhibit great potential for image classification of SLA-printed dosage forms. By using photorealistic virtual images, which are faster and cheaper to generate, we pave the way for accelerated, reliable, and sustainable AI model development to enhance the quality control of 3D-printed medicines.
Collapse
Affiliation(s)
- Siyuan Sun
- UCL School of Pharmacy, University College London, 29-39 Brunswick Square, London WC1N 1AX, UK; (S.S.); (M.E.A.); (S.G.)
| | - Manal E. Alkahtani
- UCL School of Pharmacy, University College London, 29-39 Brunswick Square, London WC1N 1AX, UK; (S.S.); (M.E.A.); (S.G.)
- Department of Pharmaceutics, College of Pharmacy, Prince Sattam bin Abdulaziz University, Alkharj 11942, Saudi Arabia
| | - Simon Gaisford
- UCL School of Pharmacy, University College London, 29-39 Brunswick Square, London WC1N 1AX, UK; (S.S.); (M.E.A.); (S.G.)
| | - Abdul W. Basit
- UCL School of Pharmacy, University College London, 29-39 Brunswick Square, London WC1N 1AX, UK; (S.S.); (M.E.A.); (S.G.)
| | - Moe Elbadawi
- UCL School of Pharmacy, University College London, 29-39 Brunswick Square, London WC1N 1AX, UK; (S.S.); (M.E.A.); (S.G.)
- School of Biological and Behavioural Sciences, Queen Mary University of London, Mile End Road, London E1 4DQ, UK
| | - Mine Orlu
- UCL School of Pharmacy, University College London, 29-39 Brunswick Square, London WC1N 1AX, UK; (S.S.); (M.E.A.); (S.G.)
| |
Collapse
|
29
|
Alahdab F, El Shawi R, Ahmed AI, Han Y, Al-Mallah M. Patient-level explainable machine learning to predict major adverse cardiovascular events from SPECT MPI and CCTA imaging. PLoS One 2023; 18:e0291451. [PMID: 37967112 PMCID: PMC10651041 DOI: 10.1371/journal.pone.0291451] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2023] [Accepted: 08/30/2023] [Indexed: 11/17/2023] Open
Abstract
BACKGROUND Machine learning (ML) has shown promise in improving the risk prediction in non-invasive cardiovascular imaging, including SPECT MPI and coronary CT angiography. However, most algorithms used remain black boxes to clinicians in how they compute their predictions. Furthermore, objective consideration of the multitude of available clinical data, along with the visual and quantitative assessments from CCTA and SPECT, are critical for optimal patient risk stratification. We aim to provide an explainable ML approach to predict MACE using clinical, CCTA, and SPECT data. METHODS Consecutive patients who underwent clinically indicated CCTA and SPECT myocardial imaging for suspected CAD were included and followed up for MACEs. A MACE was defined as a composite outcome that included all-cause mortality, myocardial infarction, or late revascularization. We employed an Automated Machine Learning (AutoML) approach to predict MACE using clinical, CCTA, and SPECT data. Various mainstream models with different sets of hyperparameters have been explored, and critical predictors of risk are obtained using explainable techniques on the global and patient levels. Ten-fold cross-validation was used in training and evaluating the AutoML model. RESULTS A total of 956 patients were included (mean age 61.1 ±14.2 years, 54% men, 89% hypertension, 81% diabetes, 84% dyslipidemia). Obstructive CAD on CCTA and ischemia on SPECT were observed in 14% of patients, and 11% experienced MACE. ML prediction's sensitivity, specificity, and accuracy in predicting a MACE were 69.61%, 99.77%, and 96.54%, respectively. The top 10 global predictive features included 8 CCTA attributes (segment involvement score, number of vessels with severe plaque ≥70, ≥50% stenosis in the left marginal coronary artery, calcified plaque, ≥50% stenosis in the left circumflex coronary artery, plaque type in the left marginal coronary artery, stenosis degree in the second obtuse marginal of the left circumflex artery, and stenosis category in the marginals of the left circumflex artery) and 2 clinical features (past medical history of MI or left bundle branch block, being an ever smoker). CONCLUSION ML can accurately predict risk of developing a MACE in patients suspected of CAD undergoing SPECT MPI and CCTA. ML feature-ranking can also show, at a sample- as well as at a patient-level, which features are key in making such a prediction.
Collapse
Affiliation(s)
- Fares Alahdab
- Houston Methodist DeBakey Heart & Vascular Center, Houston, TX, United States of America
| | - Radwa El Shawi
- Institute of Computer Science, University of Tartu, Tartu, Estonia
| | - Ahmed Ibrahim Ahmed
- Houston Methodist DeBakey Heart & Vascular Center, Houston, TX, United States of America
| | - Yushui Han
- Houston Methodist DeBakey Heart & Vascular Center, Houston, TX, United States of America
| | - Mouaz Al-Mallah
- Houston Methodist DeBakey Heart & Vascular Center, Houston, TX, United States of America
| |
Collapse
|
30
|
Connor M, Salans M, Karunamuni R, Unnikrishnan S, Huynh-Le MP, Tibbs M, Qian A, Reyes A, Stasenko A, McDonald C, Moiseenko V, El-Naqa I, Hattangadi-Gluth JA. Fine Motor Skill Decline After Brain Radiation Therapy-A Multivariate Normal Tissue Complication Probability Study of a Prospective Trial. Int J Radiat Oncol Biol Phys 2023; 117:581-593. [PMID: 37150258 PMCID: PMC10911396 DOI: 10.1016/j.ijrobp.2023.04.033] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/21/2022] [Revised: 03/20/2023] [Accepted: 04/29/2023] [Indexed: 05/09/2023]
Abstract
PURPOSE Brain radiation therapy can impair fine motor skills (FMS). Fine motor skills are essential for activities of daily living, enabling hand-eye coordination for manipulative movements. We developed normal tissue complication probability (NTCP) models for the decline in FMS after fractionated brain radiation therapy (RT). METHODS AND MATERIALS On a prospective trial, 44 patients with primary brain tumors received fractioned RT; underwent high-resolution volumetric magnetic resonance imaging, diffusion tensor imaging, and comprehensive FMS assessments (Delis-Kaplan Executive Function System Trail Making Test Motor Speed [DKEFS-MS]; and Grooved Pegboard dominant/nondominant hands) at baseline and 6 months postRT. Regions of interest subserving motor function (including cortex, superficial white matter, thalamus, basal ganglia, cerebellum, and white matter tracts) were autosegmented using validated methods and manually verified. Dosimetric and clinical variables were included in multivariate NTCP models using automated bootstrapped logistic regression, least absolute shrinkage and selection operator logistic regression, and random forests with nested cross-validation. RESULTS Half of the patients showed a decline on grooved pegboard test of nondominant hands, 17 of 42 (40.4%) on grooved pegboard test of -dominant hands, and 11 of 44 (25%) on DKEFS-MS. Automated bootstrapped logistic regression selected a 1-term model including maximum dose to dominant postcentral white matter. The least absolute shrinkage and selection operator logistic regression selected this term and steroid use. The top 5 variables in the random forest were all dosimetric: maximum dose to dominant thalamus, mean dose to dominant caudate, mean and maximum dose to the dominant corticospinal tract, and maximum dose to dominant postcentral white matter. This technique performed best with an area under the curve of 0.69 (95% CI, 0.68-0.70) on nested cross-validation. CONCLUSIONS We present the first NTCP models for FMS impairment after brain RT. Dose to several supratentorial motor-associated regions of interest correlated with a decline in dominant-hand fine motor dexterity in patients with primary brain tumors in multivariate models, outperforming clinical variables. These data can guide prospective fine motor-sparing strategies for brain RT.
Collapse
Affiliation(s)
- Michael Connor
- Department of Radiation Medicine and Applied Sciences, University of California San Diego, San Diego, California
| | - Mia Salans
- Department of Radiation Medicine and Applied Sciences, University of California San Diego, San Diego, California
| | - Roshan Karunamuni
- Department of Radiation Medicine and Applied Sciences, University of California San Diego, San Diego, California
| | - Soumya Unnikrishnan
- Department of Radiation Medicine and Applied Sciences, University of California San Diego, San Diego, California
| | | | - Michelle Tibbs
- Department of Radiation Medicine and Applied Sciences, University of California San Diego, San Diego, California
| | - Alexander Qian
- Department of Radiation Medicine and Applied Sciences, University of California San Diego, San Diego, California
| | - Anny Reyes
- Department of Psychiatry, University of California San Diego, San Diego, California
| | - Alena Stasenko
- Department of Psychiatry, University of California San Diego, San Diego, California
| | - Carrie McDonald
- Department of Radiation Medicine and Applied Sciences, University of California San Diego, San Diego, California; Department of Psychiatry, University of California San Diego, San Diego, California
| | - Vitali Moiseenko
- Department of Radiation Medicine and Applied Sciences, University of California San Diego, San Diego, California
| | - Issam El-Naqa
- Department of Radiation Oncology, Moffitt Cancer Center and Research Institute, Tampa, Florida
| | - Jona A Hattangadi-Gluth
- Department of Radiation Medicine and Applied Sciences, University of California San Diego, San Diego, California.
| |
Collapse
|
31
|
Ajmal M, Khan MA, Akram T, Alqahtani A, Alhaisoni M, Armghan A, Althubiti SA, Alenezi F. BF2SkNet: best deep learning features fusion-assisted framework for multiclass skin lesion classification. Neural Comput Appl 2023; 35:22115-22131. [DOI: 10.1007/s00521-022-08084-6] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2022] [Accepted: 11/22/2022] [Indexed: 12/14/2022]
|
32
|
Fu X, Song C, Zhang R, Shi H, Jiao Z. Multimodal Classification Framework Based on Hypergraph Latent Relation for End-Stage Renal Disease Associated with Mild Cognitive Impairment. Bioengineering (Basel) 2023; 10:958. [PMID: 37627843 PMCID: PMC10451373 DOI: 10.3390/bioengineering10080958] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2023] [Revised: 08/09/2023] [Accepted: 08/10/2023] [Indexed: 08/27/2023] Open
Abstract
Combined arterial spin labeling (ASL) and functional magnetic resonance imaging (fMRI) can reveal more comprehensive properties of the spatiotemporal and quantitative properties of brain networks. Imaging markers of end-stage renal disease associated with mild cognitive impairment (ESRDaMCI) will be sought from these properties. The current multimodal classification methods often neglect to collect high-order relationships of brain regions and remove noise from the feature matrix. A multimodal classification framework is proposed to address this issue using hypergraph latent relation (HLR). A brain functional network with hypergraph structural information is constructed by fMRI data. The feature matrix is obtained through graph theory (GT). The cerebral blood flow (CBF) from ASL is selected as the second modal feature matrix. Then, the adaptive similarity matrix is constructed by learning the latent relation between feature matrices. Latent relation adaptive similarity learning (LRAS) is introduced to multi-task feature learning to construct a multimodal feature selection method based on latent relation (LRMFS). The experimental results show that the best classification accuracy (ACC) reaches 88.67%, at least 2.84% better than the state-of-the-art methods. The proposed framework preserves more valuable information between brain regions and reduces noise among feature matrixes. It provides an essential reference value for ESRDaMCI recognition.
Collapse
Affiliation(s)
- Xidong Fu
- School of Computer Science and Artificial Intelligence, Changzhou University, Changzhou 213164, China
| | - Chaofan Song
- School of Computer Science and Artificial Intelligence, Changzhou University, Changzhou 213164, China
| | - Rupu Zhang
- School of Computer Science and Artificial Intelligence, Changzhou University, Changzhou 213164, China
| | - Haifeng Shi
- Department of Radiology, The Affiliated Changzhou No.2 People’s Hospital of Nanjing Medical University, Changzhou 213003, China
| | - Zhuqing Jiao
- School of Computer Science and Artificial Intelligence, Changzhou University, Changzhou 213164, China
| |
Collapse
|
33
|
Wang H, Doumard E, Soule-Dupuy C, Kemoun P, Aligon J, Monsarrat P. Explanations as a New Metric for Feature Selection: A Systematic Approach. IEEE J Biomed Health Inform 2023; 27:4131-4142. [PMID: 37220033 DOI: 10.1109/jbhi.2023.3279340] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/25/2023]
Abstract
With the extensive use of Machine Learning (ML) in the biomedical field, there was an increasing need for Explainable Artificial Intelligence (XAI) to improve transparency and reveal complex hidden relationships between variables for medical practitioners, while meeting regulatory requirements. Feature Selection (FS) is widely used as a part of a biomedical ML pipeline to significantly reduce the number of variables while preserving as much information as possible. However, the choice of FS methods affects the entire pipeline including the final prediction explanations, whereas very few works investigate the relationship between FS and model explanations. Through a systematic workflow performed on 145 datasets and an illustration on medical data, the present work demonstrated the promising complementarity of two metrics based on explanations (using ranking and influence changes) in addition to accuracy and retention rate to select the most appropriate FS/ML models. Measuring how much explanations differ with/without FS are particularly promising for FS methods recommendation. While reliefF generally performs the best on average, the optimal choice may vary for each dataset. Positioning FS methods in a tridimensional space, integrating explanations-based metrics, accuracy and retention rate, would allow the user to choose the priorities to be given on each of the dimensions. In biomedical applications, where each medical condition may have its own preferences, this framework will make it possible to offer the healthcare professional the appropriate FS technique, to select the variables that have an important explainable impact, even if this comes at the expense of a limited drop of accuracy.
Collapse
|
34
|
Ribeiro C, Farmer CK, de Magalhães JP, Freitas AA. Predicting lifespan-extending chemical compounds for C. elegans with machine learning and biologically interpretable features. Aging (Albany NY) 2023; 15:6073-6099. [PMID: 37450404 PMCID: PMC10373959 DOI: 10.18632/aging.204866] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2022] [Accepted: 06/19/2023] [Indexed: 07/18/2023]
Abstract
Recently, there has been a growing interest in the development of pharmacological interventions targeting ageing, as well as in the use of machine learning for analysing ageing-related data. In this work, we use machine learning methods to analyse data from DrugAge, a database of chemical compounds (including drugs) modulating lifespan in model organisms. To this end, we created four types of datasets for predicting whether or not a compound extends the lifespan of C. elegans (the most frequent model organism in DrugAge), using four different types of predictive biological features, based on: compound-protein interactions, interactions between compounds and proteins encoded by ageing-related genes, and two types of terms annotated for proteins targeted by the compounds, namely Gene Ontology (GO) terms and physiology terms from the WormBase's Phenotype Ontology. To analyse these datasets, we used a combination of feature selection methods in a data pre-processing phase and the well-established random forest algorithm for learning predictive models from the selected features. In addition, we interpreted the most important features in the two best models in light of the biology of ageing. One noteworthy feature was the GO term "Glutathione metabolic process", which plays an important role in cellular redox homeostasis and detoxification. We also predicted the most promising novel compounds for extending lifespan from a list of previously unlabelled compounds. These include nitroprusside, which is used as an antihypertensive medication. Overall, our work opens avenues for future work in employing machine learning to predict novel life-extending compounds.
Collapse
Affiliation(s)
- Caio Ribeiro
- School of Computing, University of Kent, Canterbury, Kent, UK
| | | | - João Pedro de Magalhães
- Genomics of Ageing and Rejuvenation Lab, Institute of Inflammation and Ageing, University of Birmingham, Birmingham, UK
| | - Alex A. Freitas
- School of Computing, University of Kent, Canterbury, Kent, UK
| |
Collapse
|
35
|
Rostamzadeh S, Abouhossein A, Saremi M, Taheri F, Ebrahimian M, Vosoughi S. A comparative investigation of machine learning algorithms for predicting safety signs comprehension based on socio-demographic factors and cognitive sign features. Sci Rep 2023; 13:10843. [PMID: 37407611 DOI: 10.1038/s41598-023-38065-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/12/2022] [Accepted: 07/02/2023] [Indexed: 07/07/2023] Open
Abstract
This study examines whether the socio-demographic factors and cognitive sign features can be used for envisaging safety signs comprehensibility using predictive machine learning (ML) techniques. This study will determine the role of different machine learning components such as feature selection and classification to determine suitable factors for safety construction signs comprehensibility. A total of 2310 participants were requested to guess the meaning of 20 construction safety signs (four items for each of the mandatory, prohibition, emergency, warning, and firefighting signs) using the open-ended method. Moreover, the participants were asked to rate the cognitive design features of each sign in terms of familiarity, concreteness, simplicity, meaningfulness, and semantic closeness on a 0-100 rating scale. Subsequently, all eight features (age, experience, education level, familiarity, concreteness, meaningfulness, semantic closeness, and simplicity) were used for classification. Furthermore, the 14 most popular supervised classifiers were implemented and evaluated for safety sign comprehensibility prediction using these eight features. Also, filter and wrapper methods were used as feature selection techniques. Results of feature selection techniques indicate that among the eight features considered in this study, familiarity, simplicity, and meaningfulness are found to be the most relevant and effective components in predicting the comprehensibility of selected safety signs. Further, when these three features are used for classification, the K-NN classifier achieves the highest classification accuracy of 94.369% followed by medium Gaussian SVM which achieves a classification accuracy of 76.075% under hold-out data division protocol. The machine learning (ML) technique was adopted as a promising approach to addressing the issue of comprehensibility, especially in terms of determining factors affecting the safety signs' comprehension. The cognitive sign features of familiarity, simplicity, and meaningfulness can provide useful information in terms of designing user-friendly safety signs.
Collapse
Affiliation(s)
- Sajjad Rostamzadeh
- Department of Ergonomics, School of Public Health and Safety, Shahid Beheshti University of Medical Sciences, Tehran, Iran
| | - Alireza Abouhossein
- Department of Ergonomics, School of Public Health and Safety, Shahid Beheshti University of Medical Sciences, Tehran, Iran
| | - Mahnaz Saremi
- Department of Ergonomics, School of Public Health and Safety, Shahid Beheshti University of Medical Sciences, Tehran, Iran
| | - Fereshteh Taheri
- Occupational Health Research Center, Iran University of Medical Sciences, Shahid Hemmat Highway, Tehran, 1449614535, Iran
| | - Mobin Ebrahimian
- Department of Health in Disasters and Emergencies, University of Social Welfare and Rehabilitation Sciences, Tehran, Iran
| | - Shahram Vosoughi
- Occupational Health Research Center, Iran University of Medical Sciences, Shahid Hemmat Highway, Tehran, 1449614535, Iran.
| |
Collapse
|
36
|
Rahnenführer J, De Bin R, Benner A, Ambrogi F, Lusa L, Boulesteix AL, Migliavacca E, Binder H, Michiels S, Sauerbrei W, McShane L. Statistical analysis of high-dimensional biomedical data: a gentle introduction to analytical goals, common approaches and challenges. BMC Med 2023; 21:182. [PMID: 37189125 DOI: 10.1186/s12916-023-02858-y] [Citation(s) in RCA: 10] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 12/28/2022] [Accepted: 04/03/2023] [Indexed: 05/17/2023] Open
Abstract
BACKGROUND In high-dimensional data (HDD) settings, the number of variables associated with each observation is very large. Prominent examples of HDD in biomedical research include omics data with a large number of variables such as many measurements across the genome, proteome, or metabolome, as well as electronic health records data that have large numbers of variables recorded for each patient. The statistical analysis of such data requires knowledge and experience, sometimes of complex methods adapted to the respective research questions. METHODS Advances in statistical methodology and machine learning methods offer new opportunities for innovative analyses of HDD, but at the same time require a deeper understanding of some fundamental statistical concepts. Topic group TG9 "High-dimensional data" of the STRATOS (STRengthening Analytical Thinking for Observational Studies) initiative provides guidance for the analysis of observational studies, addressing particular statistical challenges and opportunities for the analysis of studies involving HDD. In this overview, we discuss key aspects of HDD analysis to provide a gentle introduction for non-statisticians and for classically trained statisticians with little experience specific to HDD. RESULTS The paper is organized with respect to subtopics that are most relevant for the analysis of HDD, in particular initial data analysis, exploratory data analysis, multiple testing, and prediction. For each subtopic, main analytical goals in HDD settings are outlined. For each of these goals, basic explanations for some commonly used analysis methods are provided. Situations are identified where traditional statistical methods cannot, or should not, be used in the HDD setting, or where adequate analytic tools are still lacking. Many key references are provided. CONCLUSIONS This review aims to provide a solid statistical foundation for researchers, including statisticians and non-statisticians, who are new to research with HDD or simply want to better evaluate and understand the results of HDD analyses.
Collapse
Affiliation(s)
| | | | - Axel Benner
- Division of Biostatistics, German Cancer Research Center (DKFZ), Heidelberg, Germany
| | - Federico Ambrogi
- Department of Clinical Sciences and Community Health, University of Milan, Milan, Italy
- Scientific Directorate, IRCCS Policlinico San Donato, San Donato Milanese, Italy
| | - Lara Lusa
- Department of Mathematics, Faculty of Mathematics, Natural Sciences and Information Technology, University of Primorksa, Koper, Slovenia
- Institute of Biostatistics and Medical Informatics, University of Ljubljana, Ljubljana, Slovenia
| | - Anne-Laure Boulesteix
- Institute for Medical Information Processing, Biometry and Epidemiology, Ludwig Maximilian University of Munich, Munich, Germany
| | | | - Harald Binder
- Institute of Medical Biometry and Statistics, Faculty of Medicine and Medical Center, University of Freiburg, Freiburg, Germany
| | - Stefan Michiels
- Service de Biostatistique et d'Épidémiologie, Gustave Roussy, Université Paris-Saclay, Villejuif, France
- Oncostat U1018, Inserm, Université Paris-Saclay, Labeled Ligue Contre le Cancer, Villejuif, France
| | - Willi Sauerbrei
- Institute of Medical Biometry and Statistics, Faculty of Medicine and Medical Center, University of Freiburg, Freiburg, Germany
| | - Lisa McShane
- Biometric Research Program, Division of Cancer Treatment and Diagnosis, National Cancer Institute, Bethesda, MD, USA.
| |
Collapse
|
37
|
Francis DP, Laustsen M, Dossi E, Treiberg T, Hardy I, Shiv SH, Hansen BS, Mogensen J, Jakobsen MH, Alstrøm TS. Machine learning methods for the detection of explosives, drugs and precursor chemicals gathered using a colorimetric sniffer sensor. ANALYTICAL METHODS : ADVANCING METHODS AND APPLICATIONS 2023; 15:2343-2354. [PMID: 37157832 DOI: 10.1039/d3ay00247k] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/10/2023]
Abstract
Colorimetric sensing technology for the detection of explosives, drugs, and their precursor chemicals is an important and effective approach. In this work, we use various machine learning models to detect these substances from colorimetric sensing experiments conducted in controlled environments. The detection experiments based on the response of a colorimetric chip containing 26 chemo-responsive dyes indicate that homemade explosives (HMEs) such as hexamethylene triperoxide diamine (HMTD), triacetone triperoxide (TATP), and methyl ethyl ketone peroxide (MEKP) used in improvised explosives devices are detected with true positive rate (TPR) of 70-75%, 73-90% and 60-82% respectively. Time series classifiers such as Convolutional Neural Networks (CNN) are explored, and the results indicate that improvements can be achieved with the use of kinetics of the chemical responses. The use of CNNs is limited, however, to scenarios where a large number of measurements, typically in the range of a few hundred, of each analyte are available. Feature selection of important dyes using the Group Lasso (GPLASSO) algorithm indicated that certain dyes are more important in discrimination of an analyte from ambient air. This information could be used for optimizing the colorimetric sensor and extend the detection to more analytes.
Collapse
Affiliation(s)
- Deena P Francis
- DTU Compute, Technical University of Denmark, 2800 Kgs. Lyngby, Denmark.
| | | | - Eleftheria Dossi
- Centre for Defence Chemistry, Cranfield University, Defence Academy of United Kingdom, Shrivenham, SN6 8LA, UK
| | - Tuule Treiberg
- DTU Chemistry, Technical University of Denmark, 2800 Kgs. Lyngby, Denmark
| | - Iona Hardy
- Centre for Defence Chemistry, Cranfield University, Defence Academy of United Kingdom, Shrivenham, SN6 8LA, UK
| | - Shai Hvid Shiv
- DTU Chemistry, Technical University of Denmark, 2800 Kgs. Lyngby, Denmark
| | | | - Jesper Mogensen
- Danish Emergency Management Agency, Chemical Division, Nørre Allé 67, 2100 Copenhagen, Denmark
| | - Mogens H Jakobsen
- DTU Chemistry, Technical University of Denmark, 2800 Kgs. Lyngby, Denmark
| | - Tommy S Alstrøm
- DTU Compute, Technical University of Denmark, 2800 Kgs. Lyngby, Denmark.
| |
Collapse
|
38
|
Doherty T, Dempster E, Hannon E, Mill J, Poulton R, Corcoran D, Sugden K, Williams B, Caspi A, Moffitt TE, Delany SJ, Murphy TM. A comparison of feature selection methodologies and learning algorithms in the development of a DNA methylation-based telomere length estimator. BMC Bioinformatics 2023; 24:178. [PMID: 37127563 PMCID: PMC10152624 DOI: 10.1186/s12859-023-05282-4] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2022] [Accepted: 04/11/2023] [Indexed: 05/03/2023] Open
Abstract
BACKGROUND The field of epigenomics holds great promise in understanding and treating disease with advances in machine learning (ML) and artificial intelligence being vitally important in this pursuit. Increasingly, research now utilises DNA methylation measures at cytosine-guanine dinucleotides (CpG) to detect disease and estimate biological traits such as aging. Given the challenge of high dimensionality of DNA methylation data, feature-selection techniques are commonly employed to reduce dimensionality and identify the most important subset of features. In this study, our aim was to test and compare a range of feature-selection methods and ML algorithms in the development of a novel DNA methylation-based telomere length (TL) estimator. We utilised both nested cross-validation and two independent test sets for the comparisons. RESULTS We found that principal component analysis in advance of elastic net regression led to the overall best performing estimator when evaluated using a nested cross-validation analysis and two independent test cohorts. This approach achieved a correlation between estimated and actual TL of 0.295 (83.4% CI [0.201, 0.384]) on the EXTEND test data set. Contrastingly, the baseline model of elastic net regression with no prior feature reduction stage performed less well in general-suggesting a prior feature-selection stage may have important utility. A previously developed TL estimator, DNAmTL, achieved a correlation of 0.216 (83.4% CI [0.118, 0.310]) on the EXTEND data. Additionally, we observed that different DNA methylation-based TL estimators, which have few common CpGs, are associated with many of the same biological entities. CONCLUSIONS The variance in performance across tested approaches shows that estimators are sensitive to data set heterogeneity and the development of an optimal DNA methylation-based estimator should benefit from the robust methodological approach used in this study. Moreover, our methodology which utilises a range of feature-selection approaches and ML algorithms could be applied to other biological markers and disease phenotypes, to examine their relationship with DNA methylation and predictive value.
Collapse
Affiliation(s)
- Trevor Doherty
- School of Biological, Health and Sports Sciences, Technological University Dublin, Dublin, Ireland.
- SFI Centre for Research Training in Machine Learning, Technological University Dublin, Dublin, Ireland.
| | - Emma Dempster
- University of Exeter Medical School, University of Exeter, Exeter, UK
| | - Eilis Hannon
- University of Exeter Medical School, University of Exeter, Exeter, UK
| | - Jonathan Mill
- University of Exeter Medical School, University of Exeter, Exeter, UK
| | - Richie Poulton
- Department of Psychology, University of Otago, Dunedin, 9016, New Zealand
| | - David Corcoran
- Center for Genomic and Computational Biology, Duke University, Durham, NC, 27708, USA
| | - Karen Sugden
- Department of Psychology and Neuroscience, Duke University, Durham, NC, USA
| | - Ben Williams
- Department of Psychology and Neuroscience, Duke University, Durham, NC, USA
| | - Avshalom Caspi
- Social, Genetic and Developmental Psychiatry Centre, Institute of Psychiatry, Psychology and Neuroscience, King's College London, London, UK
- Department of Psychology and Neuroscience, Duke University, Durham, NC, USA
| | - Terrie E Moffitt
- Social, Genetic and Developmental Psychiatry Centre, Institute of Psychiatry, Psychology and Neuroscience, King's College London, London, UK
- Department of Psychology and Neuroscience, Duke University, Durham, NC, USA
| | - Sarah Jane Delany
- School of Computer Science, Technological University Dublin, Dublin, Ireland
| | - Therese M Murphy
- School of Biological, Health and Sports Sciences, Technological University Dublin, Dublin, Ireland
| |
Collapse
|
39
|
Pan Q, Hu W, He D, He C, Zhang L, Shi Q. Machine-learning assisted molecular formula assignment to high-resolution mass spectrometry data of dissolved organic matter. Talanta 2023; 259:124484. [PMID: 37001397 DOI: 10.1016/j.talanta.2023.124484] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/27/2022] [Revised: 02/22/2023] [Accepted: 03/22/2023] [Indexed: 03/29/2023]
Abstract
High-resolution mass spectrometry (HRMS) provides molecular compositional information of dissolved organic matter (DOM) through isotopic assignment from the molecular mass. However, due to the inevitable deviation of molecular mass measurement and the limitation of resolving power, multiple possible solutions frequently occur for a given molecular mass. Lowering the mass deviation threshold and adding assignment restriction rules are often applied to exclude the incorrect solutions, which generally involves time-consuming manual post-processing of mass data. To improve the result accuracy in an automated manner, we developed a molecular formula assignment algorithm based on machine-learning technology. The method integrated a logistic regression model using manually corrected isotopic composition and the peak features of HRMS data (m/z, signal-to-noise ratio, isotope type, and number, etc.) as training data. The developed model can evaluate the correctness of a candidate formula for the given mass peak based on the peak features. The method was verified by various DOM samples FT-ICR MS data (direct infusion negative mode electrospray), achieving a ∼90% accuracy (compared to the traditional approach) for formula assignment. The method was applied to a series of NOM samples and showed a significant improvement in formula assignment compared with the mass matching method.
Collapse
|
40
|
Wang XW, Wang T, Schaub DP, Chen C, Sun Z, Ke S, Hecker J, Maaser-Hecker A, Zeleznik OA, Zeleznik R, Litonjua AA, DeMeo DL, Lasky-Su J, Silverman EK, Liu YY, Weiss ST. Benchmarking omics-based prediction of asthma development in children. Respir Res 2023; 24:63. [PMID: 36842969 PMCID: PMC9969629 DOI: 10.1186/s12931-023-02368-8] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2023] [Accepted: 02/16/2023] [Indexed: 02/27/2023] Open
Abstract
BACKGROUND Asthma is a heterogeneous disease with high morbidity. Advancement in high-throughput multi-omics approaches has enabled the collection of molecular assessments at different layers, providing a complementary perspective of complex diseases. Numerous computational methods have been developed for the omics-based patient classification or disease outcome prediction. Yet, a systematic benchmarking of those methods using various combinations of omics data for the prediction of asthma development is still lacking. OBJECTIVE We aimed to investigate the computational methods in disease status prediction using multi-omics data. METHOD We systematically benchmarked 18 computational methods using all the 63 combinations of six omics data (GWAS, miRNA, mRNA, microbiome, metabolome, DNA methylation) collected in The Vitamin D Antenatal Asthma Reduction Trial (VDAART) cohort. We evaluated each method using standard performance metrics for each of the 63 omics combinations. RESULTS Our results indicate that overall Logistic Regression, Multi-Layer Perceptron, and MOGONET display superior performance, and the combination of transcriptional, genomic and microbiome data achieves the best prediction. Moreover, we find that including the clinical data can further improve the prediction performance for some but not all the omics combinations. CONCLUSIONS Specific omics combinations can reach the optimal prediction of asthma development in children. And certain computational methods showed superior performance than other methods.
Collapse
Affiliation(s)
- Xu-Wen Wang
- Channing Division of Network Medicine, Department of Medicine, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, 02115, USA
| | - Tong Wang
- Channing Division of Network Medicine, Department of Medicine, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, 02115, USA
| | - Darius P Schaub
- Department of Mathematics, University of Hamburg, 21109, Hamburg, Germany
| | - Can Chen
- Channing Division of Network Medicine, Department of Medicine, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, 02115, USA
| | - Zheng Sun
- Channing Division of Network Medicine, Department of Medicine, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, 02115, USA
| | - Shanlin Ke
- Channing Division of Network Medicine, Department of Medicine, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, 02115, USA
| | - Julian Hecker
- Channing Division of Network Medicine, Department of Medicine, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, 02115, USA
| | - Anna Maaser-Hecker
- Genetics and Aging Research Unit, Department of Neurology, McCance Center for Brain Health, Mass General Institute for Neurodegenerative Disease, Massachusetts General Hospital, Harvard Medical School, Charlestown, MA, USA
| | - Oana A Zeleznik
- Channing Division of Network Medicine, Department of Medicine, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, 02115, USA
| | - Roman Zeleznik
- Department of Radiation Oncology, Brigham and Women's Hospital, Boston, MA, USA
| | - Augusto A Litonjua
- Division of Pediatric Pulmonology, Golisano Children's Hospital, Rochester, NY, USA
| | - Dawn L DeMeo
- Channing Division of Network Medicine, Department of Medicine, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, 02115, USA
| | - Jessica Lasky-Su
- Channing Division of Network Medicine, Department of Medicine, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, 02115, USA
| | - Edwin K Silverman
- Channing Division of Network Medicine, Department of Medicine, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, 02115, USA
| | - Yang-Yu Liu
- Channing Division of Network Medicine, Department of Medicine, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, 02115, USA.
- Center for Artificial Intelligence and Modeling, The Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, IL, 61801, USA.
| | - Scott T Weiss
- Channing Division of Network Medicine, Department of Medicine, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, 02115, USA.
| |
Collapse
|
41
|
Mohiuddin S, Sheikh KH, Malakar S, Velásquez JD, Sarkar R. A hierarchical feature selection strategy for deepfake video detection. Neural Comput Appl 2023. [DOI: 10.1007/s00521-023-08201-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/09/2023]
|
42
|
Ensemble filters with harmonize PSO-SVM algorithm for optimal hearing disorder prediction. Neural Comput Appl 2023; 35:10473-10496. [PMID: 36747886 PMCID: PMC9894525 DOI: 10.1007/s00521-023-08244-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2021] [Accepted: 01/06/2023] [Indexed: 02/05/2023]
Abstract
Discovering a hearing disorder at an earlier intervention is critical for reducing the effects of hearing loss and the approaches to increase the remaining hearing ability can be implemented to achieve the successful development of human communication. Recently, the explosive dataset features have increased the complexity for audiologists to decide the proper treatment for the patient. In most cases, data with irrelevant features and improper classifier parameters causes a crucial influence on the audiometry system in terms of accuracy. This is due to the dependent processes of these two, where the classification accuracy performance could be worsened if both processes are conducted independently. Although the filter algorithm is capable of eliminating irrelevant features, it still lacks the ability to consider feature reliance and results in a poor selection of significant features. Improper kernel parameter settings may also contribute to poor accuracy performance. In this paper, an ensemble filters feature selection based on Information Gain (IG), Gain Ratio (GR), Chi-squared (CS), and Relief-F (RF) with harmonize optimization of Particle Swarm Optimization (PSO) and Support Vector Machine (SVM) is presented to mitigate these problems. Ensemble filters are utilized so that the initial top dominant features relevant for classification can be considered. Then, PSO and SVM are optimized simultaneously to achieve the optimal solution. The results on a standard Audiology dataset show that the proposed method produces 96.50% accuracy with optimal solution compared to classical SVM, which signifies the proposed method is effective in handling high dimensional data for hearing disorder prediction.
Collapse
|
43
|
Chen Y, Liu Y, Zuo X, Zhao Q, Sun M, Cui M, Zhao X, Du Y. Identification of significant imaging features for sensing oocyte viability. Microsc Res Tech 2023; 86:181-192. [PMID: 36278826 DOI: 10.1002/jemt.24248] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/11/2022] [Revised: 09/26/2022] [Accepted: 10/06/2022] [Indexed: 01/21/2023]
Abstract
The evaluation of oocyte viability in the laboratory is limited to the morphological assessment by naked eyes, but the realization that most normal-appearing oocytes may conceal abnormalities prompts the search for automated approaches that can detect the abnormalities imperceptible to naked eyes. In this study, we developed an image processing pipeline applicable to bright-field microscope images to quantify the causal relationship between the quantitative imaging features and the developmental potential of oocytes. We acquired 19 imaging features of approximately 700 oocytes and determined two imaging subtypes, namely viable and nonviable subtypes that correlated closely with a viability fluorescence indicator and cleavage rates. The causal relationship between these imaging features and oocyte viability was derived from a viability-oriented Bayesian network that was developed based on the Bayesian information criterion and Tabu search. Our experimental results revealed that entropy with mean Gray Level Co-Occurrence Matrix energy describing the uniformity and texture roughness of cytoplasm were salient features for the automated selection of promising oocytes that exhibited excellent developmental potential.
Collapse
Affiliation(s)
- Yizhe Chen
- Institute of Robotics and Automatic Information System, College of Artificial Intelligence, Nankai University, Tianjin, China.,Tianjin Key Laboratory of Intelligent Robotics, Nankai University, Tianjin, China.,Institute of Intelligence Technology and Robotic Systems, Shenzhen Research Institute of Nankai University, Tianjin, China
| | - Yaowei Liu
- Institute of Robotics and Automatic Information System, College of Artificial Intelligence, Nankai University, Tianjin, China.,Tianjin Key Laboratory of Intelligent Robotics, Nankai University, Tianjin, China.,Institute of Intelligence Technology and Robotic Systems, Shenzhen Research Institute of Nankai University, Tianjin, China
| | - Xiaoying Zuo
- Institute of Robotics and Automatic Information System, College of Artificial Intelligence, Nankai University, Tianjin, China.,Tianjin Key Laboratory of Intelligent Robotics, Nankai University, Tianjin, China.,Institute of Intelligence Technology and Robotic Systems, Shenzhen Research Institute of Nankai University, Tianjin, China
| | - Qili Zhao
- Institute of Robotics and Automatic Information System, College of Artificial Intelligence, Nankai University, Tianjin, China.,Tianjin Key Laboratory of Intelligent Robotics, Nankai University, Tianjin, China.,Institute of Intelligence Technology and Robotic Systems, Shenzhen Research Institute of Nankai University, Tianjin, China
| | - Mingzhu Sun
- Institute of Robotics and Automatic Information System, College of Artificial Intelligence, Nankai University, Tianjin, China.,Tianjin Key Laboratory of Intelligent Robotics, Nankai University, Tianjin, China.,Institute of Intelligence Technology and Robotic Systems, Shenzhen Research Institute of Nankai University, Tianjin, China
| | - Maosheng Cui
- Institute of Intelligence Technology and Robotic Systems, Shenzhen Research Institute of Nankai University, Tianjin, China.,Innovation Team of Pig Feeding, Institute of Animal Science and Veterinary of Tianjin, Tianjin, China
| | - Xin Zhao
- Institute of Robotics and Automatic Information System, College of Artificial Intelligence, Nankai University, Tianjin, China.,Tianjin Key Laboratory of Intelligent Robotics, Nankai University, Tianjin, China.,Institute of Intelligence Technology and Robotic Systems, Shenzhen Research Institute of Nankai University, Tianjin, China
| | - Yue Du
- Institute of Robotics and Automatic Information System, College of Artificial Intelligence, Nankai University, Tianjin, China.,Tianjin Key Laboratory of Intelligent Robotics, Nankai University, Tianjin, China.,Institute of Intelligence Technology and Robotic Systems, Shenzhen Research Institute of Nankai University, Tianjin, China
| |
Collapse
|
44
|
An improved feature selection approach using global best guided Gaussian artificial bee colony for EMG classification. Biomed Signal Process Control 2023. [DOI: 10.1016/j.bspc.2022.104399] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
|
45
|
Zhang M, Wang JS, Liu Y, Wang M, Li XD, Guo FJ. Feature selection method based on stochastic fractal search henry gas solubility optimization algorithm. JOURNAL OF INTELLIGENT & FUZZY SYSTEMS 2023. [DOI: 10.3233/jifs-221036] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/05/2023]
Abstract
In most data mining tasks, feature selection is an essential preprocessing stage. Henry’s Gas Solubility Optimization (HGSO) algorithm is a physical heuristic algorithm based on Henry’s law, which simulates the process of gas solubility in liquid with temperature. In this paper, an improved Henry’s Gas Solubility Optimization based on stochastic fractal search (SFS-HGSO) is proposed for feature selection and engineering optimization. Three stochastic fractal strategies based on Gaussian walk, Lévy flight and Brownian motion are adopted respectively, and the diffusion is based on the high-quality solutions obtained by the original algorithm. Individuals with different fitness are assigned different energies, and the number of diffusing individuals is determined according to individual energy. This strategy increases the diversity of search strategies and enhances the ability of local search. It greatly improves the shortcomings of the original HGSO position updating method is single and the convergence speed is slow. This algorithm is used to solve the problem of feature selection, and KNN classifier is used to evaluate the effectiveness of selected features. In order to verify the performance of the proposed feature selection method, 20 standard UCI benchmark datasets are used, and the performance is compared with other swarm intelligence optimization algorithms, such as WOA, HHO and HBA. The algorithm is also applied to the solution of benchmark function. Experimental results show that these three improved strategies can effectively improve the performance of HGSO algorithm, and achieve excellent results in feature selection and engineering optimization problems.
Collapse
Affiliation(s)
- Min Zhang
- School of Electronic and Information Engineering, University of Science & Technology Liaoning, Anshan, China
| | - Jie-Sheng Wang
- School of Electronic and Information Engineering, University of Science & Technology Liaoning, Anshan, China
| | - Yu Liu
- School of Electronic and Information Engineering, University of Science & Technology Liaoning, Anshan, China
| | - Min Wang
- School of Electronic and Information Engineering, University of Science & Technology Liaoning, Anshan, China
| | - Xu-Dong Li
- School of Electronic and Information Engineering, University of Science & Technology Liaoning, Anshan, China
| | - Fu-Jun Guo
- School of Electronic and Information Engineering, University of Science & Technology Liaoning, Anshan, China
| |
Collapse
|
46
|
Hapfelmeier A, Hornung R, Haller B. Efficient permutation testing of variable importance measures by the example of random forests. Comput Stat Data Anal 2023. [DOI: 10.1016/j.csda.2022.107689] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023]
|
47
|
Lap BQ, Phan TTH, Nguyen HD, Quang LX, Hang PT, Phi NQ, Hoang VT, Linh PG, Thanh Hang BT. Predicting Water Quality Index (WQI) by feature selection and machine learning: A case study of An Kim Hai irrigation system. ECOL INFORM 2023. [DOI: 10.1016/j.ecoinf.2023.101991] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/19/2023]
|
48
|
Signol F, Arnal L, Navarro-Cerdán JR, Llobet R, Arlandis J, Perez-Cortes JC. SEQENS: An ensemble method for relevant gene identification in microarray data. Comput Biol Med 2023; 152:106413. [PMID: 36521355 DOI: 10.1016/j.compbiomed.2022.106413] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/28/2022] [Revised: 11/25/2022] [Accepted: 12/03/2022] [Indexed: 12/12/2022]
Abstract
This paper describes an ensemble feature identification algorithm called SEQENS, and measures its capability to identify the relevant variables in a case-control study using a genetic expression microarray dataset. SEQENS uses Sequential Feature Search on multiple sample splitting to select variables showing stronger relation with the target, and a variable relevance ranking is finally produced. Although designed for feature identification, SEQENS could also serve as a basis for feature selection (classifier optimisation). Cliff, a ranking evaluation metric is also presented and used to assess the feature identification algorithms when a groundtruth of relevant variables is available. To test performance, three types of synthetic groundtruths emulating fictitious diseases are generated from ten randomly chosen variables following different target pattern distributions using the E-MTAB-3732 dataset. Several sample-to-dimensionality ratios ranging from 300 to 3,000 observations and 854 to 54,675 variables are explored. SEQENS is compared with other feature selection or identification state-of-the-art methods. On average, the proposed algorithm identifies better the relevant genes and exhibits a stronger stability. The algorithm is available to the community.
Collapse
Affiliation(s)
- François Signol
- Instituto Tecnológico de Informática (ITI), Universitat Politècnica de València, Camino de Vera, s/n, 46022 València, Spain.
| | - Laura Arnal
- Instituto Tecnológico de Informática (ITI), Universitat Politècnica de València, Camino de Vera, s/n, 46022 València, Spain.
| | - J Ramón Navarro-Cerdán
- Instituto Tecnológico de Informática (ITI), Universitat Politècnica de València, Camino de Vera, s/n, 46022 València, Spain.
| | - Rafael Llobet
- Instituto Tecnológico de Informática (ITI), Universitat Politècnica de València, Camino de Vera, s/n, 46022 València, Spain.
| | - Joaquim Arlandis
- Instituto Tecnológico de Informática (ITI), Universitat Politècnica de València, Camino de Vera, s/n, 46022 València, Spain.
| | - Juan-Carlos Perez-Cortes
- Instituto Tecnológico de Informática (ITI), Universitat Politècnica de València, Camino de Vera, s/n, 46022 València, Spain.
| |
Collapse
|
49
|
Jia Z, Ou C, Sun S, Wang J, Liu J, Sun M, Ma W, Li M, Jia S, Mao P. Integrating optical imaging techniques for a novel approach to evaluate Siberian wild rye seed maturity. FRONTIERS IN PLANT SCIENCE 2023; 14:1170947. [PMID: 37152128 PMCID: PMC10157248 DOI: 10.3389/fpls.2023.1170947] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 03/07/2023] [Accepted: 04/03/2023] [Indexed: 05/09/2023]
Abstract
Advances in optical imaging technology using rapid and non-destructive methods have led to improvements in the efficiency of seed quality detection. Accurately timing the harvest is crucial for maximizing the yield of higher-quality Siberian wild rye seeds by minimizing excessive shattering during harvesting. This research applied integrated optical imaging techniques and machine learning algorithms to develop different models for classifying Siberian wild rye seeds based on different maturity stages and grain positions. The multi-source fusion of morphological, multispectral, and autofluorescence data provided more comprehensive information but also increases the performance requirements of the equipment. Therefore, we employed three filtering algorithms, namely minimal joint mutual information maximization (JMIM), information gain, and Gini impurity, and set up two control methods (feature union and no-filtering) to assess the impact of retaining only 20% of the features on the model performance. Both JMIM and information gain revealed autofluorescence and morphological features (CIELab A, CIELab B, hue and saturation), with these two filtering algorithms showing shorter run times. Furthermore, a strong correlation was observed between shoot length and morphological and autofluorescence spectral features. Machine learning models based on linear discriminant analysis (LDA), random forests (RF) and support vector machines (SVM) showed high performance (>0.78 accuracies) in classifying seeds at different maturity stages. Furthermore, it was found that there was considerable variation in the different grain positions at the maturity stage, and the K-means approach was used to improve the model performance by 5.8%-9.24%. In conclusion, our study demonstrated that feature filtering algorithms combined with machine learning algorithms offer high performance and low cost in identifying seed maturity stages and that the application of k-means techniques for inconsistent maturity improves classification accuracy. Therefore, this technique could be employed classification of seed maturity and superior physiological quality for Siberian wild rye seeds.
Collapse
|
50
|
Parkinson E, Liberatore F, Watkins WJ, Andrews R, Edkins S, Hibbert J, Strunk T, Currie A, Ghazal P. Gene filtering strategies for machine learning guided biomarker discovery using neonatal sepsis RNA-seq data. Front Genet 2023; 14:1158352. [PMID: 37113992 PMCID: PMC10126415 DOI: 10.3389/fgene.2023.1158352] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2023] [Accepted: 03/29/2023] [Indexed: 04/29/2023] Open
Abstract
Machine learning (ML) algorithms are powerful tools that are increasingly being used for sepsis biomarker discovery in RNA-Seq data. RNA-Seq datasets contain multiple sources and types of noise (operator, technical and non-systematic) that may bias ML classification. Normalisation and independent gene filtering approaches described in RNA-Seq workflows account for some of this variability and are typically only targeted at differential expression analysis rather than ML applications. Pre-processing normalisation steps significantly reduce the number of variables in the data and thereby increase the power of statistical testing, but can potentially discard valuable and insightful classification features. A systematic assessment of applying transcript level filtering on the robustness and stability of ML based RNA-seq classification remains to be fully explored. In this report we examine the impact of filtering out low count transcripts and those with influential outliers read counts on downstream ML analysis for sepsis biomarker discovery using elastic net regularised logistic regression, L1-reguarlised support vector machines and random forests. We demonstrate that applying a systematic objective strategy for removal of uninformative and potentially biasing biomarkers representing up to 60% of transcripts in different sample size datasets, including two illustrative neonatal sepsis cohorts, leads to substantial improvements in classification performance, higher stability of the resulting gene signatures, and better agreement with previously reported sepsis biomarkers. We also demonstrate that the performance uplift from gene filtering depends on the ML classifier chosen, with L1-regularlised support vector machines showing the greatest performance improvements with our experimental data.
Collapse
Affiliation(s)
- Edward Parkinson
- Department of Computer Science and Informatics, Cardiff University, Cardiff, United Kingdom
- *Correspondence: Edward Parkinson,
| | - Federico Liberatore
- Department of Computer Science and Informatics, Cardiff University, Cardiff, United Kingdom
| | - W. John Watkins
- Project Sepsis, Systems Immunity Research Institute, Cardiff University, Cardiff, United Kingdom
| | - Robert Andrews
- Project Sepsis, Systems Immunity Research Institute, Cardiff University, Cardiff, United Kingdom
| | - Sarah Edkins
- Project Sepsis, Systems Immunity Research Institute, Cardiff University, Cardiff, United Kingdom
| | - Julie Hibbert
- Wesfarmers Centre of Vaccines and Infectious Diseases, Telethon Kids Institute, Perth, WA, Australia
- Medical School, University of Western Australia, Perth, WA, Australia
- Centre for Molecular Medicine and Innovative Therapeutics, Murdoch University, Perth, WA, Australia
| | - Tobias Strunk
- Wesfarmers Centre of Vaccines and Infectious Diseases, Telethon Kids Institute, Perth, WA, Australia
- Medical School, University of Western Australia, Perth, WA, Australia
- Neonatal Directorate, Child and Adolescent Health Service, Perth, WA, Australia
| | - Andrew Currie
- Wesfarmers Centre of Vaccines and Infectious Diseases, Telethon Kids Institute, Perth, WA, Australia
- Centre for Molecular Medicine and Innovative Therapeutics, Murdoch University, Perth, WA, Australia
| | - Peter Ghazal
- Project Sepsis, Systems Immunity Research Institute, Cardiff University, Cardiff, United Kingdom
| |
Collapse
|