1
|
Wang H, Doumard E, Soule-Dupuy C, Kemoun P, Aligon J, Monsarrat P. Explanations as a New Metric for Feature Selection: A Systematic Approach. IEEE J Biomed Health Inform 2023; 27:4131-4142. [PMID: 37220033 DOI: 10.1109/jbhi.2023.3279340] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/25/2023]
Abstract
With the extensive use of Machine Learning (ML) in the biomedical field, there was an increasing need for Explainable Artificial Intelligence (XAI) to improve transparency and reveal complex hidden relationships between variables for medical practitioners, while meeting regulatory requirements. Feature Selection (FS) is widely used as a part of a biomedical ML pipeline to significantly reduce the number of variables while preserving as much information as possible. However, the choice of FS methods affects the entire pipeline including the final prediction explanations, whereas very few works investigate the relationship between FS and model explanations. Through a systematic workflow performed on 145 datasets and an illustration on medical data, the present work demonstrated the promising complementarity of two metrics based on explanations (using ranking and influence changes) in addition to accuracy and retention rate to select the most appropriate FS/ML models. Measuring how much explanations differ with/without FS are particularly promising for FS methods recommendation. While reliefF generally performs the best on average, the optimal choice may vary for each dataset. Positioning FS methods in a tridimensional space, integrating explanations-based metrics, accuracy and retention rate, would allow the user to choose the priorities to be given on each of the dimensions. In biomedical applications, where each medical condition may have its own preferences, this framework will make it possible to offer the healthcare professional the appropriate FS technique, to select the variables that have an important explainable impact, even if this comes at the expense of a limited drop of accuracy.
Collapse
|
2
|
Khan Mamun MMR, Elfouly T. Detection of Cardiovascular Disease from Clinical Parameters Using a One-Dimensional Convolutional Neural Network. Bioengineering (Basel) 2023; 10:796. [PMID: 37508823 PMCID: PMC10376462 DOI: 10.3390/bioengineering10070796] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2023] [Revised: 06/29/2023] [Accepted: 06/30/2023] [Indexed: 07/30/2023] Open
Abstract
Heart disease is a significant public health problem, and early detection is crucial for effective treatment and management. Conventional and noninvasive techniques are cumbersome, time-consuming, inconvenient, expensive, and unsuitable for frequent measurement or diagnosis. With the advance of artificial intelligence (AI), new invasive techniques emerging in research are detecting heart conditions using machine learning (ML) and deep learning (DL). Machine learning models have been used with the publicly available dataset from the internet about heart health; in contrast, deep learning techniques have recently been applied to analyze electrocardiograms (ECG) or similar vital data to detect heart diseases. Significant limitations of these datasets are their small size regarding the number of patients and features and the fact that many are imbalanced datasets. Furthermore, the trained models must be more reliable and accurate in medical settings. This study proposes a hybrid one-dimensional convolutional neural network (1D CNN), which uses a large dataset accumulated from online survey data and selected features using feature selection algorithms. The 1D CNN proved to show better accuracy compared to contemporary machine learning algorithms and artificial neural networks. The non-coronary heart disease (no-CHD) and CHD validation data showed an accuracy of 80.1% and 76.9%, respectively. The model was compared with an artificial neural network, random forest, AdaBoost, and a support vector machine. Overall, 1D CNN proved to show better performance in terms of accuracy, false negative rates, and false positive rates. Similar strategies were applied for four more heart conditions, and the analysis proved that using the hybrid 1D CNN produced better accuracy.
Collapse
Affiliation(s)
| | - Tarek Elfouly
- Department of Electrical and Computer Engineering, Tennessee Technological University, Cookeville, TN 38505, USA
| |
Collapse
|
3
|
Bertolini R, Finch SJ. Stability of filter feature selection methods in data pipelines: a simulation study. INTERNATIONAL JOURNAL OF DATA SCIENCE AND ANALYTICS 2022. [DOI: 10.1007/s41060-022-00373-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022]
|
4
|
Pudjihartono N, Fadason T, Kempa-Liehr AW, O'Sullivan JM. A Review of Feature Selection Methods for Machine Learning-Based Disease Risk Prediction. FRONTIERS IN BIOINFORMATICS 2022; 2:927312. [PMID: 36304293 PMCID: PMC9580915 DOI: 10.3389/fbinf.2022.927312] [Citation(s) in RCA: 75] [Impact Index Per Article: 37.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2022] [Accepted: 06/03/2022] [Indexed: 01/14/2023] Open
Abstract
Machine learning has shown utility in detecting patterns within large, unstructured, and complex datasets. One of the promising applications of machine learning is in precision medicine, where disease risk is predicted using patient genetic data. However, creating an accurate prediction model based on genotype data remains challenging due to the so-called “curse of dimensionality” (i.e., extensively larger number of features compared to the number of samples). Therefore, the generalizability of machine learning models benefits from feature selection, which aims to extract only the most “informative” features and remove noisy “non-informative,” irrelevant and redundant features. In this article, we provide a general overview of the different feature selection methods, their advantages, disadvantages, and use cases, focusing on the detection of relevant features (i.e., SNPs) for disease risk prediction.
Collapse
Affiliation(s)
| | - Tayaza Fadason
- Liggins Institute, University of Auckland, Auckland, New Zealand
- Maurice Wilkins Centre for Molecular Biodiscovery, Auckland, New Zealand
| | - Andreas W. Kempa-Liehr
- Department of Engineering Science, The University of Auckland, Auckland, New Zealand
- *Correspondence: Andreas W. Kempa-Liehr, ; Justin M. O'Sullivan,
| | - Justin M. O'Sullivan
- Liggins Institute, University of Auckland, Auckland, New Zealand
- Maurice Wilkins Centre for Molecular Biodiscovery, Auckland, New Zealand
- MRC Lifecourse Epidemiology Unit, University of Southampton, Southampton, United Kingdom
- Singapore Institute for Clinical Sciences, Agency for Science, Technology and Research (A*STAR), Singapore, Singapore
- Australian Parkinson’s Mission, Garvan Institute of Medical Research, Sydney, NSW, Australia
- *Correspondence: Andreas W. Kempa-Liehr, ; Justin M. O'Sullivan,
| |
Collapse
|
5
|
Chatzimparmpas A, Martins RM, Kucher K, Kerren A. FeatureEnVi: Visual Analytics for Feature Engineering Using Stepwise Selection and Semi-Automatic Extraction Approaches. IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS 2022; 28:1773-1791. [PMID: 34990365 DOI: 10.1109/tvcg.2022.3141040] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
The machine learning (ML) life cycle involves a series of iterative steps, from the effective gathering and preparation of the data-including complex feature engineering processes-to the presentation and improvement of results, with various algorithms to choose from in every step. Feature engineering in particular can be very beneficial for ML, leading to numerous improvements such as boosting the predictive results, decreasing computational times, reducing excessive noise, and increasing the transparency behind the decisions taken during the training. Despite that, while several visual analytics tools exist to monitor and control the different stages of the ML life cycle (especially those related to data and algorithms), feature engineering support remains inadequate. In this paper, we present FeatureEnVi, a visual analytics system specifically designed to assist with the feature engineering process. Our proposed system helps users to choose the most important feature, to transform the original features into powerful alternatives, and to experiment with different feature generation combinations. Additionally, data space slicing allows users to explore the impact of features on both local and global scales. FeatureEnVi utilizes multiple automatic feature selection techniques; furthermore, it visually guides users with statistical evidence about the influence of each feature (or subsets of features). The final outcome is the extraction of heavily engineered features, evaluated by multiple validation metrics. The usefulness and applicability of FeatureEnVi are demonstrated with two use cases and a case study. We also report feedback from interviews with two ML experts and a visualization researcher who assessed the effectiveness of our system.
Collapse
|
6
|
Leveraging AI and Machine Learning for National Student Survey: Actionable Insights from Textual Feedback to Enhance Quality of Teaching and Learning in UK’s Higher Education. APPLIED SCIENCES-BASEL 2022. [DOI: 10.3390/app12010514] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/19/2023]
Abstract
Students’ evaluation of teaching, for instance, through feedback surveys, constitutes an integral mechanism for quality assurance and enhancement of teaching and learning in higher education. These surveys usually comprise both the Likert scale and free-text responses. Since the discrete Likert scale responses are easy to analyze, they feature more prominently in survey analyses. However, the free-text responses often contain richer, detailed, and nuanced information with actionable insights. Mining these insights is more challenging, as it requires a higher degree of processing by human experts, making the process time-consuming and resource intensive. Consequently, the free-text analyses are often restricted in scale, scope, and impact. To address these issues, we propose a novel automated analysis framework for extracting actionable information from free-text responses to open-ended questions in student feedback questionnaires. By leveraging state-of-the-art supervised machine learning techniques and unsupervised clustering methods, we implemented our framework as a case study to analyze a large-scale dataset of 4400 open-ended responses to the National Student Survey (NSS) at a UK university. These analyses then led to the identification, design, implementation, and evaluation of a series of teaching and learning interventions over a two-year period. The highly encouraging results demonstrate our approach’s validity and broad (national and international) application potential—covering tertiary education, commercial training, and apprenticeship programs, etc., where textual feedback is collected to enhance the quality of teaching and learning.
Collapse
|
7
|
Bommert A, Welchowski T, Schmid M, Rahnenführer J. Benchmark of filter methods for feature selection in high-dimensional gene expression survival data. Brief Bioinform 2021; 23:6366322. [PMID: 34498681 PMCID: PMC8769710 DOI: 10.1093/bib/bbab354] [Citation(s) in RCA: 24] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2021] [Revised: 08/05/2021] [Accepted: 08/10/2021] [Indexed: 11/30/2022] Open
Abstract
Feature selection is crucial for the analysis of high-dimensional data, but benchmark studies for data with a survival outcome are rare. We compare 14 filter methods for feature selection based on 11 high-dimensional gene expression survival data sets. The aim is to provide guidance on the choice of filter methods for other researchers and practitioners. We analyze the accuracy of predictive models that employ the features selected by the filter methods. Also, we consider the run time, the number of selected features for fitting models with high predictive accuracy as well as the feature selection stability. We conclude that the simple variance filter outperforms all other considered filter methods. This filter selects the features with the largest variance and does not take into account the survival outcome. Also, we identify the correlation-adjusted regression scores filter as a more elaborate alternative that allows fitting models with similar predictive accuracy. Additionally, we investigate the filter methods based on feature rankings, finding groups of similar filters.
Collapse
Affiliation(s)
- Andrea Bommert
- Department of Statistics, TU Dortmund University, Vogelpothsweg 87, 44227, Dortmund, Germany
| | - Thomas Welchowski
- Institute of Medical Biometry, Informatics and Epidemiology (IMBIE), Medical Faculty, University of Bonn, Venusberg-Campus 1, 53127, Bonn, Germany
| | - Matthias Schmid
- Institute of Medical Biometry, Informatics and Epidemiology (IMBIE), Medical Faculty, University of Bonn, Venusberg-Campus 1, 53127, Bonn, Germany
| | - Jörg Rahnenführer
- Department of Statistics, TU Dortmund University, Vogelpothsweg 87, 44227, Dortmund, Germany
| |
Collapse
|
8
|
Bommert A, Sun X, Bischl B, Rahnenführer J, Lang M. Benchmark for filter methods for feature selection in high-dimensional classification data. Comput Stat Data Anal 2020. [DOI: 10.1016/j.csda.2019.106839] [Citation(s) in RCA: 206] [Impact Index Per Article: 51.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023]
|
9
|
Xiao Q, Zhong X, Zhong C. Application Research of KNN Algorithm Based on Clustering in Big Data Talent Demand Information Classification. INT J PATTERN RECOGN 2019. [DOI: 10.1142/s0218001420500159] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
With the growth of massive data in the current mobile Internet, network recruitment is gradually growing into a new recruitment channel. How to effectively mine available information in the massive network recruitment data has become the technical bottleneck of current education and social supply and demand development. The renewal of talent demand information is carried out every day, which produces a large amount of text data. How to manage these talents’ demand information reasonably becomes more and more important. Artificial classification is time-consuming and laborious, which is unrealistic naturally. Therefore, using automatic text categorization technology to classify and manage this information becomes particularly important. To break through the bottleneck of this technology, a heuristic KNN text categorization algorithm based on ABC (artificial bee colony) is proposed to adjust the weight of features, and the similarity between test observation and training observation is measured by using the method of fuzzy distance measurement. Firstly, the recruitment information is segmented and feature selection and noise data elimination are carried out by using term frequency-inverse document frequency (TF-IDF) algorithm and AP (affinity propagation) clustering algorithm. Finally, the text information is classified by using KNN algorithm combined with heuristic search and fuzzy distance measurement. The experimental results show that this method effectively solves the problem of poor stability and low classification accuracy of traditional KNN algorithm in text categorization method for talent demand.
Collapse
Affiliation(s)
- Qingtao Xiao
- Vocational Education Center, The Army Military University, Chongqing, P. R. China
| | - Xin Zhong
- Mental Health Education and Counseling Center, Chongqing Technology and Business University, Chongqing, P. R. China
| | - Chenghua Zhong
- College of Environment and Resources, Chongqing Technology and Business University, Chongqing, P. R. China
| |
Collapse
|
10
|
Sabbah T, Selamat A, Selamat MH, Al-Anzi FS, Viedma EH, Krejcar O, Fujita H. Modified frequency-based term weighting schemes for text classification. Appl Soft Comput 2017. [DOI: 10.1016/j.asoc.2017.04.069] [Citation(s) in RCA: 60] [Impact Index Per Article: 8.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
|
11
|
Park JH, Kim K. An Information Retrieval Approach for Robust Prediction of Road Surface States. SENSORS (BASEL, SWITZERLAND) 2017; 17:s17020262. [PMID: 28134859 PMCID: PMC5335980 DOI: 10.3390/s17020262] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 11/29/2016] [Revised: 01/19/2017] [Accepted: 01/23/2017] [Indexed: 06/06/2023]
Abstract
Recently, due to the increasing importance of reducing severe vehicle accidents on roads (especially on highways), the automatic identification of road surface conditions, and the provisioning of such information to drivers in advance, have recently been gaining significant momentum as a proactive solution to decrease the number of vehicle accidents. In this paper, we firstly propose an information retrieval approach that aims to identify road surface states by combining conventional machine-learning techniques and moving average methods. Specifically, when signal information is received from a radar system, our approach attempts to estimate the current state of the road surface based on the similar instances observed previously based on utilizing a given similarity function. Next, the estimated state is then calibrated by using the recently estimated states to yield both effective and robust prediction results. To validate the performances of the proposed approach, we established a real-world experimental setting on a section of actual highway in South Korea and conducted a comparison with the conventional approaches in terms of accuracy. The experimental results show that the proposed approach successfully outperforms the previously developed methods.
Collapse
Affiliation(s)
- Jae-Hyung Park
- ICT Convergence R & D Center, Metabuild Co., Ltd., 5F 1487-6 Seocho-3dong, Seocho-gu, Seoul 06708, Korea.
| | - Kwanho Kim
- Department of Industrial and Management Engineering, College of Engineering, Incheon National University, Incheon 22012, Korea.
| |
Collapse
|
12
|
Rosenkrantz AB, Doshi AM, Ginocchio LA, Aphinyanaphongs Y. Use of a Machine-learning Method for Predicting Highly Cited Articles Within General Radiology Journals. Acad Radiol 2016; 23:1573-1581. [PMID: 27692588 DOI: 10.1016/j.acra.2016.08.011] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/25/2016] [Revised: 08/18/2016] [Accepted: 08/19/2016] [Indexed: 12/16/2022]
Abstract
RATIONALE AND OBJECTIVES This study aimed to assess the performance of a text classification machine-learning model in predicting highly cited articles within the recent radiological literature and to identify the model's most influential article features. MATERIALS AND METHODS We downloaded from PubMed the title, abstract, and medical subject heading terms for 10,065 articles published in 25 general radiology journals in 2012 and 2013. Three machine-learning models were applied to predict the top 10% of included articles in terms of the number of citations to the article in 2014 (reflecting the 2-year time window in conventional impact factor calculations). The model having the highest area under the curve was selected to derive a list of article features (words) predicting high citation volume, which was iteratively reduced to identify the smallest possible core feature list maintaining predictive power. Overall themes were qualitatively assigned to the core features. RESULTS The regularized logistic regression (Bayesian binary regression) model had highest performance, achieving an area under the curve of 0.814 in predicting articles in the top 10% of citation volume. We reduced the initial 14,083 features to 210 features that maintain predictivity. These features corresponded with topics relating to various imaging techniques (eg, diffusion-weighted magnetic resonance imaging, hyperpolarized magnetic resonance imaging, dual-energy computed tomography, computed tomography reconstruction algorithms, tomosynthesis, elastography, and computer-aided diagnosis), particular pathologies (prostate cancer; thyroid nodules; hepatic adenoma, hepatocellular carcinoma, non-alcoholic fatty liver disease), and other topics (radiation dose, electroporation, education, general oncology, gadolinium, statistics). CONCLUSIONS Machine learning can be successfully applied to create specific feature-based models for predicting articles likely to achieve high influence within the radiological literature.
Collapse
Affiliation(s)
- Andrew B Rosenkrantz
- Department of Radiology, NYU Langone Medical Center, 660 First Avenue, 3rd Floor, New York, NY 10016.
| | - Ankur M Doshi
- Department of Radiology, NYU Langone Medical Center, 660 First Avenue, 3rd Floor, New York, NY 10016
| | - Luke A Ginocchio
- Department of Radiology, NYU Langone Medical Center, 660 First Avenue, 3rd Floor, New York, NY 10016
| | - Yindalon Aphinyanaphongs
- Center for Healthcare Innovation and Delivery Science, NYU Langone Medical Center, New York, New York
| |
Collapse
|
13
|
Surkis A, Hogle JA, DiazGranados D, Hunt JD, Mazmanian PE, Connors E, Westaby K, Whipple EC, Adamus T, Mueller M, Aphinyanaphongs Y. Classifying publications from the clinical and translational science award program along the translational research spectrum: a machine learning approach. J Transl Med 2016; 14:235. [PMID: 27492440 PMCID: PMC4974725 DOI: 10.1186/s12967-016-0992-8] [Citation(s) in RCA: 33] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2016] [Accepted: 07/27/2016] [Indexed: 01/05/2023] Open
Abstract
BACKGROUND Translational research is a key area of focus of the National Institutes of Health (NIH), as demonstrated by the substantial investment in the Clinical and Translational Science Award (CTSA) program. The goal of the CTSA program is to accelerate the translation of discoveries from the bench to the bedside and into communities. Different classification systems have been used to capture the spectrum of basic to clinical to population health research, with substantial differences in the number of categories and their definitions. Evaluation of the effectiveness of the CTSA program and of translational research in general is hampered by the lack of rigor in these definitions and their application. This study adds rigor to the classification process by creating a checklist to evaluate publications across the translational spectrum and operationalizes these classifications by building machine learning-based text classifiers to categorize these publications. METHODS Based on collaboratively developed definitions, we created a detailed checklist for categories along the translational spectrum from T0 to T4. We applied the checklist to CTSA-linked publications to construct a set of coded publications for use in training machine learning-based text classifiers to classify publications within these categories. The training sets combined T1/T2 and T3/T4 categories due to low frequency of these publication types compared to the frequency of T0 publications. We then compared classifier performance across different algorithms and feature sets and applied the classifiers to all publications in PubMed indexed to CTSA grants. To validate the algorithm, we manually classified the articles with the top 100 scores from each classifier. RESULTS The definitions and checklist facilitated classification and resulted in good inter-rater reliability for coding publications for the training set. Very good performance was achieved for the classifiers as represented by the area under the receiver operating curves (AUC), with an AUC of 0.94 for the T0 classifier, 0.84 for T1/T2, and 0.92 for T3/T4. CONCLUSIONS The combination of definitions agreed upon by five CTSA hubs, a checklist that facilitates more uniform definition interpretation, and algorithms that perform well in classifying publications along the translational spectrum provide a basis for establishing and applying uniform definitions of translational research categories. The classification algorithms allow publication analyses that would not be feasible with manual classification, such as assessing the distribution and trends of publications across the CTSA network and comparing the categories of publications and their citations to assess knowledge transfer across the translational research spectrum.
Collapse
Affiliation(s)
- Alisa Surkis
- Health Sciences Library, NYU School of Medicine, New York, USA
| | - Janice A. Hogle
- Institute for Clinical and Translational Research, School of Medicine and Public Health, University of Wisconsin-Madison, Madison, USA
| | | | - Joe D. Hunt
- Indiana Clinical and Translational Sciences Institute, Indiana University School of Medicine, Indianapolis, USA
| | | | - Emily Connors
- Clinical and Translational Science Institute, Medical College of Wisconsin, Milwaukee, USA
| | - Kate Westaby
- Wisconsin Partnership Program, School of Medicine and Public Health, University of Wisconsin-Madison, Madison, USA
| | - Elizabeth C. Whipple
- Ruth Lilly Medical Library, Indiana University School of Medicine, Indianapolis, USA
| | - Trisha Adamus
- Ebling Library for the Health Sciences, School of Medicine and Public Health, University of Wisconsin-Madison, Madison, USA
| | - Meridith Mueller
- Population Health Sciences, School of Medicine and Public Health, University of Wisconsin-Madison, Madison, USA
| | | |
Collapse
|
14
|
|
15
|
Ko Y. A new term-weighting scheme for text classification using the odds of positive and negative class probabilities. J Assoc Inf Sci Technol 2015. [DOI: 10.1002/asi.23338] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
Affiliation(s)
- Youngjoong Ko
- Computer Engineering; Dong-A University; Busan 604-714 Korea
| |
Collapse
|