1
|
Baniecki H, Sobieski B, Szatkowski P, Bombinski P, Biecek P. Interpretable machine learning for time-to-event prediction in medicine and healthcare. Artif Intell Med 2025; 159:103026. [PMID: 39579416 DOI: 10.1016/j.artmed.2024.103026] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/27/2024] [Revised: 08/03/2024] [Accepted: 11/15/2024] [Indexed: 11/25/2024]
Abstract
Time-to-event prediction, e.g. cancer survival analysis or hospital length of stay, is a highly prominent machine learning task in medical and healthcare applications. However, only a few interpretable machine learning methods comply with its challenges. To facilitate a comprehensive explanatory analysis of survival models, we formally introduce time-dependent feature effects and global feature importance explanations. We show how post-hoc interpretation methods allow for finding biases in AI systems predicting length of stay using a novel multi-modal dataset created from 1235 X-ray images with textual radiology reports annotated by human experts. Moreover, we evaluate cancer survival models beyond predictive performance to include the importance of multi-omics feature groups based on a large-scale benchmark comprising 11 datasets from The Cancer Genome Atlas (TCGA). Model developers can use the proposed methods to debug and improve machine learning algorithms, while physicians can discover disease biomarkers and assess their significance. We contribute open data and code resources to facilitate future work in the emerging research direction of explainable survival analysis.
Collapse
Affiliation(s)
- Hubert Baniecki
- University of Warsaw, Warsaw, Poland; Warsaw University of Technology, Warsaw, Poland.
| | - Bartlomiej Sobieski
- University of Warsaw, Warsaw, Poland; Warsaw University of Technology, Warsaw, Poland
| | - Patryk Szatkowski
- Warsaw University of Technology, Warsaw, Poland; Medical University of Warsaw, Warsaw, Poland
| | - Przemyslaw Bombinski
- Warsaw University of Technology, Warsaw, Poland; Medical University of Warsaw, Warsaw, Poland
| | - Przemyslaw Biecek
- University of Warsaw, Warsaw, Poland; Warsaw University of Technology, Warsaw, Poland
| |
Collapse
|
2
|
Jenul A, Stokmo HL, Schrunner S, Hjortland GO, Revheim ME, Tomic O. Novel ensemble feature selection techniques applied to high-grade gastroenteropancreatic neuroendocrine neoplasms for the prediction of survival. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2024; 244:107934. [PMID: 38016391 DOI: 10.1016/j.cmpb.2023.107934] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/15/2023] [Revised: 09/05/2023] [Accepted: 11/17/2023] [Indexed: 11/30/2023]
Abstract
BACKGROUND AND OBJECTIVE Determining the most informative features for predicting the overall survival of patients diagnosed with high-grade gastroenteropancreatic neuroendocrine neoplasms is crucial to improve individual treatment plans for patients, as well as the biological understanding of the disease. The main objective of this study is to evaluate the use of modern ensemble feature selection techniques for this purpose with respect to (a) quantitative performance measures such as predictive performance, (b) clinical interpretability, and (c) the effect of integrating prior expert knowledge. METHODS The Repeated Elastic Net Technique for Feature Selection (RENT) and the User-Guided Bayesian Framework for Feature Selection (UBayFS) are recently developed ensemble feature selectors investigated in this work. Both allow the user to identify informative features in datasets with low sample sizes and focus on model interpretability. While RENT is purely data-driven, UBayFS can integrate expert knowledge a priori in the feature selection process. In this work, we compare both feature selectors on a dataset comprising 63 patients and 110 features from multiple sources, including baseline patient characteristics, baseline blood values, tumor histology, imaging, and treatment information. RESULTS Our experiments involve data-driven and expert-driven setups, as well as combinations of both. In a five-fold cross-validated experiment without expert knowledge, our results demonstrate that both feature selectors allow accurate predictions: A reduction from 110 to approximately 20 features (around 82%) delivers near-optimal predictive performances with minor variations according to the choice of the feature selector, the predictive model, and the fold. Thereafter, we use findings from clinical literature as a source of expert knowledge. In addition, expert knowledge has a stabilizing effect on the feature set (an increase in stability of approximately 40%), while the impact on predictive performance is limited. CONCLUSIONS The features WHO Performance Status, Albumin, Platelets, Ki-67, Tumor Morphology, Total MTV, Total TLG, and SUVmax are the most stable and predictive features in our study. Overall, this study demonstrated the practical value of feature selection in medical applications not only to improve quantitative performance but also to deliver potentially new insights to experts.
Collapse
Affiliation(s)
- Anna Jenul
- Department of Data Science, Norwegian University of Life Sciences, Universitetstunet 3, 1433 Ås, Norway.
| | - Henning Langen Stokmo
- Department of Nuclear Medicine, Division of Radiology and Nuclear Medicine, Oslo University Hospital, Oslo, Norway; Institute of Clinical Medicine, University of Oslo, Oslo, Norway.
| | - Stefan Schrunner
- Department of Data Science, Norwegian University of Life Sciences, Universitetstunet 3, 1433 Ås, Norway.
| | | | - Mona-Elisabeth Revheim
- Department of Nuclear Medicine, Division of Radiology and Nuclear Medicine, Oslo University Hospital, Oslo, Norway; Institute of Clinical Medicine, University of Oslo, Oslo, Norway; The Intervention Centre, Division of Technology and Innovation, Oslo University Hospital, Oslo, Norway.
| | - Oliver Tomic
- Department of Data Science, Norwegian University of Life Sciences, Universitetstunet 3, 1433 Ås, Norway.
| |
Collapse
|
3
|
Rashad M, Afifi I, Abdelfatah M. RbQE: An Efficient Method for Content-Based Medical Image Retrieval Based on Query Expansion. J Digit Imaging 2023; 36:1248-1261. [PMID: 36702987 PMCID: PMC10287886 DOI: 10.1007/s10278-022-00769-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/26/2022] [Revised: 12/18/2022] [Accepted: 12/19/2022] [Indexed: 01/27/2023] Open
Abstract
Systems for retrieving and managing content-based medical images are becoming more important, especially as medical imaging technology advances and the medical image database grows. In addition, these systems can also use medical images to better grasp and gain a deeper understanding of the causes and treatments of different diseases, not just for diagnostic purposes. For achieving all these purposes, there is a critical need for an efficient and accurate content-based medical image retrieval (CBMIR) method. This paper proposes an efficient method (RbQE) for the retrieval of computed tomography (CT) and magnetic resonance (MR) images. RbQE is based on expanding the features of querying and exploiting the pre-trained learning models AlexNet and VGG-19 to extract compact, deep, and high-level features from medical images. There are two searching procedures in RbQE: a rapid search and a final search. In the rapid search, the original query is expanded by retrieving the top-ranked images from each class and is used to reformulate the query by calculating the mean values for deep features of the top-ranked images, resulting in a new query for each class. In the final search, the new query that is most similar to the original query will be used for retrieval from the database. The performance of the proposed method has been compared to state-of-the-art methods on four publicly available standard databases, namely, TCIA-CT, EXACT09-CT, NEMA-CT, and OASIS-MRI. Experimental results show that the proposed method exceeds the compared methods by 0.84%, 4.86%, 1.24%, and 14.34% in average retrieval precision (ARP) for the TCIA-CT, EXACT09-CT, NEMA-CT, and OASIS-MRI databases, respectively.
Collapse
Affiliation(s)
- Metwally Rashad
- Department of Computer Science, Faculty of Computers & Artificial Intelligence, Benha University, Benha, Egypt
- Faculty of Artificial Intelligence, Delta University for Science and Technology, Gamasa, Egypt
| | - Ibrahem Afifi
- Department of Information System, Faculty of Computers & Artificial Intelligence, Benha University, Benha, Egypt
| | - Mohammed Abdelfatah
- Department of Information System, Faculty of Computers & Artificial Intelligence, Benha University, Benha, Egypt
| |
Collapse
|
4
|
Rajput D, Wang WJ, Chen CC. Evaluation of a decided sample size in machine learning applications. BMC Bioinformatics 2023; 24:48. [PMID: 36788550 PMCID: PMC9926644 DOI: 10.1186/s12859-023-05156-9] [Citation(s) in RCA: 53] [Impact Index Per Article: 26.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2022] [Accepted: 01/23/2023] [Indexed: 02/16/2023] Open
Abstract
BACKGROUND An appropriate sample size is essential for obtaining a precise and reliable outcome of a study. In machine learning (ML), studies with inadequate samples suffer from overfitting of data and have a lower probability of producing true effects, while the increment in sample size increases the accuracy of prediction but may not cause a significant change after a certain sample size. Existing statistical approaches using standardized mean difference, effect size, and statistical power for determining sample size are potentially biased due to miscalculations or lack of experimental details. This study aims to design criteria for evaluating sample size in ML studies. We examined the average and grand effect sizes and the performance of five ML methods using simulated datasets and three real datasets to derive the criteria for sample size. We systematically increase the sample size, starting from 16, by randomly sampling and examine the impact of sample size on classifiers' performance and both effect sizes. Tenfold cross-validation was used to quantify the accuracy. RESULTS The results demonstrate that the effect sizes and the classification accuracies increase while the variances in effect sizes shrink with the increment of samples when the datasets have a good discriminative power between two classes. By contrast, indeterminate datasets had poor effect sizes and classification accuracies, which did not improve by increasing sample size in both simulated and real datasets. A good dataset exhibited a significant difference in average and grand effect sizes. We derived two criteria based on the above findings to assess a decided sample size by combining the effect size and the ML accuracy. The sample size is considered suitable when it has appropriate effect sizes (≥ 0.5) and ML accuracy (≥ 80%). After an appropriate sample size, the increment in samples will not benefit as it will not significantly change the effect size and accuracy, thereby resulting in a good cost-benefit ratio. CONCLUSION We believe that these practical criteria can be used as a reference for both the authors and editors to evaluate whether the selected sample size is adequate for a study.
Collapse
Affiliation(s)
- Daniyal Rajput
- Institute of Cognitive Neuroscience, National Central University, Zhongda Rd, No. 300, Zhongli District, Taoyuan City, 320317, Taiwan, ROC. .,Taiwan International Graduate Program in Interdisciplinary Neuroscience, National Central University and Academia Sinica, Taipei, Taiwan, ROC.
| | - Wei-Jen Wang
- grid.37589.300000 0004 0532 3167Department of Computer Science and Information Engineering, National Central University, Taoyuan, Taiwan, ROC
| | - Chun-Chuan Chen
- grid.37589.300000 0004 0532 3167Institute of Cognitive Neuroscience, National Central University, Zhongda Rd, No. 300, Zhongli District, Taoyuan City, 320317 Taiwan, ROC ,grid.37589.300000 0004 0532 3167Department of Biomedical Sciences and Engineering, National Central University, Taoyuan, Taiwan, ROC
| |
Collapse
|
5
|
Abstract
Brain surgery offers the best chance of seizure-freedom for patients with focal drug-resistant epilepsy, but only 50% achieve sustained seizure-freedom. With the explosion of data collected during routine presurgical evaluations and recent advances in computational science, we now have a tremendous potential to achieve precision epilepsy surgery: a data-driven tailoring of surgical planning. This review highlights the clinical need, the relevant computational science focusing on machine learning, and discusses some specific applications in epilepsy surgery.
Collapse
Affiliation(s)
- Lara Jehi
- Cleveland Clinic Ringgold Standard Institution, Cleveland, OH, USA
| |
Collapse
|
6
|
Hu R, Zhou XJ, Li W. Computational Analysis of High-Dimensional DNA Methylation Data for Cancer Prognosis. J Comput Biol 2022; 29:769-781. [PMID: 35671506 PMCID: PMC9419965 DOI: 10.1089/cmb.2022.0002] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
Developing cancer prognostic models using multiomics data is a major goal of precision oncology. DNA methylation provides promising prognostic biomarkers, which have been used to predict survival and treatment response in solid tumor or plasma samples. This review article presents an overview of recently published computational analyses on DNA methylation for cancer prognosis. To address the challenges of survival analysis with high-dimensional methylation data, various feature selection methods have been applied to screen a subset of informative markers. Using candidate markers associated with survival, prognostic models either predict risk scores or stratify patients into subtypes. The model's discriminatory power can be assessed by multiple evaluation metrics. Finally, we discuss the limitations of existing studies and present the prospects of applying machine learning algorithms to fully exploit the prognostic value of DNA methylation.
Collapse
Affiliation(s)
- Ran Hu
- Department of Pathology and Laboratory Medicine, David Geffen School of Medicine, University of California at Los Angeles, Los Angeles, California, USA
- Bioinformatics Interdepartmental Graduate Program, University of California at Los Angeles, Los Angeles, California, USA
- Institute for Quantitative & Computational Biosciences, University of California at Los Angeles, Los Angeles, California, USA
| | - Xianghong Jasmine Zhou
- Department of Pathology and Laboratory Medicine, David Geffen School of Medicine, University of California at Los Angeles, Los Angeles, California, USA
- Institute for Quantitative & Computational Biosciences, University of California at Los Angeles, Los Angeles, California, USA
| | - Wenyuan Li
- Department of Pathology and Laboratory Medicine, David Geffen School of Medicine, University of California at Los Angeles, Los Angeles, California, USA
- Institute for Quantitative & Computational Biosciences, University of California at Los Angeles, Los Angeles, California, USA
| |
Collapse
|
7
|
Dogra V, Verma S, Kavita, Chatterjee P, Shafi J, Choi J, Ijaz MF. A Complete Process of Text Classification System Using State-of-the-Art NLP Models. COMPUTATIONAL INTELLIGENCE AND NEUROSCIENCE 2022; 2022:1883698. [PMID: 35720939 PMCID: PMC9203176 DOI: 10.1155/2022/1883698] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/04/2022] [Revised: 04/20/2022] [Accepted: 05/09/2022] [Indexed: 11/30/2022]
Abstract
With the rapid advancement of information technology, online information has been exponentially growing day by day, especially in the form of text documents such as news events, company reports, reviews on products, stocks-related reports, medical reports, tweets, and so on. Due to this, online monitoring and text mining has become a prominent task. During the past decade, significant efforts have been made on mining text documents using machine and deep learning models such as supervised, semisupervised, and unsupervised. Our area of the discussion covers state-of-the-art learning models for text mining or solving various challenging NLP (natural language processing) problems using the classification of texts. This paper summarizes several machine learning and deep learning algorithms used in text classification with their advantages and shortcomings. This paper would also help the readers understand various subtasks, along with old and recent literature, required during the process of text classification. We believe that readers would be able to find scope for further improvements in the area of text classification or to propose new techniques of text classification applicable in any domain of their interest.
Collapse
Affiliation(s)
- Varun Dogra
- School of Computer Science and Engineering, Lovely Professional University, Phagwara, Punjab, India
| | - Sahil Verma
- Department of Computer Science and Engineering, Chandigarh University, Mohali 140413, India
- Bio and Health Informatics Research Lab, Chandigarh University, Mohali 140413, India
| | - Kavita
- Department of Computer Science and Engineering, Chandigarh University, Mohali 140413, India
- Machine Learning and Data Science Research Lab, Chandigarh University, Mohali 140413, India
| | | | - Jana Shafi
- Department of Computer Science, College of Arts and Science, Prince Sattam Bin Abdul Aziz University, Wadi Ad-Dwasir 11991, Saudi Arabia
| | - Jaeyoung Choi
- School of Computing, Gachon University, Seongnam-si 13120, Republic of Korea
| | - Muhammad Fazal Ijaz
- Department of Intelligent Mechatronics Engineering, Sejong University, Seoul 05006, Republic of Korea
| |
Collapse
|
8
|
Jung JO, Crnovrsanin N, Wirsik NM, Nienhüser H, Peters L, Popp F, Schulze A, Wagner M, Müller-Stich BP, Büchler MW, Schmidt T. Machine learning for optimized individual survival prediction in resectable upper gastrointestinal cancer. J Cancer Res Clin Oncol 2022; 149:1691-1702. [PMID: 35616729 PMCID: PMC10097798 DOI: 10.1007/s00432-022-04063-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2022] [Accepted: 05/09/2022] [Indexed: 11/29/2022]
Abstract
PURPOSE Surgical oncologists are frequently confronted with the question of expected long-term prognosis. The aim of this study was to apply machine learning algorithms to optimize survival prediction after oncological resection of gastroesophageal cancers. METHODS Eligible patients underwent oncological resection of gastric or distal esophageal cancer between 2001 and 2020 at Heidelberg University Hospital, Department of General Surgery. Machine learning methods such as multi-task logistic regression and survival forests were compared with usual algorithms to establish an individual estimation. RESULTS The study included 117 variables with a total of 1360 patients. The overall missingness was 1.3%. Out of eight machine learning algorithms, the random survival forest (RSF) performed best with a concordance index of 0.736 and an integrated Brier score of 0.166. The RSF demonstrated a mean area under the curve (AUC) of 0.814 over a time period of 10 years after diagnosis. The most important long-term outcome predictor was lymph node ratio with a mean AUC of 0.730. A numeric risk score was calculated by the RSF for each patient and three risk groups were defined accordingly. Median survival time was 18.8 months in the high-risk group, 44.6 months in the medium-risk group and above 10 years in the low-risk group. CONCLUSION The results of this study suggest that RSF is most appropriate to accurately answer the question of long-term prognosis. Furthermore, we could establish a compact risk score model with 20 input parameters and thus provide a clinical tool to improve prediction of oncological outcome after upper gastrointestinal surgery.
Collapse
Affiliation(s)
- Jin-On Jung
- Department of General, Visceral and Transplantation Surgery, University Hospital of Heidelberg, Im Neuenheimer Feld 420, 69120, Heidelberg, Germany.,Department of General, Visceral and Cancer Surgery, University Hospital of Cologne, Kerpener Straße 62, 50937, Cologne, Germany
| | - Nerma Crnovrsanin
- Department of General, Visceral and Transplantation Surgery, University Hospital of Heidelberg, Im Neuenheimer Feld 420, 69120, Heidelberg, Germany
| | - Naita Maren Wirsik
- Department of General, Visceral and Transplantation Surgery, University Hospital of Heidelberg, Im Neuenheimer Feld 420, 69120, Heidelberg, Germany.,Department of General, Visceral and Cancer Surgery, University Hospital of Cologne, Kerpener Straße 62, 50937, Cologne, Germany
| | - Henrik Nienhüser
- Department of General, Visceral and Transplantation Surgery, University Hospital of Heidelberg, Im Neuenheimer Feld 420, 69120, Heidelberg, Germany
| | - Leila Peters
- Department of General, Visceral and Transplantation Surgery, University Hospital of Heidelberg, Im Neuenheimer Feld 420, 69120, Heidelberg, Germany
| | - Felix Popp
- Department of General, Visceral and Cancer Surgery, University Hospital of Cologne, Kerpener Straße 62, 50937, Cologne, Germany
| | - André Schulze
- Department of General, Visceral and Transplantation Surgery, University Hospital of Heidelberg, Im Neuenheimer Feld 420, 69120, Heidelberg, Germany
| | - Martin Wagner
- Department of General, Visceral and Transplantation Surgery, University Hospital of Heidelberg, Im Neuenheimer Feld 420, 69120, Heidelberg, Germany
| | - Beat Peter Müller-Stich
- Department of General, Visceral and Transplantation Surgery, University Hospital of Heidelberg, Im Neuenheimer Feld 420, 69120, Heidelberg, Germany
| | - Markus Wolfgang Büchler
- Department of General, Visceral and Transplantation Surgery, University Hospital of Heidelberg, Im Neuenheimer Feld 420, 69120, Heidelberg, Germany
| | - Thomas Schmidt
- Department of General, Visceral and Transplantation Surgery, University Hospital of Heidelberg, Im Neuenheimer Feld 420, 69120, Heidelberg, Germany. .,Department of General, Visceral and Cancer Surgery, University Hospital of Cologne, Kerpener Straße 62, 50937, Cologne, Germany.
| |
Collapse
|
9
|
Fan Z, Chiong R, Hu Z, Keivanian F, Chiong F. Body fat prediction through feature extraction based on anthropometric and laboratory measurements. PLoS One 2022; 17:e0263333. [PMID: 35192644 PMCID: PMC8863283 DOI: 10.1371/journal.pone.0263333] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2021] [Accepted: 01/17/2022] [Indexed: 01/15/2023] Open
Abstract
Obesity, associated with having excess body fat, is a critical public health problem that can cause serious diseases. Although a range of techniques for body fat estimation have been developed to assess obesity, these typically involve high-cost tests requiring special equipment. Thus, the accurate prediction of body fat percentage based on easily accessed body measurements is important for assessing obesity and its related diseases. By considering the characteristics of different features (e.g. body measurements), this study investigates the effectiveness of feature extraction for body fat prediction. It evaluates the performance of three feature extraction approaches by comparing four well-known prediction models. Experimental results based on two real-world body fat datasets show that the prediction models perform better on incorporating feature extraction for body fat prediction, in terms of the mean absolute error, standard deviation, root mean square error and robustness. These results confirm that feature extraction is an effective pre-processing step for predicting body fat. In addition, statistical analysis confirms that feature extraction significantly improves the performance of prediction methods. Moreover, the increase in the number of extracted features results in further, albeit slight, improvements to the prediction models. The findings of this study provide a baseline for future research in related areas.
Collapse
Affiliation(s)
- Zongwen Fan
- School of Information and Physical Sciences, The University of Newcastle, Callaghan, NSW, Australia
- College of Computer Science and Technology, Huaqiao University, Xiamen, China
| | - Raymond Chiong
- School of Information and Physical Sciences, The University of Newcastle, Callaghan, NSW, Australia
- * E-mail:
| | - Zhongyi Hu
- School of Information Management, Wuhan University, Wuhan, China
| | - Farshid Keivanian
- School of Information and Physical Sciences, The University of Newcastle, Callaghan, NSW, Australia
| | | |
Collapse
|
10
|
Boškoski P, Perne M, Rameša M, Boshkoska BM. Variational Bayes survival analysis for unemployment modelling. Knowl Based Syst 2021. [DOI: 10.1016/j.knosys.2021.107335] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
|
11
|
Topolski M. Application of Feature Extraction Methods for Chemical Risk Classification in the Pharmaceutical Industry. SENSORS 2021; 21:s21175753. [PMID: 34502644 PMCID: PMC8434006 DOI: 10.3390/s21175753] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/20/2021] [Revised: 08/20/2021] [Accepted: 08/21/2021] [Indexed: 11/25/2022]
Abstract
The features that are used in the classification process are acquired from sensor data on the production site (associated with toxic, physicochemical properties) and also a dataset associated with cybersecurity that may affect the above-mentioned risk. These are large datasets, so it is important to reduce them. The author’s motivation was to develop a method of assessing the dimensionality of features based on correlation measures and the discriminant power of features allowing for a more accurate reduction of their dimensions compared to the classical Kaiser criterion and assessment of scree plot. The method proved to be promising. The results obtained in the experiments demonstrate that the quality of classification after extraction is better than using classical criteria for estimating the number of components and features. Experiments were carried out for various extraction methods, demonstrating that the rotation of factors according to centroids of a class in this classification task gives the best risk assessment of chemical threats. The classification quality increased by about 7% compared to a model where feature extraction was not used and resulted in an improvement of 4% compared to the classical PCA method with the Kaiser criterion, with an evaluation of the scree plot. Furthermore, it has been shown that there is a certain subspace of cybersecurity features, which complemented with the features of the concentration of volatile substances, affects the risk assessment of chemical hazards. The identified cybersecurity factors are the number of packets lost, incorrect Logins, incorrect sensor responses, increased email spam, and excessive traffic in the computer network. To visualize the speed of classification in real-time, simulations were carried out for various systems used in Industry 4.0.
Collapse
Affiliation(s)
- Mariusz Topolski
- Department of Systems and Computer Networks, Faculty of Electronics, Wrocław University of Science and Technology, Wybrzeże Wyspiańskiego 27, 50-370 Wrocław, Poland
| |
Collapse
|
12
|
Wang J, Chen N, Guo J, Xu X, Liu L, Yi Z. SurvNet: A Novel Deep Neural Network for Lung Cancer Survival Analysis With Missing Values. Front Oncol 2021; 10:588990. [PMID: 33552965 PMCID: PMC7855857 DOI: 10.3389/fonc.2020.588990] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2020] [Accepted: 12/04/2020] [Indexed: 02/05/2023] Open
Abstract
Survival analysis is important for guiding further treatment and improving lung cancer prognosis. It is a challenging task because of the poor distinguishability of features and the missing values in practice. A novel multi-task based neural network, SurvNet, is proposed in this paper. The proposed SurvNet model is trained in a multi-task learning framework to jointly learn across three related tasks: input reconstruction, survival classification, and Cox regression. It uses an input reconstruction mechanism cooperating with incomplete-aware reconstruction loss for latent feature learning of incomplete data with missing values. Besides, the SurvNet model introduces a context gating mechanism to bridge the gap between survival classification and Cox regression. A new real-world dataset of 1,137 patients with IB-IIA stage non-small cell lung cancer is collected to evaluate the performance of the SurvNet model. The proposed SurvNet achieves a higher concordance index than the traditional Cox model and Cox-Net. The difference between high-risk and low-risk groups obtained by SurvNet is more significant than that of high-risk and low-risk groups obtained by the other models. Moreover, the SurvNet outperforms the other models even though the input data is randomly cropped and it achieves better generalization performance on the Surveillance, Epidemiology, and End Results Program (SEER) dataset.
Collapse
Affiliation(s)
- Jianyong Wang
- Machine Intelligence Laboratory, College of Computer Science, Sichuan University, Chengdu, China
| | - Nan Chen
- Department of Thoracic Surgery, West China Hospital and West China School of Medicine, Sichuan University, Chengdu, China
| | - Jixiang Guo
- Machine Intelligence Laboratory, College of Computer Science, Sichuan University, Chengdu, China
| | - Xiuyuan Xu
- Machine Intelligence Laboratory, College of Computer Science, Sichuan University, Chengdu, China
| | - Lunxu Liu
- Department of Thoracic Surgery, West China Hospital and West China School of Medicine, Sichuan University, Chengdu, China
| | - Zhang Yi
- Machine Intelligence Laboratory, College of Computer Science, Sichuan University, Chengdu, China
| |
Collapse
|
13
|
Spooner A, Chen E, Sowmya A, Sachdev P, Kochan NA, Trollor J, Brodaty H. A comparison of machine learning methods for survival analysis of high-dimensional clinical data for dementia prediction. Sci Rep 2020; 10:20410. [PMID: 33230128 PMCID: PMC7683682 DOI: 10.1038/s41598-020-77220-w] [Citation(s) in RCA: 96] [Impact Index Per Article: 19.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/14/2020] [Accepted: 11/05/2020] [Indexed: 12/22/2022] Open
Abstract
Data collected from clinical trials and cohort studies, such as dementia studies, are often high-dimensional, censored, heterogeneous and contain missing information, presenting challenges to traditional statistical analysis. There is an urgent need for methods that can overcome these challenges to model this complex data. At present there is no cure for dementia and no treatment that can successfully change the course of the disease. Machine learning models that can predict the time until a patient develops dementia are important tools in helping understand dementia risks and can give more accurate results than traditional statistical methods when modelling high-dimensional, heterogeneous, clinical data. This work compares the performance and stability of ten machine learning algorithms, combined with eight feature selection methods, capable of performing survival analysis of high-dimensional, heterogeneous, clinical data. We developed models that predict survival to dementia using baseline data from two different studies. The Sydney Memory and Ageing Study (MAS) is a longitudinal cohort study of 1037 participants, aged 70-90 years, that aims to determine the effects of ageing on cognition. The Alzheimer's Disease Neuroimaging Initiative (ADNI) is a longitudinal study aimed at identifying biomarkers for the early detection and tracking of Alzheimer's disease. Using the concordance index as a measure of performance, our models achieve maximum performance values of 0.82 for MAS and 0.93 For ADNI.
Collapse
Affiliation(s)
- Annette Spooner
- School of Computer Science and Engineering, UNSW Sydney, Sydney, Australia.
| | - Emily Chen
- School of Computer Science and Engineering, UNSW Sydney, Sydney, Australia
| | - Arcot Sowmya
- School of Computer Science and Engineering, UNSW Sydney, Sydney, Australia
| | - Perminder Sachdev
- School of Psychiatry, UNSW Sydney, Sydney, Australia
- Centre for Healthy Brain Ageing (CHeBA), UNSW Sydney, Sydney, Australia
| | - Nicole A Kochan
- Centre for Healthy Brain Ageing (CHeBA), UNSW Sydney, Sydney, Australia
| | - Julian Trollor
- School of Psychiatry, UNSW Sydney, Sydney, Australia
- Centre for Healthy Brain Ageing (CHeBA), UNSW Sydney, Sydney, Australia
- Department of Developmental Disability Neuropsychiatry, School of Psychiatry, UNSW Sydney, Sydney, Australia
| | - Henry Brodaty
- School of Psychiatry, UNSW Sydney, Sydney, Australia
- Centre for Healthy Brain Ageing (CHeBA), UNSW Sydney, Sydney, Australia
| |
Collapse
|
14
|
Keramati A, Lu P, Iranitalab A, Pan D, Huang Y. A crash severity analysis at highway-rail grade crossings: The random survival forest method. ACCIDENT; ANALYSIS AND PREVENTION 2020; 144:105683. [PMID: 32659490 DOI: 10.1016/j.aap.2020.105683] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/05/2019] [Revised: 05/21/2020] [Accepted: 07/06/2020] [Indexed: 06/11/2023]
Abstract
This paper proposes a machine learning approach, the random survival forest (RSF) for competing risks, to investigate highway-rail grade crossing (HRGC) crash severity during a 29-year analysis period. The benefits of the RSF approach are that it (1) is a special type of survival analysis able to accommodate the competing nature of multiple-event outcomes to the same event of interest (here the competing multiple events are crash severities), (2) is able to conduct an event-specific selection of risk factors, (3) has the capability to determine long-term cumulative effects of contributors with the cumulative incidence function (CIF), (4) provides high prediction performance, and (5) is effective in high-dimensional settings. The RSF approach is able to consider complexities in HRGC safety analysis, e.g., non-linear relationships between HRGCs crash severities and the contributing factors and heterogeneity in data. Variable importance (VIMP) technique is adopted in this research for selecting the most predictive contributors for each crash-severity level. Moreover, marginal effect analysis results real several HRGC countermeasures' effectiveness. Several insightful findings are discovered. For examples, adding stop signs to HRGCs that already have a combination of gate, standard flashing lights, and audible devices will reduce the likelihood of property damage only (PDO) crashes for up to seven years; but after the seventh year, the crossings are more likely to have PDO crashes. Adding audible devices to crossing with gates and standard flashing lights will reduce crash likelihood, PDO, injury, and fatal crashes by 49 %, 52 %, 46 %, and 50 %, respectively.
Collapse
Affiliation(s)
- Amin Keramati
- Upper Great Plains Transportation Institute, Dept. 2880, North Dakota State University, Fargo, ND 58108-6050, USA.
| | - Pan Lu
- Department of Transportation, Logistics, and Finance, Upper Great Plains Transportation Institute, North Dakota State University, Fargo, ND 58108-6050, USA.
| | - Amirfarrokh Iranitalab
- Impact Research LLC, 10480 Little Patuxent Parkway, Suite 1050 (Corporate 40), Columbia, MD 21044, USA.
| | - Danguang Pan
- Department of Civil Engineering, University of Science and Technology Beijing, Beijing 100083, China.
| | - Ying Huang
- Department of Civil and Environmental Engineering, North Dakota State University, Fargo, ND 58108-6050, USA.
| |
Collapse
|
15
|
Machine Learning Applied to Diagnosis of Human Diseases: A Systematic Review. APPLIED SCIENCES-BASEL 2020. [DOI: 10.3390/app10155135] [Citation(s) in RCA: 22] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/20/2022]
Abstract
Human healthcare is one of the most important topics for society. It tries to find the correct effective and robust disease detection as soon as possible to patients receipt the appropriate cares. Because this detection is often a difficult task, it becomes necessary medicine field searches support from other fields such as statistics and computer science. These disciplines are facing the challenge of exploring new techniques, going beyond the traditional ones. The large number of techniques that are emerging makes it necessary to provide a comprehensive overview that avoids very particular aspects. To this end, we propose a systematic review dealing with the Machine Learning applied to the diagnosis of human diseases. This review focuses on modern techniques related to the development of Machine Learning applied to diagnosis of human diseases in the medical field, in order to discover interesting patterns, making non-trivial predictions and useful in decision-making. In this way, this work can help researchers to discover and, if necessary, determine the applicability of the machine learning techniques in their particular specialties. We provide some examples of the algorithms used in medicine, analysing some trends that are focused on the goal searched, the algorithm used, and the area of applications. We detail the advantages and disadvantages of each technique to help choose the most appropriate in each real-life situation, as several authors have reported. The authors searched Scopus, Journal Citation Reports (JCR), Google Scholar, and MedLine databases from the last decades (from 1980s approximately) up to the present, with English language restrictions, for studies according to the objectives mentioned above. Based on a protocol for data extraction defined and evaluated by all authors using PRISMA methodology, 141 papers were included in this advanced review.
Collapse
|
16
|
Gauthama Raman MR, Nivethitha S, Kannan K, Shankar Sriram VS. A hybrid approach using rough set theory and hypergraph for feature selection on high-dimensional medical datasets. Soft comput 2019. [DOI: 10.1007/s00500-019-03818-6] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
|
17
|
Xu X, Liang T, Zhu J, Zheng D, Sun T. Review of classical dimensionality reduction and sample selection methods for large-scale data processing. Neurocomputing 2019. [DOI: 10.1016/j.neucom.2018.02.100] [Citation(s) in RCA: 31] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/21/2023]
|
18
|
A Systematic Mapping Study of Data Preparation in Heart Disease Knowledge Discovery. J Med Syst 2018; 43:17. [PMID: 30542772 DOI: 10.1007/s10916-018-1134-z] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/26/2018] [Accepted: 12/03/2018] [Indexed: 01/25/2023]
Abstract
The increasing amount of data produced by various biomedical and healthcare systems has led to a need for methodologies related to knowledge data discovery. Data mining (DM) offers a set of powerful techniques that allow the identification and extraction of relevant information from medical datasets, thus enabling doctors and patients to greatly benefit from DM, particularly in the case of diseases with high mortality and morbidity rates, such as heart disease (HD). Nonetheless, the use of raw medical data implies several challenges, such as missing data, noise, redundancy and high dimensionality, which make the extraction of useful and relevant information difficult and challenging. Intensive research has, therefore, recently begun in order to prepare raw healthcare data before knowledge extraction. In any knowledge data discovery (KDD) process, data preparation is the step prior to DM that deals with data imperfectness in order to improve its quality so as to satisfy the requirements and improve the performances of DM techniques. The objective of this paper is to perform a systematic mapping study (SMS) on data preparation for KDD in cardiology so as to provide an overview of the quantity and type of research carried out in this respect. The SMS consisted of a set of 58 selected papers published in the period January 2000 and December 2017. The selected studies were analyzed according to six criteria: year and channel of publication, preparation task, medical task, DM objective, research type and empirical type. The results show that a high amount of data preparation research was carried out in order to improve the performance of DM-based decision support systems in cardiology. Researchers were mainly interested in the data reduction preparation task and particularly in feature selection. Moreover, the majority of the selected studies focused on classification for the diagnosis of HD. Two main research types were identified in the selected studies: solution proposal and evaluation research, and the most frequently used empirical type was that of historical-based evaluation.
Collapse
|
19
|
Idri A, Benhar H, Fernández-Alemán JL, Kadi I. A systematic map of medical data preprocessing in knowledge discovery. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2018; 162:69-85. [PMID: 29903496 DOI: 10.1016/j.cmpb.2018.05.007] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/08/2017] [Revised: 04/25/2018] [Accepted: 05/03/2018] [Indexed: 06/08/2023]
Abstract
BACKGROUND AND OBJECTIVE Datamining (DM) has, over the last decade, received increased attention in the medical domain and has been widely used to analyze medical datasets in order to extract useful knowledge and previously unknown patterns. However, historical medical data can often comprise inconsistent, noisy, imbalanced, missing and high dimensional data. These challenges lead to a serious bias in predictive modeling and reduce the performance of DM techniques. Data preprocessing is, therefore, an essential step in knowledge discovery as regards improving the quality of data and making it appropriate and suitable for DM techniques. The objective of this paper is to review the use of preprocessing techniques in clinical datasets. METHODS We performed a systematic map of studies regarding the application of data preprocessing to healthcare and published between January 2000 and December 2017. A search string was determined on the basis of the mapping questions and the PICO categories. The search string was then applied in digital databases covering the fields of computer science and medical informatics in order to identify relevant studies. The studies were initially selected by reading their titles, abstracts and keywords. Those that were selected at that stage were then reviewed using a set of inclusion and exclusion criteria in order to eliminate any that were not relevant. This process resulted in 126 primary studies. RESULTS Selected studies were analyzed and classified according to their publication years and channels, research type, empirical type and contribution type. The findings of this mapping study revealed that researchers have paid a considerable amount of attention to preprocessing in medical DM in last decade. A significant number of the selected studies used data reduction and cleaning preprocessing tasks. Moreover, the disciplines in which preprocessing have received most attention are: cardiology, endocrinology and oncology. CONCLUSIONS Researchers should develop and implement standards for an effective integration of multiple medical data types. Moreover, we identified the need to perform literature reviews.
Collapse
Affiliation(s)
- A Idri
- Software Project Management Research Team, ENSIAS, University Mohammed V of Rabat, Morocco.
| | - H Benhar
- Software Project Management Research Team, ENSIAS, University Mohammed V of Rabat, Morocco.
| | - J L Fernández-Alemán
- Department of Informatics and Systems, Faculty of Computer Science, University of Murcia, Spain.
| | - I Kadi
- Software Project Management Research Team, ENSIAS, University Mohammed V of Rabat, Morocco.
| |
Collapse
|
20
|
Pang S, Orgun MA, Yu Z. A novel biomedical image indexing and retrieval system via deep preference learning. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2018; 158:53-69. [PMID: 29544790 DOI: 10.1016/j.cmpb.2018.02.003] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/01/2017] [Revised: 11/23/2017] [Accepted: 02/02/2018] [Indexed: 06/08/2023]
Abstract
BACKGROUND AND OBJECTIVES The traditional biomedical image retrieval methods as well as content-based image retrieval (CBIR) methods originally designed for non-biomedical images either only consider using pixel and low-level features to describe an image or use deep features to describe images but still leave a lot of room for improving both accuracy and efficiency. In this work, we propose a new approach, which exploits deep learning technology to extract the high-level and compact features from biomedical images. The deep feature extraction process leverages multiple hidden layers to capture substantial feature structures of high-resolution images and represent them at different levels of abstraction, leading to an improved performance for indexing and retrieval of biomedical images. METHODS We exploit the current popular and multi-layered deep neural networks, namely, stacked denoising autoencoders (SDAE) and convolutional neural networks (CNN) to represent the discriminative features of biomedical images by transferring the feature representations and parameters of pre-trained deep neural networks from another domain. Moreover, in order to index all the images for finding the similarly referenced images, we also introduce preference learning technology to train and learn a kind of a preference model for the query image, which can output the similarity ranking list of images from a biomedical image database. To the best of our knowledge, this paper introduces preference learning technology for the first time into biomedical image retrieval. RESULTS We evaluate the performance of two powerful algorithms based on our proposed system and compare them with those of popular biomedical image indexing approaches and existing regular image retrieval methods with detailed experiments over several well-known public biomedical image databases. Based on different criteria for the evaluation of retrieval performance, experimental results demonstrate that our proposed algorithms outperform the state-of-the-art techniques in indexing biomedical images. CONCLUSIONS We propose a novel and automated indexing system based on deep preference learning to characterize biomedical images for developing computer aided diagnosis (CAD) systems in healthcare. Our proposed system shows an outstanding indexing ability and high efficiency for biomedical image retrieval applications and it can be used to collect and annotate the high-resolution images in a biomedical database for further biomedical image research and applications.
Collapse
Affiliation(s)
- Shuchao Pang
- College of Computer Science and Technology, Jilin University, Qianjin Street: 2699, Jilin Province, China; Department of Computing, Macquarie University, Sydney, NSW 2109, Australia.
| | - Mehmet A Orgun
- Department of Computing, Macquarie University, Sydney, NSW 2109, Australia.
| | - Zhezhou Yu
- College of Computer Science and Technology, Jilin University, Qianjin Street: 2699, Jilin Province, China.
| |
Collapse
|
21
|
Wang H, Li G. A Selective Review on Random Survival Forests for High Dimensional Data. QUANTITATIVE BIO-SCIENCE 2017; 36:85-96. [PMID: 30740388 PMCID: PMC6364686 DOI: 10.22283/qbs.2017.36.2.85] [Citation(s) in RCA: 35] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
Over the past decades, there has been considerable interest in applying statistical machine learning methods in survival analysis. Ensemble based approaches, especially random survival forests, have been developed in a variety of contexts due to their high precision and non-parametric nature. This article aims to provide a timely review on recent developments and applications of random survival forests for time-to-event data with high dimensional covariates. This selective review begins with an introduction to the random survival forest framework, followed by a survey of recent developments on splitting criteria, variable selection, and other advanced topics of random survival forests for time-to-event data in high dimensional settings. We also discuss potential research directions for future research.
Collapse
Affiliation(s)
- Hong Wang
- School of Mathematics and Statistics, Central South University, Hunan 410083, China
| | - Gang Li
- Department of Biostatistics and Biomathematics, School of Public Health, University of California at Los Angeles, CA 90095, USA
| |
Collapse
|