1
|
Saleem S, Asim MN, Van Elst L, Junker M, Dengel A. MLR-predictor: a versatile and efficient computational framework for multi-label requirements classification. Front Artif Intell 2024; 7:1481581. [PMID: 39664103 PMCID: PMC11632133 DOI: 10.3389/frai.2024.1481581] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/27/2024] [Accepted: 11/05/2024] [Indexed: 12/13/2024] Open
Abstract
Introduction Requirements classification is an essential task for development of a successful software by incorporating all relevant aspects of users' needs. Additionally, it aids in the identification of project failure risks and facilitates to achieve project milestones in more comprehensive way. Several machine learning predictors are developed for binary or multi-class requirements classification. However, a few predictors are designed for multi-label classification and they are not practically useful due to less predictive performance. Method MLR-Predictor makes use of innovative OkapiBM25 model to transforms requirements text into statistical vectors by computing words informative patterns. Moreover, predictor transforms multi-label requirements classification data into multi-class classification problem and utilize logistic regression classifier for categorization of requirements. The performance of the proposed predictor is evaluated and compared with 123 machine learning and 9 deep learning-based predictive pipelines across three public benchmark requirements classification datasets using eight different evaluation measures. Results The large-scale experimental results demonstrate that proposed MLR-Predictor outperforms 123 adopted machine learning and 9 deep learning predictive pipelines, as well as the state-of-the-art requirements classification predictor. Specifically, in comparison to state-of-the-art predictor, it achieves a 13% improvement in macro F1-measure on the PROMISE dataset, a 1% improvement on the EHR-binary dataset, and a 2.5% improvement on the EHR-multiclass dataset. Discussion As a case study, the generalizability of proposed predictor is evaluated on softwares customer reviews classification data. In this context, the proposed predictor outperformed the state-of-the-art BERT language model by F-1 score of 1.4%. These findings underscore the robustness and effectiveness of the proposed MLR-Predictor in various contexts, establishing its utility as a promising solution for requirements classification task.
Collapse
Affiliation(s)
- Summra Saleem
- Department of Computer Science, Rheinland Pfälzische Technische Universität, Kaiserslautern, Germany
- German Research Center for Artificial Intelligence (DFKI), Kaiserslautern, Germany
| | - Muhammad Nabeel Asim
- German Research Center for Artificial Intelligence (DFKI), Kaiserslautern, Germany
| | - Ludger Van Elst
- German Research Center for Artificial Intelligence (DFKI), Kaiserslautern, Germany
| | - Markus Junker
- German Research Center for Artificial Intelligence (DFKI), Kaiserslautern, Germany
| | - Andreas Dengel
- Department of Computer Science, Rheinland Pfälzische Technische Universität, Kaiserslautern, Germany
- German Research Center for Artificial Intelligence (DFKI), Kaiserslautern, Germany
| |
Collapse
|
2
|
Bagies T. Classifying software security requirements into confidentiality, integrity, and availability using machine learning approaches. PeerJ Comput Sci 2024; 10:e2554. [PMID: 39650452 PMCID: PMC11623117 DOI: 10.7717/peerj-cs.2554] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2024] [Accepted: 11/05/2024] [Indexed: 12/11/2024]
Abstract
Security requirements are considered one of the most important non-functional requirements of software. The CIA (confidentiality, integrity, and availability) triad forms the basis for the development of security systems. Each dimension is expressed as having many security requirements that should be designed, implemented, and tested. However, requirements are written in a natural language and may suffer from ambiguity and inconsistency, which makes it harder to distinguish between different security dimensions. Recognizing the security dimensions in a requirements document should facilitate tracing the requirements and ensuring that a dimension has been implemented in a software system. This process should be automated to reduce time and effort for software engineers. In this paper, we propose to classify the security requirements into CIA triads using Term frequency-inverse document frequency and sentence-transformer embedding as two different technologies for feature extraction. For both techniques, we developed five models by using five well-known machine learning algorithms: (1) support vector machine (SVM), (2) K-nearest neighbors (KNN), (3) Random Forest (RF), (4) gradient boosting (GB), and (5) Bernoulli Naive Bayes (BNB). Also, we developed a web interface that facilitates real-time analysis and classifies security requirements into CIA triads. Our results revealed that SVM with the sentence-transformer technique outperformed all classifiers by 87% accuracy in predicting a type of security dimension.
Collapse
Affiliation(s)
- Taghreed Bagies
- Information Technology, Computing and Information Technology, King Abdulaziz University, Jeddah, Saudi Arabia
| |
Collapse
|
3
|
Al-Fraihat D, Sharrab Y, Al-Ghuwairi AR, Sbaih N, Qahmash A. Detecting refactoring type of software commit messages based on ensemble machine learning algorithms. Sci Rep 2024; 14:21367. [PMID: 39266651 PMCID: PMC11392950 DOI: 10.1038/s41598-024-72307-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2023] [Accepted: 09/05/2024] [Indexed: 09/14/2024] Open
Abstract
Refactoring is a well-established topic in contemporary software engineering, focusing on enhancing software's structural design without altering its external behavior. Commit messages play a vital role in tracking changes to the codebase. However, determining the exact refactoring required in the code can be challenging due to various refactoring types. Prior studies have attempted to classify refactoring documentation by type, achieving acceptable results in accuracy, precision, recall, F1-Score, and other performance metrics. Nevertheless, there is room for improvement. To address this, we propose a novel approach using four ensemble Machine Learning algorithms to detect refactoring types. Our experimentation utilized a dataset containing 573 commits, with text cleaning and preprocessing applied to address data imbalances. Various techniques, including hyperparameter optimization, feature engineering with TF-IDF and bag-of-words, and binary transformation using one-vs-one and one-vs-rest classifiers, were employed to enhance accuracy. Results indicate that the experiment involving feature engineering using the TF-IDF technique outperformed other methods. Notably, the XGBoost algorithm with the same technique achieved superior performance across all metrics, attaining 100% accuracy. Moreover, our results surpass the current state-of-the-art performance using the same dataset. Our proposed approach bears significant implications for software engineering, particularly in enhancing the internal quality of software.
Collapse
Affiliation(s)
- Dimah Al-Fraihat
- Department of Software Engineering, Faculty of Information Technology, Isra University, Amman, 11622, Jordan.
| | - Yousef Sharrab
- Department of Data Science and Artificial Intelligence, Faculty of Information Technology, Isra University, Amman, Jordan
| | - Abdel-Rahman Al-Ghuwairi
- Department of Software Engineering, Faculty of Prince Al-Hussien Bin Abdallah II for Information Technology, The Hashemite University, Zarqa, Jordan
| | - Nour Sbaih
- Department of Software Engineering, Faculty of Prince Al-Hussien Bin Abdallah II for Information Technology, The Hashemite University, Zarqa, Jordan
| | - Ayman Qahmash
- Department of Information Systems, King Khalid University, Abha, Saudi Arabia
| |
Collapse
|
4
|
Laison EKE, Hamza Ibrahim M, Boligarla S, Li J, Mahadevan R, Ng A, Muthuramalingam V, Lee WY, Yin Y, Nasri BR. Identifying Potential Lyme Disease Cases Using Self-Reported Worldwide Tweets: Deep Learning Modeling Approach Enhanced With Sentimental Words Through Emojis. J Med Internet Res 2023; 25:e47014. [PMID: 37843893 PMCID: PMC10616745 DOI: 10.2196/47014] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2023] [Revised: 07/26/2023] [Accepted: 08/31/2023] [Indexed: 10/17/2023] Open
Abstract
BACKGROUND Lyme disease is among the most reported tick-borne diseases worldwide, making it a major ongoing public health concern. An effective Lyme disease case reporting system depends on timely diagnosis and reporting by health care professionals, and accurate laboratory testing and interpretation for clinical diagnosis validation. A lack of these can lead to delayed diagnosis and treatment, which can exacerbate the severity of Lyme disease symptoms. Therefore, there is a need to improve the monitoring of Lyme disease by using other data sources, such as web-based data. OBJECTIVE We analyzed global Twitter data to understand its potential and limitations as a tool for Lyme disease surveillance. We propose a transformer-based classification system to identify potential Lyme disease cases using self-reported tweets. METHODS Our initial sample included 20,000 tweets collected worldwide from a database of over 1.3 million Lyme disease tweets. After preprocessing and geolocating tweets, tweets in a subset of the initial sample were manually labeled as potential Lyme disease cases or non-Lyme disease cases using carefully selected keywords. Emojis were converted to sentiment words, which were then replaced in the tweets. This labeled tweet set was used for the training, validation, and performance testing of DistilBERT (distilled version of BERT [Bidirectional Encoder Representations from Transformers]), ALBERT (A Lite BERT), and BERTweet (BERT for English Tweets) classifiers. RESULTS The empirical results showed that BERTweet was the best classifier among all evaluated models (average F1-score of 89.3%, classification accuracy of 90.0%, and precision of 97.1%). However, for recall, term frequency-inverse document frequency and k-nearest neighbors performed better (93.2% and 82.6%, respectively). On using emojis to enrich the tweet embeddings, BERTweet had an increased recall (8% increase), DistilBERT had an increased F1-score of 93.8% (4% increase) and classification accuracy of 94.1% (4% increase), and ALBERT had an increased F1-score of 93.1% (5% increase) and classification accuracy of 93.9% (5% increase). The general awareness of Lyme disease was high in the United States, the United Kingdom, Australia, and Canada, with self-reported potential cases of Lyme disease from these countries accounting for around 50% (9939/20,000) of the collected English-language tweets, whereas Lyme disease-related tweets were rare in countries from Africa and Asia. The most reported Lyme disease-related symptoms in the data were rash, fatigue, fever, and arthritis, while symptoms, such as lymphadenopathy, palpitations, swollen lymph nodes, neck stiffness, and arrythmia, were uncommon, in accordance with Lyme disease symptom frequency. CONCLUSIONS The study highlights the robustness of BERTweet and DistilBERT as classifiers for potential cases of Lyme disease from self-reported data. The results demonstrated that emojis are effective for enrichment, thereby improving the accuracy of tweet embeddings and the performance of classifiers. Specifically, emojis reflecting sadness, empathy, and encouragement can reduce false negatives.
Collapse
Affiliation(s)
- Elda Kokoe Elolo Laison
- Département de médecine sociale et préventive, École de Santé Publique de l'Université de Montréal, Université de Montréal, Montréal, QC, Canada
| | | | - Srikanth Boligarla
- Harvard Extension School, Harvard University, Cambridge, MA, United States
| | - Jiaxin Li
- Harvard Extension School, Harvard University, Cambridge, MA, United States
| | - Raja Mahadevan
- Harvard Extension School, Harvard University, Cambridge, MA, United States
| | - Austen Ng
- Harvard Extension School, Harvard University, Cambridge, MA, United States
| | | | - Wee Yi Lee
- Harvard Extension School, Harvard University, Cambridge, MA, United States
| | - Yijun Yin
- Harvard Extension School, Harvard University, Cambridge, MA, United States
| | - Bouchra R Nasri
- Département de médecine sociale et préventive, École de Santé Publique de l'Université de Montréal, Université de Montréal, Montréal, QC, Canada
| |
Collapse
|
5
|
A machine learning approach for hierarchical classification of software requirements. MACHINE LEARNING WITH APPLICATIONS 2023. [DOI: 10.1016/j.mlwa.2023.100457] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/03/2023] Open
|
6
|
Huang X, Hu Y. Recognition of Continuous Music Segments Based on the Phase Space Reconstruction Method. COMPUTATIONAL INTELLIGENCE AND NEUROSCIENCE 2022; 2022:4099505. [PMID: 36238675 PMCID: PMC9553418 DOI: 10.1155/2022/4099505] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/15/2021] [Accepted: 12/15/2021] [Indexed: 11/24/2022]
Abstract
Piano score recognition is one of the important research contents in the field of music information retrieval, and it plays an important role in information processing. In order to reduce the influence of vocals on the progress of piano notes and restore the harmonic information corresponding to piano notes, the article models the harmonic information and vocal information corresponding to piano notes in the frequency spectrum. We use the phase space reconstruction method to extract the nonlinear feature parameters in the note audio and use some of the parameters as the training set to construct the support vector machine (SVM) classifier and the other part as the test set to test the recognition effect. Therefore, the method of adaptive signal decomposition and SVM is introduced into the signal preprocessing link, and the corresponding recognition process is established. In order to improve the performance of the support vector machine, the article uses measurement learning method to obtain the measurement learning and uses the measurement learning to replace the Euclidean distance of the Gaussian kernel function of the support vector machine. The SVM method of adaptive signal decomposition and the SVM method of principal component analysis are introduced into the preprocessing process of the note signal, and then the preprocessed signal is reconstructed in phase space, and the corresponding recognition process is established. The method of directly reconstructing the phase space of the original signal has higher accuracy and can be applied to the note recognition of continuous music segments. The final experimental results show that, compared with the current popular piano score recognition algorithm, the recognition accuracy of the proposed piano score recognition algorithm is improved by 3.5% to 12.2%.
Collapse
Affiliation(s)
- Xuesheng Huang
- School of Music and Dance, Quanzhou Normal University, Quanzhou, Fujian 362000, China
| | - YanQing Hu
- Dean's Office, Quanzhou Normal University, Quanzhou, Fujian 362000, China
| |
Collapse
|
7
|
Research on Product Core Component Acquisition Based on Patent Semantic Network. ENTROPY 2022; 24:e24040549. [PMID: 35455212 PMCID: PMC9026476 DOI: 10.3390/e24040549] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/19/2022] [Revised: 04/06/2022] [Accepted: 04/07/2022] [Indexed: 02/01/2023]
Abstract
Patent data contain plenty of valuable information. Recently, the lack of innovative ideas has resulted in some enterprises encountering bottlenecks in product research and development (R&D). Some enterprises point out that they do not have enough comprehension of product components. To improve efficiency of product R&D, this paper introduces natural-language processing (NLP) technology, which includes part-of-speech (POS) tagging and subject–action–object (SAO) classification. Our strategy first extracts patent keywords from products, then applies a complex network to obtain core components based on structural holes and centrality of eigenvector algorism. Finally, we use the example of US shower patents to verify the effectiveness and feasibility of the methodology. As a result, this paper examines the acquisition of core components and how they can help enterprises and designers clarify their R&D ideas and design priorities.
Collapse
|
8
|
Khurshid I, Imtiaz S, Boulila W, Khan Z, Abbasi A, Javed AR, Jalil Z. Classification of Non-Functional Requirements From IoT Oriented Healthcare Requirement Document. Front Public Health 2022; 10:860536. [PMID: 35372217 PMCID: PMC8974737 DOI: 10.3389/fpubh.2022.860536] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/23/2022] [Accepted: 02/07/2022] [Indexed: 01/03/2023] Open
Abstract
Internet of Things (IoT) involves a set of devices that aids in achieving a smart environment. Healthcare systems, which are IoT-oriented, provide monitoring services of patients' data and help take immediate steps in an emergency. Currently, machine learning-based techniques are adopted to ensure security and other non-functional requirements in smart health care systems. However, no attention is given to classifying the non-functional requirements from requirement documents. The manual process of classifying the non-functional requirements from documents is erroneous and laborious. Missing non-functional requirements in the Requirement Engineering (RE) phase results in IoT oriented healthcare system with compromised security and performance. In this research, an experiment is performed where non-functional requirements are classified from the IoT-oriented healthcare system's requirement document. The machine learning algorithms considered for classification are Logistic Regression (LR), Support Vector Machine (SVM), Multinomial Naive Bayes (MNB), K-Nearest Neighbors (KNN), ensemble, Random Forest (RF), and hybrid KNN rule-based machine learning (ML) algorithms. The results show that our novel hybrid KNN rule-based machine learning algorithm outperforms others by showing an average classification accuracy of 75.9% in classifying non-functional requirements from IoT-oriented healthcare requirement documents. This research is not only novel in its concept of using a machine learning approach for classification of non-functional requirements from IoT-oriented healthcare system requirement documents, but it also proposes a novel hybrid KNN-rule based machine learning algorithm for classification with better accuracy. A new dataset is also created for classification purposes, comprising requirements related to IoT-oriented healthcare systems. However, since this dataset is small and consists of only 104 requirements, this might affect the generalizability of the results of this research.
Collapse
Affiliation(s)
- Iqra Khurshid
- Department of Software Engineering, International Islamic University, Islamabad, Pakistan
| | - Salma Imtiaz
- Department of Software Engineering, International Islamic University, Islamabad, Pakistan
| | - Wadii Boulila
- Robotics and Internet-of-Things Laboratory, Prince Sultan University, Riyadh, Saudi Arabia
- *Correspondence: Wadii Boulila
| | - Zahid Khan
- Robotics and Internet-of-Things Laboratory, Prince Sultan University, Riyadh, Saudi Arabia
| | - Almas Abbasi
- Department of Software Engineering, International Islamic University, Islamabad, Pakistan
| | - Abdul Rehman Javed
- Department of Cyber Security, Air University, Islamabad, Pakistan
- Abdul Rehman Javed
| | - Zunera Jalil
- Department of Cyber Security, Air University, Islamabad, Pakistan
| |
Collapse
|
9
|
Peketi V, Satti S. ARCORE: A Requirements Dataset for Service Identification. BIG DATA ANALYTICS 2022. [DOI: 10.1007/978-3-031-24094-2_4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/29/2023] Open
|
10
|
One- and Two-Phase Software Requirement Classification Using Ensemble Deep Learning. ENTROPY 2021; 23:e23101264. [PMID: 34681988 PMCID: PMC8535052 DOI: 10.3390/e23101264] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/08/2021] [Revised: 09/27/2021] [Accepted: 09/27/2021] [Indexed: 12/11/2022]
Abstract
Recently, deep learning (DL) has been utilized successfully in different fields, achieving remarkable results. Thus, there is a noticeable focus on DL approaches to automate software engineering (SE) tasks such as maintenance, requirement extraction, and classification. An advanced utilization of DL is the ensemble approach, which aims to reduce error rates and learning time and improve performance. In this research, three ensemble approaches were applied: accuracy as a weight ensemble, mean ensemble, and accuracy per class as a weight ensemble with a combination of four different DL models-long short-term memory (LSTM), bidirectional long short-term memory (BiLSTM), a gated recurrent unit (GRU), and a convolutional neural network (CNN)-in order to classify the software requirement (SR) specification, the binary classification of SRs into functional requirement (FRs) or non-functional requirements (NFRs), and the multi-label classification of both FRs and NFRs into further experimental classes. The models were trained and tested on the PROMISE dataset. A one-phase classification system was developed to classify SRs directly into one of the 17 multi-classes of FRs and NFRs. In addition, a two-phase classification system was developed to classify SRs first into FRs or NFRs and to pass the output to the second phase of multi-class classification to 17 classes. The experimental results demonstrated that the proposed classification systems can lead to a competitive classification performance compared to the state-of-the-art methods. The two-phase classification system proved its robustness against the one-phase classification system, as it obtained a 95.7% accuracy in the binary classification phase and a 93.4% accuracy in the second phase of NFR and FR multi-class classification.
Collapse
|
11
|
Enhancing Software Feature Extraction Results Using Sentiment Analysis to Aid Requirements Reuse. COMPUTERS 2021. [DOI: 10.3390/computers10030036] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
Recently, feature extraction from user reviews has been used for requirements reuse to improve the software development process. However, research has yet to use sentiment analysis in the extraction for it to be well understood. The aim of this study is to improve software feature extraction results by using sentiment analysis. Our study’s novelty focuses on the correlation between feature extraction from user reviews and results of sentiment analysis for requirement reuse. This study can inform system analysis in the requirements elicitation process. Our proposal uses user reviews for the software feature extraction and incorporates sentiment analysis and similarity measures in the process. Experimental results show that the extracted features used to expand existing requirements may come from positive and negative sentiments. However, extracted features with positive sentiment overall have better values than negative sentiments, namely 90% compared to 63% for the relevance value, 74–47% for prompting new features, and 55–26% for verbatim reuse as new requirements.
Collapse
|
12
|
Dhindsa A, Bhatia S, Agrawal S, Sohi BS. An Improvised Machine Learning Model Based on Mutual Information Feature Selection Approach for Microbes Classification. ENTROPY (BASEL, SWITZERLAND) 2021; 23:257. [PMID: 33672252 PMCID: PMC7927045 DOI: 10.3390/e23020257] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/14/2021] [Revised: 02/10/2021] [Accepted: 02/20/2021] [Indexed: 12/11/2022]
Abstract
The accurate classification of microbes is critical in today's context for monitoring the ecological balance of a habitat. Hence, in this research work, a novel method to automate the process of identifying microorganisms has been implemented. To extract the bodies of microorganisms accurately, a generalized segmentation mechanism which consists of a combination of convolution filter (Kirsch) and a variance-based pixel clustering algorithm (Otsu) is proposed. With exhaustive corroboration, a set of twenty-five features were identified to map the characteristics and morphology for all kinds of microbes. Multiple techniques for feature selection were tested and it was found that mutual information (MI)-based models gave the best performance. Exhaustive hyperparameter tuning of multilayer layer perceptron (MLP), k-nearest neighbors (KNN), quadratic discriminant analysis (QDA), logistic regression (LR), and support vector machine (SVM) was done. It was found that SVM radial required further improvisation to attain a maximum possible level of accuracy. Comparative analysis between SVM and improvised SVM (ISVM) through a 10-fold cross validation method ultimately showed that ISVM resulted in a 2% higher performance in terms of accuracy (98.2%), precision (98.2%), recall (98.1%), and F1 score (98.1%).
Collapse
Affiliation(s)
- Anaahat Dhindsa
- Department of Electronics and Communication Engineering, Chandigarh University, Gharuan, Punjab 140413, India;
- University Institute of Engineering and Technology, Panjab University, Chandigarh 160014, India;
| | - Sanjay Bhatia
- Post Graduate Department of Zoology, University of Jammu, Kashmir 180006, India;
| | - Sunil Agrawal
- University Institute of Engineering and Technology, Panjab University, Chandigarh 160014, India;
| | - Balwinder Singh Sohi
- Department of Electronics and Communication Engineering, Chandigarh University, Gharuan, Punjab 140413, India;
| |
Collapse
|
13
|
Assi K. Traffic Crash Severity Prediction-A Synergy by Hybrid Principal Component Analysis and Machine Learning Models. INTERNATIONAL JOURNAL OF ENVIRONMENTAL RESEARCH AND PUBLIC HEALTH 2020; 17:E7598. [PMID: 33086567 PMCID: PMC7589286 DOI: 10.3390/ijerph17207598] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/25/2020] [Revised: 10/14/2020] [Accepted: 10/17/2020] [Indexed: 12/24/2022]
Abstract
The accurate prediction of road traffic crash (RTC) severity contributes to generating crucial information, which can be used to adopt appropriate measures to reduce the aftermath of crashes. This study aims to develop a hybrid system using principal component analysis (PCA) with multilayer perceptron neural networks (MLP-NN) and support vector machines (SVM) in predicting RTC severity. PCA shows that the first nine components have an eigenvalue greater than one. The cumulative variance percentage explained by these principal components was found to be 67%. The prediction accuracies of the models developed using the original attributes were compared with those of the models developed using principal components. It was found that the testing accuracies of MLP-NN and SVM increased from 64.50% and 62.70% to 82.70% and 80.70%, respectively, after using principal components. The proposed models would be beneficial to trauma centers in predicting crash severity with high accuracy so that they would be able to prepare for appropriate and prompt medical treatment.
Collapse
Affiliation(s)
- Khaled Assi
- Civil & Environmental Engineering Department, King Fahd University of Petroleum & Minerals, Dhahran 31261, Saudi Arabia
| |
Collapse
|