1
|
Najadat H, Alzubaidi MA, Qarqaz I. Detecting Arabic Spam Reviews in Social Networks Based on Classification Algorithms. ACM T ASIAN LOW-RESO 2022. [DOI: 10.1145/3476115] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
Abstract
Reviews or comments that users leave on social media have great importance for companies and business entities. New product ideas can be evaluated based on customer reactions. However, this use of social media is complicated by those who post spam on social media in the form of reviews and comments.
Designing methodologies to automatically detect and block social media spam is complicated by the fact that spammers continuously develop new ways to leave their spam comments. Researchers have proposed several methods to detect English spam reviews. However, few studies have been conducted to detect Arabic spam reviews. This article proposes a keyword-based method for detecting Arabic spam reviews. Keywords or Features are subsets of words from the original text that are labelled as important. A term's weight, Term Frequency–Inverse Document Frequency (TF-IDF) matrix, and filter methods (such as information gain, chi-squared, deviation, correlation, and uncertainty) have been used to extract keywords from Arabic text.
The method proposed in this article detects Arabic spam in Facebook comments. The dataset consists of 3,000 Arabic comments extracted from Facebook pages. Four different machine learning algorithms are used in the detection process, including C4.5, kNN, SVM, and Naïve Bayes classifiers. The results show that the Decision Tree classifier outperforms the other classification algorithms, with a detection accuracy of 92.63%.
Collapse
Affiliation(s)
- Hassan Najadat
- Jordan University of Science and Technology, Irbid, Jordan
| | | | - Islam Qarqaz
- Jordan University of Science and Technology, Irbid, Jordan
| |
Collapse
|
2
|
Chen X, Yuan Y, Orgun MA. Using Bayesian networks with hidden variables for identifying trustworthy users in social networks. J Inf Sci 2019. [DOI: 10.1177/0165551519857590] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
The popularity and broad accessibility of online social networks (OSNs) have facilitated effective communication among people, but such networks also pose potential risks that should not be ignored. Interaction through OSNs is complex and can be unsafe, as individuals can be contacted by strangers at any time. This makes the notion of trust a crucial issue in the use of OSNs. However, compared with decision-making processes associated with whether to trust a stranger encountered in everyday life, this task is more difficult to address with regard to OSNs due to the lack of face-to-face communication and prior knowledge between people. In this article, trust evaluation is formalised as a classification problem. We demonstrate how user profiles and historical records can be organised into a logical structure based on Bayesian networks to recognise the trustworthy people without the need to build trust relationships in OSNs. This is possible when a more detailed description of features denoted by hidden variables is considered. We compare the performance of our method with those of six other machine learning methods using Facebook and Twitter datasets, and our results show that our method achieves higher values in accuracy, recall and F1 score.
Collapse
Affiliation(s)
- Xu Chen
- Key Laboratory of Trustworthy Distributed Computing and Service, Ministry of Education, School of Software, Beijing University of Posts and Telecommunications, China
| | - Yuyu Yuan
- Key Laboratory of Trustworthy Distributed Computing and Service, Ministry of Education, School of Software, Beijing University of Posts and Telecommunications, China
| | | |
Collapse
|
3
|
A unified score propagation model for web spam demotion algorithm. INFORM RETRIEVAL J 2017. [DOI: 10.1007/s10791-017-9307-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
|
4
|
Herzallah W, Faris H, Adwan O. Feature engineering for detecting spammers on Twitter: Modelling and analysis. J Inf Sci 2017. [DOI: 10.1177/0165551516684296] [Citation(s) in RCA: 27] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Twitter is a social networking website that has gained a lot of popularity around the world in the last decade. This popularity made Twitter a common target for spammers and malicious users to spread unwanted advertisements, viruses and phishing attacks. In this article, we review the latest research works to determine the most effective features that were investigated for spam detection in the literature. These features are collected to build a comprehensive data set that can be used to develop more robust and accurate spammer detection models. The new data set is tested using popular classifiers (Naive Bayes, support vector machines, multilayer perceptron neural networks, Decision Trees, Random forests and k-Nearest Neighbour). The prediction performance of these classifiers is evaluated and compared based on different evaluation metrics. Moreover, a further analysis is carried out to identify the features that have higher impact on the accuracy of spam detection. Three different techniques are used and compared for this analysis: change of mean square error (CoM), information gain (IG) and Relief-F method. Top five features identified by each technique are used again to build the detection models. Experimental results show that most of the developed classifiers obtained high evaluation results based on the comprehensive data set constructed in this work. Experiments also reveal the important role of some features like the reputation of the account, average length of the tweet, average mention per tweet, age of the account, and the average time between posts in the process of identifying spammers in the social network.
Collapse
Affiliation(s)
- Wafa Herzallah
- Business Information Technology, King Abdullah II School of Information Technology, The University of Jordan, Jordan
| | - Hossam Faris
- Business Information Technology, King Abdullah II School of Information Technology, The University of Jordan, Jordan
| | - Omar Adwan
- Business Information Technology, King Abdullah II School of Information Technology, The University of Jordan, Jordan
| |
Collapse
|
5
|
Al-Badarneh A, Al-Shawakfa E, Bani-Ismail B, Al-Rababah K, Shatnawi S. The impact of indexing approaches on Arabic text classification. J Inf Sci 2016. [DOI: 10.1177/0165551515625030] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
This paper investigates the impact of using different indexing approaches (full-word, stem, and root) when classifying Arabic text. In this study, the naïve Bayes classifier is used to construct the multinomial classification models and is evaluated using stratified k-fold cross-validation ( k ranges from 2 to 10). It is also uses a corpus that consists of 1000 normalized Arabic documents. The results of one experiment in this study show that significant accuracy improvements have occurred when the full-word form is used in most k-folds. Further experiments show that the classifier has achieved the highest accuracy in the eight-fold by using 7/8–1/8 train–test ratio, despite the indexing approach being used. The overall results of this study show that the classifier has achieved the maximum micro-average accuracy 99.36%, either by using the full-word form or the stem form. This proves that the stem is a better choice to use when classifying Arabic text, because it makes the corpus dataset smaller and this will enhance both the processing time and storage utilization, and achieve the highest level of accuracy.
Collapse
|
6
|
Hmeidi I, Al-Ayyoub M, Abdulla NA, Almodawar AA, Abooraig R, Mahyoub NA. Automatic Arabic text categorization: A comprehensive comparative study. J Inf Sci 2014. [DOI: 10.1177/0165551514558172] [Citation(s) in RCA: 41] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Text categorization or classification (TC) is concerned with placing text documents in their proper category according to their contents. Owing to the various applications of TC and the large volume of text documents uploaded on the Internet daily, the need for such an automated method stems from the difficulty and tedium of performing such a process manually. The usefulness of TC is manifested in different fields and needs. For instance, the ability to automatically classify an article or an email into its right class (Arts, Economics, Politics, Sports, etc.) would be appreciated by individual users as well as companies. This paper is concerned with TC of Arabic articles. It contains a comparison of the five best known algorithms for TC. It also studies the effects of utilizing different Arabic stemmers (light and root-based stemmers) on the effectiveness of these classifiers. Furthermore, a comparison between different data mining software tools (Weka and RapidMiner) is presented. The results illustrate the good accuracy provided by the SVM classifier, especially when used with the light10 stemmer. This outcome can be used in future as a baseline to compare with other unexplored classifiers and Arabic stemmers.
Collapse
|