1
|
Zaman F, Kamiran F, Shardlow M, Hassan SU, Karim A, Aljohani NR. SATS: simplification aware text summarization of scientific documents. Front Artif Intell 2024; 7:1375419. [PMID: 39049961 PMCID: PMC11266102 DOI: 10.3389/frai.2024.1375419] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/23/2024] [Accepted: 06/06/2024] [Indexed: 07/27/2024] Open
Abstract
Simplifying summaries of scholarly publications has been a popular method for conveying scientific discoveries to a broader audience. While text summarization aims to shorten long documents, simplification seeks to reduce the complexity of a document. To accomplish these tasks collectively, there is a need to develop machine learning methods to shorten and simplify longer texts. This study presents a new Simplification Aware Text Summarization model (SATS) based on future n-gram prediction. The proposed SATS model extends ProphetNet, a text summarization model, by enhancing the objective function using a word frequency lexicon for simplification tasks. We have evaluated the performance of SATS on a recently published text summarization and simplification corpus consisting of 5,400 scientific article pairs. Our results in terms of automatic evaluation demonstrate that SATS outperforms state-of-the-art models for simplification, summarization, and joint simplification-summarization across two datasets on ROUGE, SARI, and CSS1 . We also provide human evaluation of summaries generated by the SATS model. We evaluated 100 summaries from eight annotators for grammar, coherence, consistency, fluency, and simplicity. The average human judgment for all evaluated dimensions lies between 4.0 and 4.5 on a scale from 1 to 5 where 1 means low and 5 means high.
Collapse
Affiliation(s)
- Farooq Zaman
- Scientometrics Lab, Information Technology University, Lahore, Pakistan
| | - Faisal Kamiran
- Scientometrics Lab, Information Technology University, Lahore, Pakistan
| | - Matthew Shardlow
- Department of Computing and Mathematics, Manchester Metropolitan University, Manchester, United Kingdom
| | - Saeed-Ul Hassan
- Department of Computing and Mathematics, Manchester Metropolitan University, Manchester, United Kingdom
| | - Asim Karim
- Department of Computer Science, Syed Babar Ali School of Science and Engineering (SBASSE), Lahore University of Management Sciences, Lahore, Pakistan
| | - Naif Radi Aljohani
- Information Systems Department, Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah, Saudi Arabia
| |
Collapse
|
2
|
Multi-class sentiment analysis of urdu text using multilingual BERT. Sci Rep 2022; 12:5436. [PMID: 35361890 PMCID: PMC8971433 DOI: 10.1038/s41598-022-09381-9] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2021] [Accepted: 03/22/2022] [Indexed: 12/02/2022] Open
Abstract
Sentiment analysis (SA) is an important task because of its vital role in analyzing people’s opinions. However, existing research is solely based on the English language with limited work on low-resource languages. This study introduced a new multi-class Urdu dataset based on user reviews for sentiment analysis. This dataset is gathered from various domains such as food and beverages, movies and plays, software and apps, politics, and sports. Our proposed dataset contains 9312 reviews manually annotated by human experts into three classes: positive, negative and neutral. The main goal of this research study is to create a manually annotated dataset for Urdu sentiment analysis and to set baseline results using rule-based, machine learning (SVM, NB, Adabbost, MLP, LR and RF) and deep learning (CNN-1D, LSTM, Bi-LSTM, GRU and Bi-GRU) techniques. Additionally, we fine-tuned Multilingual BERT(mBERT) for Urdu sentiment analysis. We used four text representations: word n-grams, char n-grams,pre-trained fastText and BERT word embeddings to train our classifiers. We trained these models on two different datasets for evaluation purposes. Finding shows that the proposed mBERT model with BERT pre-trained word embeddings outperformed deep learning, machine learning and rule-based classifiers and achieved an F1 score of 81.49%.
Collapse
|
3
|
Saeed A, Nawab RMA, Stevenson M. Investigating the Feasibility of Deep Learning Methods for Urdu Word Sense Disambiguation. ACM T ASIAN LOW-RESO 2022. [DOI: 10.1145/3477578] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
Abstract
Word Sense Disambiguation (WSD), the process of automatically identifying the correct meaning of a word used in a given context, is a significant challenge in Natural Language Processing. A range of approaches to the problem has been explored by the research community. The majority of these efforts has focused on a relatively small set of languages, particularly English. Research on WSD for South Asian languages, particularly Urdu, is still in its infancy. In recent years, deep learning methods have proved to be extremely successful for a range of Natural Language Processing tasks. The main aim of this study is to apply, evaluate, and compare a range of deep learning methods approaches to Urdu WSD (both Lexical Sample and All-Words) including Simple Recurrent Neural Networks, Long-Short Term Memory, Gated Recurrent Units, Bidirectional Long-Short Term Memory, and Ensemble Learning. The evaluation was carried out on two benchmark corpora: (1) the ULS-WSD-18 corpus and (2) the UAW-WSD-18 corpus. Results (Accuracy = 63.25% and F1-Measure = 0.49) show that a deep learning approach outperforms previously reported results for the Urdu All-Words WSD task, whereas performance using deep learning approaches (Accuracy = 72.63% and F1-Measure = 0.60) are low in comparison to previously reported for the Urdu Lexical Sample task.
Collapse
Affiliation(s)
- Ali Saeed
- Department of Software Engineering, The University of Lahore, and Department of Computer Sciences, COMSATS University Islamabad, Lahore, Punjab, Pakistan
| | - Rao Muhammad Adeel Nawab
- Department of Computer Sciences, COMSATS UniversityIslamabad, Lahore Campus, Lahore, Punjab, Pakistan
| | - Mark Stevenson
- Department of Computer Sciences, University of Sheffield, Western Bank, Sheffield, UK
| |
Collapse
|
4
|
Deep Sentiment Analysis Using CNN-LSTM Architecture of English and Roman Urdu Text Shared in Social Media. APPLIED SCIENCES-BASEL 2022. [DOI: 10.3390/app12052694] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
Sentiment analysis (SA) has been an active research subject in the domain of natural language processing due to its important functions in interpreting people’s perspectives and drawing successful opinion-based judgments. On social media, Roman Urdu is one of the most extensively utilized dialects. Sentiment analysis of Roman Urdu is difficult due to its morphological complexities and varied dialects. The purpose of this paper is to evaluate the performance of various word embeddings for Roman Urdu and English dialects using the CNN-LSTM architecture with traditional machine learning classifiers. We introduce a novel deep learning architecture for Roman Urdu and English dialect SA based on two layers: LSTM for long-term dependency preservation and a one-layer CNN model for local feature extraction. To obtain the final classification, the feature maps learned by CNN and LSTM are fed to several machine learning classifiers. Various word embedding models support this concept. Extensive tests on four corpora show that the proposed model performs exceptionally well in Roman Urdu and English text sentiment classification, with an accuracy of 0.904, 0.841, 0.740, and 0.748 against MDPI, RUSA, RUSA-19, and UCL datasets, respectively. The results show that the SVM classifier and the Word2Vec CBOW (Continuous Bag of Words) model are more beneficial options for Roman Urdu sentiment analysis, but that BERT word embedding, two-layer LSTM, and SVM as a classifier function are more suitable options for English language sentiment analysis. The suggested model outperforms existing well-known advanced models on relevant corpora, improving the accuracy by up to 5%.
Collapse
|
5
|
Iqbal S, Hassan SU, Aljohani NR, Alelyani S, Nawaz R, Bornmann L. A decade of in-text citation analysis based on natural language processing and machine learning techniques: an overview of empirical studies. Scientometrics 2021. [DOI: 10.1007/s11192-021-04055-1] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
|
6
|
Aljohani NR, Fayoumi A, Hassan SU. An in-text citation classification predictive model for a scholarly search system. Scientometrics 2021. [DOI: 10.1007/s11192-021-03986-z] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
7
|
Hassan SU, Shabbir M, Iqbal S, Said A, Kamiran F, Nawaz R, Saif U. Leveraging Deep Learning and SNA approaches for Smart City Policing in the Developing World. INTERNATIONAL JOURNAL OF INFORMATION MANAGEMENT 2021. [DOI: 10.1016/j.ijinfomgt.2019.102045] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]
|
8
|
Sarwar R, Zia A, Nawaz R, Fayoumi A, Aljohani NR, Hassan SU. Webometrics: evolution of social media presence of universities. Scientometrics 2021. [DOI: 10.1007/s11192-020-03804-y] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
|
9
|
|
10
|
Liu L, Li W, Aljohani NR, Lytras MD, Hassan SU, Nawaz R. A framework to evaluate the interoperability of information systems – Measuring the maturity of the business process alignment. INTERNATIONAL JOURNAL OF INFORMATION MANAGEMENT 2020. [DOI: 10.1016/j.ijinfomgt.2020.102153] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]
|
11
|
Sarwar R, Rutherford AT, Hassan SU, Rakthanmanon T, Nutanong S. Native Language Identification of Fluent and Advanced Non-Native Writers. ACM T ASIAN LOW-RESO 2020. [DOI: 10.1145/3383202] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]
Abstract
Native Language Identification
(NLI) aims at identifying the
native
languages of authors by analyzing their text samples written in a
non-native
language. Most existing studies investigate this task for educational applications such as
second language acquisition
and require the learner corpora. This article performs NLI in a challenging context of the
user-generated-content
(UGC) where authors are fluent and advanced non-native speakers of a second language. Existing NLI studies with UGC (i) rely on the content-specific/social-network features and may not be generalizable to other domains and datasets, (ii) are unable to capture the variations of the language-usage-patterns within a text sample, and (iii) are not associated with any outlier handling mechanism. Moreover, since there is a sizable number of people who have acquired non-English second languages due to the economic and immigration policies, there is a need to gauge the applicability of NLI with UGC to other languages. Unlike existing solutions, we define a topic-independent feature space, which makes our solution generalizable to other domains and datasets. Based on our feature space, we present a solution that mitigates the effect of outliers in the data and helps capture the variations of the language-usage-patterns within a text sample. Specifically, we represent each text sample as a
point set
and identify the top-
k
stylistically similar text samples (SSTs) from the corpus. We then apply the
probabilistic
k
nearest neighbors’
classifier on the identified top-
k
SSTs to predict the native languages of the authors. To conduct experiments, we create three new corpora where each corpus is written in a different language, namely,
English, French
, and
German
. Our experimental studies show that our solution outperforms competitive methods and reports more than 80% accuracy across languages.
Collapse
Affiliation(s)
- Raheem Sarwar
- School of Information Science and Technology, Vidyasirimedhi Institute of Science and Technology, Wangchan, Rayong, Thailand
| | - Attapol T. Rutherford
- Department of Linguistics at Faculty of Arts Chulalongkorn University, Pathumwan, Bangkok, Thailand
| | - Saeed-Ul Hassan
- Department of Computer Science, Information Technology University, Lahore, Punjab, Pakistan
| | - Thanawin Rakthanmanon
- Department of Computer Engineering, Kasetsart University, Thailand and School of Information Science and Technology, Vidyasirimedhi Institute of Science and Technology, Wangchan, Rayong, Thailand
| | - Sarana Nutanong
- School of Information Science and Technology, Vidyasirimedhi Institute of Science and Technology, Wangchan, Rayong, Thailand
| |
Collapse
|
12
|
Mahmood Z, Safder I, Nawab RMA, Bukhari F, Nawaz R, Alfakeeh AS, Aljohani NR, Hassan SU. Deep sentiments in Roman Urdu text using Recurrent Convolutional Neural Network model. Inf Process Manag 2020. [DOI: 10.1016/j.ipm.2020.102233] [Citation(s) in RCA: 28] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
|
13
|
Hassan SU, Saleem A, Soroya SH, Safder I, Iqbal S, Jamil S, Bukhari F, Aljohani NR, Nawaz R. Sentiment analysis of tweets through Altmetrics: A machine learning approach. J Inf Sci 2020. [DOI: 10.1177/0165551520930917] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
The purpose of the study is to (a) contribute to annotating an Altmetrics dataset across five disciplines, (b) undertake sentiment analysis using various machine learning and natural language processing–based algorithms, (c) identify the best-performing model and (d) provide a Python library for sentiment analysis of an Altmetrics dataset. First, the researchers gave a set of guidelines to two human annotators familiar with the task of related tweet annotation of scientific literature. They duly labelled the sentiments, achieving an inter-annotator agreement (IAA) of 0.80 (Cohen’s Kappa). Then, the same experiments were run on two versions of the dataset: one with tweets in English and the other with tweets in 23 languages, including English. Using 6388 tweets about 300 papers indexed in Web of Science, the effectiveness of employed machine learning and natural language processing models was measured by comparing with well-known sentiment analysis models, that is, SentiStrength and Sentiment140, as the baseline. It was proved that Support Vector Machine with uni-gram outperformed all the other classifiers and baseline methods employed, with an accuracy of over 85%, followed by Logistic Regression at 83% accuracy and Naïve Bayes at 80%. The precision, recall and F1 scores for Support Vector Machine, Logistic Regression and Naïve Bayes were (0.89, 0.86, 0.86), (0.86, 0.83, 0.80) and (0.85, 0.81, 0.76), respectively.
Collapse
Affiliation(s)
| | | | - Saira Hanif Soroya
- Department of Information Management, University of the Punjab, Pakistan
| | | | | | - Saqib Jamil
- Department of Management Sciences, University of Okara, Pakistan
| | - Faisal Bukhari
- Punjab University College for Information Technology (PUCIT), University of the Punjab, Pakistan
| | - Naif Radi Aljohani
- Faculty of Computing and Information Technology, King Abdulaziz University, Kingdom of Saudi Arabia
| | - Raheel Nawaz
- School of Computer Science, Manchester Metropolitan University, UK
| |
Collapse
|
14
|
Sarwar R, Porthaveepong T, Rutherford A, Rakthanmanon T, Nutanong S. StyloThai:. ACM T ASIAN LOW-RESO 2020. [DOI: 10.1145/3365832] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]
Abstract
Authorship identification helps to identify the true author of a given anonymous document from a set of candidate authors. The applications of this task can be found in several domains, such as law enforcement agencies and information retrieval. These application domains are not limited to a specific language, community, or ethnicity. However, most of the existing solutions are designed for English, and a little attention has been paid to Thai. These existing solutions are not directly applicable to Thai due to the linguistic differences between these two languages. Moreover, the existing solution designed for Thai is unable to (i) handle outliers in the dataset, (ii) scale when the size of the candidate authors set increases, and (iii) perform well when the number of writing samples for each candidate author is low. We identify a stylometric feature space for the Thai authorship identification task. Based on our feature space, we present an authorship identification solution that uses the probabilistic
k
nearest neighbors classifier by transforming each document into a collection of point sets. Specifically, this document transformation allows us to (i) use set distance measures associated with an outlier handling mechanism, (ii) capture stylistic variations within a document, and (iii) produce multiple predictions for a query document. We create a new Thai authorship identification corpus containing 547 documents from 200 authors, which is significantly larger than the corpus used by the existing study (an increase of 32 folds in terms of the number of candidate authors). The experimental results show that our solution can overcome the limitations of the existing solution and outperforms all competitors with an accuracy level of 91.02%. Moreover, we investigate the effectiveness of each stylometric features category with the help of an ablation study. We found that combining all categories of the stylometric features outperforms the other combinations. Finally, we cross compare the feature spaces and classification methods of all solutions. We found that (i) our solution can scale as the number of candidate authors increases, (ii) our method outperforms all the competitors, and (iii) our feature space provides better performance than the feature space used by the existing study.
Collapse
|
15
|
Linking Work-Family Conflict (WFC) and Talent Management: Insights from a Developing Country. SUSTAINABILITY 2020. [DOI: 10.3390/su12072861] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Considering the profound societal change taking place in several developing countries, the objective of this paper is to reflect on work-family conflict (WFC) both as a concept and a social phenomenon. Given that WFC is a concept rooted in academic debate focusing on developments in Western, largely individualistic, societies, this paper reconsiders WFC’s value added as applied in a context of a collectivist society in a developing country. The objective of this paper is thus threefold, i.e., (i) to assess WFC’s applicability in a context of a collectivist society in a developing country, where the position and role of women gradually changes; (ii) to develop a culturally adjusted/sensitive scale to measure the scope of WFC in Pakistan, whereby the latter is treated here as a case study; and (iii) to reflect on the possibility of devising a set of good practices that would allow a smooth inclusion of women in the formal workforce, while at the same time mitigating the scope and scale of WFC. The value added of this paper stems from these three objectives.
Collapse
|
16
|
Measuring the Scale and Scope of Social Anxiety among Students in Pakistani Higher Education Institutions: An Alternative Social Anxiety Scale. SUSTAINABILITY 2020. [DOI: 10.3390/su12062164] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Social Anxiety Disorder (SAD) is neither just shyness, nor for most victims does it merely involve an inability to speak in public. For most sufferers of this disorder, it could be a pervasive, disabling condition that steals away opportunities for a richer, fuller life. Having an early onset and combining high prevalence rates with serious negative effects on functioning and quality of life, SAD is a public health problem of considerable magnitude. Hence, its assessment using a standardized measure and timely intervention can completely preempt or at least lessen the severity of this psychiatric illness. So far SAD among students in higher education institutions is a less investigated area of study in Pakistan. Students generally avoid reporting difficulties they experience while making interactions with people and quietly try to combat with their fears in social settings. Proper and timely diagnosis and treatment of SAD are required, and for this purpose, the need of the hour is to create a culturally oriented measuring instrument for proper surveillance of the student population in Pakistan. This paper, drawing from a study conducted at Higher Education Institutions (HEI) across Pakistan, addresses this issue by devising an indigenous, comprehensive, well-founded and valid scale of social anxiety in the Urdu language. The use of this scale, both in general and patient care settings, would effectively screen individuals who could be at risk of being victimized by this disorder. This alternative Social Anxiety Scale (SAS) carefully evaluates social behaviors and attitudes while also ensuring that cultural perspectives are considered, which would also encourage clinicians to evaluate SAD in Pakistani population.
Collapse
|
17
|
|
18
|
Predicting academic performance of students from VLE big data using deep learning models. COMPUTERS IN HUMAN BEHAVIOR 2020. [DOI: 10.1016/j.chb.2019.106189] [Citation(s) in RCA: 113] [Impact Index Per Article: 28.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
|
19
|
Aljohani NR, Fayoumi A, Hassan SU. Bot prediction on social networks of Twitter in altmetrics using deep graph convolutional networks. Soft comput 2020. [DOI: 10.1007/s00500-020-04689-y] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]
|
20
|
Arshad N, Bakar A, Soroya SH, Safder I, Haider S, Hassan SU, Aljohani NR, Alelyani S, Nawaz R. Extracting scientific trends by mining topics from Call for Papers. LIBRARY HI TECH 2019. [DOI: 10.1108/lht-02-2019-0048] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
PurposeThe purpose of this paper is to present a novel approach for mining scientific trends using topics from Call for Papers (CFP). The work contributes a valuable input for researchers, academics, funding institutes and research administration departments by sharing the trends to set directions of research path.Design/methodology/approachThe authors procure an innovative CFP data set to analyse scientific evolution and prestige of conferences that set scientific trends using scientific publications indexed in DBLP. Using the Field of Research code 804 from Australian Research Council, the authors identify 146 conferences (from 2006 to 2015) into different thematic areas by matching the terms extracted from publication titles with the Association for Computing Machinery Computing Classification System. Furthermore, the authors enrich the vocabulary of terms from the WordNet dictionary and Growbag data set. To measure the significance of terms, the authors adopt the following weighting schemas: probabilistic, gram, relative, accumulative and hierarchal.FindingsThe results indicate the rise of “big data analytics” from CFP topics in the last few years. Whereas the topics related to “privacy and security” show an exponential increase, the topics related to “semantic web” show a downfall in recent years. While analysing publication output in DBLP that matches CFP indexed in ERA Core A* to C rank conference, the authors identified that A* and A tier conferences not merely set publication trends, since B or C tier conferences target similar CFP.Originality/valueOverall, the analyses presented in this research are prolific for the scientific community and research administrators to study research trends and better data management of digital libraries pertaining to the scientific literature.
Collapse
|
21
|
Measuring the Scale and Scope of Workplace Bullying: An Alternative Workplace Bullying Scale. SUSTAINABILITY 2019. [DOI: 10.3390/su11174634] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
Abstract
The toll of workplace bullying is immense, yet, similarly as with an iceberg, its scope, scale and implications tend to remain underestimated. Several ways of assessing the prevalence of workplace bullying have been proposed in the literature. The most frequently discussed are the ‘subjective method’ assessing individuals’ perceptions of being a victim and the questionnaire, i.e., criterion-based, methods, including Negative Acts Questionnaire (NAQ) and Leymann Inventory of Psychological Terror (LIPT). Since in both cases culture plays a profound role as a mediating factor in the process of identifying, collecting, and processing data, the applicability of these methods across cultures and countries has several limitations. At this stage, it is impossible to determine the impact of the implicit cultural-bias that these methods entail on the research outcomes. This would be possible if an alternative workplace bullying scale (WBS) was at hand and, consequently, a comparative analysis was conducted. This paper, drawing from a study conducted at higher education institutions (HEI) across Pakistan, addresses this issue by devising an alternative WBS. The value added of this paper is three-fold, i.e., it elaborates on the study and the specific methods employed to prove the validity and relevance of the alternative WBS. Moreover, by so doing, it addresses some of the limitations that other methods measuring the prevalence of workplace bullying display. As a result, it adds to the researchers’ and administrators’ toolkit as regards research and policies aimed at mitigating the scope and scale of bullying at HEIs across cultures and countries.
Collapse
|
22
|
Thompson P, Daikou S, Ueno K, Batista-Navarro R, Tsujii J, Ananiadou S. Annotation and detection of drug effects in text for pharmacovigilance. J Cheminform 2018; 10:37. [PMID: 30105604 PMCID: PMC6089860 DOI: 10.1186/s13321-018-0290-y] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/12/2018] [Accepted: 07/20/2018] [Indexed: 02/02/2023] Open
Abstract
Pharmacovigilance (PV) databases record the benefits and risks of different drugs, as a means to ensure their safe and effective use. Creating and maintaining such resources can be complex, since a particular medication may have divergent effects in different individuals, due to specific patient characteristics and/or interactions with other drugs being administered. Textual information from various sources can provide important evidence to curators of PV databases about the usage and effects of drug targets in different medical subjects. However, the efficient identification of relevant evidence can be challenging, due to the increasing volume of textual data. Text mining (TM) techniques can support curators by automatically detecting complex information, such as interactions between drugs, diseases and adverse effects. This semantic information supports the quick identification of documents containing information of interest (e.g., the different types of patients in which a given adverse drug reaction has been observed to occur). TM tools are typically adapted to different domains by applying machine learning methods to corpora that are manually labelled by domain experts using annotation guidelines to ensure consistency. We present a semantically annotated corpus of 597 MEDLINE abstracts, PHAEDRA, encoding rich information on drug effects and their interactions, whose quality is assured through the use of detailed annotation guidelines and the demonstration of high levels of inter-annotator agreement (e.g., 92.6% F-Score for identifying named entities and 78.4% F-Score for identifying complex events, when relaxed matching criteria are applied). To our knowledge, the corpus is unique in the domain of PV, according to the level of detail of its annotations. To illustrate the utility of the corpus, we have trained TM tools based on its rich labels to recognise drug effects in text automatically. The corpus and annotation guidelines are available at: http://www.nactem.ac.uk/PHAEDRA/ .
Collapse
Affiliation(s)
- Paul Thompson
- National Centre for Text Mining, School of Computer Science, Manchester Institute of Biotechnology, University of Manchester, 131 Princess Street, Manchester, M1 7DN UK
| | - Sophia Daikou
- National Centre for Text Mining, School of Computer Science, Manchester Institute of Biotechnology, University of Manchester, 131 Princess Street, Manchester, M1 7DN UK
| | - Kenju Ueno
- Artificial Intelligence Research Center, National Research and Development Agency (AIST), Tokyo Waterfront 2-3-2 Aomi, Koto-ku, Tokyo, 135-0064 Japan
| | - Riza Batista-Navarro
- National Centre for Text Mining, School of Computer Science, Manchester Institute of Biotechnology, University of Manchester, 131 Princess Street, Manchester, M1 7DN UK
| | - Jun’ichi Tsujii
- National Centre for Text Mining, School of Computer Science, Manchester Institute of Biotechnology, University of Manchester, 131 Princess Street, Manchester, M1 7DN UK
- Artificial Intelligence Research Center, National Research and Development Agency (AIST), Tokyo Waterfront 2-3-2 Aomi, Koto-ku, Tokyo, 135-0064 Japan
| | - Sophia Ananiadou
- National Centre for Text Mining, School of Computer Science, Manchester Institute of Biotechnology, University of Manchester, 131 Princess Street, Manchester, M1 7DN UK
| |
Collapse
|
23
|
Shardlow M, Batista-Navarro R, Thompson P, Nawaz R, McNaught J, Ananiadou S. Identification of research hypotheses and new knowledge from scientific literature. BMC Med Inform Decis Mak 2018; 18:46. [PMID: 29940927 PMCID: PMC6019216 DOI: 10.1186/s12911-018-0639-1] [Citation(s) in RCA: 43] [Impact Index Per Article: 7.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/10/2017] [Accepted: 06/11/2018] [Indexed: 01/05/2023] Open
Abstract
Background Text mining (TM) methods have been used extensively to extract relations and events from the literature. In addition, TM techniques have been used to extract various types or dimensions of interpretative information, known as Meta-Knowledge (MK), from the context of relations and events, e.g. negation, speculation, certainty and knowledge type. However, most existing methods have focussed on the extraction of individual dimensions of MK, without investigating how they can be combined to obtain even richer contextual information. In this paper, we describe a novel, supervised method to extract new MK dimensions that encode Research Hypotheses (an author’s intended knowledge gain) and New Knowledge (an author’s findings). The method incorporates various features, including a combination of simple MK dimensions. Methods We identify previously explored dimensions and then use a random forest to combine these with linguistic features into a classification model. To facilitate evaluation of the model, we have enriched two existing corpora annotated with relations and events, i.e., a subset of the GENIA-MK corpus and the EU-ADR corpus, by adding attributes to encode whether each relation or event corresponds to Research Hypothesis or New Knowledge. In the GENIA-MK corpus, these new attributes complement simpler MK dimensions that had previously been annotated. Results We show that our approach is able to assign different types of MK dimensions to relations and events with a high degree of accuracy. Firstly, our method is able to improve upon the previously reported state of the art performance for an existing dimension, i.e., Knowledge Type. Secondly, we also demonstrate high F1-score in predicting the new dimensions of Research Hypothesis (GENIA: 0.914, EU-ADR 0.802) and New Knowledge (GENIA: 0.829, EU-ADR 0.836). Conclusion We have presented a novel approach for predicting New Knowledge and Research Hypothesis, which combines simple MK dimensions to achieve high F1-scores. The extraction of such information is valuable for a number of practical TM applications. Electronic supplementary material The online version of this article (10.1186/s12911-018-0639-1) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Matthew Shardlow
- National Centre for Text Mining, University of Manchester, Manchester, UK
| | | | - Paul Thompson
- National Centre for Text Mining, University of Manchester, Manchester, UK
| | - Raheel Nawaz
- National Centre for Text Mining, University of Manchester, Manchester, UK
| | - John McNaught
- National Centre for Text Mining, University of Manchester, Manchester, UK
| | - Sophia Ananiadou
- National Centre for Text Mining, University of Manchester, Manchester, UK.
| |
Collapse
|