1
|
Lee B, Brownstein JS, Kohane IS. Geoinference of author affiliations using NLP-based text classification. Sci Rep 2024; 14:24306. [PMID: 39414801 PMCID: PMC11484971 DOI: 10.1038/s41598-024-73318-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2024] [Accepted: 09/16/2024] [Indexed: 10/18/2024] Open
Abstract
Author affiliations are essential in bibliometric studies, requiring relevant information extraction from free-text affiliations. Precisely determining an author's location from their affiliation is crucial for understanding research networks, collaborations, and geographic distribution. Existing geoparsing tools using regular expressions have limitations due to unstructured and ambiguous affiliations, resulting in erroneous location identification, especially for unconventional variations or misspellings. Moreover, their inefficient handling of big datasets hampers large-scale bibliometric studies. Though machine learning-based geoparsers exist, they depend on explicit location information, creating challenges when detailed geographic data is absent. To address these issues, we developed and evaluated a natural language processing model to predict the city, state, and country from an author's free-text affiliation. Our model automates location inference, overcoming drawbacks of existing methods. Trained and tested with MapAffil, a publicly available geoparsed dataset of PubMed affiliations up to 2018, our model accurately retrieves high-resolution locations, even without explicit mentions of a city, state, or even country. Leveraging NLP techniques and the LinearSVC algorithm, our machine learning model achieves superior accuracy based on several validation datasets. This research demonstrates a practical application of text classification for inferring specific geographical locations from free-text affiliations, benefiting researchers and institutions in analyzing research output distribution.
Collapse
Affiliation(s)
- Brian Lee
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, 02115, USA.
| | - John S Brownstein
- Computational Epidemiology Lab, Boston Children's Hospital, Boston, MA, USA
- Department of Pediatrics, Harvard Medical School, Boston, MA, USA
| | - Isaac S Kohane
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, 02115, USA
| |
Collapse
|
2
|
Hossain MR, Hoque MM, Siddique N, Sarker IH. CovTiNet: Covid text identification network using attention-based positional embedding feature fusion. Neural Comput Appl 2023; 35:13503-13527. [PMCID: PMC10011801 DOI: 10.1007/s00521-023-08442-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2022] [Accepted: 02/24/2023] [Indexed: 03/28/2023]
Abstract
Covid text identification (CTI) is a crucial research concern in natural language processing (NLP). Social and electronic media are simultaneously adding a large volume of Covid-affiliated text on the World Wide Web due to the effortless access to the Internet, electronic gadgets and the Covid outbreak. Most of these texts are uninformative and contain misinformation, disinformation and malinformation that create an infodemic. Thus, Covid text identification is essential for controlling societal distrust and panic. Though very little Covid-related research (such as Covid disinformation, misinformation and fake news) has been reported in high-resource languages (e.g. English), CTI in low-resource languages (like Bengali) is in the preliminary stage to date. However, automatic CTI in Bengali text is challenging due to the deficit of benchmark corpora, complex linguistic constructs, immense verb inflexions and scarcity of NLP tools. On the other hand, the manual processing of Bengali Covid texts is arduous and costly due to their messy or unstructured forms. This research proposes a deep learning-based network (CovTiNet) to identify Covid text in Bengali. The CovTiNet incorporates an attention-based position embedding feature fusion for text-to-feature representation and attention-based CNN for Covid text identification. Experimental results show that the proposed CovTiNet achieved the highest accuracy of 96.61±.001% on the developed dataset (BCovC) compared to the other methods and baselines (i.e. BERT-M, IndicBERT, ELECTRA-Bengali, DistilBERT-M, BiLSTM, DCNN, CNN, LSTM, VDCNN and ACNN).
Collapse
Affiliation(s)
- Md. Rajib Hossain
- Department of Computer Science and Engineering, Chittagong University of Engineering and Technology, Chittagong, 4349 Bangladesh
| | - Mohammed Moshiul Hoque
- Department of Computer Science and Engineering, Chittagong University of Engineering and Technology, Chittagong, 4349 Bangladesh
| | - Nazmul Siddique
- School of Computing, Engineering and Intelligent Systems, Ulster University, Londonderry, UK
| | - Iqbal H. Sarker
- Department of Computer Science and Engineering, Chittagong University of Engineering and Technology, Chittagong, 4349 Bangladesh
- Security Research Institute, Edith Cowan University, Joondalup, WA 6027 Australia
| |
Collapse
|
3
|
Contreras Hernández S, Tzili Cruz MP, Espínola Sánchez JM, Pérez Tzili A. Deep Learning Model for COVID-19 Sentiment Analysis on Twitter. NEW GENERATION COMPUTING 2023; 41:189-212. [PMID: 37229180 PMCID: PMC10010651 DOI: 10.1007/s00354-023-00209-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 04/04/2022] [Accepted: 02/23/2023] [Indexed: 05/27/2023]
Abstract
The COVID-19 pandemic impacted the mood of the people, and this was evident on social networks. These common user publications are a source of information to measure the population's opinion on social phenomena. In particular, the Twitter network represents a resource of great value due to the amount of information, the geographical distribution of the publications and the openness to dispose of them. This work presents a study on the feelings of the population in Mexico during one of the waves that produced the most contagion and deaths in this country. A mixed, semi-supervised approach was used, with a lexical-based data labeling technique to later bring these data to a pre-trained model of Transformers completely in Spanish. Two Spanish-language models were trained by adding to the Transformers neural network the adjustment for the sentiment analysis task specifically on COVID-19. In addition, ten other multilanguage Transformer models including the Spanish language were trained with the same data set and parameters to compare their performance. In addition, other classifiers with the same data set were used for training and testing, such as Support Vector Machines, Naive Bayes, Logistic Regression, and Decision Trees. These performances were compared with the exclusive model in Spanish based on Transformers, which had higher precision. Finally, this model was used, developed exclusively based on the Spanish language, with new data, to measure the sentiment about COVID-19 of the Twitter community in Mexico.
Collapse
Affiliation(s)
- Salvador Contreras Hernández
- Department of Informatics, Universidad Politécnica del Valle de México, 54910 Tultitlán Estado de México, Mexico
| | - María Patricia Tzili Cruz
- Department of Informatics, Universidad Politécnica del Valle de México, 54910 Tultitlán Estado de México, Mexico
| | - José Martín Espínola Sánchez
- Department of Informatics, Universidad Politécnica del Valle de México, 54910 Tultitlán Estado de México, Mexico
| | | |
Collapse
|
4
|
Asudani DS, Nagwani NK, Singh P. Impact of word embedding models on text analytics in deep learning environment: a review. Artif Intell Rev 2023; 56:1-81. [PMID: 36844886 PMCID: PMC9944441 DOI: 10.1007/s10462-023-10419-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 02/01/2023] [Indexed: 02/25/2023]
Abstract
The selection of word embedding and deep learning models for better outcomes is vital. Word embeddings are an n-dimensional distributed representation of a text that attempts to capture the meanings of the words. Deep learning models utilize multiple computing layers to learn hierarchical representations of data. The word embedding technique represented by deep learning has received much attention. It is used in various natural language processing (NLP) applications, such as text classification, sentiment analysis, named entity recognition, topic modeling, etc. This paper reviews the representative methods of the most prominent word embedding and deep learning models. It presents an overview of recent research trends in NLP and a detailed understanding of how to use these models to achieve efficient results on text analytics tasks. The review summarizes, contrasts, and compares numerous word embedding and deep learning models and includes a list of prominent datasets, tools, APIs, and popular publications. A reference for selecting a suitable word embedding and deep learning approach is presented based on a comparative analysis of different techniques to perform text analytics tasks. This paper can serve as a quick reference for learning the basics, benefits, and challenges of various word representation approaches and deep learning models, with their application to text analytics and a future outlook on research. It can be concluded from the findings of this study that domain-specific word embedding and the long short term memory model can be employed to improve overall text analytics task performance.
Collapse
Affiliation(s)
- Deepak Suresh Asudani
- Department of Computer Science and Engineering, National Institute of Technology, Raipur, Chhattisgarh India
| | - Naresh Kumar Nagwani
- Department of Computer Science and Engineering, National Institute of Technology, Raipur, Chhattisgarh India
| | - Pradeep Singh
- Department of Computer Science and Engineering, National Institute of Technology, Raipur, Chhattisgarh India
| |
Collapse
|
5
|
Predictive Analysis of COVID-19 Symptoms in Social Networks through Machine Learning. ELECTRONICS 2022. [DOI: 10.3390/electronics11040580] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/01/2023]
Abstract
Social media is a great source of data for analyses, since they provide ways for people to share emotions, feelings, ideas, and even symptoms of diseases. By the end of 2019, a global pandemic alert was raised, relative to a virus that had a high contamination rate and could cause respiratory complications. To help identify those who may have the symptoms of this disease or to detect who is already infected, this paper analyzed the performance of eight machine learning algorithms (KNN, Naive Bayes, Decision Tree, Random Forest, SVM, simple Multilayer Perceptron, Convolutional Neural Networks and BERT) in the search and classification of tweets that mention self-report of COVID-19 symptoms. The dataset was labeled using a set of disease symptom keywords provided by the World Health Organization. The tests showed that Random Forest algorithm had the best results, closely followed by BERT and Convolution Neural Network, although traditional machine learning algorithms also have can also provide good results. This work could also aid in the selection of algorithms in the identification of diseases symptoms in social media content.
Collapse
|
6
|
Improving Sentiment Analysis for Social Media Applications Using an Ensemble Deep Learning Language Model. ARABIAN JOURNAL FOR SCIENCE AND ENGINEERING 2021; 47:2499-2511. [PMID: 34660170 PMCID: PMC8502794 DOI: 10.1007/s13369-021-06227-w] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/05/2021] [Accepted: 09/12/2021] [Indexed: 11/25/2022]
Abstract
As data grow rapidly on social media by users’ contributions, specially with the recent coronavirus pandemic, the need to acquire knowledge of their behaviors is in high demand. The opinions behind posts on the pandemic are the scope of the tested dataset in this study. Finding the most suitable classification algorithms for this kind of data is challenging. Within this context, models of deep learning for sentiment analysis can introduce detailed representation capabilities and enhanced performance compared to existing feature-based techniques. In this paper, we focus on enhancing the performance of sentiment classification using a customized deep learning model with an advanced word embedding technique and create a long short-term memory (LSTM) network. Furthermore, we propose an ensemble model that combines our baseline classifier with other state-of-the-art classifiers used for sentiment analysis. The contributions of this paper are twofold. (1) We establish a robust framework based on word embedding and an LSTM network that learns the contextual relations among words and understands unseen or rare words in relatively emerging situations such as the coronavirus pandemic by recognizing suffixes and prefixes from training data. (2) We capture and utilize the significant differences in state-of-the-art methods by proposing a hybrid ensemble model for sentiment analysis. We conduct several experiments using our own Twitter coronavirus hashtag dataset as well as public review datasets from Amazon and Yelp. For concluding results, a statistical study is carried out indicating that the performance of these proposed models surpasses other models in terms of classification accuracy.
Collapse
|