1
|
Novo-Lourés M, Pavón R, Laza R, Méndez JR, Ruano-Ordás D. An enhanced algorithm for semantic-based feature reduction in spam filtering. PeerJ Comput Sci 2024; 10:e2206. [PMID: 39145211 PMCID: PMC11323001 DOI: 10.7717/peerj-cs.2206] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2024] [Accepted: 06/26/2024] [Indexed: 08/16/2024]
Abstract
With the advent and improvement of ontological dictionaries (WordNet, Babelnet), the use of synsets-based text representations is gaining popularity in classification tasks. More recently, ontological dictionaries were used for reducing dimensionality in this kind of representation (e.g., Semantic Dimensionality Reduction System (SDRS) (Vélez de Mendizabal et al., 2020)). These approaches are based on the combination of semantically related columns by taking advantage of semantic information extracted from ontological dictionaries. Their main advantage is that they not only eliminate features but can also combine them, minimizing (low-loss) or avoiding (lossless) the loss of information. The most recent (and accurate) techniques included in this group are based on using evolutionary algorithms to find how many features can be grouped to reduce false positive (FP) and false negative (FN) errors obtained. The main limitation of these evolutionary-based schemes is the computational requirements derived from the use of optimization algorithms. The contribution of this study is a new lossless feature reduction scheme exploiting information from ontological dictionaries, which achieves slightly better accuracy (specially in FP errors) than optimization-based approaches but using far fewer computational resources. Instead of using computationally expensive evolutionary algorithms, our proposal determines whether two columns (synsets) can be combined by observing whether the instances included in a dataset (e.g., training dataset) containing these synsets are mostly of the same class. The study includes experiments using three datasets and a detailed comparison with two previous optimization-based approaches.
Collapse
Affiliation(s)
- María Novo-Lourés
- CINBIO - Biomedical Research Centre, CINBIO, Vigo, Pontevedra, Spain
- Galicia Sur Health Research Institute (IIS Galicia Sur), SERGAS-UVIGO, Vigo, Pontevedra, Spain
- Department of Computer Science, ESEI - Escola Superior de Enxeñaría Informática, Edificio Politécnico, Universidade de Vigo, Ourense, Ourense, Spain
| | - Reyes Pavón
- CINBIO - Biomedical Research Centre, CINBIO, Vigo, Pontevedra, Spain
- Galicia Sur Health Research Institute (IIS Galicia Sur), SERGAS-UVIGO, Vigo, Pontevedra, Spain
- Department of Computer Science, ESEI - Escola Superior de Enxeñaría Informática, Edificio Politécnico, Universidade de Vigo, Ourense, Ourense, Spain
| | - Rosalía Laza
- CINBIO - Biomedical Research Centre, CINBIO, Vigo, Pontevedra, Spain
- Galicia Sur Health Research Institute (IIS Galicia Sur), SERGAS-UVIGO, Vigo, Pontevedra, Spain
- Department of Computer Science, ESEI - Escola Superior de Enxeñaría Informática, Edificio Politécnico, Universidade de Vigo, Ourense, Ourense, Spain
| | - José R. Méndez
- CINBIO - Biomedical Research Centre, CINBIO, Vigo, Pontevedra, Spain
- Galicia Sur Health Research Institute (IIS Galicia Sur), SERGAS-UVIGO, Vigo, Pontevedra, Spain
- Department of Computer Science, ESEI - Escola Superior de Enxeñaría Informática, Edificio Politécnico, Universidade de Vigo, Ourense, Ourense, Spain
| | - David Ruano-Ordás
- CINBIO - Biomedical Research Centre, CINBIO, Vigo, Pontevedra, Spain
- Galicia Sur Health Research Institute (IIS Galicia Sur), SERGAS-UVIGO, Vigo, Pontevedra, Spain
- Department of Computer Science, ESEI - Escola Superior de Enxeñaría Informática, Edificio Politécnico, Universidade de Vigo, Ourense, Ourense, Spain
| |
Collapse
|
2
|
Jeyakodi G, Shanthi Bala P, Sruthi OT, Swathi K. MBORS: Mosquito vector Biocontrol Ontology and Recommendation System. J Vector Borne Dis 2024; 61:51-60. [PMID: 38648406 DOI: 10.4103/0972-9062.383640] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2022] [Accepted: 07/24/2023] [Indexed: 04/25/2024] Open
Abstract
BACKGROUND OBJECTIVES Mosquito vectors are disease-causing insects, responsible for various life-threatening vector-borne diseases such as dengue, Zika, malaria, chikungunya, and lymphatic filariasis. In practice, synthetic insecticides are used to control the mosquito vector, but, the continuous usage of synthetic insecticides is toxic to human health resulting in communicable diseases. Non-toxic biocontrol agents such as bacteria, fungus, plants, and mosquito densoviruses play a vital role in controlling mosquitoes. Community awareness of mosquito biocontrol agents is required to control vector-borne diseases. Mosquito vector-based ontology facilitates mosquito biocontrol by providing information such as species names, pathogen-associated diseases, and biological controlling agents. It helps to explore the associations among the mosquitoes and their biocontrol agents in the form of rules. The Mosquito vector-based Biocontrol Ontology Recommendation System (MBORS) provides the knowledge on mosquito-associated biocontrol agents to control the vector at the early stage of the mosquitoes such as eggs, larvae, pupae, and adults. This paper proposes MBORS for the prevention and effective control of vector-borne diseases. The Mosquito Vector Association ontology (MVAont) suggests the appropriate mosquito vector biocontrol agents (MosqVecRS) for related diseases. METHODS Natural Language Processing and Data mining are employed to develop the MBORS. While Tokenization, Part-of-speech Tagging (POS), Named Entity Recognition (NER), and rule-based text mining techniques are used to identify the mosquito ontology concepts, the data mining apriori algorithm is used to predict the associations among them. RESULTS The outcome of the MBORS results in MVAont as Web Ontology Language (OWL) representation and MosqVecRS as an Android application. The developed ontology and recommendation system are freely available on the web portal. INTERPRETATION CONCLUSION The MVAont predicts harmless biocontrol agents which help to diminish the rate of vector-borne diseases. On the other hand, the MosqVecRS system raises awareness of vectors and vector-borne diseases by recommending suitable biocontrol agents to the vector control community and researchers.
Collapse
Affiliation(s)
- G Jeyakodi
- Department of Computer Science, School of Engineering and Technology, Pondicherry University, Puducherry, India
| | | | | | | |
Collapse
|
4
|
Liu J, Li T, Montero J. Special issue on hybrid data and knowledge driven decision making under uncertainty (Hybrid DK for DM). Inf Sci (N Y) 2021. [DOI: 10.1016/j.ins.2021.07.092] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
6
|
Ibrahim M, Gauch S, Salman O, Alqahtani M. An automated method to enrich consumer health vocabularies using GloVe word embeddings and an auxiliary lexical resource. PeerJ Comput Sci 2021; 7:e668. [PMID: 34458573 PMCID: PMC8371999 DOI: 10.7717/peerj-cs.668] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/20/2021] [Accepted: 07/19/2021] [Indexed: 06/13/2023]
Abstract
BACKGROUND Clear language makes communication easier between any two parties. A layman may have difficulty communicating with a professional due to not understanding the specialized terms common to the domain. In healthcare, it is rare to find a layman knowledgeable in medical terminology which can lead to poor understanding of their condition and/or treatment. To bridge this gap, several professional vocabularies and ontologies have been created to map laymen medical terms to professional medical terms and vice versa. OBJECTIVE Many of the presented vocabularies are built manually or semi-automatically requiring large investments of time and human effort and consequently the slow growth of these vocabularies. In this paper, we present an automatic method to enrich laymen's vocabularies that has the benefit of being able to be applied to vocabularies in any domain. METHODS Our entirely automatic approach uses machine learning, specifically Global Vectors for Word Embeddings (GloVe), on a corpus collected from a social media healthcare platform to extend and enhance consumer health vocabularies. Our approach further improves the consumer health vocabularies by incorporating synonyms and hyponyms from the WordNet ontology. The basic GloVe and our novel algorithms incorporating WordNet were evaluated using two laymen datasets from the National Library of Medicine (NLM), Open-Access Consumer Health Vocabulary (OAC CHV) and MedlinePlus Healthcare Vocabulary. RESULTS The results show that GloVe was able to find new laymen terms with an F-score of 48.44%. Furthermore, our enhanced GloVe approach outperformed basic GloVe with an average F-score of 61%, a relative improvement of 25%. Furthermore, the enhanced GloVe showed a statistical significance over the two ground truth datasets with P < 0.001. CONCLUSIONS This paper presents an automatic approach to enrich consumer health vocabularies using the GloVe word embeddings and an auxiliary lexical source, WordNet. Our approach was evaluated used healthcare text downloaded from MedHelp.org, a healthcare social media platform using two standard laymen vocabularies, OAC CHV, and MedlinePlus. We used the WordNet ontology to expand the healthcare corpus by including synonyms, hyponyms, and hypernyms for each layman term occurrence in the corpus. Given a seed term selected from a concept in the ontology, we measured our algorithms' ability to automatically extract synonyms for those terms that appeared in the ground truth concept. We found that enhanced GloVe outperformed GloVe with a relative improvement of 25% in the F-score.
Collapse
|
7
|
Abstract
With the rapid increase in the world’s population, there is an ever-growing need for a sustainable food supply. Agriculture is one of the pillars for worldwide food provisioning, with fruits and vegetables being essential for a healthy diet. However, in the last few years the worldwide dispersion of virulent plant pests and diseases has caused significant decreases in the yield and quality of crops, in particular fruit, cereal and vegetables. Climate change and the intensification of global trade flows further accentuate the issue. Integrated Pest Management (IPM) is an approach to pest control that aims at maintaining pest insects at tolerable levels, keeping pest populations below an economic injury level. Under these circumstances, the early identification of pests and diseases becomes crucial. In this work, we present the first step towards a fully fledged, semantically enhanced decision support system for IPM. The ultimate goal is to build a complete agricultural knowledge base by gathering data from multiple, heterogeneous sources and to develop a system to assist farmers in decision making concerning the control of pests and diseases. The pest classifier framework has been evaluated in a simulated environment, obtaining an aggregated accuracy of 98.8%.
Collapse
|
8
|
Abstract
The ontology sparse vector learning algorithm is essentially a dimensionality reduction trick, i.e., the key components in the p-dimensional vector are taken out, and the remaining components are set to zero, so as to obtain the key information in a certain ontology application background. In the early stage of ontology data processing, the goal of the algorithm is to find the location of key components through the learning of some ontology sample points, if the relevant concepts and structure information of each ontology vertex with p-dimensional vectors are expressed. The ontology sparse vector itself contains a certain structure, such as the symmetry between components and the binding relationship between certain components, and the algorithm can also be used to dig out the correlation and decisive components between the components. In this paper, the graph structure is used to express these components and their interrelationships, and the optimal solution is obtained by using spectral graph theory and graph optimization techniques. The essence of the proposed ontology learning algorithm is to find the decisive vertices in the graph Gβ. Finally, two experiments show that the given ontology learning algorithm is effective in similarity calculation and ontology mapping in some specific engineering fields.
Collapse
|