1
|
Shamoug A, Cranefield S, Dick G. SEmHuS: a semantically embedded humanitarian space. JOURNAL OF INTERNATIONAL HUMANITARIAN ACTION 2023; 8:3. [PMID: 37520288 PMCID: PMC9990040 DOI: 10.1186/s41018-023-00135-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 10/04/2022] [Accepted: 02/15/2023] [Indexed: 08/01/2023]
Abstract
Humanitarian crises are unpredictable and complex environments, in which access to basic services and infrastructures is not adequately available. Computing in a humanitarian crisis environment is different from any other environment. In humanitarian environments the accessibility to electricity, internet, and qualified human resources is usually limited. Hence, advanced computing technologies in such an environment are hard to deploy and implement. Moreover, time and resources in those environments are also limited and devoted for life-saving activities, which makes computing technologies among the lowest priorities for those who operate there. In humanitarian crises, interests and preferences of decision-makers are driven by their original languages, cultures, education, religions, and political affiliations. Hence, decision-making in such environments is usually hard and slow because it solely depends on human capacity in absence of proper computing techniques. In this research, we are interested in overcoming the above challenges by involving machines in humanitarian response. This work proposes and evaluates a text classification and embedding technique to transform historical humanitarian records from human-oriented into a machine-oriented structure (in a vector space). This technique allows machines to extract humanitarian knowledge and use it to answer questions and classify documents. Having machines involved in those tasks helps decision-makers in speeding up humanitarian response, reducing its cost, saving lives, and easing human suffering. Supplementary Information The online version contains supplementary material available at 10.1186/s41018-023-00135-4.
Collapse
Affiliation(s)
- Aladdin Shamoug
- Department of Information Science, University of Otago, Dunedin, New Zealand
| | - Stephen Cranefield
- Department of Information Science, University of Otago, Dunedin, New Zealand
| | - Grant Dick
- Department of Information Science, University of Otago, Dunedin, New Zealand
| |
Collapse
|
2
|
Zhang J, Gui W, Wen J. China’s policy similarity evaluation using LDA model: An experimental analysis in Hebei province. J Inf Sci 2022. [DOI: 10.1177/01655515221097858] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
This article proposes a combination model, which is composed of latent Dirichlet allocation model, TF-IDF feature extraction algorithm and Euclidean distance measurement method, to identify and judge whether the similarities between multiple policy texts exist or not. With the help of actual data result, this will drive the relevant government agencies to figure out problems in a timely manner and provide a decision-making basis for them to formulate and optimise appropriate economic policies. To this end, this article analyses and studies the four types of economic texts that are classified as Insurance, Banking, Tax and Finance from the Central Government of Hebei province and Shijiazhuang city levels. Also, we consider Beijing, Shanghai and Guangdong. Experimental results show that (1) the combination model can quickly and effectively recognise and determine whether there are similarities between multiple economic policy texts; (2) similarities exist or not between the central, provincial and municipal level policy texts depending on the comparison of the distance values across them; (3) the smaller the distance value between economic policy texts of the same kind, the higher the similarity in them; and (4) the distance values between the six policy texts in Finance, Insurance, Bank and Tax categories are ranked from low to high. In terms of similarity, the Finance category is the highest, followed by Insurance and Bank, and the Tax category is the lowest.
Collapse
Affiliation(s)
- Junhuan Zhang
- School of Economics and Management, Beihang University, China; Key Laboratory of Complex System Analysis, Management and Decision, Beihang University, Ministry of Education, China
| | - Wanbing Gui
- School of Economics and Management, Beihang University, China
| | - Jiaqi Wen
- School of Computer Science, University of Technology Sydney, Australia
| |
Collapse
|
3
|
Das B, Majumder M, Sekh AA, Phadikar S. Automatic question generation and answer assessment for subjective examination. COGN SYST RES 2022. [DOI: 10.1016/j.cogsys.2021.11.002] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
4
|
Xiang J, Zhang J, Zhao Y, Wu FX, Li M. Biomedical data, computational methods and tools for evaluating disease-disease associations. Brief Bioinform 2022; 23:6522999. [PMID: 35136949 DOI: 10.1093/bib/bbac006] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/19/2021] [Revised: 01/04/2022] [Accepted: 01/05/2022] [Indexed: 12/12/2022] Open
Abstract
In recent decades, exploring potential relationships between diseases has been an active research field. With the rapid accumulation of disease-related biomedical data, a lot of computational methods and tools/platforms have been developed to reveal intrinsic relationship between diseases, which can provide useful insights to the study of complex diseases, e.g. understanding molecular mechanisms of diseases and discovering new treatment of diseases. Human complex diseases involve both external phenotypic abnormalities and complex internal molecular mechanisms in organisms. Computational methods with different types of biomedical data from phenotype to genotype can evaluate disease-disease associations at different levels, providing a comprehensive perspective for understanding diseases. In this review, available biomedical data and databases for evaluating disease-disease associations are first summarized. Then, existing computational methods for disease-disease associations are reviewed and classified into five groups in terms of the usages of biomedical data, including disease semantic-based, phenotype-based, function-based, representation learning-based and text mining-based methods. Further, we summarize software tools/platforms for computation and analysis of disease-disease associations. Finally, we give a discussion and summary on the research of disease-disease associations. This review provides a systematic overview for current disease association research, which could promote the development and applications of computational methods and tools/platforms for disease-disease associations.
Collapse
Affiliation(s)
- Ju Xiang
- School of Computer Science and Engineering, Central South University, China
| | - Jiashuai Zhang
- Hunan Provincial Key Lab on Bioinformatics, School of Computer Science and Engineering, Central South University, Changsha, Hunan 410083, China
| | - Yichao Zhao
- School of Computer Science and Engineering, Central South University, China
| | - Fang-Xiang Wu
- Hunan Provincial Key Lab on Bioinformatics, School of Computer Science and Engineering, Central South University, Changsha, Hunan 410083, China
| | - Min Li
- Division of Biomedical Engineering and Department of Mechanical Engineering at University of Saskatchewan, Saskatoon, Canada
| |
Collapse
|
5
|
Krishna Siva Prasad M, Sharma P. Exploring intrinsic information content models for addressing the issues of traditional semantic measures to evaluate verb similarity. COMPUT SPEECH LANG 2022. [DOI: 10.1016/j.csl.2021.101280] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
|
6
|
Alahmar A, AlMousa M, Benlamri R. Automated clinical pathway standardization using SNOMED CT- based semantic relatedness. Digit Health 2022; 8:20552076221089796. [PMID: 35392252 PMCID: PMC8980435 DOI: 10.1177/20552076221089796] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/19/2021] [Accepted: 03/09/2022] [Indexed: 11/22/2022] Open
Abstract
The increasing number of patients and heavy workload drive health care institutions to search for efficient and cost-effective methods to deliver optimal care. Clinical pathways are promising care plans that proved to be efficient in reducing costs and optimizing resource usage. However, most clinical pathways are circulated in paper-based formats. Clinical pathway computerization is an emerging research field that aims to integrate clinical pathways with health information systems. A key process in clinical pathway computerization is the standardization of clinical pathway terminology to comply with digital terminology systems. Since clinical pathways include sensitive medical terms, clinical pathway standardization is performed manually and is difficult to automate using machines. The objective of this research is to introduce automation to clinical pathway standardization. The proposed approach utilizes a semantic score-based algorithm that automates the search for SNOMED CT terms. The algorithm was implemented in a software system with a graphical user interface component that physicians can use to standardize clinical pathways by searching for and comparing relevant SNOMED CT retrieved automatically by the algorithm. The system has been tested and validated on SNOMED CT ontology. The experimental results show that the system reached a maximum search space reduction of 98.9% within any single iteration of the algorithm and an overall average of 71.3%. The system enables physicians to locate the proper terms precisely, quickly, and more efficiently. This is demonstrated using case studies, and the results show that human-guided automation is a promising methodology in the field of clinical pathway standardization and computerization.
Collapse
Affiliation(s)
- Ayman Alahmar
- Department of Software Engineering, Lakehead University, Thunder Bay, Ontario, Canada
| | - Mohannad AlMousa
- Department of Software Engineering, Lakehead University, Thunder Bay, Ontario, Canada
| | - Rachid Benlamri
- Department of Software Engineering, Lakehead University, Thunder Bay, Ontario, Canada
| |
Collapse
|
7
|
Abstract
AbstractIn low-resource domains, it is challenging to achieve good performance using existing machine learning methods due to a lack of training data and mixed data types (numeric and categorical). In particular, categorical variables with high cardinality pose a challenge to machine learning tasks such as classification and regression because training requires sufficiently many data points for the possible values of each variable. Since interpolation is not possible, nothing can be learned for values not seen in the training set. This paper presents a method that uses prior knowledge of the application domain to support machine learning in cases with insufficient data. We propose to address this challenge by using embeddings for categorical variables that are based on an explicit representation of domain knowledge (KR), namely a hierarchy of concepts. Our approach is to 1. define a semantic similarity measure between categories, based on the hierarchy—we propose a purely hierarchy-based measure, but other similarity measures from the literature can be used—and 2. use that similarity measure to define a modified one-hot encoding. We propose two embedding schemes for single-valued and multi-valued categorical data. We perform experiments on three different use cases. We first compare existing similarity approaches with our approach on a word pair similarity use case. This is followed by creating word embeddings using different similarity approaches. A comparison with existing methods such as Google, Word2Vec and GloVe embeddings on several benchmarks shows better performance on concept categorisation tasks when using knowledge-based embeddings. The third use case uses a medical dataset to compare the performance of semantic-based embeddings and standard binary encodings. Significant improvement in performance of the downstream classification tasks is achieved by using semantic information.
Collapse
|
8
|
González-Eras A, Santos RD, Aguilar J, Lopez A. Ontological engineering for the definition of a COVID-19 pandemic ontology. INFORMATICS IN MEDICINE UNLOCKED 2021; 28:100816. [PMID: 34934805 PMCID: PMC8677430 DOI: 10.1016/j.imu.2021.100816] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2021] [Revised: 12/09/2021] [Accepted: 12/10/2021] [Indexed: 11/30/2022] Open
Abstract
COVID-19 has generated a lot of information in different formats, and one of them is in the ontology format. Also, there are previous ontologies from other disciplines that can help to analyze the COVID-19 pandemic. Thus, due to the large quantity of COVID-19 information in the form of ontologies, approaches to ontology integration and interoperability could be beneficial. In this context, this research proposes a new ontology, called COVID-19 Pandemic ontology, which is the product of an ontological engineering process proposed in this research that allows the integration of several ontologies to cover all the aspects of this infectious disease. The ontological engineering process defines tasks of fusion, alignment, and linking for integrating the ontologies. The resulting pandemic ontology provides a simple repository for storing information about the COVID-19, reusing existing ontologies, to offer multiple views about the disease, including the social context. This ontology has been tested in different case studies to prove its capabilities to infer useful information about the COVID-19 pandemic.
Collapse
Affiliation(s)
- Alexandra González-Eras
- CEMISID, Facultad de Ingeniería- Universidad de Los Andes, 5101, Mérida, Venezuela
- Departamento de Ciencias de la Computación y Electrónica - Universidad Técnica Particular de Loja, Cdla. Universitaria San Cayetano Alto, 1101608, Loja, Ecuador
- Tepuy R+D Group. Artificial Intelligence Software Development. Mérida, Venezuela
| | - Ricardo Dos Santos
- CEMISID, Facultad de Ingeniería- Universidad de Los Andes, 5101, Mérida, Venezuela
- Tepuy R+D Group. Artificial Intelligence Software Development. Mérida, Venezuela
| | - Jose Aguilar
- CEMISID, Facultad de Ingeniería- Universidad de Los Andes, 5101, Mérida, Venezuela
- Tepuy R+D Group. Artificial Intelligence Software Development. Mérida, Venezuela
- GIDITIC, Escuela de Ingeniería, Universidad EAFIT, Medellín, Colombia
- Universidad de Alcala, Departamento de Automática, Spain
| | - Alberto Lopez
- CEMISID, Facultad de Ingeniería- Universidad de Los Andes, 5101, Mérida, Venezuela
- Tepuy R+D Group. Artificial Intelligence Software Development. Mérida, Venezuela
| |
Collapse
|
9
|
Slater K, Williams JA, Karwath A, Fanning H, Ball S, Schofield PN, Hoehndorf R, Gkoutos GV. Multi-faceted semantic clustering with text-derived phenotypes. Comput Biol Med 2021; 138:104904. [PMID: 34600327 PMCID: PMC8573608 DOI: 10.1016/j.compbiomed.2021.104904] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/23/2021] [Revised: 09/22/2021] [Accepted: 09/23/2021] [Indexed: 02/03/2023]
Abstract
Identification of ontology concepts in clinical narrative text enables the creation of phenotype profiles that can be associated with clinical entities, such as patients or drugs. Constructing patient phenotype profiles using formal ontologies enables their analysis via semantic similarity, in turn enabling the use of background knowledge in clustering or classification analyses. However, traditional semantic similarity approaches collapse complex relationships between patient phenotypes into a unitary similarity scores for each pair of patients. Moreover, single scores may be based only on matching terms with the greatest information content (IC), ignoring other dimensions of patient similarity. This process necessarily leads to a loss of information in the resulting representation of patient similarity, and is especially apparent when using very large text-derived and highly multi-morbid phenotype profiles. Moreover, it renders finding a biological explanation for similarity very difficult; the black box problem. In this article, we explore the generation of multiple semantic similarity scores for patients based on different facets of their phenotypic manifestation, which we define through different sub-graphs in the Human Phenotype Ontology. We further present a new methodology for deriving sets of qualitative class descriptions for groups of entities described by ontology terms. Leveraging this strategy to obtain meaningful explanations for our semantic clusters alongside other evaluation techniques, we show that semantic clustering with ontology-derived facets enables the representation, and thus identification of, clinically relevant phenotype relationships not easily recoverable using overall clustering alone. In this way, we demonstrate the potential of faceted semantic clustering for gaining a deeper and more nuanced understanding of text-derived patient phenotypes.
Collapse
Affiliation(s)
- Karin Slater
- College of Medical and Dental Sciences, Institute of Cancer and Genomic Sciences, University of Birmingham, UK; Institute of Translational Medicine, University Hospitals Birmingham, NHS Foundation Trust, UK; MRC Health Data Research UK (HDR UK) Midlands, UK; University Hospitals Birmingham NHS Foundation Trust, Edgbaston, Birmingham, UK.
| | - John A Williams
- College of Medical and Dental Sciences, Institute of Cancer and Genomic Sciences, University of Birmingham, UK; Institute of Translational Medicine, University Hospitals Birmingham, NHS Foundation Trust, UK; University Hospitals Birmingham NHS Foundation Trust, Edgbaston, Birmingham, UK
| | - Andreas Karwath
- College of Medical and Dental Sciences, Institute of Cancer and Genomic Sciences, University of Birmingham, UK; Institute of Translational Medicine, University Hospitals Birmingham, NHS Foundation Trust, UK; MRC Health Data Research UK (HDR UK) Midlands, UK; University Hospitals Birmingham NHS Foundation Trust, Edgbaston, Birmingham, UK
| | - Hilary Fanning
- Institute of Translational Medicine, University Hospitals Birmingham, NHS Foundation Trust, UK; University Hospitals Birmingham NHS Foundation Trust, Edgbaston, Birmingham, UK
| | - Simon Ball
- Institute of Translational Medicine, University Hospitals Birmingham, NHS Foundation Trust, UK; University Hospitals Birmingham NHS Foundation Trust, Edgbaston, Birmingham, UK
| | - Paul N Schofield
- Dept of Physiology, Development, and Neuroscience, University of Cambridge, UK
| | - Robert Hoehndorf
- Computer, Electrical and Mathematical Sciences & Engineering Division, Computational Bioscience Research Center, King Abdullah University of Science and Technology, Saudi Arabia
| | - Georgios V Gkoutos
- College of Medical and Dental Sciences, Institute of Cancer and Genomic Sciences, University of Birmingham, UK; Institute of Translational Medicine, University Hospitals Birmingham, NHS Foundation Trust, UK; NIHR Experimental Cancer Medicine Centre, UK; NIHR Surgical Reconstruction and Microbiology Research Centre, UK; NIHR Biomedical Research Centre, UK; MRC Health Data Research UK (HDR UK) Midlands, UK; University Hospitals Birmingham NHS Foundation Trust, Edgbaston, Birmingham, UK
| |
Collapse
|
10
|
Knowledge-based sentence semantic similarity: algebraical properties. PROGRESS IN ARTIFICIAL INTELLIGENCE 2021. [DOI: 10.1007/s13748-021-00248-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
Abstract
AbstractDetermining the extent to which two text snippets are semantically equivalent is a well-researched topic in the areas of natural language processing, information retrieval and text summarization. The sentence-to-sentence similarity scoring is extensively used in both generic and query-based summarization of documents as a significance or a similarity indicator. Nevertheless, most of these applications utilize the concept of semantic similarity measure only as a tool, without paying importance to the inherent properties of such tools that ultimately restrict the scope and technical soundness of the underlined applications. This paper aims to contribute to fill in this gap. It investigates three popular WordNet hierarchical semantic similarity measures, namely path-length, Wu and Palmer and Leacock and Chodorow, from both algebraical and intuitive properties, highlighting their inherent limitations and theoretical constraints. We have especially examined properties related to range and scope of the semantic similarity score, incremental monotonicity evolution, monotonicity with respect to hyponymy/hypernymy relationship as well as a set of interactive properties. Extension from word semantic similarity to sentence similarity has also been investigated using a pairwise canonical extension. Properties of the underlined sentence-to-sentence similarity are examined and scrutinized. Next, to overcome inherent limitations of WordNet semantic similarity in terms of accounting for various Part-of-Speech word categories, a WordNet “All word-To-Noun conversion” that makes use of Categorial Variation Database (CatVar) is put forward and evaluated using a publicly available dataset with a comparison with some state-of-the-art methods. The finding demonstrates the feasibility of the proposal and opens up new opportunities in information retrieval and natural language processing tasks.
Collapse
|
11
|
Kulmanov M, Smaili FZ, Gao X, Hoehndorf R. Semantic similarity and machine learning with ontologies. Brief Bioinform 2021; 22:bbaa199. [PMID: 33049044 PMCID: PMC8293838 DOI: 10.1093/bib/bbaa199] [Citation(s) in RCA: 29] [Impact Index Per Article: 9.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/07/2020] [Revised: 08/03/2020] [Accepted: 08/04/2020] [Indexed: 12/13/2022] Open
Abstract
Ontologies have long been employed in the life sciences to formally represent and reason over domain knowledge and they are employed in almost every major biological database. Recently, ontologies are increasingly being used to provide background knowledge in similarity-based analysis and machine learning models. The methods employed to combine ontologies and machine learning are still novel and actively being developed. We provide an overview over the methods that use ontologies to compute similarity and incorporate them in machine learning methods; in particular, we outline how semantic similarity measures and ontology embeddings can exploit the background knowledge in ontologies and how ontologies can provide constraints that improve machine learning models. The methods and experiments we describe are available as a set of executable notebooks, and we also provide a set of slides and additional resources at https://github.com/bio-ontology-research-group/machine-learning-with-ontologies.
Collapse
Affiliation(s)
| | | | - Xin Gao
- Computational Bioscience Research Center and lead of the Structural and Functional Bioinformatics Group at King Abdullah University of Science and Technology
| | | |
Collapse
|
12
|
Qin Y, Qin X, Chen H, Li X, Lang W. Measuring cognitive proximity using semantic analysis: A case study of China's ICT industry. Scientometrics 2021. [DOI: 10.1007/s11192-021-04021-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|
13
|
Yaghtin M, Sotudeh H, Nikseresht A, Mirzabeigi M. Modeling the co-citation dependence on semantic layers of co-cited documents. ONLINE INFORMATION REVIEW 2021. [DOI: 10.1108/oir-04-2020-0126] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
PurposeCo-citation frequency, defined as the number of documents co-citing two articles, is considered as a quantitative, and thus, an efficient proxy of subject relatedness or prestige of the co-cited articles. Despite its quantitative nature, it is found effective in retrieving and evaluating documents, signifying its linkage with the related documents' contents. To better understand the dynamism of the citation network, the present study aims to investigate various content features giving rise to the measure.Design/methodology/approachThe present study examined the interaction of different co-citation features in explaining the co-citation frequency. The features include the co-cited works' similarities in their full-texts, Medical Subject Headings (MeSH) terms, co-citation proximity, opinions and co-citances. A test collection is built using the CITREC dataset. The data were analyzed using natural language processing (NLP) and opinion mining techniques. A linear model was developed to regress the objective and subjective content-based co-citation measures against the natural log of the co-citation frequency.FindingsThe dimensions of co-citation similarity, either subjective or objective, play significant roles in predicting co-citation frequency. The model can predict about half of the co-citation variance. The interaction of co-opinionatedness and non-co-opinionatedness is the strongest factor in the model.Originality/valueIt is the first study in revealing that both the objective and subjective similarities could significantly predict the co-citation frequency. The findings re-confirm the citation analysis assumption claiming the connection between the cognitive layers of cited documents and citation measures in general and the co-citation frequency in particular.Peer reviewThe peer review history for this article is available at https://publons.com/publon/10.1108/OIR-04-2020-0126.
Collapse
|
14
|
Souza CM, Meireles MRG, Almeida PEM. A comparative study of abstractive and extractive summarization techniques to label subgroups on patent dataset. Scientometrics 2020. [DOI: 10.1007/s11192-020-03732-x] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
|
15
|
Rashidi K, Sotudeh H, Mirzabeigi M, Nikseresht A. Determining the informativeness of comments: a natural language study of F1000Research open peer review reports. ONLINE INFORMATION REVIEW 2020. [DOI: 10.1108/oir-02-2020-0073] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
PurposeSocial comments are rich in information and useful in evaluating, ranking or retrieving different kinds of materials. However, their merits in representing or providing added values to scientific articles have not yet been studied. Therefore, the present study investigates the informativeness of open review reports as a kind of social comments in a scholarly setting.Design/methodology/approachA test collection was built consisting of 100 randomly selected queries, 1,962 reviewed documents and their reviewers' open reports from F1000Research. They were analyzed using natural language techniques. The comments' salient words were compared to the documents' and also to the Medical Subject Headings (MeSH) salient words. The receiver operating characteristic (ROC) curve was used to test the accuracy of the comments in representing their related articles.FindingsThe papers' contents and comments have a considerable number of salient words in common. The comments' salient words are also largely found in the MeSH, signifying their consistency with the knowledge tree and their potential to add some complementary features to their related items. The ROC curves confirm the accuracy of the comments in retrieving their related papers.Originality/valueThis research is the first to reveal the merits of open review reports on scientific papers, in terms of their relatedness to their mother articles, in specific, and to the knowledge tree, in general. They are found informative in not only representing the reviewed papers but also in adding values to the contents of the papers.
Collapse
|
16
|
Wang B, Fei T, Kang Y, Li M, Du Q, Han M, Dong N. Understanding the spatial dimension of natural language by measuring the spatial semantic similarity of words through a scalable geospatial context window. PLoS One 2020; 15:e0236347. [PMID: 32702022 PMCID: PMC7377466 DOI: 10.1371/journal.pone.0236347] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/20/2020] [Accepted: 07/03/2020] [Indexed: 11/19/2022] Open
Abstract
Measuring the semantic similarity between words is important for natural language processing tasks. The traditional models of semantic similarity perform well in most cases, but when dealing with words that involve geographical context, spatial semantics of implied spatial information are rarely preserved. Geographic information retrieval (GIR) methods have focused on this issue; however, they sometimes fail to solve the problem because the spatial and textual similarities of words are considered and calculated separately. In this paper, from the perspective of spatial context, we consider the two parts as a whole—spatial context semantics, and we propose a method that measures spatial semantic similarity using a sliding geospatial context window for geo-tagged words. The proposed method was first validated with a set of simulated data and then applied to a real-world dataset from Flickr. As a result, a spatial semantic similarity model at different scales is presented. We believe this model is a necessary supplement for traditional textual-language semantic analyses of words obtained by word-embedding technologies. This study has the potential to improve the quality of recommendation systems by considering relevant spatial context semantics, and benefits linguistic semantic research by emphasising the spatial cognition among words.
Collapse
Affiliation(s)
- Bozhi Wang
- School of Resource and Environmental Sciences, Wuhan University, Wuhan, China
| | - Teng Fei
- School of Resource and Environmental Sciences, Wuhan University, Wuhan, China
- * E-mail: (TF); (QD)
| | - Yuhao Kang
- Geospatial Data Science Lab, Department of Geography, University of Wisconsin, Madison, WI, United States of America
| | - Meng Li
- School of Resource and Environmental Sciences, Wuhan University, Wuhan, China
| | - Qingyun Du
- School of Resource and Environmental Sciences, Wuhan University, Wuhan, China
- * E-mail: (TF); (QD)
| | - Meng Han
- State Grid Beijing Electric Power Company, Beijing, China
| | - Ning Dong
- State Grid Beijing Electric Power Company, Beijing, China
| |
Collapse
|
17
|
Dang D, Chen C, Yu W, Hu H. A semantic-aware collaborative filtering recommendation method for emergency plans in response to meteorological hazards. INTELL DATA ANAL 2020. [DOI: 10.3233/ida-194571] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
|
18
|
An approach for measuring semantic similarity between Wikipedia concepts using multiple inheritances. Inf Process Manag 2020. [DOI: 10.1016/j.ipm.2019.102188] [Citation(s) in RCA: 17] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
|
19
|
Gątkowski M, Dietl M, Skrok Ł, Whalen R, Rockett K. Semantically-based patent thicket identification. RESEARCH POLICY 2020. [DOI: 10.1016/j.respol.2020.103925] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
20
|
Cardoso C, Sousa RT, Köhler S, Pesquita C. A Collection of Benchmark Data Sets for Knowledge Graph-based Similarity in the Biomedical Domain. Database (Oxford) 2020; 2020:baaa078. [PMID: 33181823 PMCID: PMC7661097 DOI: 10.1093/database/baaa078] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2020] [Revised: 08/13/2020] [Accepted: 08/24/2020] [Indexed: 01/12/2023]
Abstract
The ability to compare entities within a knowledge graph is a cornerstone technique for several applications, ranging from the integration of heterogeneous data to machine learning. It is of particular importance in the biomedical domain, where semantic similarity can be applied to the prediction of protein-protein interactions, associations between diseases and genes, cellular localization of proteins, among others. In recent years, several knowledge graph-based semantic similarity measures have been developed, but building a gold standard data set to support their evaluation is non-trivial. We present a collection of 21 benchmark data sets that aim at circumventing the difficulties in building benchmarks for large biomedical knowledge graphs by exploiting proxies for biomedical entity similarity. These data sets include data from two successful biomedical ontologies, Gene Ontology and Human Phenotype Ontology, and explore proxy similarities calculated based on protein sequence similarity, protein family similarity, protein-protein interactions and phenotype-based gene similarity. Data sets have varying sizes and cover four different species at different levels of annotation completion. For each data set, we also provide semantic similarity computations with state-of-the-art representative measures. Database URL: https://github.com/liseda-lab/kgsim-benchmark.
Collapse
Affiliation(s)
- Carlota Cardoso
- Departamento de informática, LASIGE Faculdade de Ciências da Universidade de Lisboa, 1749 - 016 Lisboa, Portugal
| | - Rita T Sousa
- Departamento de informática, LASIGE Faculdade de Ciências da Universidade de Lisboa, 1749 - 016 Lisboa, Portugal
| | | | - Catia Pesquita
- Departamento de informática, LASIGE Faculdade de Ciências da Universidade de Lisboa, 1749 - 016 Lisboa, Portugal
| |
Collapse
|
21
|
Rodriguez-Prieto O, Araujo L, Martinez-Romo J. Discovering related scientific literature beyond semantic similarity: a new co-citation approach. Scientometrics 2019. [DOI: 10.1007/s11192-019-03125-9] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]
|
22
|
Abstract
In highly sophisticated network attacks, command-and-control (C&C) servers always use domain generation algorithms (DGAs) to dynamically produce several candidate domains instead of static hard-coded lists of IP addresses or domain names. Distinguishing the domains generated by DGAs from the legitimate ones is critical for finding out the existence of malware or further locating the hidden attackers. The word-based DGAs disclosed in recent network attack events have shown significantly stronger stealthiness when compared with traditional character-based DGAs. In word-based DGAs, two or more words are randomly chosen from one or more specific dictionaries to form a dynamic domain, these regularly generated domains aim to mimic the characteristics of a legitimate domain. Existing DGA detection schemes, including the state-of-the-art one based on deep learning, still cannot find out these domains accurately while maintaining an acceptable false alarm rate. In this study, we exploit the inter-word and inter-domain correlations using semantic analysis approaches, word embedding and the part-of-speech are taken into consideration. Next, we propose a detection framework for word-based DGAs by incorporating the frequency distribution of the words and that of part-of-speech into the design of the feature set. Using an ensemble classifier constructed from Naive Bayes, Extra-Trees, and Logistic Regression, we benchmark the proposed scheme with malicious and legitimate domain samples extracted from public datasets. The experimental results show that the proposed scheme can achieve significantly higher detection accuracy for word-based DGAs when compared with three state-of-the-art DGA detection schemes.
Collapse
|
23
|
Using Summarization Techniques on Patent Database Through Computational Intelligence. PROGRESS IN ARTIFICIAL INTELLIGENCE 2019. [DOI: 10.1007/978-3-030-30244-3_42] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]
|
24
|
Zhou S, Kang H, Yao B, Gong Y. An automated pipeline for analyzing medication event reports in clinical settings. BMC Med Inform Decis Mak 2018; 18:113. [PMID: 30526590 PMCID: PMC6284273 DOI: 10.1186/s12911-018-0687-6] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/02/2023] Open
Abstract
BACKGROUND Medication events in clinical settings are significant threats to patient safety. Analyzing and learning from the medication event reports is an important way to prevent the recurrence of these events. Currently, the analysis of medication event reports is ineffective and requires heavy workloads for clinicians. An automated pipeline is proposed to help clinicians deal with the accumulated reports, extract valuable information and generate feedback from the reports. Thus, the strategy of medication event prevention can be further developed based on the lessons learned. METHODS In order to build the automated pipeline, four classic machine learning classifiers (i.e., support vector machine, Naïve Bayes, random forest, and multi-layer perceptron) were compared to identify the event originating stages, event types, and event causes from the medication event reports. The precision, recall and F-1 measure were calculated to assess the performance of the classifiers. Further, a strategy to measure the similarity of medication event reports in our pipeline was established and evaluated by human subjects through a questionnaire. RESULTS We developed three classifiers to identify the medication event originating stages, event types and causes, respectively. For the event originating stages, a support vector machine classifier obtains the best performance with an F-1 measure of 0.792. For the event types, a support vector machine classifier exhibits the best performance with an F-1 measure of 0.758. And for the event causes, a random forest classifier reaches an F-1 measure of 0.925. The questionnaire results show that the similarity measurement is consistent with the domain experts in the task of identifying similar reports. CONCLUSION We developed and evaluated an automated pipeline that could identify three attributes from the medication event reports and calculate the similarity scores between the reports based on the attributes. The pipeline is expected to improve the efficiency of analyzing the medication event reports and to learn from the reports in a timely manner.
Collapse
Affiliation(s)
- Sicheng Zhou
- School of Biomedical Informatics, University of Texas Health Science Center at Houston, 7000 Fannin Street, Suite 600, Houston, 77030, TX, USA
| | - Hong Kang
- School of Biomedical Informatics, University of Texas Health Science Center at Houston, 7000 Fannin Street, Suite 600, Houston, 77030, TX, USA
| | - Bin Yao
- School of Biomedical Informatics, University of Texas Health Science Center at Houston, 7000 Fannin Street, Suite 600, Houston, 77030, TX, USA
| | - Yang Gong
- School of Biomedical Informatics, University of Texas Health Science Center at Houston, 7000 Fannin Street, Suite 600, Houston, 77030, TX, USA.
| |
Collapse
|
25
|
Smaili FZ, Gao X, Hoehndorf R. OPA2Vec: combining formal and informal content of biomedical ontologies to improve similarity-based prediction. Bioinformatics 2018; 35:2133-2140. [DOI: 10.1093/bioinformatics/bty933] [Citation(s) in RCA: 65] [Impact Index Per Article: 10.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2018] [Revised: 11/02/2018] [Accepted: 11/07/2018] [Indexed: 12/11/2022] Open
Affiliation(s)
- Fatima Zohra Smaili
- Computer, Electrical & Mathematical Sciences and Engineering (CEMSE) Division, Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia
| | - Xin Gao
- Computer, Electrical & Mathematical Sciences and Engineering (CEMSE) Division, Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia
| | - Robert Hoehndorf
- Computer, Electrical & Mathematical Sciences and Engineering (CEMSE) Division, Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia
| |
Collapse
|
26
|
|
27
|
Cardoso S, Reynaud-Delaître C, Da Silveira M, Lin YC, Groß A, Rahm E, Pruski C. Evolving semantic annotations through multiple versions of controlled medical terminologies. HEALTH AND TECHNOLOGY 2018. [DOI: 10.1007/s12553-018-0261-3] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|
28
|
Multi-corpus-Based Model for Measuring the Semantic Relatedness in Short Texts (SRST). ARABIAN JOURNAL FOR SCIENCE AND ENGINEERING 2018. [DOI: 10.1007/s13369-018-3232-0] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/17/2022]
|
29
|
Kulmanov M, Hoehndorf R. Evaluating the effect of annotation size on measures of semantic similarity. J Biomed Semantics 2017; 8:7. [PMID: 28193260 PMCID: PMC5307803 DOI: 10.1186/s13326-017-0119-z] [Citation(s) in RCA: 22] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2016] [Accepted: 02/01/2017] [Indexed: 01/29/2023] Open
Abstract
Background Ontologies are widely used as metadata in biological and biomedical datasets. Measures of semantic similarity utilize ontologies to determine how similar two entities annotated with classes from ontologies are, and semantic similarity is increasingly applied in applications ranging from diagnosis of disease to investigation in gene networks and functions of gene products. Results Here, we analyze a large number of semantic similarity measures and the sensitivity of similarity values to the number of annotations of entities, difference in annotation size and to the depth or specificity of annotation classes. We find that most similarity measures are sensitive to the number of annotations of entities, difference in annotation size as well as to the depth of annotation classes; well-studied and richly annotated entities will usually show higher similarity than entities with only few annotations even in the absence of any biological relation. Conclusions Our findings may have significant impact on the interpretation of results that rely on measures of semantic similarity, and we demonstrate how the sensitivity to annotation size can lead to a bias when using semantic similarity to predict protein-protein interactions. Electronic supplementary material The online version of this article (doi:10.1186/s13326-017-0119-z) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Maxat Kulmanov
- Computational Bioscience Research Center, King Abdullah University of Science and Technology, Thuwal, 23955-6900, Saudi Arabia.,Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology, Thuwal, 23955-6900, Saudi Arabia
| | - Robert Hoehndorf
- Computational Bioscience Research Center, King Abdullah University of Science and Technology, Thuwal, 23955-6900, Saudi Arabia. .,Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology, Thuwal, 23955-6900, Saudi Arabia.
| |
Collapse
|
30
|
Görnerup O, Gillblad D, Vasiloudis T. Domain-agnostic discovery of similarities and concepts at scale. Knowl Inf Syst 2016. [DOI: 10.1007/s10115-016-0984-2] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|