1
|
Automatically transforming full length biomedical articles into search queries for retrieving related articles. EGYPTIAN INFORMATICS JOURNAL 2021. [DOI: 10.1016/j.eij.2020.04.004] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
2
|
Henry S, McInnes BT. Indirect association and ranking hypotheses for literature based discovery. BMC Bioinformatics 2019; 20:425. [PMID: 31416434 PMCID: PMC6694578 DOI: 10.1186/s12859-019-2989-9] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/24/2018] [Accepted: 07/09/2019] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Literature Based Discovery (LBD) produces more potential hypotheses than can be manually reviewed, making automatically ranking these hypotheses critical. In this paper, we introduce the indirect association measures of Linking Term Association (LTA), Minimum Weight Association (MWA), and Shared B to C Set Association (SBC), and compare them to Linking Set Association (LSA), concept embeddings vector cosine, Linking Term Count (LTC), and direct co-occurrence vector cosine. Our proposed indirect association measures extend traditional association measures to quantify indirect rather than direct associations while preserving valuable statistical properties. RESULTS We perform a comparison between several different hypothesis ranking methods for LBD, and compare them against our proposed indirect association measures. We intrinsically evaluate each method's performance using its ability to estimate semantic relatedness on standard evaluation datasets. We extrinsically evaluate each method's ability to rank hypotheses in LBD using a time-slicing dataset based on co-occurrence information, and another time-slicing dataset based on SemRep extracted-relationships. Precision and recall curves are generated by ranking term pairs and applying a threshold at each rank. CONCLUSIONS Results differ depending on the evaluation methods and datasets, but it is unclear if this is a result of biases in the evaluation datasets or if one method is truly better than another. We conclude that LTC and SBC are the best suited methods for hypothesis ranking in LBD, but there is value in having a variety of methods to choose from.
Collapse
Affiliation(s)
- Sam Henry
- Department of Computer Science, Virginia Commonwealth University, 601 W. Main St. Rm 435, Richmond, 23284 USA
| | - Bridget T. McInnes
- Department of Computer Science, Virginia Commonwealth University, 601 W. Main St. Rm 435, Richmond, 23284 USA
| |
Collapse
|
3
|
Henry S, McQuilkin A, McInnes BT. Association measures for estimating semantic similarity and relatedness between biomedical concepts. Artif Intell Med 2018; 93:1-10. [PMID: 30197305 DOI: 10.1016/j.artmed.2018.08.006] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2017] [Revised: 03/08/2018] [Accepted: 08/24/2018] [Indexed: 12/26/2022]
Abstract
Association measures quantify the observed likelihood a term pair co-occurs versus their predicted co-occurrence together if by chance. This is based both on the terms' individual occurrence frequencies, and their mutual co-occurrence frequencies. One application of association scores is estimating semantic relatedness, which is critical for many natural language processing applications, such as clustering of biomedical and clinical documents and the development of biomedical terminologies and ontololgies. In this paper we propose a method of generating association scores between biomedical concepts to estimate semantic relatedness. We use co-occurrence statistics between Unified Medical Language System (UMLS) concepts to account for lexical variation at the synonymous level, and introduce a process of concept expansion that exploits hierarchical information from the UMLS to account for lexical variation at the hyponymous level. State of the art results are achieved on several standard evaluation datasets, and an in depth analysis of hyper-parameters is presented.
Collapse
Affiliation(s)
- Sam Henry
- Virginia Commonwealth University, Richmond, VA, United States
| | - Alex McQuilkin
- Virginia Commonwealth University, Richmond, VA, United States
| | | |
Collapse
|
4
|
Wilk S, Michalowski W, Slowinski R, Thomas R, Kadzinski M, Farion K, O´Sullivan D. Learning the Preferences of Physicians for the Organization of Result Lists of Medical Evidence Articles. Methods Inf Med 2018; 53:344-56. [DOI: 10.3414/me13-01-0085] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2013] [Accepted: 02/24/2014] [Indexed: 11/09/2022]
Abstract
SummaryBackground: Online medical knowledge repositories such as MEDLINE and The Cochrane Library are increasingly used by physicians to retrieve articles to aid with clinical decision making. The prevailing approach for organizing retrieved articles is in the form of a rank-ordered list, with the assumption that the higher an article is presented on a list, the more relevant it is.Objectives: Despite this common list-based organization, it is seldom studied how physicians perceive the association between the relevance of articles and the order in which articles are presented. In this paper we describe a case study that captured physician preferences for 3-element lists of medical articles in order to learn how to organize medical knowledge for decision-making.Methods: Comprehensive relevance evaluations were developed to represent 3-element lists of hypothetical articles that may be retrieved from an online medical knowledge source such as MEDLINE or The Cochrane Library. Comprehensive relevance evalua tions asses not only an article’s relevance for a query, but also whether it has been placed on the correct list position. In other words an article may be relevant and correctly placed on a result list (e.g. the most relevant article appears first in the result list), an article may be relevant for a query but placed on an incorrect list position (e.g. the most relevant article appears second in a result list), or an article may be irrelevant for a query yet still appear in the result list. The relevance evaluations were presented to six senior physi cians who were asked to express their preferences for an article’s relevance and its position on a list by pairwise comparisons representing different combinations of 3-element lists. The elicited preferences were assessed using a novel GRIP (Generalized Regression with Intensities of Preference) method and represented as an additive value function. Value functions were derived for individual physicians as well as the group of physicians.Results: The results show that physicians assign significant value to the 1st position on a list and they expect that the most relevant article is presented first. Whilst physicians still prefer obtaining a correctly placed article on position 2, they are also quite satisfied with misplaced relevant article. Low consideration of the 3rd position was uniformly confirmed.Conclusions: Our findings confirm the importance of placing the most relevant article on the 1st position on a list and the importance paid to position on a list significantly diminishes after the 2nd position. The derived value functions may be used by developers of clinical decision support applications to decide how best to organize medical knowledge for decision making and to create personalized evaluation measures that can augment typical measures used to evaluate information retrieval systems.
Collapse
|
5
|
Henry S, Cuffy C, McInnes BT. Vector representations of multi-word terms for semantic relatedness. J Biomed Inform 2017; 77:111-119. [PMID: 29247788 DOI: 10.1016/j.jbi.2017.12.006] [Citation(s) in RCA: 28] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2017] [Revised: 10/09/2017] [Accepted: 12/12/2017] [Indexed: 11/28/2022]
Abstract
This paper presents a comparison between several multi-word term aggregation methods of distributional context vectors applied to the task of semantic similarity and relatedness in the biomedical domain. We compare the multi-word term aggregation methods of summation of component word vectors, mean of component word vectors, direct construction of compound term vectors using the compoundify tool, and direct construction of concept vectors using the MetaMap tool. Dimensionality reduction is critical when constructing high quality distributional context vectors, so these baseline co-occurrence vectors are compared against dimensionality reduced vectors created using singular value decomposition (SVD), and word2vec word embeddings using continuous bag of words (CBOW), and skip-gram models. We also find optimal vector dimensionalities for the vectors produced by these techniques. Our results show that none of the tested multi-word term aggregation methods is statistically significantly better than any other. This allows flexibility when choosing a multi-word term aggregation method, and means expensive corpora preprocessing may be avoided. Results are shown with several standard evaluation datasets, and state of the results are achieved.
Collapse
Affiliation(s)
- Sam Henry
- Department of Computer Science, Virginia Commonwealth University, 401 S. Main St., Richmond, VA 23284, USA.
| | - Clint Cuffy
- Department of Computer Science, Virginia Commonwealth University, 401 S. Main St., Richmond, VA 23284, USA
| | - Bridget T McInnes
- Department of Computer Science, Virginia Commonwealth University, 401 S. Main St., Richmond, VA 23284, USA
| |
Collapse
|
6
|
Krallinger M, Rabal O, Lourenço A, Oyarzabal J, Valencia A. Information Retrieval and Text Mining Technologies for Chemistry. Chem Rev 2017; 117:7673-7761. [PMID: 28475312 DOI: 10.1021/acs.chemrev.6b00851] [Citation(s) in RCA: 111] [Impact Index Per Article: 15.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/03/2023]
Abstract
Efficient access to chemical information contained in scientific literature, patents, technical reports, or the web is a pressing need shared by researchers and patent attorneys from different chemical disciplines. Retrieval of important chemical information in most cases starts with finding relevant documents for a particular chemical compound or family. Targeted retrieval of chemical documents is closely connected to the automatic recognition of chemical entities in the text, which commonly involves the extraction of the entire list of chemicals mentioned in a document, including any associated information. In this Review, we provide a comprehensive and in-depth description of fundamental concepts, technical implementations, and current technologies for meeting these information demands. A strong focus is placed on community challenges addressing systems performance, more particularly CHEMDNER and CHEMDNER patents tasks of BioCreative IV and V, respectively. Considering the growing interest in the construction of automatically annotated chemical knowledge bases that integrate chemical information and biological data, cheminformatics approaches for mapping the extracted chemical names into chemical structures and their subsequent annotation together with text mining applications for linking chemistry with biological information are also presented. Finally, future trends and current challenges are highlighted as a roadmap proposal for research in this emerging field.
Collapse
Affiliation(s)
- Martin Krallinger
- Structural Computational Biology Group, Structural Biology and BioComputing Programme, Spanish National Cancer Research Centre , C/Melchor Fernández Almagro 3, Madrid E-28029, Spain
| | - Obdulia Rabal
- Small Molecule Discovery Platform, Molecular Therapeutics Program, Center for Applied Medical Research (CIMA), University of Navarra , Avenida Pio XII 55, Pamplona E-31008, Spain
| | - Anália Lourenço
- ESEI - Department of Computer Science, University of Vigo , Edificio Politécnico, Campus Universitario As Lagoas s/n, Ourense E-32004, Spain.,Centro de Investigaciones Biomédicas (Centro Singular de Investigación de Galicia) , Campus Universitario Lagoas-Marcosende, Vigo E-36310, Spain.,CEB-Centre of Biological Engineering, University of Minho , Campus de Gualtar, Braga 4710-057, Portugal
| | - Julen Oyarzabal
- Small Molecule Discovery Platform, Molecular Therapeutics Program, Center for Applied Medical Research (CIMA), University of Navarra , Avenida Pio XII 55, Pamplona E-31008, Spain
| | - Alfonso Valencia
- Life Science Department, Barcelona Supercomputing Centre (BSC-CNS) , C/Jordi Girona, 29-31, Barcelona E-08034, Spain.,Joint BSC-IRB-CRG Program in Computational Biology, Parc Científic de Barcelona , C/ Baldiri Reixac 10, Barcelona E-08028, Spain.,Institució Catalana de Recerca i Estudis Avançats (ICREA) , Passeig de Lluís Companys 23, Barcelona E-08010, Spain
| |
Collapse
|
7
|
Raja K, Patrick M, Gao Y, Madu D, Yang Y, Tsoi LC. A Review of Recent Advancement in Integrating Omics Data with Literature Mining towards Biomedical Discoveries. Int J Genomics 2017; 2017:6213474. [PMID: 28331849 PMCID: PMC5346376 DOI: 10.1155/2017/6213474] [Citation(s) in RCA: 29] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/07/2016] [Accepted: 02/09/2017] [Indexed: 12/13/2022] Open
Abstract
In the past decade, the volume of "omics" data generated by the different high-throughput technologies has expanded exponentially. The managing, storing, and analyzing of this big data have been a great challenge for the researchers, especially when moving towards the goal of generating testable data-driven hypotheses, which has been the promise of the high-throughput experimental techniques. Different bioinformatics approaches have been developed to streamline the downstream analyzes by providing independent information to interpret and provide biological inference. Text mining (also known as literature mining) is one of the commonly used approaches for automated generation of biological knowledge from the huge number of published articles. In this review paper, we discuss the recent advancement in approaches that integrate results from omics data and information generated from text mining approaches to uncover novel biomedical information.
Collapse
Affiliation(s)
- Kalpana Raja
- Department of Dermatology, University of Michigan Medical School, Ann Arbor, MI, USA
| | - Matthew Patrick
- Department of Dermatology, University of Michigan Medical School, Ann Arbor, MI, USA
| | - Yilin Gao
- Department of Dermatology, University of Michigan Medical School, Ann Arbor, MI, USA
| | - Desmond Madu
- Department of Dermatology, University of Michigan Medical School, Ann Arbor, MI, USA
| | - Yuyang Yang
- Department of Dermatology, University of Michigan Medical School, Ann Arbor, MI, USA
| | - Lam C. Tsoi
- Department of Dermatology, University of Michigan Medical School, Ann Arbor, MI, USA
- Department of Computational Medicine & Bioinformatics, University of Michigan Medical School, Ann Arbor, MI, USA
- Department of Biostatistics, University of Michigan, Ann Arbor, MI, USA
| |
Collapse
|
8
|
Lucini FR, Fogliatto FS, da Silveira GJC, Neyeloff JL, Anzanello MJ, Kuchenbecker RS, Schaan BD. Text mining approach to predict hospital admissions using early medical records from the emergency department. Int J Med Inform 2017; 100:1-8. [PMID: 28241931 DOI: 10.1016/j.ijmedinf.2017.01.001] [Citation(s) in RCA: 72] [Impact Index Per Article: 10.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/15/2016] [Revised: 10/31/2016] [Accepted: 01/03/2017] [Indexed: 11/25/2022]
Abstract
OBJECTIVE Emergency department (ED) overcrowding is a serious issue for hospitals. Early information on short-term inward bed demand from patients receiving care at the ED may reduce the overcrowding problem, and optimize the use of hospital resources. In this study, we use text mining methods to process data from early ED patient records using the SOAP framework, and predict future hospitalizations and discharges. DESIGN We try different approaches for pre-processing of text records and to predict hospitalization. Sets-of-words are obtained via binary representation, term frequency, and term frequency-inverse document frequency. Unigrams, bigrams and trigrams are tested for feature formation. Feature selection is based on χ2 and F-score metrics. In the prediction module, eight text mining methods are tested: Decision Tree, Random Forest, Extremely Randomized Tree, AdaBoost, Logistic Regression, Multinomial Naïve Bayes, Support Vector Machine (Kernel linear) and Nu-Support Vector Machine (Kernel linear). MEASUREMENTS Prediction performance is evaluated by F1-scores. Precision and Recall values are also informed for all text mining methods tested. RESULTS Nu-Support Vector Machine was the text mining method with the best overall performance. Its average F1-score in predicting hospitalization was 77.70%, with a standard deviation (SD) of 0.66%. CONCLUSIONS The method could be used to manage daily routines in EDs such as capacity planning and resource allocation. Text mining could provide valuable information and facilitate decision-making by inward bed management teams.
Collapse
Affiliation(s)
- Filipe R Lucini
- Industrial Engineering Department, Federal University of Rio Grande do Sul. Av. Osvaldo Aranha, 99, 5° Andar, 90035-190 Porto Alegre, RS, Brazil.
| | - Flavio S Fogliatto
- Industrial Engineering Department, Federal University of Rio Grande do Sul. Av. Osvaldo Aranha, 99, 5° Andar, 90035-190 Porto Alegre, RS, Brazil
| | - Giovani J C da Silveira
- Haskayne School of Business, University of Calgary, 2500 University Dr NW, T2N 1N4 Calgary, AB, Canada
| | - Jeruza L Neyeloff
- Hospital de Clínicas de Porto Alegre, Federal University of Rio Grande do Sul. Rua Ramiro Barcelos, 2350, 90035-903 Porto Alegre, RS, Brazil
| | - Michel J Anzanello
- Industrial Engineering Department, Federal University of Rio Grande do Sul. Av. Osvaldo Aranha, 99, 5° Andar, 90035-190 Porto Alegre, RS, Brazil
| | - Ricardo S Kuchenbecker
- Hospital de Clínicas de Porto Alegre, Federal University of Rio Grande do Sul. Rua Ramiro Barcelos, 2350, 90035-903 Porto Alegre, RS, Brazil
| | - Beatriz D Schaan
- Hospital de Clínicas de Porto Alegre, Federal University of Rio Grande do Sul. Rua Ramiro Barcelos, 2350, 90035-903 Porto Alegre, RS, Brazil
| |
Collapse
|
9
|
Yu Z, Wallace BC, Johnson T, Cohen T. Retrofitting Concept Vector Representations of Medical Concepts to Improve Estimates of Semantic Similarity and Relatedness. Stud Health Technol Inform 2017; 245:657-661. [PMID: 29295178 PMCID: PMC6464117] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]
Abstract
Estimation of semantic similarity and relatedness between biomedical concepts has utility for many informatics applications. Automated methods fall into two categories: methods based on distributional statistics drawn from text corpora, and methods using the structure of existing knowledge resources. Methods in the former category disregard taxonomic structure, while those in the latter fail to consider semantically relevant empirical information. In this paper, we present a method that retrofits distributional context vector representations of biomedical concepts using structural information from the UMLS Metathesaurus, such that the similarity between vector representations of linked concepts is augmented. We evaluated it on the UMNSRS benchmark. Our results demonstrate that retrofitting of concept vector representations leads to better correlation with human raters for both similarity and relatedness, surpassing the best results reported to date. They also demonstrate a clear improvement in performance on this reference standard for retrofitted vector representations, as compared to those without retrofitting.
Collapse
Affiliation(s)
- Zhiguo Yu
- The University of Texas School of Biomedical Informatics at Houston, Houston, Texas, USA
| | - Byron C. Wallace
- College of Computer and Information Science, Northeastern University, Boston, Massachusetts, USA
| | - Todd Johnson
- The University of Texas School of Biomedical Informatics at Houston, Houston, Texas, USA
| | - Trevor Cohen
- The University of Texas School of Biomedical Informatics at Houston, Houston, Texas, USA
| |
Collapse
|
10
|
Abstract
With the rapid development of the Internet, more and more users utilize health communities (known as forums) to find health-related information, share their medical stories and experiences, or interact with other people in the communities. In this paper, we propose a framework to analyze the user-generated contents in a health community. The proposed framework contains three phases. First, we extract medical terms, including conditions, symptoms, treatments, effectiveness and side effects to form a virtual document for each question in the community. Next, we modify Latent Dirichlet Allocation (LDA) by adding a weighted scheme, called conLDA, to cluster virtual documents with similar medical term distributions into a conditional topic (C-topic). Finally, we analyze the clustered C-topics by sentiment polarities, and physiological and psychological sentiment. The experiment results show that conLDA outperforms the original LDA, and can cluster relevant medical terms and relevant questions together. The C-topics clustered by conLDA are more thematic than those clustered by the original LDA. The results of sentiment analysis may provide a quick reference and valuable insights for patients, caregivers and doctors.
Collapse
|
11
|
Ji Y, Ying H, Tran J, Dews P, Massanari RM. Integrating unified medical language system and association mining techniques into relevance feedback for biomedical literature search. BMC Bioinformatics 2016; 17 Suppl 9:264. [PMID: 27453982 PMCID: PMC4959361 DOI: 10.1186/s12859-016-1129-z] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Background Finding highly relevant articles from biomedical databases is challenging not only because it is often difficult to accurately express a user’s underlying intention through keywords but also because a keyword-based query normally returns a long list of hits with many citations being unwanted by the user. This paper proposes a novel biomedical literature search system, called BiomedSearch, which supports complex queries and relevance feedback. Methods The system employed association mining techniques to build a k-profile representing a user’s relevance feedback. More specifically, we developed a weighted interest measure and an association mining algorithm to find the strength of association between a query and each concept in the article(s) selected by the user as feedback. The top concepts were utilized to form a k-profile used for the next-round search. BiomedSearch relies on Unified Medical Language System (UMLS) knowledge sources to map text files to standard biomedical concepts. It was designed to support queries with any levels of complexity. Results A prototype of BiomedSearch software was made and it was preliminarily evaluated using the Genomics data from TREC (Text Retrieval Conference) 2006 Genomics Track. Initial experiment results indicated that BiomedSearch increased the mean average precision (MAP) for a set of queries. Conclusions With UMLS and association mining techniques, BiomedSearch can effectively utilize users’ relevance feedback to improve the performance of biomedical literature search.
Collapse
Affiliation(s)
- Yanqing Ji
- Department of Electrical and Computer Engineering, Gonzaga University, Spokane, WA, USA.
| | - Hao Ying
- Department of Electrical and Computer Engineering, Wayne State University, Detroit, MI, USA
| | - John Tran
- Frontier Behavioral Health, Spokane, WA, USA
| | - Peter Dews
- Department of Medicine, St. Mary Mercy Hospital, Livonia, MI, USA
| | | |
Collapse
|
12
|
Denny JC, Spickard A, Speltz PJ, Porier R, Rosenstiel DE, Powers JS. Using natural language processing to provide personalized learning opportunities from trainee clinical notes. J Biomed Inform 2015; 56:292-9. [PMID: 26070431 DOI: 10.1016/j.jbi.2015.06.004] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/13/2014] [Revised: 06/01/2015] [Accepted: 06/03/2015] [Indexed: 12/20/2022]
Abstract
OBJECTIVE Assessment of medical trainee learning through pre-defined competencies is now commonplace in schools of medicine. We describe a novel electronic advisor system using natural language processing (NLP) to identify two geriatric medicine competencies from medical student clinical notes in the electronic medical record: advance directives (AD) and altered mental status (AMS). MATERIALS AND METHODS Clinical notes from third year medical students were processed using a general-purpose NLP system to identify biomedical concepts and their section context. The system analyzed these notes for relevance to AD or AMS and generated custom email alerts to students with embedded supplemental learning material customized to their notes. Recall and precision of the two advisors were evaluated by physician review. Students were given pre and post multiple choice question tests broadly covering geriatrics. RESULTS Of 102 students approached, 66 students consented and enrolled. The system sent 393 email alerts to 54 students (82%), including 270 for AD and 123 for AMS. Precision was 100% for AD and 93% for AMS. Recall was 69% for AD and 100% for AMS. Students mentioned ADs for 43 patients, with all mentions occurring after first having received an AD reminder. Students accessed educational links 34 times from the 393 email alerts. There was no difference in pre (mean 62%) and post (mean 60%) test scores. CONCLUSIONS The system effectively identified two educational opportunities using NLP applied to clinical notes and demonstrated a small change in student behavior. Use of electronic advisors such as these may provide a scalable model to assess specific competency elements and deliver educational opportunities.
Collapse
Affiliation(s)
- Joshua C Denny
- Department of Biomedical Informatics, Vanderbilt University School of Medicine, Nashville, TN, United States; Department of Medicine, Vanderbilt University School of Medicine, Nashville, TN, United States.
| | - Anderson Spickard
- Department of Biomedical Informatics, Vanderbilt University School of Medicine, Nashville, TN, United States; Department of Medicine, Vanderbilt University School of Medicine, Nashville, TN, United States
| | - Peter J Speltz
- Department of Biomedical Informatics, Vanderbilt University School of Medicine, Nashville, TN, United States
| | - Renee Porier
- The Center for Quality Aging, Vanderbilt University School of Medicine, Nashville, TN, United States; Office of Health Sciences Education, Vanderbilt University School of Medicine, Nashville, TN, United States
| | - Donna E Rosenstiel
- The Center for Quality Aging, Vanderbilt University School of Medicine, Nashville, TN, United States
| | - James S Powers
- Department of Medicine, Vanderbilt University School of Medicine, Nashville, TN, United States; The Center for Quality Aging, Vanderbilt University School of Medicine, Nashville, TN, United States; The Meharry Consortium Geriatric Education Center, Meharry Medical Center, Nashville, TN, United States; The Tennessee Valley Geriatric Research Education and Clinical Center, Tennessee Valley Healthcare System, Nashville, TN, United States
| |
Collapse
|
13
|
Eikvil L, Jenssen TK, Holden M. Multi-focus cluster labeling. J Biomed Inform 2015; 55:116-23. [PMID: 25869415 DOI: 10.1016/j.jbi.2015.03.012] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2014] [Revised: 03/19/2015] [Accepted: 03/30/2015] [Indexed: 11/24/2022]
Abstract
Document collections resulting from searches in the biomedical literature, for instance, in PubMed, are often so large that some organization of the returned information is necessary. Clustering is an efficient tool for organizing search results. To help the user to decide how to continue the search for relevant documents, the content of each cluster can be characterized by a set of representative keywords or cluster labels. As different users may have different interests, it can be desirable with solutions that make it possible to produce labels from a selection of different topical categories. We therefore introduce the concept of multi-focus cluster labeling to give users the possibility to get an overview of the contents through labels from multiple viewpoints. The concept for multi-focus cluster labeling has been established and has been demonstrated on three different document collections. We illustrate that multi-focus visualizations can give an overview of clusters along axes that general labels are not able to convey. The approach is generic and should be applicable to any biomedical (or other) domain with any selection of foci where appropriate focus vocabularies can be established. A user evaluation also indicates that such a multi-focus concept is useful.
Collapse
Affiliation(s)
- Line Eikvil
- Norwegian Computing Center, P.O. Box 114 Blindern, NO-0314 Oslo, Norway.
| | | | - Marit Holden
- Norwegian Computing Center, P.O. Box 114 Blindern, NO-0314 Oslo, Norway.
| |
Collapse
|
14
|
Shao W, Adams CE, Cohen AM, Davis JM, McDonagh MS, Thakurta S, Yu PS, Smalheiser NR. Aggregator: a machine learning approach to identifying MEDLINE articles that derive from the same underlying clinical trial. Methods 2015; 74:65-70. [PMID: 25461812 PMCID: PMC4339517 DOI: 10.1016/j.ymeth.2014.11.006] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2013] [Revised: 10/27/2014] [Accepted: 11/05/2014] [Indexed: 11/27/2022] Open
Abstract
OBJECTIVE It is important to identify separate publications that report outcomes from the same underlying clinical trial, in order to avoid over-counting these as independent pieces of evidence. METHODS We created positive and negative training sets (comprised of pairs of articles reporting on the same condition and intervention) that were, or were not, linked to the same clinicaltrials.gov trial registry number. Features were extracted from MEDLINE and PubMed metadata; pairwise similarity scores were modeled using logistic regression. RESULTS Article pairs from the same trial were identified with high accuracy (F1 score=0.843). We also created a clustering tool, Aggregator, that takes as input a PubMed user query for RCTs on a given topic, and returns article clusters predicted to arise from the same clinical trial. DISCUSSION Although painstaking examination of full-text may be needed to be conclusive, metadata are surprisingly accurate in predicting when two articles derive from the same underlying clinical trial.
Collapse
Affiliation(s)
- Weixiang Shao
- Department of Computer Science, University of Illinois at Chicago, Chicago, IL 60612, USA
| | - Clive E Adams
- Division of Psychiatry, University of Nottingham, Nottingham, UK
| | - Aaron M Cohen
- Department of Medical Informatics and Clinical Epidemiology, Oregon Health & Science University, Portland, OR, USA
| | - John M Davis
- Department of Psychiatry, University of Illinois at Chicago, Chicago, IL 60612, USA
| | - Marian S McDonagh
- Department of Medical Informatics and Clinical Epidemiology, Oregon Health & Science University, Portland, OR, USA
| | - Sujata Thakurta
- Department of Medical Informatics and Clinical Epidemiology, Oregon Health & Science University, Portland, OR, USA
| | - Philip S Yu
- Department of Computer Science, University of Illinois at Chicago, Chicago, IL 60612, USA
| | - Neil R Smalheiser
- Department of Psychiatry, University of Illinois at Chicago, Chicago, IL 60612, USA.
| |
Collapse
|
15
|
Evaluating semantic similarity and relatedness over the semantic grouping of clinical term pairs. J Biomed Inform 2014; 54:329-36. [PMID: 25523466 DOI: 10.1016/j.jbi.2014.11.014] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/04/2014] [Revised: 11/11/2014] [Accepted: 11/13/2014] [Indexed: 11/20/2022]
Abstract
INTRODUCTION This article explores how measures of semantic similarity and relatedness are impacted by the semantic groups to which the concepts they are measuring belong. Our goal is to determine if there are distinctions between homogeneous comparisons (where both concepts belong to the same group) and heterogeneous ones (where the concepts are in different groups). Our hypothesis is that the similarity measures will be significantly affected since they rely on hierarchical is-a relations, whereas relatedness measures should be less impacted since they utilize a wider range of relations. In addition, we also evaluate the effect of combining different measures of similarity and relatedness. Our hypothesis is that these combined measures will more closely correlate with human judgment, since they better reflect the rich variety of information humans use when assessing similarity and relatedness. METHOD We evaluate our method on four reference standards. Three of the reference standards were annotated by human judges for relatedness and one was annotated for similarity. RESULTS We found significant differences in the correlation of semantic similarity and relatedness measures with human judgment, depending on which semantic groups were involved. We also found that combining a definition based relatedness measure with an information content similarity measure resulted in significant improvements in correlation over individual measures. AVAILABILITY The semantic similarity and relatedness package is an open source program available from http://umls-similarity.sourceforge.net/. The reference standards are available at http://www.people.vcu.edu/∼{}btmcinnes/downloads.html.
Collapse
|
16
|
McInnes BT, Pedersen T, Liu Y, Melton GB, Pakhomov SV. U-path: An undirected path-based measure of semantic similarity. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2014; 2014:882-891. [PMID: 25954395 PMCID: PMC4419983] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/04/2023]
Abstract
In this paper, we present the results of a method using undirected paths to determine the degree of semantic similarity between two concepts in a dense taxonomy with multiple inheritance. The overall objective of this work was to explore methods that take advantage of dense multi-hierarchical taxonomies that are more graph-like than tree-like by incorporating the proximity of concepts with respect to each other within the entire is-a hierarchy. Our hypothesis is that the proximity of the concepts regardless of how they are connected is an indicator to the degree of their similarity. We evaluate our method using the Systematized Nomenclature of Medicine Clinical Terms (SNOMED CT), and four reference standards that have been manually tagged by human annotators. The overall results of our experiments show, in SNOMED CT, the location of the concepts with respect to each other does indicate the degree to which they are similar.
Collapse
Affiliation(s)
| | | | - Ying Liu
- The Advisory Board Company, San Francisco, CA
| | | | | |
Collapse
|
17
|
Moosavinasab S, Rastegar-Mojarad M, Liu H, Jonnalagadda SR. Towards Transforming Expert-based Content to Evidence-based Content. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE PROCEEDINGS. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE 2014; 2014:83-90. [PMID: 25954582 PMCID: PMC4419763] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
The goal of this paper is to find relevant citations for clinicians' written content and make it more reliable by adding scientific articles as references and enabling the clinicians to easily update it using new information. The proposed approach uses information retrieval and ranking techniques to extract and rank relevant citations from MEDLINE for any given sentence. Additionally, this system extracts snippets of relevant content from ranked citations. We assessed our approach on 4,697 MEDLINE papers and their corresponding full-text on the subject of Heart Failure. We implemented multi-level and weight ranking algorithms to rank the citations. We demonstrate that using journal relevance and study design type improves results obtained from only using content similarity by approximately 40%. We also show that using full-text, rather than abstract text, leads to extracting higher quality snippets.
Collapse
Affiliation(s)
- Soheil Moosavinasab
- Department of Health Sciences Research, Mayo Clinic, Rochester, MN,University of Wisconsin-Milwaukee, Milwaukee, WI
| | - Majid Rastegar-Mojarad
- Department of Health Sciences Research, Mayo Clinic, Rochester, MN,University of Wisconsin-Milwaukee, Milwaukee, WI
| | - Hongfang Liu
- Department of Health Sciences Research, Mayo Clinic, Rochester, MN
| | - Siddhartha R. Jonnalagadda
- Department of Health Sciences Research, Mayo Clinic, Rochester, MN,Northwestern University Feinberg School of Medicine, Chicago, IL
| |
Collapse
|
18
|
Raghupathi V, Raghupathi W. An Unstructured Information Management Architecture Approach to Text Analytics of Cancer Blogs. INTERNATIONAL JOURNAL OF HEALTHCARE INFORMATION SYSTEMS AND INFORMATICS 2014. [DOI: 10.4018/ijhisi.2014040102] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
In this research the authors explore the potential of the Unstructured Information Management Architecture (UIMA) platform in text analytics of cancer blogs. The application is developed using the UIMA open source platform. They use the text analytics methods of categorization, clustering, taxonomic classification, and others to identify and analyze the patterns in cancer blog postings. The authors establish a comprehensive UIMA methodology for developing text analytics applications for the analysis of cancer blogs. Additional insights are extracted through the development of categories or keywords contained in the blogs, the development of a taxonomy and the examination of relationships among the categories. The application has the potential for generalizability and implementation with health content in other blogs and social media. It has the potential to provide insight and decision support for cancer management and to facilitate the efficient and relevant search for information on cancer.
Collapse
Affiliation(s)
- Viju Raghupathi
- Brooklyn College, City University of New York, Brooklyn, New York, USA
| | | |
Collapse
|
19
|
Difficulties and Challenges Associated with Literature Searches in Operating Room Management, Complete with Recommendations. Anesth Analg 2013; 117:1460-79. [DOI: 10.1213/ane.0b013e3182a6d33b] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/15/2023]
|
20
|
Demner-Fushman D, Mork JG, Aronson AR. Mining MEDLINE for problems associated with vitamin D. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2013; 2013:300-308. [PMID: 24551339 PMCID: PMC3900180] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/03/2023]
Abstract
This paper presents a two-step approach to generating comprehensive abstractive overviews for biomedical topics. It starts with a sensitivity-maximizing search of MEDLINE/PubMed and MeSH-based filtering of the results that are then processed using NLP methods to extract relations between entities of interest. We evaluate this approach in a case study based on the IOM report on the role of vitamin D in human health. The report defines disorders that serve as health indicators for the role of vitamin D. We evaluate the abstractive overviews generated using MeSH indexing and the extracted relations using the disorders listed in the IOM report as reference standard. We conclude that MeSH-based aggregation and filtering of the results is a useful and easy step in the generation of abstractive overviews. Although our relation extraction achieved 83.6% recall and 92.8% precision, only half of the disorders of interest participated in these relations.
Collapse
Affiliation(s)
- Dina Demner-Fushman
- Lister Hill National Center for Biomedical Communications, U.S. National Library of Medicine, National Institutes of Health, DHHS, Bethesda, MD
| | - James G Mork
- Lister Hill National Center for Biomedical Communications, U.S. National Library of Medicine, National Institutes of Health, DHHS, Bethesda, MD
| | - Alan R Aronson
- Lister Hill National Center for Biomedical Communications, U.S. National Library of Medicine, National Institutes of Health, DHHS, Bethesda, MD
| |
Collapse
|
21
|
Health-related hot topic detection in online communities using text clustering. PLoS One 2013; 8:e56221. [PMID: 23457530 PMCID: PMC3574139 DOI: 10.1371/journal.pone.0056221] [Citation(s) in RCA: 81] [Impact Index Per Article: 7.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2012] [Accepted: 01/07/2013] [Indexed: 11/24/2022] Open
Abstract
Recently, health-related social media services, especially online health communities, have rapidly emerged. Patients with various health conditions participate in online health communities to share their experiences and exchange healthcare knowledge. Exploring hot topics in online health communities helps us better understand patients’ needs and interest in health-related knowledge. However, the statistical topic analysis employed in previous studies is becoming impractical for processing the rapidly increasing amount of online data. Automatic topic detection based on document clustering is an alternative approach for extracting health-related hot topics in online communities. In addition to the keyword-based features used in traditional text clustering, we integrate medical domain-specific features to represent the messages posted in online health communities. Three disease discussion boards, including boards devoted to lung cancer, breast cancer and diabetes, from an online health community are used to test the effectiveness of topic detection. Experiment results demonstrate that health-related hot topics primarily include symptoms, examinations, drugs, procedures and complications. Further analysis reveals that there also exist some significant differences among the hot topics discussed on different types of disease discussion boards.
Collapse
|
22
|
|
23
|
Shash SF, Mollá D. Clustering of Medical Publications for Evidence Based Medicine Summarisation. Artif Intell Med 2013. [DOI: 10.1007/978-3-642-38326-7_42] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]
|
24
|
Abstract
Applications of clustering algorithms in biomedical research are ubiquitous, with typical examples including gene expression data analysis, genomic sequence analysis, biomedical document mining, and MRI image analysis. However, due to the diversity of cluster analysis, the differing terminologies, goals, and assumptions underlying different clustering algorithms can be daunting. Thus, determining the right match between clustering algorithms and biomedical applications has become particularly important. This paper is presented to provide biomedical researchers with an overview of the status quo of clustering algorithms, to illustrate examples of biomedical applications based on cluster analysis, and to help biomedical researchers select the most suitable clustering algorithms for their own applications.
Collapse
Affiliation(s)
- Rui Xu
- Industrial Artificial Intelligence Laboratory, GE Global Research Center, Niskayuna, NY 12309, USA.
| | | |
Collapse
|
25
|
Nourbakhsh E, Nugent R, Wang H, Cevik C, Nugent K. Medical literature searches: a comparison of PubMed and Google Scholar. Health Info Libr J 2012; 29:214-22. [PMID: 22925384 DOI: 10.1111/j.1471-1842.2012.00992.x] [Citation(s) in RCA: 63] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2010] [Accepted: 05/03/2012] [Indexed: 11/27/2022]
Abstract
BACKGROUND Medical literature searches provide critical information for clinicians. However, the best strategy for identifying relevant high-quality literature is unknown. OBJECTIVES We compared search results using PubMed and Google Scholar on four clinical questions and analysed these results with respect to article relevance and quality. METHODS Abstracts from the first 20 citations for each search were classified into three relevance categories. We used the weighted kappa statistic to analyse reviewer agreement and nonparametric rank tests to compare the number of citations for each article and the corresponding journals' impact factors. RESULTS Reviewers ranked 67.6% of PubMed articles and 80% of Google Scholar articles as at least possibly relevant (P = 0.116) with high agreement (all kappa P-values < 0.01). Google Scholar articles had a higher median number of citations (34 vs. 1.5, P < 0.0001) and came from higher impact factor journals (5.17 vs. 3.55, P = 0.036). CONCLUSIONS PubMed searches and Google Scholar searches often identify different articles. In this study, Google Scholar articles were more likely to be classified as relevant, had higher numbers of citations and were published in higher impact factor journals. The identification of frequently cited articles using Google Scholar for searches probably has value for initial literature searches.
Collapse
Affiliation(s)
- Eva Nourbakhsh
- Department of Internal Medicine, Texas Tech University Health Sciences Center, Lubbock, TX, USA
| | | | | | | | | |
Collapse
|
26
|
Chen AT. Exploring online support spaces: using cluster analysis to examine breast cancer, diabetes and fibromyalgia support groups. PATIENT EDUCATION AND COUNSELING 2012; 87:250-257. [PMID: 21930359 DOI: 10.1016/j.pec.2011.08.017] [Citation(s) in RCA: 59] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/31/2011] [Revised: 08/10/2011] [Accepted: 08/26/2011] [Indexed: 05/31/2023]
Abstract
OBJECTIVE This study sought to characterize and compare online discussion forums for three conditions: breast cancer, type 1 diabetes and fibromyalgia. Though there has been considerable work examining online support groups, few studies have considered differences in discussion content between health conditions. In addition, in contrast to the extant literature, this study sought to employ a semi-automated approach to examine health-related online communities. METHODS Online discussion content for the three conditions was compiled, pre-processed, and clustered at the thread level using the bisecting k-means algorithm. RESULTS Though the clusters for each condition differed, the clusters fell into a set of common categories: Generic, Support, Patient-Centered, Experiential Knowledge, Treatments/Procedures, Medications, and Condition Management. CONCLUSION The cluster analyses facilitate an increased understanding of various aspects of patient experience, including significant emotional and temporal aspects of the illness experience. PRACTICE IMPLICATIONS The clusters highlighted the changing nature of patients' information needs. Information provided to patients should be tailored to address their needs at various points during their illness. In addition, cluster analysis may be integrated into online support groups or other types of online interventions to assist patients in finding information.
Collapse
Affiliation(s)
- Annie T Chen
- School of Information and Library Science, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599-3360, USA.
| |
Collapse
|
27
|
Workman TE, Hurdle JF. Dynamic summarization of bibliographic-based data. BMC Med Inform Decis Mak 2011; 11:6. [PMID: 21284871 PMCID: PMC3042900 DOI: 10.1186/1472-6947-11-6] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2010] [Accepted: 02/01/2011] [Indexed: 11/15/2022] Open
Abstract
Background Traditional information retrieval techniques typically return excessive output when directed at large bibliographic databases. Natural Language Processing applications strive to extract salient content from the excessive data. Semantic MEDLINE, a National Library of Medicine (NLM) natural language processing application, highlights relevant information in PubMed data. However, Semantic MEDLINE implements manually coded schemas, accommodating few information needs. Currently, there are only five such schemas, while many more would be needed to realistically accommodate all potential users. The aim of this project was to develop and evaluate a statistical algorithm that automatically identifies relevant bibliographic data; the new algorithm could be incorporated into a dynamic schema to accommodate various information needs in Semantic MEDLINE, and eliminate the need for multiple schemas. Methods We developed a flexible algorithm named Combo that combines three statistical metrics, the Kullback-Leibler Divergence (KLD), Riloff's RlogF metric (RlogF), and a new metric called PredScal, to automatically identify salient data in bibliographic text. We downloaded citations from a PubMed search query addressing the genetic etiology of bladder cancer. The citations were processed with SemRep, an NLM rule-based application that produces semantic predications. SemRep output was processed by Combo, in addition to the standard Semantic MEDLINE genetics schema and independently by the two individual KLD and RlogF metrics. We evaluated each summarization method using an existing reference standard within the task-based context of genetic database curation. Results Combo asserted 74 genetic entities implicated in bladder cancer development, whereas the traditional schema asserted 10 genetic entities; the KLD and RlogF metrics individually asserted 77 and 69 genetic entities, respectively. Combo achieved 61% recall and 81% precision, with an F-score of 0.69. The traditional schema achieved 23% recall and 100% precision, with an F-score of 0.37. The KLD metric achieved 61% recall, 70% precision, with an F-score of 0.65. The RlogF metric achieved 61% recall, 72% precision, with an F-score of 0.66. Conclusions Semantic MEDLINE summarization using the new Combo algorithm outperformed a conventional summarization schema in a genetic database curation task. It potentially could streamline information acquisition for other needs without having to hand-build multiple saliency schemas.
Collapse
Affiliation(s)
- T Elizabeth Workman
- Department of Biomedical Informatics, University of Utah, HSEB 5775, Salt Lake City, UT, USA.
| | | |
Collapse
|
28
|
Mining and modeling linkage information from citation context for improving biomedical literature retrieval. Inf Process Manag 2011. [DOI: 10.1016/j.ipm.2010.03.010] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
|
29
|
Al Zamil MGH, Betin Can A. A model based on multi-features to enhance healthcare and medical document retrieval. Inform Health Soc Care 2010; 36:100-15. [DOI: 10.3109/17538157.2010.506252] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
|
30
|
Enabling multi-level relevance feedback on PubMed by integrating rank learning into DBMS. BMC Bioinformatics 2010; 11 Suppl 2:S6. [PMID: 20406504 PMCID: PMC3165966 DOI: 10.1186/1471-2105-11-s2-s6] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/02/2022] Open
Abstract
Background Finding relevant articles from PubMed is challenging because it is hard to express the user's specific intention in the given query interface, and a keyword query typically retrieves a large number of results. Researchers have applied machine learning techniques to find relevant articles by ranking the articles according to the learned relevance function. However, the process of learning and ranking is usually done offline without integrated with the keyword queries, and the users have to provide a large amount of training documents to get a reasonable learning accuracy. This paper proposes a novel multi-level relevance feedback system for PubMed, called RefMed, which supports both ad-hoc keyword queries and a multi-level relevance feedback in real time on PubMed. Results RefMed supports a multi-level relevance feedback by using the RankSVM as the learning method, and thus it achieves higher accuracy with less feedback. RefMed "tightly" integrates the RankSVM into RDBMS to support both keyword queries and the multi-level relevance feedback in real time; the tight coupling of the RankSVM and DBMS substantially improves the processing time. An efficient parameter selection method for the RankSVM is also proposed, which tunes the RankSVM parameter without performing validation. Thereby, RefMed achieves a high learning accuracy in real time without performing a validation process. RefMed is accessible at http://dm.postech.ac.kr/refmed. Conclusions RefMed is the first multi-level relevance feedback system for PubMed, which achieves a high accuracy with less feedback. It effectively learns an accurate relevance function from the user’s feedback and efficiently processes the function to return relevant articles in real time.
Collapse
|
31
|
McInnes BT, Pedersen T, Pakhomov SVS. UMLS-Interface and UMLS-Similarity : open source software for measuring paths and semantic similarity. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2009; 2009:431-435. [PMID: 20351894 PMCID: PMC2815481] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 05/29/2023]
Abstract
A number of computational measures for determining semantic similarity between pairs of biomedical concepts have been developed using various standards and programming platforms. In this paper, we introduce two new open-source frameworks based on the Unified Medical Language System (UMLS). These frameworks consist of the UMLS-Similarity and UMLS-Interface packages. UMLS-Interface provides path information about UMLS concepts. UMLS-Similarity calculates the semantic similarity between UMLS concepts using several previously developed measures and can be extended to include new measures. We validate the functionality of these frameworks by reproducing the results from previous work. Our frameworks constitute a significant contribution to the field of biomedical Natural Language Processing by providing a common development and testing platform for semantic similarity measures based on the UMLS.
Collapse
|
32
|
Abstract
Meta-analyses are seen as representing the pinnacle of a hierarchy of evidence used to inform clinical practice. Therefore, the potential importance of differences in the rigor with which they are conducted and reported warrants consideration. In this review, we use standardized instruments to describe the scientific and reporting quality of meta-analyses of randomized controlled trials of the treatment of anxiety disorders. We also use traditional and novel metrics of article impact to assess the influence of meta-analyses across a range of research fields in the anxiety disorders. Overall, although the meta-analyses that we examined had some flaws, their quality of reporting was generally acceptable. Neither the scientific nor reporting quality of the meta-analyses was predicted by any of the impact metrics. The finding that treatment meta-analyses were cited less frequently than quantitative reviews of studies in current "hot spots" of research (ie, genetics, imaging) points to the multifactorial nature of citation patterns. A list of the meta-analyses included in this review is available on an evidence-based website of anxiety and trauma-related disorders.
Collapse
|
33
|
Zheng HT, Borchert C, Kim HG. GOClonto: an ontological clustering approach for conceptualizing PubMed abstracts. J Biomed Inform 2009; 43:31-40. [PMID: 19635585 DOI: 10.1016/j.jbi.2009.07.006] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2008] [Revised: 05/21/2009] [Accepted: 07/20/2009] [Indexed: 10/20/2022]
Abstract
Concurrent with progress in biomedical sciences, an overwhelming of textual knowledge is accumulating in the biomedical literature. PubMed is the most comprehensive database collecting and managing biomedical literature. To help researchers easily understand collections of PubMed abstracts, numerous clustering methods have been proposed to group similar abstracts based on their shared features. However, most of these methods do not explore the semantic relationships among groupings of documents, which could help better illuminate the groupings of PubMed abstracts. To address this issue, we proposed an ontological clustering method called GOClonto for conceptualizing PubMed abstracts. GOClonto uses latent semantic analysis (LSA) and gene ontology (GO) to identify key gene-related concepts and their relationships as well as allocate PubMed abstracts based on these key gene-related concepts. Based on two PubMed abstract collections, the experimental results show that GOClonto is able to identify key gene-related concepts and outperforms the STC (suffix tree clustering) algorithm, the Lingo algorithm, the Fuzzy Ants algorithm, and the clustering based TRS (tolerance rough set) algorithm. Moreover, the two ontologies generated by GOClonto show significant informative conceptual structures.
Collapse
Affiliation(s)
- Hai-Tao Zheng
- Biomedical Knowledge Engineering Laboratory, BK21 College of Dentistry, Seoul National University, 28 Yeongeon-dong, Jongro-gu, Seoul 110-749, Republic of Korea
| | | | | |
Collapse
|
34
|
Fiszman M, Demner-Fushman D, Kilicoglu H, Rindflesch TC. Automatic summarization of MEDLINE citations for evidence-based medical treatment: a topic-oriented evaluation. J Biomed Inform 2008; 42:801-13. [PMID: 19022398 DOI: 10.1016/j.jbi.2008.10.002] [Citation(s) in RCA: 48] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2008] [Revised: 09/30/2008] [Accepted: 10/15/2008] [Indexed: 11/18/2022]
Abstract
As the number of electronic biomedical textual resources increases, it becomes harder for physicians to find useful answers at the point of care. Information retrieval applications provide access to databases; however, little research has been done on using automatic summarization to help navigate the documents returned by these systems. After presenting a semantic abstraction automatic summarization system for MEDLINE citations, we concentrate on evaluating its ability to identify useful drug interventions for 53 diseases. The evaluation methodology uses existing sources of evidence-based medicine as surrogates for a physician-annotated reference standard. Mean average precision (MAP) and a clinical usefulness score developed for this study were computed as performance metrics. The automatic summarization system significantly outperformed the baseline in both metrics. The MAP gain was 0.17 (p<0.01) and the increase in the overall score of clinical usefulness was 0.39 (p<0.05).
Collapse
Affiliation(s)
- Marcelo Fiszman
- National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bldg 38A, Rm B1N-28J, Bethesda, MD 20894, USA.
| | | | | | | |
Collapse
|
35
|
Wren JD, Wilkins D, Fuscoe JC, Bridges S, Winters-Hilt S, Gusev Y. Proceedings of the 2008 MidSouth Computational Biology and Bioinformatics Society (MCBIOS) Conference. BMC Bioinformatics 2008; 9 Suppl 9:S1. [PMID: 18793454 PMCID: PMC2537572 DOI: 10.1186/1471-2105-9-s9-s1] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022] Open
Affiliation(s)
- Jonathan D Wren
- Arthritis and Immunology Research Program, Oklahoma Medical Research Foundation; 825 N.E. 13th Street, Oklahoma City, OK 73104-5005, USA.
| | | | | | | | | | | |
Collapse
|
36
|
Lin J. PageRank without hyperlinks: reranking with PubMed related article networks for biomedical text retrieval. BMC Bioinformatics 2008; 9:270. [PMID: 18538027 PMCID: PMC2442104 DOI: 10.1186/1471-2105-9-270] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2008] [Accepted: 06/06/2008] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Graph analysis algorithms such as PageRank and HITS have been successful in Web environments because they are able to extract important inter-document relationships from manually-created hyperlinks. We consider the application of these techniques to biomedical text retrieval. In the current PubMed(R) search interface, a MEDLINE(R) citation is connected to a number of related citations, which are in turn connected to other citations. Thus, a MEDLINE record represents a node in a vast content-similarity network. This article explores the hypothesis that these networks can be exploited for text retrieval, in the same manner as hyperlink graphs on the Web. RESULTS We conducted a number of reranking experiments using the TREC 2005 genomics track test collection in which scores extracted from PageRank and HITS analysis were combined with scores returned by an off-the-shelf retrieval engine. Experiments demonstrate that incorporating PageRank scores yields significant improvements in terms of standard ranked-retrieval metrics. CONCLUSION The link structure of content-similarity networks can be exploited to improve the effectiveness of information retrieval systems. These results generalize the applicability of graph analysis algorithms to text retrieval in the biomedical domain.
Collapse
Affiliation(s)
- Jimmy Lin
- National Center for Biotechnology Information, National Library of Medicine, Bethesda, Maryland, USA.
| |
Collapse
|