1
|
Gobeill J, Caucheteur D, Michel PA, Mottin L, Pasche E, Ruch P. SIB Literature Services: RESTful customizable search engines in biomedical literature, enriched with automatically mapped biomedical concepts. Nucleic Acids Res 2020; 48:W12-W16. [PMID: 32379317 PMCID: PMC7319474 DOI: 10.1093/nar/gkaa328] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2020] [Revised: 04/09/2020] [Accepted: 04/22/2020] [Indexed: 01/05/2023] Open
Abstract
Thanks to recent efforts by the text mining community, biocurators have now access to plenty of good tools and Web interfaces for identifying and visualizing biomedical entities in literature. Yet, many of these systems start with a PubMed query, which is limited by strong Boolean constraints. Some semantic search engines exploit entities for Information Retrieval, and/or deliver relevance-based ranked results. Yet, they are not designed for supporting a specific curation workflow, and allow very limited control on the search process. The Swiss Institute of Bioinformatics Literature Services (SIBiLS) provide personalized Information Retrieval in the biological literature. Indeed, SIBiLS allow fully customizable search in semantically enriched contents, based on keywords and/or mapped biomedical entities from a growing set of standardized and legacy vocabularies. The services have been used and favourably evaluated to assist the curation of genes and gene products, by delivering customized literature triage engines to different curation teams. SIBiLS (https://candy.hesge.ch/SIBiLS) are freely accessible via REST APIs and are ready to empower any curation workflow, built on modern technologies scalable with big data: MongoDB and Elasticsearch. They cover MEDLINE and PubMed Central Open Access enriched by nearly 2 billion of mapped biomedical entities, and are daily updated.
Collapse
Affiliation(s)
- Julien Gobeill
- To whom correspondence should be addressed. Tel: +41 22 388 17 86; Fax: +41 22 546 97 38;
| | - Déborah Caucheteur
- BiTeM group, Information Sciences, HES-SO / HEG Geneva, 1227 Carouge, Switzerland
| | - Pierre-André Michel
- SIB Text Mining group, Swiss Institute of Bioinformatics, 1206 Geneva, Switzerland
| | - Luc Mottin
- BiTeM group, Information Sciences, HES-SO / HEG Geneva, 1227 Carouge, Switzerland
| | - Emilie Pasche
- SIB Text Mining group, Swiss Institute of Bioinformatics, 1206 Geneva, Switzerland
- BiTeM group, Information Sciences, HES-SO / HEG Geneva, 1227 Carouge, Switzerland
| | - Patrick Ruch
- Correspondence may also be addressed to Patrick Ruch. Tel: +41 22 388 17 81; Fax: +41 22 546 97 38;
| |
Collapse
|
2
|
Garcia-Pelaez J, Rodriguez D, Medina-Molina R, Garcia-Rivas G, Jerjes-Sánchez C, Trevino V. PubTerm: a web tool for organizing, annotating and curating genes, diseases, molecules and other concepts from PubMed records. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2019; 2019:5280306. [PMID: 30624653 PMCID: PMC6323318 DOI: 10.1093/database/bay137] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/14/2018] [Accepted: 12/02/2018] [Indexed: 11/13/2022]
Abstract
Background and objective Analysis, annotation and curation of biomedical scientific literature is a recurrent task in biomedical research, database curation and clinics. Commonly, the reading is centered on concepts such as genes, diseases or molecules. Database curators may also need to annotate published abstracts related to a specific topic. However, few free and intuitive tools exist to assist users in this context. Therefore, we developed PubTerm, a web tool to organize, categorize, curate and annotate a large number of PubMed abstracts related to biological entities such as genes, diseases, chemicals, species, sequence variants and other related information. Methods A variety of interfaces were implemented to facilitate curation and annotation, including the organization of abstracts by terms, by the co-occurrence of terms or by specific phrases. Information includes statistics on the occurrence of terms. The abstracts, terms and other related information can be annotated and categorized using user-defined categories. The session information can be saved and restored, and the data can be exported to other formats. Results The pipeline in PubTerm starts by specifying a PubMed query or list of PubMed identifiers. Then, the user can specify three lists of categories and specify what information will be highlighted in which colors. The user then utilizes the `term view’ to organize the abstracts by gene, disease, species or other information to facilitate the annotation and categorization of terms or abstracts. Other views also facilitate the exploration of abstracts and connections between terms. We have used PubTerm to quickly and efficiently curate collections of more than 400 abstracts that mention more than 350 genes to generate revised lists of susceptibility genes for diseases. An example is provided for pulmonary arterial hypertension. Conclusions PubTerm saves time for literature revision by assisting with annotation organization and knowledge acquisition.
Collapse
Affiliation(s)
- José Garcia-Pelaez
- Tecnologico de Monterrey, Escuela de Medicina y Ciencias de la Salud. Ave. Morones Prieto 3000, Monterrey, N.L., México
| | - David Rodriguez
- Tecnologico de Monterrey, Escuela de Medicina y Ciencias de la Salud. Ave. Morones Prieto 3000, Monterrey, N.L., México
| | - Roberto Medina-Molina
- Tecnologico de Monterrey, Escuela de Medicina y Ciencias de la Salud. Ave. Morones Prieto 3000, Monterrey, N.L., México
| | - Gerardo Garcia-Rivas
- Tecnologico de Monterrey, Escuela de Medicina y Ciencias de la Salud. Ave. Morones Prieto 3000, Monterrey, N.L., México.,Centro de Investigación Biomédica, Hospital Zambrano-Hellion, Tec Salud, Tecnologico de Monterrey, Batallón San Patricio 112 Col. Real de San Agustín, San Pedro Garza García, N.L., México
| | - Carlos Jerjes-Sánchez
- Tecnologico de Monterrey, Escuela de Medicina y Ciencias de la Salud. Ave. Morones Prieto 3000, Monterrey, N.L., México.,Centro de Investigación Biomédica, Hospital Zambrano-Hellion, Tec Salud, Tecnologico de Monterrey, Batallón San Patricio 112 Col. Real de San Agustín, San Pedro Garza García, N.L., México
| | - Victor Trevino
- Tecnologico de Monterrey, Escuela de Medicina y Ciencias de la Salud. Ave. Morones Prieto 3000, Monterrey, N.L., México
| |
Collapse
|
3
|
Ji Y, Ying H, Tran J, Dews P, Massanari RM. Integrating unified medical language system and association mining techniques into relevance feedback for biomedical literature search. BMC Bioinformatics 2016; 17 Suppl 9:264. [PMID: 27453982 PMCID: PMC4959361 DOI: 10.1186/s12859-016-1129-z] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Background Finding highly relevant articles from biomedical databases is challenging not only because it is often difficult to accurately express a user’s underlying intention through keywords but also because a keyword-based query normally returns a long list of hits with many citations being unwanted by the user. This paper proposes a novel biomedical literature search system, called BiomedSearch, which supports complex queries and relevance feedback. Methods The system employed association mining techniques to build a k-profile representing a user’s relevance feedback. More specifically, we developed a weighted interest measure and an association mining algorithm to find the strength of association between a query and each concept in the article(s) selected by the user as feedback. The top concepts were utilized to form a k-profile used for the next-round search. BiomedSearch relies on Unified Medical Language System (UMLS) knowledge sources to map text files to standard biomedical concepts. It was designed to support queries with any levels of complexity. Results A prototype of BiomedSearch software was made and it was preliminarily evaluated using the Genomics data from TREC (Text Retrieval Conference) 2006 Genomics Track. Initial experiment results indicated that BiomedSearch increased the mean average precision (MAP) for a set of queries. Conclusions With UMLS and association mining techniques, BiomedSearch can effectively utilize users’ relevance feedback to improve the performance of biomedical literature search.
Collapse
Affiliation(s)
- Yanqing Ji
- Department of Electrical and Computer Engineering, Gonzaga University, Spokane, WA, USA.
| | - Hao Ying
- Department of Electrical and Computer Engineering, Wayne State University, Detroit, MI, USA
| | - John Tran
- Frontier Behavioral Health, Spokane, WA, USA
| | - Peter Dews
- Department of Medicine, St. Mary Mercy Hospital, Livonia, MI, USA
| | | |
Collapse
|
4
|
Thompson P, Madan JC, Moore JH. Prediction of relevant biomedical documents: a human microbiome case study. BioData Min 2015; 8:28. [PMID: 26361503 PMCID: PMC4564977 DOI: 10.1186/s13040-015-0061-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2014] [Accepted: 08/26/2015] [Indexed: 11/13/2022] Open
Abstract
Background Retrieving relevant biomedical literature has become increasingly difficult due to the large volume and rapid growth of biomedical publication. A query to a biomedical retrieval system often retrieves hundreds of results. Since the searcher will not likely consider all of these documents, ranking the documents is important. Ranking by recency, as PubMed does, takes into account only one factor indicating potential relevance. This study explores the use of the searcher’s relevance feedback judgments to support relevance ranking based on features more general than recency. Results It was found that the researcher’s relevance judgments could be used to accurately predict the relevance of additional documents: both using tenfold cross-validation and by training on publications from 2008–2010 and testing on documents from 2011. Conclusions This case study has shown the promise for relevance feedback to improve biomedical document retrieval. A researcher’s judgments as to which initially retrieved documents are relevant, or not, can be leveraged to predict additional relevant documents.
Collapse
Affiliation(s)
- Paul Thompson
- Program in Linguistics, Dartmouth College, Hanover, NH 03755 USA
| | - Juliette C Madan
- Department of Pediatrics, Division of Neonatology, Dartmouth-Hitchcock Medical Center, One Medical Center Drive, Lebanon, NH 03756 USA
| | - Jason H Moore
- Institute for Biomedical Informatics, Departments of Genetics and Biostatistics and Epidemiology, Perelman School of Medicine, University of Pennsylvania, 3535 Market Street, Philadelphia, PA 19104 USA
| |
Collapse
|
5
|
|
6
|
Khare R, Leaman R, Lu Z. Accessing biomedical literature in the current information landscape. Methods Mol Biol 2014; 1159:11-31. [PMID: 24788259 PMCID: PMC4593617 DOI: 10.1007/978-1-4939-0709-0_2] [Citation(s) in RCA: 27] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/22/2023]
Abstract
Biomedical and life sciences literature is unique because of its exponentially increasing volume and interdisciplinary nature. Biomedical literature access is essential for several types of users including biomedical researchers, clinicians, database curators, and bibliometricians. In the past few decades, several online search tools and literature archives, generic as well as biomedicine specific, have been developed. We present this chapter in the light of three consecutive steps of literature access: searching for citations, retrieving full text, and viewing the article. The first section presents the current state of practice of biomedical literature access, including an analysis of the search tools most frequently used by the users, including PubMed, Google Scholar, Web of Science, Scopus, and Embase, and a study on biomedical literature archives such as PubMed Central. The next section describes current research and the state-of-the-art systems motivated by the challenges a user faces during query formulation and interpretation of search results. The research solutions are classified into five key areas related to text and data mining, text similarity search, semantic search, query support, relevance ranking, and clustering results. Finally, the last section describes some predicted future trends for improving biomedical literature access, such as searching and reading articles on portable devices, and adoption of the open access policy.
Collapse
Affiliation(s)
- Ritu Khare
- National Center for Biotechnology Information, U.S. National Library of Medicine, NIH, Blg 38 A, Rm 1003B, 8600 Rockville Pike, Bethesda, MD 20894
| | - Robert Leaman
- National Center for Biotechnology Information, U.S. National Library of Medicine, NIH, Blg 38 A, Rm 1003E, 8600 Rockville Pike, Bethesda, MD 20894
| | - Zhiyong Lu
- National Center for Biotechnology Information, U.S. National Library of Medicine, NIH, Blg 38 A, Rm 1003A, 8600 Rockville Pike, Bethesda, MD 20894
| |
Collapse
|
7
|
|
8
|
Rebholz-Schuhmann D, Oellrich A, Hoehndorf R. Text-mining solutions for biomedical research: enabling integrative biology. Nat Rev Genet 2012; 13:829-39. [DOI: 10.1038/nrg3337] [Citation(s) in RCA: 170] [Impact Index Per Article: 14.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022]
|
9
|
Literature retrieval and mining in bioinformatics: state of the art and challenges. Adv Bioinformatics 2012; 2012:573846. [PMID: 22778730 PMCID: PMC3388278 DOI: 10.1155/2012/573846] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2011] [Revised: 05/18/2012] [Accepted: 05/18/2012] [Indexed: 11/29/2022] Open
Abstract
The world has widely changed in terms of communicating, acquiring, and storing information. Hundreds of millions of people are involved in information retrieval tasks on a daily basis, in particular while using a Web search engine or searching their e-mail, making such field the dominant form of information access, overtaking traditional database-style searching. How to handle this huge amount of information has now become a challenging issue. In this paper, after recalling the main topics concerning information retrieval, we present a survey on the main works on literature retrieval and mining in bioinformatics. While claiming that information retrieval approaches are useful in bioinformatics tasks, we discuss some challenges aimed at showing the effectiveness of these approaches applied therein.
Collapse
|
10
|
Sondhi P, Sun J, Zhai C, Sorrentino R, Kohn MS. Leveraging medical thesauri and physician feedback for improving medical literature retrieval for case queries. J Am Med Inform Assoc 2012; 19:851-8. [PMID: 22437075 DOI: 10.1136/amiajnl-2011-000293] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/23/2023] Open
Abstract
OBJECTIVE This paper presents a study of methods for medical literature retrieval for case queries, in which the goal is to retrieve literature articles similar to a given patient case. In particular, it focuses on analyzing the performance of state-of-the-art general retrieval methods and improving them by the use of medical thesauri and physician feedback. MATERIALS AND METHODS The Kullback-Leibler divergence retrieval model with Dirichlet smoothing is used as the state-of-the-art general retrieval method. Pseudorelevance feedback and term weighing methods are proposed by leveraging MeSH and UMLS thesauri. Evaluation is performed on a test collection recently created for the ImageCLEF medical case retrieval challenge. RESULTS Experimental results show that a well-tuned state-of-the-art general retrieval model achieves a mean average precision of 0.2754, but the performance can be improved by over 40% to 0.3980, through the proposed methods. DISCUSSION The results over the ImageCLEF test collection, which is currently the best collection available for the task, are encouraging. There are, however, limitations due to small evaluation set size. The analysis shows that further refinement of the methods is necessary before they can be really useful in a clinical setting. CONCLUSION Medical case-based literature retrieval is a critical search application that presents a number of unique challenges. This analysis shows that the state-of-the-art general retrieval models are reasonably good for the task, but the performance can be significantly improved by developing new task-specific retrieval models that incorporate medical thesauri and physician feedback.
Collapse
Affiliation(s)
- Parikshit Sondhi
- Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, Illinois 61801-2302, USA.
| | | | | | | | | |
Collapse
|
11
|
Song MH, Park DK, Lee YH. Medical informatics methods for the clinical evidence extraction. JOURNAL OF THE KOREAN MEDICAL ASSOCIATION 2012. [DOI: 10.5124/jkma.2012.55.8.741] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022] Open
Affiliation(s)
- Mi Hwa Song
- U-Healthcare Institute, Gachon University, Incheon, Korea
| | - Dong Kyun Park
- U-Healthcare Center, Gachon University Gil Hospital, Incheon, Korea
| | - Young Ho Lee
- IT Department, Gachon University, Incheon, Korea
| |
Collapse
|
12
|
Yoo S, Choi J. Evaluation of Term Ranking Algorithms for Pseudo-Relevance Feedback in MEDLINE Retrieval. Healthc Inform Res 2011; 17:120-30. [PMID: 21886873 PMCID: PMC3155169 DOI: 10.4258/hir.2011.17.2.120] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2011] [Accepted: 04/29/2011] [Indexed: 11/23/2022] Open
Abstract
OBJECTIVES The purpose of this study was to investigate the effects of query expansion algorithms for MEDLINE retrieval within a pseudo-relevance feedback framework. METHODS A number of query expansion algorithms were tested using various term ranking formulas, focusing on query expansion based on pseudo-relevance feedback. The OHSUMED test collection, which is a subset of the MEDLINE database, was used as a test corpus. Various ranking algorithms were tested in combination with different term re-weighting algorithms. RESULTS Our comprehensive evaluation showed that the local context analysis ranking algorithm, when used in combination with one of the reweighting algorithms - Rocchio, the probabilistic model, and our variants - significantly outperformed other algorithm combinations by up to 12% (paired t-test; p < 0.05). In a pseudo-relevance feedback framework, effective query expansion would be achieved by the careful consideration of term ranking and re-weighting algorithm pairs, at least in the context of the OHSUMED corpus. CONCLUSIONS Comparative experiments on term ranking algorithms were performed in the context of a subset of MEDLINE documents. With medical documents, local context analysis, which uses co-occurrence with all query terms, significantly outperformed various term ranking methods based on both frequency and distribution analyses. Furthermore, the results of the experiments demonstrated that the term rank-based re-weighting method contributed to a remarkable improvement in mean average precision.
Collapse
Affiliation(s)
- Sooyoung Yoo
- Medical Information Center, Seoul National University Bundang Hospital, Seongnam, Korea
| | | |
Collapse
|
13
|
Lu Z. PubMed and beyond: a survey of web tools for searching biomedical literature. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2011; 2011:baq036. [PMID: 21245076 PMCID: PMC3025693 DOI: 10.1093/database/baq036] [Citation(s) in RCA: 222] [Impact Index Per Article: 17.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/23/2022]
Abstract
The past decade has witnessed the modern advances of high-throughput technology and rapid growth of research capacity in producing large-scale biological data, both of which were concomitant with an exponential growth of biomedical literature. This wealth of scholarly knowledge is of significant importance for researchers in making scientific discoveries and healthcare professionals in managing health-related matters. However, the acquisition of such information is becoming increasingly difficult due to its large volume and rapid growth. In response, the National Center for Biotechnology Information (NCBI) is continuously making changes to its PubMed Web service for improvement. Meanwhile, different entities have devoted themselves to developing Web tools for helping users quickly and efficiently search and retrieve relevant publications. These practices, together with maturity in the field of text mining, have led to an increase in the number and quality of various Web tools that provide comparable literature search service to PubMed. In this study, we review 28 such tools, highlight their respective innovations, compare them to the PubMed system and one another, and discuss directions for future development. Furthermore, we have built a website dedicated to tracking existing systems and future advances in the field of biomedical literature search. Taken together, our work serves information seekers in choosing tools for their needs and service providers and developers in keeping current in the field. Database URL:http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/search
Collapse
Affiliation(s)
- Zhiyong Lu
- National Center for Biotechnology Information (NCBI), National Library of Medicine, Bethesda, MD 20894, USA.
| |
Collapse
|