1
|
Abdelmageed N, Löffler F, Feddoul L, Algergawy A, Samuel S, Gaikwad J, Kazem A, König-Ries B. BiodivNERE: Gold standard corpora for named entity recognition and relation extraction in the biodiversity domain. Biodivers Data J 2022; 10:e89481. [PMID: 36761617 PMCID: PMC9836593 DOI: 10.3897/bdj.10.e89481] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2022] [Accepted: 09/07/2022] [Indexed: 11/12/2022] Open
Abstract
Background Biodiversity is the assortment of life on earth covering evolutionary, ecological, biological, and social forms. To preserve life in all its variety and richness, it is imperative to monitor the current state of biodiversity and its change over time and to understand the forces driving it. This need has resulted in numerous works being published in this field. With this, a large amount of textual data (publications) and metadata (e.g. dataset description) has been generated. To support the management and analysis of these data, two techniques from computer science are of interest, namely Named Entity Recognition (NER) and Relation Extraction (RE). While the former enables better content discovery and understanding, the latter fosters the analysis by detecting connections between entities and, thus, allows us to draw conclusions and answer relevant domain-specific questions. To automatically predict entities and their relations, machine/deep learning techniques could be used. The training and evaluation of those techniques require labelled corpora. New information In this paper, we present two gold-standard corpora for Named Entity Recognition (NER) and Relation Extraction (RE) generated from biodiversity datasets metadata and abstracts that can be used as evaluation benchmarks for the development of new computer-supported tools that require machine learning or deep learning techniques. These corpora are manually labelled and verified by biodiversity experts. In addition, we explain the detailed steps of constructing these datasets. Moreover, we demonstrate the underlying ontology for the classes and relations used to annotate such corpora.
Collapse
Affiliation(s)
- Nora Abdelmageed
- Heinz Nixdorf Chair for Distributed Information Systems, Department of Mathematics and Computer Science, Friedrich Schiller University Jena, Jena, GermanyHeinz Nixdorf Chair for Distributed Information Systems, Department of Mathematics and Computer Science, Friedrich Schiller University JenaJenaGermany,Michael-Stifel-Center for Data-Driven and Simulation Science, Jena, GermanyMichael-Stifel-Center for Data-Driven and Simulation ScienceJenaGermany
| | - Felicitas Löffler
- Heinz Nixdorf Chair for Distributed Information Systems, Department of Mathematics and Computer Science, Friedrich Schiller University Jena, Jena, GermanyHeinz Nixdorf Chair for Distributed Information Systems, Department of Mathematics and Computer Science, Friedrich Schiller University JenaJenaGermany
| | - Leila Feddoul
- Heinz Nixdorf Chair for Distributed Information Systems, Department of Mathematics and Computer Science, Friedrich Schiller University Jena, Jena, GermanyHeinz Nixdorf Chair for Distributed Information Systems, Department of Mathematics and Computer Science, Friedrich Schiller University JenaJenaGermany
| | - Alsayed Algergawy
- Heinz Nixdorf Chair for Distributed Information Systems, Department of Mathematics and Computer Science, Friedrich Schiller University Jena, Jena, GermanyHeinz Nixdorf Chair for Distributed Information Systems, Department of Mathematics and Computer Science, Friedrich Schiller University JenaJenaGermany
| | - Sheeba Samuel
- Heinz Nixdorf Chair for Distributed Information Systems, Department of Mathematics and Computer Science, Friedrich Schiller University Jena, Jena, GermanyHeinz Nixdorf Chair for Distributed Information Systems, Department of Mathematics and Computer Science, Friedrich Schiller University JenaJenaGermany,Michael-Stifel-Center for Data-Driven and Simulation Science, Jena, GermanyMichael-Stifel-Center for Data-Driven and Simulation ScienceJenaGermany
| | - Jitendra Gaikwad
- Heinz Nixdorf Chair for Distributed Information Systems, Department of Mathematics and Computer Science, Friedrich Schiller University Jena, Jena, GermanyHeinz Nixdorf Chair for Distributed Information Systems, Department of Mathematics and Computer Science, Friedrich Schiller University JenaJenaGermany
| | - Anahita Kazem
- Heinz Nixdorf Chair for Distributed Information Systems, Department of Mathematics and Computer Science, Friedrich Schiller University Jena, Jena, GermanyHeinz Nixdorf Chair for Distributed Information Systems, Department of Mathematics and Computer Science, Friedrich Schiller University JenaJenaGermany,German Center for Integrative Biodiversity Research (iDiv), Halle-Jena-Leipzig, GermanyGerman Center for Integrative Biodiversity Research (iDiv)Halle-Jena-LeipzigGermany
| | - Birgitta König-Ries
- Heinz Nixdorf Chair for Distributed Information Systems, Department of Mathematics and Computer Science, Friedrich Schiller University Jena, Jena, GermanyHeinz Nixdorf Chair for Distributed Information Systems, Department of Mathematics and Computer Science, Friedrich Schiller University JenaJenaGermany,Michael-Stifel-Center for Data-Driven and Simulation Science, Jena, GermanyMichael-Stifel-Center for Data-Driven and Simulation ScienceJenaGermany,German Center for Integrative Biodiversity Research (iDiv), Halle-Jena-Leipzig, GermanyGerman Center for Integrative Biodiversity Research (iDiv)Halle-Jena-LeipzigGermany
| |
Collapse
|
2
|
Kruesi L, Burstein F, Tanner K. A knowledge management system framework for an open biomedical repository: communities, collaboration and corroboration. JOURNAL OF KNOWLEDGE MANAGEMENT 2020. [DOI: 10.1108/jkm-05-2020-0370] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/22/2022]
Abstract
Purpose
The purpose of this study is to assess the opportunity for a distributed, networked open biomedical repository (OBR) using a knowledge management system (KMS) conceptual framework. An innovative KMS conceptual framework is proposed to guide the transition from a traditional, siloed approach to a sustainable OBR.
Design/methodology/approach
This paper reports on a cycle of action research, involving literature review, interviews and focus group with leaders in biomedical research, open science and librarianship, and an audit of elements needed for an Australasian OBR; these, along with an Australian KM standard, informed the resultant KMS framework.
Findings
The proposed KMS framework aligns the requirements for an OBR with the people, process, technology and content elements of the KM standard. It identifies and defines nine processes underpinning biomedical knowledge – discovery, creation, representation, classification, storage, retrieval, dissemination, transfer and translation. The results comprise an explanation of these processes and examples of the people, process, technology and content dimensions of each process. While the repository is an integral cog within the collaborative, distributed open science network, its effectiveness depends on understanding the relationships and linkages between system elements and achieving an appropriate balance between them.
Research limitations/implications
The current research has focused on biomedicine. This research builds on the worldwide effort to reduce barriers, in particular paywalls to health knowledge. The findings present an opportunity to rationalize and improve a KMS integral to biomedical knowledge.
Practical implications
Adoption of the KMS framework for a distributed, networked OBR will facilitate open science through reducing duplication of effort, removing barriers to the flow of knowledge and ensuring effective management of biomedical knowledge.
Social implications
Achieving quality, permanency and discoverability of a region’s digital assets is possible through ongoing usage of the framework for researchers, industry and consumers.
Originality/value
The framework demonstrates the dependencies and interplay of elements and processes to frame an OBR KMS.
Collapse
|
3
|
Roberts K, Alam T, Bedrick S, Demner-Fushman D, Lo K, Soboroff I, Voorhees E, Wang LL, Hersh WR. TREC-COVID: rationale and structure of an information retrieval shared task for COVID-19. J Am Med Inform Assoc 2020; 27:1431-1436. [PMID: 32365190 PMCID: PMC7239098 DOI: 10.1093/jamia/ocaa091] [Citation(s) in RCA: 55] [Impact Index Per Article: 13.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/28/2020] [Accepted: 05/01/2020] [Indexed: 11/17/2022] Open
Abstract
TREC-COVID is an information retrieval (IR) shared task initiated to support clinicians and clinical research during the COVID-19 pandemic. IR for pandemics breaks many normal assumptions, which can be seen by examining 9 important basic IR research questions related to pandemic situations. TREC-COVID differs from traditional IR shared task evaluations with special considerations for the expected users, IR modality considerations, topic development, participant requirements, assessment process, relevance criteria, evaluation metrics, iteration process, projected timeline, and the implications of data use as a post-task test collection. This article describes how all these were addressed for the particular requirements of developing IR systems under a pandemic situation. Finally, initial participation numbers are also provided, which demonstrate the tremendous interest the IR community has in this effort.
Collapse
Affiliation(s)
- Kirk Roberts
- University of Texas Health Science Center at Houston, Houston, Texas, USA
| | - Tasmeer Alam
- National Institute of Standards and Technology, Gaithersburg, Maryland, USA
| | - Steven Bedrick
- Oregon Health & Science University, Portland, Oregon, USA
| | | | - Kyle Lo
- Allen Institute for AI, Seattle, Washington, USA
| | - Ian Soboroff
- National Institute of Standards and Technology, Gaithersburg, Maryland, USA
| | - Ellen Voorhees
- National Institute of Standards and Technology, Gaithersburg, Maryland, USA
| | - Lucy Lu Wang
- Allen Institute for AI, Seattle, Washington, USA
| | | |
Collapse
|
4
|
Anjaria KA. Computational implementation and formalism of FAIR data stewardship principles. DATA TECHNOLOGIES AND APPLICATIONS 2020. [DOI: 10.1108/dta-09-2019-0164] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
PurposeThe progress of life science and social science research is contingent on effective modes of data storage, data sharing and data reproducibility. In the present digital era, data storage and data sharing play a vital role. For productive data-centric tasks, findable, accessible, interoperable and reusable (FAIR) principles have been developed as a standard convention. However, FAIR principles have specific challenges from computational implementation perspectives. The purpose of this paper is to identify the challenges related to computational implementations of FAIR principles. After identification of challenges, this paper aims to solve the identified challenges.Design/methodology/approachThis paper deploys Petri net-based formal model and Petri net algebra to implement and analyze FAIR principles. The proposed Petri net-based model, theorems and corollaries may assist computer system architects in implementing and analyzing FAIR principles.FindingsTo demonstrate the use of derived petri net-based theorems and corollaries, existing data stewardship platforms – FAIRDOM and Dataverse – have been analyzed in this paper. Moreover, a data stewardship model – “Datalection” has been developed and conversed about in the present paper. Datalection has been designed based on the petri net-based theorems and corollaries.Originality/valueThis paper aims to bridge information science and life science using the formalism of data stewardship principles. This paper not only provides new dimensions to data stewardship but also systematically analyzes two existing data stewardship platforms FAIRDOM and Dataverse.
Collapse
|
5
|
A content-based literature recommendation system for datasets to improve data reusability - A case study on Gene Expression Omnibus (GEO) datasets. J Biomed Inform 2020; 104:103399. [PMID: 32151769 DOI: 10.1016/j.jbi.2020.103399] [Citation(s) in RCA: 17] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2019] [Revised: 02/26/2020] [Accepted: 03/01/2020] [Indexed: 02/02/2023]
Abstract
OBJECTIVE The centrality of data to biomedical research is difficult to understate, and the same is true for the importance of the biomedical literature in disseminating empirical findings to scientific questions made on such data. But the connections between the literature and related datasets are often weak, hampering the ability of scientists to easily move between existing datasets and existing findings to derive new scientific hypotheses. This work aims to recommend relevant literature articles for datasets with the ultimate goal of increasing the productivity of researchers. Our approach to literature recommendation for datasets is a part of the dataset reusability platform developed at the University Texas Health Science Center at Houston for datasets related to gene expression. This platform incorporates datasets from Gene Expression Omnibus (GEO). An average of 34 datasets were added to GEO daily in the last five years (i.e. 2014 to 2018), demonstrating the need for automatic methods to connect these datasets with relevant literature. The relevant literature for a given dataset may describe that dataset, provide a scientific finding based on that dataset, or even describe prior and related work to the dataset's topic that is of interest to users of the dataset. MATERIALS AND METHODS We adopt an information retrieval paradigm for literature recommendation. In our experiments, distributional semantic features are created from the title and abstract of MEDLINE articles. Then, related articles are identified for datasets in GEO. We evaluate multiple distributional methods such as TF-IDF, BM25, Latent Semantic Analysis, Latent Dirichlet Allocation, word2vec, and doc2vec. Top similar papers are recommended for each dataset using cosine similarity between the dataset's vector representation and every paper's vector representation. We also propose several novel re-ranking and normalization methods over embeddings to improve the recommendations. RESULTS The top-performing literature recommendation technique achieved a strict precision at 10 of 0.8333 and a partial precision at 10 of 0.9000 using BM25 based on a manual evaluation of 36 datasets. Evaluation on a larger, automatically-collected benchmark shows small but consistent gains by emphasizing the similarity of dataset and article titles. CONCLUSION This work is the first step toward developing a literature recommendation tool by recommending relevant literature for datasets. This will hopefully lead to better data reuse experience.
Collapse
|
6
|
Patra BG, Roberts K, Wu H. A content-based dataset recommendation system for researchers-a case study on Gene Expression Omnibus (GEO) repository. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2020; 2020:1. [PMID: 33002137 PMCID: PMC7659921 DOI: 10.1093/database/baaa064] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/24/2020] [Revised: 07/19/2020] [Accepted: 07/27/2020] [Indexed: 11/13/2022]
Abstract
It is a growing trend among researchers to make their data publicly available for experimental reproducibility and data reusability. Sharing data with fellow researchers helps in increasing the visibility of the work. On the other hand, there are researchers who are inhibited by the lack of data resources. To overcome this challenge, many repositories and knowledge bases have been established to date to ease data sharing. Further, in the past two decades, there has been an exponential increase in the number of datasets added to these dataset repositories. However, most of these repositories are domain-specific, and none of them can recommend datasets to researchers/users. Naturally, it is challenging for a researcher to keep track of all the relevant repositories for potential use. Thus, a dataset recommender system that recommends datasets to a researcher based on previous publications can enhance their productivity and expedite further research. This work adopts an information retrieval (IR) paradigm for dataset recommendation. We hypothesize that two fundamental differences exist between dataset recommendation and PubMed-style biomedical IR beyond the corpus. First, instead of keywords, the query is the researcher, embodied by his or her publications. Second, to filter the relevant datasets from non-relevant ones, researchers are better represented by a set of interests, as opposed to the entire body of their research. This second approach is implemented using a non-parametric clustering technique. These clusters are used to recommend datasets for each researcher using the cosine similarity between the vector representations of publication clusters and datasets. The maximum normalized discounted cumulative gain at 10 (NDCG@10), precision at 10 (p@10) partial and p@10 strict of 0.89, 0.78 and 0.61, respectively, were obtained using the proposed method after manual evaluation by five researchers. As per the best of our knowledge, this is the first study of its kind on content-based dataset recommendation. We hope that this system will further promote data sharing, offset the researchers' workload in identifying the right dataset and increase the reusability of biomedical datasets. Database URL: http://genestudy.org/recommends/#/.
Collapse
Affiliation(s)
- Braja Gopal Patra
- Department of Biostatistics and Data Science, School of Public Health, The University of Texas Health Science Center at Houston/1200 Pressler Street, Suite E-833, Houston, TX, 77030, USA and
| | - Kirk Roberts
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston/7000 Fannin st. Suite 600, Houston, TX, 77030, USA
| | - Hulin Wu
- Department of Biostatistics and Data Science, School of Public Health, The University of Texas Health Science Center at Houston/1200 Pressler Street, Suite E-833, Houston, TX, 77030, USA.,School of Biomedical Informatics, The University of Texas Health Science Center at Houston/7000 Fannin st. Suite 600, Houston, TX, 77030, USA
| |
Collapse
|
7
|
Chen X, Gururaj AE, Ozyurt B, Liu R, Soysal E, Cohen T, Tiryaki F, Li Y, Zong N, Jiang M, Rogith D, Salimi M, Kim HE, Rocca-Serra P, Gonzalez-Beltran A, Farcas C, Johnson T, Margolis R, Alter G, Sansone SA, Fore IM, Ohno-Machado L, Grethe JS, Xu H. DataMed - an open source discovery index for finding biomedical datasets. J Am Med Inform Assoc 2018; 25:300-308. [PMID: 29346583 PMCID: PMC7378878 DOI: 10.1093/jamia/ocx121] [Citation(s) in RCA: 37] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2017] [Revised: 09/20/2017] [Accepted: 09/28/2017] [Indexed: 12/17/2022] Open
Abstract
Objective Finding relevant datasets is important for promoting data reuse in the biomedical domain, but it is challenging given the volume and complexity of biomedical data. Here we describe the development of an open source biomedical data discovery system called DataMed, with the goal of promoting the building of additional data indexes in the biomedical domain. Materials and Methods DataMed, which can efficiently index and search diverse types of biomedical datasets across repositories, is developed through the National Institutes of Health–funded biomedical and healthCAre Data Discovery Index Ecosystem (bioCADDIE) consortium. It consists of 2 main components: (1) a data ingestion pipeline that collects and transforms original metadata information to a unified metadata model, called DatA Tag Suite (DATS), and (2) a search engine that finds relevant datasets based on user-entered queries. In addition to describing its architecture and techniques, we evaluated individual components within DataMed, including the accuracy of the ingestion pipeline, the prevalence of the DATS model across repositories, and the overall performance of the dataset retrieval engine. Results and Conclusion Our manual review shows that the ingestion pipeline could achieve an accuracy of 90% and core elements of DATS had varied frequency across repositories. On a manually curated benchmark dataset, the DataMed search engine achieved an inferred average precision of 0.2033 and a precision at 10 (P@10, the number of relevant results in the top 10 search results) of 0.6022, by implementing advanced natural language processing and terminology services. Currently, we have made the DataMed system publically available as an open source package for the biomedical community.
Collapse
Affiliation(s)
- Xiaoling Chen
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, USA
| | - Anupama E Gururaj
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, USA
| | | | - Ruiling Liu
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, USA
| | - Ergin Soysal
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, USA
| | - Trevor Cohen
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, USA
| | - Firat Tiryaki
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, USA
| | - Yueling Li
- Center for Research in Biological Systems
| | - Nansu Zong
- Department of Biomedical Informatics, University of California San Diego, La Jolla, CA, USA
| | - Min Jiang
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, USA
| | - Deevakar Rogith
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, USA
| | - Mandana Salimi
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, USA
| | - Hyeon-Eui Kim
- Department of Biomedical Informatics, University of California San Diego, La Jolla, CA, USA
| | | | | | - Claudiu Farcas
- Department of Biomedical Informatics, University of California San Diego, La Jolla, CA, USA
| | - Todd Johnson
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, USA
| | - Ron Margolis
- National Institutes of Health, Bethesda, MD, USA
| | | | | | - Ian M Fore
- National Institutes of Health, Bethesda, MD, USA
| | - Lucila Ohno-Machado
- Department of Biomedical Informatics, University of California San Diego, La Jolla, CA, USA
| | | | - Hua Xu
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, USA
| |
Collapse
|
8
|
Karisani P, Qin ZS, Agichtein E. Probabilistic and machine learning-based retrieval approaches for biomedical dataset retrieval. Database (Oxford) 2018; 2018:4956082. [PMID: 29688379 PMCID: PMC5887275 DOI: 10.1093/database/bax104] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2017] [Revised: 11/12/2017] [Accepted: 12/20/2017] [Indexed: 11/17/2022]
Abstract
The bioCADDIE dataset retrieval challenge brought together different approaches to retrieval of biomedical datasets relevant to a user’s query, expressed as a text description of a needed dataset. We describe experiments in applying a data-driven, machine learning-based approach to biomedical dataset retrieval as part of this challenge. We report on a series of experiments carried out to evaluate the performance of both probabilistic and machine learning-driven techniques from information retrieval, as applied to this challenge. Our experiments with probabilistic information retrieval methods, such as query term weight optimization, automatic query expansion and simulated user relevance feedback, demonstrate that automatically boosting the weights of important keywords in a verbose query is more effective than other methods. We also show that although there is a rich space of potential representations and features available in this domain, machine learning-based re-ranking models are not able to improve on probabilistic information retrieval techniques with the currently available training data. The models and algorithms presented in this paper can serve as a viable implementation of a search engine to provide access to biomedical datasets. The retrieval performance is expected to be further improved by using additional training data that is created by expert annotation, or gathered through usage logs, clicks and other processes during natural operation of the system. Database URL: https://github.com/emory-irlab/biocaddie
Collapse
Affiliation(s)
- Payam Karisani
- Department of Computer Science, Mathematics & Science Center, Emory University, Suite W401, 400 Dowman Drive NE, Atlanta, Georgia 30322, USA
| | - Zhaohui S Qin
- Department of Biostatistics and Bioinformatics, Emory University, 1518 Clifton Road NE, Atlanta, Georgia 30322-4201, USA
| | - Eugene Agichtein
- Department of Computer Science, Mathematics & Science Center, Emory University, Suite W401, 400 Dowman Drive NE, Atlanta, Georgia 30322, USA
| |
Collapse
|
9
|
Cieslewicz A, Dutkiewicz J, Jedrzejek C. Baseline and extensions approach to information retrieval of complex medical data: Poznan's approach to the bioCADDIE 2016. Database (Oxford) 2018; 2018:4930756. [PMID: 29688372 PMCID: PMC5846287 DOI: 10.1093/database/bax103] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2017] [Revised: 12/18/2017] [Accepted: 12/18/2017] [Indexed: 11/23/2022]
Abstract
Database URL https://biocaddie.org/benchmark-data.
Collapse
Affiliation(s)
- Artur Cieslewicz
- Department of Clinical Pharmacology, Poznan University of Medical Sciences, Dluga 1/2 Str., 61-848 Poznan, Poland
| | - Jakub Dutkiewicz
- Institute of Control, Robotics and Information Engineering, Poznan University of Technology, ul. Piotrowo 3a, 60-965 Poznań, Poland
| | - Czeslaw Jedrzejek
- Institute of Control, Robotics and Information Engineering, Poznan University of Technology, ul. Piotrowo 3a, 60-965 Poznań, Poland
| |
Collapse
|
10
|
Wei W, Ji Z, He Y, Zhang K, Ha Y, Li Q, Ohno-Machado L. Finding relevant biomedical datasets: the UC San Diego solution for the bioCADDIE Retrieval Challenge. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2018; 2018:4939515. [PMID: 29688374 PMCID: PMC5861401 DOI: 10.1093/database/bay017] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/16/2017] [Accepted: 01/30/2018] [Indexed: 01/28/2023]
Abstract
The number and diversity of biomedical datasets grew rapidly in the last decade. A large number of datasets are stored in various repositories, with different formats. Existing dataset retrieval systems lack the capability of cross-repository search. As a result, users spend time searching datasets in known repositories, and they typically do not find new repositories. The biomedical and healthcare data discovery index ecosystem (bioCADDIE) team organized a challenge to solicit new indexing and searching strategies for retrieving biomedical datasets across repositories. We describe the work of one team that built a retrieval pipeline and examined its performance. The pipeline used online resources to supplement dataset metadata, automatically generated queries from users’ free-text questions, produced high-quality retrieval results and achieved the highest inferred Normalized Discounted Cumulative Gain among competitors. The results showed that it is a promising solution for cross-database, cross-domain and cross-repository biomedical dataset retrieval. Database URL: https://github.com/w2wei/dataset_retrieval_pipeline
Collapse
Affiliation(s)
- Wei Wei
- University of California, San Diego, 9500 Gilman Drive, MC 0728, La Jolla, CA 92093-0728, USA
| | - Zhanglong Ji
- University of California, San Diego, 9500 Gilman Drive, MC 0728, La Jolla, CA 92093-0728, USA
| | - Yupeng He
- University of California, San Diego, 9500 Gilman Drive, MC 0728, La Jolla, CA 92093-0728, USA
| | - Kai Zhang
- University of California, San Diego, 9500 Gilman Drive, MC 0728, La Jolla, CA 92093-0728, USA
| | - Yuanchi Ha
- University of California, San Diego, 9500 Gilman Drive, MC 0728, La Jolla, CA 92093-0728, USA
| | - Qi Li
- Department of Computer Science, Northern Kentucky University, Nunn Drive Highland Heights, KY 41099, USA
| | - Lucila Ohno-Machado
- University of California, San Diego, 9500 Gilman Drive, MC 0728, La Jolla, CA 92093-0728, USA
| |
Collapse
|
11
|
Scerri A, Kuriakose J, Deshmane AA, Stanger M, Cotroneo P, Moore R, Naik R, de Waard A. Elsevier's approach to the bioCADDIE 2016 Dataset Retrieval Challenge. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2017; 2017:4090923. [PMID: 29220454 PMCID: PMC5737073 DOI: 10.1093/database/bax056] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/16/2017] [Accepted: 06/29/2017] [Indexed: 11/13/2022]
Abstract
Database URL https://data.mendeley.com/datasets/zd9dxpyybg/1.
Collapse
Affiliation(s)
- Antony Scerri
- Elsevier Ltd, The Boulevard, Langford Lane, Kidlington, Oxford OX5 1GB, UK
| | - John Kuriakose
- Infosys, Hosur Road, Electronics City, Bengaluru 560 100, India
| | | | - Mark Stanger
- Search Technologies Corp, 1110 Herndon Parkway, Suite 306, Herndon, VA 20170, USA
| | - Peter Cotroneo
- Elsevier Ltd, The Boulevard, Langford Lane, Kidlington, Oxford OX5 1GB, UK
| | - Rebekah Moore
- Elsevier Ltd, The Boulevard, Langford Lane, Kidlington, Oxford OX5 1GB, UK
| | - Raj Naik
- Elsevier Ltd, The Boulevard, Langford Lane, Kidlington, Oxford OX5 1GB, UK
| | | |
Collapse
|
12
|
Bouadjenek MR, Verspoor K. Multi-field query expansion is effective for biomedical dataset retrieval. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2017; 2017:4107606. [PMID: 29220457 PMCID: PMC5737205 DOI: 10.1093/database/bax062] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/20/2017] [Accepted: 07/31/2017] [Indexed: 01/01/2023]
Abstract
In the context of the bioCADDIE challenge addressing information retrieval of biomedical datasets, we propose a method for retrieval of biomedical data sets with heterogenous schemas through query reformulation. In particular, the method proposed transforms the initial query into a multi-field query that is then enriched with terms that are likely to occur in the relevant datasets. We compare and evaluate two query expansion strategies, one based on the Rocchio method and another based on a biomedical lexicon. We then perform a comprehensive comparative evaluation of our method on the bioCADDIE dataset collection for biomedical retrieval. We demonstrate the effectiveness of our multi-field query method compared to two baselines, with MAP improved from 0.2171 and 0.2669 to 0.2996. We also show the benefits of query expansion, where the Rocchio expanstion method improves the MAP for our two baselines from 0.2171 and 0.2669 to 0.335. We show that the Rocchio query expansion method slightly outperforms the one based on the biomedical lexicon as a source of terms, with an improvement of roughly 3% for MAP. However, the query expansion method based on the biomedical lexicon is much less resource intensive since it does not require computation of any relevance feedback set or any initial execution of the query. Hence, in term of trade-off between efficiency, execution time and retrieval accuracy, we argue that the query expansion method based on the biomedical lexicon offers the best performance for a prototype biomedical data search engine intended to be used at a large scale. In the official bioCADDIE challenge results, although our approach is ranked seventh in terms of the infNDCG evaluation metric, it ranks second in term of P@10 and NDCG. Hence, the method proposed here provides overall good retrieval performance in relation to the approaches of other competitors. Consequently, the observations made in this paper should benefit the development of a Data Discovery Index prototype or the improvement of the existing one.
Collapse
Affiliation(s)
- Mohamed Reda Bouadjenek
- School of Computing and Information Systems, The University of Melbourne, Parkville, VIC, 3010, Australia
| | - Karin Verspoor
- School of Computing and Information Systems, The University of Melbourne, Parkville, VIC, 3010, Australia
| |
Collapse
|
13
|
Wang Y, Rastegar-Mojarad M, Komandur-Elayavilli R, Liu H. Leveraging word embeddings and medical entity extraction for biomedical dataset retrieval using unstructured texts. Database (Oxford) 2017; 2017:bax091. [PMID: 31725862 PMCID: PMC7243926 DOI: 10.1093/database/bax091] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/17/2017] [Revised: 10/17/2017] [Accepted: 11/14/2017] [Indexed: 11/16/2022]
Abstract
The recent movement towards open data in the biomedical domain has generated a large number of datasets that are publicly accessible. The Big Data to Knowledge data indexing project, biomedical and healthCAre Data Discovery Index Ecosystem (bioCADDIE), has gathered these datasets in a one-stop portal aiming at facilitating their reuse for accelerating scientific advances. However, as the number of biomedical datasets stored and indexed increases, it becomes more and more challenging to retrieve the relevant datasets according to researchers' queries. In this article, we propose an information retrieval (IR) system to tackle this problem and implement it for the bioCADDIE Dataset Retrieval Challenge. The system leverages the unstructured texts of each dataset including the title and description for the dataset, and utilizes a state-of-the-art IR model, medical named entity extraction techniques, query expansion with deep learning-based word embeddings and a re-ranking strategy to enhance the retrieval performance. In empirical experiments, we compared the proposed system with 11 baseline systems using the bioCADDIE Dataset Retrieval Challenge datasets. The experimental results show that the proposed system outperforms other systems in terms of inference Average Precision and inference normalized Discounted Cumulative Gain, implying that the proposed system is a viable option for biomedical dataset retrieval. Database URL: https://github.com/yanshanwang/biocaddie2016mayodata.
Collapse
Affiliation(s)
- Yanshan Wang
- Department of Health Sciences Research, Mayo Clinic, Rochester, MN 55901, USA
| | | | | | - Hongfang Liu
- Department of Health Sciences Research, Mayo Clinic, Rochester, MN 55901, USA
| |
Collapse
|