1
|
Maistro M, Breuer T, Schaer P, Ferro N. An in-depth investigation on the behavior of measures to quantify reproducibility. Inf Process Manag 2023. [DOI: 10.1016/j.ipm.2023.103332] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/16/2023]
|
2
|
Göksel G, Arslan A, Dinçer BT. A selective approach to stemming for minimizing the risk of failure in information retrieval systems. PeerJ Comput Sci 2023; 9:e1175. [PMID: 37346699 PMCID: PMC10280253 DOI: 10.7717/peerj-cs.1175] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2022] [Accepted: 11/09/2022] [Indexed: 06/23/2023]
Abstract
Stemming is supposed to improve the average performance of an information retrieval system, but in practice, past experimental results show that this is not always the case. In this article, we propose a selective approach to stemming that decides whether stemming should be applied or not on a query basis. Our method aims at minimizing the risk of failure caused by stemming in retrieving semantically-related documents. The proposed work mainly contributes to the IR literature by proposing an application of selective stemming and a set of new features that derived from the term frequency distributions of the systems in selection. The method based on the approach leverages both some of the query performance predictors and the derived features and a machine learning technique. It is comprehensively evaluated using three rule-based stemmers and eight query sets corresponding to four document collections from the standard TREC and NTCIR datasets. The document collections, except for one, include Web documents ranging from 25 million to 733 million. The results of the experiments show that the method is capable of making accurate selections that increase the robustness of the system and minimize the risk of failure (i.e., per query performance losses) across queries. The results also show that the method attains a systematically higher average retrieval performance than the single systems for most query sets.
Collapse
Affiliation(s)
- Gökhan Göksel
- Computer Engineering, Eskişehir Technical University, Eskisehir, Turkey
| | - Ahmet Arslan
- Computer Engineering, Eskişehir Technical University, Eskisehir, Turkey
| | | |
Collapse
|
3
|
Focal elements of neural information retrieval models. An outlook through a reproducibility study. Inf Process Manag 2020. [DOI: 10.1016/j.ipm.2019.102109] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
4
|
A selective approach to index term weighting for robust information retrieval based on the frequency distributions of query terms. INFORM RETRIEVAL J 2019. [DOI: 10.1007/s10791-018-9347-9] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
5
|
Yılmazel İB, Arslan A. An intrinsic evaluation of the Waterloo spam rankings of the ClueWeb09 and ClueWeb12 datasets. J Inf Sci 2019. [DOI: 10.1177/0165551519866551] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
The ClueWeb09 dataset and its successor, the ClueWeb12 dataset, are two of the largest collections of Web pages released by Text REtrieval Conference (TREC). The ClueWeb datasets were used in various tracks of TREC ran through 2009 to 2017. For every year, approximately 50 new queries are released and a pool of Web pages are judged against these queries by human assessors as relevant, non-relevant or spam. In this article, a ground truth for binary classification (spam vs non-spam) is constructed from Web pages that are judged as spam or relevant under the assumption that a Web page judged as relevant for any query cannot be spam. Based on this ground truth, we evaluate classification performances of the Waterloo spam rankings (Fusion, Britney, GroupX and UK2006), which have been traditionally used to identify and filter spam pages in retrieval systems. The experimental results in terms of the universal binary classification evaluation measures suggest that the Fusion (with threshold = 11%) is the best for the ClueWeb09 dataset. Analysis of the frequency distributions of relevant/spam documents over spam scores reveals that the GroupX is the most powerful at identifying relevant documents, whereas the Fusion is the most powerful at identifying spam documents. It is also confirmed that the effectiveness of the Fusion spam ranking of the ClueWeb12 dataset is not as good as that of the ClueWeb09.
Collapse
Affiliation(s)
| | - Ahmet Arslan
- Department of Computer Engineering, Eskişehir Technical University, Turkey
| |
Collapse
|
6
|
Arslan A. How sensitive are the term-weighting models of information retrieval to spam Web pages? INFORM PROCESS LETT 2019. [DOI: 10.1016/j.ipl.2018.12.003] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
|
7
|
Roy D, Mitra M, Ganguly D. To Clean or Not to Clean. ACM JOURNAL OF DATA AND INFORMATION QUALITY 2018. [DOI: 10.1145/3242180] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/28/2022]
Abstract
Web document collections such as WT10G, GOV2, and ClueWeb are widely used for text retrieval experiments. Documents in these collections contain a fair amount of non-content-related markup in the form of tags, hyperlinks, and so on. Published articles that use these corpora generally do not provide specific details about how this markup information is handled during indexing. However, this question turns out to be important: Through experiments, we find that including or excluding metadata in the index can produce significantly different results with standard IR models. More importantly, the effect varies across models and collections. For example, metadata filtering is found to be generally beneficial when using BM25, or language modeling with Dirichlet smoothing, but can significantly reduce retrieval effectiveness if language modeling is used with Jelinek-Mercer smoothing. We also observe that, in general, the performance differences become more noticeable as the amount of metadata in the test collections increase. Given this variability, we believe that the details of document preprocessing are significant from the point of view of reproducibility. In a second set of experiments, we also study the effect of preprocessing on query expansion using RM3. In this case, once again, we find that it is generally better to remove markup before using documents for query expansion.
Collapse
|
8
|
Ferro N, Fuhr N, Rauber A. Introduction to the Special Issue on Reproducibility in Information Retrieval. ACM JOURNAL OF DATA AND INFORMATION QUALITY 2018. [DOI: 10.1145/3268410] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/28/2022]
|
9
|
Abstract
This work tackles the perennial problem of reproducible baselines in information retrieval research, focusing on bag-of-words ranking models. Although academic information retrieval researchers have a long history of building and sharing systems, they are primarily designed to facilitate the publication of research papers. As such, these systems are often incomplete, inflexible, poorly documented, difficult to use, and slow, particularly in the context of modern web-scale collections. Furthermore, the growing complexity of modern software ecosystems and the resource constraints most academic research groups operate under make maintaining open-source systems a constant struggle. However, except for a small number of companies (mostly commercial web search engines) that deploy custom infrastructure, Lucene has become the
de facto
platform in industry for building search applications. Lucene has an active developer base, a large audience of users, and diverse capabilities to work with heterogeneous collections at scale. However, it lacks systematic support for
ad hoc
experimentation using standard test collections. We describe Anserini, an information retrieval toolkit built on Lucene that fills this gap. Our goal is to simplify
ad hoc
experimentation and allow researchers to easily reproduce results with modern bag-of-words ranking models on diverse test collections. With Anserini, we demonstrate that Lucene provides a suitable framework for supporting information retrieval research. Experiments show that our system efficiently indexes large web collections, provides modern ranking models that are on par with research implementations in terms of effectiveness, and supports low-latency query evaluation to facilitate rapid experimentation
Collapse
Affiliation(s)
| | - Hui Fang
- University of Delaware, Newark, DE, USA
| | - Jimmy Lin
- University of Waterloo, Waterloo, ON, Canada
| |
Collapse
|
10
|
|
11
|
Ferro N, Fuhr N, Rauber A. Introduction to the Special Issue on Reproducibility in Information Retrieval. ACM JOURNAL OF DATA AND INFORMATION QUALITY 2018. [DOI: 10.1145/3268408] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/28/2022]
|
12
|
Ferro N. Reproducibility Challenges in Information Retrieval Evaluation. ACM JOURNAL OF DATA AND INFORMATION QUALITY 2017. [DOI: 10.1145/3020206] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
|
13
|
3.5K runs, 5K topics, 3M assessments and 70M measures: What trends in 10 years of Adhoc-ish CLEF? Inf Process Manag 2017. [DOI: 10.1016/j.ipm.2016.08.001] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
|
14
|
Silvello G, Bordea G, Ferro N, Buitelaar P, Bogers T. Semantic representation and enrichment of information retrieval experimental data. INTERNATIONAL JOURNAL ON DIGITAL LIBRARIES 2016. [DOI: 10.1007/s00799-016-0172-8] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
|