Retrieval Model and Experimental Setup Sample Clauses
Retrieval Model and Experimental Setup. Having the debiased embeddings, we now explain how we use these in a retrieval experiment. We select two different sets of pre-trained word embeddings, the standard pre-trained Word2Vec embeddings on Google News as shared by [7] and the debiased version of these embeddings (as explained in the section above). For the dataset to test our model, we selected the TREC Robust 04 test collection consisting of news articles, matching the domain of our embeddings. This test collection consists of 250 queries (usually called topics in IR), with a total of 311410 relevance judgments. We removed stopwords from these queries using the NLTK stopword list, and we cast query terms to all lower case. We expand each of these queries with k = 5 terms, by computing the five closest terms to the query embedding in the embedding space with each method regarding the cosine similarity. To compute these terms, we use the GENSIM most similar function, where the input is the stopped lowercase query terms, and the output is the top-k closest words which are not in the input words. After this, we substitute the words of the query with the expanded terms and used these for retrieval. The score is based on the method used in [2], but not identical as we use cosine instead of the dot product, and we only expand with words that de not occur in the original query. To run our experiment, we used Anserini [9]. We ranked the documents using RM3 and BM25. This gives us three ranking files, the one with the regular queries (Standard ), with the biased expansions (Biased ) and with the debiased expansions (Debiased ). To combine the biased or debiased word embeddings based score with the standard retrieval score, we used Coordinate Ascent from the RankLib package. scoretotal = λscorestandard + (1 − λ)score(de)biased λ ∈ [0, 1] (2) We used cross fold validation, where we trained with 5 folds, and we optimized regarding to the metrics of NDCG@10 and ERROR@10. This gave us, for all folds with both methods, the average λ score of 0.90 (σ = 0.04). As we can see in Table 1, there is no significant difference in score between biased and debiased query expansion. We also see no significant difference regarding the Expanded versus the Regular version. Table 1 has two columns, one where we evaluate with respect to the full Robust 04 qrels file, and one where we compare to only the 48 queries which got different expansions. The expansions only differ in about 20% of the queries, so differences are more cl...
