Agreement Among Statistical Significance Tests for Information Retrieval Evaluation at Varying Sample SizesStatistical Significance Tests • May 1st, 2009
Contract Type FiledMay 1st, 2009Research has shown that little practical difference exists be- tween the randomization, Student’s paired t, and bootstrap tests of statistical significance for TREC ad-hoc retrieval ex- periments with 50 topics. We compared these three tests on runs with topic sizes down to 10 topics. We found that these tests show increasing disagreement as the number of topics decreases. At smaller numbers of topics, the randomization test tended to produce smaller p-values than the t-test for p-values less than 0.1. The bootstrap exhibited a system- atic bias towards p-values strictly less than the t-test with this bias increasing as the number of topics decreased. We recommend the use of the randomization test although the t-test appears to be suitable even when the number of topics is small.