Methods and Materials. For each of the 18820 pairs of the ad-hoc retrieval runs of TREC 3, 5–8, we computed the two-sided statistical sig- nificance (p-value) of the difference in the pair’s mean aver- age precision using each of three tests: the randomization, shifted bootstrap, and Student’s paired t-test. Both the ran- domization and bootstrap are distribution-free tests. Space limitations prevent us from explaining the details of each of these well-known tests. For both the randomization and bootstrap, we performed 100,000 samples. For each pair of runs, we sampled topics without replacement to produce runs with 10, 20, 30, and 40 topics. To compare significance tests, we computed the root mean square error between each test and each other test’s p-values. The root mean square error is: Copyright is held by the author/owner(s). SIGIR’09, July 19–23, 2009, Boston, Massachusetts, USA. RMSE = 1 ΣN (Ei − Oi)2 1/2 ACM 978-1-60558-483-6/09/07. N i Pairs of TREC runs with p-values ≥ 0.0001 Number of Topics 50 40 30 20 10 rand. vs. t-test 0.007 0.009 0.011 0.018 0.037 boot. vs. t-test 0.007 0.009 0.011 0.017 0.035 boot. vs. rand. 0.011 0.014 0.017 0.026 0.051 Run pairs with p-value p such that 0.0001 < p < 0.5 Number of Topics 50 40 30 20 10 rand. vs. t-test 0.005 0.006 0.008 0.012 0.027 boot. vs. t-test 0.008 0.010 0.013 0.020 0.041 boot. vs. rand. 0.010 0.013 0.016 0.024 0.047 ≥
Appears in 2 contracts
Samples: citeseerx.ist.psu.edu, ciir-publications.cs.umass.edu
Methods and Materials. For each of the 18820 pairs of the ad-hoc retrieval runs of TREC 3, 5–8, we computed the two-sided statistical sig- nificance nificance (p-value) of the difference difference in the pair’s mean aver- age precision using each of three tests: the randomization, shifted bootstrap, and Student’s paired t-test. Both the ran- domization and bootstrap are distribution-free tests. Space limitations prevent us from explaining the details of each of these well-known tests. For both the randomization and bootstrap, we performed 100,000 samples. For each pair of runs, we sampled topics without replacement to produce runs with 10, 20, 30, and 40 topics. To compare significance significance tests, we computed the root mean square error between each test and each other test’s p-values. The root mean square error is: Copyright is held by the author/owner(s). SIGIR’09, July 19–23, 2009, Boston, Massachusetts, USA. RMSE = Σ 1 ΣN (Ei − Oi)2 1/2 Σ1/2 ACM 978-1-60558-483-6/09/07. N i Pairs Num ber of TREC runs with p-values ≥ 0.0001 Number of Topics 50 40 30 20 10 rand. vs. t-test 0.007 0.009 0.011 0.018 0.037 boot. vs. t-test 0.007 0.009 0.011 0.017 0.035 boot. vs. rand. 0.011 0.014 0.017 0.026 0.051 T opics Run pairs with p-value p such that 0.0001 < p < 0.5 Number of Topics 50 40 30 20 10 rand. vs. t-test 0.005 0.006 0.008 0.012 0.027 boot. vs. t-test 0.008 0.010 0.013 0.020 0.041 boot. vs. rand. 0.010 0.013 0.016 0.024 0.047 Pairs of TREC runs with p-values ≥ 0.0001 50 40 30 20 10 rand. vs. t-test 0.007 0.009 0.011 0.018 0.037 boot. vs. t-test 0.007 0.009 0.011 0.017 0.035 boot. vs. rand. 0.011 0.014 0.017 0.026 0.051 ≥
Appears in 2 contracts
Samples: citeseerx.ist.psu.edu, maroo.cs.umass.edu