Simulation settings and evaluation methods Clause Samples
Simulation settings and evaluation methods. Based on our simulation framework, the benchmark IRA K and the data generation are deter- mined by the following parameters: the prevalence of the rating outcome (θ), the two raters’ marginal probabilities of feeling uncertain about an item (p1 and p2), the two raters’ indirect correlation measure of encountering uncertainties (ρU ), the two raters’ marginal probabilities of giving a wrong judgment when uncertain (m1 and m2), the indirect correlation measure of making a correct decision (ρC), and the total sample size of rating subjects (N ). In this assessment, we considered 9 levels of prevalence θ ∈ {0.1, 0.2, ..., 0.9}, 5 levels of probability of feeling uncertain for each rater p1, p2 ∈ {0.1, 0.3, 0.5, 0.7, 0.9}, 5 levels of latent correlation parameter about uncertainty ρU ∈ {0.1, 0.3, 0.5, 0.7, 0.9}, 5 levels of misclassification rates given uncertainty m1, m2 ∈ {0.1, 0.2, 0.3, 0.4, 0.5} (we assume that rational decisions tend to perform not worse than random guessing, i.e., mi = 0.5), 5 levels of latent correlation parameter about making correct decisions ρC ∈ {0.1, 0.3, 0.5, 0.7, 0.9}, and 4 levels of total numbers of rating subjects N ∈ {25, 50, 100, 200}. Given each parameter combination, our proposed framework allows us to calculate a “true” IRA K under this specific setting. Then, we generate 1, 000 2 × 2 tables following the proposed data-generating process and calculate the corresponding 10 IRA measures under each data simulation. The bias calculated from the Monte Carlo mean of each IRA measure over 1, 000 simulated data minus the setting-specific K value is of primary interest. The coverage probability of the “true” chance-corrected IRA, which is defined as the probability that K fell into the 95% confidence interval of each IRA measure over 1, 000 replications, is also computed for each IRA measure under each parameter setting. This comparison metric is for assessing how well the IRA measures’ interval estimates can help to describe the benchmark K, and in the coverage calculations, the confidence intervals are constructed based on the asymptotic variance estimation methods included in the review. In addition, we also conducted an agglomerative hierarchical clustering using our extensive simulation data to estimate the similarity among the reviewed methods as well as our proposed “true” chance-corrected ▇▇▇ ▇. Euclidean distance across different simulation settings and the average linkage method were used to quantify the between-IRA metho...
