Datasets and Evaluation. Natural Language Inference We use MNLI [57] as the ID dataset for NLI. MNLI is comprised of premise-hypothesis pairs, whose relationship may be entailment, contradiction, or neutral. In MNLI the word overlap between premise and hypothesis is strongly correlated with the entailment class. To solve this problem, the OOD HANS dataset [37] is built so that such correlation does not hold. Paraphrase Identification The ID dataset for paraphrase identification is QQP 3, which contains question pairs that are labelled as either duplicate or non-duplicate. In QQP, high lexical overlap is also strongly associated with the duplicate class. The OOD datasets PAWS-qqp and PAWS-wiki [65] are built from sentences in Quora and Wikipedia respectively. In PAWS sentence pairs with high word overlap have a balanced distribution over duplicate and non-duplicate. Fact Verification FEVER 4 [52] is adopted as the ID dataset of fact verification, where the task is to assess whether a given evidence supports or refutes the claim, or whether there is not-enough-info to reach a conclusion. The OOD dataset Fever-Symmetric (v1 and v2) [46] is proposed to evaluate the influence of the claim-only bias (the label can be predicted correctly without the evidence). For NLI and fact verification, we use Accuracy as the evaluation metric. For paraphrase identification, we evaluate using the F1 score. More details of datasets and evaluation are shown in Appendix B.1.

Appears in 3 contracts

Samples: openreview.net, openreview.net, openreview.net

Datasets and Evaluation. Natural Language Inference We use MNLI [5744] as the ID dataset for NLI. MNLI is comprised of premise-hypothesis pairs, whose relationship may be entailment, contradiction, or neutral. In MNLI the word overlap between premise and hypothesis is strongly correlated with the entailment class. To solve this problem, the OOD HANS dataset [3730] is built so that such correlation does not hold. Paraphrase Identification The ID dataset for paraphrase identification is QQP 3, which contains question pairs that are labelled as either duplicate or non-duplicate. In QQP, high lexical overlap is also strongly associated with the duplicate class. The OOD datasets PAWS-qqp and PAWS-wiki [6550] are built from sentences in Quora and Wikipedia respectively. In PAWS sentence pairs with high word overlap have a balanced distribution over duplicate and non-duplicate. Fact Verification FEVER 4 [5240] is adopted as the ID dataset of fact verification, where the task is to assess whether a given evidence supports or refutes the claim, or whether there is not-enough-info to reach a conclusion. The OOD dataset Fever-Symmetric (v1 and v2) [4635] is proposed to evaluate the influence of the claim-only bias (the label can be predicted correctly without the evidence). For NLI and fact verification, we use Accuracy as the evaluation metric. For paraphrase identification, we evaluate using the F1 score. More details of datasets and evaluation are shown in Appendix B.1.

Appears in 1 contract

Samples: openreview.net

Common use of Datasets and Evaluation Clause in Contracts