Training; Details. 1. Mask training and IMP basically use the same hyper-parameters (adopting from [55]) as full XXXX. An exception is longer training, because we find that good subnetworks at high sparsity levels require more training to be found. Unless otherwise specified, we select the best checkpoints based on the ID dev performance, without using OOD information. All the reported results are averaged over 4 runs. We defer training details about each dataset, and each training and pruning setup, to Appendix B.3.
Appears in 3 contracts
Samples: Research and Development, Research and Development, Research Paper
Training; Details.
1. Mask training and IMP basically use the same hyper-parameters (adopting from [5542]) as full XXXX. An exception is longer training, because we find that good subnetworks at high sparsity levels require more training to be found. Unless otherwise specified, we select the best checkpoints based on the ID dev performance, without using OOD information. All the reported results are averaged over 4 runs. We defer training details about each dataset, and each training and pruning setup, to Appendix B.3.
Appears in 1 contract
Samples: Research and Development