Mask Training. Mask training treats the pruning mask m as trainable parameters. Following [35, 66, 42, 32], we achieve this through binarization in forward pass and gradient estimation in backward pass. Each weight matrix W ∈ Rd1 ×d2 , which is frozen during mask training, is associated with a bianry mask m ∈ {0, 1}d1 ×d2 , and a real-valued mask mˆ ∈ Rd1 ×d2 . In the forward pass, W is replaced = with m ⊙ W, where m is derived from mˆ through binarization: xx,x 1 if mˆ i,j ≥ ϕ 0 otherwise (1) where ϕ is the threshold. In the backward pass, since the binarization operation is not differentiable, we use the straight-through estimator [3] to compute the gradients for mˆ using the gradients of m, i.e., ∂L , where L is the loss. Then, mˆ is updated as mˆ ← mˆ − η ∂L , where η is the learning rate. Following [42, 32], we initialize the real-valued masks according to the magnitude of the original weights. The complete mask training algorithm is summarized in Appendix A.1.2.
Appears in 3 contracts
Samples: Research and Development, Research and Development, Research Paper
Mask Training. Mask training treats the pruning mask m as trainable parameters. Following [3528, 6651, 4233, 3227], we achieve this through binarization in forward pass and gradient estimation in backward pass. Each weight matrix W ∈ Rd1 ×d2 , which is frozen during mask training, is associated with a bianry mask m ∈ {0, 1}d1 ×d2 , and a real-valued mask mˆ ∈ Rd1 ×d2 . In the forward pass, W is replaced = with m ⊙ W, where m is derived from mˆ through binarization: xx,x 1 if mˆ i,j ≥ ϕ 0 otherwise
(1) where ϕ is the threshold. In the backward pass, since the binarization operation is not differentiable, we use the straight-through estimator [32] to compute the gradients for mˆ using the gradients of m, i.e., ∂L , where L is the loss. Then, mˆ is updated as mˆ ← mˆ − η ∂L , where η is the learning rate. Following [4233, 3227], we initialize the real-valued masks according to the magnitude of the original weights. The complete mask training algorithm is summarized in Appendix A.1.2.
Appears in 1 contract
Samples: Research and Development