Mask Training. Mask training treats the pruning mask m as trainable parameters. Following [35, 66, 42, 32], we achieve this through binarization in forward pass and gradient estimation in backward pass. Each weight matrix W ∈ Rd1 ×d2 , which is frozen during mask training, is associated with a bianry mask m ∈ {0, 1}d1 ×d2 , and a real-valued mask mˆ ∈ Rd1 ×d2 . In the forward pass, W is replaced = with m ⊙ W, where m is derived from mˆ through binarization: xx,x 1 if mˆ i,j ≥ ϕ 0 otherwise
Appears in 3 contracts
Samples: openreview.net, openreview.net, openreview.net
Mask Training. Mask training treats the pruning mask m as trainable parameters. Following [3528, 6651, 4233, 3227], we achieve this through binarization in forward pass and gradient estimation in backward pass. Each weight matrix W ∈ Rd1 ×d2 , which is frozen during mask training, is associated with a bianry mask m ∈ {0, 1}d1 ×d2 , and a real-valued mask mˆ ∈ Rd1 ×d2 . In the forward pass, W is replaced = with m ⊙ W, where m is derived from mˆ through binarization: xx,x 1 if mˆ i,j ≥ ϕ 0 otherwise
Appears in 1 contract
Samples: openreview.net