Bitslicing and Efficient Clause Samples
Bitslicing and Efficient. Sbox Representation CPU architectures tend to operate best on their native word size or half-words and they encounter performance issues with bit-level manipulation. To deal with this issue, the Cortex-M4 features bit-banding support4, as well as a wide selection of bit-field instructions. However, applying them in the context of PRESENT requires extensive use of load and store instructions or numerous bit extractions/insertions, often resulting in poor performance. Bitslicing is a technique introduced by ▇▇▇▇▇ to tackle this inefficiency for DES [6]. Instead of using registers to store consecutive bits of a state, one uses them to hold one specific bit from several different states, effectively transforming bit-level operations into SIMD equivalents. In our implementation, we employ a bitsliced representation of factor 32, i.e. we process in parallel 32 cipher blocks, 64 bits each, resulting in 256 bytes per bitsliced encryption. Doing so, allows us to efficiently compute both the substi- tution and the permutation layer of PRESENT. Analytically, the Sbox can be decomposed into GF (2) operations which can be accelerated by via the SIMD- like instructions and it no longer requires the application of memory lookup tables.5 Similarly, the bit permutations can be accelerated by directly exchang- ing the memory contents of the corresponding bitsliced bits according to the permutation pattern, instead of relying on bit extraction, insertion and shifting. The GF (2) decomposition of the Sbox has sparked interest in the optimiza- tion of boolean circuits w.r.t. computational efficiency. In our implementation, we use the optimized boolean circuit suggested for PRESENT by ▇▇▇▇▇▇▇▇ et al. [21]. The optimized representation was generated by applying the ▇▇▇▇▇-▇▇▇▇▇▇▇ heuristic [12], which reduces the circuit’s gate complexity, i.e. the number of AND, OR, XOR, NOT operations. The representation is shown below. T1 = X2^X1; T2 = X1&T1; T3 = X0^T2; Y4 = X3^T3; T2 = T1&T3; T1 ^= Y4; T2 ^= X1; T4 = X3|T2; Y3 = T1^T4; X3 =~ X3; T2^ = X3; Y1 = Y3^T2; T2 |= T1; Y2 = T3^T2; Values X1–X4 represent an Sbox input, T1–T4 hold temporary values and Y1-Y4 are output values. The total cost is 14 operations, 4 non-linear (AND, OR) and 10 linear (XOR,NOT).
