Synthetic benchmark structure Clause Samples
Synthetic benchmark structure. The performance metrics considered here are the cluster IPC (IPCc, 0 < IPCc ≤ 16) and its average value, calculated as the number of instructions executed by all the processors divided by the number of cluster execution cycles. Cold misses: The body of this benchmark consists of only ALU operations (i.e. mov r0, r0) leading to a theoretical IPCc = 16 (and average IPC = 1) for both architectures. The plot in Figure 7 shows on the Y-axis the cluster average IPC while X-axis reports how many times the loop is executed. Increasing N_LOOP both architectures tend to the theoretical value, but the private architecture starts from a lower IPC due to the heavy impact of cold misses serialization (16 cores contending for L3 access). Conlict free TCDM accesses: This benchmark adds the effect of TCDM access. As already mentioned before, in case of conflict free access, TCDM latency is two cycles leading to a single cycle stall between two consecutive instruction fetches. The loop is iterated a fixed number of times (4K in order to lower cold misses effect) and has a variable number of memory operations inside its body. We are considering a banking factor of 1, allowing every core to access a different bank without conflicts. The plot in Figure 8 shows on the Y-axis the average cluster IPC while on X-axis varies the percentage of memory instructions over the number of instructions the loop is made of. Both architectures are affected in the same way, with IPC tending to the asymptotic value value of 1/21 (and cluster IPC respectively to 8 1 A program consisting of only ALU (1 cycle) or MEMORY (2 cycles for TCDM access) operations gives a per- core IPC = (Nalu + Nmem)/(1·Nalu+2·Nmem). Increasing Nmem/Nalu ratio, leads to an asymptotic value value of 1/2 . Cluster IPC, in this case of perfectly aligned execution, is IPCc = 16·IPCi and its average is equal to the IPC of a single core. because of any conflict leading to misalignement). Private architectures as an initial lower IPC due to the cold misses effect discussed in the previous paragraph. As shown in Figure 10, assuming a cache line is made of 4 32-bit words, there will be 4 groups of 4 processors accessing the same line (i.e. bank) but requesting instructions at different addresses. When this situation arises, the average hit time increases from 1 cycle (concurrent access) to 4 cycles (conflicting requests are served in a round-robin fashion). This particular case clearly shows how this architecture is sensitive to...
