Implementation and Experiments Clause Samples
Implementation and Experiments. We summarize the results on probabilistic shielding of an agent for the arcade game Pac-Man. The task is to eat food in a maze and not get eaten by ghosts. Pac-Man achieves a high score if it eats all the food as quickly as possible while minimizing the number of times it gets eaten by the ghosts. Each instance of the game is modeled as an arena, where Pac-Man is the avatar and the ghosts are adversaries. The safety specification is that the avatar does not get eaten with a high probability. Tokens represent the food at each position in the maze, such that food is either present or already eaten. Food earns reward (+10), while each step causes a small penalty ( 1). A large reward (+500) is granted, if Pac-Man eats all the food in the maze. If Pac-Man gets eaten, a large penalty ( 500) is imposed and the game is restarted. The ghost behavior is learned from the original Pac-Man game for each ghost. Transferring the resulting stochastic behavior to any arena (without tokens) yields the safety-relevant MDP. For that MDP, a shield is computed via the model checker STORM [7] for a horizon of 10 steps. The implementation uses an approximate Q-learning agent (using α = 0.2, ц = 0.8 and ϵ = 0.05) with the following feature vector: (1) how far away the next food is, (2) whether a ghost collision is imminent, and (3) whether a ghost is one step away. Figure 5(left) show a screenshot of a series of videos1. Each video compares how RL performs either shielded or unshielded on a Pac-Man instance. In the shielded version, the risk of potential decisions is indicated by the colors green (low), orange (medium), and red (high). Figure 5 (right) depicts the scores obtained during RL, composed by rewards and penalties mentioned above. Table 1 shows the results. The table lists the number of model checking calls, the time to construct the shield, the scores with and without shield, and the winning rate. For all instances, we see a large difference in scores due to the fact that Pac-Man is often saved by the shield. For the two largest instances with 3 and 4 ghosts, a shield that plans 10 steps ahead is not enough to always avoid Pac-Man from being encircled by the ghosts. Nevertheless, the shield still saves Pac-Man in many situations, leading to superior scores. Moreover, the shield helps to learn an optimal policy much faster because viewer restarts are needed.
