Multi-armed bandit problem Sample Clauses
Multi-armed bandit problem. The multi-armed bandit problem (MAB) is a sub-field of reinforcement learning, which is a class of sequential decision-making problems and has been widely studied in probability theory and machine learning [8]. In recent years, MAB has been used to model the problems of adaptive routing and server selection in networks and click-through probabilities for online advertising. The classic MAB problem is formulated as a system with k different arms (actions). Once the player chooses an arm, a numerical reward will be received where the probability distribution of the reward to each arm is unknown in advance [71]. The aim is to maximize the sum of rewards by repeatedly playing these arms in multiple rounds, so as to asymptotically approach to the rewards of playing the optimal arm in hindsight. Some classic MAB problems are briefly summarized as follows [110] • Binary or Bernoulli MAB problem: A reward of one is issued with probability • Restless bandit problem: Each arm represents an independent Markov ma- chine, where the state of that machine advances to a new one according to the Markov state evolution probabilities once an arm is played (and even non-played arms). • Adversarial bandit problem: At each iteration, an agent chooses an arm and an adversary simultaneously chooses the payoff structure for each arm. Because the adversarial bandit problem removes all assumptions of the distribution, it is a generalization of the bandit problem. In addition, some other variants of MAB problems have been investigated, such as dueling bandit [135, 136], collaborative bandit [66, 46], combinatorial bandit [43, 30, 86] and etc. The biggest feature in MAB problems is a trade-off between exploration, i.e., attempting new arms to further increase knowledge, and exploitation, i.e., selecting the best arm based on the existing experience. The MAB problem is said to be solved if the resulting regret of the proposed algorithm at T -th round can match the lower bound of the regret, i.e., Regret = O(log T ). In the past decades, many distinct algorithms have been developed to solve MAB problems, such as epsilon-greedy algorithm [110], upper confidence bound (UCB) [62], ▇▇▇▇▇▇▇▇ sampling (TS) [25], EXP3 [96] and etc. Chapter 3
