Multi-Armed Bandit
The foundation of RL is based on three topics:
- Markov decision process (MDP)
- Dynamic Programming (DP)
- Monte Carlo methods
The last two lie at heart of all algorithms that intend to solve MDPs.
The exploitation-exploration tradeoff is emphasized when tackling canonical problem of sequential decision making under uncertainty.
Multi-Armed Bandit Testing
In any MAB experiment you have to establish three elements:
- reward
- actions
- environment
Reward
- Reward is used to quantify performance.
- In order to do so, the result of an action must be measurable.
- Ideally, the metric should be as simple as possible.
Policy Evaluation: The Value Function
- In an environment, an agent takes a specific action, a from a set of possible actions, A.
- The environment rewards the agent r from a range of potential rewards, R.
- As a result of the action, the environment moves to a new state, s.
- All of these variables can be stochastic.
- Stochastic means if your agent requested exactly the same action in precisely the same situation, the environment may return a slightly different reward or state.
How to choose the reward metric?
- One way of quantifying performance is to calculate the average reward for certain number of customers, N, for each action a.
- You basically sum the observed rewards and divide it by the number of customers.
Drawbacks
- This works but not computationally efficient in terms of storage.
- Also, you’d have to wait until lots of agents have interacted with the environment before you can compute this.
Fix
- Do an online algorithm, i.e. parameters of the algorithm are updated in place, rather than waiting until the end of all the experiments.
- rNavg←rN−1avg+N1(rN−rN−1avg) or r←r+α(r−r′) ; for derivations check out the book
- The above is called “exponentially weighted moving average”.
- Next is to decide which action to take based on the calculated reward.
Policy Improvement: Choosing the Best Action
- Allowing an agent full control over the choices it makes distinguished RL from ML.
- In RL, we have the notion of exploration and exploitation.
- You need to observe the rewards from all of the actions multiple times before you know which one is the best.
- Several methods are devised in other to balance between exploration and exploitation. The ϵ-greedy algorithm is one of them.
- In e-greedy, upon every action there is some probability that the action will be a random choice, specified by parameter ϵ.
Bandit Algorithm

- In step 4, with a probability of 1−ϵ, the agent should choose choose the action that produces the current highest average reward. The agent choose a random action with probability ϵ.
- Step 6 increments a counter for this action for use next.
- The above algorithm is a bandit algorithm. Over time this algorithm allows you to automatically pick the “arm” on the bandit machine that provides the highest reward. It allows you to automatically tend toward the version that produces the highest reward.
Simulating the Environment
- The best environment is always real life. But it can be too expensive, dangerous, or complicated to develop against.
- Instead you can build a simulation, which is a simplified version of a domain (like a 3D gaming engine), or modeled from collected data.
- Sample pseudo-function for a simulator:
def environment(a, p(a)):
'''
args:
* a: action
* p(a): probability of action a
returns:
* r: reward
'''
- The point is, simulations don’t have to be complex to be useful. They essentially accept an action along with its probability and outputs a reward with that probability.
Running the Experiment
Firgure below illustrates the workflow of a typical bandit test on a website:

- The hyperparameter ϵ controls the tradeoff (between exploration and exploitation) in the ϵ-greedy algorithm.
- The learning rate depends on the value of ϵ. When ϵ is high, the agent chooses random actions more often and explore more.
- Different values of ϵ affect the eventual optimal reward.
- The reward curve is asymptotic, i.e. tending toward some final value.
- The asymptote of the optimal action is n1+n1−ϵ=n2−ϵ, where n is the number of actions.
Improving the ϵ-greedy Algorithm
- All algorithms are fundamentally limited by how effectively they explore and how quickly they exploit.
- The best algorithms are able to do both at the same time, but defining how to explore is largely problem dependent. For example, a maze and an optimal healthcare schedule would require different exploration strategies.
- The simulation also matters.
- If the reward difference between two actions are small, the agent has to sample these two outcomes a lot.
- It is often better to choose the action based upon the current estimates of the distribution of rewards.
- In other words, rather than returning a single action, the agent returns the probabilities of each action, weighted by the expected rewards of each state. This is called softmax function.
- When the explore action is not yielding much, you could remove the ϵ-greedy action.
- But it is more common to reduce the value of ϵ over time.
- This is called annealing.
- A third popular improvement revolves around the idea that it doesn’t make sense to explore randomly, especially in simple simulations.
- The agent will learn more by exploring states that it hasn’t seen before.
- These algorithms add a bonus to each action for inadequately sampled states.
- They’re called upper-confidence-bound (UCB) methods.
- UCB algorithms are useful because they have no hyperparameters like ϵ.