Multi-Armed Bandit

The foundation of RL is based on three topics:

The last two lie at heart of all algorithms that intend to solve MDPs.

The exploitation-exploration tradeoff is emphasized when tackling canonical problem of sequential decision making under uncertainty.

Multi-Armed Bandit Testing

In any MAB experiment you have to establish three elements:

Reward

Policy Evaluation: The Value Function

How to choose the reward metric?

Drawbacks

Fix

Policy Improvement: Choosing the Best Action

Bandit Algorithm
e_greedy_algo

Simulating the Environment

def environment(a, p(a)):
	'''
	args:
		* a: action
		* p(a): probability of action a
	returns:
		* r: reward
	'''
# action and reward calculations

Running the Experiment

Firgure below illustrates the workflow of a typical bandit test on a website:
running_exp

Improving the ϵ\epsilon-greedy Algorithm