Introduction

Most important distinguishing feature of RL is that it uses training information that evaluates the actions taken rather than instructs by giving correct actions (as in machine learning).
- This is what creates the need for active exploration for an explicit search for good behavior.

A $k$ -armed Bandit Problem

Consider the following problem:

Choosing from $k$ different options/actions.
Reward chosen from a stationary probability distribution that depends on the action taken.
The objective is to maximize the expected total reward over some time period.

Each of the $k$ actions has an expected or mean reward given that the action is selected (we call it the value of that action). The value of an arbitrary action $a$ is the expected rewards given that action:

$q_{*} (a) \doteq \mathbb{E}[R_t|A_t=a]$

If you knew the value of each action, then, it would be trivial to solve the $k$ -armed bandit problem: you’d always select the action with highest value. We assume that you don’t know the action values, although you might have some estimates. We want that estimated value, denoted by $Q_t (s)$ , to be close to $q_{*} (a)$ .

If you maintain the estimated action values, then at any time step there is at least one action whose estimated value is greatest $\rightarrow$ greedy actions.

If you choose the greedy action, you’re exploiting, and if you choose nongreedy actions, you’re exploring.

Exploiting is the right thing to do to maximize the expected reward.
Exploring may produce the greater total reward in the long run.

Wether to explore or exploit depends in a complex way on:

the precise values of the estimates
uncertainties
the number of remaining steps

There are many methods for balancing exploration and exploitation, but most of them make strong assumption about the stationarity and prior knowledge, which makes it hard to verify in most applications.

Action-value Methods

Action-value methods includes methods of estimating the values of actions and using those estimates to make action selection decisions.

Recall that the true value of an action is the mean reward when that action is selected. One natural way to estimate this is by averaging the rewards actually received:

$Q_t (a) \doteq \frac{\textit{sum of rewards when a taken prior to t}}{\textit{number of times a taken prior to t}}=\frac{\sum\limits_{i=1}^{t-1} R_i . \mathbb{1}_{A_i = a}}{\sum\limits_{i=1}^{t-1} \mathbb{1}_{A_i = a}}$

Introduction

A kkk-armed Bandit Problem

Action-value Methods

A $k$ -armed Bandit Problem