Consider the following problem:
Each of the actions has an expected or mean reward given that the action is selected (we call it the value of that action). The value of an arbitrary action is the expected rewards given that action:
If you knew the value of each action, then, it would be trivial to solve the -armed bandit problem: you’d always select the action with highest value. We assume that you don’t know the action values, although you might have some estimates. We want that estimated value, denoted by , to be close to .
If you maintain the estimated action values, then at any time step there is at least one action whose estimated value is greatest greedy actions.
If you choose the greedy action, you’re exploiting, and if you choose nongreedy actions, you’re exploring.
Wether to explore or exploit depends in a complex way on:
There are many methods for balancing exploration and exploitation, but most of them make strong assumption about the stationarity and prior knowledge, which makes it hard to verify in most applications.
Action-value methods includes methods of estimating the values of actions and using those estimates to make action selection decisions.
Recall that the true value of an action is the mean reward when that action is selected. One natural way to estimate this is by averaging the rewards actually received: