Reinforcement learning is a field of Machine Learning where software agents in order to solve a particular problem takes action in an uncertain and potentially complex environment. Through these actions, the software agent learns to achieve a goal.
Reinforcement Learning is one of the 3 machine learning paradigms alongside supervised and unsupervised learning.
The main difference between Reinforcement Learning and Supervised learning is it doesn’t need labelled data or any kind of sub-steps that needs to be corrected to better increase the accuracy of the model. Instead the main focus is towards figuring out the right balance between investigating the unexplored areas of the problem and leveraging the current knowledge to solve the problem.
In the upcoming sections, we shall get slowly into understanding the fundamentals of Reinforcement Learning and what are the components that need to be understood before solving a problem using this approach.
In Reinforcement Learning, the agent generates its own data by interacting with the environment. This enables the agent to learn from its actions through trial and error rather than instructing it about the right actions.
Let’s dive a bit deeper into this evaluative aspect of Reinforcement Learning by focusing on the problem of decision-making using bandits. So, let’s understand in detail on how this method works.
The K-armed Bandit Problem
We will use the K-armed bandits to formalize the problem of decision making under uncertainty.
We will use this problem to understand the important concept of reinforcement learning: rewards, time steps and values.
To understand this concept better let’s take an example
Let’s consider a medical setup where the covid-19 vaccine trials are being done on humans. There are 3 set of vaccines that are being tested in this case.

The doctor gives these vaccines to the patients and observes any changes to their health. The doctor then finds out that the red vaccine seems to be working more effectively, however the sample size that has been take in not a large one here.
So, now the doctor needs to decide between sticking with the best performing treatment or continue the study further. If the doctor sticks to the red vaccine and prescribes to go ahead with that, then he will no longer be able to collect data for the treatment of other vaccines.
Perhaps one of the other two vaccines might turn out to be better once the sample size of patients increase. You never know!! It appears worse at this moment due to chance.
This medical trial exemplifies decision making under uncertainty. This medical trial problem was an example of K-armed bandit problem.
So, what did we have in the example, let’s see!!
- We had an agent, the doctor in our case,
- We had “k” different actions which the agent chooses, 3 in our case as the doctor chooses one out of 3 vaccines.
- Reward based on the action. Here the reward is the effectiveness of the vaccine on the patient.

For the doctor to decide which action would yield accurate results, we must define the value of taking each action. These values are called Action-Values or Action-Value Function.
The value of selecting an action is nothing but the expected reward/outcome on taking that action. We can represent this in the form of probability:

- So, q*(a) is defined as the expectation of Reward Rt given that we selected action At = a for each possible action {1,2…k}
- This conditional expectation is further defined as the sum of all possible rewards which is calculated by multiplying the possible reward with the probability of observing that reward.
- We can also extend this equation to a continuous reward case where you are doing something for an extended period of time and getting the reward for it.
The goal of all this is to maximize the expected reward. If the agent is selecting the action which has the highest value, then it achieves the goal of maximizing the expected reward.
This procedure is called argmax or the argument that maximizes the function q*(a).

So, why did we consider the Bandits problem at the first place?
Well, its always better to consider issues and algorithm design choices in the simplest of ways. For example, maximizing rewards and estimating values are important subproblems in both bandits and Reinforcement Learning.

Learning Action Values – Reinforcement Learning
Before discussing and getting into the detail of things, let’s first understand the fundamentals of Action values and how we can estimate and use it to find the solution to the problem.
So, to do this we need to re-iterate to the same example of covid-19 vaccine which we used in the k-armed bandit problem.
Before getting into the detail, let’s recall the definition of Action Value.

As per the above formula, the value of Q star is defined as the expected reward received after the action has been taken.
Here Q star is not known to the agent, just as the effectiveness of the vaccine is not known to the doctor. We have to estimate this by computing the sample average.
Sample Average – Method

In Sample average method, we take the total reward for each action a and divide it by the number of times that particular action has been selected prior to time t.
Let’s implement this understanding of estimating the Action Value to our vaccine example:
The Doctor will decide which of the vaccines will work best for the patients based on the Action Value. If the patient gets better, the doctor records a reward of 1. Otherwise he records a value 0.
Let’s take an example where the doctor prescribes vaccines to 8 different patients. Based on the effectiveness of the vaccines, we calculate the Sample Average Value of each of the vaccine as shown in the figure below.

So based on the results, the doctor would assign the vaccine that he currently thinks is the best, the Red Vaccine in the above case. This method of prescribing the vaccine based on the current effectiveness is called the Greedy action.
The greedy action is the action that currently has the largest estimated value and by selecting the greedy action the agent is exploiting its current knowledge.
We compute the greedy action by taking the argmax(Q) which is 0.66 in the above case.
However, chances are that the agent might choose to explore more and would choose a non-greedy action. This would mean that agent has to sacrifice the immediate reward which he got in the greedy action, with the hope that it would gain more information about the other actions in the longer run.
Exploration- Exploitation Dilemma in Reinforcement Learning

You might be wondering, why couldn’t the agent exploit the current knowledge by choosing the best estimate and keep exploring at the same time to gain more information about the other actions.
Well that’s not possible. This is one of the fundamental problems in Reinforcement Learning.
This is called the Exploration-Exploitation Dilemma.
Incremental Sample Average Value Function/ Action Values
To understand the Incremental Action Values, let’s re-iterate back again to what we have learned so far on calculating the Sample Average Value Function. The reason we want to revisit it again is to understand why we should use the Incremental Action Values.
T0 | T1 | T2 | T3 | T4 | T5 | ||
A0 | R0 | R1 | R2 | R3 | R4 | R5 | Q5 = (R0 + R1 + R2 + R3 + R4)/4 |
We have developed a computer program to play atari games. The bot starts playing and the above table shows the action A0 taken by the bot at time steps T0, T1…T5 .The reward at each time step is R0, R1…R5
Therefore, the Sample Average Value function would be:

In our case as there are 5-time steps and 5 reward function, the sample value for the 6th would be the average of the 5-reward function as shown above.
Well, that was simple, isn’t it??

Reason, it seemed easy was due to the fact that out Atari game was perhaps way too simple and the bot learned it in 5-time steps.
But things won’t always be so simple and straight forward. There might be instances where the bot takes infinitely large amount of time steps, actions and rewards to play a game efficiently.
This will significantly increase the cost of computation. So, in order to handle this, we must use a recursive form of calculating the Sample Average value function.
To derive the recursive form of Sample Average value function, let’s rewrite its formula and derive from it:








The above formula solves the problem of cost of computation. We no longer need to store all the rewards and calculate all the average value.
In order to find the (n+1) average value Qn+1, we just need the average value of the previous record Qn and the current reward Rn.
The above formula could be explained below:

- The error in the estimate is the difference between the old estimate and the new target. When we are taking a step, the reward/Target (Rn) will create a new estimate which will reduce the error. In this case the new reward Rn is the target.
- The size of the step is determined by the step size parameter(1/n)
Now, that we have understood the concept behind Incremental Action Values, let’s see how we could use it to solve a Non-Stationary bandit Problem.
Non-Stationary Bandit Problem
This is an important problem and most of the Reinforcement problems that we will encounter will be of the form of a Non-Stationary Bandit Problem.
So, what exactly is this Non-Stationary Bandit Problem?
To understand that, let’s take the same covid-19 example with a slight bit of modification. Let’s say the Red Vaccine is more effective that the other vaccines in the summer months due to which its effectiveness has increased from 0.66 to 0.9

This is an example of a non-stationary bandit problem where the distribution of the rewards changes over time. The doctor is unaware of this change in distribution of rewards but would like to get used to it.

In order to understand it we need to use a fix step size. This means that if the step size is constant say 0.1, then the most recent reward would affect the estimate than the older rewards.
If you see the above graph, the amount of weight the most recent reward receives is greater and this eventually goes down as the time increases. The weight fades exponentially with time.
We will derive this in the upcoming section where we see how the weight fades exponentially. To do that let’s take the below equation:






The above equation is the initial value of Q plus a weighted sum of the rewards over time. The above equation relates the current estimate of Qn+1 to Q1 and all observed rewards.
The first term (1 – α)n Q1 says that the contribution of Q1 decreases exponentially with time.
The 2nd term α (1 – α)n-i Ri says that the reward which are older in time have exponentially less contribution to the sum. The weight α (1 – α)n-i given to the reward Ri decreases are the number of rewards increases. As a matter of fact, the weight decays exponentially according to the exponent on (1 – α)
These two terms taken together decreases the value of Q towards zero with more and more data. Therefore the most recent rewards contribute most to our current estimate.
Reference :
- Reinforcement Learning Specialization by University of Alberta
- Reinforcement Learning An Introduction second Edition by Richard S. Sutton and Andrew G. Barto
- https://www.mathworks.com/discovery/reinforcement-learning.html
- https://www.tensorflow.org/agents/tutorials/intro_bandit
- https://en.wikipedia.org/wiki/Multi-armed_bandit
- https://www.tensorflow.org/agents/tutorials/intro_bandit
- Non-Stationary Reinforcement Learning: The Blessing of (More) Optimism by Wang Chi Cheung, David Simchi-Levi, Ruihao Zhu
- https://deeplizard.com/learn/video/mo96Nqlo1L8
- Learning Exploration/Exploitation Strategies for Single Trajectory Reinforcement Learning, Michael Castronovo, Francis Maes, Raphael Fonteneau, Damien Ernst ; Proceedings of the Tenth European Workshop on Reinforcement Learning, PMLR 24:1-10, 2013.