Reinforcement Learning – Exploration vs Exploitation Tradeoff

Reinforcement Learning is an area of machine learning which teaches us to take actions to maximize rewards in a particular situation. Reinforcement learning is used in a various of fields, starting from Automobile to Medicine and many others.

In Reinforcement Learning, the agent is not aware of the different states, the actions available in all the states, the associated rewards and transition to resulting states. The agent learns more and more about it by interacting with the environment.

There is a significant difference between Reinforcement Learning and Supervised Learning. In supervised learning, the training data has the labels that help the model to train from the right set of labels. Whereas in Reinforcement Learning there is no right label and the agent is the one who decides how to perform the given task. In the absence of training set, the agent is bound to learn from its experience after performing the task for a certain number of times. If you want to learn more about Supervised learning then please visit this post.

In this post we will go into detail on understanding the concept of Exploration and Exploitation tradeoff with the help of examples.

At any given point in time, the knowledge of an agent about the state, actions, rewards and resulting states is always partial and this results in Exploration-Exploitation Dilemma.

Exploration and Exploitation in Reinforcement Learning

  • Exploration

Exploration is more of a long-term benefit concept where it allows the agent to improve its knowledge about each action which could lead to long term benefit.

  • Exploitation

Exploitation basically exploits the agent’s current estimated value and chooses the greedy approach to get the most reward. However, the agent is being greedy with the estimated value and not the actual value, so chances are it might not get the most reward.

Let’s take an interesting example to understand Exploration- Exploitation properly.

Let’s say your friend and you digging in the hope that they will get diamond out of it. Your friend gets lucky and finds the diamond before you and walks off happily.

By seeing this, you get a bit greedy and think that you might also get lucky. So, you start digging at the same spot as your friend.

Your action is called the greedy action and the policy is called the greedy policy.

However, in this situation the Greedy policy would fail because a bigger diamond is buried where you were digging in the beginning.

However, when your friend found the diamond, the only knowledge you got was the depth at which the diamond was buried. You do not have the knowledge of what lies beyond that depth. In reality the diamond may be where you were digging in the beginning or it may be where your friend was digging, or it may be completely at a different place.

With such partial knowledge about future states and future rewards, our reinforcement learning agent will be in dilemma on whether to exploit the partial knowledge to receive some rewards or it should explore unknown actions which could result in much larger rewards.

However, we cannot choose both explore and exploit simultaneously.

In order to overcome the Exploration-Exploitation Dilemma, we use the Epsilon Greedy Policy.

Epsilon Greedy Policy

To choose between exploration and exploitation a very simple method is to choose randomly. This could be one by choosing to exploit most of the time with little exploring.

Example 1

Suppose we are rolling a dice, and if it lands on 1 then we will explore, else exploit.

This method is called the Epsilon Greedy Action where Epsilon refers to the probability of choosing to explore.

In the above case the value of Epsilon will be 1/6 which is the probability of getting 1 on rolling a dice.

Epsilon-Greedy can be represented as follows:

The Action that the agent selects at time step t, will be a greedy action (exploit) with probability (1-epsilon) or may be a random action (explore) with probability of epsilon.

Let’s take another example to understand the above expression better:

Example 2

With partial or no knowledge about future rewards, Epsilon-greedy approach yields best results as it balances between exploitation of current knowledge and exploration of unknown action.

In the above example, your friend has got the diamond and by seeing that you have the knowledge that about the level of depth needed to be dug to get the diamond. So, you choose to dig where you friend was digging with a probability of (1-epsilon). This means we are taking a greedy action, or we exploit our knowledge that a diamond was found there.

Or we can explore with probability epsilon with an understanding that the diamond has not yet been found here, but we still want to keep exploring with a probability epsilon where epsilon is a positive real number which lies between 0 and 1.


  1. Reinforcement Learning Specialization by University of Alberta
  2. Reinforcement Learning An Introduction second Edition by Richard S. Sutton and Andrew G. Barto
  3. Exploration Exploitation Dilemma Greedy Policy and Epsilon Greedy Policy – Reinforcement Learning
  4. Adaptive ε-Greedy Exploration in Reinforcement Learning Based on Value Differences
  5. Exploration and Exploitation in Reinforcement Learning