Introduction to Machine Learning

This blog will be more theoretical introduction towards what is Machine Learning and its different paradigms.

So, let’s take a typical definition of Machine Learning and then break it down to understand what it means:

“A machine or agent is said to learn from experience with respect to some class of tasks, and a performance measure P, if [the learner’s] performance at tasks in the class as measured by P, improves with experience.”

So, now let’s break down this definition into 3 components and see what the definition says:

  1. The first component is the Class of Task. So, what that basically means is, we must define learning with respect to specific class of tasks. These class of task could be answering questions in exams or diagnosing patients of specific illness.
  2. The 2nd component that we need is the Performance Measure P. This is basically used to measure whether the learning is happening or not. Without the performance measure, one would start to make vague statement without any kind of solid measure.So to take an example of performance criterion, if you write an exam, your performance criterion would be the    marks that you obtain or if you talk about illness, then the performance is the number of patients who did not have any kind of adverse effect to the drugs you gave.
  3. Lastly, the 3rd component is the Experience.This basically says that says that with experience the performance must improve. So, what basically that means is suppose you are learning to ride a bike. The more you ride, the better you get at it. Another example is supposing you are diagnosing a patient for illness. The more patients you look at, the better you become at diagnosing the illness, which means that with Experience the performance improves.

So now let’s discuss about the different paradigms of Machine Learning:

  1. Supervised Learning
  2. Unsupervised Learning
  3. Reinforcement Learning

1. Supervised Learning

Basically, supervised learning is where you learn an input to output mapping.

In supervised learning you are given some kind of an input, it can be description of a patient who comes to a clinic and the output can be whether the patient has a certain disease or not. So, the machine learning model that will be used needs to learn such kind of input- output map.

As shown in the chart above, Supervised learning can be either Classification or Regression.

  • Classification:  As discussed above, supervised learning is essentially about mapping the input to its required output. But suppose the output that we are mapping is categorical then the supervised learning problem will be tagged as classification problem. Ex: Whether it’s going to rain or not, whether the answer to the question was true or false
  • Regression: If the output is a continuous value, then it can be tagged as a regression problem. Ex: What will the height of a person, what will be the expected rainfall tomorrow.

2. Unsupervised Learning

Unlike Supervised learning problems where we have to have an output mapped to an input, here in Unsupervised Learning the goal is not to produce an output in response to an input, rather generate pattern in the data given a set of input data.

So, there is no desired output that we are trying to find in Unsupervised learning but rather it all about trying to find patterns in the data.

So, with that let’s understand the types of Unsupervised Learning:

  • Clustering: This is a type of Unsupervised learning in which we are trying to find cohesive group within the input pattern.One of the examples as you could see in the below graph is clustering the superheroes of marvels in one category and the superheroes of DC in another category.  
  • Association Mining:  Also known as frequent pattern mining is another kind of Unsupervised learning where we are interested in finding a frequent co-occurrence of items in the data that is given. Let’s take an example. If you see the below cartoon you can understand the association mining which says that the customers who bought bananas also bought the other items like carrot, milk, bread etc.

3. Reinforcement Learning

This is the 3rd type of machine learning paradigm which is neither supervised not unsupervised. These are basically type of problems where you are learning to control the behavior of a system. This will be discussed in detail further.

Now that we have talked in brief about the different Machine learning paradigms, let’s go ahead and understand a bit more about Performance Measure

Performance Measure

  1. As discussed earlier, for every task there needs to be a performance measure, so if it is a Classification task then the performance measure would be the classification error.Ex: How many patients were diagnosed to have disease who in actual didn’t have disease or how many patients were diagnosed to not have any disease, but they had it. So, basically this would be the measure we would use but we will see later that often that would not be possible to learn directly from this kind of performance measure.
  2. Likewise, in case of Regression we might also consider this performance measure, suppose the actual height of a person is 165cm but the predicted height is 190cm, which is a big prediction error.
  3. In case of Clustering, it’s a bit trickier to define the Performance Measure. The challenge that we face in case of Clustering is we do not know which is a good clustering algorithm because we are not sure about how to measure the quality of clusters.One of the popular Performance Measure that we could use in case of Clustering is Scatter or Spread of the cluster which tells us how spread out the points are that belong to a single group.So, basically if the group is not cohesive enough in which all the members are not together then that means the clustering is of poor quality.
  4. In case of Association Mining, we can use support and confidence. We will be discussing this later.
  5. In case of Reinforcement Learning tasks which is basically about learning to control, there will be a cost involved for controlling the system. The Performance Measure in this case is also cost and you would try to minimize the cost that you would be accrued while controlling the system.

Challenges while implementing Machine Learning Solution

Now that we have discussed about different Machine Learning Paradigms and the Performance Measures for it. Let’s see what can be the possible challenges that we could face while implementing them.

How Good a model is?

How do I choose a model?

This is basically the most important point that we will be discussing now and also going forward which is given some kind of data which will basically be the experience that we have talked about so far, how would we choose a model that somehow learns what we want to accomplish. And also, how that improves itself with experience. So, the question is how do we choose such model and how do we actually find the parameters of the model that gives the right answer

Do we have enough data?

Do we have enough experience to say that the model is good?

Is the data of sufficient quality, there could be errors in the data?

For example, Suppose I have medical data and age is recorded as 225. This could be anything, it could be 225 days which is reasonable data, it could be 22.5 years which is also reasonable, or 22.5 months is reasonable. But if it is 225 years it’s not a reasonable number so there is something wrong in the data.

How confident can I be of the results?

Am I describing the data correctly?

So this is a domain dependent question that you can answer only with experience as a machine learning or a data science professional or with time, however there are typical questions that you would like to ask.

For example:  

a) Are age and income enough? Should we look at the gender as well?

b) How should the age be represented. In numeric form or categorical form like young, middle age, old?

1.Supervised Learning

Let’s understand supervised learning in a bit more detail by taking an example.        

As we have discussed briefly above that in supervised learning, we have experience where you have some kind of a description of the data. So, in this case let us assume that we have a customer database and we am describing that by two attributes here, age and income.

So, let’s say we know the age and the income level of each customer that comes to the shop.

Goal: So here our goal is to predict whether the customer will buy a computer or not. So, let’s say we are provided with labeled data as shown below and we have to build a classifier from it. As we had discussed above that classification where the output is a discrete value. In our case the discrete value is “yes the customer will buy a computer” or “no the customer will not buy a computer”.

And we describe the input as a set of attributes. In our case we are looking at age and income as the attributes that describe the customer.

The goal here is to come up with a function that will take the age and income as the input and it will give an output which says that the person will buy the computer or not.

Classifier 1:

Given that we are actually looking at a geometric interpretation of the data, we are looking at data as points in space, the one of the most natural ways of thinking about defining this function is by drawing lines or curves on the input space right. So, one possible example here is to draw a line and everything to the left of the line are points that are red would be classified as will not buy a computer, everything to the right of the line where the data points are blue will be classified as will buy a computer.

So, in this case it basically says that if the income of the person is less than some value X, the he will not buy a computer. If the income is greater than X the person will buy your computer. So, this is a very simple function that we have defined here.

However, in this case we are completely ignoring one of the variables here which is the age. So, we are just going by income, if the income is less than some X then the person will not buy a computer, if the income is greater than X the person will buy a computer. So, this is not the best classifiers that we could possibly have but nonetheless we are still managing to get most of the points correct. However, we could still do slightly better that this.

Classifier 2:

So, now let’s check out this classifier. The 2 red points that were not classified correctly in the previous classifier and were on the wrong side of the line are now correctly classified and on the right side of the line. So, everything to the left of this line will not buy a computer, everything to the right will buy a computer. So, basically in this case we have improved our performance measure compared to the previous classifier.

What is the difference between the previous classifier and the current classifier?

So earlier we are only taking the income as an input for the classifier, but now we are also taking the age as another input. So, for an older person, the income threshold at which he will buy a computer is higher. So, for a younger person, the income threshold at which he will buy a computer is lower.

So, the older you are, the income threshold is shifted to the right here as you could see from the graph above. Therefore, the older you are, you should have a higher income before you buy a computer and the younger you are your income threshold is lower, so you do not mind buying a computer even if your income is slightly lesser.

Classifier 3:

Now in this classifier we are getting everything correct except that one red point. Basically in this case we get a much better performance, but at the cost of having a more complex classifier.

If you think about it in geometric terms,

  1. First you had a line that was parallel to the y-axis therefore, we just needed to define an intercept on the x-axis right.
  2. Then the second function it was actually a slanting line, so we needed to define both the intercept and the slope.
  3. And now here it is quadratic, so we have to define three parameters. So that means we have to define something like ax2+ bx+c

Classifier 4:

This classifier somehow doesn’t seem right. It seems to be too complex function just to get the one point.

Another thing to note here is that particular red point that we see is actually surrounded by a sea of blue right.So it is quite likely that there was some glitch in the data.

Summarizing from above results, below are the points that we should think or consider:

  1. what is the complexity of the classifier that I would like to have versus the accuracy of the classifier?
  2. how good is the classifier in actually recovering the right input-output map?
  3. Is the input data clean or is there noise on it, if so, then how do I handle that noise? These are the kinds of issues that we must look at.

Inductive Bias

So, what do we mean by Inductive Bias:

Let’s go back to the 3 classifiers that we saw previously. The lines that we drew were basically based on an assumption which would lead us to categorize the data or generalize it to say something about the entire space.

So, basically the data that was there was in the form of discrete points in space and from these discrete points we were supposed to generalize it and say something about the entire state space basically in the form of a label, which in our case was “Will buy a computer” or “Will not buy a computer”.

If we do not have some kind of assumptions about these lines the only thing we can do is if the same customer comes again or somebody who has exact same age and income as the other customers comes again we can tell whether the person is going to buy a computer or not buy a computer but we will not be able to tell about anything else outside of the experience.

So, the assumption we made is everything to the left of the line will not buy the computer and everything to the right or everyone to the right will buy a computer.

This is an assumption we made for the lines that can segregate people who buy from who do not buy or for the curves to segregate people who will buy from who will not buy

So that is a kind of an assumption we made about the distribution of the input data and the class labels.

So, this kind of assumptions that we make about these lines are known as Inductive biases.

Inductive Bias are of two types:

  1. Language Bias: Language Bias essentially tells us about the type of lines that we are going to draw. Whether we are going to draw straight lines or curves and what order polynomials are we going to look at and so on so forth. These form our language bias.
  2. Search Bias: Search bias is the other form of inductive bias that tells us in what order am we going to examine all those possible lines.

So, putting these together we will be able to generalize from a few training points to the entire space of inputs.

So, now that we have gone through the different classifiers and assumptions, let’s look at the whole process from an implementation point of view.

In the above figure we have a training set which will consist of an input X and output Y. So we will have a set of inputs X1, X2, X3, X4 and outputs Y1, Y2, Y3, Y4 and this data is fed into a training algorithm.

X’s are the input variable so in this case that should have the income and the age, so x1 is like 30,000 and 25 and x2 is like 80,000 and 45 and so on so forth and the Y’s are the labels which correspond to the colors in the previous picture.

So Y1 does not buy a computer Y2 buys a computer and so on so forth so this essentially gives us the color coding so Y1 is essentially red and Y2 is blue right. So before doing any kind of classification, we need to normalize the numbers as the values of X would vary too much as the salaries of different persons would mostly vary and sometimes with a large variance. So, we will typically end up doing is normalizing these so that they form approximately in the same range.

From the above figure you can see the normalized values and likewise we have taken not buy as – 1 and buy a computer as + 1. Then this value is fed to the training algorithm which will output a classifier. However, we still do not know how good or bad our classifier is.

To validate the efficiency of out classifier we use a test set or validation set which is another set of x and y pairs like we had in the training set.

In the test set we know what the labels are. It is just that we are not showing it to the training algorithm. We know what the labels are because we need to use the correct labels to evaluate whether our training algorithm is doing good or bad. This process by which this evaluation happens is called Validation.

Then at the end of the validation, if we are satisfied with the quality of the classifier we can keep it. If you are not satisfied, then go back to the training algorithm to either iterate over the algorithm again. We will go over the data again and try to refine the parameter estimation or we could even think of changing some parameter values and then trying to redo the training algorithm all over again.

Training Algorithm

Let’s now understand a bit about what happens inside the training algorithm.

Inside the training algorithm there will be a learning agent which will take an input and it will produce an output Ŷ which it thinks is the correct output, but it will compare it against the actual target Y which was provided for the training.

So, in the training we actually have a target Y and the agent will compare the output Ŷ against a target Y and then figure out what the error is and use the error to change the agent so that it can produce the right output next time around.

This is essentially an iterative process where, so we see that input produces an output Ŷ and then we compare the output Ŷ with the target Y, figure out what is the error and use the error to change the agent again.

This process is by and large the way most of the learning algorithms will operate; most of the classification algorithms or even regression algorithms will operate.


Another supervised learning problem is prediction or regression where the output that you are going to predict is no longer a discrete value. It is more of a continuous value.

So above is an example, where at different times of day we have recorded the temperature. The input to the system is going to be the “time of day” and the output from the system is going to be the “temperature” that was measured at a particular point at the time.

So, we are going to get our experience or training data which is going to take the form shown above. The blue points would be the input and the red points would be the output that we are expected to predict.

The points to the left are day and the points to the right are night.

Just like we did in classification, let’s do a simple fit here as well to draw a straight line that is as close as possible to the points. Now if you see there are certain points at which we are making large errors which we should try to fix.

Let’s see if we could do a bit better than this.

As you could see above that while the day time temperatures are more or less fine but with the night times there seems to be something really off. Let’s try fitting something more complex as we did in case of classification to fit that one-off red dot.

As we discussed earlier in case of classification also, that this is probably not the right answer and we are probably in this case better off fitting the straight line.

So, these kinds of solutions where we try to fit the noise in the data, we are basically trying to make the solution predict the noise in the training data correctly. This situation is known as over fitting. These are the things that we look to avoid in, machine learning.

Now that we have discussed about supervised learning, let’s go over a discuss briefly about unsupervised learning.

2.Unsupervised Learning

Below is a classification data set where red denotes one class and blue denotes the other class.

However in unsupervised learning we basically have a lot of data that is given to us, but they do not have any labels attached to them right. Below you could see an unlabeled space of data.

So, first we look at the problem of clustering where the goal is to find groups of coherent or cohesive data points in the input space. Below is an example of possible clusters.

So, there are like four clusters that we have identified in this in the above setup. So, one thing to note here is that even in something like clustering we need to have some form of a bias.

In this case the bias that we are having is in the shape of the cluster, so we are assuming that the clusters are all ellipsoids and therefore we are drawing a specific shape curves for representing the clusters.

Also note that not all data points need to fall into clusters and there are a couple of points there that do not fall into any of the clusters.

This is primarily an artifact of us assuming that they are ellipsoids. However, there are still some points in the center which are actually faraway from all the other points in the dataset.

These points are known as outliers.

So, when we do clustering, there are two things that we should be interested in:

  1. finding cohesive groups of points
  2. finding data points that do not conform to the patterns in the input and these are known as outliers.

Association Rule Mining

Association rule mining we are interested in finding frequent patterns that occur in the input data and then we are looking at conditional dependencies among these patterns right.

For example, if A and B occur together often then we could say something like if A happens then B will happen.

Let us suppose that so you have customers that are coming to your shop and whenever customer A visits your shop custom B also tags along with him, so the next time we find customer A somewhere in the shop we can know that customer B is already there in the shop along with A or with very high confidence you could say that B is also in the shop somewhere else may be not with A, but somewhere else in the shop.

So, basically in case of Association rules, we are mostly looking at conditional dependencies which means that “if A has come to the shop then B is also there”.

The Association rule mining process usually goes in two stages:

  • The first thing is we find all frequent patterns. Let’s take an example:A is a customer that comes to the store often. And then we find that A and B are pairs of customers that come to the store often. So, if A comes to the store often and A and B comes to the store often then we can derive associations from these kinds of frequent patterns. We will discuss about the association mathematically going forward.
  • The 2nd thing is Derive associations from frequent patterns. What does this mean?It is basically deriving some association from events or patterns. Suppose you do some fault analysis by looking at a sequence of events that happened and you can figure out which event occurs more often with a fault right.

Next let’s look at Mining transaction which is also related to Association Mining:

Mining Transactions

Let’s first go over some terminology to understand the context.

What is a transaction?

Transaction is a collection of items that are bought together.


A set or a subset of items is often called an item set in the Association rule mining.

The first step that we need to do in Association mining is to find frequent item sets.

So, we can say that if Itemset A which is frequent, implies item set B if both A and AUB are frequent itemsets. (Itemset A => Itemset B, if both A and AUB are frequent Itemsets).

Some of the applications of this are:

  1. Predicting co-occurrence
  2. Market Basket Analysis
  3. Time series Analysis
    • Trigger Events

3.Reinforcement Learning

So far, we have been looking at popular models of machine learning such as supervised and unsupervised learning.

In the supervised learning we looked at the classification and the regression problem and in unsupervised learning we looked at clustering and frequent pattern so on and so forth. Now let’s understand what does Reinforcement Learning mean.

Let’s understand Reinforcement learning by taking an example of how we learnt to cycle.

So how did you learn to cycle– was it supervised learning or was it unsupervised learning?

If you consider this as a Supervised learning problem, then there should be someone telling you how many pounds of pressure you should put with your left foot and what angle you should be leaning and so on so forth.

It is also not completely unsupervised because it is not like you just watch people cycling and then figure out what the pattern that you should move in order to cycle and then you just magically got on a cycle and started cycling.

So, what was the crucial thing here? There is a trial and error component which is that you have to get on the cycle. You had to try things out yourself before you could learn how to cycle in an acceptable manner.

So, you have some kind of feedback which is not completely unsupervised.

What that means is, as we had someone standing to give us feedback when we learnt to cycle as a kid, someone who would encourage us even when we fell down while learning to cycle.

So, there is some amount of trial and error component and that is feedback that you are getting from the environment. So, this kind of learning where you are learning to control a system through the trial and error and the minimal feedback is what Reinforcement learning is.

So, in the RL framework you typically think of a learning agent.

We already looked at learning agents, it could be the supervisor learner, or it could be an unsupervised agent in this case you have a reinforcement learning agent that learns from close interaction with an environment that is outside the control of the agent.

So close interaction here is that the agent senses the state in which the environment is and it takes an action which it applies to the environment thereby causing the state of the environment to change so thereby completing the interaction cycle.

So, the agent senses what is the state of the environment. Suppose in case of a cycle the agent is going to sense what angle is the cycle tilting, at what speed it is moving forward right and on what speed s person would falling etc.

All this constitute the state of the environment. The agent is going to take an appropriate action which in our case may be lean to the right or push down with right leg and this action is then applied to the environment and that in turn changes the state of the environment.

Even though, the agent learns from such close interaction with the environment, we typically assume that the environment is stochastic. What this means here is, every time you take an action you are not going to get the same response from the environment which means that things could be slightly different.

There might be a small stone in the road that you did not have the last time you went over this place and therefore what was a smooth ride could suddenly turn bumpy. So, you know that cycling always has some amount of noise and then you have to react to the noise.

So apart from this interaction the mathematical abstraction also assumes that there is some kind of an evaluation signal/feedback that is available from the environment that gives you some measure of how well you are performing in this particular task.

While learning to cycle also we needed to have an evaluation measure for every task and we are assuming that this comes in the form of some kind of a scalar evaluation from the environment.

This could be someone encouraging us by clapping and saying that we are doing well, or it could be falling down and getting hurt; so all of this would be translated to some kind of a numeric scale.

So, the goal of the agent is to learn a policy which is a kind of mapping from the states that you sense to the actions that you apply so as to maximize a measure of long-term performance, what this means is the goal is not just staying upright while cycling but cycling from point A to point B successfully making sure that we are balanced through the entire duration of the ride.

This is the basic idea behind the reinforcement learning problem. In each reinforcement algorithm the goal is to learn a policy that maximizes some measure of long-term performance.