So, Is Your Machine Learning Algorithm Not Working?

By Kimberly Cook | Dec 11, 2018 | 3843 Views

Take these few steps before you jump into the code to debug it.

Let's imagine it's a fine afternoon and you have written even a finer machine learning algorithm for your model and you expect that it will give you correct results. At this point, if you find something's terribly wrong in the prediction then you are in the right place.
Well, my friend, there's something which you can do before you start being Columbus and begin searching for bugs in your giant looking code.

Before jumping into the solution let's look at some of the terms that we are going to use.
  1. Training set: The input which you provide to your algorithm in order to train it and find the perfect fit for your data.
  2. Cross-Validation set(CV): The dataset with which you predict the error in your model which you just trained using your training examples.
  3. Test set: This is the dataset in which you are going to test your model and find out its accuracy.

In general, if you are given a training set you can divide the above-mentioned sets into 3 parts with 60% for the training, 20% for validation set and 20% for test set(this is just a way to divide you may or may not do this).

Now, depending upon the dataset you have chosen, the features you have taken for your input, your regularization parameter etc you might have encountered one of the following situations:

  1. High bias(underfit): If your hypothesis does not perform well with the training set as well as fails to generalize on new examples it suffers from high bias.
  2. High variance(overfit): If your hypothesis performs well for your training set but performs poorly for the validation or any new example it suffers from the high variance.
A typical bias-variance trade-off for linear regression(let's say we are predicting housing prices with only one feature like the size of the house)is shown in the figure below. The graph is drawn with only one feature of X(training example)along the x-axis and it's(training example) output along the y-axis. Figure(a) shows how well the hypothesis fits the training set but due to a high amount of angularity, it fails to generalize on new examples. Figure(c) shows that the hypothesis does not fit the training data set as well as fails to generalize. Figure(b) shows the optimum fit which performs well for our training set and also the validation set as it grows slowly with an increase in the size of the house.

If you have understood the above problem let's deal with them and see how they can be resolved.

How to resolve high variance: high variance can be resolved by using regularization or by increasing the number of training examples. Regularization ensures that the parameters for the hypothesis(theta for the best fit) are low so that angularity could be avoided and we get a hypothesis which generalizes well.

The figure below shows the cost function for the housing prices vs size of the house with regularization in action. The blue color curve shows overfitting of the hypothesis which performs well for the training set but fails to generalize. On the other hand, the magenta color curve shows regularization in action. It performs well for both training and new examples.

If you increase the number of training examples the curve would find it difficult to take angular shapes and as a result generalize pretty well.

How to resolve high bias: high bias can be resolved by increasing the number of features for the training examples like adding polynomial features to the size of the house or introduce more features like the location of the house, number of BHKs, and many more. This would prevent the curve to take a straight fit and thus perform well for training and new examples.

The figure given below shows different high bias models.


  1. an increasing number of training sets would not help much in high bias.
  2. making the value of the regularization parameter(lambda) too high should be avoided because it would cause a heavy penalty on our theta parameters and result in underfitting.
By now if you are wondering how would I implement these concepts in my code, you really need to know about Learning Curves!!

Learning Curves
To understand learning curves let's look at a typical curve of high bias and analyze it.

The figure above shows a high bias learning curve for linear regression with only one feature. Learning curves are plotted with error(in training set and cross-validation )along the y-axis and m(number of training examples)along the x-axis. It is evident from the figure that if m is very small the training error is quite low but the validation error(CV) is very high because we haven't trained our dataset well to generalize. As we increase the value of m the training error increases because a straight fit can no longer perform on a large number of the dataset but the validation error decreases as we can generalize it better than before(as it has been trained). On further increasing m, the error saturates for both the training and validation sets because a straight fit cannot fit so many datasets no matter how many you have given as input.

How to code this concept: Let's look at the code for the concept mentioned above. I have written the code in Matlab/Octave environment but that's not a problem if we want to learn the concept.

function [error_train, error_val] = ...
    learningCurve(X, y, Xval, yval, lambda)
   m = size(X, 1);
for i = 1:m
    Xt = X(1:i, :);
    yt = y(1:i);
    theta = trainLinearReg(Xt,yt,lambda);
    error_train(i) = linearRegCostFunction(Xt, yt, theta, 0);
    error_val(i) = linearRegCostFunction(Xval, yval, theta, 0);
%Xval and yval are the cross validation matrix    

We run a loop from 1 to a number of training examples(m). X is our original matrix containing all the training examples. Xt contains ??i' number of training examples and Yt contains their corresponding output(here price of the house). Then we pass Xt, Yt, and a regularization parameter(lambda) to the trainLinearReg function which returns us the optimum fit(theta values) for the given number of training examples and store it in theta. Finally, we calculate the error matrix for ??i' number of training examples. The error_train(i) contains the error in the training set for ??i' number of training examples. The error_val(i) contains error in the validation set for ??ith' iteration. You may note here that we pass the whole validation set at each iteration, unlike the training set. To calculate the error for the training set or the validation set we simply use the cost function by passing lambda as zero into our linearRegCostFunction.

function [J, grad] = linearRegCostFunction(X, y, theta, lambda)
m = length(y); % number of training example
h = X*theta;
sqrError = (h - y).^2;
J  = (1/(2*m))*sum(sqrError) + (lambda/(2*m))*sum(theta(2:end).^2);
grad = (1/m).*(X'*(h - y)) + (lambda/m).*theta;
grad(1) = (1/m)*(sum(h - y));grad = grad(:);

When you've both the error matrix(training and validation) you should now plot them to visualize it.

[error_train, error_val] = ...
    learningCurve([ones(m, 1) X], y, ...
                  [ones(size(Xval, 1), 1) Xval], yval, ...
plot(1:m, error_train, 1:m, error_val);
title('Learning curve for linear regression')
legend('Train', 'Cross Validation')
xlabel('Number of training examples')
axis([0 13 0 150])

The plot function takes values from 1 to m and plots error_train and error_val for the respective number of training examples. After implementing this you would get a curve similar to the curve shown above for high bias.

For high variance, you would get a curve similar to the one shown below.

  1. There's a gap between the validation error and the training error, unlike the high bias case.
  2. Increasing the number of training examples or regularization parameter can help to avoid high variance.
By now you are in the condition to recognize whether you are in high bias or high variance which is a headstart to debug your code.

Tuning the regularization parameter(lambda):
One last thing that I want to tell you is about the regularization parameter lambda. In particular, a model without regularization (λ = 0) fits the training set well but does not generalize. Conversely, a model with too much regularization does not fit the training set and testing set well. A good choice of ??λ' can provide a good fit to the data. You can easily find out the optimum value of lambda for your model if you've followed me. We have to do the same thing that we did in learning curves i.e we have to plot a graph between ??error' and the number of training examples. However, this time we have to vary the parameter lambda and see the corresponding error in the training and validation set. The region where both the error i.e training and validation are low would give us the optimum value for lambda. Let's see the code behind this concept.

for i = 1:length(lambda_vec)
  lambda  = lambda_vec(i);
  theta = trainLinearReg(X,y,lambda);
% here X and y are the whole trainig set matrix
  error_train(i) = linearRegCostFunction(X, y, theta, 0);
  error_val(i) = linearRegCostFunction(Xval, yval, theta, 0);
The vector lambda_vec is a vector containing various values for lambda that we are going to take and loop it to find the corresponding errors in the training set and the validaiton set. Let's see a graph of this concept and try to understand how it works.

It is evident from the graph that a value of lambda around 100 could be a good choice to train the model because at this point both cross-validation and training set has a low error value. At this point, we have seen a number of possible errors that can make our algorithm perform poorly and also how we can eradicate these problems. If you wish to see a whole implementation of these concepts be sure to check out the link given below.

An exercise to merge all these concepts and make your own model.

Thus, learning curves can help to rectify high bias, high variance and also the regularization parameter in your model which saves a lot of time. I hope this concept could help you debug the code and you liked it

Source: HOB