Bias and Variance in Machine Learning

Bias and Variance in Machine Learning

Overview

  • Learn to interpret Bias and Variance in a given model.
  • What is the difference between Bias and Variance?
  • How to achieve Bias and Variance Trade off using Machine Learning workflow

Introduction

Let us talk about the weather. It rains only if it’s a little humid and does not rain if it’s windy, hot or freezing. In this case, how would you train a predictive model and ensure that there are no errors in forecasting the weather? You may say that there are many learning algorithms to choose from. They are distinct in many ways but there is a major difference in what we expect and what the model predicts. That’s the concept of Bias and Variance Trade off.

No alt text provided for this image

Usually, Bias and Variance Trade off is taught through dense mathematical formulas. But in this article, I have attempted to explain Bias and Variance as simply as possible!

My focus will be to spin you through the process of understanding the problem statement and ensuring that you choose the best model where the Bias and Variance errors are minimal.

For this, I have taken up the popular Pima Indians Diabetes dataset. The dataset consists of diagnostic measurements of adult female patients of Native Indian Pima Heritage. For this dataset, we are going to focus on the “Outcome” variable – which indicates whether the patient has diabetes or not. Evidently, this is a binary classification problem and we are going to dive right in and learn how to go about it.


Evaluating your Machine Learning Model

The primary aim of the Machine Learning model is to learn from the given data and generate predictions based on the pattern observed during the learning process. However, our task doesn’t end there. We need to continuously make improvements to the models, based on the kind of results it generates. We also quantify the model’s performance using metrics like Accuracy, Mean Squared Error(MSE), F1-Score, etc and try to improve these metrics. This can often get tricky when we have to maintain the flexibility of the model without compromising on its correctness.

A supervised Machine Learning model aims to train itself on the input variables(X) in such a way that the predicted values(Y) are as close to the actual values as possible. This difference between the actual values and predicted values is the error and it is used to evaluate the model. The error for any supervised Machine Learning algorithm comprises of 3 parts:

  1. Bias error
  2. Variance error
  3. The noise

While the noise is the irreducible error that we cannot eliminate, the other two i.e. Bias and Variance are reducible errors that we can attempt to minimize as much as possible.

What is Bias?

In the simplest terms, Bias is the difference between the Predicted Value and the Expected Value. To explain further, the model makes certain assumptions when it trains on the data provided. When it is introduced to the testing/validation data, these assumptions may not always be correct.

In our model, if we use a large number of nearest neighbours, the model can totally decide that some parameters are not important at all. For example, it can just consider that the Glucose level and the Blood Pressure decide if the patient has diabetes. This model would make very strong assumptions about the other parameters not affecting the outcome. You can also think of it as a model predicting a simple relationship when the data points clearly indicate a more complex relationship:

No alt text provided for this image

Mathematically, let the input variables be X and a target variable Y. We map the relationship between the two using a function f.

Therefore,

Y = f(X) + e

Here ‘e’ is the error that is normally distributed. The aim of our model f'(x) is to predict values as close to f(x) as possible. Here, the Bias of the model is:

Bias[f'(X)] = E[f'(X) – f(X)]

As I explained above, when the model makes the generalizations i.e. when there is a high bias error, it results in a very simplistic model that does not consider the variations very well. Since it does not learn the training data very well, it is called Underfitting.

What is a Variance?

Contrary to bias, the Variance is when the model takes into account the fluctuations in the data i.e. the noise as well. So, what happens when our model has a high variance?

The model will still consider the variance as something to learn from. That is, the model learns too much from the training data, so much so, that when confronted with new (testing) data, it is unable to predict accurately based on it.

Mathematically, the variance error in the model is:

Variance[f(x))=E[X^2]−E[X]^2

Since in the case of high variance, the model learns too much from the training data, it is called overfitting.

In the context of our data, if we use very few nearest neighbours, it is like saying that if the number of pregnancies is more than 3, the glucose level is more than 78, Diastolic BP is less than 98, Skin thickness is less than 23 mm and so on for every feature….. decide that the patient has diabetes. All the other patients who don’t meet the above criteria are not diabetic. While this may be true for one particular patient in the training set, what if these parameters are the outliers or were even recorded incorrectly? Clearly, such a model could prove to be very costly!

Additionally, this model would have a high variance error because the predictions of the patient being diabetic or not vary greatly with the kind of training data we are providing it. So even changing the Glucose Level to 75 would result in the model predicting that the patient does not have diabetes.

To make it simpler, the model predicts very complex relationships between the outcome and the input features when a quadratic equation would have sufficed. This is how a classification model would look like when there is a high variance error/when there is overfitting:

No alt text provided for this image

To summarise,

  • A model with a high bias error underfits data and makes very simplistic assumptions on it
  • A model with a high variance error overfits the data and learns too much from it
  • A good model is where both Bias and Variance errors are balanced

Bias-Variance Trade off

How do we relate the above concepts to our KNN model from earlier? Let’s find out!

In our model, say, for, k = 1, the point closest to the data point in question will be considered. Here, the prediction might be accurate for that particular data point so the bias error will be less.

However, the variance error will be high since only the one nearest point is considered and this doesn’t take into account the other possible points. What scenario do you think this corresponds to? Yes, you are thinking right, this means that our model is overfitting.

On the other hand, for higher values of k, many more points closer to the data point in question will be considered. This would result in higher bias error and underfitting since many points closer to the data point are considered and thus it can’t learn the specifics from the training set. However, we can account for a lower variance error for the testing set which has unknown values.

To achieve a balance between the Bias error and the Variance error, we need a value of k such that the model neither learns from the noise (overfit on data) nor makes sweeping assumptions on the data(underfit on data). To keep it simpler, a balanced model would look like this:

No alt text provided for this image

Though some points are classified incorrectly, the model generally fits most of the data points accurately. The balance between the Bias error and the Variance error is the Bias-Variance Trade-off.

The following bulls-eye diagram explains the trade-off better:

No alt text provided for this image

The centre i.e. the bull’s eye is the model result we want to achieve that perfectly predicts all the values correctly. As we move away from the bull’s eye, our model starts to make more and more wrong predictions.

A model with low bias and high variance predicts points that are around the centre generally, but pretty far away from each other. A model with high bias and low variance is pretty far away from the bull’s eye, but since the variance is low, the predicted points are closer to each other.

In terms of model complexity, we can use the following diagram to decide on the optimal complexity of our model.

No alt text provided for this image

So, what do you think is the optimum value for k?

From the above explanation, we can conclude that the k for which

  • the testing score is the highest, and
  • both the test score and the training score are close to each other

is the optimal value of k. So, even though we are compromising on a lower training score, we still get a high score for our testing data which is more crucial – the test data is after all unknown data.

Let us make a table for different values of k to further prove this:

No alt text provided for this image

Conclusion

To summarize, in this article, we learned that an ideal model would be one where both the bias error and the variance error are low. However, we should always aim for a model where the model score for the training data is as close as possible to the model score for the testing data.

That’s where we figured out how to choose a model that is not too complex (High variance and low bias) which would lead to overfitting and nor too simple(High Bias and low variance) which would lead to underfitting.

Bias and Variance plays an important role in deciding which predictive model to use. I hope this article explained the concept well.

To view or add a comment, sign in

More articles by Revanth Yadama

  • Encode-Categorical-Features

    Encode-Categorical-Features

    Handling Categorical/Qualitative variables is an important step in data pre-processing. Many Machine learning…

    1 Comment
  • XG Boost Algorithm

    XG Boost Algorithm

    Basically, XGBoost is an algorithm. Also, it has recently been dominating applied machine learning.

Insights from the community

Others also viewed

Explore topics