Linear regression

Linear Regression Unravelled: Demystifying the Algorithm and Leveraging its Benefits

Last year Data Juice Lab did a Models' Comparison project on an exceptionally large data set of approximately 50+ mln observations. The objective of the project was to compare performance of different models. The skirmish was strong, yet the linear regression scored high among well-known competitors like extreme gradient boosting.

So, I thought it would be helpful to briefly describe the algorithm behind linear regression and discuss its strong points.

Business Advantages of Linear Regression

While enhanced Machine Learning algorithms become so popular and passionately discussed in the circles of scientists and data enthusiasts, regression models are still in a wide usage in business community. The difference between those two approaches is not so obvious, and yet quite important. While supervised Machine Learning models are more strictly focused on the best prediction of the future, regression, besides, also helps to understand relationships between the dependent and independent features themselves.

In this story, we want to focus on the understanding behind the fundamental regression model, namely, the Linear Regression, which remains to be the building block in statistics due to possibility to get interpretable results and built-in statistical tests for the significance of association between the examined variables.

Regression use cases

When it comes to applying any model, whether it originates from the field of Machine Learning or Econometrics, it is crucial to accurately recognize the specificity of the problem, and namely, the data type of the outcome variable examined. Depending on that understanding, the correct choice of the model should be done. The family of regressions is quite wide allowing to customize the approach for many specificities of the dependent feature.

Speaking about Linear Regression (OLS) specifically, it is used to predict continuous numeric variables, which makes it a universal model. Financial analytics, marketing, advertisement, healthcare, social sciences, real estate – these are just a few branches where OLS can be implemented.

Here are a few more specific use cases of Linear Regression, provided by our specialists:

Recovery rate estimation (approximate the amount of liabilities the default borrower could return),
Bonus/wage increase estimation for employees,
Entity’s disposable income approximation.

The fact that should be taken into account is that the algorithm behind Linear Regression is fairly simple, and thus, it might not be the most accurate predictor out of the whole variety of models. Nevertheless, OLS is still favoured around the circles and should not be underrated as a tool to analyse data. One reason for that, is that its simplicity allows for the lower time cost of modelling, and another, is that interpretable model is a better choice when one seeks to grasp deeper understanding of associations in data.

Especially, such an approach is applied in medicine, as the Evidence Based Paradigm requires the interpretation capability included into the regression modelling. In medical data analysis the biggest obstacle is the presence of confounding variables, which can never be eliminated due to the nature of the data itself (complexity of the human body system). Regression analysis is the main way of dealing with the effect of the confounding variables that will allow for proper interpretation of the causal relationship between the investigated exposure like new drug or new device and the outcome like patients’ survival or disease progression.

Thus, linear regression serves as a basis of scoring models which are used in real life medical practice to assess risk of future adverse events and can affect the course of treatment.

Having said that, let us now undercover the algorithm behind the Ordinary Least Squares (OLS) regression.

Back to primary school

To understand the concept behind OLS regression, let us first recall some simple math theory: namely, linear functions. In fact, they explore the whole philosophy behind predicting: if we know the values of: a) independent variable x; and b) the parameters b₀ and b₁, we may obtain the value of the dependent variable y, following the formula below.

Usually, x stands for the feature that can be observed with more certainty and is more accessible, for example person’s gender or age. Those characteristics are the toolkit that we have to make predictions, thus, we call them independent variables. On the contrary, y is the feature that we want to approximate (e.g. employee’s income), based on the relationship between x and y, by making a hypothesis that value of y is dependent on the values of x.

This x and y relationship in equation is revealed by the parameters b₀ and b₁.

b₀ stands for the value of y under zero x and often referred to as the intercept. In turn, b₁ is taken as a benchmark for the relationship between y and x. It equals to Δy / Δx and reveals the unit change of y with respect to unit change in x. In other words, b₁ is also called the slope of the linear function.

Summing up, it is possible to estimate coefficients b₀ and b₁ (the intercept and the slope) based on observed data, so that to approximate unobserved future values of y by only knowing the values of b₀, b₁ and x. This is what, in brief, Ordinary Least Squares allows us to do.

Transition to Linear Regression

When it comes to estimating coefficients, the issue we encounter is that in reality, true values of y never perfectly match the line – instead, they are dispersed in space, around it. For this reason, the linear equation described above only approximately approaches the true values of y but does not cross through them. So what could be done with that problem?

To express the accurate value of the observed yᵢ, for each specific value of xᵢ, the error term is added to the linear equation part. The error term is equal to the distance between the predicted value, also denoted as ýᵢ and the true, observed value of yᵢ.

This way, the formula of the linear regression looks as follows:

where:

(1) the linear equation, with estimated b₀ and b₁ remaining constant for each pair of xᵢ and yᵢ
(2) the error term eᵢ, changing with values of xᵢ respectively, equal to yᵢ−ýᵢ

So why is the algorithm called Ordinary Least Squares?

The purpose of the OLS regression is to estimate the coefficients so that the fitted line best approximates the values of dependent variable yᵢ. To do that, it is crucial to minimize the sum of squared errors eᵢ (i.e. distances between true yᵢ and predicted ýᵢ), so that the linear function derived approaches the data as close as possible. Consequently, the cost function for the linear regression looks as follows:

Would we want to visualize the process of fitting the best approximation to real data, it would look as follows:

So, a comprehensive understanding of this algorithm behind regression is critical in the modelling process, as the statistical measures used to evaluate and fine-tune the model rely exactly on this foundation.

Summary

When it comes to constructing any model, it is essential to consider the trade-off between algorithm complexity and time cost, as this impacts both prediction accuracy and model stability. For problems focused on capturing the overall relationship or gaining broad perspective on data, simpler, less time costly algorithms like Ordinary Least Squares can be a suitable choice. The simplicity of linear regression makes it a benchmark model often implemented as an initial step of preliminary data analysis, serving as a basis for comparison with more advanced models.

Though, even a simpler model may provide solid results by following the two key conditions. Firstly, selecting and gathering high-quality data for the predictors (independent variables) is crucial. Secondly, thorough data cleaning and processing are necessary to ensure accurate model development. These steps do significantly affect the model’s performance.

In the following article, we will delve into a specific example to demonstrate the practical implementation of these concepts. Stay tuned with Data Juice Lab!

________________________________________________

For more on t-test and p-value concepts, you might be interested to read: ‘Freedman David et al. Statistics. 4th ed. W.W. Norton 2007, p.475’

Linear regression - still a Queen?

Dominik Ogonowski

Co-Owner at Data Juice Lab