How logistic regression can save the day?
This was originally posted to my blog on 05th Feb 2022
Despite the chaotic nature of our world, there lies beauty in it in the form of hidden patterns and systems. This makes the fields of data science and machine learning so alluring to many ambitious and talented individuals out there; the task of looking at random and noisy datasets as something that is hiding a fascinating truth, I must confess, fulfills a Sherlock-Holmes-type detective fantasy for us. It is exciting to know that we can predict the salary of an individual by building a model on some data available to us or that we can predict the future stock price of a company by utilizing its historical data and the only reason this is possible is that there are indeed some hidden patterns and relationships among the variables involved in these datasets.
But, when looking at a few complex problems like predicting whether a customer will buy the product we advertised or predicting whether a tumor is benign or malignant based on some available features of a patient, how can we apply this fascinating, detective-style approach? As I mentioned in my previous post on linear regression, how can we use the simple, straight-line equation we learned during our high school to deal with something that is called a ‘classification’ problem. To understand this let us look into an example problem of trying to predict whether an individual will accept a job offer or not by utilizing a single feature – offered salary. This is a classification problem where we need to predict a label in the form of digits 0 and 1, where 0 indicates that the individual hasn’t accepted the offer and 1 indicates that the individual has accepted the offer.
Let us use the below made-up dataset for this problem:
Now we can try to visualize this dataset in a two-dimensional coordinate system:
Well, how are we even going to try and fit a straight line for this situation?
A regression algorithm doesn’t need to only fit a straight line, of course, it can even fit non-linear curves that best suit our data, but in this case trying a second order or third order polynomial curves still won’t be enough to model this data. This is how we know that regression algorithms fail when it comes to dealing with classification problems. It seems we need to look at classification problems from a different perspective now as regression algorithms are ineffective in these cases but hold on for a sec because regression algorithms do have one last trick which can help us to build classification models, this trick’s name is called ‘logistic regression’.
Now that we are finally introduced to the titular character and the main protagonist of our story, let us delve into the story of how logistic regression can save the day.
It starts with the famed normal distribution, a simple but brilliant concept that is a representation of the fact that mediocrity is widespread in our society whereas excellence is found rarely. A dataset that exhibits this normal distribution usually exhibits a characteristic bell curve when its distribution is plotted graphically i.e. when we plot the values of this dataset on the X-axis along with the frequency of appearance of these values in our dataset (which is also called probability density) on the Y-axis, we will have a nice bell-shaped curve like below:
Hence this bell curve is a visual representation of the fact that the most frequent values in a dataset are the average values (called as mean in statistical terms) and the exceptionally high and low values are rare, hence have less frequency. We can apply a philosophical angle to this distribution and define that the motive of most people in the world is to travel from left to right on a bell curve.
But from where did philosophy pop up all of a sudden? (that's my bad, sorry!) and why does this normal distribution matter now?
It is because most of the things we are trying to predict using regression models are continuous outputs like salary or test scores which are believed to have a normal distribution. Even the linear regression algorithm assumes this normal distribution in the dataset, but the labels of the classification problem we are trying to predict instead have something that is widely known as Bernoulli distribution.
This Bernoulli distribution doesn’t care about mediocrity or excellence, it in fact only cares about two things, p and 1-p where p refers to the probability of occurrence of some event and 1-p would naturally refer to the probability that such event doesn’t occur. We can look at the classic coin toss experiment, where there are only two outcomes: Heads and Tails as an ideal example of this distribution, in this case, p would be the probability that outcome is Heads and 1-p would be the probability that outcome is Tails. Hence, to deal with a classification model, we will need to find an algorithm that assumes the Bernoulli distribution of labels instead of normal distribution.
But while we are talking in terms of any regression technique (or a simple y = b+w*x), the normal distribution is inherently involved and it cannot be just flipped to Bernoulli’s at our wish. Hence, logistic regression finds a way to achieve this by using something called a ‘link function’, which tries to bridge normal distribution with Bernoulli’s for modeling purposes. To understand how this is done, let us look at a classification problem of predicting whether some event happens or not, given a data of features ‘x’.
We will first look at it with the lens of linear regression, for the sake of simplicity let us just assume that our ML model was somehow able to calculate the parameters w and b, then:
y = w*x + b
y is a continuous variable with normal distribution which is not of much use to us currently, but we can assume a function ‘f’ that transforms it into a categorical label with a binomial distribution, then the equation of our model will be re-written as below:
f(y) = w*x+b
A logistic regression model uses something called the ‘logit’ function as a link function. To understand this logit function intuitively, let us first look into the term ‘odds’. We commonly hear this term called odds in the scenarios of sports betting, we must have even used it ourselves sometimes when we said, “what are the odds?” to some seemingly improbable event. So, what do we actually mean by odds? It is just a ratio of the number of instances that are in favor of an event to the number of instances that contradict an event. So when we say that the odds of me becoming a professional soccer player are 1 in 1 billion, it means that there is only one possibility for me to become a soccer player, whereas a billion possibilities in contradiction to the happening of that event (I would say that these are fair odds considering the impossibility of the task!).
So, we can define odds as a ratio of p and 1-p in our case. A logit function is just a logarithm of these odds, hence,
f(y) = log(p/(1-p))
A logistic regression aims to mainly calculate the probability ‘p’ of the occurrence of an event, hence we will try to derive an equation for p from below original equation:
log(p/(1-p)) = b + w*x
Recommended by LinkedIn
p/(1-p) = e^(b + w*x)
writing b + w*x every time is annoying, so let us replace it with a variable t such that,
t = b + w*x
p/(1-p) = e^t
(1-p)/p = 1/(e^t)
(1/p)-1 = 1/(e^t)
(1/p) = 1 + (1/(e^t))
(1/p) = 1 + (e^-t)
p = 1/(1+(e^-t))
Well, after some rearrangements, we were able to find an equation for calculating p. But the expression ‘1/(1+(e^-t))’ is an interesting one, it is called a ‘logistic’ or a ‘sigmoid’ function, which is an inverse of our original logit link function.
One of the interesting things about this logistic function is the shape of the curve we obtain when we plot the t vs logistic function of t (i.e. 1/(1+(e^-t))). It is in the form of an ‘S’ curve which looks like below:
Now, once refer this curve along with the plot we obtained for our example dataset in a two-dimensional coordinate system and everything will fall into one piece.
Let us look at the variable t, which is actually just a changed version of y:
t = b + w*x
So t is the equation of the straight line which is not helpful to model a dataset in a classification problem, therefore logistic regression applies a logistic function (let us say ‘h’) on the variable ‘t’ to instead output an s-shaped curve for the data which works like a charm:
h(t) = 1/(1+(e^-t))
An important characteristic of this logistic function is how it behaves at various ‘t’ values.
When t = 0, h(t) = 0.5,
when t values are very high, h(t) approaches 1
when t values are very low (negative), h(t) approaches 0.
This can be best explained with the below graph:
Before reaching our happy conclusion, an important point to note is that logistic regression actually looks at the labels 0 and 1 in terms of probabilities, hence label 0 becomes 0 percent probability of occurrence of a particular event and label 1 becomes 100 percent probability of occurrence of that event. So, if we are trying to predict cancer using a logistic regression model, a label 0 instance might refer to a benign tumor whereas a label 1 instance might mean that the tumor is malignant. Since we know that it is impossible to predict anything with 100 percent certainty, logistic regression outputs the values between 0 and 1 as labels, which are again just probabilities. To predict one class or other, we will need to fix some threshold and output the values that fall above this threshold as label 1 and the values that fall below this threshold as label 0. So, if we fix the threshold as 0.6 (or 60%) for a cancer prediction problem, the output of 0.3 would result in a prediction that the tumor is benign and an output of 0.75 would result in a prediction that tumor is malignant.
We have looked at a lot of evidence so far to see how logistic regression, despite being built on top of the regression technique, can be used to work with classification problems that are completely different from regression problems. But we have completely bypassed talking about the model parameters ‘weights and biases’ and how they are calculated in the first place for the model. Well, the answer is that an ML model doesn’t compute these parameters directly, it simply assumes some random values for these, goes ahead with further computations to fit a model, and finally uses some metric to understand the deviation of the model with actual data. Later it reiterates this entire process by tweaking these parameters until the best possible model is achieved, which will show a relatively low deviation from the actual data. The tale of this iterative process and the metric is again an intricate one, which I will try to narrate in some future blog posts.