Machine learning
Gradient Descent / Line of Best Fit
(While this first one isn’t traditionally thought of as a machine-learning algorithm, understanding gradient descent is vital to understanding how many machine learning algorithms work and are optimized.)
Me-to-grandma:
“Basically, gradient descent helps us get the most accurate predictions based on some data.
Let me explain a bit more – let’s say you have a big list of the height and weight of every person you know. And let’s say you graph that data. It would probably look something like this:
Now let’s say there’s a local guessing competition where the person to guess someone’s weight correctly, given their height, gets a cash prize. Besides using your eyes to size the person up, you’d have to rely pretty heavily on the list of heights and weights you have at your disposal, right?
So, based on the graph of your data above, you could probably make some pretty good predictions if only you had a line on the graph that showed the trend of the data. With such a line, if you were given someone’s height, you could just find that height on the x-axis, go up until you hit your trend line, and then see what the corresponding weight is on the y-axis, right?
But how in the world do you find that perfect line? You could probably do it manually, but it would take forever. That’s where gradient descent comes in!
It does this by trying to minimize something called RSS (the residual sum of squares), which is basically the sum of the squares of the differences between our dots and our line, i.e. how far away our real data (dots) is from our line (red line). We get a smaller and smaller RSS by changing where our line is on the graph, which makes intuitive sense — we want our line to be wherever it’s closest to the majority of our dots.
We can actually take this further and graph each different line’s parameters on something called a cost curve. Using gradient descent, we can get to the bottom of our cost curve. At the bottom of our cost curve is our lowest RSS!
There are more granular aspects of gradient descent like “step sizes” (i.e. how fast we want to approach the bottom of our skateboard ramp) and “learning rate” (i.e. what direction we want to take to reach the bottom), but in essence: gradient descent gets our line of best fit by minimizing the space between our dots and our line of best fit. Our line of best fit, in turn, allows us to make predictions!”
Linear Regression
Me-to-grandma:
“Super simply, linear regression is a way we analyze the strength of the relationship between 1 variable (our “outcome variable”) and 1 or more other variables (our “independent variables”).
A hallmark of linear regression, like the name implies, is that the relationship between the independent variables and our outcome variable is linear. For our purposes, all that means is that when we plot the independent variable(s) against the outcome variable, we can see the points start to take on a line-like shape, like they do below.
(If you can’t plot your data, a good way to think about linearity is by answering the question: does a certain amount of change in my independent variable(s) result in the same amount of change in my outcome variable? If yes, your data is linear!)
Another important thing to know about linear regression is that the outcome variable, or the thing that changes depending on how we change our other variables, is always continuous. But what does that mean?
Let’s say we wanted to measure what effect elevation has on rainfall in New York State: our outcome variable (or the variable we care about seeing a change in) would be rainfall, and our independent variable would be elevation. With linear regression, that outcome variable would have to be specifically how many inches of rainfall, as opposed to just a True/False category indicating whether or not it rained at x elevation. That is because our outcome variable has to be continuous — meaning that it can be any number (including fractions) in a range of numbers.
The coolest thing about linear regression is that it can predict things using the line of best fit that we spoke about before! If we run a linear regression analysis on our rainfall vs. elevation scenario above, we can find the line of best fit like we did in the gradient descent section (this time shown in blue), and then we can use that line to make educated guesses as to how much rain one could reasonably expect at some elevation.”
Ridge & LASSO Regression
Me, continuing to hopefully-not-too-scared-grandma:
“So linear regression’s not that scary, right? It’s just a way to see what effect something has on something else. Cool.
Now that we know about simple linear regression, there are even cooler linear regression-like things we can discuss, like ridge regression.
Like gradient descent’s relationship to linear regression, there’s one back-story we need to cover to understand ridge regression, and that’s regularization.
Simply put, data scientists use regularization methods to make sure that their models only pay attention to independent variables that have a significant impact on their outcome variable.
You’re probably wondering why we care if our model uses independent variables that don’t have an impact. If they don’t have an impact, wouldn’t our regression just ignore them? The answer is no! We can get more into the details of machine learning later, but basically we create these models by feeding them a bunch of “training” data. Then, we see how good our models are by testing them on a bunch of “test” data. So, if we train our model with a bunch of independent variables, with some that matter and some that don’t, our model will perform super well on our training data (because we are tricking it to think all of what we fed it matters), but super poorly on our test data. This is because our model isn’t flexible enough to work well on new data that doesn’t have every. single. little. thing we fed it during the training phase. When this happens, we say that the model is “overfit.”
To understand over-fitting, let’s look at a (lengthy) example:
Recommended by LinkedIn
Let’s say you’re a new mother and your baby boy loves pasta. As the months go by, you make it a habit to feed your baby pasta with the kitchen window open because you like the breeze. Then your baby’s cousin gets him a onesie, and you start a tradition of only feeding him pasta when he’s in his special onesie. Then you adopt a dog who diligently sits beneath the baby’s highchair to catch the stray noodles while he’s eating his pasta . At this point, you only feed your baby pasta while he’s wearing the special onesie …and the kitchen window’s open …and the dog is underneath the highchair. As a new mom you naturally correlate your son’s love of pasta with all of these features: the open kitchen window, the onesie, and the dog. Right now, your mental model of the baby’s feeding habits is pretty complex!
That is exactly what regularization can do for a machine learning model.
So, regularization helps your model only pay attention to what matters in your data and gets rid of the noise.
In all types of regularization, there is something called a penalty term (the Greek letter lambda: λ). This penalty term is what mathematically shrinks the noise in our data.
In ridge regression, sometimes known as “L2 regression,” the penalty term is the sum of the squared value of the coefficients of your variables. (Coefficients in linear regression are basically just numbers attached to each independent variable that tell you how much of an effect each will have on the outcome variable. Sometimes we refer to them as “weights.”) In ridge regression, your penalty term shrinks the coefficients of your independent variables, but never actually does away with them totally. This means that with ridge regression, noise in your data will always be taken into account by your model a little bit.
Another type of regularization is LASSO, or “L1” regularization. In LASSO regularization, instead of penalizing every feature in your data, you only penalize the high coefficient-features. Additionally, LASSO has the ability to shrink coefficients all the way to zero. This essentially deletes those features from your data set because they now have a “weight” of zero (i.e. they’re essentially being multiplied by zero).” With LASSO regression, your model has the potential to get rid of most all of the noise in your dataset. This is super helpful in some scenarios!
Logistic Regression
Me-to-grandma:
“So, cool, we have linear regression down. Linear regression = what effect some variable(s) has on another variable, assuming that 1) the outcome variable is continuous and 2) the relationship(s) between the variable(s) and the outcome variable is linear.
But what if your outcome variable is “categorical”? That’s where logistic regression comes in!
Categorical variables are just variables that can be only fall within in a single category. Good examples are days of the week —if you have a bunch of data points about things that happened on certain days of the week, there is no possibility that you’ll ever get a datapoint that could have happened sometime between Monday and Tuesday. If something happened on Monday, it happened on Monday, end of story.
But if we think of how our linear regression model works, how would it be possible for us to figure out a line of best fit for something categorical? It would be impossible! That is why logistic regression models output a probability of your datapoint being in one category or another, rather than a regular numeric value. That’s why logistic regression models are primarily used for classification.
But back to both linear regression and logistic regression being “linear.” If we can’t come up with a line of best fit in logistic regression, where does the linear part of logistic regression come in? Well in the world of logistic regression, the outcome variable has a linear relationship with the log-odds of the independent variables.
But what in the world are the log-odds? Okay here we go….
Odds
The core of logistic regression = odds.
Intuitively, odds are something we understand —they are the probability of success to the probability of failure. In other words, they are the probability of something happening compared to the probability of something not happening.
For a concrete example of odds, we can think of a class of students. Let’s say the odds of women passing the test are 5:1, while the odds of men passing the test are 3:10. This means that, of 6 women, 5 are likely to pass the test, and that, of 13 men, 3 are likely to pass the test. The total class size here is 19 students (6 women+ 13 men).
So…aren’t odds just the same as probability?
Sadly, no! While probability measures the ratio of the number of times something happened out of the total number of times everything happened (e.g. 10 heads out 30 coin tosses), odds measures the ratio of the number of times something happened to the number of times something didn’t happen (e.g. 10 heads to 20 tails).
That means that while probability will always be confined to a scale of 0–1, odds can continuously grow from 0 to positive infinity! This presents a problem for our logistic regression model, because we know that our expected output is a probability (i.e. a number from 0–1).
So, how do we get from odds to probability?
Let’s think of a classification problem…say your favorite soccer team winning over another soccer team. You might say that the odds of your team losing are 1:6, or 0.17. And the odds of your team winning, because they’re a great team, are 6:1, or 6. You could represent those odds on a number line like below:
Now, you wouldn’t want your model to predict that your team will win on a future game just because the magnitude of the odds of them winning in the past is so much bigger than the magnitude of the odds of them losing in the past, right? There is so much more you want your model to take into account (maybe weather, maybe starting players, etc.)! So, to get the magnitude of the odds to be evenly distributed, or symmetrical, we calculate something called the log-odds.
Log-Odds
Log-odds is a shorthand way of referring to taking the natural logarithm of the odds. When you take the natural logarithm of something, you basically make it more normally distributed. When we make something more normally distributed, we are essentially putting it on a scale that’s super easy to work with.