Stochastic Gradient Descent Overview
Points discussed:
What is the need for Optimization?
Any algorithm has an objective of reducing the error, reduction in error is achieved by optimization techniques.
Error: Cross-Entropy Loss in Logistic Regression, Sum of Squared Loss in Linear Regression
What is the Optimization function?
Optimization is a mathematical technique of either minimizing or maximizing some function f(x) by altering x. In many real-time scenarios, we will be minimizing f(x). Well, how do we maximize the f(x) if the thought comes up? It is just possible by minimizing the -f(x). Any function f(x) that we want to either minimize or maximize we call the objective function or criterion.
What is a Loss function?
Whenever we minimize our objective function f(x) we call it a loss function. Loss function takes various names such as Cost function or error function.
Derivative:
Let’s take an example of a function y = f(x), where both x and y are real numbers. The derivative of this function is denoted as f’(x) or as dx/dy. The derivative f’(x) gives the slope of f(x) at the point x. In other words, it directs us to how a small change in the input will correspond to the change in output. The derivative is useful to minimize the loss because it tells us how to change x to reduce the error or to make a slight improvement in y.
Optimization Methods:
1. Unconstrained – Closed Form Solution:
Steepest descent convergence is when every element of the gradient is zero (at least very close to zero). In some cases, we may be able to avoid running an iterative algorithm and just jump to the critical point by solving the equation Δxfx=0 for x.
Why is the Iterative method more robust than the closed form?
Gradient Descent (First Order Iterative Method):
Gradient Descent is an iterative method. You start at some Gradient (or) Slope and based on the slope, take a step of the descent. The technique of moving x in small steps with the opposite sign of the derivative is called Gradient Descent. In other words, the positive gradient points direct uphill, and the negative gradient points direct downhill. We can decrease the value by moving in the direction of the negative gradient. This is known as the method of steepest descent or gradient descent.
A new point is proposed:
Where is the learning rate, a positive scalar determining the size of the step? Choose a small constant. A popular method to reach with optimum learning rate (ϵ) is by using the grid search or line search.
Choice of Learning Rate:
Gradient descent is limited to optimization in continuous spaces, the general concept of repeatedly making a best small move (either positive or negative) toward the better configurations can be generalized to discrete spaces.
Gradient Descent can be associated with the ball rolling down from a valley and the lowest point is the steepest descent, learning rate (ϵ) consider it as the steps taken by the ball to reach the lowest point of the valley.
For example, let’s consider the below function as the cost function:
Recommended by LinkedIn
Step-by-step approach:
The question arises when the derivative function or f’(x) = 0, in that situation the derivative provides no information about which direction to move. Points where f’(x) = 0 are known as critical points or stationary points.
Few other important terminologies to know before we move to SGD:
Local Minimum:
A local minimum is a point where f(x) is lower than all neighbouring points, so it is no longer possible to decrease f(x) by making baby steps. This is also called local minima (or) relative minimum.
Local Maximum:
Local Maximum is a point where f(x) is higher than at all neighbouring points so it is impossible to increase f(x) by taking baby steps. This is also called a local maxima (or) relative maximum.
Saddle Points:
Some critical points or stationary points are neither maxima nor minima, they are called Saddle points.
Global Minimum:
The global minimum is the smallest value of the function f(x) which is also called the absolute minimum. There would be only one global minimum whereas there could be more than one or more local minimum.
When the input to the function is multidimensional most of the optimization functions fail to find the global minimum as they have multiple local minima, local maxima and saddle points.
This is one of the greatest challenges in optimization, we often settle with the value of f that is minimal not necessarily global minimum in any formal sense.
What is the Jacobian Matrix?
When we have multiple inputs, we must use partial derivatives of each variable xi. The partial derivative a/axi f(x) measures how f changes as only the variable xi increases at point x. The gradient of f is the vector containing all the partial derivatives, denoted by xf(x). In multiple dimensions, critical points are points where every element of the gradient is equal to zero.
Sometimes we need to find all derivatives of a function whose input and output are both vectors. The matrix containing all such partial derivatives is known as the Jacobian Matrix. Discussion of Jacobian and Hessian matrices is beyond the scope of the current discussion.
Challenges in Gradient Descent:
Stochastic Gradient Descent:
Stochastic Gradient Descent is the extension of Gradient Descent.
Any Machine Learning/ Deep Learning function works on the same objective function f(x) to reduce the error and generalize when new data comes in.
To overcome the challenges in Gradient Descent we are taking a small set of samples, specifically on each step of the algorithm, we can sample a minibatch drawn uniformly from the training set. The minibatch size is typically chosen to be a relatively small number of examples; it could be from one to a few hundred.
Using the examples from the minibatch. The SGD algorithm then follows the expected gradient downhill:
The Gradient Descent has often been regarded as slow or unreliable, it was not feasible to deal with non-convex optimization problems. Now with Stochastic Gradient Descent, machine learning algorithms work very well when trained, though it reaches the local minimum in a reasonable amount of time.
A crucial parameter for SGD is the learning rate, it is necessary to decrease the learning rate over time, so we now denote the learning rate at iteration k as Ek.