Gradient descent is an optimization algorithm which is used in machine learning to minimize a cost function. The Cost function describes how good the model will perform with the given set of parameters (weights and biases), and gradient descent is used to find the best set of parameters. Gradient descent is used to update the parameters of the model.
For example, parameters refer to coefficients in Linear Regression and weights in neural networks.
What is cost function? (Loss Function)
Loss Function is a function which is used to quantify the loss during the training phase in the form of a single real number. The loss functions are used in the supervised learning algorithms that use optimization techniques.
- Square loss that is used in Linear Regression.
- Hinge loss in SVM (Support Vector Machine).
While Loss function is used to refer to the error for a single training example, Cost function is actually used to refer to the average of the loss functions over an entire dataset.
Cost Function quantifies the error between predicted values and expected values and presents it in the form of a single real number.
The main goal is to find a set of weights and biases that minimizes the cost. One of the common functions that is often used is the Mean Squared Error (MSE), which measures the difference between the actual value of y and the estimated value of y (the prediction).
After the hypothesis with the initial parameters, the Cost function is calculated. The mathematical representation is:
Gradient Descent and its derivation.
Gradient descent determines a weight vector that minimizes error(E) by starting with an arbitrary initial weight vector, then repeatedly modifying it in small steps.
Calculating the direction of steepest descent along the error surface:
The cost is calculated for a ML algorithm over the entire training dataset for each iteration of the gradient descent algorithm. In Gradient Descent, each iteration of the algorithm is called as one batch and this form is referred to as batch gradient descent, which denotes the total number of samples from a dataset which is used to calculate the gradient for each iteration.
Derivation steps for MSE:
The partial derivative and the chain rule techniques are being applied in this derivative, so it would be better to understand about them.
Partial derivatives are used to find how each individual parameter affects Mean Squared Error (MSE). For solving the gradient descent, the iteration through the data points using the new weight ‘θ0’ and bias ‘θ1’ values is done and then computes the partial derivatives. This new gradient allows to determine the slope of the cost function of the current parameter value and the direction one should approach to update the parameters. The update size of the parameters is fully controlled by the learning rate.
The learning rate is a hyperparameter used in the training of neural networks that has a small positive value, ranged between 0.0 and 1.0.
The learning rate controls how quickly the model adapts to the problem. Smaller learning rates require more training epochs but they are precise, whereas larger learning rates result in rapid changes and require fewer training epochs.