# Gradient Descent and its Variants in Deep Learning Posted by

Gradient descent is an optimization algorithm used for finding weights (parameters) in Deep learning that will produce most accurate predictions. It works iteratively by making predictions on training data then it compute loss (prediction error) and then finally it compute new weights by using loss.

New weights are calculated by this formula:

```new_weight = old_weight - learning_rate*gradient

learning_rate - step size
gradient - tells us in which direction the loss will move to a minimum and it is calculated using the partial derivative of loss function
gradient = Δ error (loss) / Δ weight
Δ - change in (small change in)
error = predicted_value - real_value```

Gradient tell us the slope of our loss (cost) function at our current position and the direction we should move to update our parameters.

Gradient descent algorithm calculates the gradient of the loss (cost) function. If we are able to compute the derivative of a function, we know in which direction to proceed to minimize it.

Negative gradient points in the direction of greatest decrease of the function. Gradient is always negative in loss function because we are trying to minimize loss, if gradient is positive then it will increase our loss.

### 2. Variants of Gradient Descent

There are 3 types of Gradient descent:

In this type of Gradient descent whole dataset is used to calculate Gradient of the Cost function. This mean for every record of training dataset we calculate error and if epoch is finished (algorithm passed through the entire training data set) then we update Gradient.

• Decreased update frequency results in a more stable gradient and may result in a more stable convergence
• Fewer updates for the model is more efficient then more updates that would be used in Stochastic gradient descent

• It requires the entire training dataset in RAM memory to be available to the algorithm
• Training speed may become very slow for large datasets

#### 2.1.1. Programming Implementation

Keras:

``````model = tf.keras.Sequential()
model.compile(optimizer='sgd', loss='mse')
model.fit(x, y, batch_size=training_dataset_length, epochs=10)``````

https://www.tensorflow.org/api_docs/python/tf/keras/Sequential

In this type of Gradient descent one training record is used to calculate Gradient of the Cost function. This mean we calculate error for one training record per iteration then we update Gradient. The term “stochastic” indicates that the one example comprising each batch is chosen at random.

• Frequent updates immediately give an insight into the performance of the model and the rate of improvement
• The increased model update frequency can result in faster learning on some problems

• Updating the model so frequently is more computationally expensive than other gradient descent variants
• Frequent updates can result in a noisy

#### 2.2.1. Programming Implementation

Keras:

``````model = tf.keras.Sequential()
model.compile(optimizer='sgd', loss='mse')
model.fit(x, y, batch_size=1, epochs=10)``````

https://www.tensorflow.org/api_docs/python/tf/keras/Sequential

In this type of Gradient descent two or more training records are used to calculate Gradient of the Cost function. This mean we calculate error for two or more training records per iteration then we update Gradient.

It is a compromise between Batch gradient descent (BGD) and Stochastic gradient descent (SGD). Mini-batch is typically between 10 and 1,000 examples, chosen at random. Mini-batch reduces the amount of noise in SGD but is still more efficient than BGD.

• Leads to more stable convergence
• Gets closer to minimum
• Not having all training data in RAM memory
• Very good for parallel computations

• Requires additional “mini-batch size” hyperparameter for the learning algorithm

#### 2.3.1. Programming Implementation

Keras:

``````model = tf.keras.Sequential()
model.compile(optimizer='sgd', loss='mse')
model.fit(x, y, batch_size=32, epochs=10)``````

https://www.tensorflow.org/api_docs/python/tf/keras/Sequential

### 3. Conclusion

Gradient descent is an optimization algorithm used for calculating weights that will give us most accurate predictions. It works iteratively.

If we have a small dataset and want good precision then we will use Batch gradient descent.

If we have a large dataset and want fast to see in which direction training will go then we will use Stochastic gradient descent.

If we have large dataset and want good precision but our RAM memory is limited then we will use Mini-batch gradient descent.