Gradient Descent and its Variants in Deep Learning

AvatarPosted by

1. Gradient Descent

Gradient descent is an optimization algorithm used for finding weights (parameters) in Deep learning that will produce most accurate predictions. It works iteratively by making predictions on training data then it compute loss (prediction error) and then finally it compute new weights by using loss.

Gradient descent diagram

New weights are calculated by this formula:

new_weight = old_weight - learning_rate*gradient

learning_rate - step size
gradient - tells us in which direction the loss will move to a minimum and it is calculated using the partial derivative of loss function 
gradient = Δ error (loss) / Δ weight
Δ - change in (small change in)
error = predicted_value - real_value

Learn more about Learning rate here https://marko-kovacevic.com/blog/learning-rate-in-deep-learning/

Gradient tell us the slope of our loss (cost) function at our current position and the direction we should move to update our parameters.

Gradient descent algorithm calculates the gradient of the loss (cost) function. If we are able to compute the derivative of a function, we know in which direction to proceed to minimize it.

Negative gradient points in the direction of greatest decrease of the function. Gradient is always negative in loss function because we are trying to minimize loss, if gradient is positive then it will increase our loss.

Gradient in Loss function

2. Variants of Gradient Descent

There are 3 types of Gradient descent:

  1. Batch gradient descent
  2. Stochastic Gradient Descent
  3. Mini-batch Gradient Descent

2.1. Batch Gradient Descent

In this type of Gradient descent whole dataset is used to calculate Gradient of the Cost function. This mean for every record of training dataset we calculate error and if epoch is finished (algorithm passed through the entire training data set) then we update Gradient.

Advantages:

  • Decreased update frequency results in a more stable gradient and may result in a more stable convergence
  • Fewer updates for the model is more efficient then more updates that would be used in Stochastic gradient descent

Disadvantages:

  • It requires the entire training dataset in RAM memory to be available to the algorithm
  • Training speed may become very slow for large datasets

2.1.1. Programming Implementation

Keras:

model = tf.keras.Sequential()
model.add(tf.keras.layers.Dense(8))
model.add(tf.keras.layers.Dense(1))
model.compile(optimizer='sgd', loss='mse')
model.fit(x, y, batch_size=training_dataset_length, epochs=10)

https://www.tensorflow.org/api_docs/python/tf/keras/Sequential

2.2. Stochastic Gradient Descent

In this type of Gradient descent one training record is used to calculate Gradient of the Cost function. This mean we calculate error for one training record per iteration then we update Gradient. The term “stochastic” indicates that the one example comprising each batch is chosen at random.

Advantages:

  • Frequent updates immediately give an insight into the performance of the model and the rate of improvement
  • The increased model update frequency can result in faster learning on some problems

Disadvantages:

  • Updating the model so frequently is more computationally expensive than other gradient descent variants
  • Frequent updates can result in a noisy

2.2.1. Programming Implementation

Keras:

model = tf.keras.Sequential()
model.add(tf.keras.layers.Dense(8))
model.add(tf.keras.layers.Dense(1))
model.compile(optimizer='sgd', loss='mse')
model.fit(x, y, batch_size=1, epochs=10)

https://www.tensorflow.org/api_docs/python/tf/keras/Sequential

2.3. Mini-batch Gradient Descent

In this type of Gradient descent two or more training records are used to calculate Gradient of the Cost function. This mean we calculate error for two or more training records per iteration then we update Gradient.

It is a compromise between Batch gradient descent (BGD) and Stochastic gradient descent (SGD). Mini-batch is typically between 10 and 1,000 examples, chosen at random. Mini-batch reduces the amount of noise in SGD but is still more efficient than BGD.

Advantages:

  • Leads to more stable updates
  • Leads to more stable convergence
  • Gets closer to minimum
  • Not having all training data in RAM memory
  • Very good for parallel computations

Disadvantages:

  • Requires additional “mini-batch size” hyperparameter for the learning algorithm

2.3.1. Programming Implementation

Keras:

model = tf.keras.Sequential()
model.add(tf.keras.layers.Dense(8))
model.add(tf.keras.layers.Dense(1))
model.compile(optimizer='sgd', loss='mse')
model.fit(x, y, batch_size=32, epochs=10)

https://www.tensorflow.org/api_docs/python/tf/keras/Sequential

3. Conclusion

Gradient descent is an optimization algorithm used for calculating weights that will give us most accurate predictions. It works iteratively.

If we have a small dataset and want good precision then we will use Batch gradient descent.

If we have a large dataset and want fast to see in which direction training will go then we will use Stochastic gradient descent.

If we have large dataset and want good precision but our RAM memory is limited then we will use Mini-batch gradient descent.

Thanks for reading this post.

4. References

  1. Google Developers. 2020. Reducing Loss: An Iterative Approach  |  Machine Learning Crash Course. [online] Available at: <https://developers.google.com/machine-learning/crash-course/reducing-loss/an-iterative-approach> [Accessed 9 May 2020].
  2. Google Developers. 2020. Reducing Loss: Gradient Descent  |  Machine Learning Crash Course. [online] Available at: <https://developers.google.com/machine-learning/crash-course/reducing-loss/gradient-descent> [Accessed 9 May 2020].
  3. Google Developers. 2020. Reducing Loss: Stochastic Gradient Descent. [online] Available at: <https://developers.google.com/machine-learning/crash-course/reducing-loss/stochastic-gradient-descent> [Accessed 9 May 2020].
  4. Brownlee, J., 2020. A Gentle Introduction To Mini-Batch Gradient Descent And How To Configure Batch Size. [online] Machine Learning Mastery. Available at: <https://machinelearningmastery.com/gentle-introduction-mini-batch-gradient-descent-configure-batch-size/> [Accessed 9 May 2020].
  5. mc.ai. 2020. Variants Of Gradient Descent. [online] Available at: <https://mc.ai/variants-of-gradient-descent/> [Accessed 9 May 2020].
  6. Medium. 2020. Understanding The Mathematics Behind Gradient Descent.. [online] Available at: <https://towardsdatascience.com/understanding-the-mathematics-behind-gradient-descent-dde5dc9be06e> [Accessed 9 May 2020].

Leave a Reply

Your email address will not be published. Required fields are marked *