1. Gradient Descent
Gradient descent is an optimization algorithm used for finding weights (parameters) in Deep learning that will produce most accurate predictions. It works iteratively by making predictions on training data then it compute loss (prediction error) and then finally it compute new weights by using loss.

New weights are calculated by this formula:
new_weight = old_weight - learning_rate*gradient learning_rate - step size gradient - tells us in which direction the loss will move to a minimum and it is calculated using the partial derivative of loss function gradient = Δ error (loss) / Δ weight Δ - change in (small change in) error = predicted_value - real_value
Learn more about Learning rate here https://marko-kovacevic.com/blog/learning-rate-in-deep-learning/
Gradient tell us the slope of our loss (cost) function at our current position and the direction we should move to update our parameters.
Gradient descent algorithm calculates the gradient of the loss (cost) function. If we are able to compute the derivative of a function, we know in which direction to proceed to minimize it.
Negative gradient points in the direction of greatest decrease of the function. Gradient is always negative in loss function because we are trying to minimize loss, if gradient is positive then it will increase our loss.

2. Variants of Gradient Descent
There are 3 types of Gradient descent:
- Batch gradient descent
- Stochastic Gradient Descent
- Mini-batch Gradient Descent
2.1. Batch Gradient Descent
In this type of Gradient descent whole dataset is used to calculate Gradient of the Cost function. This mean for every record of training dataset we calculate error and if epoch is finished (algorithm passed through the entire training data set) then we update Gradient.
Advantages:
- Decreased update frequency results in a more stable gradient and may result in a more stable convergence
- Fewer updates for the model is more efficient then more updates that would be used in Stochastic gradient descent
Disadvantages:
- It requires the entire training dataset in RAM memory to be available to the algorithm
- Training speed may become very slow for large datasets
2.1.1. Programming Implementation
Keras:
model = tf.keras.Sequential()
model.add(tf.keras.layers.Dense(8))
model.add(tf.keras.layers.Dense(1))
model.compile(optimizer='sgd', loss='mse')
model.fit(x, y, batch_size=training_dataset_length, epochs=10)
https://www.tensorflow.org/api_docs/python/tf/keras/Sequential
2.2. Stochastic Gradient Descent
In this type of Gradient descent one training record is used to calculate Gradient of the Cost function. This mean we calculate error for one training record per iteration then we update Gradient. The term “stochastic” indicates that the one example comprising each batch is chosen at random.
Advantages:
- Frequent updates immediately give an insight into the performance of the model and the rate of improvement
- The increased model update frequency can result in faster learning on some problems
Disadvantages:
- Updating the model so frequently is more computationally expensive than other gradient descent variants
- Frequent updates can result in a noisy
2.2.1. Programming Implementation
Keras:
model = tf.keras.Sequential()
model.add(tf.keras.layers.Dense(8))
model.add(tf.keras.layers.Dense(1))
model.compile(optimizer='sgd', loss='mse')
model.fit(x, y, batch_size=1, epochs=10)
https://www.tensorflow.org/api_docs/python/tf/keras/Sequential
2.3. Mini-batch Gradient Descent
In this type of Gradient descent two or more training records are used to calculate Gradient of the Cost function. This mean we calculate error for two or more training records per iteration then we update Gradient.
It is a compromise between Batch gradient descent (BGD) and Stochastic gradient descent (SGD). Mini-batch is typically between 10 and 1,000 examples, chosen at random. Mini-batch reduces the amount of noise in SGD but is still more efficient than BGD.
Advantages:
- Leads to more stable updates
- Leads to more stable convergence
- Gets closer to minimum
- Not having all training data in RAM memory
- Very good for parallel computations
Disadvantages:
- Requires additional “mini-batch size” hyperparameter for the learning algorithm
2.3.1. Programming Implementation
Keras:
model = tf.keras.Sequential()
model.add(tf.keras.layers.Dense(8))
model.add(tf.keras.layers.Dense(1))
model.compile(optimizer='sgd', loss='mse')
model.fit(x, y, batch_size=32, epochs=10)
https://www.tensorflow.org/api_docs/python/tf/keras/Sequential
3. Conclusion
Gradient descent is an optimization algorithm used for calculating weights that will give us most accurate predictions. It works iteratively.
If we have a small dataset and want good precision then we will use Batch gradient descent.
If we have a large dataset and want fast to see in which direction training will go then we will use Stochastic gradient descent.
If we have large dataset and want good precision but our RAM memory is limited then we will use Mini-batch gradient descent.
Thanks for reading this post.
4. References
- Google Developers. 2020. Reducing Loss: An Iterative Approach | Machine Learning Crash Course. [online] Available at: <https://developers.google.com/machine-learning/crash-course/reducing-loss/an-iterative-approach> [Accessed 9 May 2020].
- Google Developers. 2020. Reducing Loss: Gradient Descent | Machine Learning Crash Course. [online] Available at: <https://developers.google.com/machine-learning/crash-course/reducing-loss/gradient-descent> [Accessed 9 May 2020].
- Google Developers. 2020. Reducing Loss: Stochastic Gradient Descent. [online] Available at: <https://developers.google.com/machine-learning/crash-course/reducing-loss/stochastic-gradient-descent> [Accessed 9 May 2020].
- Brownlee, J., 2020. A Gentle Introduction To Mini-Batch Gradient Descent And How To Configure Batch Size. [online] Machine Learning Mastery. Available at: <https://machinelearningmastery.com/gentle-introduction-mini-batch-gradient-descent-configure-batch-size/> [Accessed 9 May 2020].
- mc.ai. 2020. Variants Of Gradient Descent. [online] Available at: <https://mc.ai/variants-of-gradient-descent/> [Accessed 9 May 2020].
- Medium. 2020. Understanding The Mathematics Behind Gradient Descent.. [online] Available at: <https://towardsdatascience.com/understanding-the-mathematics-behind-gradient-descent-dde5dc9be06e> [Accessed 9 May 2020].