Nesterov Momentum in Deep Learning

AvatarPosted by

Nesterov momentum is different version of momentum update. It is also called Nesterov Accelerated Gradient.

In practice it works better than standard momentum (read abaout standard Momentum here ).

The main idea is to look ahead before leap. If we know the velocity and direction of an object, we can predict its location in time T and calculate its gradient.

Difference between Momentum and Nesterov momentum update

Instead of just blindly using momentum to keep going in the direction we were already going. Lets instead peek ahead by taking a big jump in the same direction of previous velocity and calculate the gradient from there. Then we use that gradient to update our velocity instead.

Mathematical Implementation

x_ahead = x + mu * v
# evaluate dx_ahead (the gradient at x_ahead instead of at x)
v = mu * v - learning_rate * dx_ahead
x = x - v

x_ahead - weight that is look ahead
x - weight
dx_ahead - gradient of x_ahead
v - current velocity vector
mu - momentum update

In practice people prefer to express the update to look as similar to vanilla Stochastic gradient descent or to the previous momentum update as possible.

Same formula but written to be similar as standard momentum:

v_prev = v # back this up
v = mu * v - learning_rate * dx # velocity update stays the same
x += -mu * v_prev + (1 + mu) * v # position update changes form

Weight update with standard momentum:

v = mu * v - learning_rate * dx
x = x - v

Vanilla weight update (without Momentum):

x = x - learning_rate * dx

Programming Implementation


keras.optimizers.SGD(learning_rate=0.01, momentum=0.9, nesterov=True)


    learning_rate, momentum, use_locking=False, name='Momentum', use_nesterov=True

Thanks for reading this post.


  1. 2020. Cs231n Convolutional Neural Networks For Visual Recognition. [online] Available at: <> [Accessed 20 April 2020].

Leave a Reply

Your email address will not be published. Required fields are marked *