# Nesterov Momentum in Deep Learning

Posted by

Nesterov momentum is different version of momentum update. It is also called Nesterov Accelerated Gradient.

In practice it works better than standard momentum (read abaout standard Momentum here https://marko-kovacevic.com/blog/momentum-in-deep-learning/ ).

The main idea is to look ahead before leap. If we know the velocity and direction of an object, we can predict its location in time T and calculate its gradient.

Instead of just blindly using momentum to keep going in the direction we were already going. Lets instead peek ahead by taking a big jump in the same direction of previous velocity and calculate the gradient from there. Then we use that gradient to update our velocity instead.

### Mathematical Implementation

x_ahead = x + mu * v
v = mu * v - learning_rate * dx_ahead
x = x - v

x - weight
v - current velocity vector
mu - momentum update

In practice people prefer to express the update to look as similar to vanilla Stochastic gradient descent or to the previous momentum update as possible.

Same formula but written to be similar as standard momentum:

v_prev = v # back this up
v = mu * v - learning_rate * dx # velocity update stays the same
x += -mu * v_prev + (1 + mu) * v # position update changes form

Weight update with standard momentum:

v = mu * v - learning_rate * dx
x = x - v

Vanilla weight update (without Momentum):

x = x - learning_rate * dx

### Programming Implementation

Keras:

keras.optimizers.SGD(learning_rate=0.01, momentum=0.9, nesterov=True)

https://keras.io/optimizers/

Tensorflow:

tf.compat.v1.train.MomentumOptimizer(
learning_rate, momentum, use_locking=False, name='Momentum', use_nesterov=True
)

https://www.tensorflow.org/api_docs/python/tf/compat/v1/train/MomentumOptimizer