Eroxl's Notes
AdaGrad
aliases
Adaptive Gradient Algorithm

AdaGrad is a variation of gradient descent that has an individual learning rate for each parameter (similarly to RMSProp). This variation helps normalize parameter updates as some parameters might rise significantly while others will only rise slightly. AdaGrad does this by using the size of previous changes to the parameter to control how small the current learning rate for the parameter is.

The diagonal of the matrix is taken usually as computing the next steps on a diagonal matrix are much faster than on the full matrix.

This is then used to calculate the new parameters of the algorithm

Note

It's possible for the denominator to be 0 so usually a small value is added (for example ).

This means that the parameter update would be re-written formally as