AdaGrad is a variation of gradient descent that has an individual learning rate for each parameter (similarly to RMSProp). This variation helps normalize parameter updates as some parameters might rise significantly while others will only rise slightly. AdaGrad does this by using the size of previous changes to the parameter to control how small the current learning rate for the parameter is.

Definitions
- is the diagonal of it's input matrix
- is the current iteration
- is the gradient of the multivariable function that is being optimized with respect to the parameters
- is the matrix transposition

The diagonal of the matrix is taken usually as computing the next steps on a diagonal matrix are much faster than on the full matrix.

This is then used to calculate the new parameters of the algorithm

Definitions
- is the current value of the parameters of the algorithm
- is the base learning rate of the algorithm
- is the gradient of the multivariable function that is being optimized with respect to the parameters

Note

It's possible for the denominator to be 0 so usually a small value is added (for example ).

This means that the parameter update would be re-written formally as