The Adam class implements the Adam optimization algorithm for updating model parameters. The Adam optimizer (Kingma & Ba, 2015) with optional L2 weight decay maintains first (m) and second (v) moment estimates and applies bias correction. Classical (non-decoupled) weight decay is applied by adding weightDecay * param to the raw gradient.
Value parameters
beta1
exponential decay rate for the first moment estimates.
beta2
exponential decay rate for the second moment estimates.
eps
small constant added for numerical stability.
lr
base Learning rate for updating the parameters.
parameters
indexed sequence of Variables representing model parameters.
Clip the gradients of all parameters by global norm. Scales gradients so that the total norm ≤ maxNorm. Math: Let g = √(∑_p ‖grad_p‖² ). If g > maxNorm, scale all gradients by (maxNorm / g).
Clip the gradients of all parameters by global norm. Scales gradients so that the total norm ≤ maxNorm. Math: Let g = √(∑_p ‖grad_p‖² ). If g > maxNorm, scale all gradients by (maxNorm / g).
def clipGradValue(minVal: Double, maxVal: Double): Unit
Clip the gradients of all parameters by value (element-wise). Each gradient entry smaller than minVal is set to minVal, and each entry larger than maxVal is set to maxVal.
Clip the gradients of all parameters by value (element-wise). Each gradient entry smaller than minVal is set to minVal, and each entry larger than maxVal is set to maxVal.
Reset gradients for all parameters. Typically called before the next forward/backward pass. Only parameters with non-null gradient buffers are updated.
Reset gradients for all parameters. Typically called before the next forward/backward pass. Only parameters with non-null gradient buffers are updated.