FTML optimizer implementation#9262
Conversation
python/mxnet/optimizer.py
Outdated
| sigma_t = d_t - self.beta1 * prev_d | ||
| z_t = self.beta1 * prev_z + (1 - self.beta1) * grad - sigma_t * weight | ||
| # update weight | ||
| weight[:] = - z_t / d_t - lr * wd * weight |
There was a problem hiding this comment.
I think we should merge the wd term into the gradient. @szhengac could you help check this?
There was a problem hiding this comment.
The rest formulas look good.
|
Testing the optimizer is a difficult problem and we haven't found a good solution. Currently I think this kind of test, which optimizes a simple problem and checks the error, should be enough, https://github.com/apache/incubator-mxnet/blob/master/tests/python/unittest/test_optimizer.py#L648-L672. Also, would you also add the C++ version? If C++ is added, we can test it against the python version. |
|
Thanks for your comments.
|
|
|
||
|
|
||
| @register | ||
| class FTML(Optimizer): |
There was a problem hiding this comment.
I thought we already have this
There was a problem hiding this comment.
We have FTRL (Follow the regularized leader). This PR adds FTML (Follow the moving leader).
| grad = grad * self.rescale_grad | ||
| if self.clip_gradient is not None: | ||
| grad = mx.nd.clip(grad, -self.clip_gradient, self.clip_gradient) | ||
| grad += wd * weight |
There was a problem hiding this comment.
We should clip after adding the gradient of L2. This is consistent with other optimizers.
There was a problem hiding this comment.
It seems that L2 term is out of clip in other optimizers, such as sgd in https://github.com/apache/incubator-mxnet/blob/master/src/operator/optimizer_op-inl.h#L76-L78.
There was a problem hiding this comment.
In Adam, it’s clipped outside. So our current optimizes have such kinds of inconsist behavior. I think clip the gradient without we is wrong.
There was a problem hiding this comment.
Got it. Thanks. Now WD part is added into gradients.
| using namespace mshadow_op; | ||
| const DType grad_i = clip_grad >= 0.0f | ||
| ? (clip::Map(rescale_grad * grad[i], clip_grad) + wd * weight[i]) | ||
| : (rescale_grad * grad[i] + wd * weight[i]); |
There was a problem hiding this comment.
We should clip after adding the gradient of L2. This is consistent with other optimizers.
* ftml implemention * c++ version and test * merge WD into gradients
* ftml implemention * c++ version and test * merge WD into gradients
* ftml implemention * c++ version and test * merge WD into gradients


Description
FTML optimizer implementation, requested in #9182
The default values of
beta1, beta2, epsilonis the same with keras-team/keras-contrib#110.How to add test to verify the correctness of implementation? @sxjscience
Here is the test for FTML in keras-contrib. Is that OK?
I have done only one experiment, FTML (val acc : 0.756210 at 10th epoch) can converge faster than momentum SGD (val acc : 0.684095 at 10th epoch) on cifar10, both using
lr = 0.001, wd = 0andresnet18_v1.Checklist
Essentials
make lint)Changes
Comments