[MXNET-58]Layer Normalization in C++ by sxjscience · Pull Request #10029 · apache/mxnet

sxjscience · 2018-03-07T23:56:28Z

Description

Directly implement layer normalization in C++. The speed and memory cost are both better than the way of stacking the broadcast/reduce OPs. Solves [OP] LayerNorm in MXNet #9950
Add LayerNorm in Gluon
Fix the doc of InstanceNorm. In InstanceNorm, the real axis to normalize the input tensor is all axes excluding the 0th axis and the given axis.
Fix the doc of BatchNorm, the inverse std instead of the var is set as the output. Should fix Loss of Precision in BatchNorm and output_var may be wrong #9216

Checklist

Essentials

Passed code style checking (make lint)
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage:
Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
Code is well-documented:
For user-facing API changes, API doc string has been updated.
For new C++ functions in header files, their functionalities and arguments are documented.
For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

LayerNorm in C++/Gluon, tests
Fix Doc of InstanceNorm
Fix Doc of BatchNorm

Comments

We can improve the speed further by fusing the operators. This is left as future work.

sxjscience · 2018-03-08T00:03:45Z

@fhieber @tdomhan You could try this after it gets merged.

sxjscience · 2018-03-08T04:21:14Z

Here's the new doc of InstanceNorm http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-10029/4/api/python/gluon/nn.html#mxnet.gluon.nn.InstanceNorm @zhanghang1989

fhieber · 2018-03-08T07:12:00Z

@sxjscience fantastic, thank you! We will definitely try this as soon as its available!

sxjscience · 2018-03-09T19:06:52Z

Does anyone has time to review it? The doc page of the latest build is in http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-10029/7/index.html

zhanghang1989 · 2018-03-09T19:25:19Z

The docs look good to me 👍

cjolivier01 · 2018-03-09T19:25:52Z

src/operator/nn/layer_norm.cc

+  using namespace mshadow;
+  CHECK_EQ(in_shape->size(), 3U) << "Input:[data, gamma, beta]";
+  const TShape &dshape = in_shape->at(layernorm::kData);
+  int axis = param.axis;


nit: const int

cjolivier01 · 2018-03-09T19:28:28Z

tests/python/unittest/test_operator.py

+
+def test_layer_norm():
+    for dtype in [np.float16, np.float32, np.float64]:
+        check_layer_normalization((10, 12, 5), -1, 1E-3)


Is any axis allowed?
Can you check all possiblities (even if they theoretically overlap))? -2, -1, 0, 1, 2 (for 3D)
How about 1D and 2D? Are those relevant for this operator?

cjolivier01 · 2018-03-09T19:29:50Z

tests/python/unittest/test_operator.py

                        check_l2_normalization((nbatch, nchannel, height, width), mode)


+def npy_layer_norm(data, gamma, beta, axis=1, eps=1E-5):


Can this be a nested function in check_layer_normalization?

cjolivier01 · 2018-03-09T19:30:11Z

tests/python/unittest/test_operator.py

+    exe.arg_dict['beta'][:] = beta
+    out_nd = exe.forward()[0]
+    out = npy_layer_norm(data, gamma, beta, axis, eps)
+    assert_allclose(out, out_nd.asnumpy(), 1E-4, 1E-4)


Is this the correctness test?

Yes, it compares it with a numpy version.

cjolivier01 · 2018-03-09T21:06:58Z

tests/python/unittest/test_operator.py

-        check_layer_normalization((10, 12, 5), -1, 1E-3)
-        check_layer_normalization((10, 12, 5), 0, 1E-3)
-        check_layer_normalization((10, 12, 5), 1, 1E-3)
+        for in_shape in [(10, 6, 5), (5, 5), (2, 3, 3, 3)]:


szha · 2018-03-09T22:21:31Z

python/mxnet/gluon/nn/basic_layers.py

+                 beta_initializer='zeros', gamma_initializer='ones',
+                 in_channels=0, prefix=None, params=None):
+        super(LayerNorm, self).__init__(prefix=prefix, params=params)
+        self._kwargs = {'eps': epsilon, 'axis': axis}


center, scale

szha · 2018-03-09T22:22:13Z

src/operator/nn/layer_norm-inl.h

+    DMLC_DECLARE_FIELD(axis).set_default(-1)
+      .describe("The axis to perform layer normalization. "
+                "Usually, this should be be axis of the channel dimension. "
+                "Negative values means indexing from right to left. ");


extra space at the end

szha · 2018-03-09T22:22:41Z

src/operator/nn/layer_norm-inl.h

+    DMLC_DECLARE_FIELD(eps).set_default(1e-5f)
+      .describe("An `epsilon` parameter to prevent division by 0.");
+    DMLC_DECLARE_FIELD(output_mean_var).set_default(false)
+      .describe("Output the mean and std calculated along the given axis");


marcoabreu · 2018-03-10T01:00:26Z

Do you have any benchmarks regarding statement 1?

sxjscience · 2018-03-10T03:24:57Z

@marcoabreu Yes, here the benchmark result. My reference implementation is the following LayerNorm that is implemented by stacking broadcasting/reducing operators:

class LayerNormStackSmallOp(HybridBlock):
    """Applies layer normalization to the n-dimensional input array.
    Stack bcast/reduce
    """
    def __init__(self, axis=1, epsilon=1e-5, center=True, scale=True,
                 beta_initializer='zeros', gamma_initializer='ones',
                 in_channels=0, prefix=None, params=None):
        super(LayerNormStackSmallOp, self).__init__(prefix=prefix, params=params)
        self._kwargs = {'eps': epsilon, 'axis': axis}
        self._axis = axis
        self._epsilon = epsilon
        self._center = center
        self._scale = scale
        assert in_channels != 0, "in_channels == 0 is currently not supported"
        if self._center:
            self.gamma = self.params.get('gamma', grad_req='write' if scale else 'null',
                                         shape=(in_channels,), init=gamma_initializer,
                                         allow_deferred_init=True)
        if self._scale:
            self.beta = self.params.get('beta', grad_req='write' if center else 'null',
                                        shape=(in_channels,), init=beta_initializer,
                                        allow_deferred_init=True)

    def moments(self, F, data):
        mean = F.mean(data=data, axis=self._axis, keepdims=True)
        var = F.mean(F.square(F.broadcast_minus(data, mean)),
                     axis=self._axis, keepdims=True)
        return mean, var

    def hybrid_forward(self, F, data, gamma, beta):
        if not self._center and not self._scale:
            return data
        mean, var = self.moments(F, data)
        norm_data = F.broadcast_minus(data, mean)
        norm_data = F.broadcast_mul(norm_data, mx.sym.rsqrt(var + self._epsilon))
        norm_data = F.broadcast_mul(norm_data, gamma)
        norm_data = F.broadcast_add(norm_data, beta)
        return norm_data

I run the layer normalization on data with shape=(128, 1024, 100), axis=-1

Forward-only	Time	Peak GPU Memory
Layer Norm (Stack Small)	6.859ms	105MB
Layer Norm (Implemented)	4.784ms	53MB

Forward + Backward	Time	Peak GPU Memory
Layer Norm (Stack Small)	7.741 + 17.682 = 25.423ms	367MB
Layer Norm (Implemented)	5.137 + 10.943 = 16.08ms	53MB

marcoabreu · 2018-03-10T07:13:46Z

Great numbers, thanks a lot. Good job!

…

* add layer_norm + fix batch_norm doc * add test * add layer normaliation in Gluon * update * fix __repr__ + lint * fix doc * fix threshold * fix doc * fix bug * enable inplace + fix test * try to fix test * fix doc

marvis · 2018-09-05T13:02:17Z

Is there a way to infer the in_channels? I am implementing Scale layer, which has the same problem.

assert in_channels != 0, "in_channels == 0 is currently not supported"

sxjscience · 2018-09-05T15:55:56Z

@marvis, Would you submit an issue describing the problem with some examples? I’ve rechecked the code and the LayerNorm layer should support in_channels=0.

…

________________________________ From: Xingjian SHI Sent: Wednesday, September 5, 2018 9:53:59 PM To: apache/incubator-mxnet; apache/incubator-mxnet Cc: Mention Subject: Re: [apache/incubator-mxnet] [MXNET-58]Layer Normalization in C++ (#10029) Currently no. I’ll try to support it soon. Get Outlook for iOS<https://aka.ms/o0ukef>

________________________________ From: marvis <notifications@github.com> Sent: Wednesday, September 5, 2018 9:02:55 PM To: apache/incubator-mxnet Cc: Xingjian SHI; Mention Subject: Re: [apache/incubator-mxnet] [MXNET-58]Layer Normalization in C++ (#10029) Is there a way to infer the in_channels? I am implementing Scale layer, which has the same problem. assert in_channels != 0, "in_channels == 0 is currently not supported" — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#10029 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AE8D7ke5jju2gzlQ_YJGXEc3dx9Y7n4Uks5uX8t_gaJpZM4ShxnT>.

sxjscience added 4 commits March 7, 2018 15:49

add layer_norm + fix batch_norm doc

01a2006

add test

00cf5ca

add layer normaliation in Gluon

15b681c

update

6be802e

sxjscience requested review from cjolivier01 and szha as code owners March 7, 2018 23:56

fix __repr__ + lint

c666d41

sxjscience changed the title ~~Layer Norm~~ [MXNET-58]Layer Norm Mar 8, 2018

sxjscience changed the title ~~[MXNET-58]Layer Norm~~ [MXNET-58]Layer Normalization in C++ Mar 8, 2018

fix doc

f5f5636

szha self-assigned this Mar 8, 2018

fix threshold

7926fe3

sxjscience added 2 commits March 7, 2018 23:04

fix doc

581fcd7

fix bug

a62b978

cjolivier01 reviewed Mar 9, 2018

View reviewed changes

sxjscience added 2 commits March 9, 2018 11:52

enable inplace + fix test

9048e72

try to fix test

1bb1b7e

cjolivier01 reviewed Mar 9, 2018

View reviewed changes

cjolivier01 approved these changes Mar 9, 2018

View reviewed changes

szha reviewed Mar 9, 2018

View reviewed changes

fix doc

dd6bf37

szha merged commit 279ccb1 into apache:master Mar 10, 2018

sxjscience mentioned this pull request Mar 10, 2018

[OP] LayerNorm in MXNet #9950

Closed

		check_l2_normalization((nbatch, nchannel, height, width), mode)


		def npy_layer_norm(data, gamma, beta, axis=1, eps=1E-5):

Conversation

sxjscience commented Mar 7, 2018

Description

Checklist

Essentials

Changes

Comments

Uh oh!

sxjscience commented Mar 8, 2018

Uh oh!

sxjscience commented Mar 8, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fhieber commented Mar 8, 2018

Uh oh!

sxjscience commented Mar 9, 2018

Uh oh!

zhanghang1989 commented Mar 9, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

marcoabreu commented Mar 10, 2018

Uh oh!

sxjscience commented Mar 10, 2018

Uh oh!

marcoabreu commented Mar 10, 2018 via email

Uh oh!

marvis commented Sep 5, 2018

Uh oh!

sxjscience commented Sep 5, 2018 via email • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

sxjscience commented Mar 8, 2018 •

edited

Loading

sxjscience commented Sep 5, 2018 via email •

edited

Loading