Add contrib.rand_zipfian by eric-haibin-lin · Pull Request #9747 · apache/mxnet

eric-haibin-lin · 2018-02-09T00:29:29Z

Description

Add log-uniform distribution sampler similar to
https://www.tensorflow.org/api_docs/python/tf/nn/log_uniform_candidate_sampler
Note that tf implementation supports sampling w/o replacement, which is not available in this PR.
@sxjscience

Checklist

Essentials

Passed code style checking (make lint)
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage:
Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
Code is well-documented:
For user-facing API changes, API doc string has been updated.
For new C++ functions in header files, their functionalities and arguments are documented.
For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

Feature1, tests, (and when applicable, API doc)
Feature2, tests, (and when applicable, API doc)

Comments

If this change is a backward incompatible change, why must this change be made.
Interesting edge cases to note here

piiswrong · 2018-02-09T01:41:55Z

python/mxnet/ndarray/contrib.py

 except ImportError:
    pass

 __all__ = []


need to include it in all

piiswrong · 2018-02-09T01:43:21Z

python/mxnet/ndarray/contrib.py

+    true_classes = true_classes.as_in_context(ctx).astype('float64')
+    expected_count_true = ((true_classes + 2.0) / (true_classes + 1.0)).log() / log_range
+    # cast sampled classes to fp64 to avoid interget division
+    sampled_cls_fp64 = sampled_classes.astype('float64')


why is the output always float64?

64-bit is adopted because this sampler is usually used for extremely large number of classes. Returned samples are always actually always in int64. The fp64 here is used to calculate the probability of a particular classes. (Limited precision of fp32 treat 50M - 1 and 50M - 2 as the same number, yielding nan when taking the log)

piiswrong · 2018-02-09T01:43:38Z

python/mxnet/symbol/contrib.py

+    <NDArray 3x2 @cpu(0)>
+    """
+    assert(isinstance(true_classes, Symbol)), "unexpected type %s" % type(true_classes)
+    if ctx is None:


symbol doesn't need ctx

sxjscience · 2018-02-13T01:26:00Z

python/mxnet/ndarray/contrib.py

+    list of NDArrays
+        A 1-D `int64` `NDArray` for sampled candidate classes, a 1-D `float64` `NDArray` for \
+        the expected count for true classes, and a 1-D `float64` `NDArray` for the \
+        expected count for sampled classes.


We need to write the docstring as:

Returns -------- samples : NDArray A 1-D `int64` `NDArray` for sampled candidate classes exp_count_true : NDArray ... exp_count_sample : NDArray ...

sxjscience · 2018-02-13T01:36:21Z

python/mxnet/ndarray/contrib.py

+    # cast sampled classes to fp64 to avoid interget division
+    sampled_cls_fp64 = sampled_classes.astype('float64')
+    expected_count_sampled = ((sampled_cls_fp64 + 2.0) / (sampled_cls_fp64 + 1.0)).log() / log_range
+    return [sampled_classes, expected_count_true, expected_count_sampled]


No need to return a list here.

sxjscience · 2018-02-13T01:41:38Z

python/mxnet/ndarray/contrib.py

-__all__ = []
+__all__ = ["rand_log_uniform"]
+
+def rand_log_uniform(true_classes, num_sampled, range_max, ctx=None):


I think it should not be called as rand_log_uniform because LogUniform has a specific meaning. Should be called something like rand_zipfian, or log_uniform_candidate_sampler like in TF.

sxjscience · 2018-02-13T01:47:59Z

python/mxnet/ndarray/contrib.py

+    sampled_classes = (rand.exp() - 1).astype('int64') % range_max
+
+    true_classes = true_classes.as_in_context(ctx).astype('float64')
+    expected_count_true = ((true_classes + 2.0) / (true_classes + 1.0)).log() / log_range


I think it should be expected_count_true = ((true_classes + 2.0) / (true_classes + 1.0)).log() / log_range * num_sampled. Otherwise it should be called something like prob_true_class.

You are right, I should either multiply it by num_sampled or change the name. Will do an update.

sxjscience · 2018-02-13T01:49:02Z

python/mxnet/ndarray/contrib.py

+    [ 0.12453879]
+    <NDArray 1 @cpu(0)>
+    >>> exp_count_sample
+    [ 0.22629439  0.12453879  0.12453879  0.12453879]


The example output looks suspicious as it does not sum up to 1.

Sorry I've misunderstood the term. It should be correct.

I feel it's suspicious at first glance because the exp_count of 1 is larger than the exp_count of 3. However, the sampling result show that 3 is much more often then 1. We need to sample multiple times and test if the empirical expectation matches the true expectation.

It's just a coincident for the first 5 samples. If I sample 50 times, it returns:

1 3 3 3 2 0 0 0 0 1 3 1 1 3 0 2 0 4 0 3 1 3 1 2 2 1 1 2 0 1 0 2 0 0 0 0 0 0 4 1 1 4 0 4 2 0 0 2 1 0

0's = 19

1's = 12

2's = 8

3's = 7

4's = 4

OK, looks good

szha · 2018-02-17T19:41:42Z

python/mxnet/ndarray/contrib.py

+__all__ = ["log_uniform_candidate_sampler"]
+
+# pylint: disable=line-too-long
+def log_uniform_candidate_sampler(true_classes, num_sampled, range_max, ctx=None):


should it go under contrib.random? since other sampling methods in random just have the distribution as name, should we follow the same convention?

Name changed to rand_zipfian to follow the convention. Extra namespaces such as contrib.random might over-complicate APIs since there are just a few operators in nd.contrib.

szha · 2018-02-21T05:52:14Z

python/mxnet/symbol/contrib.py

+    sampled_cls_fp64 = sampled_classes.astype('float64')
+    expected_prob_sampled = ((sampled_cls_fp64 + 2.0) / (sampled_cls_fp64 + 1.0)).log() / log_range
+    expected_count_sampled = expected_prob_sampled * num_sampled
+    return [sampled_classes, expected_count_true, expected_count_sampled]


why a list?

Good catch, I forgot to update this

* draft * move to contrib * rename op * CR comments * Update contrib.py * Update contrib.py * Update random.py * update example in the doc * update example in symbol doc * CR comments * update op name * update op name * update op name in test * update test * Update contrib.py

ZiyueHuang added 3 commits February 2, 2018 22:23

draft

65fcf2f

move to contrib

4ed3ba4

rename op

0792162

eric-haibin-lin requested a review from szha as a code owner February 9, 2018 00:29

piiswrong reviewed Feb 9, 2018

View reviewed changes

python/mxnet/ndarray/contrib.py Outdated

except ImportError:

pass

__all__ = []

Copy link

Contributor

piiswrong Feb 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to include it in all

piiswrong reviewed Feb 9, 2018

View reviewed changes

ZiyueHuang and others added 3 commits February 9, 2018 05:48

CR comments

6162c18

Update contrib.py

105c212

Update contrib.py

6be919c

eric-haibin-lin changed the title ~~Add contrib.rand_log_uniform~~ [WIP] Add contrib.rand_log_uniform Feb 9, 2018

eric-haibin-lin added 3 commits February 9, 2018 09:17

Update random.py

136defb

update example in the doc

4d128a7

update example in symbol doc

436543b

eric-haibin-lin changed the title ~~[WIP] Add contrib.rand_log_uniform~~ Add contrib.rand_log_uniform Feb 11, 2018

sxjscience reviewed Feb 13, 2018

View reviewed changes

CR comments

1cee16f

eric-haibin-lin requested a review from cjolivier01 as a code owner February 14, 2018 02:30

sxjscience approved these changes Feb 14, 2018

View reviewed changes

szha reviewed Feb 17, 2018

View reviewed changes

eric-haibin-lin added 4 commits February 18, 2018 17:41

Merge branch 'master' into log-uniform

2533576

update op name

a765d2f

update op name

c93af4a

update op name in test

c53c2a9

eric-haibin-lin changed the title ~~Add contrib.rand_log_uniform~~ Add contrib.rand_zipfian Feb 19, 2018

Merge remote-tracking branch 'upstream/master' into log-uniform

c17c215

update test

add866d

szha reviewed Feb 21, 2018

View reviewed changes

Update contrib.py

c48aebe

szha approved these changes Feb 21, 2018

View reviewed changes

piiswrong approved these changes Feb 22, 2018

View reviewed changes

Merge remote-tracking branch 'upstream/master' into log-uniform

90d684d

eric-haibin-lin merged commit 9158352 into apache:master Feb 23, 2018

eric-haibin-lin deleted the log-uniform branch September 18, 2018 23:33

Conversation

eric-haibin-lin commented Feb 9, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Essentials

Changes

Comments

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eric-haibin-lin Feb 11, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

0's = 19

1's = 12

2's = 8

3's = 7

4's = 4

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

eric-haibin-lin commented Feb 9, 2018 •

edited

Loading

eric-haibin-lin Feb 11, 2018 •

edited

Loading