[MXNET-876] make CachedOp a normal operator by zheng-da · Pull Request #11641 · apache/mxnet

zheng-da · 2018-07-11T05:56:41Z

Description

Currently, CachedOp is used to execute the graph in a Gluon hybrid block when the block is hybridized. It's registered as an operator, but it doesn't have a full set of operator attributes. So it can't be used as a regular operator and can't be used in a normal NNVM computation graph. This PR is to extend CachedOp and make it a normal operator. The main motivation is to use it as a default subgraph operator, as proposed in Unified integration with external acceleration libraries.

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage:
Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
Code is well-documented:
For user-facing API changes, API doc string has been updated.
For new C++ functions in header files, their functionalities and arguments are documented.
For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

Feature1, tests, (and when applicable, API doc)
Feature2, tests, (and when applicable, API doc)

Comments

If this change is a backward incompatible change, why must this change be made.
Interesting edge cases to note here

larroy · 2018-07-11T09:23:03Z

include/mxnet/c_api.h

 MXNET_DLL int MXSymbolCutSubgraph(SymbolHandle sym, SymbolHandle **inputs,
                                  int *input_size);

+int MXMakeSubgraph(SymbolHandle sym, SymbolHandle *input_symbols, mx_uint num_inputs,


Doc missing

larroy · 2018-07-11T09:23:35Z

python/mxnet/symbol/contrib.py


    return (outs, states)
+
+def make_subgraph(subg, *args):


Doc missing

larroy · 2018-07-11T09:26:01Z

src/c_api/c_api_symbolic.cc

+  // Construct a node for this subgraph.
+  std::vector<nnvm::NodeEntry> inputs(num_inputs);
+  for (size_t i = 0; i < inputs.size(); i++) {
+    nnvm::Symbol *s = static_cast<nnvm::Symbol*>(input_symbols[i]);


Can it be const?

larroy · 2018-07-11T09:27:30Z

src/c_api/c_api_symbolic.cc

+
+  // Create CachedOp for the node.
+  std::vector<std::pair<std::string, std::string> > kwargs;
+  kwargs.push_back(std::pair<std::string, std::string>("inline_limit", "0"));


Why not emplace back? More efficient and less noisy...

larroy · 2018-07-11T09:28:26Z

src/c_api/c_api_symbolic.cc

+  n->attrs.parsed = std::make_shared<mxnet::CachedOp>(*s, kwargs);
+
+  // Create a new symbol for this node.
+  s = new nnvm::Symbol();


Can this leak? Who manages this one?

i just follow the implementations in other APIs. The symbol will be saved in a Python symbol handle. AFAIK, once the python symbol handle is destroyed, the symbol object will also be destroyed.

larroy · 2018-07-11T09:29:13Z

src/imperative/cached_op.cc

+  std::shared_ptr<CachedOp> op;
+  OpStatePtr forward_state;
+
+  CachedOpActualState(std::shared_ptr<CachedOp> op) {


By reference?

larroy · 2018-07-11T09:29:45Z

src/imperative/cached_op.cc

  Engine::Get()->set_bulk_size(prev_bulk_size);
 }

+struct CachedOpActualState {


Missing class doc

larroy · 2018-07-11T09:33:26Z

src/imperative/cached_op.cc

+  }
+};
+
+void CachedOpForward(const OpStatePtr& state_ptr,


Missing a short documentation stating what's the intention and how it works

larroy · 2018-07-11T09:34:52Z

src/imperative/cached_op.cc

+  const std::vector<bool> &save_outputs = s.op->save_outputs();
+  CHECK_EQ(save_inputs.size(), in_end - in_begin);
+  CHECK_EQ(s.op->num_outputs(), out_end - out_begin);
+  for (auto it = in_begin; it != in_end; it++) {


++it is potentially faster

really? Where is this documented?

@zheng-da it's a well known thing for old C++ farts. It's in reference C++ books like http://www.cppstdlib.com/ or Stroustrup. https://stackoverflow.com/questions/1077026/incrementing-iterators-it-more-efficient-than-it

In most cases probably doesn't make a difference, specially for simple iterators where the iterator is just a pointer. That's why I said is potentially faster. It's more like a good idiomatic practice to always use preincrement.

https://stackoverflow.com/questions/1077026/incrementing-iterators-it-more-efficient-than-it

larroy · 2018-07-11T09:36:27Z

src/imperative/cached_op.h

      DispatchMode* dispatch_mode,
      std::vector<int> *in_attrs,
      std::vector<int> *out_attrs);
+  bool ForwardInferShape(


Missing documentation in function prototypes

larroy · 2018-07-11T09:37:43Z

tests/python/unittest/test_operator.py

+                       'b': mx.nd.empty(shape=(10, 10))})
+    e1.forward()
+    e2.forward()
+    assert_almost_equal(e1.outputs[0].asnumpy(), e2.outputs[0].asnumpy(),


Why almost equal and not equal?

I think due to floating point precision

larroy

Added comments.

zheng-da · 2018-07-11T16:11:49Z

@larroy thanks for the review. The code, especially the API design, is still experimental. I'll let you know when the code is ready for review.

reminisce · 2018-07-11T18:43:55Z

I think the function like shape, type inferences, FMutate, etc. used in operator registration should not belong to CachedOp only. They should be made generally available to subgraph-type operators, while CachedOp is just a special case.

zheng-da · 2018-08-27T18:15:19Z

This PR should be rebased to #12157

reminisce · 2018-09-12T20:53:25Z

src/operator/subgraph/common.h

+inline bool DefaultSubgraphOpShape(const nnvm::NodeAttrs& attrs,
+                                   std::vector<TShape> *in_shapes,
+                                   std::vector<TShape> *out_shapes) {
+  return DefaultSubgraphOpShape1(*attrs.subgraphs[0], in_shapes, out_shapes);


Maybe rename DefaultSubgraphOpShape1 to something like a helper function for better readability?

reminisce · 2018-09-12T21:03:10Z

src/imperative/cached_op.cc

-  const auto& idx = g.indexed_graph();
-  const auto &outputs = idx.outputs();
+/*
+ * This is the operator state of CachedOp when CachedOp is used in the symbol


Please elaborate on the necessity of adding this data structure in the description.

eric-haibin-lin · 2018-09-21T22:19:34Z

src/imperative/cached_op.cc

+  // Clean up what we recorded.
+  s.forward_state.reset();
+
+  // The arrays in out_ptrs may be changed by CachedOp.


why would it be changed?

Thanks for updating the comments

eric-haibin-lin · 2018-09-21T22:42:13Z

src/imperative/cached_op.cc

+  else
+    orig_is_train = Imperative::Get()->is_training();
+  // TODO(zhengda) is it right to use false here?
+  s.op->Backward(false, s.forward_state, in_ptrs, req, out_ptrs);


please add more comment on retain_graph=False

zheng-da requested review from anirudh2290 and szha as code owners July 11, 2018 05:56

larroy reviewed Jul 11, 2018

View reviewed changes

python/mxnet/symbol/contrib.py Outdated

return (outs, states)

def make_subgraph(subg, *args):

Copy link

Contributor

larroy Jul 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doc missing

larroy reviewed Jul 11, 2018

View reviewed changes

src/imperative/cached_op.cc

Engine::Get()->set_bulk_size(prev_bulk_size);

}

struct CachedOpActualState {

Copy link

Contributor

larroy Jul 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing class doc

larroy reviewed Jul 11, 2018

View reviewed changes

zheng-da force-pushed the cachedop branch 2 times, most recently from 6a03241 to 476fa57 Compare July 25, 2018 08:14

zheng-da mentioned this pull request Jul 26, 2018

subgraph TODO #11896

Open

9 tasks

sandeep-krishnamurthy added Operator Backend Issues related to the backend of MXNet pr-work-in-progress PR is still work in progress labels Aug 8, 2018

zheng-da force-pushed the cachedop branch 3 times, most recently from 0b5df8b to 910cc05 Compare August 24, 2018 23:56

zheng-da force-pushed the cachedop branch 2 times, most recently from cea32a3 to 3b616c2 Compare August 31, 2018 04:07

zheng-da changed the title ~~[WIP] make CachedOp a normal operator~~ [MXNET-876] make CachedOp a normal operator Aug 31, 2018

zheng-da force-pushed the cachedop branch from 8a32538 to 8a356e9 Compare September 10, 2018 20:38

zheng-da force-pushed the cachedop branch from 8a356e9 to 158fcf8 Compare September 11, 2018 08:02

zheng-da added 12 commits September 12, 2018 09:44

extend _CachedOp a regular operator.

f04686a

use default subgraph infer.

b7f1066

fix test.

33895d1

fix compilation error.

9b0771c

use default subgraph stuff.

f253ade

add comments.

daf3801

fix.

15751a4

use a more general InferStorage.

7c175c5

use cachedOp as default subgraph operator.

8fa5ac8

remove default subgraph op.

ded2cc2

fix.

7e722a9

fix.

f7b63b4

zheng-da force-pushed the cachedop branch from bd67c47 to f7b63b4 Compare September 12, 2018 16:44

reminisce reviewed Sep 12, 2018

View reviewed changes

zheng-da added 3 commits September 14, 2018 11:17

rename.

58d4851

add comment.

9f1aa26

retrigger

d4c95d3

reminisce approved these changes Sep 17, 2018

View reviewed changes

piiswrong approved these changes Sep 18, 2018

View reviewed changes

eric-haibin-lin reviewed Sep 21, 2018

View reviewed changes

add comments.

fc001dc

eric-haibin-lin merged commit 3caf2ca into apache:master Sep 23, 2018

This was referenced Sep 16, 2019

contrib.cond operator does not support parameterized block execution #16182

Closed

symbol.contrib.cond does not support custom operator execution #16187

Open

symbol.contrib.cond does not support some built-in operators #16188

Open

Conversation

zheng-da commented Jul 11, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Essentials

Changes

Comments

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

larroy Jul 11, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

larroy left a comment

Choose a reason for hiding this comment

Uh oh!

zheng-da commented Jul 11, 2018

Uh oh!

reminisce commented Jul 11, 2018

Uh oh!

zheng-da commented Aug 27, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

zheng-da commented Jul 11, 2018 •

edited

Loading

larroy Jul 11, 2018 •

edited

Loading