Skip to content
This repository was archived by the owner on Nov 17, 2023. It is now read-only.

[FFI] Add new containers and Implementations#19685

Merged
leezu merged 44 commits intoapache:masterfrom
barry-jin:ffi-container
Mar 9, 2021
Merged

[FFI] Add new containers and Implementations#19685
leezu merged 44 commits intoapache:masterfrom
barry-jin:ffi-container

Conversation

@barry-jin
Copy link
Contributor

@barry-jin barry-jin commented Dec 16, 2020

Description

This is the follow up PR for RFC #19672. Map container is added and more data types are supported by new FFI, like dictionary, list of strings.

  • Make ADT container and MAP container support NDArray type.
  • Adopt PackedFunc based FFI on CachedOp.
    • Some CachedOp functions are implemented: create, free, invoke, get_optimized_symbol

Checklist

Essentials

  • PR's title starts with a category (e.g. [BUGFIX], [MODEL], [TUTORIAL], [FEATURE], [DOC], etc)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage
  • Code is well-documented

@mxnet-bot
Copy link

Hey @barry-jin , Thanks for submitting the PR
All tests are already queued to run once. If tests fail, you can trigger one or more tests again with the following commands:

  • To trigger all jobs: @mxnet-bot run ci [all]
  • To trigger specific jobs: @mxnet-bot run ci [job1, job2]

CI supported jobs: [clang, edge, website, windows-gpu, sanity, windows-cpu, unix-gpu, unix-cpu, centos-cpu, centos-gpu, miscellaneous]


Note:
Only following 3 categories can trigger CI :PR Author, MXNet Committer, Jenkins Admin.
All CI tests must pass before the PR can be merged.

@lanking520 lanking520 added the pr-work-in-progress PR is still work in progress label Dec 16, 2020
@barry-jin
Copy link
Contributor Author

@mxnet-bot run ci [unix-cpu, unix-gpu]

@mxnet-bot
Copy link

Jenkins CI successfully triggered : [unix-cpu, unix-gpu]

@barry-jin
Copy link
Contributor Author

@mxnet-bot run ci [unix-cpu, unix-gpu]

@mxnet-bot
Copy link

Jenkins CI successfully triggered : [unix-gpu, unix-cpu]

@lanking520 lanking520 added pr-work-in-progress PR is still work in progress and removed pr-awaiting-testing PR is reviewed and waiting CI build and test labels Feb 13, 2021
@lanking520 lanking520 added pr-awaiting-testing PR is reviewed and waiting CI build and test and removed pr-work-in-progress PR is still work in progress labels Feb 16, 2021
@barry-jin
Copy link
Contributor Author

After benchmarking on GluonNLP, I have got some improvements in the single forward step. I have pasted the average improvements as follows. (The latency is the average number with different batch_size, sequence_length as input)

model training latency withou this PR (s) training latency with this PR (s) Improvement (s)
google_en_uncased_bert_base 0.09161326 0.09133351 0.00027974
google_en_uncased_bert_base 0.3565172 0.35624171 0.000275489
google_en_uncased_bert_large 0.91762223 0.9173615 0.000260731
google_albert_base_v2 0.38036531 0.38022336 0.00014195
google_albert_large_v2 0.74285129 0.74271887 0.000132424
google_albert_xlarge_v2 1.53808278 1.53795535 0.000127428
google_albert_xxlarge_v2 2.49918614 2.49904376 0.000142379
google_electra_small 0.07791454 0.07770361 0.000210933
google_electra_base 0.35639018 0.35617552 0.000214658
google_electra_large 0.91575478 0.9154471 0.000307674
google_uncased_mobilebert 0.1725719 0.17218696 0.000384942
fairseq_bart_base 0.43927581 0.43899117 0.00028464
fairseq_bart_large 0.70489126 0.70455636 0.0003349

Also, I have compared the Training and inferencing time with the real workloads:
Running google_electra_small model on SQuAD dataset and will get the following results.

Training/Inferencing Latency without this PR Latency with this PR Throughput without thie PR (samples/s) Throughput with this PR (samples/s)
Training 1.59179 h 1.48754 h 70 75
Inferencing 55.566 s 55.41125 s 216.35 216.96

Environment

python_version 3.6.9
instance g4dn.2x
system Linux
cpu x86_64
architecture 64bit
fp16 FALSE
cpu_ram_mb 63622
use_gpu TRUE
num_gpus 1
gpu Tesla T4
gpu_ram_mb 15079
gpu_power_watts 70
gpu_performance_state 0

@barry-jin
Copy link
Contributor Author

@mxnet-bot run ci [windows-cpu, unix-gpu]

@mxnet-bot
Copy link

Jenkins CI successfully triggered : [unix-gpu, windows-cpu]

@barry-jin
Copy link
Contributor Author

I have replaced the backend APIs(MXInvokeCachedOp, MXNET_REGISTER_GLOBAL("cached_op.invoke")) with simple or dummy implementation so that we can fully expose the overhead of the API call with/without this PR by removing the computational costs. The results is shown as follows:

For CachedOp invocation call without this PR, it takes around 7.22 us and the most of the overhead is in making cython/python args; For CachedOp invocation call with this PR, it takes around 4.041 us and the most of the overhead is in type translation/checking in packedfunc system.

CachedOp Invocation in cython code:
Screen Shot 2021-02-17 at 5 42 22 PM
CachedOp Invocation with new FFI implementation(accelerated by cython):
Screen Shot 2021-02-17 at 5 42 30 PM

Comment on lines +521 to +526
// } else if (type_code_ == kStr) {
// return std::string(value_.v_str);
// } else {
// CHECK(IsObjectRef<tvm::runtime::String>());
// return AsObjectRef<tvm::runtime::String>().operator std::string();
// }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove the unused code?

@barry-jin
Copy link
Contributor Author

@mxnet-bot run ci [windows-cpu]

@mxnet-bot
Copy link

Jenkins CI successfully triggered : [windows-cpu]

@barry-jin
Copy link
Contributor Author

@mxnet-bot run ci [windows-cpu]

@mxnet-bot
Copy link

Jenkins CI successfully triggered : [windows-cpu]

@barry-jin
Copy link
Contributor Author

@mxnet-bot run ci [windows-cpu]

@mxnet-bot
Copy link

Jenkins CI successfully triggered : [windows-cpu]

@szha
Copy link
Member

szha commented Feb 26, 2021

@mxnet-bot run ci [windows-cpu]

@mxnet-bot
Copy link

Jenkins CI successfully triggered : [windows-cpu]

@szha
Copy link
Member

szha commented Feb 26, 2021

@mxnet-bot run ci [windows-cpu]

@mxnet-bot
Copy link

Jenkins CI successfully triggered : [windows-cpu]

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

pr-awaiting-review PR is waiting for code review

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants