Increase perfomance of BulkAppend and BulkFlush by ptrendx · Pull Request #14067 · apache/mxnet

ptrendx · 2019-02-05T00:00:34Z

Description

Increase the performance of BulkAppend and BulkFlush methods used in Gluon hybridized models with static_alloc=True, static_shape=False.

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

Changes are complete (i.e. I finished coding on this PR)
To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

The previous way of bulking ops in Gluon was creating a new lambda during every BulkAppend function in kind of recursive scheme, which involved copying all the environment every time. This PR changes that by instead introducing a vector of lambda functions to the BulkStatus and populating that vector.
That change improves the time to perform BulkAppend, but BulkFlush still needed to perform the copy of the entire vector of lambdas, including all of the environment. This is alleviated by, instead of passing the vector by value, a shared_ptr is passed instead, increasing the performance of BulkFlush function by ~3.5x from 70us to ~20us.

Comments

Those changes have the biggest impact in multi-GPU or small model scenario when the speed of scheduling work is the most important (in multi-GPU all work is scheduled by a single Python thread).
The speed of multiGPU launches should be improved further by going to multiprocess model where each process handles a single GPU.

@eric-haibin-lin

vandanavk · 2019-02-05T00:10:19Z

@mxnet-label-bot add [pr-awaiting-review, Performance]

yuxihu

LGTM. So we should be able to see bigger performance boost when training with Horovod where a process handles a single GPU?

yuxihu · 2019-02-05T23:02:19Z

@mxnet-label-bot update [pr-awaiting-merge, Performance]

ptrendx · 2019-02-05T23:15:28Z

@yuxihu It is actually the other way around - single process per GPU alleviates some of those issues, because each GPU is handled independently. This helps the most the single process-multi GPU cases (where single Python thread needs to launch everything on all GPUs) and small batch size scenarios, where you do not have much time to launch your work.

yuxihu · 2019-02-05T23:17:59Z

@ptrendx Thanks for the explanation. Anyway it is good improvement.

junrushao · 2019-02-06T18:53:43Z

Why a shared_ptr rather than unique_ptr?

ptrendx · 2019-02-06T19:01:30Z

@junrushao1994 2 reasons:

std::move to lambda is not supported before C++14 without some arcane template tricks, which I did not want to put there
when I tried it (disregarding warnings from GCC that I'm using C++14 feature), I got errors in some other places in the engine that I try to use deleted copy function of those lambdas and did not investigate further whether that's a bigger issue or just a lack of std::move in few places.

ptrendx · 2019-02-06T19:02:36Z

But I agree unique_ptr would be the ultimate solution there.

junrushao · 2019-02-06T19:26:51Z

I see. Thanks!

apeforest

LGTM

* Better bulkappend * Fix lint

Better bulkappend

32d05dc

marcoabreu added Performance pr-awaiting-review PR is waiting for code review labels Feb 5, 2019

Fix lint

319de55

eric-haibin-lin approved these changes Feb 5, 2019

View reviewed changes

yuxihu approved these changes Feb 5, 2019

View reviewed changes

marcoabreu added pr-awaiting-merge Review and CI is complete. Ready to Merge and removed pr-awaiting-review PR is waiting for code review labels Feb 5, 2019

junrushao approved these changes Feb 6, 2019

View reviewed changes

apeforest approved these changes Feb 6, 2019

View reviewed changes

apeforest merged commit 149d810 into apache:master Feb 6, 2019

stephenrawls pushed a commit to stephenrawls/incubator-mxnet that referenced this pull request Feb 16, 2019

Increase perfomance of BulkAppend and BulkFlush (apache#14067)

7c13e05

* Better bulkappend * Fix lint

vdantu pushed a commit to vdantu/incubator-mxnet that referenced this pull request Mar 31, 2019

Increase perfomance of BulkAppend and BulkFlush (apache#14067)

66d1c3c

* Better bulkappend * Fix lint

haohuanw pushed a commit to haohuanw/incubator-mxnet that referenced this pull request Jun 23, 2019

Increase perfomance of BulkAppend and BulkFlush (apache#14067)

92197cd

* Better bulkappend * Fix lint

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Increase perfomance of BulkAppend and BulkFlush#14067

Increase perfomance of BulkAppend and BulkFlush#14067
apeforest merged 2 commits intoapache:masterfrom
ptrendx:pr_bulkappend

ptrendx commented Feb 5, 2019

Uh oh!

vandanavk commented Feb 5, 2019

Uh oh!

yuxihu left a comment

Uh oh!

yuxihu commented Feb 5, 2019 •

edited

Loading

Uh oh!

ptrendx commented Feb 5, 2019

Uh oh!

yuxihu commented Feb 5, 2019

Uh oh!

junrushao commented Feb 6, 2019

Uh oh!

ptrendx commented Feb 6, 2019

Uh oh!

ptrendx commented Feb 6, 2019

Uh oh!

junrushao commented Feb 6, 2019

Uh oh!

apeforest left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Conversation

ptrendx commented Feb 5, 2019

Description

Checklist

Essentials

Changes

Comments

Uh oh!

vandanavk commented Feb 5, 2019

Uh oh!

yuxihu left a comment

Choose a reason for hiding this comment

Uh oh!

yuxihu commented Feb 5, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ptrendx commented Feb 5, 2019

Uh oh!

yuxihu commented Feb 5, 2019

Uh oh!

junrushao commented Feb 6, 2019

Uh oh!

ptrendx commented Feb 6, 2019

Uh oh!

ptrendx commented Feb 6, 2019

Uh oh!

junrushao commented Feb 6, 2019

Uh oh!

apeforest left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

yuxihu commented Feb 5, 2019 •

edited

Loading