Increase perfomance of BulkAppend and BulkFlush#14067
Increase perfomance of BulkAppend and BulkFlush#14067apeforest merged 2 commits intoapache:masterfrom
Conversation
|
@mxnet-label-bot add [pr-awaiting-review, Performance] |
yuxihu
left a comment
There was a problem hiding this comment.
LGTM. So we should be able to see bigger performance boost when training with Horovod where a process handles a single GPU?
|
@mxnet-label-bot update [pr-awaiting-merge, Performance] |
|
@yuxihu It is actually the other way around - single process per GPU alleviates some of those issues, because each GPU is handled independently. This helps the most the single process-multi GPU cases (where single Python thread needs to launch everything on all GPUs) and small batch size scenarios, where you do not have much time to launch your work. |
|
@ptrendx Thanks for the explanation. Anyway it is good improvement. |
|
Why a shared_ptr rather than unique_ptr? |
|
@junrushao1994 2 reasons:
|
|
But I agree unique_ptr would be the ultimate solution there. |
|
I see. Thanks! |
* Better bulkappend * Fix lint
* Better bulkappend * Fix lint
* Better bulkappend * Fix lint
Description
Increase the performance of
BulkAppendandBulkFlushmethods used in Gluon hybridized models withstatic_alloc=True, static_shape=False.Checklist
Essentials
Please feel free to remove inapplicable items for your PR.
Changes
BulkStatusand populating that vector.BulkAppend, butBulkFlushstill needed to perform the copy of the entire vector of lambdas, including all of the environment. This is alleviated by, instead of passing the vector by value, ashared_ptris passed instead, increasing the performance ofBulkFlushfunction by ~3.5x from 70us to ~20us.Comments
@eric-haibin-lin