Less cudaGet/SetDevice calls in Gluon execution#13764
Less cudaGet/SetDevice calls in Gluon execution#13764eric-haibin-lin merged 6 commits intoapache:masterfrom
Conversation
|
@ptrendx Can you look into failing CI builds? |
eric-haibin-lin
left a comment
There was a problem hiding this comment.
Thanks for the fix! One quesiton
| for (int i = 0; i < n; ++i) { | ||
| device_store.SetDevice(gpus[i]); | ||
| // Restores active device to what it was before EnableP2P | ||
| mxnet::common::cuda::DeviceStore device_store(gpus[i]); |
There was a problem hiding this comment.
is cudaGetDevice costly? This change would cause 2x cudaGetDevice calls
There was a problem hiding this comment.
This code is executed only during initialization, so I'm not concerned about its performance (to answer your question though - cudaGetDevice is slightly less costly than cudaSetDevice).
I made a change here just because it is then real RAII guard instead of just a setdevice call.
|
@ctcyang could you also take a look? |
|
Not sure why the website check is showing as pending - it seems to have finished successfully in Details view. |
There was a problem hiding this comment.
Nice work! The only nitpick I have is that after these changes, the only place where cudaSetDevice is still used directly is: https://github.com/apache/incubator-mxnet/blob/e9a7aa42ec380d92b1623025d6434b8856724402/src/engine/threaded_engine_pooled.cc#L136
Could you change that to use this new API too?
* Remove unnecessary cudaGetDevice/cudaSetDevice calls * Fixes for the DeviceGuard * Retrigger CI * Fix for possible invalid device ordinal when using DeviceStore while driver is unloading * Fix for RTC when the driver API call is the first call * Added DeviceStore to pooled engine
* Remove unnecessary cudaGetDevice/cudaSetDevice calls * Fixes for the DeviceGuard * Retrigger CI * Fix for possible invalid device ordinal when using DeviceStore while driver is unloading * Fix for RTC when the driver API call is the first call * Added DeviceStore to pooled engine
* Remove unnecessary cudaGetDevice/cudaSetDevice calls * Fixes for the DeviceGuard * Retrigger CI * Fix for possible invalid device ordinal when using DeviceStore while driver is unloading * Fix for RTC when the driver API call is the first call * Added DeviceStore to pooled engine
* Remove unnecessary cudaGetDevice/cudaSetDevice calls * Fixes for the DeviceGuard * Retrigger CI * Fix for possible invalid device ordinal when using DeviceStore while driver is unloading * Fix for RTC when the driver API call is the first call * Added DeviceStore to pooled engine
* Remove unnecessary cudaGetDevice/cudaSetDevice calls * Fixes for the DeviceGuard * Retrigger CI * Fix for possible invalid device ordinal when using DeviceStore while driver is unloading * Fix for RTC when the driver API call is the first call * Added DeviceStore to pooled engine
Description
This PR reduces the number of cudaGetDevice/cudaSetDevice calls during Gluon execution.
Previously, during every call to allocate/free buffer in StorageManager DeviceStore would call cudaGetDevice and 2x cudaSetDevice (to get the current device, set the new device and lastly to set the original device again), even if no actual allocation took place (due to caching allocator usage).
Checklist
Essentials
Please feel free to remove inapplicable items for your PR.
Changes
Comments