Skip to content

Higher Performance with Lower SM Occupancy through Zero-Copy and TMA Offloading#453

Open
monethuang1 wants to merge 26 commits intodeepseek-ai:mainfrom
monethuang1:trmt-zero-copy
Open

Higher Performance with Lower SM Occupancy through Zero-Copy and TMA Offloading#453
monethuang1 wants to merge 26 commits intodeepseek-ai:mainfrom
monethuang1:trmt-zero-copy

Conversation

@monethuang1
Copy link
Copy Markdown

The original Internode Normal Kernel suffers from high GPU SM utilization and underutilized interconnect bandwidth, which constrains prefill performance.
In our optimized version, we apply buffer fusion and TMA offloading to enable true zero-copy communication and maximize NVLink bandwidth usage.

Evaluation on H20 clusters shows significant gains:

  • With EP=16, performance (dispatch(FP8) / dispatch(BF16) / combine) improved from 76.50 / 84.05 / 62.50 to 89.46 / 91.82 / 82.27.
  • With EP=32, performance increased from 59.95 / 61.33 / 61.24 to 62.53 / 63.24 / 62.55.

Additionally, SM occupancy was reduced by up to 66.7%. The optimized kernel uses only 12 SMs for EP=16 and 8 SMs for EP=32, compared to 24 SMs in the original version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant