[DSv32] Overlap indexer weights_proj during dual_stream decode#16637
[DSv32] Overlap indexer weights_proj during dual_stream decode#16637Fridge003 merged 1 commit intosgl-project:mainfrom
Conversation
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
ca63555 to
b8c807d
Compare
|
Hi @Fridge003 here are the results. I ran the AFTER twice: Also, the total runtime dropped from 26min to 24.5min, 1.06x speedup. |
|
Are |
|
@Fridge003 There is an additional 10.1μs (2.112+6.176+1.856) overlap opportunity (0.62ms across 61 layers) by hiding the (kv norm + nvjet + qk rope) circled in red—none of them have data dependencies with the indexer |
|
/tag-and-rerun-ci |
b8c807d to
7c6b1b7
Compare
|
@Fridge003 Could you trigger /rerun-failed-ci ? The currently failed ci did not fail last time and they don't seem to be related to my code changes. |
7c6b1b7 to
b4c7f23
Compare

Motivation
The weights_proj in the DSA indexer uses float type, which is slow on modern GPUs. Profiler traces show this accounts for ~20% of decode layer runtime. The traces also reveal that this projection typically uses a grid of (bs, 1, 1), utilizing very few SMs and creating an opportunity for inter-stream overlap.
Modifications
In this optimization, we overlap the very slow weights_proj with q_b_proj, indexer _get_q_k_bf16, and indexer qk act_quant during dual_stream decode.
Accuracy Tests
Benchmarking and Profiling
Checklist
Review Process
/tag-run-ci-label,/rerun-failed-ci,/tag-and-rerun-ci) or contact authorized users to do so.