-
Notifications
You must be signed in to change notification settings - Fork 20
Description
Problem Description
I am trying to evaluate Qwen3 model using "--enable_prefix_caching" option, but running into shape mismatch errors. The setup is working fine without using the flag. I attached the error log file (
Qwen3_log_with_prefix_cache.log
)
I investigated a bit on my end and I believe the issue is related to scheduling. I find that in the file "scheduler.py" , the prefill scheduling loop calculated "num_new_tokens" before calling block_manager.allocate(). This is updating the "seq.num_cached_tokens" but not the "num_new_tokens" which I believe is causing the shape mismatch error. I updated the "num_new_tokens" after calling block_manager.allocate() which seem to fix the issue for first run, but when I use the same client command for the second time, I am getting memory access fault errors. I attached the error report for this run as well (
qwen_log_with_scheduler_fix.log
)
Operating System
Ubuntu 24.04.3 LTS (Noble Numbat)
CPU
AMD EPYC 9575F 64-Core Processor
GPU
AMD Instinct Mi325X VF
ROCm Version
7.2.0.70200-43~24.04
ROCm Component
No response
Steps to Reproduce
Docker image: rocm/atom-dev:nightly_202602020423
Server command:
ATOM_ENABLE_QK_NORM_ROPE_CACHE_QUANT_FUSION=1 python -m atom.entrypoints.openai_server --model Qwen/Qwen3-235B-A22B-Instruct-2507-FP8 -tp 4 --kv_cache_dtype fp8 --enable-expert-parallel --max-model-len 32768 --max-num-batched-tokens 32768 --cudagraph-capture-sizes "[1,2,4,8,16,32,48,64,128,256,512]" --enable_prefix_caching
Client command:
python -m atom.benchmarks.benchmark_serving --model Qwen/Qwen3-235B-A22B-Instruct-2507-FP8 --backend vllm --base-url http://localhost:8000 --dataset-name random --random-input-len 5600 --random-output-len 140 --random-range-ratio 1.0 --num-prompts 100 --request-rate inf --ignore-eos --percentile-metrics "ttft,tpot"
(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
No response
Additional Information
No response