[Issue]: ATOM fails on Qwen3 model when the flag "--enable_prefix_caching" is enabled

### Problem Description

I am trying to evaluate Qwen3 model using "--enable_prefix_caching" option, but running into shape mismatch errors. The setup is working fine without using the flag. I attached the error log file (

[Qwen3_log_with_prefix_cache.log](https://github.com/user-attachments/files/25349057/Qwen3_log_with_prefix_cache.log)

)

I investigated a bit on my end and I believe the issue is related to scheduling. I find that in the file "scheduler.py" , the prefill scheduling loop calculated "num_new_tokens"  before calling `block_manager.allocate()`. This is updating the "seq.num_cached_tokens" but not the "num_new_tokens" which I believe is causing the shape mismatch error. I updated the "num_new_tokens" after calling `block_manager.allocate()` which seem to fix the issue for first run, but when I use the same client command for the second time, I am getting memory access fault errors. I attached the error report for this run as well (

[qwen_log_with_scheduler_fix.log](https://github.com/user-attachments/files/25349120/qwen_log_with_scheduler_fix.log)

)

### Operating System

Ubuntu 24.04.3 LTS (Noble Numbat)

### CPU

AMD EPYC 9575F 64-Core Processor

### GPU

AMD Instinct Mi325X VF

### ROCm Version

7.2.0.70200-43~24.04

### ROCm Component

_No response_

### Steps to Reproduce

Docker image: rocm/atom-dev:nightly_202602020423

Server command:

ATOM_ENABLE_QK_NORM_ROPE_CACHE_QUANT_FUSION=1 python -m atom.entrypoints.openai_server --model Qwen/Qwen3-235B-A22B-Instruct-2507-FP8 -tp 4 --kv_cache_dtype fp8 --enable-expert-parallel --max-model-len 32768 --max-num-batched-tokens 32768 --cudagraph-capture-sizes "[1,2,4,8,16,32,48,64,128,256,512]" --enable_prefix_caching

Client command:

python -m atom.benchmarks.benchmark_serving --model Qwen/Qwen3-235B-A22B-Instruct-2507-FP8 --backend vllm --base-url http://localhost:8000 --dataset-name random --random-input-len 5600 --random-output-len 140 --random-range-ratio 1.0 --num-prompts 100 --request-rate inf --ignore-eos --percentile-metrics "ttft,tpot"

### (Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

_No response_

### Additional Information

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Issue]: ATOM fails on Qwen3 model when the flag "--enable_prefix_caching" is enabled #221

Problem Description

Operating System

CPU

GPU

ROCm Version

ROCm Component

Steps to Reproduce

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

Additional Information

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Issue]: ATOM fails on Qwen3 model when the flag "--enable_prefix_caching" is enabled #221

Description

Problem Description

Operating System

CPU

GPU

ROCm Version

ROCm Component

Steps to Reproduce

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

Additional Information

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions