Skip to content
This repository was archived by the owner on Mar 21, 2026. It is now read-only.
This repository was archived by the owner on Mar 21, 2026. It is now read-only.

FlashAttention CUDA "no kernel image" crash on RTX 5060 Ti #3342

@pauli31

Description

@pauli31

System Info

Running TGI 3.3.6 on a new GPU NVIDIA GeForce RTX 5060 Ti (compute capability 12.0 / sm_120) causes TGI to crash during warmup with:

CUDA Error: no kernel image is available for execution on the device
/usr/src/flash-attention/csrc/layer_norm/ln_fwd_kernels.cuh:236

it crashes immediately because FlashAttention 1.0.9—bundled inside the TGI Docker image—does not include kernels compiled for sm_120. This appears to be the root cause.

Environment
Hardware

GPU: NVIDIA GeForce RTX 5060 Ti

Compute capability: 12.0

VRAM: 16 GB

Driver: 581.80

CUDA (system): 13.0 (from nvidia-smi)

Inside the TGI 3.3.6 container

torch == 2.7.0+cu128
torch.version.cuda == "12.8"
flash_attn == 1.0.9
triton == 3.3.0
mamba_ssm == 1.1.2
text-generation-server == 2.0.5.dev0  (from the internal server component)

FlashAttention 1.0.9 is confirmed by:

import flash_attn
print(flash_attn.__version__)       # 1.0.9
print(flash_attn.__file__)
nvidia-smi
Tue Dec  9 11:01:48 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.105.07             Driver Version: 581.80         CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 5060 Ti     On  |   00000000:01:00.0  On |                  N/A |
|  0%   39C    P8              7W /  180W |    1781MiB /  16311MiB |      1%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A              25      G   /Xwayland                             N/A      |
|    0   N/A  N/A              50      G   /Xwayland                             N/A      |
+-----------------------------------------------------------------------------------------+

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

Just run docker with llama 3.2

export HF_TOKEN="..."

docker run --rm -it \
  --gpus all \
  --shm-size 1g \
  -p 8080:80 \
  -e HUGGING_FACE_HUB_TOKEN=${HF_TOKEN} \
  -e RUST_LOG=debug \
  -v /mnt/hf-cache:/data \
  --name llama-3.2-1b-tgi \
  ghcr.io/huggingface/text-generation-inference:3.3.6 \
    --model-id meta-llama/Llama-3.2-1B-Instruct \
    --max-input-length 4096 \
    --max-total-tokens 4224 \
    --cuda-memory-fraction 0.9 

The entire stacktrace

2025-12-09T10:04:16.474404Z  INFO text_generation_launcher: Args {
    model_id: "meta-llama/Llama-3.2-1B-Instruct",
    revision: None,
    validation_workers: 2,
    sharded: None,
    num_shard: None,
    quantize: None,
    speculate: None,
    dtype: None,
    kv_cache_dtype: None,
    trust_remote_code: false,
    max_concurrent_requests: 128,
    max_best_of: 2,
    max_stop_sequences: 4,
    max_top_n_tokens: 5,
    max_input_tokens: None,
    max_input_length: Some(
        4096,
    ),
    max_total_tokens: Some(
        4224,
    ),
    waiting_served_ratio: 0.3,
    max_batch_prefill_tokens: None,
    max_batch_total_tokens: None,
    max_waiting_tokens: 20,
    max_batch_size: None,
    cuda_graphs: None,
    hostname: "da2c85ccd2e1",
    port: 80,
    prometheus_port: 9000,
    shard_uds_path: "/tmp/text-generation-server",
    master_addr: "localhost",
    master_port: 29500,
    huggingface_hub_cache: None,
    weights_cache_override: None,
    disable_custom_kernels: true,
    cuda_memory_fraction: 0.9,
    rope_scaling: None,
    rope_factor: None,
    json_output: false,
    otlp_endpoint: None,
    otlp_service_name: "text-generation-inference.router",
    cors_allow_origin: [],
    api_key: None,
    watermark_gamma: None,
    watermark_delta: None,
    ngrok: false,
    ngrok_authtoken: None,
    ngrok_edge: None,
    tokenizer_config_path: None,
    disable_grammar_support: false,
    env: false,
    max_client_batch_size: 4,
    lora_adapters: None,
    usage_stats: On,
    payload_limit: 2000000,
    enable_prefill_logprobs: false,
    graceful_termination_timeout: 90,
}
2025-12-09T10:04:18.510804Z  INFO text_generation_launcher: Using attention flashinfer - Prefix caching true
2025-12-09T10:04:18.599056Z  WARN text_generation_launcher: Unkown compute for card nvidia-geforce-rtx-5060-ti
2025-12-09T10:04:18.639000Z  INFO text_generation_launcher: Default `max_batch_prefill_tokens` to 4096
2025-12-09T10:04:18.639066Z  INFO text_generation_launcher: Using default cuda graphs [1, 2, 4, 8, 16, 32]
2025-12-09T10:04:18.639261Z  INFO download: text_generation_launcher: Starting check and download process for meta-llama/Llama-3.2-1B-Instruct
2025-12-09T10:04:24.229594Z  INFO text_generation_launcher: Files are already present on the host. Skipping download.
2025-12-09T10:04:24.963645Z  INFO download: text_generation_launcher: Successfully downloaded weights for meta-llama/Llama-3.2-1B-Instruct
2025-12-09T10:04:24.963959Z  INFO shard-manager: text_generation_launcher: Starting shard rank=0
2025-12-09T10:04:29.917437Z  INFO text_generation_launcher: Using prefix caching = True
2025-12-09T10:04:29.917489Z  INFO text_generation_launcher: Using Attention = flashinfer
2025-12-09T10:04:34.991699Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2025-12-09T10:04:45.011112Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2025-12-09T10:04:55.030039Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2025-12-09T10:05:04.721763Z  INFO text_generation_launcher: Using prefill chunking = True
2025-12-09T10:05:05.041495Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2025-12-09T10:05:05.123703Z  INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-0
2025-12-09T10:05:05.141684Z  INFO shard-manager: text_generation_launcher: Shard ready in 40.172224949s rank=0
2025-12-09T10:05:05.214578Z  INFO text_generation_launcher: Starting Webserver
2025-12-09T10:05:05.276269Z  INFO text_generation_router_v3: backends/v3/src/lib.rs:125: Warming up model
2025-12-09T10:05:05.299117Z  INFO text_generation_launcher: Using optimized Triton indexing kernels.
2025-12-09T10:05:07.527851Z ERROR warmup{max_input_length=Some(4096) max_prefill_tokens=4096 max_total_tokens=Some(4224) max_batch_size=None}:warmup: text_generation_router_v3::client: backends/v3/src/client/mod.rs:45: Server error: transport error
Error: Backend(Warmup(Generation("transport error")))
2025-12-09T10:05:07.566220Z ERROR text_generation_launcher: Webserver Crashed
2025-12-09T10:05:07.566264Z  INFO text_generation_launcher: Shutting down shards
2025-12-09T10:05:07.655671Z ERROR shard-manager: text_generation_launcher: Shard complete standard error output:

2025-12-09 10:04:26.943 | INFO     | text_generation_server.utils.import_utils:<module>:76 - Detected system cuda
/usr/src/.venv/lib/python3.11/site-packages/mamba_ssm/ops/selective_scan_interface.py:158: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
  @custom_fwd
/usr/src/.venv/lib/python3.11/site-packages/mamba_ssm/ops/selective_scan_interface.py:231: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
  @custom_bwd
/usr/src/.venv/lib/python3.11/site-packages/mamba_ssm/ops/triton/layernorm.py:507: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
  @custom_fwd
/usr/src/.venv/lib/python3.11/site-packages/mamba_ssm/ops/triton/layernorm.py:566: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
  @custom_bwd
CUDA Error: no kernel image is available for execution on the device /usr/src/flash-attention/csrc/layer_norm/ln_fwd_kernels.cuh 236 rank=0
Error: WebserverFailed

Additional test in the docker

from flash_attn.ops.rms_norm import rms_norm
import torch
b = torch.rand(32).cuda()
a = torch.rand(2,32).cuda()
rms_norm(a,b,1e-6)

Expected behavior

It's gonna load the model and it will work. Common, it's almost a year after releasing blackwell GPUs it should work...

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions