Skip to content

[BUG] CUDA 13 runtime crash: Concat kernel missing symbol on Jetson Thor #26212

@bccw2021

Description

@bccw2021

Describe the issue

[BUG] CUDA 13 runtime crash: Concat kernel missing symbol on Jetson Thor

  • Reporter: @bccw2021
  • Environment:
    • Hardware: NVIDIA Jetson Thor (Blackwell class GPU)
    • OS: Ubuntu 24.04 (JetPack 7.0)
    • CUDA Toolkit: 13.0 (/usr/local/cuda)
    • cuDNN: 9.x (JetPack default in /usr/lib/aarch64-linux-gnu)
    • TensorRT: 10.x (/usr/lib/aarch64-linux-gnu)
    • Python: 3.12.3 (virtualenv ~/openpilot/.venv)
    • ONNX Runtime: custom build from microsoft/onnxruntime (main branch, built with --use_cuda --use_tensorrt --build_wheel)
    • Build flags: CMAKE_CUDA_ARCHITECTURES=90-real;90-virtual
    • NVCC: 13.0.48 (cuda_13.0.r13.0/compiler.36260728_0)
    • NVIDIA driver: 580.00 (reports CUDA Version: 13.0 via nvidia-smi)
    • cuDNN packages: libcudnn9-cuda-13 / libcudnn9-dev-cuda-13 / headers 9.12.0.46 (dpkg -l | grep cudnn)
    • TensorRT packages: libnvinfer10 10.13.3.9 (+ dev/headers/plugins, Python bindings) dpkg -l | grep nvinfer

Repro steps

  1. On Jetson Thor, build ONNX Runtime from source:
    ./build.sh --update --build --parallel \
      --config Release \
      --build_wheel \
      --use_cuda \
      --cuda_home /usr/local/cuda \
      --cudnn_home /usr/lib/aarch64-linux-gnu \
      --use_tensorrt \
      --tensorrt_home /usr/lib/aarch64-linux-gnu \
      --cmake_extra_defines CMAKE_CUDA_ARCHITECTURES="90-real;90-virtual" \
      --cmake_extra_defines CMAKE_CXX_FLAGS="-I/usr/local/cuda/targets/sbsa-linux/include -I/usr/local/cuda/targets/sbsa-linux/include/cccl -Wno-psabi -Wno-sign-compare -Wno-error=deprecated-declarations" \
      --cmake_extra_defines CMAKE_CUDA_FLAGS="--compiler-options '-fno-strict-aliasing'"
  2. Install the generated wheel into the project virtualenv.
  3. Run the demo script bodyjim/examples/roam.py that loads a GPT-based policy and performs inference on camera frames.
    python bodyjim/examples/roam.py 192.168.100.52

gpt2 test: https://github.com/commaai/bodyjim/blob/master/examples/roam.py

Actual behavior

Runtime fails on the first CUDA inference call for the tokenizer session:

2025-09-30 14:50:36.763560092 [E:onnxruntime:, sequential_executor.cc:572 ExecuteKernel] Non-zero status code returned while running Concat node. Name:'/model/Concat' Status Message: CUDA error cudaErrorSymbolNotFound:named symbol not found
Traceback (most recent call last):
  File "/home/canpan/bodyjim/examples/roam.py", line 170, in <module>
    roam(args.body_ip)
  File "/home/canpan/bodyjim/examples/roam.py", line 132, in roam
    action, _ = runner.run(obs["cameras"]["driver"], obs["carState"]["wheelSpeeds"])
  File "/home/canpan/bodyjim/examples/roam.py", line 94, in run
    img_tokens = self.tokenize_frame(img)
  File "/home/canpan/bodyjim/examples/roam.py", line 65, in tokenize_frame
    img_tokens = self.tokenizer_session.run(None, {'img': img})[0].reshape(1, -1)
onnxruntime.capi.onnxruntime_pybind11_state.Fail: [ONNXRuntimeError] : 1 : FAIL : Non-zero status code returned while running Concat node. Name:'/model/Concat' Status Message: CUDA error cudaErrorSymbolNotFound:named symbol not found

CPUExecutionProvider fallback works but is too slow for real-time usage.

Expected behavior

Concat kernels compiled for CUDA 13.0 should load correctly on Jetson Thor (SM 90) and run without missing symbol errors, allowing the tokenizer and policy ONNX graphs to execute on CUDAExecutionProvider.

Additional context

  • The crash happens immediately when executing the Concat node inside the tokenizer graph.
  • Earlier in the project we patched deprecated CUDA vector types (longlong4longlong4_16a) and added CCCL include paths; those fixes do not resolve this runtime error.
  • CPU execution runs the same model successfully, confirming the ONNX graph is valid.
  • We followed guidance from issue [Build] CUDA 13 Failed #25936 and disabled Flash Attention, yet the Concat CUDA symbol remains missing when compiled against CUDA 13.0 on Thor.
  • Suspect that CUDA 13 requires additional PTX variants (e.g., 120-virtual) or updated CUDA kernels for Concat similar to the Flash Attention fixes.
  • nvidia-smi confirms the system is running driver 580.00 on the embedded Thor GPU with CUDA 13.0 runtime.

Please advise on the proper configuration or patches to ensure CUDA 13 builds ship the necessary Concat kernels for Blackwell GPUs.

Build environment

nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2025 NVIDIA Corporation
Built on Wed_Jul_16_07:31:19_PM_PDT_2025
Cuda compilation tools, release 13.0, V13.0.48
Build cuda_13.0.r13.0/compiler.36260728_0

nvidia-smi
Wed Oct 1 06:58:53 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.00 Driver Version: 580.00 CUDA Version: 13.0 |
+-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA Thor Off | 00000000:01:00.0 Off | N/A |
| N/A N/A N/A N/A / N/A | Not Supported | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 2877 G /usr/lib/xorg/Xorg 0MiB |
| 0 N/A N/A 3101 G /usr/bin/gnome-shell 0MiB |
| 0 N/A N/A 4825 G /usr/bin/gnome-control-center 0MiB |
+-----------------------------------------------------------------------------------------+

dpkg -l | grep cudnn
ii libcudnn9-cuda-13 9.12.0.46-1 arm64 cuDNN runtime libraries for CUDA 13.0
ii libcudnn9-dev-cuda-13 9.12.0.46-1 arm64 cuDNN development libraries for CUDA 13.0
ii libcudnn9-headers-cuda-13 9.12.0.46-1 arm64 cuDNN header files for CUDA 13.0
ii libcudnn9-samples 9.12.0.46-1 all cuDNN samples
ii nvidia-cudnn 7.0-b110 arm64 NVIDIA CUDNN Meta Package
ii nvidia-cudnn-dev 7.0-b110 arm64 NVIDIA CUDNN dev Meta Package

dpkg -l | grep nvinfer
ii libnvinfer-bin 10.13.3.9-1+cuda13.0 arm64 TensorRT binaries
ii libnvinfer-dev 10.13.3.9-1+cuda13.0 arm64 TensorRT development libraries
ii libnvinfer-dispatch-dev 10.13.3.9-1+cuda13.0 arm64 TensorRT development dispatch runtime libraries
ii libnvinfer-dispatch10 10.13.3.9-1+cuda13.0 arm64 TensorRT dispatch runtime library
ii libnvinfer-headers-dev 10.13.3.9-1+cuda13.0 arm64 TensorRT development headers
ii libnvinfer-headers-plugin-dev 10.13.3.9-1+cuda13.0 arm64 TensorRT plugin headers
ii libnvinfer-headers-python-plugin-dev 10.13.3.9-1+cuda13.0 arm64 TensorRT Python plugin development headers
ii libnvinfer-lean-dev 10.13.3.9-1+cuda13.0 arm64 TensorRT lean runtime libraries
ii libnvinfer-lean10 10.13.3.9-1+cuda13.0 arm64 TensorRT lean runtime library
ii libnvinfer-plugin-dev 10.13.3.9-1+cuda13.0 arm64 TensorRT plugin libraries
ii libnvinfer-plugin10 10.13.3.9-1+cuda13.0 arm64 TensorRT plugin libraries
ii libnvinfer-samples 10.13.3.9-1+cuda13.0 all TensorRT samples
ii libnvinfer-vc-plugin-dev 10.13.3.9-1+cuda13.0 arm64 TensorRT vc-plugin library
ii libnvinfer-vc-plugin10 10.13.3.9-1+cuda13.0 arm64 TensorRT vc-plugin library
ii libnvinfer10 10.13.3.9-1+cuda13.0 arm64 TensorRT runtime libraries
ii python3-libnvinfer 10.13.3.9-1+cuda13.0 arm64 Python 3 bindings for TensorRT standard runtime
ii python3-libnvinfer-dev 10.13.3.9-1+cuda13.0 arm64 Python 3 development package for TensorRT standard runtime
ii python3-libnvinfer-dispatch 10.13.3.9-1+cuda13.0 arm64 Python 3 bindings for TensorRT dispatch runtime
ii python3-libnvinfer-lean 10.13.3.9-1+cuda13.0 arm64 Python 3 bindings for TensorRT lean runtime

To reproduce

Repro steps

  1. On Jetson Thor, build ONNX Runtime from source:
    ./build.sh --update --build --parallel \
      --config Release \
      --build_wheel \
      --use_cuda \
      --cuda_home /usr/local/cuda \
      --cudnn_home /usr/lib/aarch64-linux-gnu \
      --use_tensorrt \
      --tensorrt_home /usr/lib/aarch64-linux-gnu \
      --cmake_extra_defines CMAKE_CUDA_ARCHITECTURES="90-real;90-virtual" \
      --cmake_extra_defines CMAKE_CXX_FLAGS="-I/usr/local/cuda/targets/sbsa-linux/include -I/usr/local/cuda/targets/sbsa-linux/include/cccl -Wno-psabi -Wno-sign-compare -Wno-error=deprecated-declarations" \
      --cmake_extra_defines CMAKE_CUDA_FLAGS="--compiler-options '-fno-strict-aliasing'"
  2. Install the generated wheel into the project virtualenv.
  3. Run the demo script bodyjim/examples/roam.py that loads a GPT-based policy and performs inference on camera frames.
    python bodyjim/examples/roam.py 192.168.100.52

gpt2 test: https://github.com/commaai/bodyjim/blob/master/examples/roam.py

Urgency

No response

Platform

Linux

OS Version

Ubuntu 24.04

ONNX Runtime Installation

Built from Source

ONNX Runtime Version or Commit ID

1.23.0

ONNX Runtime API

Python

Architecture

ARM64

Execution Provider

CUDA

Execution Provider Library Version

CUDA 13

Metadata

Metadata

Assignees

No one assigned

    Labels

    ep:CUDAissues related to the CUDA execution providerplatform:jetsonissues related to the NVIDIA Jetson platform

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions