[BUG] CUDA 13 runtime crash: Concat kernel missing symbol on Jetson Thor

### Describe the issue

# [BUG] CUDA 13 runtime crash: Concat kernel missing symbol on Jetson Thor

- **Reporter**: @bccw2021 
- **Environment**:
  - Hardware: NVIDIA Jetson Thor (Blackwell class GPU)
  - OS: Ubuntu 24.04 (JetPack 7.0)
  - CUDA Toolkit: 13.0 (`/usr/local/cuda`)
  - cuDNN: 9.x (JetPack default in `/usr/lib/aarch64-linux-gnu`)
  - TensorRT: 10.x (`/usr/lib/aarch64-linux-gnu`)
  - Python: 3.12.3 (virtualenv `~/openpilot/.venv`)
  - ONNX Runtime: custom build from `microsoft/onnxruntime` (main branch, built with `--use_cuda --use_tensorrt --build_wheel`)
  - Build flags: `CMAKE_CUDA_ARCHITECTURES=90-real;90-virtual`
  - NVCC: 13.0.48 (`cuda_13.0.r13.0/compiler.36260728_0`)
  - NVIDIA driver: 580.00 (reports `CUDA Version: 13.0` via `nvidia-smi`)
  - cuDNN packages: `libcudnn9-cuda-13` / `libcudnn9-dev-cuda-13` / headers 9.12.0.46 (`dpkg -l | grep cudnn`)
  - TensorRT packages: `libnvinfer10` 10.13.3.9 (+ dev/headers/plugins, Python bindings) `dpkg -l | grep nvinfer`

## Repro steps

1. On Jetson Thor, build ONNX Runtime from source:
   ```bash
   ./build.sh --update --build --parallel \
     --config Release \
     --build_wheel \
     --use_cuda \
     --cuda_home /usr/local/cuda \
     --cudnn_home /usr/lib/aarch64-linux-gnu \
     --use_tensorrt \
     --tensorrt_home /usr/lib/aarch64-linux-gnu \
     --cmake_extra_defines CMAKE_CUDA_ARCHITECTURES="90-real;90-virtual" \
     --cmake_extra_defines CMAKE_CXX_FLAGS="-I/usr/local/cuda/targets/sbsa-linux/include -I/usr/local/cuda/targets/sbsa-linux/include/cccl -Wno-psabi -Wno-sign-compare -Wno-error=deprecated-declarations" \
     --cmake_extra_defines CMAKE_CUDA_FLAGS="--compiler-options '-fno-strict-aliasing'"
   ```
2. Install the generated wheel into the project virtualenv.
3. Run the demo script `bodyjim/examples/roam.py` that loads a GPT-based policy and performs inference on camera frames.
   ```bash
   python bodyjim/examples/roam.py 192.168.100.52
   ```
gpt2 test: https://github.com/commaai/bodyjim/blob/master/examples/roam.py

## Actual behavior

Runtime fails on the first CUDA inference call for the tokenizer session:

```
2025-09-30 14:50:36.763560092 [E:onnxruntime:, sequential_executor.cc:572 ExecuteKernel] Non-zero status code returned while running Concat node. Name:'/model/Concat' Status Message: CUDA error cudaErrorSymbolNotFound:named symbol not found
Traceback (most recent call last):
  File "/home/canpan/bodyjim/examples/roam.py", line 170, in <module>
    roam(args.body_ip)
  File "/home/canpan/bodyjim/examples/roam.py", line 132, in roam
    action, _ = runner.run(obs["cameras"]["driver"], obs["carState"]["wheelSpeeds"])
  File "/home/canpan/bodyjim/examples/roam.py", line 94, in run
    img_tokens = self.tokenize_frame(img)
  File "/home/canpan/bodyjim/examples/roam.py", line 65, in tokenize_frame
    img_tokens = self.tokenizer_session.run(None, {'img': img})[0].reshape(1, -1)
onnxruntime.capi.onnxruntime_pybind11_state.Fail: [ONNXRuntimeError] : 1 : FAIL : Non-zero status code returned while running Concat node. Name:'/model/Concat' Status Message: CUDA error cudaErrorSymbolNotFound:named symbol not found
```

CPUExecutionProvider fallback works but is too slow for real-time usage.

## Expected behavior

Concat kernels compiled for CUDA 13.0 should load correctly on Jetson Thor (SM 90) and run without missing symbol errors, allowing the tokenizer and policy ONNX graphs to execute on `CUDAExecutionProvider`.

## Additional context

- The crash happens immediately when executing the Concat node inside the tokenizer graph.
- Earlier in the project we patched deprecated CUDA vector types (`longlong4` → `longlong4_16a`) and added CCCL include paths; those fixes do not resolve this runtime error.
- CPU execution runs the same model successfully, confirming the ONNX graph is valid.
- We followed guidance from issue #25936 and disabled Flash Attention, yet the Concat CUDA symbol remains missing when compiled against CUDA 13.0 on Thor.
- Suspect that CUDA 13 requires additional PTX variants (e.g., `120-virtual`) or updated CUDA kernels for Concat similar to the Flash Attention fixes.
- `nvidia-smi` confirms the system is running driver 580.00 on the embedded Thor GPU with CUDA 13.0 runtime.

Please advise on the proper configuration or patches to ensure CUDA 13 builds ship the necessary Concat kernels for Blackwell GPUs.


## Build environment

nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2025 NVIDIA Corporation
Built on Wed_Jul_16_07:31:19_PM_PDT_2025
Cuda compilation tools, release 13.0, V13.0.48
Build cuda_13.0.r13.0/compiler.36260728_0


nvidia-smi
Wed Oct  1 06:58:53 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.00                 Driver Version: 580.00         CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA Thor                    Off |   00000000:01:00.0 Off |                  N/A |
| N/A   N/A  N/A             N/A  /  N/A  | Not Supported          |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A            2877      G   /usr/lib/xorg/Xorg                        0MiB |
|    0   N/A  N/A            3101      G   /usr/bin/gnome-shell                      0MiB |
|    0   N/A  N/A            4825      G   /usr/bin/gnome-control-center             0MiB |
+-----------------------------------------------------------------------------------------+


dpkg -l | grep cudnn
ii  libcudnn9-cuda-13                             9.12.0.46-1                              arm64        cuDNN runtime libraries for CUDA 13.0
ii  libcudnn9-dev-cuda-13                         9.12.0.46-1                              arm64        cuDNN development libraries for CUDA 13.0
ii  libcudnn9-headers-cuda-13                     9.12.0.46-1                              arm64        cuDNN header files for CUDA 13.0
ii  libcudnn9-samples                             9.12.0.46-1                              all          cuDNN samples
ii  nvidia-cudnn                                  7.0-b110                                 arm64        NVIDIA CUDNN Meta Package
ii  nvidia-cudnn-dev                              7.0-b110                                 arm64        NVIDIA CUDNN dev Meta Package


dpkg -l | grep nvinfer
ii  libnvinfer-bin                                10.13.3.9-1+cuda13.0                     arm64        TensorRT binaries
ii  libnvinfer-dev                                10.13.3.9-1+cuda13.0                     arm64        TensorRT development libraries
ii  libnvinfer-dispatch-dev                       10.13.3.9-1+cuda13.0                     arm64        TensorRT development dispatch runtime libraries
ii  libnvinfer-dispatch10                         10.13.3.9-1+cuda13.0                     arm64        TensorRT dispatch runtime library
ii  libnvinfer-headers-dev                        10.13.3.9-1+cuda13.0                     arm64        TensorRT development headers
ii  libnvinfer-headers-plugin-dev                 10.13.3.9-1+cuda13.0                     arm64        TensorRT plugin headers
ii  libnvinfer-headers-python-plugin-dev          10.13.3.9-1+cuda13.0                     arm64        TensorRT Python plugin development headers
ii  libnvinfer-lean-dev                           10.13.3.9-1+cuda13.0                     arm64        TensorRT lean runtime libraries
ii  libnvinfer-lean10                             10.13.3.9-1+cuda13.0                     arm64        TensorRT lean runtime library
ii  libnvinfer-plugin-dev                         10.13.3.9-1+cuda13.0                     arm64        TensorRT plugin libraries
ii  libnvinfer-plugin10                           10.13.3.9-1+cuda13.0                     arm64        TensorRT plugin libraries
ii  libnvinfer-samples                            10.13.3.9-1+cuda13.0                     all          TensorRT samples
ii  libnvinfer-vc-plugin-dev                      10.13.3.9-1+cuda13.0                     arm64        TensorRT vc-plugin library
ii  libnvinfer-vc-plugin10                        10.13.3.9-1+cuda13.0                     arm64        TensorRT vc-plugin library
ii  libnvinfer10                                  10.13.3.9-1+cuda13.0                     arm64        TensorRT runtime libraries
ii  python3-libnvinfer                            10.13.3.9-1+cuda13.0                     arm64        Python 3 bindings for TensorRT standard runtime
ii  python3-libnvinfer-dev                        10.13.3.9-1+cuda13.0                     arm64        Python 3 development package for TensorRT standard runtime
ii  python3-libnvinfer-dispatch                   10.13.3.9-1+cuda13.0                     arm64        Python 3 bindings for TensorRT dispatch runtime
ii  python3-libnvinfer-lean                       10.13.3.9-1+cuda13.0                     arm64        Python 3 bindings for TensorRT lean runtime

### To reproduce

## Repro steps

1. On Jetson Thor, build ONNX Runtime from source:
   ```bash
   ./build.sh --update --build --parallel \
     --config Release \
     --build_wheel \
     --use_cuda \
     --cuda_home /usr/local/cuda \
     --cudnn_home /usr/lib/aarch64-linux-gnu \
     --use_tensorrt \
     --tensorrt_home /usr/lib/aarch64-linux-gnu \
     --cmake_extra_defines CMAKE_CUDA_ARCHITECTURES="90-real;90-virtual" \
     --cmake_extra_defines CMAKE_CXX_FLAGS="-I/usr/local/cuda/targets/sbsa-linux/include -I/usr/local/cuda/targets/sbsa-linux/include/cccl -Wno-psabi -Wno-sign-compare -Wno-error=deprecated-declarations" \
     --cmake_extra_defines CMAKE_CUDA_FLAGS="--compiler-options '-fno-strict-aliasing'"
   ```
2. Install the generated wheel into the project virtualenv.
3. Run the demo script `bodyjim/examples/roam.py` that loads a GPT-based policy and performs inference on camera frames.
   ```bash
   python bodyjim/examples/roam.py 192.168.100.52
   ```
gpt2 test: https://github.com/commaai/bodyjim/blob/master/examples/roam.py

### Urgency

_No response_

### Platform

Linux

### OS Version

Ubuntu 24.04

### ONNX Runtime Installation

Built from Source

### ONNX Runtime Version or Commit ID

1.23.0

### ONNX Runtime API

Python

### Architecture

ARM64

### Execution Provider

CUDA

### Execution Provider Library Version

CUDA 13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] CUDA 13 runtime crash: Concat kernel missing symbol on Jetson Thor #26212

Describe the issue

[BUG] CUDA 13 runtime crash: Concat kernel missing symbol on Jetson Thor

Repro steps

Actual behavior

Expected behavior

Additional context

Build environment

To reproduce

Repro steps

Urgency

Platform

OS Version

ONNX Runtime Installation

ONNX Runtime Version or Commit ID

ONNX Runtime API

Architecture

Execution Provider

Execution Provider Library Version

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] CUDA 13 runtime crash: Concat kernel missing symbol on Jetson Thor #26212

Description

Describe the issue

[BUG] CUDA 13 runtime crash: Concat kernel missing symbol on Jetson Thor

Repro steps

Actual behavior

Expected behavior

Additional context

Build environment

To reproduce

Repro steps

Urgency

Platform

OS Version

ONNX Runtime Installation

ONNX Runtime Version or Commit ID

ONNX Runtime API

Architecture

Execution Provider

Execution Provider Library Version

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions