Skip to content

Unusual Vendor-Reset Behavior with AMD Vega 20 Cards in Proxmox VE #99

@PPTG

Description

@PPTG

Environment

Host: Proxmox VE 8.4.1 running on HP DL580 G9 server
GPU: AMD Vega 20 (Radeon Pro VII/Radeon Instinct MI50 32GB) cards
Using vendor-reset module for GPU reset functionality

Issue Description
When using AMD Vega 20 cards with PCI passthrough in Proxmox, I've discovered an unusual behavior with the vendor-reset module. The module appears to be loaded correctly after system boot (visible in lsmod and dmesg | grep vendor_reset), but it doesn't actually perform resets on the GPU devices until a specific sequence of PCI device removal and rescan operations is performed.

Observed Behavior
After a fresh host boot, vendor-reset appears loaded but inactive
When checking dmesg | grep reset after VM shutdown, no reset actions are logged
VMs fail to initialize the GPU properly on first start

Workaround/Solution
I discovered that running the following script "activates" the vendor-reset functionality:

-#!/bin/bash
-# Remove gpu 0f:00, 0c:00 from driver, set wake timer of 8 seconds, suspend system, rescan pci bus devices
-rtcwake -m no -s 10 && systemctl suspend
-sleep 6
-echo 1 > /sys/bus/pci/devices/0000:06:00.0/remove
-echo 1 > /sys/bus/pci/devices/0000:06:00.1/remove
-echo 1 > /sys/bus/pci/devices/0000:c6:00.0/remove
-echo 1 > /sys/bus/pci/devices/0000:c6:00.1/remove
-sleep 2
-echo 1 > /sys/bus/pci/rescan
-sleep 2
-echo 1 > /sys/bus/pci/devices/0000:06:00.0/remove
-echo 1 > /sys/bus/pci/devices/0000:06:00.1/remove
-echo 1 > /sys/bus/pci/devices/0000:c6:00.0/remove
-echo 1 > /sys/bus/pci/devices/0000:c6:00.1/remove
-sleep 2
-echo 1 > /sys/bus/pci/rescan

After running this script, vendor-reset begins functioning correctly and subsequent VM startups with GPU passthrough work properly.
Additional Observations

The double remove-rescan sequence seems to be necessary
In Linux VMs, even with functioning vendor-reset, EEPROM read errors still occur:

Image
and
Image

But ROCm detect Card and it work's
In Windows VMs, driver installation for the audio component of the card fails, though the GPU itself works

Theory
It appears that vendor-reset requires some form of "activation" through PCI bus manipulation before it begins monitoring and resetting the GPU devices. The initial PCI removal/rescan operations seem to place the cards in a state where vendor-reset can properly handle them.
This suggests that vendor-reset might not be correctly identifying or hooking into the AMD Vega 20 cards on initial system boot, but can do so after they've been cycled through the PCI subsystem.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions