-
Notifications
You must be signed in to change notification settings - Fork 134
Description
Environment
Host: Proxmox VE 8.4.1 running on HP DL580 G9 server
GPU: AMD Vega 20 (Radeon Pro VII/Radeon Instinct MI50 32GB) cards
Using vendor-reset module for GPU reset functionality
Issue Description
When using AMD Vega 20 cards with PCI passthrough in Proxmox, I've discovered an unusual behavior with the vendor-reset module. The module appears to be loaded correctly after system boot (visible in lsmod and dmesg | grep vendor_reset), but it doesn't actually perform resets on the GPU devices until a specific sequence of PCI device removal and rescan operations is performed.
Observed Behavior
After a fresh host boot, vendor-reset appears loaded but inactive
When checking dmesg | grep reset after VM shutdown, no reset actions are logged
VMs fail to initialize the GPU properly on first start
Workaround/Solution
I discovered that running the following script "activates" the vendor-reset functionality:
-#!/bin/bash
-# Remove gpu 0f:00, 0c:00 from driver, set wake timer of 8 seconds, suspend system, rescan pci bus devices
-rtcwake -m no -s 10 && systemctl suspend
-sleep 6
-echo 1 > /sys/bus/pci/devices/0000:06:00.0/remove
-echo 1 > /sys/bus/pci/devices/0000:06:00.1/remove
-echo 1 > /sys/bus/pci/devices/0000:c6:00.0/remove
-echo 1 > /sys/bus/pci/devices/0000:c6:00.1/remove
-sleep 2
-echo 1 > /sys/bus/pci/rescan
-sleep 2
-echo 1 > /sys/bus/pci/devices/0000:06:00.0/remove
-echo 1 > /sys/bus/pci/devices/0000:06:00.1/remove
-echo 1 > /sys/bus/pci/devices/0000:c6:00.0/remove
-echo 1 > /sys/bus/pci/devices/0000:c6:00.1/remove
-sleep 2
-echo 1 > /sys/bus/pci/rescan
After running this script, vendor-reset begins functioning correctly and subsequent VM startups with GPU passthrough work properly.
Additional Observations
The double remove-rescan sequence seems to be necessary
In Linux VMs, even with functioning vendor-reset, EEPROM read errors still occur:
But ROCm detect Card and it work's
In Windows VMs, driver installation for the audio component of the card fails, though the GPU itself works
Theory
It appears that vendor-reset requires some form of "activation" through PCI bus manipulation before it begins monitoring and resetting the GPU devices. The initial PCI removal/rescan operations seem to place the cards in a state where vendor-reset can properly handle them.
This suggests that vendor-reset might not be correctly identifying or hooking into the AMD Vega 20 cards on initial system boot, but can do so after they've been cycled through the PCI subsystem.

