-
Notifications
You must be signed in to change notification settings - Fork 218
Closed
Description
Nvidia AI cards (Tesla, A100, A40, A2) are not detected as GPU elements. The reason is that these controllers are 3D controller devices (0x0302) and don't have DRM interface.
For example:
$ lspci -nnvs 0b:00.0
0b:00.0 VGA compatible controller [0300]: Matrox Electronics Systems Ltd. G200eR2 [102b:0534] (rev 01) (prog-if 00 [VGA controller])
DeviceName: Embedded Video
Subsystem: Dell Device [1028:0600]
Flags: bus master, medium devsel, latency 0, IRQ 19, NUMA node 0, IOMMU group 30
Memory at 90000000 (32-bit, prefetchable) [size=16M]
Memory at 93000000 (32-bit, non-prefetchable) [size=16K]
Memory at 92800000 (32-bit, non-prefetchable) [size=8M]
Expansion ROM at 000c0000 [virtual] [disabled] [size=128K]
Capabilities: [dc] Power Management version 1
Kernel driver in use: mgag200
Kernel modules: mgag200
$ lspci -nnvs 04:00.0
04:00.0 3D controller [0302]: NVIDIA Corporation GA100 [A100 PCIe 40GB] [10de:20f1] (rev a1)
Subsystem: NVIDIA Corporation Device [10de:145f]
Flags: bus master, fast devsel, latency 0, IRQ 162, NUMA node 0, IOMMU group 28
Memory at 91000000 (32-bit, non-prefetchable) [size=16M]
Memory at 39000000000 (64-bit, prefetchable) [size=64G]
Memory at 3b020000000 (64-bit, prefetchable) [size=32M]
Capabilities: [60] Power Management version 3
Capabilities: [68] Null
Capabilities: [78] Express Endpoint, MSI 00
Capabilities: [c8] MSI-X: Enable+ Count=6 Masked-
Capabilities: [100] Virtual Channel
Capabilities: [258] L1 PM Substates
Capabilities: [128] Power Budgeting <?>
Capabilities: [420] Advanced Error Reporting
Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
Capabilities: [900] Secondary PCI Express
Capabilities: [bb0] Physical Resizable BAR
Capabilities: [bcc] Single Root I/O Virtualization (SR-IOV)
Capabilities: [c14] Alternative Routing-ID Interpretation (ARI)
Capabilities: [c1c] Physical Layer 16.0 GT/s <?>
Capabilities: [d00] Lane Margining at the Receiver <?>
Capabilities: [e00] Data Link Feature <?>
Kernel driver in use: vfio-pci
Kernel modules: nouveau
$ ls -l /sys/class/drm
total 0
lrwxrwxrwx. 1 root root 0 Aug 5 11:10 card0 -> ../../devices/pci0000:00/0000:00:1c.7/0000:08:00.0/0000:09:00.0/0000:0a:00.0/0000:0b:00.0/drm/card0
lrwxrwxrwx. 1 root root 0 Aug 5 11:10 card0-VGA-1 -> ../../devices/pci0000:00/0000:00:1c.7/0000:08:00.0/0000:09:00.0/0000:0a:00.0/0000:0b:00.0/drm/card0/card0-VGA-1
-r--r--r--. 1 root root 4096 Sep 16 07:23 version
In this case we should add a patch (e.g. detect GPUs by PCI base class or PCI subclass) to detect these cards as GPUs, or add different function/parameters for these scenarios.
Going futher, Intel Gaudi cards and other vendor classified their hardware as processing accelerators (PCI class 0x12). Maybe we can discuss if it should detected as a GPU (and then added to the fix) or detected in a different hardware category.
Reactions are currently unavailable