Skip to content

GPUs without DRM are not detected in Linux #383

@mlorenzofr

Description

@mlorenzofr

Nvidia AI cards (Tesla, A100, A40, A2) are not detected as GPU elements. The reason is that these controllers are 3D controller devices (0x0302) and don't have DRM interface.

For example:

$ lspci -nnvs 0b:00.0
0b:00.0 VGA compatible controller [0300]: Matrox Electronics Systems Ltd. G200eR2 [102b:0534] (rev 01) (prog-if 00 [VGA controller])
        DeviceName: Embedded Video
        Subsystem: Dell Device [1028:0600]
        Flags: bus master, medium devsel, latency 0, IRQ 19, NUMA node 0, IOMMU group 30
        Memory at 90000000 (32-bit, prefetchable) [size=16M]
        Memory at 93000000 (32-bit, non-prefetchable) [size=16K]
        Memory at 92800000 (32-bit, non-prefetchable) [size=8M]
        Expansion ROM at 000c0000 [virtual] [disabled] [size=128K]
        Capabilities: [dc] Power Management version 1
        Kernel driver in use: mgag200
        Kernel modules: mgag200

$ lspci -nnvs 04:00.0
04:00.0 3D controller [0302]: NVIDIA Corporation GA100 [A100 PCIe 40GB] [10de:20f1] (rev a1)
        Subsystem: NVIDIA Corporation Device [10de:145f]
        Flags: bus master, fast devsel, latency 0, IRQ 162, NUMA node 0, IOMMU group 28
        Memory at 91000000 (32-bit, non-prefetchable) [size=16M]
        Memory at 39000000000 (64-bit, prefetchable) [size=64G]
        Memory at 3b020000000 (64-bit, prefetchable) [size=32M]
        Capabilities: [60] Power Management version 3
        Capabilities: [68] Null
        Capabilities: [78] Express Endpoint, MSI 00
        Capabilities: [c8] MSI-X: Enable+ Count=6 Masked-
        Capabilities: [100] Virtual Channel
        Capabilities: [258] L1 PM Substates
        Capabilities: [128] Power Budgeting <?>
        Capabilities: [420] Advanced Error Reporting
        Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
        Capabilities: [900] Secondary PCI Express
        Capabilities: [bb0] Physical Resizable BAR
        Capabilities: [bcc] Single Root I/O Virtualization (SR-IOV)
        Capabilities: [c14] Alternative Routing-ID Interpretation (ARI)
        Capabilities: [c1c] Physical Layer 16.0 GT/s <?>
        Capabilities: [d00] Lane Margining at the Receiver <?>
        Capabilities: [e00] Data Link Feature <?>
        Kernel driver in use: vfio-pci
        Kernel modules: nouveau

$ ls -l /sys/class/drm
total 0
lrwxrwxrwx. 1 root root    0 Aug  5 11:10 card0 -> ../../devices/pci0000:00/0000:00:1c.7/0000:08:00.0/0000:09:00.0/0000:0a:00.0/0000:0b:00.0/drm/card0
lrwxrwxrwx. 1 root root    0 Aug  5 11:10 card0-VGA-1 -> ../../devices/pci0000:00/0000:00:1c.7/0000:08:00.0/0000:09:00.0/0000:0a:00.0/0000:0b:00.0/drm/card0/card0-VGA-1
-r--r--r--. 1 root root 4096 Sep 16 07:23 version

In this case we should add a patch (e.g. detect GPUs by PCI base class or PCI subclass) to detect these cards as GPUs, or add different function/parameters for these scenarios.

Going futher, Intel Gaudi cards and other vendor classified their hardware as processing accelerators (PCI class 0x12). Maybe we can discuss if it should detected as a GPU (and then added to the fix) or detected in a different hardware category.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions