Extending Linux GPU drivers with eBPF for programmable memory offloading and scheduling.
Modern GPU workloads (LLM inference, vector databases, DNN training) exhibit diverse memory access patterns and scheduling requirements. However, GPU drivers use fixed, one-size-fits-all policies that cannot adapt to workload-specific needs.
gpu_ext enables customizable GPU resource management through eBPF struct_ops:
- Memory Management: Pluggable eviction and prefetch policies at the driver level
- Scheduling: Per-process timeslice and priority control for multi-tenant GPU sharing
- Observability: Tracing tools for memory and scheduling events
Inspired by Linux kernel's sched_ext, gpu_ext brings the same extensibility to GPU drivers.
Note: the device-side runtime path referenced by gpu_ext is based on bpftime.
├── extension/ # eBPF policies, userspace loaders, trace tools
├── kernel-module/ # Modified NVIDIA kernel modules with eBPF hooks
│ └── nvidia-module/ # NVIDIA Open GPU Kernel Modules v575.57.08
├── workloads/ # Benchmark workloads (llama.cpp, vLLM, PyTorch, FAISS)
├── libbpf/ # libbpf submodule
├── bpftool/ # bpftool submodule
├── vmlinux/ # vmlinux BTF headers
├── microbench/ # Microbenchmarks (compute/memory)
├── scripts/ # Shared utilities
├── tools/ # Helper tools
└── docs/ # Documentation
Policies in extension/:
| Category | Policies |
|---|---|
| Eviction | FIFO, LFU, MRU, PID-quota, freq-decay, FIFO-chance |
| Prefetch | none, always-max, adaptive-sequential, adaptive-tree-iter, stride, PID-tree, PID-eviction |
| Scheduling | timeslice control, preemption control |
| Tracing | chunk_trace, prefetch_trace, gpu_sched_trace |
# Ubuntu 22.04+
sudo apt-get install -y --no-install-recommends \
build-essential gcc g++ make \
clang llvm \
libelf1 libelf-dev zlib1g-dev \
pkg-config
# Or use the Makefile shortcut:
make installAdditional requirements:
- Kernel module build: Linux kernel headers (
linux-headers-$(uname -r)), CUDA 12.8+ - Workloads: Python 3.12+ with
uvpackage manager - Nix users:
nix developprovides a ready-to-use shell environment
make build # Compiles all BPF policies + userspace loadersThis builds libbpf and bpftool from submodules, then compiles each .bpf.c policy into BPF bytecode (.bpf.o) and a userspace loader binary. BPF objects and skeleton headers go to extension/.output/; loader binaries are placed directly in extension/.
Some optional extension binaries require extra host dependencies:
sched_gpu_*needsSCX_INCLUDE_DIR=/path/to/linux/tools/sched_ext/includeprefetch_adaptive_*needs CUDA/NVML headers and stubstest_preempt_demo/test_preempt_multineed CUDA driver headers and stubs
The modified NVIDIA kernel module (based on Open GPU Kernel Modules v575.57.08) adds BPF struct_ops hook points to nvidia-uvm for memory management and to nvidia for GPU scheduling.
cd kernel-module/nvidia-module
make modules -j$(nproc)This runs two stages automatically: first builds OS-agnostic driver objects (src/nvidia/, src/nvidia-modeset/), then builds kernel modules via Kbuild.
Output:
kernel-open/nvidia.ko
kernel-open/nvidia-modeset.ko
kernel-open/nvidia-drm.ko
kernel-open/nvidia-uvm.ko # Contains eBPF hooks
IMPORTANT: Only use
insmodfor temporary loading. NEVER runmake modules_installor copy.kofiles to/lib/modules/. The custom modules are loaded into the running kernel only and automatically revert to the system NVIDIA driver on reboot. This ensures system stability — if anything goes wrong, a simple reboot restores the original driver.
# Unload system modules
sudo systemctl stop nvidia-persistenced 2>/dev/null || true
sudo systemctl stop gdm3 2>/dev/null || true
sleep 2
sudo rmmod nvidia_uvm nvidia_drm nvidia_modeset nvidia 2>/dev/null || true
# Load custom modules via insmod (in dependency order)
sudo insmod kernel-module/nvidia-module/kernel-open/nvidia.ko
sudo insmod kernel-module/nvidia-module/kernel-open/nvidia-modeset.ko
sudo insmod kernel-module/nvidia-module/kernel-open/nvidia-drm.ko
sudo insmod kernel-module/nvidia-module/kernel-open/nvidia-uvm.ko
# Restart display manager
sudo systemctl start gdm3 2>/dev/null || true
# Verify
lsmod | grep nvidiaTo revert to system modules at any time (without reboot):
sudo rmmod nvidia_uvm nvidia_drm nvidia_modeset nvidia
sudo modprobe nvidia && sudo modprobe nvidia_uvmFor detailed troubleshooting, see docs/driver_docs/MODULE_LOAD_UNLOAD_GUIDE.md.
With the custom kernel module loaded, attach a policy:
# Run a policy loader (stays in foreground, Ctrl-C to detach)
sudo ./extension/prefetch_adaptive_sequential
# Or run in background
sudo ./extension/eviction_lfu &
# Verify eBPF programs are attached
sudo bpftool prog list | grep struct_opsBenchmark workloads for reproducing the paper experiments. See workloads/README.md for full setup and instructions.
| Workload | Paper | Description |
|---|---|---|
| llama.cpp | RQ1, Fig 6 | MoE expert offloading (GPT-OSS-120B, 59 GiB) |
| vLLM | RQ1, Fig 7 | KV-cache offloading (Qwen3-30B-A3B-FP8) |
| PyTorch | RQ1, Fig 8 | GNN training with UVM oversubscription (1M-15M nodes) |
| FAISS | RQ1, Fig 9 | Vector search on SIFT 20M/50M/100M |
Quick start:
cd workloads/llama.cpp
uv sync
uv run python configs/bench.py --uvm -o results/uvm_baseline.jsongpu_ext: Extensible OS Policies for GPUs via eBPF Yusheng Zheng, Tong Yu, Yiwei Yang, Minghui Jiang, Xiangyu Gao, Jianchang Su, Yanpeng Hu, Wenan Mao, Wei Zhang, Dan Williams, Andi Quinn arXiv:2512.12615
Documentation sync note: when paper-facing claims, policies, or benchmark configurations change, update this file, docs/gpu-ext/paper/README.md, and workloads/README.md together.
- Cross-VA-block proactive prefetch: eBPF workqueue-based prefetch that breaks the 2MB per-fault-page limit. ~20% improvement on microbenchmarks. Pending end-to-end testing on real workloads.
- GPU kernel submission-level scheduling:
bpf_nv_gpu_preempt_tsgkfunc for cross-process GPU TSG preemption. Two trigger paths verified: bpf_wq from struct_ops hooks, and sleepable uprobe oncuLaunchKernel(avg 312us, no bpf_wq needed). (seedocs/gpu_preempt_kfunc_plan.md) - CPU-GPU coordinated scheduling: Combined sched_ext + GPU memory/scheduling policies (FPRS). ~5% improvement on multi-tenant serving. (see
docs/xcoord_plan.md) - Better coordinated scheduling policy: Exploring AI-driven policy search for improved CPU-GPU coordination.
- Combined host-side policies: Multiple compositions implemented and benchmarked (always_max + cycle_moe, always_max + xCoord, FPRS coord v2).
- More complex combined policies: Explore richer compositions (prefetch + eviction + scheduling + CPU coordination) once framework capabilities are expanded.
- Dynamism: Fast policy injection and fast runtime via compiler techniques, enabling both rapid development iteration and low-overhead execution.
- Paper improvements: Strengthen evaluation methodology, add new workloads, and refine the writing.
- bpftime - GPU device-side eBPF support
MIT