π A high-performance Golang tool for analyzing GPU resource allocation and utilization in Kubernetes clusters
Reduce API server load by up to 75% while getting comprehensive GPU usage insights
β If this project helps you, please give it a star! β
- Automatically discover GPU nodes in the cluster
- Count GPU pods on each node
- Calculate GPU resource requests and utilization rates
- Display statistics in a clear table format
- Support out-of-cluster access to Kubernetes
- Filter GPU nodes by custom labels
- Query pods from specific namespaces to reduce apiserver load
- Go 1.21 or higher
- Valid kubeconfig file (usually located at
~/.kube/config) - Access permissions to the Kubernetes cluster
Download the latest release from GitHub Releases:
Linux:
# AMD64
wget https://github.com/Kevinz857/k8s-gpu-analyzer/releases/latest/download/k8s-gpu-analyzer-linux-amd64
chmod +x k8s-gpu-analyzer-linux-amd64
./k8s-gpu-analyzer-linux-amd64
# ARM64
wget https://github.com/Kevinz857/k8s-gpu-analyzer/releases/latest/download/k8s-gpu-analyzer-linux-arm64
chmod +x k8s-gpu-analyzer-linux-arm64
./k8s-gpu-analyzer-linux-arm64macOS:
# Intel
wget https://github.com/Kevinz857/k8s-gpu-analyzer/releases/latest/download/k8s-gpu-analyzer-darwin-amd64
chmod +x k8s-gpu-analyzer-darwin-amd64
# Remove quarantine attribute (required for unsigned binaries)
xattr -d com.apple.quarantine k8s-gpu-analyzer-darwin-amd64 2>/dev/null || true
./k8s-gpu-analyzer-darwin-amd64
# Apple Silicon
wget https://github.com/Kevinz857/k8s-gpu-analyzer/releases/latest/download/k8s-gpu-analyzer-darwin-arm64
chmod +x k8s-gpu-analyzer-darwin-arm64
# Remove quarantine attribute (required for unsigned binaries)
xattr -d com.apple.quarantine k8s-gpu-analyzer-darwin-arm64 2>/dev/null || true
./k8s-gpu-analyzer-darwin-arm64Alternative for macOS (if you see security warnings):
# Method 1: Use command line to bypass Gatekeeper
sudo spctl --master-disable # Temporarily disable Gatekeeper
./k8s-gpu-analyzer-darwin-arm64
sudo spctl --master-enable # Re-enable Gatekeeper
# Method 2: Manual approval (recommended)
# 1. Try to run the binary (it will fail with security warning)
# 2. Go to System Preferences β Security & Privacy β General
# 3. Click "Allow Anyway" next to the blocked app message
# 4. Run the binary again and click "Open" when promptedWindows:
# AMD64
Invoke-WebRequest -Uri https://github.com/Kevinz857/k8s-gpu-analyzer/releases/latest/download/k8s-gpu-analyzer-windows-amd64.exe -OutFile k8s-gpu-analyzer.exe
.\k8s-gpu-analyzer.exe
# ARM64
Invoke-WebRequest -Uri https://github.com/Kevinz857/k8s-gpu-analyzer/releases/latest/download/k8s-gpu-analyzer-windows-arm64.exe -OutFile k8s-gpu-analyzer.exe
.\k8s-gpu-analyzer.exe# Clone the project
git clone https://github.com/Kevinz857/k8s-gpu-analyzer.git
cd k8s-gpu-analyzer
# Download dependencies
go mod tidy
# Build the project
make build
# Run the program
./bin/k8s-gpu-analyzer# Use default settings (gpu=true label, default namespace)
./bin/k8s-gpu-analyzer
# Specify custom node labels
./bin/k8s-gpu-analyzer --node-labels "gpu=true,instance-type=gpu"
# Specify custom namespaces
./bin/k8s-gpu-analyzer --namespaces "default,kube-system,gpu-namespace"
# Combine both options
./bin/k8s-gpu-analyzer --node-labels "gpu=true" --namespaces "default,production"./bin/k8s-gpu-analyzer --helpAvailable flags:
-l, --node-labels: Node labels to filter GPU nodes (format: key=value,key2=value2) (default: gpu=true)-n, --namespaces: Namespaces to search for pods (comma-separated) (default: default)-h, --help: Show help information
The program looks for kubeconfig files in the following order:
- Path specified by the
KUBECONFIGenvironment variable - Default path
~/.kube/config
To specify a custom kubeconfig file:
export KUBECONFIG=/path/to/your/kubeconfig
./bin/k8s-gpu-analyzerThe program outputs a table in the following format:
GPU-Node-Name GPUPodCount NodeGPURequest NodeGPURequestPercent NodeGPUTotal
------------- ----------- -------------- -------------------- ------------
gpu-node-001 3 6 75.00% 8
gpu-node-002 2 4 50.00% 8
gpu-node-003 1 2 25.00% 8
------------- ----------- -------------- -------------------- ------------
TOTAL 6 12 50.00% 24
Summary:
Total GPU nodes: 3
Total GPU pods: 6
Total GPU requests: 12
Total GPU capacity: 24
Overall GPU utilization: 50.00%
- GPU-Node-Name: GPU node name
- GPUPodCount: Number of GPU-using pods on the node
- NodeGPURequest: Total GPU resource requests on the node
- NodeGPURequestPercent: GPU resource utilization percentage
- NodeGPUTotal: Total GPU capacity of the node
The program identifies GPU nodes using the following rules:
- Custom Labels: If
--node-labelsis specified, nodes must match all specified labels - GPU Resources: Node resources contain
nvidia.com/gpu - Label Keywords: Node labels contain
gpu,nvidia, oracceleratorkeywords - Name Keywords: Node names contain
gpuornvidiakeywords
To reduce load on the apiserver, especially in large clusters, the tool uses several optimization strategies:
The tool intelligently minimizes API calls based on the number of GPU nodes:
-
Single GPU Node: Uses field selectors for maximum efficiency
- API calls:
1 nodes.list() + M pods.list()(where M = number of namespaces) - Example: 1 node + 3 namespaces = 4 API calls total
- API calls:
-
Multiple GPU Nodes: Batch queries to reduce API calls
- API calls:
1 nodes.list() + M pods.list()(same as single node!) - Filters pods client-side to avoid NΓM API calls
- Example: 5 nodes + 3 namespaces = 4 API calls (not 16!)
- API calls:
- Node Label Filtering: Use
--node-labelsto filter nodes at the Kubernetes API level - Namespace Filtering: By default, only queries the
defaultnamespace - Smart Pod Querying:
- Single node: Uses
spec.nodeName=<node>field selector - Multiple nodes: Queries all pods per namespace once, filters client-side
- Single node: Uses
- Early GPU Node Detection: Filters GPU nodes before any pod queries
We evaluated several methods for gathering GPU usage information:
| Method | API Calls | Pros | Cons |
|---|---|---|---|
| Current Optimized | 1 + M |
Minimal API calls, accurate | Some client-side filtering |
| Original Per-Node | 1 + (N Γ M) |
Simple logic | High API server load |
| Metrics API | 2-3 |
Very low API calls | Requires metrics-server, no pod count |
| Node Status Only | 1 |
Minimal load | No pod-level details, less accurate |
| Event-based | 2-3 |
Low API calls | Unreliable, events expire |
Legend: N = GPU nodes, M = namespaces
For a cluster with 5 GPU nodes and 3 namespaces:
- Naive approach: 1 + (5 Γ 3) = 16 API calls
- Our optimized approach: 1 + 3 = 4 API calls (75% reduction)
The optimization becomes more significant as the number of GPU nodes increases.
- Only counts pods in Running and Pending states
- Supports checking both requests and limits for GPU resources in pods
- If a pod only sets limits without requests, uses the limits value
- All GPU-related resources are based on the
nvidia.com/gpuresource type
-
Failed to connect to Kubernetes cluster
- Check if kubeconfig file exists and is valid
- Verify network connectivity and cluster access permissions
-
No GPU nodes found
- Confirm that GPU nodes actually exist in the cluster
- Check if nodes are properly configured with GPU resources
- Verify that node labels match the specified
--node-labels
-
Permission errors
- Ensure the kubeconfig user has sufficient permissions to access node and pod information
The project follows the Go Standard Project Layout:
.
βββ cmd/
β βββ k8s-gpu-analyzer/ # Main application
β βββ main.go
βββ internal/ # Private application and library code
β βββ k8s/ # Kubernetes client
β β βββ client.go
β βββ monitor/ # GPU analysis core logic
β βββ gpu_analyzer.go
β βββ printer.go
βββ pkg/ # Library code that can be used by external applications
β βββ types/ # Public type definitions
β βββ types.go
βββ go.mod # Go module definition
βββ Makefile # Build scripts
βββ README.md # Documentation
βββ .gitignore # Git ignore file
Main modules:
cmd/k8s-gpu-analyzer: Main application entry pointinternal/k8s: Kubernetes client creation and configurationinternal/monitor: GPU monitoring core logic and output formattingpkg/types: Shared data type definitions
MIT License