You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This OEP introduces an AcceleratorClass abstraction to OME that enables intelligent runtime selection in heterogeneous GPU environments. Currently, supporting multiple GPU types (e.g., H100, A100, B200, H200) requires creating numerous runtime configurations - a combinatorial explosion that becomes unmanageable. The proposed solution provides a vendor-agnostic way to define accelerator capabilities and automatically match them with appropriate runtimes.
The design integrates seamlessly with existing Kubernetes ecosystem tools, particularly Kueue's ResourceFlavor concept, enabling users to leverage existing resource management infrastructure. By introducing capability-based matching rather than hard-coding specific GPU models, the system remains flexible and future-proof as new accelerator types emerge.
Motivation
OME's current runtime selection mechanism matches runtimes based on model characteristics (format, architecture, size, quantization) but lacks awareness of the underlying hardware accelerators. In clusters with heterogeneous GPU types, this limitation forces operators to create and maintain separate runtime definitions for each GPU model - leading to operational complexity and configuration drift.
Furthermore, the lack of standardization across GPU vendors (NVIDIA, AMD, Intel) in naming conventions and resource exposure makes it challenging to build a unified solution. Each vendor uses different labeling schemes, and Kubernetes device plugins expose resources differently (e.g., nvidia.com/gpu vs amd.com/gpu).
Goals
Reduce runtime proliferation - Enable a single runtime definition to work across multiple GPU types
Vendor-agnostic design - Support NVIDIA, AMD, Intel, and future accelerators without code changes
Kueue integration - Seamlessly work with existing Kueue ResourceFlavor deployments
Progressive disclosure - Simple for basic use cases, powerful for advanced scenarios
Automatic optimization - Select optimal GPU based on model requirements and availability
Clear override hierarchy - Provide predictable configuration precedence
Non-Goals
GPU virtualization - This OEP does not address GPU sharing or MIG configuration
Cost optimization - While the design enables cost-aware scheduling, implementing cost models is out of scope
Dynamic runtime generation - Automatically creating new runtime configurations based on discovered GPUs
Replacing existing APIs - The design extends rather than replaces current runtime selection
Proposal
Introduce new API resources and extensions to enable accelerator-aware runtime selection:
AcceleratorClass (Cluster-scoped) - Defines accelerator capabilities and discovery patterns
Runtime/InferenceService extensions - Add accelerator requirements and selection fields
Note: AcceleratorProfile was considered but deferred to reduce initial complexity. The same functionality can be achieved through InferenceService's acceleratorSelector field.
The system will automatically discover available accelerators, match them with runtime requirements, and select the optimal configuration based on model characteristics and user preferences.
User Stories
Story 1: ML Practitioner Deploying Models
Alice wants to deploy a Llama 7B model for inference. She doesn't know the differences between GPU types and just wants her model to run efficiently.
Current Experience:
# Alice deploys with auto-selected runtimekind: InferenceServicemetadata:
name: llama-7bspec:
model:
name: llama-7b# Runtime is auto-selected based on model, but GPU type is not considered# Result: Might get scheduled on any available GPU (A100, H100, etc.)# Problem: No control over GPU selection, potentially inefficient resource use
New Experience:
# Alice specifies only what she needskind: InferenceServicemetadata:
name: llama-7bspec:
model:
name: llama-7bruntime:
name: sglang-universal # Automatically selects appropriate GPU
Story 2: Platform Engineer Managing Multiple GPU Types
Bob manages a cluster with A100-40GB, A100-80GB, H100-80GB, and H200-96GB GPUs. He needs to support multiple model architectures without creating 16+ runtime configurations.
Current Experience:
# Bob creates a runtime that works on all GPUs but can't optimize per GPU typekind: ServingRuntimemetadata:
name: sglang-all-gpusspec:
supportedModelFormats:
- name: safetensorscontainers:
- name: sglangenv:
# Must use conservative settings that work on smallest GPU
- name: GPU_MEMORY_UTILIZATIONvalue: "0.85"# Conservative for all GPUs
- name: MAX_MODEL_LEN value: "16384"# Limited by smallest GPU# Can't enable H100-specific optimizations like FP8
Carol's organization uses Kueue extensively for resource management with ResourceFlavors already defined. She wants OME to leverage existing Kueue configuration.
Story 8: Custom Command with Accelerator Optimization
Henry has a complex sglang deployment with custom commands but wants GPU-specific optimizations to be applied.
Scenario 1: User provides full command (accelerator args NOT applied)
apiVersion: ome.io/v1beta1kind: InferenceServicemetadata:
name: custom-command-deploymentspec:
engine:
runner:
# User has full control with commandcommand:
- sh
- -c
- > python3 -m sglang.launch_server --host 0.0.0.0 --port 8080 --model-path ${MODEL_PATH} --tp-size 16 --nccl-init $(LWS_LEADER_ADDRESS):5000 --nnodes ${LWS_GROUP_SIZE} --node-rank ${LWS_WORKER_INDEX} --trust-remote-code --enable-torch-compile --torch-compile-max-bs 1 --reasoning-parser deepseek-r1 --enable-metrics# Even with acceleratorSelector, args from acceleratorConfigurations are NOT applied# because user specified commandacceleratorSelector:
preferredClasses: ["nvidia-h100-80gb"]
Scenario 2: User provides args array (accelerator args are appended)
apiVersion: ome.io/v1beta1kind: InferenceServicemetadata:
name: env-merge-deploymentspec:
engine:
runner:
env:
- name: TENSOR_PARALLEL_SIZEvalue: "4"# User override
- name: CUSTOM_SETTINGvalue: "user-value"runtime:
name: sglang-performance-optimizedacceleratorSelector:
preferredClasses: ["nvidia-h100-80gb"]# Result: Final env will include:# TENSOR_PARALLEL_SIZE=4 (user value wins)# CUSTOM_SETTING=user-value (user defined)# ENABLE_FP8=true (from acceleratorConfig, not overridden)# GPU_MEMORY_UTILIZATION=0.95 (from acceleratorConfig)
Notes/Constraints/Caveats
Kueue Integration: When Kueue ResourceFlavors exist, OME can auto-discover and create corresponding AcceleratorClasses. This ensures consistency and reduces duplicate configuration.
Vendor Differences: Different GPU vendors expose resources differently (nvidia.com/gpu, amd.com/gpu). The AcceleratorClass abstraction handles these differences transparently.
Override Precedence: Configuration follows a clear hierarchy: ServingRuntime defaults → AcceleratorClass configurations → InferenceService spec → Pod annotations
Backward Compatibility: Existing InferenceServices continue to work. The new fields are optional and only enhance functionality when used.
Interaction with Existing NodeSelector/Affinity: Both ServingRuntime and InferenceService already have nodeSelector and affinity fields. The AcceleratorClass system works as follows:
AcceleratorClass defines the node selection criteria for specific GPU types
These are merged with existing nodeSelector/affinity settings
User-specified nodeSelector/affinity in InferenceService takes precedence
Component Architecture: OME uses separate Engine, Decoder, and Router components:
Engine and Decoder can have different AcceleratorClass selections
Router typically runs CPU-only and doesn't apply accelerator configurations
Components receive already-merged specs from the controller
Container Args Merging: Following OME's existing behavior:
Runtime args and user args are concatenated, not merged
Accelerator args are prepended to runtime args
If user specifies command, accelerator args are not applied
If user specifies args, they are appended after accelerator args
When an accelerator is selected, its nodeSelector is merged with existing nodeSelectors
Affinity rules are combined using AND logic
Direct nodeSelector/affinity in InferenceService takes precedence over AcceleratorClass
This allows users to further constrain placement beyond GPU type
Router Component Handling: The router component is CPU-only and doesn't require GPUs:
AcceleratorClass constraints are only applied to Engine and Decoder components
Router pods maintain their own independent nodeSelector/affinity settings
This prevents routers from being unnecessarily scheduled on expensive GPU nodes
Router can be explicitly scheduled on CPU-only nodes for cost optimization
Container Arguments Merging: When acceleratorConfigurations specify args and the user also provides command/args:
If user specifies command, accelerator args are NOT applied (user has full control)
If user specifies args as array, accelerator args are appended
Environment variables are merged (user values take precedence)
Resources are merged (max of user and accelerator values)
Risks and Mitigations
Risk 1: Complexity for simple use cases
Mitigation: All new fields are optional. Users can continue using existing simple configurations.
Risk 2: Conflict with Kueue ResourceFlavors
Mitigation: Provide automatic discovery and synchronization with Kueue ResourceFlavors. Allow bi-directional mapping.
Risk 3: Performance overhead in selection
Mitigation: Cache accelerator discovery results. Use efficient matching algorithms.
Risk 4: Vendor lock-in through capability definitions
Mitigation: Use generic capability names. Allow vendor-specific extensions without requiring them.
Design Details
Example AcceleratorClass Definitions
Users or platform administrators would create AcceleratorClass resources for their GPU types. Here are common examples:
// AcceleratorClass defines a class of accelerators with similar capabilities// +kubebuilder:object:root=true// +kubebuilder:resource:scope=Cluster// +kubebuilder:subresource:status// +kubebuilder:printcolumn:name="Vendor",type=string,JSONPath=`.spec.vendor`// +kubebuilder:printcolumn:name="Family",type=string,JSONPath=`.spec.family`// +kubebuilder:printcolumn:name="Memory",type=string,JSONPath=`.spec.capabilities.memoryGB`// +kubebuilder:printcolumn:name="Nodes",type=integer,JSONPath=`.status.availableNodes`typeAcceleratorClassstruct {
metav1.TypeMeta`json:",inline"`
metav1.ObjectMeta`json:"metadata,omitempty"`SpecAcceleratorClassSpec`json:"spec,omitempty"`StatusAcceleratorClassStatus`json:"status,omitempty"`
}
typeAcceleratorClassSpecstruct {
// Vendor of the accelerator (nvidia, amd, intel, etc.)// +optionalVendorstring`json:"vendor,omitempty"`// Family of the accelerator (ampere, hopper, cdna2, etc.)// +optionalFamilystring`json:"family,omitempty"`// Model name (a100, h100, mi250x, etc.)// +optionalModelstring`json:"model,omitempty"`// Discovery patterns to identify nodes with this acceleratorDiscoveryAcceleratorDiscovery`json:"discovery"`// Capabilities of this accelerator classCapabilitiesAcceleratorCapabilities`json:"capabilities"`// Resources exposed by this accelerator// +optionalResources []AcceleratorResource`json:"resources,omitempty"`// Integration with external systems// +optionalIntegration*AcceleratorIntegration`json:"integration,omitempty"`// Cost information for optimization decisions// +optionalCost*AcceleratorCost`json:"cost,omitempty"`
}
typeAcceleratorCoststruct {
// Cost per hour in dollars// +optionalPerHour*resource.Quantity`json:"perHour,omitempty"`// Cost per million tokens (for usage-based pricing)// +optionalPerMillionTokens*resource.Quantity`json:"perMillionTokens,omitempty"`// Spot instance pricing if available// +optionalSpotPerHour*resource.Quantity`json:"spotPerHour,omitempty"`// Cost tier for simplified selection (low, medium, high)// +optionalTierstring`json:"tier,omitempty"`
}
typeAcceleratorDiscoverystruct {
// NodeSelector to identify nodes with this accelerator// +optionalNodeSelectormap[string]string`json:"nodeSelector,omitempty"`// NodeSelectorTerms for more complex node selection// +optionalNodeSelectorTerms []v1.NodeSelectorTerm`json:"nodeSelectorTerms,omitempty"`// PCIVendorID for device discovery (e.g., "10de" for NVIDIA)// +optionalPCIVendorIDstring`json:"pciVendorID,omitempty"`// DeviceIDs list of PCI device IDs// +optionalDeviceIDs []string`json:"deviceIDs,omitempty"`
}
typeAcceleratorCapabilitiesstruct {
// Memory capacity in GB// +optionalMemoryGB*resource.Quantity`json:"memoryGB,omitempty"`// Compute capability (NVIDIA) or equivalent// +optionalComputeCapabilitystring`json:"computeCapability,omitempty"`// Clock speeds// +optionalClockSpeedMHz*int32`json:"clockSpeedMHz,omitempty"`// Memory bandwidth// +optionalMemoryBandwidthGBps*resource.Quantity`json:"memoryBandwidthGBps,omitempty"`// Features supported by this accelerator// +optionalFeatures []string`json:"features,omitempty"`// Performance metrics// +optionalPerformance*AcceleratorPerformance`json:"performance,omitempty"`
}
typeAcceleratorResourcestruct {
// Name of the resource (e.g., nvidia.com/gpu)Namestring`json:"name"`// Quantity per accelerator// +kubebuilder:default="1"Quantity resource.Quantity`json:"quantity,omitempty"`// Divisible indicates if the resource can be subdivided// +optionalDivisiblebool`json:"divisible,omitempty"`
}
typeAcceleratorIntegrationstruct {
// KueueResourceFlavor name to sync with// +optionalKueueResourceFlavorstring`json:"kueueResourceFlavor,omitempty"`// VolcanoGPUType for Volcano integration// +optionalVolcanoGPUTypestring`json:"volcanoGPUType,omitempty"`
}
typeAcceleratorClassStatusstruct {
// Nodes that have this accelerator// +optionalNodes []string`json:"nodes,omitempty"`// Total number of accelerators in the cluster// +optionalTotalAcceleratorsint32`json:"totalAccelerators,omitempty"`// Available accelerators (not allocated)// +optionalAvailableAcceleratorsint32`json:"availableAccelerators,omitempty"`// Last update time// +optionalLastUpdated metav1.Time`json:"lastUpdated,omitempty"`// Conditions represent the latest available observations// +optionalConditions []metav1.Condition`json:"conditions,omitempty"`
}
InferenceService Extensions
// Update InferenceServiceSpec in inference_service.gotypeInferenceServiceSpecstruct {
// Existing fields...Engine*EngineSpec`json:"engine,omitempty"`Decoder*DecoderSpec`json:"decoder,omitempty"`Model*ModelRef`json:"model,omitempty"`Runtime*ServingRuntimeRef`json:"runtime,omitempty"`Router*RouterSpec`json:"router,omitempty"`// NEW: Accelerator selection preferences// +optionalAcceleratorSelector*AcceleratorSelector`json:"acceleratorSelector,omitempty"`
}
typeAcceleratorSelectorstruct {
// PreferredClasses in order of preference// +optionalPreferredClasses []string`json:"preferredClasses,omitempty"`// RequiredCapabilities that must be met// +optionalRequiredCapabilities*AcceleratorCapabilities`json:"requiredCapabilities,omitempty"`// Strategy for selection (performance, cost, balanced)// +kubebuilder:default="balanced"// +optionalStrategyAcceleratorSelectionStrategy`json:"strategy,omitempty"`// NodeSelector for specific node targeting// +optionalNodeSelectormap[string]string`json:"nodeSelector,omitempty"`// Strategy for selection (performance, cost, balanced)// +kubebuilder:default="balanced"// +optionalStrategyAcceleratorSelectionStrategy`json:"strategy,omitempty"`
}
// Update EngineSpec to support accelerator configurationtypeEngineSpecstruct {
// Existing fields...PodSpec`json:",inline"`ComponentExtensionSpec`json:",inline"`Runner*RunnerSpec`json:"runner,omitempty"`Leader*LeaderSpec`json:"leader,omitempty"`Worker*WorkerSpec`json:"worker,omitempty"`// NEW: Accelerator-specific configuration overrides// Applied based on selected accelerator class// +optionalAcceleratorConfigurations []AcceleratorConfiguration`json:"acceleratorConfigurations,omitempty"`
}
typeAcceleratorConfigurationstruct {
// Selector for which accelerator classes this applies toSelectorAcceleratorConfigSelector`json:"selector"`// Environment variables to set// +optionalEnv []v1.EnvVar`json:"env,omitempty"`// Resources to request/limit// +optionalResources v1.ResourceRequirements`json:"resources,omitempty"`// Runner overrides// +optionalRunner*RunnerSpec`json:"runner,omitempty"`
}
ServingRuntime Extensions
// Update ServingRuntimeSpec in servingruntime_types.gotypeServingRuntimeSpecstruct {
// Existing fields...SupportedModelFormats []SupportedModelFormat`json:"supportedModelFormats,omitempty"`ModelSizeRange*ModelSizeRangeSpec`json:"modelSizeRange,omitempty"`Disabled*bool`json:"disabled,omitempty"`RouterConfig*RouterSpec`json:"routerConfig,omitempty"`EngineConfig*EngineSpec`json:"engineConfig,omitempty"`DecoderConfig*DecoderSpec`json:"decoderConfig,omitempty"`// NEW: Accelerator requirements for this runtime// +optionalAcceleratorRequirements*AcceleratorRequirements`json:"acceleratorRequirements,omitempty"`
}
typeAcceleratorRequirementsstruct {
// SupportedClasses explicitly lists supported accelerator classes// +optionalSupportedClasses []string`json:"supportedClasses,omitempty"`// RequiredCapabilities that any accelerator must meet// +optionalRequiredCapabilities*AcceleratorCapabilities`json:"requiredCapabilities,omitempty"`// PreferenceOrder for accelerator selection// +optionalPreferenceOrder []AcceleratorPreference`json:"preferenceOrder,omitempty"`
}
typeAcceleratorPreferencestruct {
// Class name or capability matcherClassstring`json:"class,omitempty"`// Score for this preference (higher is better)// +kubebuilder:validation:Minimum=0// +kubebuilder:validation:Maximum=100Scoreint32`json:"score"`// Conditions when this preference applies// +optionalConditions []PreferenceCondition`json:"conditions,omitempty"`
}
Kueue Integration
The system provides bi-directional integration with Kueue ResourceFlavors:
// AcceleratorClassController watches for Kueue ResourceFlavorsfunc (r*AcceleratorClassReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
// Check if this is a Kueue ResourceFlavorifreq.Namespace==""&&strings.HasPrefix(req.Name, "gpu-") {
// Try to get corresponding ResourceFlavorrf:=&kueuev1beta1.ResourceFlavor{}
iferr:=r.Get(ctx, req.NamespacedName, rf); err==nil {
// Create or update AcceleratorClass from ResourceFlavorreturnr.syncFromResourceFlavor(ctx, rf)
}
}
// Normal AcceleratorClass reconciliationac:=&omev1beta1.AcceleratorClass{}
iferr:=r.Get(ctx, req.NamespacedName, ac); err!=nil {
return ctrl.Result{}, client.IgnoreNotFound(err)
}
// If linked to Kueue, ensure ResourceFlavor existsifac.Spec.Integration!=nil&&ac.Spec.Integration.KueueResourceFlavor!="" {
returnr.ensureResourceFlavor(ctx, ac)
}
returnr.updateAcceleratorStatus(ctx, ac)
}
// Auto-discovery of GPU capabilities from nodesfunc (r*AcceleratorClassReconciler) discoverAcceleratorCapabilities(
ctx context.Context,
nodes []v1.Node,
) (*AcceleratorCapabilities, error) {
capabilities:=&AcceleratorCapabilities{
Features: []string{},
}
for_, node:=rangenodes {
// Extract GPU information from node labelsifmemory, ok:=node.Labels["nvidia.com/gpu.memory"]; ok {
ifmem, err:=resource.ParseQuantity(memory); err==nil {
capabilities.MemoryGB=&mem
}
}
ifcc, ok:=node.Labels["nvidia.com/gpu.compute"]; ok {
capabilities.ComputeCapability=cc
}
// Check for specific featuresif_, ok:=node.Labels["nvidia.com/mig.capable"]; ok {
capabilities.Features=append(capabilities.Features, "mig")
}
if_, ok:=node.Labels["nvidia.com/nvlink"]; ok {
capabilities.Features=append(capabilities.Features, "nvlink")
}
}
returncapabilities, nil
}
Override Hierarchy
The configuration precedence follows a clear hierarchy:
OEP-0003: Accelerator-Aware Runtime Selection for Heterogeneous GPU Environments
Summary
This OEP introduces an AcceleratorClass abstraction to OME that enables intelligent runtime selection in heterogeneous GPU environments. Currently, supporting multiple GPU types (e.g., H100, A100, B200, H200) requires creating numerous runtime configurations - a combinatorial explosion that becomes unmanageable. The proposed solution provides a vendor-agnostic way to define accelerator capabilities and automatically match them with appropriate runtimes.
The design integrates seamlessly with existing Kubernetes ecosystem tools, particularly Kueue's ResourceFlavor concept, enabling users to leverage existing resource management infrastructure. By introducing capability-based matching rather than hard-coding specific GPU models, the system remains flexible and future-proof as new accelerator types emerge.
Motivation
OME's current runtime selection mechanism matches runtimes based on model characteristics (format, architecture, size, quantization) but lacks awareness of the underlying hardware accelerators. In clusters with heterogeneous GPU types, this limitation forces operators to create and maintain separate runtime definitions for each GPU model - leading to operational complexity and configuration drift.
Furthermore, the lack of standardization across GPU vendors (NVIDIA, AMD, Intel) in naming conventions and resource exposure makes it challenging to build a unified solution. Each vendor uses different labeling schemes, and Kubernetes device plugins expose resources differently (e.g., nvidia.com/gpu vs amd.com/gpu).
Goals
Non-Goals
Proposal
Introduce new API resources and extensions to enable accelerator-aware runtime selection:
Note: AcceleratorProfile was considered but deferred to reduce initial complexity. The same functionality can be achieved through InferenceService's acceleratorSelector field.
The system will automatically discover available accelerators, match them with runtime requirements, and select the optimal configuration based on model characteristics and user preferences.
User Stories
Story 1: ML Practitioner Deploying Models
Alice wants to deploy a Llama 7B model for inference. She doesn't know the differences between GPU types and just wants her model to run efficiently.
Current Experience:
New Experience:
Story 2: Platform Engineer Managing Multiple GPU Types
Bob manages a cluster with A100-40GB, A100-80GB, H100-80GB, and H200-96GB GPUs. He needs to support multiple model architectures without creating 16+ runtime configurations.
Current Experience:
New Experience:
Story 3: Heavy Kueue User
Carol's organization uses Kueue extensively for resource management with ResourceFlavors already defined. She wants OME to leverage existing Kueue configuration.
New Experience:
Story 4: Complex Node Selection Requirements
Dave needs to deploy a model that requires H100 GPUs but must run in a specific availability zone for data locality.
New Experience:
Story 5: Cost-Optimized Router Deployment
Emma wants to ensure routers run on cheap CPU nodes while engines run on GPU nodes.
New Experience:
Story 6: Cost-Optimized Model Serving
Frank wants to optimize costs by using A100s for small requests and H100s only for large requests that need the performance.
New Experience:
Story 7: Advanced Performance Optimization
Grace needs to deploy models with speculative decoding and advanced quantization techniques based on GPU capabilities.
New Experience:
Story 8: Custom Command with Accelerator Optimization
Henry has a complex sglang deployment with custom commands but wants GPU-specific optimizations to be applied.
Scenario 1: User provides full command (accelerator args NOT applied)
Scenario 2: User provides args array (accelerator args are appended)
Scenario 3: Environment variable merging
Notes/Constraints/Caveats
Kueue Integration: When Kueue ResourceFlavors exist, OME can auto-discover and create corresponding AcceleratorClasses. This ensures consistency and reduces duplicate configuration.
Vendor Differences: Different GPU vendors expose resources differently (nvidia.com/gpu, amd.com/gpu). The AcceleratorClass abstraction handles these differences transparently.
Override Precedence: Configuration follows a clear hierarchy: ServingRuntime defaults → AcceleratorClass configurations → InferenceService spec → Pod annotations
Backward Compatibility: Existing InferenceServices continue to work. The new fields are optional and only enhance functionality when used.
Interaction with Existing NodeSelector/Affinity: Both ServingRuntime and InferenceService already have
nodeSelectorandaffinityfields. The AcceleratorClass system works as follows:Component Architecture: OME uses separate Engine, Decoder, and Router components:
Container Args Merging: Following OME's existing behavior:
command, accelerator args are not appliedargs, they are appended after accelerator argsRouter Component Handling: The router component is CPU-only and doesn't require GPUs:
Container Arguments Merging: When acceleratorConfigurations specify args and the user also provides command/args:
command, accelerator args are NOT applied (user has full control)argsas array, accelerator args are appendedRisks and Mitigations
Risk 1: Complexity for simple use cases
Risk 2: Conflict with Kueue ResourceFlavors
Risk 3: Performance overhead in selection
Risk 4: Vendor lock-in through capability definitions
Design Details
Example AcceleratorClass Definitions
Users or platform administrators would create AcceleratorClass resources for their GPU types. Here are common examples:
Platform-Provided Base Set
Platform teams would typically provide a base set of AcceleratorClasses:
Auto-Discovery Examples
The platform could also auto-discover and create AcceleratorClasses:
API Specifications
AcceleratorClass
InferenceService Extensions
ServingRuntime Extensions
Kueue Integration
The system provides bi-directional integration with Kueue ResourceFlavors:
Override Hierarchy
The configuration precedence follows a clear hierarchy:
Implementation Architecture
graph TB subgraph "API Layer" IS[InferenceService] SR[ServingRuntime] AC[AcceleratorClass] end subgraph "Controllers" ISC[InferenceService Controller] ACC[AcceleratorClass Controller] KIC[Kueue Integration Controller] end subgraph "Selection Engine" RS[Runtime Selector] AS[Accelerator Matcher] SC[Score Calculator] end subgraph "External Systems" K8S[Kubernetes Nodes] KQ[Kueue ResourceFlavors] GPU[GPU Operators] end IS --> ISC ISC --> RS RS --> AS AS --> AC RS --> SC ACC --> K8S ACC <--> KQ KIC --> KQ KIC --> AC K8S --> GPUTest Plan
Unit Tests
pkg/controller/acceleratorclass: 2024-12-01 - 0% (new package)pkg/controller/inferenceservice/utils: 2024-12-01 - Current coveragepkg/apis/ome/v1beta1: 2024-12-01 - Current coverageIntegration Tests
Basic Accelerator Selection
Kueue Integration
Override Behavior
Multi-GPU Scenarios
Failure Scenarios
Deployment Strategies
Organizations can choose different approaches for managing AcceleratorClasses:
Option 1: Manual Management
Option 2: Auto-Discovery with Override
Option 3: Template-Based
Graduation Criteria
Alpha (v0.3):
Beta (v0.4):
Stable (v0.5):
Implementation History
Drawbacks
Additional Complexity: Introduces new APIs and concepts that users need to understand
Maintenance Overhead: Requires keeping accelerator definitions up-to-date as new GPU models are released
Integration Complexity: Supporting multiple ecosystem tools (Kueue, Volcano) adds complexity
Migration Effort: Existing users need to migrate to benefit from new features
Alternatives
Alternative 1: Extend ServingRuntime Directly
Instead of new APIs, add GPU requirements directly to ServingRuntime:
Rejected because:
Alternative 2: Node Selector Templates
Use templating in ServingRuntimes:
Rejected because:
Alternative 3: Rely on External Tools
Use Kueue or Volcano exclusively for GPU management:
Rejected because:
Alternative 4: Dynamic Runtime Generation
Automatically generate runtimes based on discovered GPUs:
Rejected because:
The chosen approach balances flexibility, compatibility, and user experience while providing a clear path for future enhancements.
Completion requirements
This enhancement requires the following artifacts: