Implement vendor-specific BMC/BIOS upgrade checks with firmware inventory function#733
Implement vendor-specific BMC/BIOS upgrade checks with firmware inventory function#733
Conversation
…g and integrate checks in BIOS and BMC upgrade processes
WalkthroughThis PR adds a CheckBMCPendingComponentUpgrade capability to detect pending/staged firmware upgrades before initiating component upgrades. It includes vendor-specific implementations for Dell, HPE, and Lenovo via a shared generic helper function, along with test stubs. The feature integrates with BMC and BIOS upgrade controllers to conditionally defer upgrades when pending upgrades are detected. Changes
Sequence DiagramsequenceDiagram
participant Controller as BMC/BIOS Controller
participant BMC as BMC Interface
participant Vendor as Vendor Implementation<br/>(Dell/HPE/Lenovo)
participant Helper as checkPendingComponentUpgrade
participant Redfish as Redfish API
Controller->>BMC: CheckBMCPendingComponentUpgrade(ctx, componentType)
BMC->>Vendor: CheckBMCPendingComponentUpgrade(ctx, componentType)
Vendor->>Helper: checkPendingComponentUpgrade(ctx, callbacks...)
Helper->>Redfish: GetUpdateService()
Redfish-->>Helper: UpdateService
Helper->>Redfish: GetFirmwareInventory()
Redfish-->>Helper: []FirmwareInventory
loop For each firmware item
Helper->>Vendor: matchesComponentFilter(fw, filters)
Vendor-->>Helper: match (bool)
alt Match found
Helper->>Vendor: checkPending(fw)
Vendor-->>Helper: isPending (bool)
end
end
Helper-->>Vendor: pending (bool)
Vendor-->>BMC: pending (bool), error
BMC-->>Controller: pending (bool), error
alt Pending upgrade detected
Controller->>Controller: Defer upgrade (requeue)
else No pending upgrade
Controller->>BMC: Proceed with component upgrade
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (3)
bmc/bmc.go (1)
124-125: Use a typed component enum instead of raw strings.This API now fans out into multiple vendor switch statements, and a typo like
"Bios"would quietly turn the precheck into a no-op. Defining typed constants for the allowed values keeps controllers and vendor implementations aligned.♻️ Suggested direction
+type ComponentType string + +const ( + ComponentTypeBMC ComponentType = "BMC" + ComponentTypeBIOS ComponentType = "BIOS" +) + // CheckBMCPendingComponentUpgrade checks if there are pending/staged component upgrades for the given component type ("BMC" or "BIOS"). - CheckBMCPendingComponentUpgrade(ctx context.Context, componentType string) (bool, error) + CheckBMCPendingComponentUpgrade(ctx context.Context, componentType ComponentType) (bool, error)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bmc/bmc.go` around lines 124 - 125, The method CheckBMCPendingComponentUpgrade currently accepts a raw string parameter componentType which is error-prone; change this to a typed enum (e.g., declare type ComponentType string and constants like ComponentTypeBMC and ComponentTypeBIOS) and update the function signature to CheckBMCPendingComponentUpgrade(ctx context.Context, componentType ComponentType) (bool, error); update all callers to pass the new constants, update vendor switch/branch logic in any implementations to switch on the ComponentType constants, and add validation in the function/implementations to return an explicit error for unknown ComponentType values instead of silently no-oping.bmc/redfish_local.go (1)
271-273: Make the local pending-check stub configurable.Controllers now have a defer-on-pending branch, but this test double can only return
false, so that path is hard to cover in local/controller tests. AUnitTestMockUpsflag or mock inventory-backed response would make the new behavior testable.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bmc/redfish_local.go` around lines 271 - 273, The stub CheckBMCPendingComponentUpgrade on RedfishLocalBMC always returns false; make it configurable by adding a test-only field (e.g., UnitTestPendingUpgrade bool or a PendingUpgradeFunc func(ctx context.Context, id string) (bool, error)) to the RedfishLocalBMC struct and update the CheckBMCPendingComponentUpgrade method to return that field/value when set (falling back to the current false,nil default). Ensure the new field is only used for local/test instances (default remains unchanged) and update any constructors or test setups that create RedfishLocalBMC to allow toggling this flag or injecting the function so controller defer-on-pending paths can be exercised.bmc/redfish_kube.go (1)
459-461: Make the kube stub configurable for tests.Hardcoding
falsehere means the local-test backend can never exercise the new “pending upgrade detected” branch in the BIOS/BMC controllers. Consider plumbing this throughUnitTestMockUpsso tests can cover both defer and proceed paths without vendor-specific fixtures.Based on learnings,
bmc/redfish_kube.gois used for local testing purposes.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bmc/redfish_kube.go` around lines 459 - 461, The CheckBMCPendingComponentUpgrade stub should be configurable for tests instead of always returning false; modify RedfishKubeBMC.CheckBMCPendingComponentUpgrade to consult a test hook on the receiver (e.g., a UnitTestMockUps field or method) and return that value (plus nil error) when present, falling back to false,nil when the mock is nil; reference the RedfishKubeBMC struct and the CheckBMCPendingComponentUpgrade method and add/consume a clear boolean field or accessor on UnitTestMockUps (e.g., UnitTestMockUps.BMCPendingUpgrade or UnitTestMockUps.GetBMCPendingUpgrade()) so unit tests can toggle the pending-upgrade path.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@internal/controller/biosversion_controller.go`:
- Around line 304-310: The branch handling hasPending from
bmcClient.CheckBMCPendingComponentUpgrade should requeue reconciliation instead
of returning without a retry; locate the CheckBMCPendingComponentUpgrade call
(and the else-if hasPending branch) in biosversion_controller.go and change the
return so the controller requeues (i.e., request a retry/requeue) rather than
returning false, nil; keep the existing log.Info("Pending component upgrade
detected, deferring BIOS upgrade to avoid interruption", "Server", server.Name)
message and only modify the return path to trigger a requeue.
In `@internal/controller/bmcversion_controller.go`:
- Around line 302-309: The current logic treats any error from
bmcClient.CheckBMCPendingComponentUpgrade the same as “not supported” and
proceeds; change it to only ignore the “not supported” case and requeue for
transient/real errors. Concretely, when CheckBMCPendingComponentUpgrade returns
err, check for the specific sentinel (e.g. errors.Is(err,
redfish.ErrNotSupported) or the provider-specific NotSupported error) and in
that case log and continue; for any other err log it and return
ctrl.Result{RequeueAfter: r.ResyncInterval}, nil (so we defer the upgrade). Keep
references to bmcClient.CheckBMCPendingComponentUpgrade, BMC.Name, and the
existing ctrl.Result{RequeueAfter: r.ResyncInterval} in your change.
---
Nitpick comments:
In `@bmc/bmc.go`:
- Around line 124-125: The method CheckBMCPendingComponentUpgrade currently
accepts a raw string parameter componentType which is error-prone; change this
to a typed enum (e.g., declare type ComponentType string and constants like
ComponentTypeBMC and ComponentTypeBIOS) and update the function signature to
CheckBMCPendingComponentUpgrade(ctx context.Context, componentType
ComponentType) (bool, error); update all callers to pass the new constants,
update vendor switch/branch logic in any implementations to switch on the
ComponentType constants, and add validation in the function/implementations to
return an explicit error for unknown ComponentType values instead of silently
no-oping.
In `@bmc/redfish_kube.go`:
- Around line 459-461: The CheckBMCPendingComponentUpgrade stub should be
configurable for tests instead of always returning false; modify
RedfishKubeBMC.CheckBMCPendingComponentUpgrade to consult a test hook on the
receiver (e.g., a UnitTestMockUps field or method) and return that value (plus
nil error) when present, falling back to false,nil when the mock is nil;
reference the RedfishKubeBMC struct and the CheckBMCPendingComponentUpgrade
method and add/consume a clear boolean field or accessor on UnitTestMockUps
(e.g., UnitTestMockUps.BMCPendingUpgrade or
UnitTestMockUps.GetBMCPendingUpgrade()) so unit tests can toggle the
pending-upgrade path.
In `@bmc/redfish_local.go`:
- Around line 271-273: The stub CheckBMCPendingComponentUpgrade on
RedfishLocalBMC always returns false; make it configurable by adding a test-only
field (e.g., UnitTestPendingUpgrade bool or a PendingUpgradeFunc func(ctx
context.Context, id string) (bool, error)) to the RedfishLocalBMC struct and
update the CheckBMCPendingComponentUpgrade method to return that field/value
when set (falling back to the current false,nil default). Ensure the new field
is only used for local/test instances (default remains unchanged) and update any
constructors or test setups that create RedfishLocalBMC to allow toggling this
flag or injecting the function so controller defer-on-pending paths can be
exercised.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 55c3a287-0a22-407a-9d17-c0331749920d
📒 Files selected for processing (11)
.golangci.ymlbmc/bmc.gobmc/oem_helpers.gobmc/redfish.gobmc/redfish_dell.gobmc/redfish_hpe.gobmc/redfish_kube.gobmc/redfish_lenovo.gobmc/redfish_local.gointernal/controller/biosversion_controller.gointernal/controller/bmcversion_controller.go
| // Check for pending component upgrade BEFORE issuing upgrade to avoid interrupting staged firmware | ||
| hasPending, err := bmcClient.CheckBMCPendingComponentUpgrade(ctx, "BMC") | ||
| if err != nil { | ||
| log.V(1).Info("Failed to check pending component upgrade before BMC upgrade, proceeding anyway", "error", err) | ||
| } else if hasPending { | ||
| log.V(1).Info("Pending component upgrade detected, deferring BMC upgrade to avoid interruption", "BMC", BMC.Name) | ||
| return ctrl.Result{RequeueAfter: r.ResyncInterval}, nil | ||
| } |
There was a problem hiding this comment.
Only ignore the “not supported” case here.
Proceeding on every CheckBMCPendingComponentUpgrade error defeats the safety benefit of this guard: a transient Redfish/FirmwareInventory failure can still let us start a BMC upgrade while another component upgrade is already staged. Please distinguish the base “not supported” fallback from real check failures, and requeue on the latter instead of issuing the upgrade.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@internal/controller/bmcversion_controller.go` around lines 302 - 309, The
current logic treats any error from bmcClient.CheckBMCPendingComponentUpgrade
the same as “not supported” and proceeds; change it to only ignore the “not
supported” case and requeue for transient/real errors. Concretely, when
CheckBMCPendingComponentUpgrade returns err, check for the specific sentinel
(e.g. errors.Is(err, redfish.ErrNotSupported) or the provider-specific
NotSupported error) and in that case log and continue; for any other err log it
and return ctrl.Result{RequeueAfter: r.ResyncInterval}, nil (so we defer the
upgrade). Keep references to bmcClient.CheckBMCPendingComponentUpgrade,
BMC.Name, and the existing ctrl.Result{RequeueAfter: r.ResyncInterval} in your
change.
There was a problem hiding this comment.
The "Not supported by this vendor" error is currently a special case. During the recent refactoring of the Redfish files, we didn't introduce a dedicated error for it. Therefore, the controllers follow the existing project pattern and skip errors that a vendor hasn't implemented.
I suggest we maintain this behavior for now to ensure consistency with the other controllers and the current revision. In the future, we could consider an ErrNotSupported sentinel error so that the controllers react correctly. However, since this also affects other Redfish functions, I would like to address this as a separate topic.
There was a problem hiding this comment.
@davidgrun, understood — thanks for the context. Keeping the current "proceed on all errors" pattern makes sense for consistency with the rest of the controllers, given there's no ErrNotSupported sentinel in place yet. Would you like me to open a follow-up issue to track the introduction of an ErrNotSupported sentinel error so that controllers can distinguish transient failures from "not supported by vendor" and react accordingly?
✏️ Learnings added
Learnt from: davidgrun
Repo: ironcore-dev/metal-operator PR: 733
File: internal/controller/bmcversion_controller.go:302-309
Timestamp: 2026-03-11T09:51:11.302Z
Learning: In ironcore-dev/metal-operator, the project does not yet have a dedicated ErrNotSupported sentinel error for unimplemented vendor BMC methods. Controllers (e.g. internal/controller/bmcversion_controller.go, internal/controller/biosversion_controller.go) intentionally proceed/skip on ALL errors from vendor-specific BMC client methods (such as CheckBMCPendingComponentUpgrade) to maintain consistency with the existing pattern. A future improvement to introduce an ErrNotSupported sentinel is planned so controllers can distinguish transient errors from "not supported by vendor" cases and react differently (e.g. requeue on transient errors).
Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.
🧠 Learnings used
Learnt from: CR
Repo: ironcore-dev/metal-operator PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-02-17T13:12:50.124Z
Learning: Applies to internal/controller/**/*_controller.go : Implement idempotent reconciliation logic - safe to run the same reconciliation multiple times without side effects
Learnt from: nagadeesh-nagaraja
Repo: ironcore-dev/metal-operator PR: 695
File: internal/controller/bmcsettings_controller.go:720-730
Timestamp: 2026-03-04T10:32:17.473Z
Learning: In internal/controller/bmcsettings_controller.go and other controllers in the same directory, do not overwrite RetryOfFailedResourceConditionIssued when its Status is metav1.ConditionTrue. Preserve existing reason/message on re-entry and only update metadata if the condition is not yet true. This avoids unnecessary updates and retains historical context.
Learnt from: afritzler
Repo: ironcore-dev/metal-operator PR: 649
File: bmc/redfish_kube.go:380-392
Timestamp: 2026-02-04T16:31:13.180Z
Learning: The `bmc/redfish_kube.go` file in the ironcore-dev/metal-operator repository is used for local testing purposes and does not require vendor-specific workarounds (such as SuperMicro-specific boot mode handling) that are present in the production `bmc/redfish.go` implementation.
</learning>
Learnt from: afritzler
Repo: ironcore-dev/metal-operator PR: 607
File: bmc/redfish.go:205-208
Timestamp: 2026-01-22T09:49:09.470Z
Learning: In the ironcore-dev/metal-operator repository, SuperMicro systems require BootSourceOverrideMode to be unconditionally set to UEFI in the SetPXEBootOnce function (`bmc/redfish.go`), regardless of the current value or whether the mode is reported. This is a vendor-specific requirement that intentionally bypasses the general safeguard for systems that don't report BootSourceOverrideMode.
There was a problem hiding this comment.
Yes, you can do that, then we can discuss it in our team.
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
gh issue create \
--repo ironcore-dev/metal-operator \
--title "Introduce ErrNotSupported sentinel error for unimplemented vendor BMC methods" \
--body "## Summary
Controllers such as \`internal/controller/bmcversion_controller.go\` and \`internal/controller/biosversion_controller.go\` call vendor-specific BMC client methods (e.g. \`CheckBMCPendingComponentUpgrade\`) and currently treat **all** errors identically — including the base \`RedfishBaseBMC\` fallback that returns a "not supported by this manufacturer" error. This means transient Redfish/FirmwareInventory failures are silently swallowed and the upgrade proceeds, which is the same behavior as a legitimate "not supported" response.
## Motivation
Introducing a dedicated \`ErrNotSupported\` sentinel error would allow controllers to:
- **Ignore** the "not supported by vendor" case (current behavior, intentional)
- **Requeue** on transient or real check failures instead of blindly proceeding with the upgrade
This improves safety: a temporary FirmwareInventory unavailability would no longer silently bypass the pending-upgrade guard.
## Scope
- Define an \`ErrNotSupported\` sentinel (e.g. in \`bmc/bmc.go\` or a shared errors file)
- Update \`RedfishBaseBMC.CheckBMCPendingComponentUpgrade\` (and similar fallback methods) in \`bmc/redfish.go\` to wrap/return \`ErrNotSupported\`
- Update affected controllers (\`bmcversion_controller.go\`, \`biosversion_controller.go\`, and any future controllers) to use \`errors.Is(err, bmc.ErrNotSupported)\` to distinguish the two cases
- Consider applying the same pattern to other vendor-specific BMC interface methods that currently have a "not supported" fallback
## References
- PR `#733` (where this was first discussed): https://github.com/ironcore-dev/metal-operator/pull/733
- Specific comment: https://github.com/ironcore-dev/metal-operator/pull/733#discussion_r2910355313
/cc `@davidgrun`" \
--assignee davidgrun
Length of output: 2380
🏁 Script executed:
cat > /tmp/issue_body.md << 'EOF'
## Summary
Controllers such as `internal/controller/bmcversion_controller.go` and `internal/controller/biosversion_controller.go` call vendor-specific BMC client methods (e.g. `CheckBMCPendingComponentUpgrade`) and currently treat **all** errors identically — including the base `RedfishBaseBMC` fallback that returns a "not supported by this manufacturer" error. This means transient Redfish/FirmwareInventory failures are silently swallowed and the upgrade proceeds, which is the same behavior as a legitimate "not supported" response.
## Motivation
Introducing a dedicated `ErrNotSupported` sentinel error would allow controllers to:
- **Ignore** the "not supported by vendor" case (current behavior, intentional)
- **Requeue** on transient or real check failures instead of blindly proceeding with the upgrade
This improves safety: a temporary FirmwareInventory unavailability would no longer silently bypass the pending-upgrade guard.
## Scope
- Define an `ErrNotSupported` sentinel (e.g. in `bmc/bmc.go` or a shared errors file)
- Update `RedfishBaseBMC.CheckBMCPendingComponentUpgrade` (and similar fallback methods) in `bmc/redfish.go` to wrap/return `ErrNotSupported`
- Update affected controllers (`bmcversion_controller.go`, `biosversion_controller.go`, and any future controllers) to use `errors.Is(err, bmc.ErrNotSupported)` to distinguish the two cases
- Consider applying the same pattern to other vendor-specific BMC interface methods that currently have a "not supported" fallback
## References
- PR `#733` (where this was first discussed): https://github.com/ironcore-dev/metal-operator/pull/733
- Specific comment: https://github.com/ironcore-dev/metal-operator/pull/733#discussion_r2910355313
/cc `@davidgrun`
EOF
gh issue create \
--repo ironcore-dev/metal-operator \
--title "Introduce ErrNotSupported sentinel error for unimplemented vendor BMC methods" \
--body-file /tmp/issue_body.md \
--assignee davidgrun
Length of output: 57
@davidgrun, the follow-up issue has been created and assigned to you: #735
It covers the full scope of the proposed improvement:
- Defining an
ErrNotSupportedsentinel in thebmcpackage - Updating
RedfishBaseBMCfallback methods to wrap/return it - Updating the affected controllers to use
errors.Is(err, bmc.ErrNotSupported)to distinguish "not supported by vendor" from transient failures - Considering the same pattern for other vendor-specific BMC interface methods
Feel free to discuss and prioritize it with your team! 🐇
🧠 Learnings used
Learnt from: davidgrun
Repo: ironcore-dev/metal-operator PR: 733
File: internal/controller/bmcversion_controller.go:302-309
Timestamp: 2026-03-11T09:51:11.302Z
Learning: In internal/controller/*.go, implement and utilize a dedicated ErrNotSupported sentinel error to distinguish vendor-not-supported scenarios from transient errors. Update controller logic (e.g., bmcversion_controller.go, biosversion_controller.go) to requeue or handle differently when ErrNotSupported is returned, while preserving current behavior for transient errors. This promotes clearer error handling and future-proofing for vendor-specific BMC methods.
Learnt from: CR
Repo: ironcore-dev/metal-operator PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-02-17T13:12:50.124Z
Learning: Applies to internal/controller/**/*_controller.go : Implement idempotent reconciliation logic - safe to run the same reconciliation multiple times without side effects
Learnt from: nagadeesh-nagaraja
Repo: ironcore-dev/metal-operator PR: 695
File: internal/controller/bmcsettings_controller.go:720-730
Timestamp: 2026-03-04T10:32:17.473Z
Learning: In internal/controller/bmcsettings_controller.go and other controllers in the same directory, do not overwrite RetryOfFailedResourceConditionIssued when its Status is metav1.ConditionTrue. Preserve existing reason/message on re-entry and only update metadata if the condition is not yet true. This avoids unnecessary updates and retains historical context.
Learnt from: afritzler
Repo: ironcore-dev/metal-operator PR: 649
File: bmc/redfish_kube.go:380-392
Timestamp: 2026-02-04T16:31:13.180Z
Learning: The `bmc/redfish_kube.go` file in the ironcore-dev/metal-operator repository is used for local testing purposes and does not require vendor-specific workarounds (such as SuperMicro-specific boot mode handling) that are present in the production `bmc/redfish.go` implementation.
</learning>
Learnt from: afritzler
Repo: ironcore-dev/metal-operator PR: 607
File: bmc/redfish.go:205-208
Timestamp: 2026-01-22T09:49:09.470Z
Learning: In the ironcore-dev/metal-operator repository, SuperMicro systems require BootSourceOverrideMode to be unconditionally set to UEFI in the SetPXEBootOnce function (`bmc/redfish.go`), regardless of the current value or whether the mode is reported. This is a vendor-specific requirement that intentionally bypasses the general safeguard for systems that don't report BootSourceOverrideMode.
Summary by CodeRabbit