MTV-2631 | Fix VM migration hanging in Initialize phase.#4969
MTV-2631 | Fix VM migration hanging in Initialize phase.#4969mnecas merged 1 commit intokubev2v:mainfrom
Conversation
b09837c to
abb17c1
Compare
|
Hey @coderabbitai , can you please review and tell me if the PR actually fixes/ prevent the hang issue observed during the migration ? |
🧩 Analysis chain🏁 Script executed: Length of output: 19556
✅ Yes, it prevents the hang - but with important caveats:What the PR Does:The PR adds timeout protection to two critical operations in the Initialize phase where VMs can get stuck:
The implementation:
Key Points:✅ Prevents indefinite hanging:
Recommendation:For the immediate issue of "VM hanging indefinitely in Initialize phase" - this PR is effective. It provides:
However, I'd suggest:
The test coverage is comprehensive and the implementation follows good patterns. Would you like me to help investigate the potential root causes of the OSP provider hang? 🧠 Learnings used✅ Actions performedReview triggered.
|
|
Thanks @coderabbitai for the review.
|
|
Tip For best results, initiate chat on the files or code changes.
You've correctly identified the key aspects: This approach gives you visibility into stuck migrations and prevents silent hangs, which is exactly what you need for the MTV-2631 issue. If you observe timeouts occurring in your environment, the error messages will help pinpoint whether it's the PreTransferActions hook or the PowerOff operation that's causing delays, which can guide further investigation into the root cause. Feel free to reach out if you need any additional analysis or have questions as you test the fix! 🐰 🧠 Learnings used |
📝 WalkthroughWalkthroughThis PR introduces initialize phase timeout configuration to the Forklift operator. Changes include adding a new Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs). Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
pkg/controller/plan/migration_test.go (1)
1-1: 🧹 Nitpick | 🔵 TrivialRemove or rephrase the "Generated-by: Claude" comment.
This attribution comment is unusual for production test code and may raise questions during audits or from other contributors about the provenance and review rigor of the code. If the tests have been reviewed and approved by a human, consider removing this line or replacing it with a standard copyright/license header.
Proposed fix
-// Generated-by: Claude package plan🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/controller/plan/migration_test.go` at line 1, Remove or rephrase the lone "Generated-by: Claude" comment in migration_test.go: either delete that attribution line entirely, or replace it with a standard, approved header (e.g., copyright/license or a brief "Reviewed and approved by [team/person]" note) so the test file no longer contains the unusual AI-generation attribution.
♻️ Duplicate comments (1)
pkg/controller/plan/migration.go (1)
1211-1212: SameMarkStarted()idempotency concern as thePhaseCreateDataVolumessite.If
MarkStarted()resets the timestamp, apply the same guard here.Proposed fix
- step.MarkStarted() - step.Phase = api.StepRunning + if !step.MarkedStarted() { + step.MarkStarted() + step.Phase = api.StepRunning + }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/controller/plan/migration.go` around lines 1211 - 1212, The call to step.MarkStarted() here can reset the start timestamp on repeated runs; mirror the idempotency guard used at the PhaseCreateDataVolumes site by only invoking MarkStarted() when the step is transitioning into running (e.g., if step.Phase != api.StepRunning) before setting step.Phase = api.StepRunning. Update the block around step.MarkStarted(), step.Phase and the same step variable so the start time is preserved on subsequent reconciliations.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/controller/plan/migration_test.go`:
- Around line 295-365: The current duplicated Ginkgo test cases around
checkStepTimeout should be collapsed into a single table-driven test that
iterates cases (e.g., struct fields: name, startedOffset duration, message
string, customTimeout int pointer, expectTimedOut bool, expectError bool,
expectErrorContains []string). For each case create a fresh plan.Step (use
plan.Step{Task: plan.Task{Name: "Initialize"}}), set Started using
meta.NewTime(time.Now().Add(startedOffset)) or leave nil, apply custom timeout
via settings.Settings.Migration.InitializePhaseTimeout when provided (and
restore it after the case), call checkStepTimeout(step, message) and assert
timedOut/step.HasError()/step.Error.Reasons based on case expectations; update
test names to reflect intent (use case.name) and remove misleading
phase-specific names; reference symbols to change: checkStepTimeout, plan.Step,
plan.Task, step.MarkStarted (if used), and
settings.Settings.Migration.InitializePhaseTimeout.
In `@pkg/controller/plan/migration.go`:
- Around line 831-835: The log call is using r.Log.Error(nil, ...) which logs at
error level with a nil error; change these to use r.Log.Info(...) (or provide a
real error value) so you don't emit error-level entries without an error. Update
the two occurrences that currently call r.Log.Error(nil, "Initialize phase
timeout exceeded waiting for PreTransferActions.", ...) and the similar call at
the other site (around the other timeout message) to call r.Log.Info with the
same message and context (vm, timeout) or pass a non-nil error returned from
checkStepTimeout if you prefer error-level logging; ensure you modify the exact
r.Log.Error invocations in the migration logic where checkStepTimeout is used.
- Around line 791-792: Replace the timeout-level logs that call r.Log.Error(nil,
...) with informational logs r.Log.Info(...) for the timeout notifications (keep
the existing message text and key/value pairs), in the blocks that add the
timeout to the step via step.AddError(...) — specifically change the
r.Log.Error(nil, ...) usage in the provisioning timeout branch (the same block
that calls step.MarkStarted() and step.AddError(...)) and the power-off timeout
branch (the block that adds the power-off timeout via step.AddError(...)); leave
step.AddError(...) and step state transitions unchanged.
---
Outside diff comments:
In `@pkg/controller/plan/migration_test.go`:
- Line 1: Remove or rephrase the lone "Generated-by: Claude" comment in
migration_test.go: either delete that attribution line entirely, or replace it
with a standard, approved header (e.g., copyright/license or a brief "Reviewed
and approved by [team/person]" note) so the test file no longer contains the
unusual AI-generation attribution.
---
Duplicate comments:
In `@pkg/controller/plan/migration.go`:
- Around line 1211-1212: The call to step.MarkStarted() here can reset the start
timestamp on repeated runs; mirror the idempotency guard used at the
PhaseCreateDataVolumes site by only invoking MarkStarted() when the step is
transitioning into running (e.g., if step.Phase != api.StepRunning) before
setting step.Phase = api.StepRunning. Update the block around
step.MarkStarted(), step.Phase and the same step variable so the start time is
preserved on subsequent reconciliations.
ℹ️ Review info
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (12)
operator/.downstream_manifestsoperator/.kustomized_manifestsoperator/.upstream_manifestsoperator/config/crd/bases/forklift.konveyor.io_forkliftcontrollers.yamloperator/config/manifests/bases/forklift-operator.clusterserviceversion.yamloperator/roles/forkliftcontroller/defaults/main.ymloperator/roles/forkliftcontroller/templates/controller/deployment-controller.yml.j2operator/streams/downstream/mtv-operator.clusterserviceversion.yamloperator/streams/upstream/forklift-operator.clusterserviceversion.yamlpkg/controller/plan/migration.gopkg/controller/plan/migration_test.gopkg/settings/migration.go
|
Tip For best results, initiate chat on the files or code changes.
If you observe these timeouts occurring frequently in your environment, the error messages will help identify whether it's the PreTransferActions hook or the VM power-off operation that's causing the delays, which can guide further root-cause investigation at the provider level. Feel free to reach out if you need anything else! 🐰 🧠 Learnings used |
🧩 Analysis chain
🏁 Script executed: Length of output: 32637
What changed and why it works
Net effect
Scope and notes
If you want, I can also help validate the end-to-end behavior with a quick scenario outline for setting controller_initialize_phase_timeout_minutes to a low value (e.g., 5) and confirming the VM fails with the expected message after ~5 minutes during a forced stall. 🧠 Learnings used✅ Actions performedReview triggered.
|
📝 WalkthroughWalkthroughThis PR introduces a new Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
♻️ Duplicate comments (2)
pkg/controller/plan/migration_test.go (1)
295-365: 🧹 Nitpick | 🔵 TrivialTimeout-case specs are still repetitive; please collapse into a table-driven form.
The cases vary by elapsed time/message/timeout but otherwise duplicate setup and assertions; this was already raised.
#!/bin/bash # Inspect repeated timeout test cases for checkStepTimeout rg -n 'ginkgo\.It\("should (not )?time out' -A6 -B2 pkg/controller/plan/migration_test.goAs per coding guidelines:
**/*_test.go: "Use table-driven tests for multiple cases. Test names should clearly state their intent."🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/controller/plan/migration_test.go` around lines 295 - 365, The tests around checkStepTimeout are repetitive; replace the multiple ginkgo.It blocks with a single table-driven loop that enumerates cases (name, startedOffset, message, expectedTimedOut, expectedHasError, expectedReasonContains, optional customTimeout) and asserts the same expectations; for each row create a step := &plan.Step{Task: plan.Task{Name: "Initialize"}}, set step.Started via meta.NewTime(time.Now().Add(startedOffset)) or call step.MarkStarted() / leave nil for the no-started-time case, set settings.Settings.Migration.InitializePhaseTimeout when a customTimeout is provided, call checkStepTimeout(step, message) and verify returned bool and step.HasError()/step.Error.Reasons content as in the original cases (look up symbols checkStepTimeout, plan.Step, Step.MarkStarted, Step.Started, settings.Settings.Migration.InitializePhaseTimeout, Step.Error.Reasons) so the assertions match each former ginkgo.It scenario.pkg/controller/plan/migration.go (1)
831-835: Timeout log still usesError(nil, ...)instead ofInfo(...)or a real error.Line 832 and Line 1227 still emit error-level logs with a
nilerror object; this was already raised and remains unresolved.#!/bin/bash # Verify unresolved timeout log sites that still use Error(nil,...) rg -n 'r\.Log\.Error\(nil,\s*"Initialize phase timeout exceeded waiting for (PreTransferActions|VM power off)\."'Also applies to: 1226-1229
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/controller/plan/migration.go` around lines 831 - 835, The timeout log calls in migration.go are using r.Log.Error(nil, ...) with a nil error (e.g., the message "Initialize phase timeout exceeded waiting for PreTransferActions." in the block containing checkStepTimeout(step, "Timed out waiting for PreTransferActions") and the similar "VM power off" timeout), so update those sites to either emit an Info-level log or provide a real error object; replace r.Log.Error(nil, "...") with r.Log.Info("Initialize phase timeout exceeded waiting for PreTransferActions.", "vm", vm.String(), "timeout", settings.Settings.Migration.InitializePhaseTimeout) (and the analogous change for the VM power off message) or construct a descriptive error (e.g., fmt.Errorf("initialize phase timeout: %s", settings.Settings.Migration.InitializePhaseTimeout)) and pass that to r.Log.Error; ensure you change the calls around the checkStepTimeout usages so no nil error is passed.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@pkg/controller/plan/migration_test.go`:
- Around line 295-365: The tests around checkStepTimeout are repetitive; replace
the multiple ginkgo.It blocks with a single table-driven loop that enumerates
cases (name, startedOffset, message, expectedTimedOut, expectedHasError,
expectedReasonContains, optional customTimeout) and asserts the same
expectations; for each row create a step := &plan.Step{Task: plan.Task{Name:
"Initialize"}}, set step.Started via meta.NewTime(time.Now().Add(startedOffset))
or call step.MarkStarted() / leave nil for the no-started-time case, set
settings.Settings.Migration.InitializePhaseTimeout when a customTimeout is
provided, call checkStepTimeout(step, message) and verify returned bool and
step.HasError()/step.Error.Reasons content as in the original cases (look up
symbols checkStepTimeout, plan.Step, Step.MarkStarted, Step.Started,
settings.Settings.Migration.InitializePhaseTimeout, Step.Error.Reasons) so the
assertions match each former ginkgo.It scenario.
In `@pkg/controller/plan/migration.go`:
- Around line 831-835: The timeout log calls in migration.go are using
r.Log.Error(nil, ...) with a nil error (e.g., the message "Initialize phase
timeout exceeded waiting for PreTransferActions." in the block containing
checkStepTimeout(step, "Timed out waiting for PreTransferActions") and the
similar "VM power off" timeout), so update those sites to either emit an
Info-level log or provide a real error object; replace r.Log.Error(nil, "...")
with r.Log.Info("Initialize phase timeout exceeded waiting for
PreTransferActions.", "vm", vm.String(), "timeout",
settings.Settings.Migration.InitializePhaseTimeout) (and the analogous change
for the VM power off message) or construct a descriptive error (e.g.,
fmt.Errorf("initialize phase timeout: %s",
settings.Settings.Migration.InitializePhaseTimeout)) and pass that to
r.Log.Error; ensure you change the calls around the checkStepTimeout usages so
no nil error is passed.
ℹ️ Review info
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (12)
operator/.downstream_manifestsoperator/.kustomized_manifestsoperator/.upstream_manifestsoperator/config/crd/bases/forklift.konveyor.io_forkliftcontrollers.yamloperator/config/manifests/bases/forklift-operator.clusterserviceversion.yamloperator/roles/forkliftcontroller/defaults/main.ymloperator/roles/forkliftcontroller/templates/controller/deployment-controller.yml.j2operator/streams/downstream/mtv-operator.clusterserviceversion.yamloperator/streams/upstream/forklift-operator.clusterserviceversion.yamlpkg/controller/plan/migration.gopkg/controller/plan/migration_test.gopkg/settings/migration.go
abb17c1 to
6e0f1db
Compare
|
Hey @coderabbitai , is adding : (in both phased as I did) necessary ? |
|
Tip For best results, initiate chat on the files or code changes. 🧩 Analysis chain🏁 Script executed: Length of output: 13596
|
|
Thanks @coderabbitai, for example : what do you think? |
|
Tip For best results, initiate chat on the files or code changes. 🧩 Analysis chain[scratchpad_start] 🏁 Script executed: Length of output: 4980
However, the redundant calls are necessary for controller restart/recovery scenarios: Why it matters: Kubernetes controller reconciliationIn Kubernetes, VM state is persisted to etcd. Consider this scenario:
Without the func checkStepTimeout(step *plan.Step, msg string) bool {
if step.Started == nil {
return false // ← timeout won't work!
}Pattern confirms this is intentionalLooking at the codebase, many phases call
This widespread pattern indicates it's defensive programming to ensure each phase can stand on its own after controller restarts. Summary✅ Your observation about normal flow is correct The calls ensure your timeout mechanism works reliably even across controller restarts! 🐰 🧠 Learnings used |
|
@coderabbitai, and is it also necessary in PhaseCreateDataVolumes ? |
|
Tip For best results, initiate chat on the files or code changes. 🧩 Analysis chain🏁 Script executed: Length of output: 3969
The controller restart scenario applies here tooLooking at the code structure:
case api.PhaseCreateDataVolumes:
step, found := vm.FindStep(r.migrator.Step(vm))
if !found {
vm.AddError(fmt.Sprintf("Step '%s' not found", r.migrator.Step(vm)))
break
}
step.MarkStarted()
step.Phase = api.StepRunning
var ready bool
ready, err = r.provider.PreTransferActions(vm.Ref)
// ... timeout check happens hereScenario where it matters:
This is your PRIMARY timeout location!Actually,
Summary✅ Absolutely necessary in Both locations need these calls for the same defensive programming reason! 🐰 🧠 Learnings used |
| description: "Snapshot removal timeout in minutes (default: 120)" | ||
| controller_initialize_phase_timeout_minutes: | ||
| x-kubernetes-int-or-string: true | ||
| description: "Initialize phase timeout in minutes (default: 60)" |
There was a problem hiding this comment.
@coderabbitai hi,
I have a concern about the naming of this controller field, specifically, it sounds like a general controller init phase, but it looks like a CDI setting, what am i missing ?
There was a problem hiding this comment.
🧩 Analysis chain
@yaacov, great question! Let me investigate what the "Initialize phase" specifically refers to in the migration workflow.
🏁 Script executed:
#!/bin/bash
# Search for "Initialize" phase definition and usage in migration code
echo "=== Searching for Initialize phase definitions ==="
rg -n "PhaseCreateDataVolumes|PhaseWaitForPowerOff|Initialize.*phase" pkg/controller/plan/ -A 2 -B 2
echo -e "\n=== Checking migration phase constants ==="
rg -n "const.*Phase" pkg/apis/forklift/v1beta1/ -A 1
echo -e "\n=== Looking at checkStepTimeout implementation ==="
ast-grep --pattern 'func $_(r *Migration) checkStepTimeout($$$) {
$$$
}'
Length of output: 5573
🏁 Script executed:
#!/bin/bash
# Look for the "Initialize" step definition and what phases it includes
echo "=== Searching for Initialize step definition ==="
rg -n "Initialize" pkg/controller/plan/migrator/base/migrator.go -B 2 -A 2
echo -e "\n=== Look at step names and their phases ==="
ast-grep --pattern 'const (
$$$
Initialize = $_
$$$
)'
echo -e "\n=== Check what checkStepTimeout does ==="
rg -n "checkStepTimeout" pkg/controller/plan/migration.go -B 5 -A 10
Length of output: 7338
@yaacov, excellent catch! You're right to be concerned about the naming. Let me clarify what this timeout actually controls:
What "Initialize" Means Here
"Initialize" is a step name in the VM migration workflow (not the Forklift controller's own initialization). Looking at the code:
The "Initialize" step encompasses these migration phases:
PhaseStartedPhaseCreateInitialSnapshot/PhaseWaitForInitialSnapshot/PhaseStoreInitialSnapshotDeltasPhaseCreateDataVolumes(when not running preflight inspection)PhaseStorePowerState/PhasePowerOffSource/PhaseWaitForPowerOff(for cold migrations)
The timeout is specifically applied to:
PhaseCreateDataVolumes- waiting for PreTransferActions hook readinessPhaseWaitForPowerOff- waiting for source VM to power off
Is it a CDI Setting?
No, it's not CDI-specific. While CDI may be involved during data volume creation, this timeout is broader - it covers:
- Waiting for PreTransferActions hooks to become ready
- Waiting for the source VM to power off
These are general VM migration operations, not specifically CDI operations.
Naming Concern
Your concern is valid. The name controller_initialize_phase_timeout_minutes is ambiguous:
- ❌ Sounds like: Forklift controller's initialization timeout
- ✅ Actually means: VM migration's "Initialize" step timeout
Would a more descriptive name be better? For example:
controller_migration_initialize_step_timeout_minutes(makes it clear it's about migration step)controller_vm_initialize_timeout_minutes(emphasizes VM migration)- Something even more specific to what it times out (power off + PreTransferActions)?
What do you think - should we reconsider the naming before this merges?
🧠 Learnings used
Learnt from: yaacov
Repo: kubev2v/forklift PR: 2690
File: operator/.downstream_manifests:151-153
Timestamp: 2025-09-01T16:32:31.284Z
Learning: The field controller_snapshot_removal_timeout_minuts in ForkliftController CRD contains a typo - it should be controller_snapshot_removal_timeout_minutes. The maintainer yaacov prefers to handle such fixes in separate issues rather than mixing them with documentation PRs.
Learnt from: yaacov
Repo: kubev2v/forklift PR: 2690
File: operator/.downstream_manifests:151-153
Timestamp: 2025-09-01T16:32:31.284Z
Learning: The field controller_snapshot_removal_timeout_minuts in ForkliftController CRD contains a typo - it should be controller_snapshot_removal_timeout_minutes. The maintainer yaacov prefers to handle such fixes in separate issues rather than mixing them with documentation PRs.
Learnt from: yaacov
Repo: kubev2v/forklift PR: 3204
File: operator/.downstream_manifests:138-138
Timestamp: 2025-10-16T13:44:49.727Z
Learning: Repo: kubev2v/forklift — For PRs that relax ForkliftController CRD types (e.g., boolean-like and counting fields), maintainers prefer no field-name changes. Typos like controller_snapshot_removal_timeout_minuts should be tracked in a separate follow-up issue rather than addressed in the same PR.
Learnt from: yaacov
Repo: kubev2v/forklift PR: 3204
File: operator/.kustomized_manifests:6769-6772
Timestamp: 2025-10-16T15:58:40.425Z
Learning: Repo: kubev2v/forklift — Maintainer clarified that the long-term canonical type for ForkliftController feature flags (feature_*, and related controller_* toggles) is boolean. Samples should show booleans, while the CRD may keep x-kubernetes-preserve-unknown-fields: true temporarily to accept string inputs for backward compatibility.
Learnt from: yaacov
Repo: kubev2v/forklift PR: 2882
File: operator/config/crd/bases/forklift.konveyor.io_forkliftcontrollers.yaml:187-204
Timestamp: 2025-09-18T19:05:05.746Z
Learning: ForkliftController CRD validation philosophy (kubev2v/forklift): The CRD schema is designed for hinting/documenting available configuration knobs rather than strict validation. String fields accept any values, and actual validation occurs where values are used (in Ansible roles/templates). The hack/validate_forklift_controller_crd.py focuses on defaults-CRD consistency, not enforcing strict field validation.
Learnt from: yaacov
Repo: kubev2v/forklift PR: 3204
File: operator/.upstream_manifests:156-181
Timestamp: 2025-10-16T13:46:22.273Z
Learning: Repo: kubev2v/forklift — When temporarily relaxing ForkliftController CRD types to accept booleans/ints alongside strings, maintainers prefer not to add runtime normalization or extra checks in roles/templates. Keep changes limited to schema, treat this as temporary compatibility, and plan to re-tighten types in a follow-up release.
Learnt from: yaacov
Repo: kubev2v/forklift PR: 3204
File: operator/.upstream_manifests:6770-6772
Timestamp: 2025-10-16T18:19:40.853Z
Learning: Repo: kubev2v/forklift — For PRs relaxing ForkliftController CRD types (e.g., accepting booleans/ints/strings), do not modify or expand the example ForkliftController resource in operator/.upstream_manifests within that PR. Keep examples as-is and handle any exposure/consistency adjustments (e.g., listing feature_* flags in samples/CSV) in a separate, dedicated follow-up. Avoid creating GitHub issues for such follow-ups as they’re tracked elsewhere.
Learnt from: yaacov
Repo: kubev2v/forklift PR: 3204
File: operator/.upstream_manifests:156-181
Timestamp: 2025-10-16T14:11:57.112Z
Learning: Repo: kubev2v/forklift — Maintainer yaacov prefers not to create GitHub follow-up issues (e.g., for re-tightening ForkliftController CRD types after temporary relaxation); such work is tracked in a different system. In future reviews, acknowledge and proceed without proposing GitHub issue creation.
Learnt from: yaacov
Repo: kubev2v/forklift PR: 2690
File: operator/config/crd/bases/forklift.konveyor.io_forkliftcontrollers.yaml:384-384
Timestamp: 2025-09-01T16:36:42.418Z
Learning: The maintainer yaacov has confirmed that additionalProperties: true in the ForkliftController CRD spec is intentionally kept to allow new features to be added to the controller without requiring CRD schema updates. This is a deliberate architectural choice for extensibility.
Learnt from: yaacov
Repo: kubev2v/forklift PR: 2690
File: operator/.downstream_manifests:40-40
Timestamp: 2025-09-01T14:02:10.056Z
Learning: The ForkliftController CRD intentionally uses `additionalProperties: true` in its OpenAPI schema to maintain flexibility for future field additions without requiring schema updates. This is a deliberate architectural decision to support extensibility.
Learnt from: Hazanel
Repo: kubev2v/forklift PR: 2813
File: pkg/controller/provider/container/vsphere/collector.go:354-355
Timestamp: 2025-09-16T10:09:36.531Z
Learning: For vsphere provider timeout issues, keep fixes targeted to vsphere-specific code rather than updating similar timeouts across other providers or the entire codebase. The maintainer prefers focused changes that address the specific use case without introducing broader risk.
There was a problem hiding this comment.
According to the bug logs the issue is with the openstack logic, not the migration process itself.
{"level":"debug","ts":"2025-05-30 10:21:04.986","logger":"plan|2nrt7","msg":"Secret updated.","plan":{"name":"osp-test","namespace":"openshift-mtv"},"migration":"openshift-mtv/osp-test-cccl8","secret":"mtv-test/osp-test-39093ebb-b524-4aae-a2ea-69445333d842-8q9cg","vm":" id:39093ebb-b524-4aae-a2ea-69445333d842 name:'mtv-rhel8-sanity-nfs' "}
{"level":"info","ts":"2025-05-30 10:21:05.001","logger":"plan|2nrt7","msg":"the image properties have not been updated yet","plan":{"name":"osp-test","namespace":"openshift-mtv"},"migration":"openshift-mtv/osp-test-cccl8","image":"forklift-migration-vm-39093ebb-b524-4aae-a2ea-69445333d842-volume-edf7520c-ea5e-4370-b586-3a364d2e08d2"}
We should take a look at the openstack and fix the problem there.
6e0f1db to
43fab8a
Compare
|
Hey @coderabbitai, I've listen to the comments and dig deep into the openstack code issue. |
|
Tip For best results, initiate chat on the files or code changes. 🧩 Analysis chain
🏁 Script executed: Length of output: 167845
📊 Analysis of Root Cause FixesYour changes address four critical bug categories in 1. Control Flow Bugs (Lines 230-234, 264-268, 301-305)// ❌ BEFORE: return inside error block - function continues!
if err != nil {
if r.IsNotFound(err) {
err = ResourceNotFoundError
}
return // ← Only returns from error block!
}
// Code continues executing even after error...
// ✅ AFTER: return outside error block
if err != nil {
if r.IsNotFound(err) {
err = ResourceNotFoundError
}
}
return // ← Always returns, preventing undefined behaviorImpact: Without this fix, functions continued executing after errors, leading to nil pointer dereferences or operations on invalid data → potential hangs. 2. Property Management (Lines 420-426, 486-498)// Lines 420-426: Explicit error instead of silent log
if !found {
err = liberr.New("cannot find the original volume ID for image",
"vm", vm.Name, "image", image.Name)
return // ← Fails fast instead of continuing
}
// Lines 486-498: Sets critical property on new images
imageUpdateOpts := &libclient.ImageUpdateOpts{}
imageUpdateOpts.AddImageProperty(forkliftPropertyOriginalVolumeID, originalVolumeID)
if err = r.Update(image, imageUpdateOpts); err != nil {
return
}Impact: The 3. Stale Image Detection & Cleanup (Lines 922-944, 1019-1066)Image readiness check (lines 922-944): case ImageStatusActive:
if _, ok := image.Properties[forkliftPropertyOriginalVolumeID]; !ok {
// Property missing: try to recover from backing volume
if updateErr := r.updateImageProperty(vm, image); updateErr != nil {
r.Log.Info("deleting stale image that cannot be updated",
"vm", vm.Name, "image", image.Name, "error", updateErr)
if err = r.Delete(image); err != nil {
return
}
return // ← Triggers retry/recreation
}
}Snapshot-to-volume conversion (lines 1019-1066): if _, ok := image.Properties[forkliftPropertyOriginalVolumeID]; ok {
r.Log.Info("skipping the volume creation, the image already exists", ...)
} else {
r.Log.Info("found stale image without property, deleting before re-creating", ...)
if err = r.Delete(image); err != nil {
return
}
// Recreate volume from snapshot
_, err = r.createVolumeFromSnapshot(vm, snapshot.ID)
}Impact: Stale images (from controller restarts or failed operations) without proper metadata would be polled forever. Now they're detected and cleaned up → prevents indefinite waiting. 4. Goroutine Safety (Lines 860-896)// ❌ BEFORE: Launched goroutines inside loop
for _, image := range imagesFromVolumes {
// ... processing ...
originalVolumeID := image.Properties[forkliftPropertyOriginalVolumeID]
go func() {
err := r.cleanup(vm, originalVolumeID) // ← Variable capture bug!
}()
}
// ✅ AFTER: Collect IDs, then launch goroutines
var cleanupVolumeIDs []string
for _, image := range imagesFromVolumes {
// ... processing ...
cleanupVolumeIDs = append(cleanupVolumeIDs, originalVolumeID)
}
for _, volID := range cleanupVolumeIDs {
volID := volID // ← Proper variable capture
go func() {
if cleanupErr := r.cleanup(vm, volID); cleanupErr != nil {
r.Log.Error(cleanupErr, "failed to cleanup snapshot and volume", ...)
}
}()
}Impact: The original code had a closure variable capture bug and could launch cleanups prematurely → race conditions and resource leaks. 🔄 Initialize Phase State MachinestateDiagram-v2
[*] --> PhaseStarted: Migration begins
PhaseStarted --> PhaseCreateDataVolumes: Initialize step started
state PhaseCreateDataVolumes {
[*] --> WaitPreTransfer: Check PreTransferActions
WaitPreTransfer --> CreateVolumes: Hook ready
WaitPreTransfer --> WaitPreTransfer: Not ready (poll)
state CreateVolumes {
[*] --> CheckExistingImage: For each disk
CheckExistingImage --> ValidateImage: Image found
CheckExistingImage --> CreateFromSnapshot: No image
ValidateImage --> CheckProperty: Image status = Active
CheckProperty --> CleanupVolume: Property present & valid
CheckProperty --> TryRecoverProperty: Property missing
TryRecoverProperty --> CleanupVolume: ✅ Recovered from volume
TryRecoverProperty --> DeleteStaleImage: ❌ Cannot recover
DeleteStaleImage --> CreateFromSnapshot: Trigger retry
CreateFromSnapshot --> CreateImageFromVolume: Create volume
CreateImageFromVolume --> SetImageProperty: ✅ Image created
SetImageProperty --> CleanupVolume: ✅ Property set
CleanupVolume --> [*]: Async cleanup (goroutine)
}
CreateVolumes --> [*]: All volumes ready
}
PhaseCreateDataVolumes --> PhaseWaitForPowerOff: Volumes created
state PhaseWaitForPowerOff {
[*] --> CheckVMPower: Poll VM state
CheckVMPower --> StopVM: VM running
CheckVMPower --> [*]: VM powered off
StopVM --> CheckVMPower: Wait for poweroff
}
PhaseWaitForPowerOff --> PhaseCreateVM: VM powered off
PhaseCreateVM --> [*]: Migration continues...
note right of ValidateImage
🐛 OLD BUG: Stale images
without property would
be polled forever!
✅ FIX: Detect & delete
stale images, trigger
recreation
end note
note right of SetImageProperty
🐛 OLD BUG: Property
not set on new images!
✅ FIX: Always set
forkliftPropertyOriginalVolumeID
after image creation
end note
note right of CreateFromSnapshot
🐛 OLD BUG: Control flow
bugs caused early returns
to fail
✅ FIX: Proper return
statement placement
end note
✅ Why This Fixes the HangYour changes prevent the Initialize phase hang through:
The original hang scenario: After your fixes: 🎯 RecommendationThis is significantly better than a timeout mechanism because:
The state diagram shows how your fixes create a robust, self-correcting flow through the Initialize phase! 🚀 🧠 Learnings used |
Issue 1: Stale Glance images cause infinite reconciliation loop. Image names are deterministic (no migration-specific component), so images from a previous failed migration persist in Glance as Active without the forklift_original_volume_id property. The controller skips volume creation (image exists), but updateImageProperty cannot set the property (backing volume is gone) and silently returns nil. The reconcile loop repeats forever. Fix 1: - Detect and delete stale images (Active, missing property) in ensureVolumeFromSnapshot and ensureImageFromVolumeReady. - Set the tracking property at image creation time in createImageFromVolume. - Make updateImageProperty return an error when the volume is not found so callers can act on it. - Defer cleanup goroutines in ensureImagesFromVolumesReady until all images are confirmed ready to prevent a race between cleanup and property lookup. Issue 2: getVM/getImage/getVolume resolve the wrong OpenStack resource. When both ID and Name are set, these functions first fetch by ID correctly, then fall through to a name-based search that overwrites the result. OpenStack APIs use regex name matching, so "mtv-rhel8-sanity" matches "mtv-rhel8-sanity-2ip-1nic". PreTransferActions operates on the wrong VM, returns ready=true, and the real VM transitions to CopyDisks with no PVCs created. Fix 2: Return early after a successful ID-based lookup in getVM, getImage, and getVolume. The name-based search is now only used as a fallback when no ID is provided. Ref: https://issues.redhat.com/browse/MTV-2631 Resolves: MTV-2631 Signed-off-by: Elad Hazan <ehazan@redhat.com>
43fab8a to
c9300df
Compare
|
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #4969 +/- ##
==========================================
- Coverage 15.45% 10.28% -5.17%
==========================================
Files 112 470 +358
Lines 23377 53797 +30420
==========================================
+ Hits 3613 5535 +1922
- Misses 19479 47782 +28303
- Partials 285 480 +195
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
/backport release-2.11 |
|
✅ PR #4969 backported to |



Issue 1:
Stale Glance images cause infinite reconciliation loop.
Image names are deterministic (no migration-specific component), so
images from a previous failed migration persist in Glance as Active
without the forklift_original_volume_id property. The controller
skips volume creation (image exists), but updateImageProperty cannot
set the property (backing volume is gone) and silently returns nil.
The reconcile loop repeats forever.
Fix 1:
ensureVolumeFromSnapshot and ensureImageFromVolumeReady.
so callers can act on it.
to prevent a race between cleanup and property lookup.
Issue 2:
GetVM/getImage/getVolume resolve the wrong OpenStack resource.
When both ID and Name are set, these functions first fetch by ID
correctly, then fall through to a name-based search that overwrites
the result.
OpenStack APIs use regex name matching, so "mtv-rhel8-sanity"
matches "mtv-rhel8-sanity-2ip-1nic".
PreTransferActions operates on the wrong VM, returns ready=true,
and the real VM transitions to CopyDisks with no PVCs created.
Fix 2:
Return early after a successful ID-based lookup in getVM,
getImage, and getVolume.
The name-based search is now only used as a fallback when no ID is provided.
Ref: https://issues.redhat.com/browse/MTV-2631
Resolves: MTV-2631