Skip to content

MTV-2631 | Fix VM migration hanging in Initialize phase.#4969

Merged
mnecas merged 1 commit intokubev2v:mainfrom
Hazanel:initialize-phase-timeout
Mar 9, 2026
Merged

MTV-2631 | Fix VM migration hanging in Initialize phase.#4969
mnecas merged 1 commit intokubev2v:mainfrom
Hazanel:initialize-phase-timeout

Conversation

@Hazanel
Copy link
Copy Markdown
Contributor

@Hazanel Hazanel commented Feb 25, 2026

Issue 1:
Stale Glance images cause infinite reconciliation loop.
Image names are deterministic (no migration-specific component), so
images from a previous failed migration persist in Glance as Active
without the forklift_original_volume_id property. The controller
skips volume creation (image exists), but updateImageProperty cannot
set the property (backing volume is gone) and silently returns nil.
The reconcile loop repeats forever.

Fix 1:

  • Detect and delete stale images (Active, missing property) in
    ensureVolumeFromSnapshot and ensureImageFromVolumeReady.
  • Set the tracking property at image creation time in createImageFromVolume.
  • Make updateImageProperty return an error when the volume is not found
    so callers can act on it.
  • Defer cleanup goroutines in ensureImagesFromVolumesReady until all images are confirmed ready
    to prevent a race between cleanup and property lookup.

Issue 2:
GetVM/getImage/getVolume resolve the wrong OpenStack resource.
When both ID and Name are set, these functions first fetch by ID
correctly, then fall through to a name-based search that overwrites
the result.
OpenStack APIs use regex name matching, so "mtv-rhel8-sanity"
matches "mtv-rhel8-sanity-2ip-1nic".
PreTransferActions operates on the wrong VM, returns ready=true,
and the real VM transitions to CopyDisks with no PVCs created.

Fix 2:
Return early after a successful ID-based lookup in getVM,
getImage, and getVolume.
The name-based search is now only used as a fallback when no ID is provided.

Ref: https://issues.redhat.com/browse/MTV-2631
Resolves: MTV-2631

@Hazanel
Copy link
Copy Markdown
Contributor Author

Hazanel commented Feb 25, 2026

Hey @coderabbitai , can you please review and tell me if the PR actually fixes/ prevent the hang issue observed during the migration ?

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Feb 25, 2026

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash

# Fetch PR details including files changed and diff
gh pr view 4969 --json title,body,files,additions,deletions

# Get the diff for the PR
gh pr diff 4969

Length of output: 19556


@Hazanel, I've reviewed PR #4969 thoroughly. Here's my analysis of whether it fixes the hang issue:

Yes, it prevents the hang - but with important caveats:

What the PR Does:

The PR adds timeout protection to two critical operations in the Initialize phase where VMs can get stuck:

  1. PhaseCreateDataVolumes: Waiting for PreTransferActions hook to be ready
  2. PhaseWaitForPowerOff: Waiting for VM to power off

The implementation:

  • Marks steps as started when entering these phases
  • Checks elapsed time against a configurable timeout (default: 60 minutes)
  • Fails the stuck VM explicitly with a clear error message instead of polling forever
  • Breaks out of the execution loop when timeout is exceeded

Key Points:

✅ Prevents indefinite hanging:

  • Instead of polling forever (which causes the hang), it now has a bounded timeout
  • After the configured duration, the VM will fail with messages like:
    • "Timed out waiting for PreTransferActions after 60 minutes"
    • "Timed out waiting for VM power off after 60 minutes"

⚠️ Limitations to be aware of:

  1. This is a workaround, not a root cause fix: It doesn't address why the PreTransferActions or PowerOff operations are failing/stalling for OSP provider. The underlying issue that causes the hang still exists.

  2. Only covers Initialize phase: The timeout is only applied to these two specific operations. If hangs occur in other phases, they won't be caught.

  3. Timeout tuning may be needed: The default 60 minutes might be too short or too long depending on your environment. You can configure it via controller_initialize_phase_timeout_minutes.

  4. Does not provide recovery: When timeout occurs, the VM migration fails - it doesn't attempt to recover or retry the stuck operation.

Recommendation:

For the immediate issue of "VM hanging indefinitely in Initialize phase" - this PR is effective. It provides:

  • Clear failure indication instead of silent hang
  • Configurable timeout for different environments
  • Proper error propagation through the pipeline

However, I'd suggest:

  • Monitor the error logs to understand why VMs are timing out
  • Consider investigating the root cause in the OSP provider's PreTransferActions or PowerOff implementations
  • Tune the timeout value based on your typical migration times

The test coverage is comprehensive and the implementation follows good patterns. Would you like me to help investigate the potential root causes of the OSP provider hang?


🧠 Learnings used
Learnt from: Hazanel
Repo: kubev2v/forklift PR: 2813
File: pkg/controller/provider/container/vsphere/collector.go:354-355
Timestamp: 2025-09-16T10:09:36.531Z
Learning: For vsphere provider timeout issues, keep fixes targeted to vsphere-specific code rather than updating similar timeouts across other providers or the entire codebase. The maintainer prefers focused changes that address the specific use case without introducing broader risk.
✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@Hazanel
Copy link
Copy Markdown
Contributor Author

Hazanel commented Feb 25, 2026

Thanks @coderabbitai for the review.

  1. The hang seems to occur at the provider side
  2. The issue occurs in the init phase
  3. Timeout is indeed configurable
  4. The migration should fail in such case

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Feb 25, 2026

Tip

For best results, initiate chat on the files or code changes.

@Hazanel, you're welcome! I'm glad the analysis was helpful and that your understanding aligns with what the PR accomplishes.

You've correctly identified the key aspects:
✅ Provider-side hang detection in the Initialize phase
✅ Configurable timeout mechanism
✅ Explicit failure instead of indefinite hanging

This approach gives you visibility into stuck migrations and prevents silent hangs, which is exactly what you need for the MTV-2631 issue. If you observe timeouts occurring in your environment, the error messages will help pinpoint whether it's the PreTransferActions hook or the PowerOff operation that's causing delays, which can guide further investigation into the root cause.

Feel free to reach out if you need any additional analysis or have questions as you test the fix! 🐰


(\(\
(='.')
(")_(") Happy to help!

🧠 Learnings used
Learnt from: Hazanel
Repo: kubev2v/forklift PR: 2813
File: pkg/controller/provider/container/vsphere/collector.go:354-355
Timestamp: 2025-09-16T10:09:36.531Z
Learning: For vsphere provider timeout issues, keep fixes targeted to vsphere-specific code rather than updating similar timeouts across other providers or the entire codebase. The maintainer prefers focused changes that address the specific use case without introducing broader risk.

Learnt from: yaacov
Repo: kubev2v/forklift PR: 3204
File: operator/.upstream_manifests:156-181
Timestamp: 2025-10-16T14:11:57.112Z
Learning: Repo: kubev2v/forklift — Maintainer yaacov prefers not to create GitHub follow-up issues (e.g., for re-tightening ForkliftController CRD types after temporary relaxation); such work is tracked in a different system. In future reviews, acknowledge and proceed without proposing GitHub issue creation.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Feb 25, 2026

📝 Walkthrough

Walkthrough

This PR introduces initialize phase timeout configuration to the Forklift operator. Changes include adding a new controller_initialize_phase_timeout_minutes field to the ForkliftController CRD across manifest layers, exposing it in UI descriptors, enabling environment variable support, and implementing timeout validation logic in the migration controller with comprehensive test coverage.

Changes

Cohort / File(s) Summary
CRD Field Definitions
operator/config/crd/bases/forklift.konveyor.io_forkliftcontrollers.yaml, operator/.upstream_manifests, operator/.downstream_manifests, operator/.kustomized_manifests
Added controller_initialize_phase_timeout_minutes field to ForkliftController spec with x-kubernetes-int-or-string: true and default value 60 minutes across multiple manifest layers.
UI Descriptors
operator/config/manifests/bases/forklift-operator.clusterserviceversion.yaml, operator/streams/downstream/mtv-operator.clusterserviceversion.yaml, operator/streams/upstream/forklift-operator.clusterserviceversion.yaml
Added ClusterServiceVersion descriptor entries for Initialize Phase Timeout, mapping controller_initialize_phase_timeout_minutes with hidden UI attribute across downstream and upstream manifests.
Role Configuration
operator/roles/forkliftcontroller/defaults/main.yml, operator/roles/forkliftcontroller/templates/controller/deployment-controller.yml.j2
Added default variable controller_initialize_phase_timeout_minutes: 60 and conditional environment variable injection into controller deployment template.
Settings Management
pkg/settings/migration.go
Added InitializePhaseTimeout constant and public field to Migration struct, with environment-driven configuration loading via INITIALIZE_PHASE_TIMEOUT_MINUTES.
Controller Timeout Logic
pkg/controller/plan/migration.go
Introduced checkStepTimeout helper function and integrated timeout checks into PhaseCreateDataVolumes and PhaseWaitForPowerOff paths with step state management (MarkStarted, StepRunning).
Timeout Test Suite
pkg/controller/plan/migration_test.go
Added comprehensive test coverage for initialize phase timeout behavior, including boundary conditions, custom timeout values, and error propagation scenarios.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested reviewers

  • yaacov
  • mrnold
  • mnecas
  • solenoci

Poem

🐰 A timeout arrives, graceful and kind,
To stop the initialize phase in its grind,
Sixty minutes by default (or you choose),
No endless waiting—this timeout's the muse!
Tests wrap it up tight, from start to success,
Config and logic, a temporal finesse! ⏱️

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title 'MTV-2631 | Fix VM migration hanging in Initialize phase' clearly and specifically describes the main change—adding a timeout mechanism to prevent VM migrations from hanging during the Initialize phase.
Description check ✅ Passed The pull request description directly addresses the changeset, documenting two specific issues and their fixes related to VM migration in the Initialize phase.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
pkg/controller/plan/migration_test.go (1)

1-1: 🧹 Nitpick | 🔵 Trivial

Remove or rephrase the "Generated-by: Claude" comment.

This attribution comment is unusual for production test code and may raise questions during audits or from other contributors about the provenance and review rigor of the code. If the tests have been reviewed and approved by a human, consider removing this line or replacing it with a standard copyright/license header.

Proposed fix
-// Generated-by: Claude
 package plan
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/controller/plan/migration_test.go` at line 1, Remove or rephrase the lone
"Generated-by: Claude" comment in migration_test.go: either delete that
attribution line entirely, or replace it with a standard, approved header (e.g.,
copyright/license or a brief "Reviewed and approved by [team/person]" note) so
the test file no longer contains the unusual AI-generation attribution.
♻️ Duplicate comments (1)
pkg/controller/plan/migration.go (1)

1211-1212: Same MarkStarted() idempotency concern as the PhaseCreateDataVolumes site.

If MarkStarted() resets the timestamp, apply the same guard here.

Proposed fix
-			step.MarkStarted()
-			step.Phase = api.StepRunning
+			if !step.MarkedStarted() {
+				step.MarkStarted()
+				step.Phase = api.StepRunning
+			}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/controller/plan/migration.go` around lines 1211 - 1212, The call to
step.MarkStarted() here can reset the start timestamp on repeated runs; mirror
the idempotency guard used at the PhaseCreateDataVolumes site by only invoking
MarkStarted() when the step is transitioning into running (e.g., if step.Phase
!= api.StepRunning) before setting step.Phase = api.StepRunning. Update the
block around step.MarkStarted(), step.Phase and the same step variable so the
start time is preserved on subsequent reconciliations.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/controller/plan/migration_test.go`:
- Around line 295-365: The current duplicated Ginkgo test cases around
checkStepTimeout should be collapsed into a single table-driven test that
iterates cases (e.g., struct fields: name, startedOffset duration, message
string, customTimeout int pointer, expectTimedOut bool, expectError bool,
expectErrorContains []string). For each case create a fresh plan.Step (use
plan.Step{Task: plan.Task{Name: "Initialize"}}), set Started using
meta.NewTime(time.Now().Add(startedOffset)) or leave nil, apply custom timeout
via settings.Settings.Migration.InitializePhaseTimeout when provided (and
restore it after the case), call checkStepTimeout(step, message) and assert
timedOut/step.HasError()/step.Error.Reasons based on case expectations; update
test names to reflect intent (use case.name) and remove misleading
phase-specific names; reference symbols to change: checkStepTimeout, plan.Step,
plan.Task, step.MarkStarted (if used), and
settings.Settings.Migration.InitializePhaseTimeout.

In `@pkg/controller/plan/migration.go`:
- Around line 831-835: The log call is using r.Log.Error(nil, ...) which logs at
error level with a nil error; change these to use r.Log.Info(...) (or provide a
real error value) so you don't emit error-level entries without an error. Update
the two occurrences that currently call r.Log.Error(nil, "Initialize phase
timeout exceeded waiting for PreTransferActions.", ...) and the similar call at
the other site (around the other timeout message) to call r.Log.Info with the
same message and context (vm, timeout) or pass a non-nil error returned from
checkStepTimeout if you prefer error-level logging; ensure you modify the exact
r.Log.Error invocations in the migration logic where checkStepTimeout is used.
- Around line 791-792: Replace the timeout-level logs that call r.Log.Error(nil,
...) with informational logs r.Log.Info(...) for the timeout notifications (keep
the existing message text and key/value pairs), in the blocks that add the
timeout to the step via step.AddError(...) — specifically change the
r.Log.Error(nil, ...) usage in the provisioning timeout branch (the same block
that calls step.MarkStarted() and step.AddError(...)) and the power-off timeout
branch (the block that adds the power-off timeout via step.AddError(...)); leave
step.AddError(...) and step state transitions unchanged.

---

Outside diff comments:
In `@pkg/controller/plan/migration_test.go`:
- Line 1: Remove or rephrase the lone "Generated-by: Claude" comment in
migration_test.go: either delete that attribution line entirely, or replace it
with a standard, approved header (e.g., copyright/license or a brief "Reviewed
and approved by [team/person]" note) so the test file no longer contains the
unusual AI-generation attribution.

---

Duplicate comments:
In `@pkg/controller/plan/migration.go`:
- Around line 1211-1212: The call to step.MarkStarted() here can reset the start
timestamp on repeated runs; mirror the idempotency guard used at the
PhaseCreateDataVolumes site by only invoking MarkStarted() when the step is
transitioning into running (e.g., if step.Phase != api.StepRunning) before
setting step.Phase = api.StepRunning. Update the block around
step.MarkStarted(), step.Phase and the same step variable so the start time is
preserved on subsequent reconciliations.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d6a7982 and abb17c1.

📒 Files selected for processing (12)
  • operator/.downstream_manifests
  • operator/.kustomized_manifests
  • operator/.upstream_manifests
  • operator/config/crd/bases/forklift.konveyor.io_forkliftcontrollers.yaml
  • operator/config/manifests/bases/forklift-operator.clusterserviceversion.yaml
  • operator/roles/forkliftcontroller/defaults/main.yml
  • operator/roles/forkliftcontroller/templates/controller/deployment-controller.yml.j2
  • operator/streams/downstream/mtv-operator.clusterserviceversion.yaml
  • operator/streams/upstream/forklift-operator.clusterserviceversion.yaml
  • pkg/controller/plan/migration.go
  • pkg/controller/plan/migration_test.go
  • pkg/settings/migration.go

Comment thread pkg/controller/plan/migration_test.go Outdated
Comment thread pkg/controller/plan/migration.go Outdated
Comment thread pkg/controller/plan/migration.go Outdated
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Feb 25, 2026

Tip

For best results, initiate chat on the files or code changes.

@Hazanel, you're absolutely right on all four points! Your understanding of the PR's behavior is spot-on. The timeout mechanism you've implemented will give clear visibility into Initialize-phase hangs and fail gracefully instead of leaving migrations in a stuck state.

If you observe these timeouts occurring frequently in your environment, the error messages will help identify whether it's the PreTransferActions hook or the VM power-off operation that's causing the delays, which can guide further root-cause investigation at the provider level.

Feel free to reach out if you need anything else! 🐰


🧠 Learnings used
Learnt from: yaacov
Repo: kubev2v/forklift PR: 3204
File: operator/.upstream_manifests:156-181
Timestamp: 2025-10-16T14:11:57.112Z
Learning: Repo: kubev2v/forklift — Maintainer yaacov prefers not to create GitHub follow-up issues (e.g., for re-tightening ForkliftController CRD types after temporary relaxation); such work is tracked in a different system. In future reviews, acknowledge and proceed without proposing GitHub issue creation.

Learnt from: Hazanel
Repo: kubev2v/forklift PR: 2813
File: pkg/controller/provider/container/vsphere/collector.go:354-355
Timestamp: 2025-09-16T10:09:36.531Z
Learning: For vsphere provider timeout issues, keep fixes targeted to vsphere-specific code rather than updating similar timeouts across other providers or the entire codebase. The maintainer prefers focused changes that address the specific use case without introducing broader risk.

Learnt from: yaacov
Repo: kubev2v/forklift PR: 3204
File: operator/.downstream_manifests:138-138
Timestamp: 2025-10-16T13:44:49.727Z
Learning: Repo: kubev2v/forklift — For PRs that relax ForkliftController CRD types (e.g., boolean-like and counting fields), maintainers prefer no field-name changes. Typos like controller_snapshot_removal_timeout_minuts should be tracked in a separate follow-up issue rather than addressed in the same PR.

Learnt from: yaacov
Repo: kubev2v/forklift PR: 4509
File: operator/.upstream_manifests:249-268
Timestamp: 2026-02-01T15:17:00.752Z
Learning: Repo: kubev2v/forklift — During reviews, do not request edits or PR-splitting based on diffs in auto-generated manifest files (e.g., operator/.upstream_manifests, operator/.kustomized_manifests, operator/.downstream_manifests). Treat them as outputs; only suggest scope changes if the underlying source templates or code introduced the unrelated changes.

Learnt from: yaacov
Repo: kubev2v/forklift PR: 3204
File: operator/.upstream_manifests:6770-6772
Timestamp: 2025-10-16T18:19:40.853Z
Learning: Repo: kubev2v/forklift — For PRs relaxing ForkliftController CRD types (e.g., accepting booleans/ints/strings), do not modify or expand the example ForkliftController resource in operator/.upstream_manifests within that PR. Keep examples as-is and handle any exposure/consistency adjustments (e.g., listing feature_* flags in samples/CSV) in a separate, dedicated follow-up. Avoid creating GitHub issues for such follow-ups as they’re tracked elsewhere.

Learnt from: yaacov
Repo: kubev2v/forklift PR: 2540
File: pkg/controller/plan/adapter/vsphere/builder.go:694-699
Timestamp: 2025-08-18T13:45:08.364Z
Learning: The user yaacov prefers not to change existing behavior within the scope of a current PR when additional validation or safety checks are identified, and instead prefers to create follow-up issues to address such improvements separately.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Feb 25, 2026

🧩 Analysis chain

@Hazanel Thanks for the ping — I’m pulling the exact code/diff details now to confirm whether the fix prevents the Initialize-phase hang. I’ll post a concise verdict with pointers to the relevant code paths right after these checks complete.


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== PR METADATA ==="
gh pr view 4969 --json number,title,body,headRefName,baseRefName,additions,deletions,files --jq '
{number,title,headRefName,baseRefName,additions,deletions, files: [ .files[].path ] }'

echo
echo "=== CHANGED FILES ==="
gh pr view 4969 --json files --jq '.files[].path' || true

echo
echo "=== DIFF SNIPPETS (Initialize/timeout/PhaseCreateDataVolumes/WaitForPowerOff) ==="
gh pr diff 4969 | rg -nP '(Initialize|initialize).*timeout|checkStepTimeout|PhaseCreateDataVolumes|PhaseWaitForPowerOff|PreTransferActions|power off' -C3 || true

echo
echo "=== checkStepTimeout implementation (pkg/controller/plan/migration.go) ==="
if rg -nP 'func\s+checkStepTimeout\s*\(' pkg/controller/plan/migration.go >/dev/null 2>&1; then
  start_line="$(rg -nP 'func\s+checkStepTimeout\s*\(' pkg/controller/plan/migration.go | head -1 | cut -d: -f1)"
  sed -n "$((start_line)),+$((50))p" pkg/controller/plan/migration.go
else
  echo "checkStepTimeout not found in pkg/controller/plan/migration.go"
fi

echo
echo "=== Sites calling checkStepTimeout in Initialize phase (pkg/controller/plan/migration.go) ==="
rg -nP 'checkStepTimeout\(' -C3 pkg/controller/plan/migration.go || true

echo
echo "=== Step start markers in Initialize phase (MarkStarted) ==="
rg -nP 'MarkStarted\(\)' -n -C3 pkg/controller/plan/migration.go || true

echo
echo "=== Initialize-phase timeout setting usage across repo ==="
rg -nP 'InitializePhaseTimeout' -n -C2 || true

echo
echo "=== Settings struct / defaults candidates ==="
rg -nP 'type\s+MigrationSettings\s+struct' -n -C5 pkg || true
rg -nP 'InitializePhaseTimeout.*(=|:|\))' -n -C3 pkg || true

echo
echo "=== Unit tests around timeout (pkg/controller/plan/migration_test.go) ==="
rg -nP 'DescribeTable|checkStepTimeout\(' -C3 pkg/controller/plan/migration_test.go || true

echo
echo "=== CRD/manifest exposure of initialize timeout (operator/*) ==="
rg -nP 'initialize.*timeout|InitializePhaseTimeout|controller_.*initialize.*timeout' -n -C2 operator || true

Length of output: 32637


@Hazanel Short answer: Yes—this PR prevents the observed Initialize-phase hang by bounding the two wait points with a configurable timeout and failing the stuck VM.

What changed and why it works

  • New timeout setting and default
    • Setting INITIALIZE_PHASE_TIMEOUT_MINUTES (default 60) is added and loaded via settings.Settings.Migration.InitializePhaseTimeout in pkg/settings/migration.go (Line 61 constant; default applied at Line 339). Exposed in ForkliftController as controller_initialize_phase_timeout_minutes and wired into the controller Deployment env. Auto-generated/operator manifests reflect this; no manual edits requested there per maintainer guidance.
  • Timeout enforcement during Initialize
    • The helper checkStepTimeout(step, msg) is implemented at pkg/controller/plan/migration.go Line 1423. It:
      • Returns false if the step hasn’t started (Line 1425).
      • Compares elapsed time against InitializePhaseTimeout; on exceed, adds a step error like “Timed out waiting for … after N minutes” and returns true (Lines 1427–1431).
    • Applied in both hang-prone waits:
      • PreTransferActions readiness: Lines 831–835. On timeout it logs an Error with the configured minutes and breaks out of the phase’s execution.
      • Waiting for VM power off: Lines 1226–1229. Same behavior—error + break.
    • Steps are marked started before these waits, so the timeout is active (e.g., Lines 791–793 and 1211–1213).
  • Propagation to VM failure
    • Unit tests verify the error is set on the step and reflected to the VM status: see pkg/controller/plan/migration_test.go Lines 373–379 (vm.ReflectPipeline). Tests also cover default 60-minute behavior and custom values (Lines 294–364).

Net effect

  • Instead of polling forever in Initialize, the controller now:
    • Fails the VM with a clear reason when the configured window elapses.
    • Stops re-waiting on that phase (due to the break), so it does not hang indefinitely.

Scope and notes

  • This is an effective guardrail for the provider-side stalls you’re seeing in Initialize; it doesn’t change provider behavior or root cause.
  • Only the Initialize waits are covered (PreTransferActions and PowerOff) as intended by MTV-2631—consistent with the project preference for targeted fixes.
  • Operator/CSV/CRD updates are auto-generated outputs surfacing the new knob; acknowledged without requesting edits.

If you want, I can also help validate the end-to-end behavior with a quick scenario outline for setting controller_initialize_phase_timeout_minutes to a low value (e.g., 5) and confirming the VM fails with the expected message after ~5 minutes during a forced stall.


🧠 Learnings used
Learnt from: yaacov
Repo: kubev2v/forklift PR: 3204
File: operator/.upstream_manifests:156-181
Timestamp: 2025-10-16T14:11:57.112Z
Learning: Repo: kubev2v/forklift — Maintainer yaacov prefers not to create GitHub follow-up issues (e.g., for re-tightening ForkliftController CRD types after temporary relaxation); such work is tracked in a different system. In future reviews, acknowledge and proceed without proposing GitHub issue creation.

Learnt from: Hazanel
Repo: kubev2v/forklift PR: 2813
File: pkg/controller/provider/container/vsphere/collector.go:354-355
Timestamp: 2025-09-16T10:09:36.531Z
Learning: For vsphere provider timeout issues, keep fixes targeted to vsphere-specific code rather than updating similar timeouts across other providers or the entire codebase. The maintainer prefers focused changes that address the specific use case without introducing broader risk.

Learnt from: yaacov
Repo: kubev2v/forklift PR: 3204
File: operator/.downstream_manifests:138-138
Timestamp: 2025-10-16T13:44:49.727Z
Learning: Repo: kubev2v/forklift — For PRs that relax ForkliftController CRD types (e.g., boolean-like and counting fields), maintainers prefer no field-name changes. Typos like controller_snapshot_removal_timeout_minuts should be tracked in a separate follow-up issue rather than addressed in the same PR.

Learnt from: yaacov
Repo: kubev2v/forklift PR: 4509
File: operator/.upstream_manifests:249-268
Timestamp: 2026-02-01T15:17:00.752Z
Learning: Repo: kubev2v/forklift — During reviews, do not request edits or PR-splitting based on diffs in auto-generated manifest files (e.g., operator/.upstream_manifests, operator/.kustomized_manifests, operator/.downstream_manifests). Treat them as outputs; only suggest scope changes if the underlying source templates or code introduced the unrelated changes.

Learnt from: yaacov
Repo: kubev2v/forklift PR: 3204
File: operator/.upstream_manifests:6770-6772
Timestamp: 2025-10-16T18:19:40.853Z
Learning: Repo: kubev2v/forklift — For PRs relaxing ForkliftController CRD types (e.g., accepting booleans/ints/strings), do not modify or expand the example ForkliftController resource in operator/.upstream_manifests within that PR. Keep examples as-is and handle any exposure/consistency adjustments (e.g., listing feature_* flags in samples/CSV) in a separate, dedicated follow-up. Avoid creating GitHub issues for such follow-ups as they’re tracked elsewhere.
✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Feb 25, 2026

📝 Walkthrough

Walkthrough

This PR introduces a new controller_initialize_phase_timeout_minutes field to the ForkliftController CRD specification, enabling configurable timeout handling for the migration initialization phase. The field is exposed across CRD definitions, UI descriptors, Ansible configuration, settings, and controller migration logic with test coverage.

Changes

Cohort / File(s) Summary
CRD Field Definitions
operator/config/crd/bases/forklift.konveyor.io_forkliftcontrollers.yaml, operator/.upstream_manifests, operator/.kustomized_manifests, operator/.downstream_manifests
Added new controller_initialize_phase_timeout_minutes field to ForkliftController CRD spec with type string, x-kubernetes-int-or-string annotation, and description referencing 60-minute default.
ClusterServiceVersion UI Descriptors
operator/config/manifests/bases/forklift-operator.clusterserviceversion.yaml, operator/streams/upstream/forklift-operator.clusterserviceversion.yaml, operator/streams/downstream/mtv-operator.clusterserviceversion.yaml
Added Initialize Phase Timeout descriptor entries mapping to controller_initialize_phase_timeout_minutes with displayName, description, and hidden UI designation across upstream/downstream operator manifests.
Ansible Configuration
operator/roles/forkliftcontroller/defaults/main.yml, operator/roles/forkliftcontroller/templates/controller/deployment-controller.yml.j2
Added default value (60) and conditional environment variable INITIALIZE_PHASE_TIMEOUT_MINUTES to controller deployment template.
Settings & Environment
pkg/settings/migration.go
Added InitializePhaseTimeout environment constant and public integer field to Migration struct, loaded via getPositiveEnvLimit with 60-minute default.
Migration Controller Logic
pkg/controller/plan/migration.go
Implemented checkStepTimeout helper function and integrated timeout checks in PhaseCreateDataVolumes and PhaseWaitForPowerOff to validate step duration against configured timeout threshold.
Test Coverage
pkg/controller/plan/migration_test.go
Added comprehensive test suite with 8+ test cases covering no-timeout scenarios, timeout boundaries, custom timeout values, and error message propagation via ReflectPipeline.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Suggested reviewers

  • mnecas
  • mrnold
  • solenoci
  • yaacov

Poem

🐰 A timeout for phases, so timely and true,
Initialize phases now know what to do!
Sixty minutes by default, configurable still,
The controller hops forward with deadline-met will! ⏱️✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly identifies the issue (VM migration hanging in Initialize phase) and references the ticket (MTV-2631), accurately summarizing the main change.
Description check ✅ Passed The description is directly related to the changeset, explaining the problem (VM hang in Initialize phase), the solution (configurable timeout), and expected outcome.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (2)
pkg/controller/plan/migration_test.go (1)

295-365: 🧹 Nitpick | 🔵 Trivial

Timeout-case specs are still repetitive; please collapse into a table-driven form.

The cases vary by elapsed time/message/timeout but otherwise duplicate setup and assertions; this was already raised.

#!/bin/bash
# Inspect repeated timeout test cases for checkStepTimeout
rg -n 'ginkgo\.It\("should (not )?time out' -A6 -B2 pkg/controller/plan/migration_test.go

As per coding guidelines: **/*_test.go: "Use table-driven tests for multiple cases. Test names should clearly state their intent."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/controller/plan/migration_test.go` around lines 295 - 365, The tests
around checkStepTimeout are repetitive; replace the multiple ginkgo.It blocks
with a single table-driven loop that enumerates cases (name, startedOffset,
message, expectedTimedOut, expectedHasError, expectedReasonContains, optional
customTimeout) and asserts the same expectations; for each row create a step :=
&plan.Step{Task: plan.Task{Name: "Initialize"}}, set step.Started via
meta.NewTime(time.Now().Add(startedOffset)) or call step.MarkStarted() / leave
nil for the no-started-time case, set
settings.Settings.Migration.InitializePhaseTimeout when a customTimeout is
provided, call checkStepTimeout(step, message) and verify returned bool and
step.HasError()/step.Error.Reasons content as in the original cases (look up
symbols checkStepTimeout, plan.Step, Step.MarkStarted, Step.Started,
settings.Settings.Migration.InitializePhaseTimeout, Step.Error.Reasons) so the
assertions match each former ginkgo.It scenario.
pkg/controller/plan/migration.go (1)

831-835: Timeout log still uses Error(nil, ...) instead of Info(...) or a real error.

Line 832 and Line 1227 still emit error-level logs with a nil error object; this was already raised and remains unresolved.

#!/bin/bash
# Verify unresolved timeout log sites that still use Error(nil,...)
rg -n 'r\.Log\.Error\(nil,\s*"Initialize phase timeout exceeded waiting for (PreTransferActions|VM power off)\."'

Also applies to: 1226-1229

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/controller/plan/migration.go` around lines 831 - 835, The timeout log
calls in migration.go are using r.Log.Error(nil, ...) with a nil error (e.g.,
the message "Initialize phase timeout exceeded waiting for PreTransferActions."
in the block containing checkStepTimeout(step, "Timed out waiting for
PreTransferActions") and the similar "VM power off" timeout), so update those
sites to either emit an Info-level log or provide a real error object; replace
r.Log.Error(nil, "...") with r.Log.Info("Initialize phase timeout exceeded
waiting for PreTransferActions.", "vm", vm.String(), "timeout",
settings.Settings.Migration.InitializePhaseTimeout) (and the analogous change
for the VM power off message) or construct a descriptive error (e.g.,
fmt.Errorf("initialize phase timeout: %s",
settings.Settings.Migration.InitializePhaseTimeout)) and pass that to
r.Log.Error; ensure you change the calls around the checkStepTimeout usages so
no nil error is passed.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@pkg/controller/plan/migration_test.go`:
- Around line 295-365: The tests around checkStepTimeout are repetitive; replace
the multiple ginkgo.It blocks with a single table-driven loop that enumerates
cases (name, startedOffset, message, expectedTimedOut, expectedHasError,
expectedReasonContains, optional customTimeout) and asserts the same
expectations; for each row create a step := &plan.Step{Task: plan.Task{Name:
"Initialize"}}, set step.Started via meta.NewTime(time.Now().Add(startedOffset))
or call step.MarkStarted() / leave nil for the no-started-time case, set
settings.Settings.Migration.InitializePhaseTimeout when a customTimeout is
provided, call checkStepTimeout(step, message) and verify returned bool and
step.HasError()/step.Error.Reasons content as in the original cases (look up
symbols checkStepTimeout, plan.Step, Step.MarkStarted, Step.Started,
settings.Settings.Migration.InitializePhaseTimeout, Step.Error.Reasons) so the
assertions match each former ginkgo.It scenario.

In `@pkg/controller/plan/migration.go`:
- Around line 831-835: The timeout log calls in migration.go are using
r.Log.Error(nil, ...) with a nil error (e.g., the message "Initialize phase
timeout exceeded waiting for PreTransferActions." in the block containing
checkStepTimeout(step, "Timed out waiting for PreTransferActions") and the
similar "VM power off" timeout), so update those sites to either emit an
Info-level log or provide a real error object; replace r.Log.Error(nil, "...")
with r.Log.Info("Initialize phase timeout exceeded waiting for
PreTransferActions.", "vm", vm.String(), "timeout",
settings.Settings.Migration.InitializePhaseTimeout) (and the analogous change
for the VM power off message) or construct a descriptive error (e.g.,
fmt.Errorf("initialize phase timeout: %s",
settings.Settings.Migration.InitializePhaseTimeout)) and pass that to
r.Log.Error; ensure you change the calls around the checkStepTimeout usages so
no nil error is passed.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d6a7982 and abb17c1.

📒 Files selected for processing (12)
  • operator/.downstream_manifests
  • operator/.kustomized_manifests
  • operator/.upstream_manifests
  • operator/config/crd/bases/forklift.konveyor.io_forkliftcontrollers.yaml
  • operator/config/manifests/bases/forklift-operator.clusterserviceversion.yaml
  • operator/roles/forkliftcontroller/defaults/main.yml
  • operator/roles/forkliftcontroller/templates/controller/deployment-controller.yml.j2
  • operator/streams/downstream/mtv-operator.clusterserviceversion.yaml
  • operator/streams/upstream/forklift-operator.clusterserviceversion.yaml
  • pkg/controller/plan/migration.go
  • pkg/controller/plan/migration_test.go
  • pkg/settings/migration.go

@Hazanel Hazanel force-pushed the initialize-phase-timeout branch from abb17c1 to 6e0f1db Compare February 25, 2026 12:51
@Hazanel
Copy link
Copy Markdown
Contributor Author

Hazanel commented Feb 25, 2026

Hey @coderabbitai , is adding :
step.MarkStarted() step.Phase = api.StepRunning

(in both phased as I did) necessary ?

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Feb 25, 2026

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash

echo "=== Finding MarkStarted() implementation ==="
ast-grep --pattern $'func ($$$) MarkStarted() {
  $$$
}'

echo -e "\n=== Alternative: search for MarkStarted definition ==="
rg -n 'func.*MarkStarted' -A 5

echo -e "\n=== Check Step struct definition (Started field) ==="
rg -n 'type Step struct' -A 20 | head -40

echo -e "\n=== Find other usages of MarkStarted in the codebase ==="
rg -n '\.MarkStarted\(\)' -B 2 -A 2

echo -e "\n=== Verify checkStepTimeout checks step.Started ==="
rg -n 'step\.Started.*nil' pkg/controller/plan/migration.go -C 3

Length of output: 13596


@Hazanel, yes, both are absolutely necessary! Here's why:

step.MarkStarted() is critical for the timeout to work

Looking at the implementation:

MarkStarted() sets the timestamp (pkg/apis/forklift/v1beta1/plan/timed.go:20-25):

func (r *Timed) MarkStarted() {
    if r.Started == nil {
        r.Started = r.now()
        r.Completed = nil
    }
}

checkStepTimeout() depends on it (pkg/controller/plan/migration.go:1423-1426):

func checkStepTimeout(step *plan.Step, msg string) bool {
    if step.Started == nil {
        return false  // ← NO timeout check happens!
    }
    limitMinutes := settings.Settings.Migration.InitializePhaseTimeout
    // ... rest of timeout logic

Without MarkStarted(), step.Started remains nil, causing checkStepTimeout to always return false — which means no timeout will ever trigger. Your entire timeout mechanism would be silently disabled.

step.Phase = api.StepRunning follows the codebase pattern

This is the standard pattern used throughout the codebase. Examples:

  • EC2 migrator (lines 30-32, 67-69, 86-88, etc.):
    step.MarkStarted()
    step.Phase = api.StepRunning
  • Regular migration flow (lines 791-792, 1024-1025, 1211-1212, etc.):
    step.MarkStarted()
    step.Phase = api.StepRunning

Setting Phase = api.StepRunning provides status visibility that the step is actively waiting/executing, which is important for monitoring and debugging.

Summary

✅ Both are necessary and correctly implemented
✅ Follows established codebase patterns
✅ Critical for your timeout functionality to work


🧠 Learnings used
Learnt from: Hazanel
Repo: kubev2v/forklift PR: 2813
File: pkg/controller/provider/container/vsphere/collector.go:354-355
Timestamp: 2025-09-16T10:09:36.531Z
Learning: For vsphere provider timeout issues, keep fixes targeted to vsphere-specific code rather than updating similar timeouts across other providers or the entire codebase. The maintainer prefers focused changes that address the specific use case without introducing broader risk.

Learnt from: yaacov
Repo: kubev2v/forklift PR: 3204
File: operator/.upstream_manifests:156-181
Timestamp: 2025-10-16T14:11:57.112Z
Learning: Repo: kubev2v/forklift — Maintainer yaacov prefers not to create GitHub follow-up issues (e.g., for re-tightening ForkliftController CRD types after temporary relaxation); such work is tracked in a different system. In future reviews, acknowledge and proceed without proposing GitHub issue creation.

Learnt from: yaacov
Repo: kubev2v/forklift PR: 3204
File: operator/.downstream_manifests:138-138
Timestamp: 2025-10-16T13:44:49.727Z
Learning: Repo: kubev2v/forklift — For PRs that relax ForkliftController CRD types (e.g., boolean-like and counting fields), maintainers prefer no field-name changes. Typos like controller_snapshot_removal_timeout_minuts should be tracked in a separate follow-up issue rather than addressed in the same PR.

@kubev2v kubev2v deleted a comment from coderabbitai bot Feb 25, 2026
@Hazanel
Copy link
Copy Markdown
Contributor Author

Hazanel commented Feb 25, 2026

Thanks @coderabbitai, for example :
PhaseWaitForPowerOff is part of the same Initialize step. When vm.FindStep(r.migrator.Step(vm)) is called at line 1206, it returns the same "Initialize" step that was already marked started back in PhaseStarted. So the step already has Started set and Phase == StepRunning by the time we reach PhaseWaitForPowerOff.

what do you think?
I'm sure it's harmless , I wonder if it's really required

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Feb 25, 2026

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

[scratchpad_start]
[task]
Waiting for script results to analyze phase flow and determine if the redundant MarkStarted calls are necessary or just defensive programming.
[scratchpad_end]


🏁 Script executed:

#!/bin/bash

echo "=== Examining phase flow in migration.go ==="
echo "Looking for phase transitions and the order of PhaseStarted and PhaseWaitForPowerOff..."

# Find phase constants and understand the flow
echo -e "\n1. Phase constants/enum definitions:"
rg -n 'const.*Phase.*=' pkg/apis/forklift/v1beta1/plan/vm.go -A 2 2>/dev/null || \
rg -n 'Phase.*=.*"' pkg/controller/plan/migration.go | head -20

echo -e "\n2. Switch statement structure showing phase ordering:"
rg -n 'case.*Phase' pkg/controller/plan/migration.go | head -30

echo -e "\n3. Specific PhaseStarted implementation:"
rg -n 'case.*PhaseStarted' pkg/controller/plan/migration.go -A 15

echo -e "\n4. Specific PhaseWaitForPowerOff implementation:"
rg -n 'case.*PhaseWaitForPowerOff' pkg/controller/plan/migration.go -A 20

echo -e "\n5. Look for phase assignments that could skip PhaseStarted:"
rg -n 'vm\.Phase\s*=' pkg/controller/plan/migration.go | grep -v '//'

echo -e "\n6. Check if phase can be restored from persistent state:"
rg -n 'Restore|Resume|recover.*Phase' pkg/controller/plan/migration.go -i -C 2

echo -e "\n7. Pattern check: Do other phases also call MarkStarted redundantly?"
echo "Looking for other phases that mark steps as started..."
ast-grep --pattern $'case $PHASE:
  $$$
  step.MarkStarted()
  $$$' pkg/controller/plan/migration.go 2>/dev/null || \
rg -n 'step\.MarkStarted\(\)' pkg/controller/plan/migration.go -B 3 | head -40

echo -e "\n8. Check VM phase flow documentation or comments:"
rg -n 'Initialize.*phase.*flow|phase.*order|phase.*sequence' pkg/controller/plan/ -i -C 2

echo -e "\n9. Look for any early-exit or direct jump to PhaseWaitForPowerOff:"
rg -n 'PhaseWaitForPowerOff' pkg/controller/plan/migration.go -C 2 | head -40

Length of output: 4980


@Hazanel, excellent observation! You're absolutely right about the normal execution flow — within a single controller run, PhaseWaitForPowerOff operates on the same step that was already marked started in PhaseStarted.

However, the redundant calls are necessary for controller restart/recovery scenarios:

Why it matters: Kubernetes controller reconciliation

In Kubernetes, VM state is persisted to etcd. Consider this scenario:

  1. VM reaches PhaseWaitForPowerOff and is saved to etcd with that phase
  2. Controller pod crashes or is rescheduled
  3. New controller pod starts and reconciles the VM
  4. The VM is directly processed at PhaseWaitForPowerOffPhaseStarted is not re-executed

Without the MarkStarted() call in PhaseWaitForPowerOff, your timeout check would fail:

func checkStepTimeout(step *plan.Step, msg string) bool {
    if step.Started == nil {
        return false  // ← timeout won't work!
    }

Pattern confirms this is intentional

Looking at the codebase, many phases call MarkStarted() redundantly:

  • PhaseCreateDataVolumes (Line 791)
  • PhaseCreateVM (Line 926)
  • PhaseAllocateDisks/PhaseCopyDisks (Line 976)
  • PhaseConvertOpenstackSnapshot (Line 1024)
  • PhaseWaitForPowerOff (Line 1211)
  • PhaseCreateGuestConversionPod (Line 1252)
  • PhaseConvertGuest/PhaseCopyDisksVirtV2V (Line 1271)

This widespread pattern indicates it's defensive programming to ensure each phase can stand on its own after controller restarts.

Summary

✅ Your observation about normal flow is correct
✅ The "redundant" calls are essential for recovery scenarios
MarkStarted() is idempotent, so calling it multiple times is safe
✅ Follows established codebase pattern for resilience

The calls ensure your timeout mechanism works reliably even across controller restarts! 🐰


(\(\
(='.')
(")_(") Smart question!

🧠 Learnings used
Learnt from: yaacov
Repo: kubev2v/forklift PR: 3204
File: operator/.upstream_manifests:156-181
Timestamp: 2025-10-16T14:11:57.112Z
Learning: Repo: kubev2v/forklift — Maintainer yaacov prefers not to create GitHub follow-up issues (e.g., for re-tightening ForkliftController CRD types after temporary relaxation); such work is tracked in a different system. In future reviews, acknowledge and proceed without proposing GitHub issue creation.

Learnt from: Hazanel
Repo: kubev2v/forklift PR: 2813
File: pkg/controller/provider/container/vsphere/collector.go:354-355
Timestamp: 2025-09-16T10:09:36.531Z
Learning: For vsphere provider timeout issues, keep fixes targeted to vsphere-specific code rather than updating similar timeouts across other providers or the entire codebase. The maintainer prefers focused changes that address the specific use case without introducing broader risk.

Learnt from: yaacov
Repo: kubev2v/forklift PR: 3204
File: operator/.downstream_manifests:138-138
Timestamp: 2025-10-16T13:44:49.727Z
Learning: Repo: kubev2v/forklift — For PRs that relax ForkliftController CRD types (e.g., boolean-like and counting fields), maintainers prefer no field-name changes. Typos like controller_snapshot_removal_timeout_minuts should be tracked in a separate follow-up issue rather than addressed in the same PR.

@Hazanel
Copy link
Copy Markdown
Contributor Author

Hazanel commented Feb 25, 2026

@coderabbitai, and is it also necessary in PhaseCreateDataVolumes ?

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Feb 25, 2026

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash

echo "=== Verify phase flow for PhaseCreateDataVolumes ==="
echo "1. Check phase constants/definitions to understand ordering:"
rg -n 'Phase.*=.*"' pkg/apis/forklift/v1beta1/plan/vm.go | head -30

echo -e "\n2. Switch case structure showing PhaseStarted and PhaseCreateDataVolumes:"
rg -n 'case.*PhaseStarted|case.*PhaseCreateDataVolumes' pkg/controller/plan/migration.go -A 3

echo -e "\n3. Look at PhaseCreateDataVolumes implementation with MarkStarted:"
rg -n 'case.*PhaseCreateDataVolumes' pkg/controller/plan/migration.go -A 25

echo -e "\n4. Verify if VM phase can be set directly to PhaseCreateDataVolumes:"
rg -n 'vm\.Phase.*=.*PhaseCreateDataVolumes' pkg/controller/plan/ -C 2

echo -e "\n5. Check if phase transitions can skip PhaseStarted:"
echo "Looking for any code that sets phase directly..."
rg -n 'Phase\s*=.*Phase[A-Z]' pkg/controller/plan/migration.go | head -20

echo -e "\n6. Check for any phase restoration/persistence logic:"
rg -n 'phase.*restor|recover.*phase|load.*phase' pkg/controller/plan/ -i -C 2

echo -e "\n7. Verify VM struct has Phase field that gets persisted:"
rg -n 'type VM struct' pkg/apis/forklift/v1beta1/plan/vm.go -A 30 | head -40

Length of output: 3969


@Hazanel, yes, it's equally necessary in PhaseCreateDataVolumes for exactly the same reason!

The controller restart scenario applies here too

Looking at the code structure:

PhaseCreateDataVolumes (Line 785-792):

case api.PhaseCreateDataVolumes:
    step, found := vm.FindStep(r.migrator.Step(vm))
    if !found {
        vm.AddError(fmt.Sprintf("Step '%s' not found", r.migrator.Step(vm)))
        break
    }
    step.MarkStarted()
    step.Phase = api.StepRunning
    
    var ready bool
    ready, err = r.provider.PreTransferActions(vm.Ref)
    // ... timeout check happens here

Scenario where it matters:

  1. VM reaches PhaseCreateDataVolumes and waits for PreTransferActions
  2. Controller pod crashes/restarts
  3. VM object in etcd still has Phase = PhaseCreateDataVolumes
  4. New controller pod reconciles and enters the PhaseCreateDataVolumes case directly
  5. Without MarkStarted(), your timeout check fails:
    if step.Started == nil {
        return false  // ← timeout disabled!
    }

This is your PRIMARY timeout location!

Actually, PhaseCreateDataVolumes is even more critical than PhaseWaitForPowerOff because:

  • Line 831-835: This is where you check the PreTransferActions timeout
  • This is the first Initialize-phase timeout you're adding
  • It's explicitly mentioned in your PR objective as one of the two hang points

Summary

✅ Absolutely necessary in PhaseCreateDataVolumes
✅ This phase can be entered directly after controller restart
✅ It's actually the primary location for your PreTransferActions timeout
✅ Without it, the timeout won't work for VMs stuck waiting for PreTransferActions after a controller restart

Both locations need these calls for the same defensive programming reason! 🐰


🧠 Learnings used
Learnt from: yaacov
Repo: kubev2v/forklift PR: 3204
File: operator/.upstream_manifests:156-181
Timestamp: 2025-10-16T14:11:57.112Z
Learning: Repo: kubev2v/forklift — Maintainer yaacov prefers not to create GitHub follow-up issues (e.g., for re-tightening ForkliftController CRD types after temporary relaxation); such work is tracked in a different system. In future reviews, acknowledge and proceed without proposing GitHub issue creation.

Learnt from: Hazanel
Repo: kubev2v/forklift PR: 2813
File: pkg/controller/provider/container/vsphere/collector.go:354-355
Timestamp: 2025-09-16T10:09:36.531Z
Learning: For vsphere provider timeout issues, keep fixes targeted to vsphere-specific code rather than updating similar timeouts across other providers or the entire codebase. The maintainer prefers focused changes that address the specific use case without introducing broader risk.

Learnt from: yaacov
Repo: kubev2v/forklift PR: 3204
File: operator/.downstream_manifests:138-138
Timestamp: 2025-10-16T13:44:49.727Z
Learning: Repo: kubev2v/forklift — For PRs that relax ForkliftController CRD types (e.g., boolean-like and counting fields), maintainers prefer no field-name changes. Typos like controller_snapshot_removal_timeout_minuts should be tracked in a separate follow-up issue rather than addressed in the same PR.

@kubev2v kubev2v deleted a comment from coderabbitai bot Feb 25, 2026
description: "Snapshot removal timeout in minutes (default: 120)"
controller_initialize_phase_timeout_minutes:
x-kubernetes-int-or-string: true
description: "Initialize phase timeout in minutes (default: 60)"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@coderabbitai hi,
I have a concern about the naming of this controller field, specifically, it sounds like a general controller init phase, but it looks like a CDI setting, what am i missing ?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧩 Analysis chain

@yaacov, great question! Let me investigate what the "Initialize phase" specifically refers to in the migration workflow.


🏁 Script executed:

#!/bin/bash
# Search for "Initialize" phase definition and usage in migration code
echo "=== Searching for Initialize phase definitions ==="
rg -n "PhaseCreateDataVolumes|PhaseWaitForPowerOff|Initialize.*phase" pkg/controller/plan/ -A 2 -B 2

echo -e "\n=== Checking migration phase constants ==="
rg -n "const.*Phase" pkg/apis/forklift/v1beta1/ -A 1

echo -e "\n=== Looking at checkStepTimeout implementation ==="
ast-grep --pattern 'func $_(r *Migration) checkStepTimeout($$$) {
  $$$
}'

Length of output: 5573


🏁 Script executed:

#!/bin/bash
# Look for the "Initialize" step definition and what phases it includes
echo "=== Searching for Initialize step definition ==="
rg -n "Initialize" pkg/controller/plan/migrator/base/migrator.go -B 2 -A 2

echo -e "\n=== Look at step names and their phases ==="
ast-grep --pattern 'const (
  $$$
  Initialize = $_
  $$$
)'

echo -e "\n=== Check what checkStepTimeout does ==="
rg -n "checkStepTimeout" pkg/controller/plan/migration.go -B 5 -A 10

Length of output: 7338


@yaacov, excellent catch! You're right to be concerned about the naming. Let me clarify what this timeout actually controls:

What "Initialize" Means Here

"Initialize" is a step name in the VM migration workflow (not the Forklift controller's own initialization). Looking at the code:

The "Initialize" step encompasses these migration phases:

  • PhaseStarted
  • PhaseCreateInitialSnapshot / PhaseWaitForInitialSnapshot / PhaseStoreInitialSnapshotDeltas
  • PhaseCreateDataVolumes (when not running preflight inspection)
  • PhaseStorePowerState / PhasePowerOffSource / PhaseWaitForPowerOff (for cold migrations)

The timeout is specifically applied to:

  1. PhaseCreateDataVolumes - waiting for PreTransferActions hook readiness
  2. PhaseWaitForPowerOff - waiting for source VM to power off

Is it a CDI Setting?

No, it's not CDI-specific. While CDI may be involved during data volume creation, this timeout is broader - it covers:

  • Waiting for PreTransferActions hooks to become ready
  • Waiting for the source VM to power off

These are general VM migration operations, not specifically CDI operations.

Naming Concern

Your concern is valid. The name controller_initialize_phase_timeout_minutes is ambiguous:

  • ❌ Sounds like: Forklift controller's initialization timeout
  • ✅ Actually means: VM migration's "Initialize" step timeout

Would a more descriptive name be better? For example:

  • controller_migration_initialize_step_timeout_minutes (makes it clear it's about migration step)
  • controller_vm_initialize_timeout_minutes (emphasizes VM migration)
  • Something even more specific to what it times out (power off + PreTransferActions)?

What do you think - should we reconsider the naming before this merges?


🧠 Learnings used
Learnt from: yaacov
Repo: kubev2v/forklift PR: 2690
File: operator/.downstream_manifests:151-153
Timestamp: 2025-09-01T16:32:31.284Z
Learning: The field controller_snapshot_removal_timeout_minuts in ForkliftController CRD contains a typo - it should be controller_snapshot_removal_timeout_minutes. The maintainer yaacov prefers to handle such fixes in separate issues rather than mixing them with documentation PRs.

Learnt from: yaacov
Repo: kubev2v/forklift PR: 2690
File: operator/.downstream_manifests:151-153
Timestamp: 2025-09-01T16:32:31.284Z
Learning: The field controller_snapshot_removal_timeout_minuts in ForkliftController CRD contains a typo - it should be controller_snapshot_removal_timeout_minutes. The maintainer yaacov prefers to handle such fixes in separate issues rather than mixing them with documentation PRs.

Learnt from: yaacov
Repo: kubev2v/forklift PR: 3204
File: operator/.downstream_manifests:138-138
Timestamp: 2025-10-16T13:44:49.727Z
Learning: Repo: kubev2v/forklift — For PRs that relax ForkliftController CRD types (e.g., boolean-like and counting fields), maintainers prefer no field-name changes. Typos like controller_snapshot_removal_timeout_minuts should be tracked in a separate follow-up issue rather than addressed in the same PR.

Learnt from: yaacov
Repo: kubev2v/forklift PR: 3204
File: operator/.kustomized_manifests:6769-6772
Timestamp: 2025-10-16T15:58:40.425Z
Learning: Repo: kubev2v/forklift — Maintainer clarified that the long-term canonical type for ForkliftController feature flags (feature_*, and related controller_* toggles) is boolean. Samples should show booleans, while the CRD may keep x-kubernetes-preserve-unknown-fields: true temporarily to accept string inputs for backward compatibility.

Learnt from: yaacov
Repo: kubev2v/forklift PR: 2882
File: operator/config/crd/bases/forklift.konveyor.io_forkliftcontrollers.yaml:187-204
Timestamp: 2025-09-18T19:05:05.746Z
Learning: ForkliftController CRD validation philosophy (kubev2v/forklift): The CRD schema is designed for hinting/documenting available configuration knobs rather than strict validation. String fields accept any values, and actual validation occurs where values are used (in Ansible roles/templates). The hack/validate_forklift_controller_crd.py focuses on defaults-CRD consistency, not enforcing strict field validation.

Learnt from: yaacov
Repo: kubev2v/forklift PR: 3204
File: operator/.upstream_manifests:156-181
Timestamp: 2025-10-16T13:46:22.273Z
Learning: Repo: kubev2v/forklift — When temporarily relaxing ForkliftController CRD types to accept booleans/ints alongside strings, maintainers prefer not to add runtime normalization or extra checks in roles/templates. Keep changes limited to schema, treat this as temporary compatibility, and plan to re-tighten types in a follow-up release.

Learnt from: yaacov
Repo: kubev2v/forklift PR: 3204
File: operator/.upstream_manifests:6770-6772
Timestamp: 2025-10-16T18:19:40.853Z
Learning: Repo: kubev2v/forklift — For PRs relaxing ForkliftController CRD types (e.g., accepting booleans/ints/strings), do not modify or expand the example ForkliftController resource in operator/.upstream_manifests within that PR. Keep examples as-is and handle any exposure/consistency adjustments (e.g., listing feature_* flags in samples/CSV) in a separate, dedicated follow-up. Avoid creating GitHub issues for such follow-ups as they’re tracked elsewhere.

Learnt from: yaacov
Repo: kubev2v/forklift PR: 3204
File: operator/.upstream_manifests:156-181
Timestamp: 2025-10-16T14:11:57.112Z
Learning: Repo: kubev2v/forklift — Maintainer yaacov prefers not to create GitHub follow-up issues (e.g., for re-tightening ForkliftController CRD types after temporary relaxation); such work is tracked in a different system. In future reviews, acknowledge and proceed without proposing GitHub issue creation.

Learnt from: yaacov
Repo: kubev2v/forklift PR: 2690
File: operator/config/crd/bases/forklift.konveyor.io_forkliftcontrollers.yaml:384-384
Timestamp: 2025-09-01T16:36:42.418Z
Learning: The maintainer yaacov has confirmed that additionalProperties: true in the ForkliftController CRD spec is intentionally kept to allow new features to be added to the controller without requiring CRD schema updates. This is a deliberate architectural choice for extensibility.

Learnt from: yaacov
Repo: kubev2v/forklift PR: 2690
File: operator/.downstream_manifests:40-40
Timestamp: 2025-09-01T14:02:10.056Z
Learning: The ForkliftController CRD intentionally uses `additionalProperties: true` in its OpenAPI schema to maintain flexibility for future field additions without requiring schema updates. This is a deliberate architectural decision to support extensibility.

Learnt from: Hazanel
Repo: kubev2v/forklift PR: 2813
File: pkg/controller/provider/container/vsphere/collector.go:354-355
Timestamp: 2025-09-16T10:09:36.531Z
Learning: For vsphere provider timeout issues, keep fixes targeted to vsphere-specific code rather than updating similar timeouts across other providers or the entire codebase. The maintainer prefers focused changes that address the specific use case without introducing broader risk.

Copy link
Copy Markdown
Member

@mnecas mnecas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the bug logs the issue is with the openstack logic, not the migration process itself.

{"level":"debug","ts":"2025-05-30 10:21:04.986","logger":"plan|2nrt7","msg":"Secret updated.","plan":{"name":"osp-test","namespace":"openshift-mtv"},"migration":"openshift-mtv/osp-test-cccl8","secret":"mtv-test/osp-test-39093ebb-b524-4aae-a2ea-69445333d842-8q9cg","vm":" id:39093ebb-b524-4aae-a2ea-69445333d842 name:'mtv-rhel8-sanity-nfs' "}
{"level":"info","ts":"2025-05-30 10:21:05.001","logger":"plan|2nrt7","msg":"the image properties have not been updated yet","plan":{"name":"osp-test","namespace":"openshift-mtv"},"migration":"openshift-mtv/osp-test-cccl8","image":"forklift-migration-vm-39093ebb-b524-4aae-a2ea-69445333d842-volume-edf7520c-ea5e-4370-b586-3a364d2e08d2"}

We should take a look at the openstack and fix the problem there.

Comment thread pkg/controller/plan/migration.go Outdated
@Hazanel Hazanel force-pushed the initialize-phase-timeout branch from 6e0f1db to 43fab8a Compare March 1, 2026 07:55
@Hazanel
Copy link
Copy Markdown
Contributor Author

Hazanel commented Mar 1, 2026

Hey @coderabbitai, I've listen to the comments and dig deep into the openstack code issue.
please review and also if you can create a diagram that illustrates the issue state machine

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 1, 2026

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

@Hazanel, I'll review the latest changes and create a state machine diagram for you!


🏁 Script executed:

#!/bin/bash

echo "=== PR Details and Current State ==="
gh pr view 4969 --json number,title,body,headRefName,baseRefName,state,additions,deletions,files,commits --jq '{
  number,
  title,
  headRefName,
  baseRefName,
  state,
  additions,
  deletions,
  files: [.files[].path],
  commits: [.commits[] | {oid: .oid[0:7], message: .messageHeadline}]
}'

echo -e "\n=== Full PR Diff ==="
gh pr diff 4969

echo -e "\n=== Recent Commits (last 5) ==="
gh pr view 4969 --json commits --jq '.commits[-5:] | .[] | "\(.oid[0:7]) \(.committedDate) \(.messageHeadline)"'

echo -e "\n=== Files Changed ==="
gh pr view 4969 --json files --jq '.files[] | "\(.path) +\(.additions) -\(.deletions)"'

echo -e "\n=== Check for OpenStack-specific changes ==="
rg -n 'openstack|osp|OSP' --type go -i -C 3 || echo "No OpenStack-specific code changes found"

echo -e "\n=== Migration.go changes (main logic) ==="
rg -n 'func.*Initialize|checkStepTimeout|PhaseCreateDataVolumes|PhaseWaitForPowerOff' pkg/controller/plan/migration.go -C 5 || true

echo -e "\n=== Settings changes ==="
cat pkg/settings/migration.go | grep -A 5 -B 5 'InitializePhaseTimeout' || true

Length of output: 167845


@Hazanel, excellent work investigating the root cause! You've moved from a timeout-based mitigation to fixing the actual bugs in the OpenStack provider. This is a much better solution! 🎉

📊 Analysis of Root Cause Fixes

Your changes address four critical bug categories in pkg/controller/plan/adapter/openstack/client.go:

1. Control Flow Bugs (Lines 230-234, 264-268, 301-305)

// ❌ BEFORE: return inside error block - function continues!
if err != nil {
    if r.IsNotFound(err) {
        err = ResourceNotFoundError
    }
    return  // ← Only returns from error block!
}
// Code continues executing even after error...

// ✅ AFTER: return outside error block
if err != nil {
    if r.IsNotFound(err) {
        err = ResourceNotFoundError
    }
}
return  // ← Always returns, preventing undefined behavior

Impact: Without this fix, functions continued executing after errors, leading to nil pointer dereferences or operations on invalid data → potential hangs.

2. Property Management (Lines 420-426, 486-498)

// Lines 420-426: Explicit error instead of silent log
if !found {
    err = liberr.New("cannot find the original volume ID for image",
        "vm", vm.Name, "image", image.Name)
    return  // ← Fails fast instead of continuing
}

// Lines 486-498: Sets critical property on new images
imageUpdateOpts := &libclient.ImageUpdateOpts{}
imageUpdateOpts.AddImageProperty(forkliftPropertyOriginalVolumeID, originalVolumeID)
if err = r.Update(image, imageUpdateOpts); err != nil {
    return
}

Impact: The forkliftPropertyOriginalVolumeID property is essential for tracking image provenance. Without it, validation/cleanup logic fails → hangs waiting for images that can never be validated.

3. Stale Image Detection & Cleanup (Lines 922-944, 1019-1066)

Image readiness check (lines 922-944):

case ImageStatusActive:
    if _, ok := image.Properties[forkliftPropertyOriginalVolumeID]; !ok {
        // Property missing: try to recover from backing volume
        if updateErr := r.updateImageProperty(vm, image); updateErr != nil {
            r.Log.Info("deleting stale image that cannot be updated",
                "vm", vm.Name, "image", image.Name, "error", updateErr)
            if err = r.Delete(image); err != nil {
                return
            }
            return  // ← Triggers retry/recreation
        }
    }

Snapshot-to-volume conversion (lines 1019-1066):

if _, ok := image.Properties[forkliftPropertyOriginalVolumeID]; ok {
    r.Log.Info("skipping the volume creation, the image already exists", ...)
} else {
    r.Log.Info("found stale image without property, deleting before re-creating", ...)
    if err = r.Delete(image); err != nil {
        return
    }
    // Recreate volume from snapshot
    _, err = r.createVolumeFromSnapshot(vm, snapshot.ID)
}

Impact: Stale images (from controller restarts or failed operations) without proper metadata would be polled forever. Now they're detected and cleaned up → prevents indefinite waiting.

4. Goroutine Safety (Lines 860-896)

// ❌ BEFORE: Launched goroutines inside loop
for _, image := range imagesFromVolumes {
    // ... processing ...
    originalVolumeID := image.Properties[forkliftPropertyOriginalVolumeID]
    go func() {
        err := r.cleanup(vm, originalVolumeID)  // ← Variable capture bug!
    }()
}

// ✅ AFTER: Collect IDs, then launch goroutines
var cleanupVolumeIDs []string
for _, image := range imagesFromVolumes {
    // ... processing ...
    cleanupVolumeIDs = append(cleanupVolumeIDs, originalVolumeID)
}
for _, volID := range cleanupVolumeIDs {
    volID := volID  // ← Proper variable capture
    go func() {
        if cleanupErr := r.cleanup(vm, volID); cleanupErr != nil {
            r.Log.Error(cleanupErr, "failed to cleanup snapshot and volume", ...)
        }
    }()
}

Impact: The original code had a closure variable capture bug and could launch cleanups prematurely → race conditions and resource leaks.


🔄 Initialize Phase State Machine

stateDiagram-v2
    [*] --> PhaseStarted: Migration begins
    
    PhaseStarted --> PhaseCreateDataVolumes: Initialize step started
    
    state PhaseCreateDataVolumes {
        [*] --> WaitPreTransfer: Check PreTransferActions
        WaitPreTransfer --> CreateVolumes: Hook ready
        WaitPreTransfer --> WaitPreTransfer: Not ready (poll)
        
        state CreateVolumes {
            [*] --> CheckExistingImage: For each disk
            
            CheckExistingImage --> ValidateImage: Image found
            CheckExistingImage --> CreateFromSnapshot: No image
            
            ValidateImage --> CheckProperty: Image status = Active
            CheckProperty --> CleanupVolume: Property present & valid
            CheckProperty --> TryRecoverProperty: Property missing
            
            TryRecoverProperty --> CleanupVolume: ✅ Recovered from volume
            TryRecoverProperty --> DeleteStaleImage: ❌ Cannot recover
            
            DeleteStaleImage --> CreateFromSnapshot: Trigger retry
            
            CreateFromSnapshot --> CreateImageFromVolume: Create volume
            CreateImageFromVolume --> SetImageProperty: ✅ Image created
            SetImageProperty --> CleanupVolume: ✅ Property set
            
            CleanupVolume --> [*]: Async cleanup (goroutine)
        }
        
        CreateVolumes --> [*]: All volumes ready
    }
    
    PhaseCreateDataVolumes --> PhaseWaitForPowerOff: Volumes created
    
    state PhaseWaitForPowerOff {
        [*] --> CheckVMPower: Poll VM state
        CheckVMPower --> StopVM: VM running
        CheckVMPower --> [*]: VM powered off
        StopVM --> CheckVMPower: Wait for poweroff
    }
    
    PhaseWaitForPowerOff --> PhaseCreateVM: VM powered off
    PhaseCreateVM --> [*]: Migration continues...
    
    note right of ValidateImage
        🐛 OLD BUG: Stale images
        without property would
        be polled forever!
        
        ✅ FIX: Detect & delete
        stale images, trigger
        recreation
    end note
    
    note right of SetImageProperty
        🐛 OLD BUG: Property
        not set on new images!
        
        ✅ FIX: Always set
        forkliftPropertyOriginalVolumeID
        after image creation
    end note
    
    note right of CreateFromSnapshot
        🐛 OLD BUG: Control flow
        bugs caused early returns
        to fail
        
        ✅ FIX: Proper return
        statement placement
    end note
Loading

✅ Why This Fixes the Hang

Your changes prevent the Initialize phase hang through:

  1. Proper error propagation: Control flow fixes ensure errors cause proper failures instead of silent hangs
  2. Complete metadata: All images now have the required forkliftPropertyOriginalVolumeID property
  3. Stale resource cleanup: Detects and removes leftover images from previous failed attempts
  4. Safe concurrent cleanup: Goroutine fixes prevent race conditions

The original hang scenario:

1. Volume → Image conversion starts
2. Controller crashes mid-operation
3. Restart: Image exists but missing forkliftPropertyOriginalVolumeID
4. Validation logic can't proceed
5. ⏰ Polls forever waiting for property that will never appear

After your fixes:

1. Volume → Image conversion starts
2. Property is ALWAYS set (lines 486-498)
3. If stale image detected: delete & recreate (lines 922-944, 1019-1066)
4. Proper errors prevent silent failures (lines 230-234, etc.)
5. ✅ Clean progression through Initialize phase

🎯 Recommendation

This is significantly better than a timeout mechanism because:

  • ✅ Fixes root causes instead of masking symptoms
  • ✅ Prevents resource leaks (stale images)
  • ✅ Self-healing (detects and cleans up bad state)
  • ✅ Targeted to OpenStack provider (follows maintainer preferences for focused fixes)

The state diagram shows how your fixes create a robust, self-correcting flow through the Initialize phase! 🚀


🧠 Learnings used
Learnt from: Hazanel
Repo: kubev2v/forklift PR: 0
File: :0-0
Timestamp: 2026-02-25T14:13:49.423Z
Learning: In kubev2v/forklift (pkg/controller/plan/migration.go), Initialize-phase sub-phases (e.g., PhaseWaitForPowerOff) operate on the same Step object created/started in PhaseStarted. Therefore, re-calling Step.MarkStarted() and re-setting Step.Phase=StepRunning in later Initialize sub-phases is redundant for correctness (MarkStarted is idempotent; timeout checks rely only on Started being non-nil).

Learnt from: yaacov
Repo: kubev2v/forklift PR: 3204
File: operator/.upstream_manifests:156-181
Timestamp: 2025-10-16T14:11:57.112Z
Learning: Repo: kubev2v/forklift — Maintainer yaacov prefers not to create GitHub follow-up issues (e.g., for re-tightening ForkliftController CRD types after temporary relaxation); such work is tracked in a different system. In future reviews, acknowledge and proceed without proposing GitHub issue creation.

Learnt from: Hazanel
Repo: kubev2v/forklift PR: 2813
File: pkg/controller/provider/container/vsphere/collector.go:354-355
Timestamp: 2025-09-16T10:09:36.531Z
Learning: For vsphere provider timeout issues, keep fixes targeted to vsphere-specific code rather than updating similar timeouts across other providers or the entire codebase. The maintainer prefers focused changes that address the specific use case without introducing broader risk.

Learnt from: yaacov
Repo: kubev2v/forklift PR: 3204
File: operator/.downstream_manifests:138-138
Timestamp: 2025-10-16T13:44:49.727Z
Learning: Repo: kubev2v/forklift — For PRs that relax ForkliftController CRD types (e.g., boolean-like and counting fields), maintainers prefer no field-name changes. Typos like controller_snapshot_removal_timeout_minuts should be tracked in a separate follow-up issue rather than addressed in the same PR.

Issue 1: Stale Glance images cause infinite reconciliation loop.
Image names are deterministic (no migration-specific component), so
images from a previous failed migration persist in Glance as Active
without the forklift_original_volume_id property. The controller
skips volume creation (image exists), but updateImageProperty cannot
set the property (backing volume is gone) and silently returns nil.
The reconcile loop repeats forever.

Fix 1:
- Detect and delete stale images (Active, missing property) in
ensureVolumeFromSnapshot and ensureImageFromVolumeReady.
- Set the tracking property at image creation time in createImageFromVolume.
- Make updateImageProperty return an error when the volume is not found
  so callers can act on it.
- Defer cleanup goroutines in ensureImagesFromVolumesReady until all images are confirmed ready
  to prevent a race between cleanup and property lookup.

Issue 2: getVM/getImage/getVolume resolve the wrong OpenStack resource.
When both ID and Name are set, these functions first fetch by ID
correctly, then fall through to a name-based search that overwrites
the result.
OpenStack APIs use regex name matching, so "mtv-rhel8-sanity"
matches "mtv-rhel8-sanity-2ip-1nic".
PreTransferActions operates on the wrong VM, returns ready=true,
and the real VM transitions to CopyDisks with no PVCs created.

Fix 2: Return early after a successful ID-based lookup in getVM,
getImage, and getVolume. The name-based search is now only used as
a fallback when no ID is provided.

Ref: https://issues.redhat.com/browse/MTV-2631
Resolves: MTV-2631
Signed-off-by: Elad Hazan <ehazan@redhat.com>
@Hazanel Hazanel force-pushed the initialize-phase-timeout branch from 43fab8a to c9300df Compare March 1, 2026 08:06
@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud bot commented Mar 1, 2026

@codecov-commenter
Copy link
Copy Markdown

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 0% with 45 lines in your changes missing coverage. Please review.
✅ Project coverage is 10.28%. Comparing base (f1fe5d0) to head (c9300df).
⚠️ Report is 1862 commits behind head on main.

Files with missing lines Patch % Lines
pkg/controller/plan/adapter/openstack/client.go 0.00% 45 Missing ⚠️
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4969      +/-   ##
==========================================
- Coverage   15.45%   10.28%   -5.17%     
==========================================
  Files         112      470     +358     
  Lines       23377    53797   +30420     
==========================================
+ Hits         3613     5535    +1922     
- Misses      19479    47782   +28303     
- Partials      285      480     +195     
Flag Coverage Δ
unittests 10.28% <0.00%> (-5.17%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Hazanel Hazanel requested review from mnecas and yaacov March 1, 2026 12:16
Copy link
Copy Markdown
Member

@mnecas mnecas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@mnecas mnecas merged commit a91f90f into kubev2v:main Mar 9, 2026
20 of 21 checks passed
@Hazanel Hazanel deleted the initialize-phase-timeout branch March 9, 2026 08:30
@Hazanel
Copy link
Copy Markdown
Contributor Author

Hazanel commented Apr 16, 2026

/backport release-2.11

@github-actions
Copy link
Copy Markdown

🔄 Starting backport of PR #4969 to release-2.11
🚀 Live mode
View run

@github-actions
Copy link
Copy Markdown

✅ PR #4969 backported to release-2.11.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants