Fix race condition in packet capture e2e test file copy#7775
Fix race condition in packet capture e2e test file copy#7775luolanzone wants to merge 1 commit intoantrea-io:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Improves stability of the PacketCapture E2E tests by reducing intermittent EOF failures when copying the generated .pcapng from the Antrea agent Pod after the CR reaches Complete.
Changes:
- Adds a fixed 2-second delay after
waitForPacketCapturecompletes before copying the capture file. - Adds a 3-attempt retry loop with exponential backoff (1s, 2s) around
copyPodFile.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
test/e2e/packetcapture_test.go
Outdated
| for attempt := 0; attempt < 3; attempt++ { | ||
| copyErr = data.copyPodFile(antreaPodName, "antrea-agent", "kube-system", packetFile, tmpDir) | ||
| if copyErr == nil { | ||
| break | ||
| } | ||
| if attempt < 2 { | ||
| time.Sleep(time.Duration(1<<uint(attempt)) * time.Second) | ||
| } | ||
| } |
There was a problem hiding this comment.
The retry loop currently retries on any copyPodFile error. That can mask real failures (e.g. wrong Pod/container/permission issues) and delay fast feedback. Suggest restricting retries to the known transient symptom (EOF / unexpected EOF) and failing immediately for other errors (optionally logging intermediate retry errors).
The packet capture e2e test was experiencing intermittent EOF errors when copying the pcap file from the Antrea agent Pod. This occurred because the test attempted to read the file immediately after the PacketCapture status was marked as "Complete", but before the file was fully written and flushed to disk. These changes introduce retry logic during file copy operations to reduce the likelihood of packet count mismatches in timeout-based tests. Signed-off-by: Lan Luo <lan.luo@broadcom.com>
ee280ee to
63a07f9
Compare
|
|
||
| // Copy the pcap file from the agent Pod, retrying for a short bounded | ||
| // period to handle the case where the file is still being written. | ||
| ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second) |
There was a problem hiding this comment.
nit: you can use t.Context() instead of context.Background()
| // period to handle the case where the file is still being written. | ||
| ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second) | ||
| defer cancel() | ||
| err = wait.PollUntilContextTimeout(ctx, 500*time.Millisecond, 5*time.Second, true, func(ctx context.Context) (bool, error) { |
There was a problem hiding this comment.
if you create a context with context.WithTimeout, you don't need to use PollUntilContextTimeout and repeat the timeout value, you should use PollUntilContextCancel.
The packet capture e2e test was experiencing intermittent EOF errors when copying the pcap file from the Antrea agent Pod. This occurred because the test attempted to read the file immediately after the PacketCapture status was marked as "Complete", but before the file was fully written and flushed to disk.
These changes introduce retry logic during file copy operations to reduce
the likelihood of packet count mismatches in timeout-based tests.
Observed failure:
https://github.com/antrea-io/antrea/actions/runs/21854354449/job/63068673123?pr=7769#step:8:4564