Skip to content

Fix race condition in packet capture e2e test file copy#7775

Open
luolanzone wants to merge 1 commit intoantrea-io:mainfrom
luolanzone:fix-pc-flaky-test
Open

Fix race condition in packet capture e2e test file copy#7775
luolanzone wants to merge 1 commit intoantrea-io:mainfrom
luolanzone:fix-pc-flaky-test

Conversation

@luolanzone
Copy link
Contributor

@luolanzone luolanzone commented Feb 12, 2026

The packet capture e2e test was experiencing intermittent EOF errors when copying the pcap file from the Antrea agent Pod. This occurred because the test attempted to read the file immediately after the PacketCapture status was marked as "Complete", but before the file was fully written and flushed to disk.

These changes introduce retry logic during file copy operations to reduce
the likelihood of packet count mismatches in timeout-based tests.

Observed failure:

https://github.com/antrea-io/antrea/actions/runs/21854354449/job/63068673123?pr=7769#step:8:4564

    packetcapture_test.go:1083: CR status not match, actual: {NumberCaptured:9 FilePath:sftp://10.96.4.69:22/upload/ipv4-icmp-timeout.pcapng Conditions:[{Type:PacketCaptureComplete Status:True LastTransitionTime:2026-02-10 07:36:13 +0000 UTC Reason:Timeout Message:context deadline exceeded} {Type:PacketCaptureStarted Status:True LastTransitionTime:2026-02-10 07:35:57 +0000 UTC Reason:Started Message:} {Type:PacketCaptureFileUploaded Status:True LastTransitionTime:2026-02-10 07:36:13 +0000 UTC Reason:Succeed Message:}]}, expected: {NumberCaptured:10 FilePath:sftp://10.96.4.69:22/upload/ipv4-icmp-timeout.pcapng Conditions:[{Type:PacketCaptureStarted Status:True LastTransitionTime:0001-01-01 00:00:00 +0000 UTC Reason:Started Message:} {Type:PacketCaptureComplete Status:True LastTransitionTime:0001-01-01 00:00:00 +0000 UTC Reason:Timeout Message:context deadline exceeded} {Type:PacketCaptureFileUploaded Status:True LastTransitionTime:0001-01-01 00:00:00 +0000 UTC Reason:Succeed Message:}]}
I0210 07:36:13.213470   27001 framework.go:3138] Copying file "/tmp/antrea/packetcapture/packets/ipv4-icmp-timeout.pcapng" from Pod kube-system/antrea-agent-9ln8x
    packetcapture_test.go:1119: 
        	Error Trace:	/home/runner/work/antrea/antrea/test/e2e/packetcapture_test.go:1119
        	            				/home/runner/work/antrea/antrea/test/e2e/packetcapture_test.go:725
        	Error:      	Received unexpected error:
        	            	EOF
        	Test:       	TestPacketCapture/testPacketCaptureBasic/testPacketCaptureBasic/ipv4-icmp-timeout

@luolanzone luolanzone requested a review from Copilot February 12, 2026 07:21
@luolanzone luolanzone added the area/ops/packetcapture Issues or PRs related to the PacketCapture feature label Feb 12, 2026
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Improves stability of the PacketCapture E2E tests by reducing intermittent EOF failures when copying the generated .pcapng from the Antrea agent Pod after the CR reaches Complete.

Changes:

  • Adds a fixed 2-second delay after waitForPacketCapture completes before copying the capture file.
  • Adds a 3-attempt retry loop with exponential backoff (1s, 2s) around copyPodFile.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +1108 to +1116
for attempt := 0; attempt < 3; attempt++ {
copyErr = data.copyPodFile(antreaPodName, "antrea-agent", "kube-system", packetFile, tmpDir)
if copyErr == nil {
break
}
if attempt < 2 {
time.Sleep(time.Duration(1<<uint(attempt)) * time.Second)
}
}
Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The retry loop currently retries on any copyPodFile error. That can mask real failures (e.g. wrong Pod/container/permission issues) and delay fast feedback. Suggest restricting retries to the known transient symptom (EOF / unexpected EOF) and failing immediately for other errors (optionally logging intermediate retry errors).

Copilot uses AI. Check for mistakes.
The packet capture e2e test was experiencing intermittent EOF errors
when copying the pcap file from the Antrea agent Pod. This occurred
because the test attempted to read the file immediately after the
PacketCapture status was marked as "Complete", but before the file
was fully written and flushed to disk.

These changes introduce retry logic during file copy operations to reduce
the likelihood of packet count mismatches in timeout-based tests.

Signed-off-by: Lan Luo <lan.luo@broadcom.com>

// Copy the pcap file from the agent Pod, retrying for a short bounded
// period to handle the case where the file is still being written.
ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: you can use t.Context() instead of context.Background()

// period to handle the case where the file is still being written.
ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
defer cancel()
err = wait.PollUntilContextTimeout(ctx, 500*time.Millisecond, 5*time.Second, true, func(ctx context.Context) (bool, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you create a context with context.WithTimeout, you don't need to use PollUntilContextTimeout and repeat the timeout value, you should use PollUntilContextCancel.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/ops/packetcapture Issues or PRs related to the PacketCapture feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants