Skip to content

Fix flaky FlowAggregator E2E tests#7741

Merged
antoninbas merged 1 commit intoantrea-io:mainfrom
petertran-avgo:topic/petertran/7740/flakey-e2e
Feb 9, 2026
Merged

Fix flaky FlowAggregator E2E tests#7741
antoninbas merged 1 commit intoantrea-io:mainfrom
petertran-avgo:topic/petertran/7740/flakey-e2e

Conversation

@petertran-avgo
Copy link
Contributor

@petertran-avgo petertran-avgo commented Jan 28, 2026

Fix flaky FlowAggregator E2E tests by confirming FlowAggregator metrics (#7740)

The FlowAggregator SecureConnection tests were flaky. Flows took a
long time to arrive in the FlowAggregator. When they did, it was only
1 flow instead of the expected 3. The Delta count and Total count were
always equal due to the FlowExporter buffering the flow data before
sending it.

This was caused by the fact that FlowAggregator was redeployed for
each subtest. The Agent was slow to reconnect to the freshly deployed
FlowAggregator. Thus, the FlowExporter didn't match the expected timings
of the tests.

The test setup was updated to check FlowAggregator metrics after each
redeployment. This ensures the FlowExporter has reconnected to the new
FlowAggregator before testing.

Closes #7740

@petertran-avgo petertran-avgo force-pushed the topic/petertran/7740/flakey-e2e branch 8 times, most recently from da6e8dd to 1852223 Compare February 2, 2026 16:18
@petertran-avgo petertran-avgo changed the title DRAFT: debug: trigger exporter process? Fix flaky FlowAggregator E2E tests Feb 2, 2026
t.Run(o.name, func(t *testing.T) {
var err error
data, v4Enabled, v6Enabled := setupFlowAggregatorTest(t, o.flowVisibilityTestOptions)
_, _, stderr, err := data.RunCommandOnNode(controlPlaneNodeName(), "kubectl delete pod -n kube-system -l app=antrea,component=antrea-agent --field-selector spec.nodeName="+controlPlaneNodeName())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We usually avoid running commands on the Node, instead we prefer the K8s API

However, in this case, how can we make the tests more robust without restarting Agent Pods? Would it help to call getAndCheckFlowAggregatorMetrics for each sub-test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't notice that helper, nice! I was able to re-use it and it stabilized the tests. I also realized it is called after all the setupFlowAggregator calls so I did a small refactor as well.

@petertran-avgo petertran-avgo force-pushed the topic/petertran/7740/flakey-e2e branch from fb05558 to f896705 Compare February 3, 2026 14:00
Copy link
Contributor

@antoninbas antoninbas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay. The change looks good to me but I think we don't need to introduce withClickHouseExporter. Instead we could use databaseURL != "" to determine without CH is enabled, as we already use that same condition in a couple of places.

…cs (antrea-io#7740)

The FlowAggregator SecureConnection tests were flaky. Flows took a
long time to arrive in the FlowAggregator. When they did, it was only
1 flow instead of the expected 3. The Delta count and Total count were
always equal due to the FlowExporter buffering the flow data before
sending it.

This was caused by the fact that FlowAggregator was redeployed for
each subtest. The Agent was slow to reconnect to the freshly deployed
FlowAggregator. Thus, the FlowExporter didn't match the expected timings
of the tests.

The test setup was updated to check FlowAggregator metrics after each
redeployment. This ensures the FlowExporter has reconnected to the new
FlowAggregator before testing.

Signed-off-by: Peter Tran <peter-pt.tran@broadcom.com>
@petertran-avgo petertran-avgo force-pushed the topic/petertran/7740/flakey-e2e branch from f896705 to 7ba989b Compare February 9, 2026 13:32
@antoninbas
Copy link
Contributor

/test-e2e

@antoninbas antoninbas merged commit c504567 into antrea-io:main Feb 9, 2026
57 of 65 checks passed
@petertran-avgo
Copy link
Contributor Author

@antoninbas Thanks so much for reviewing and merging my PR. Is this worth cherry-picking to the release branches? I've been hit with this flake there too

@petertran-avgo petertran-avgo deleted the topic/petertran/7740/flakey-e2e branch February 10, 2026 12:23
@antoninbas antoninbas added kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. area/flow-visibility/aggregator Issues or PRs related to Flow Aggregator action/backport Indicates a PR that requires backports. labels Feb 12, 2026
petertran-avgo added a commit to petertran-avgo/antrea that referenced this pull request Feb 12, 2026
… (antrea-io#7741)

The FlowAggregator SecureConnection tests were flaky. Flows took a
long time to arrive in the FlowAggregator. When they did, it was only
1 flow instead of the expected 3. The Delta count and Total count were
always equal due to the FlowExporter buffering the flow data before
sending it.

This was caused by the fact that FlowAggregator was redeployed for
each subtest. The Agent was slow to reconnect to the freshly deployed
FlowAggregator. Thus, the FlowExporter didn't match the expected timings
of the tests.

The test setup was updated to check FlowAggregator metrics after each
redeployment. This ensures the FlowExporter has reconnected to the new
FlowAggregator before testing.

Signed-off-by: Peter Tran <peter-pt.tran@broadcom.com>
petertran-avgo added a commit to petertran-avgo/antrea that referenced this pull request Feb 12, 2026
… (antrea-io#7741)

The FlowAggregator SecureConnection tests were flaky. Flows took a
long time to arrive in the FlowAggregator. When they did, it was only
1 flow instead of the expected 3. The Delta count and Total count were
always equal due to the FlowExporter buffering the flow data before
sending it.

This was caused by the fact that FlowAggregator was redeployed for
each subtest. The Agent was slow to reconnect to the freshly deployed
FlowAggregator. Thus, the FlowExporter didn't match the expected timings
of the tests.

The test setup was updated to check FlowAggregator metrics after each
redeployment. This ensures the FlowExporter has reconnected to the new
FlowAggregator before testing.

Signed-off-by: Peter Tran <peter-pt.tran@broadcom.com>
antoninbas pushed a commit that referenced this pull request Feb 27, 2026
#7778)

The FlowAggregator SecureConnection tests were flaky. Flows took a
long time to arrive in the FlowAggregator. When they did, it was only
1 flow instead of the expected 3. The Delta count and Total count were
always equal due to the FlowExporter buffering the flow data before
sending it.

This was caused by the fact that FlowAggregator was redeployed for
each subtest. The Agent was slow to reconnect to the freshly deployed
FlowAggregator. Thus, the FlowExporter didn't match the expected timings
of the tests.

The test setup was updated to check FlowAggregator metrics after each
redeployment. This ensures the FlowExporter has reconnected to the new
FlowAggregator before testing.

Signed-off-by: Peter Tran <peter-pt.tran@broadcom.com>
petertran-avgo added a commit to petertran-avgo/antrea that referenced this pull request Mar 2, 2026
… (antrea-io#7741)

The FlowAggregator SecureConnection tests were flaky. Flows took a
long time to arrive in the FlowAggregator. When they did, it was only
1 flow instead of the expected 3. The Delta count and Total count were
always equal due to the FlowExporter buffering the flow data before
sending it.

This was caused by the fact that FlowAggregator was redeployed for
each subtest. The Agent was slow to reconnect to the freshly deployed
FlowAggregator. Thus, the FlowExporter didn't match the expected timings
of the tests.

The test setup was updated to check FlowAggregator metrics after each
redeployment. This ensures the FlowExporter has reconnected to the new
FlowAggregator before testing.

Signed-off-by: Peter Tran <peter-pt.tran@broadcom.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

action/backport Indicates a PR that requires backports. area/flow-visibility/aggregator Issues or PRs related to Flow Aggregator kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Flakey Test: E2e tests on a Kind cluster on Linux for Flow Visibility (ipfix)

2 participants