Fix flaky FlowAggregator E2E tests#7741
Conversation
da6e8dd to
1852223
Compare
test/e2e/flowaggregator_test.go
Outdated
| t.Run(o.name, func(t *testing.T) { | ||
| var err error | ||
| data, v4Enabled, v6Enabled := setupFlowAggregatorTest(t, o.flowVisibilityTestOptions) | ||
| _, _, stderr, err := data.RunCommandOnNode(controlPlaneNodeName(), "kubectl delete pod -n kube-system -l app=antrea,component=antrea-agent --field-selector spec.nodeName="+controlPlaneNodeName()) |
There was a problem hiding this comment.
We usually avoid running commands on the Node, instead we prefer the K8s API
However, in this case, how can we make the tests more robust without restarting Agent Pods? Would it help to call getAndCheckFlowAggregatorMetrics for each sub-test?
There was a problem hiding this comment.
I didn't notice that helper, nice! I was able to re-use it and it stabilized the tests. I also realized it is called after all the setupFlowAggregator calls so I did a small refactor as well.
fb05558 to
f896705
Compare
antoninbas
left a comment
There was a problem hiding this comment.
Sorry for the delay. The change looks good to me but I think we don't need to introduce withClickHouseExporter. Instead we could use databaseURL != "" to determine without CH is enabled, as we already use that same condition in a couple of places.
…cs (antrea-io#7740) The FlowAggregator SecureConnection tests were flaky. Flows took a long time to arrive in the FlowAggregator. When they did, it was only 1 flow instead of the expected 3. The Delta count and Total count were always equal due to the FlowExporter buffering the flow data before sending it. This was caused by the fact that FlowAggregator was redeployed for each subtest. The Agent was slow to reconnect to the freshly deployed FlowAggregator. Thus, the FlowExporter didn't match the expected timings of the tests. The test setup was updated to check FlowAggregator metrics after each redeployment. This ensures the FlowExporter has reconnected to the new FlowAggregator before testing. Signed-off-by: Peter Tran <peter-pt.tran@broadcom.com>
f896705 to
7ba989b
Compare
|
/test-e2e |
|
@antoninbas Thanks so much for reviewing and merging my PR. Is this worth cherry-picking to the release branches? I've been hit with this flake there too |
… (antrea-io#7741) The FlowAggregator SecureConnection tests were flaky. Flows took a long time to arrive in the FlowAggregator. When they did, it was only 1 flow instead of the expected 3. The Delta count and Total count were always equal due to the FlowExporter buffering the flow data before sending it. This was caused by the fact that FlowAggregator was redeployed for each subtest. The Agent was slow to reconnect to the freshly deployed FlowAggregator. Thus, the FlowExporter didn't match the expected timings of the tests. The test setup was updated to check FlowAggregator metrics after each redeployment. This ensures the FlowExporter has reconnected to the new FlowAggregator before testing. Signed-off-by: Peter Tran <peter-pt.tran@broadcom.com>
… (antrea-io#7741) The FlowAggregator SecureConnection tests were flaky. Flows took a long time to arrive in the FlowAggregator. When they did, it was only 1 flow instead of the expected 3. The Delta count and Total count were always equal due to the FlowExporter buffering the flow data before sending it. This was caused by the fact that FlowAggregator was redeployed for each subtest. The Agent was slow to reconnect to the freshly deployed FlowAggregator. Thus, the FlowExporter didn't match the expected timings of the tests. The test setup was updated to check FlowAggregator metrics after each redeployment. This ensures the FlowExporter has reconnected to the new FlowAggregator before testing. Signed-off-by: Peter Tran <peter-pt.tran@broadcom.com>
#7778) The FlowAggregator SecureConnection tests were flaky. Flows took a long time to arrive in the FlowAggregator. When they did, it was only 1 flow instead of the expected 3. The Delta count and Total count were always equal due to the FlowExporter buffering the flow data before sending it. This was caused by the fact that FlowAggregator was redeployed for each subtest. The Agent was slow to reconnect to the freshly deployed FlowAggregator. Thus, the FlowExporter didn't match the expected timings of the tests. The test setup was updated to check FlowAggregator metrics after each redeployment. This ensures the FlowExporter has reconnected to the new FlowAggregator before testing. Signed-off-by: Peter Tran <peter-pt.tran@broadcom.com>
… (antrea-io#7741) The FlowAggregator SecureConnection tests were flaky. Flows took a long time to arrive in the FlowAggregator. When they did, it was only 1 flow instead of the expected 3. The Delta count and Total count were always equal due to the FlowExporter buffering the flow data before sending it. This was caused by the fact that FlowAggregator was redeployed for each subtest. The Agent was slow to reconnect to the freshly deployed FlowAggregator. Thus, the FlowExporter didn't match the expected timings of the tests. The test setup was updated to check FlowAggregator metrics after each redeployment. This ensures the FlowExporter has reconnected to the new FlowAggregator before testing. Signed-off-by: Peter Tran <peter-pt.tran@broadcom.com>
Fix flaky FlowAggregator E2E tests by confirming FlowAggregator metrics (#7740)
The FlowAggregator SecureConnection tests were flaky. Flows took a
long time to arrive in the FlowAggregator. When they did, it was only
1 flow instead of the expected 3. The Delta count and Total count were
always equal due to the FlowExporter buffering the flow data before
sending it.
This was caused by the fact that FlowAggregator was redeployed for
each subtest. The Agent was slow to reconnect to the freshly deployed
FlowAggregator. Thus, the FlowExporter didn't match the expected timings
of the tests.
The test setup was updated to check FlowAggregator metrics after each
redeployment. This ensures the FlowExporter has reconnected to the new
FlowAggregator before testing.
Closes #7740