Skip to content

balancer/pickfirst: fix flaky metrics tests by awaiting async metrics#9020

Open
mbissa wants to merge 3 commits intogrpc:masterfrom
mbissa:fix-flaky-pickfirst-metrics-test
Open

balancer/pickfirst: fix flaky metrics tests by awaiting async metrics#9020
mbissa wants to merge 3 commits intogrpc:masterfrom
mbissa:fix-flaky-pickfirst-metrics-test

Conversation

@mbissa
Copy link
Copy Markdown
Contributor

@mbissa mbissa commented Mar 25, 2026

Fixes #9012

Fixed a race condition in pickfirst metrics tests where the assertions would immediately poll the stats.TestMetricsRecorder. Depending on the goroutine scheduling, capturing of metrics like 'grpc.subchannel.open_connections' would occasionally not be fully recorded before the test checked them, causing flaky failures. This introduces an awaitMetric helper to wait for the metrics to reach their expected values.

RELEASE NOTES: none

Fixed a race condition in pickfirst metrics tests where the assertions would
immediately poll the stats.TestMetricsRecorder. Depending on the goroutine
scheduling, asynchronous metrics like 'grpc.subchannel.open_connections'
would occasionally not be fully recorded before the test checked them,
causing flaky failures. This introduces an awaitMetric helper to wait for the
metrics to reach their expected values.
@mbissa mbissa added this to the 1.81 Release milestone Mar 25, 2026
@codecov
Copy link
Copy Markdown

codecov bot commented Mar 25, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 82.96%. Comparing base (52bf4d9) to head (d5d8759).
⚠️ Report is 5 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #9020      +/-   ##
==========================================
- Coverage   83.05%   82.96%   -0.10%     
==========================================
  Files         411      411              
  Lines       32918    32931      +13     
==========================================
- Hits        27340    27320      -20     
- Misses       4184     4204      +20     
- Partials     1394     1407      +13     

see 22 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

t.Fatalf("EmptyCall() failed: %v", err)
}

if got, _ := tmr.Metric("grpc.lb.pick_first.connection_attempts_succeeded"); got != 1 {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All pick_first metrics (I'm not sure about subchannel metrics) MUST be emitted before an RPC returns, as the LB policy is responsible for making the channel active (which is what allows the RPC to succeed). If the metrics aren't emitted in this order, it likely indicates a bug causing race conditions within the channel. Because of this, we shouldn't need to change the validation logic. Could you please revert the changes to the pick_first metrics?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reverted pick first and metrics that do not need waiting out of the new construct.

@arjan-bal arjan-bal assigned mbissa and unassigned arjan-bal Mar 26, 2026
@mbissa mbissa assigned arjan-bal and unassigned mbissa Mar 26, 2026
return gotMetrics
}

func awaitMetric(ctx context.Context, t *testing.T, tmr *stats.TestMetricsRecorder, name string, want float64) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of a frequent polling, we should use channels to efficiently block the test until the expected metric is available. The TestMetricsRecorder (TMR) already has a method that allows waiting for a particular metric to be emitted:

func (r *TestMetricsRecorder) WaitForFloat64Count(ctx context.Context, metricsDataWant MetricsData) error {
got, err := r.floatCountCh.Receive(ctx)
if err != nil {
return fmt.Errorf("timeout waiting for float64Count")
}
metricsDataGot := got.(MetricsData)
if diff := cmp.Diff(metricsDataGot, metricsDataWant); diff != "" {
return fmt.Errorf("float64count metricsData received unexpected value (-got, +want): %v", diff)
}
return nil
}

However, it currently has a few issues that make it difficult to use:

  1. Strict Ordering: It fails if the very next metric emitted doesn't match the expected one. When writing e2e tests, it is difficult to account for all metrics and the exact order in which they will be emitted.
  2. Data Loss: It consumes and discards buffered metrics while waiting.

Currently, the only users of these WaitForX methods seem to be the metrics recorder list tests. I think we should improve this test utility to make it more ergonomic for other e2e tests to use.

I propose we combine the logic of the Metric() and WaitForX methods.

  • To address the first issue, we can update the WaitForX method to keep reading from the channel until either the context expires or a metric with a matching name is found for comparision.
  • To address the second issue, we can store all read MetricsData in the TMR's inner map. When WaitForX is called, it can first check this map for the required data; if it's found, it can proceed with the match directly without waiting on the channel.

What do you think?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds attempt in correct direction
What do you think about below @arjan-bal :

  1. Instead of WaitForX consuming the channel directly, the TMR will internally "pump" all incoming metrics into a thread-safe map.
  2. The internal map will store a slice of values (map[string][]float64) instead of just the latest value. This allows tests to verify state transitions (e.g., checking that a counter hit 1 and then 2) and nothing is discarded and users would know that in case of races if more than one values can come then they want to read the first one or the last one.

@arjan-bal arjan-bal assigned mbissa and unassigned arjan-bal Mar 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Flaky test: Test/PickFirstMetrics

3 participants