Skip to content

experimental/stats: Expose Telemetry Label Callback#8877

Open
seth-epps wants to merge 13 commits intogrpc:masterfrom
seth-epps:i-8682/expose-recorder
Open

experimental/stats: Expose Telemetry Label Callback#8877
seth-epps wants to merge 13 commits intogrpc:masterfrom
seth-epps:i-8682/expose-recorder

Conversation

@seth-epps
Copy link
Copy Markdown

Fixes #8682

Expose a new experimental API for registering a telemetry label callback function.

Some clients may not be instrumented with opentelemetry which restricts valuable information from being propagated to stats handlers. This gives clients the ability to collect otel labels by registering a label callback on the context and collecting the information themselves in their stats handlers.

RELEASE NOTES:

  • experimental/stats: Expose Telemetry Label Callback

@seth-epps seth-epps force-pushed the i-8682/expose-recorder branch from 09e6612 to 6dda7c7 Compare February 2, 2026 22:27
@arjan-bal arjan-bal requested a review from mbissa February 3, 2026 07:44
@arjan-bal arjan-bal added the Type: Feature New features or improvements in behavior label Feb 3, 2026
@arjan-bal arjan-bal added this to the 1.80 Release milestone Feb 3, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented Feb 3, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 83.00%. Comparing base (c1a9239) to head (e5330db).
⚠️ Report is 84 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #8877      +/-   ##
==========================================
+ Coverage   80.40%   83.00%   +2.60%     
==========================================
  Files         416      412       -4     
  Lines       33495    32967     -528     
==========================================
+ Hits        26930    27363     +433     
+ Misses       4682     4196     -486     
+ Partials     1883     1408     -475     
Files with missing lines Coverage Δ
experimental/stats/telemetry/labels.go 100.00% <100.00%> (ø)
internal/stats/labels.go 100.00% <100.00%> (ø)
internal/xds/balancer/clusterimpl/picker.go 94.80% <100.00%> (-0.32%) ⬇️
stats/opentelemetry/client_metrics.go 88.81% <100.00%> (+0.70%) ⬆️

... and 90 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Comment thread internal/xds/balancer/clusterimpl/picker.go Outdated
Comment thread experimental/stats/telemetry_test.go Outdated
Comment thread internal/stats/labels_test.go
Comment thread experimental/stats/telemetry.go Outdated
Comment thread experimental/stats/telemetry.go Outdated
@mbissa mbissa assigned seth-epps and unassigned mbissa Feb 5, 2026
mbissa

This comment was marked as outdated.

@mbissa mbissa requested review from arjan-bal and dfawley February 6, 2026 08:37
@mbissa mbissa assigned arjan-bal and dfawley and unassigned seth-epps Feb 6, 2026
@arjan-bal arjan-bal assigned mbissa and unassigned dfawley and arjan-bal Feb 10, 2026
Comment thread experimental/stats/telemetry.go Outdated
Comment thread experimental/stats/telemetry.go Outdated
Comment thread internal/stats/labels.go Outdated
labels["grpc.lb.locality"] = xdsinternal.LocalityString(lID)
labels["grpc.lb.backend_service"] = d.clusterName
}
stats.UpdateLabels(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we restrict UpdateLabels to only one invocation?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did this in two separate invocations because I wanted to ensure the callbacks were executed in the event of an error above (dropped). If I combine the two we'd potentially miss the label updates

@mbissa
Copy link
Copy Markdown
Contributor

mbissa commented Feb 16, 2026

Hey @seth-epps , apologies for the delay and thanks for your effort. After discussing with other maintainers, I have added a few more comments. Please have a look and let me know if you have any questions around them.

@mbissa mbissa assigned seth-epps and unassigned mbissa Feb 16, 2026
@seth-epps
Copy link
Copy Markdown
Author

@mbissa thanks! I got around to addressing the final review feedback and I also added some additional tests

@mbissa
Copy link
Copy Markdown
Contributor

mbissa commented Mar 17, 2026

LGTM, adding @easwars for second review.

@mbissa mbissa requested review from easwars and removed request for arjan-bal and dfawley March 17, 2026 03:34
@arjan-bal arjan-bal assigned easwars and unassigned mbissa Mar 17, 2026
Comment thread experimental/stats/telemetry/labels.go Outdated
Comment thread experimental/stats/telemetry/labels.go Outdated
Comment thread experimental/stats/telemetry/labels.go Outdated
Comment thread internal/stats/labels.go Outdated
Comment thread internal/stats/labels.go Outdated
Comment thread internal/stats/labels.go
Comment thread internal/stats/labels.go Outdated
if labels != nil {
maps.Copy(labels, d.telemetryLabels)
}
stats.UpdateLabels(info.Ctx, d.telemetryLabels)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Earlier, the labels in the context was updated. Now, I don't see that happening. I only see the callbacks being invoked with the updated labels. Is this by design? If so why? If not, (maybe I'm missing something), could you help me understand what is happening. Thanks.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is based on a combination of the two suggestions to remove the explicit callbacks and centralize
https://github.com/grpc/grpc-go/pull/8877/changes/BASE..6480ae99785c39550ace7a31b8774a547c0f8a69#r2767046493

and then to consolidate the two ways of managing the context labels (callbacks and on context)
https://github.com/grpc/grpc-go/pull/8877/changes/BASE..5fa8022a210d0a2bd504ffe8cba411043248edfe#r2810799274

By putting everything into the common callback pattern there's no longer a need to attach them to the context directly and it's tracked by whatever registers them

// This test configures OpenTelemetry with the CSM Plugin Option, and xDS
// Optional Labels turned on. It then configures an interceptor to attach
// labels, representing the cluster_impl picker. It then makes a unary RPC, and
// Optional Labels turned on. It then configures a mock balancer that updates
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: there is no mocking happening here from what I can see.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mock is probably the wrong word since it's more of a test balancer implementation. will change to "test"

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in 2544a57

// ensures that all label updates are propagated to the rpc attempt info across
// derived contexts.
ctx = istats.RegisterTelemetryLabelCallback(ctx, func(labels map[string]string) {
maps.Copy(ai.xdsLabels, labels)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok, now I see what is happening in terms of how the labels added by the clusterimpl LB policy get to the metrics handler. Does this mean that now, we are incurring more copies (as compared to just writing to an existing map) for folks who don't even register this callback?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If nothing is ever registered there shouldn't be any map copies since we only run the copy + callbacks if the context key contains something

	if callbacks, ok := ctx.Value(telemetryLabelCallbackKey{}).([]LabelCallback); ok {
		labelsCopy := map[string]string{}
		maps.Copy(labelsCopy, labels)
		for _, callback := range callbacks {
			callback(labelsCopy)
		}

	}

so anyone who doesn't use it (either through the otel middleware or exp package) shouldn't incur any penalty for copying the map

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Earlier, there was a single map in the context. I'm talking about the TelemetryLabels map inside of the Labels struct. This was being initialized by clientMetricsHandler.TagRPC. Then, clusterimpl would write to that single map. Then, the xdsLabels field inside of attemptInfo was a reference to the same underlying map, and would be read when the RPC completed.

Now, the xdsLabels map inside of attemptInfo is a map of its own. It is initialized in clientMetricsHandler.TagRPC. There is no map inside the context to store all the registered labels, but instead whenever anyone updates the labels, a copy of it is made and sent to all registered callbacks. One such callback is from the clientMetricsHandler. And in this callback, we are copying over values into ai.xdsLabels.

Seems mostly OK, but shouldn't we replacing ai.xdsLabels with the contents of labels in this callback?

The docstring for maps.Copy says:

Copy copies all key/value pairs in src adding them to dst. When a key in src is already present in dst, 
the value in dst will be overwritten by the value associated with the key in src.

So, this means that if a label key was removed in an update, that would continue to remain in ai.xdsLabels. But is removing a key something that can possibly happen? If that is something that cannot happen, that assumption needs to be documented somewhere.

@easwars easwars assigned seth-epps and unassigned easwars Mar 25, 2026
@seth-epps
Copy link
Copy Markdown
Author

That test failure looks like a transient / unrelated flake

@Pranjali-2501 Pranjali-2501 assigned easwars and unassigned seth-epps Mar 30, 2026
@seth-epps
Copy link
Copy Markdown
Author

@easwars Any chance you've had some time to re-review? 🙏

Comment thread internal/stats/labels.go
Comment on lines +58 to +59
// it does nothing. If any registered callback panics it will be swallowed and logged and
// continue running any other registered callbacks.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are not handling panics anymore. So, this docstring needs to be updated.

Comment thread internal/stats/labels.go
Comment on lines +41 to +43
// RegisterTelemetryLabelCallback registers a callback function that is executed whenever
// telemetry labels are updated.
func RegisterTelemetryLabelCallback(ctx context.Context, callback LabelCallback) context.Context {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now, we have two ways of doing the same thing. This one and telemetry.NewContextWithLabelCallback. Why?

Comment thread internal/stats/labels.go
// ensures that all label updates are propagated to the rpc attempt info across
// derived contexts.
ctx = istats.RegisterTelemetryLabelCallback(ctx, func(labels map[string]string) {
maps.Copy(ai.xdsLabels, labels)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Earlier, there was a single map in the context. I'm talking about the TelemetryLabels map inside of the Labels struct. This was being initialized by clientMetricsHandler.TagRPC. Then, clusterimpl would write to that single map. Then, the xdsLabels field inside of attemptInfo was a reference to the same underlying map, and would be read when the RPC completed.

Now, the xdsLabels map inside of attemptInfo is a map of its own. It is initialized in clientMetricsHandler.TagRPC. There is no map inside the context to store all the registered labels, but instead whenever anyone updates the labels, a copy of it is made and sent to all registered callbacks. One such callback is from the clientMetricsHandler. And in this callback, we are copying over values into ai.xdsLabels.

Seems mostly OK, but shouldn't we replacing ai.xdsLabels with the contents of labels in this callback?

The docstring for maps.Copy says:

Copy copies all key/value pairs in src adding them to dst. When a key in src is already present in dst, 
the value in dst will be overwritten by the value associated with the key in src.

So, this means that if a label key was removed in an update, that would continue to remain in ai.xdsLabels. But is removing a key something that can possibly happen? If that is something that cannot happen, that assumption needs to be documented somewhere.

Comment thread internal/stats/labels.go
Comment on lines +61 to +62
// To ensure callbacks do not mutate the state of the provided label map it is copied
// before execution.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we drop this copy and instead document on LabelCallback that callbacks should not mutate the provided label map?

// resest the tracker at the end of every test
ctx, cancel := context.WithTimeout(context.Background(), 1*time.Second)
t.Cleanup(func() {
tracker = map[string]string{}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not move tracker into t.Run? That way, we can make the subtests hermetic.

tracker := map[string]string{}

tests := map[string]struct {
callbacks []func(map[string]string)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is only one callback registered by any of the tests. If that was intentional, this field doesn't have to be slice. If not, can we add a case for that?

// Create a copy to compare against after calling UpdateLabels.
wantLabels := map[string]string{"key1": "val1", "key2": "val2"}

ctx, cancel := context.WithTimeout(context.Background(), 1*time.Second)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here and elsewhere, please consider having a timeout value of 5s or 10s. Github Actions can sometimes be notoriously slow.

@easwars
Copy link
Copy Markdown
Contributor

easwars commented Apr 14, 2026

@easwars Any chance you've had some time to re-review? 🙏

Apologies for the delay. I was finally able to make a full pass. Mostly minor comments this time around. We should be able to wrap this one up quickly. Thanks.

@github-actions
Copy link
Copy Markdown

This PR is labeled as requiring an update from the reporter, and no update has been received after 6 days. If no update is provided in the next 7 days, this issue will be automatically closed.

@github-actions github-actions Bot added the stale label Apr 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Expose the labels package to use with custom stats handler

7 participants