experimental/stats: Expose Telemetry Label Callback#8877
experimental/stats: Expose Telemetry Label Callback#8877seth-epps wants to merge 13 commits intogrpc:masterfrom
Conversation
09e6612 to
6dda7c7
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #8877 +/- ##
==========================================
+ Coverage 80.40% 83.00% +2.60%
==========================================
Files 416 412 -4
Lines 33495 32967 -528
==========================================
+ Hits 26930 27363 +433
+ Misses 4682 4196 -486
+ Partials 1883 1408 -475
🚀 New features to boost your workflow:
|
| labels["grpc.lb.locality"] = xdsinternal.LocalityString(lID) | ||
| labels["grpc.lb.backend_service"] = d.clusterName | ||
| } | ||
| stats.UpdateLabels( |
There was a problem hiding this comment.
Can we restrict UpdateLabels to only one invocation?
There was a problem hiding this comment.
I did this in two separate invocations because I wanted to ensure the callbacks were executed in the event of an error above (dropped). If I combine the two we'd potentially miss the label updates
|
Hey @seth-epps , apologies for the delay and thanks for your effort. After discussing with other maintainers, I have added a few more comments. Please have a look and let me know if you have any questions around them. |
|
@mbissa thanks! I got around to addressing the final review feedback and I also added some additional tests |
|
LGTM, adding @easwars for second review. |
| if labels != nil { | ||
| maps.Copy(labels, d.telemetryLabels) | ||
| } | ||
| stats.UpdateLabels(info.Ctx, d.telemetryLabels) |
There was a problem hiding this comment.
Earlier, the labels in the context was updated. Now, I don't see that happening. I only see the callbacks being invoked with the updated labels. Is this by design? If so why? If not, (maybe I'm missing something), could you help me understand what is happening. Thanks.
There was a problem hiding this comment.
This is based on a combination of the two suggestions to remove the explicit callbacks and centralize
https://github.com/grpc/grpc-go/pull/8877/changes/BASE..6480ae99785c39550ace7a31b8774a547c0f8a69#r2767046493
and then to consolidate the two ways of managing the context labels (callbacks and on context)
https://github.com/grpc/grpc-go/pull/8877/changes/BASE..5fa8022a210d0a2bd504ffe8cba411043248edfe#r2810799274
By putting everything into the common callback pattern there's no longer a need to attach them to the context directly and it's tracked by whatever registers them
| // This test configures OpenTelemetry with the CSM Plugin Option, and xDS | ||
| // Optional Labels turned on. It then configures an interceptor to attach | ||
| // labels, representing the cluster_impl picker. It then makes a unary RPC, and | ||
| // Optional Labels turned on. It then configures a mock balancer that updates |
There was a problem hiding this comment.
Nit: there is no mocking happening here from what I can see.
There was a problem hiding this comment.
mock is probably the wrong word since it's more of a test balancer implementation. will change to "test"
| // ensures that all label updates are propagated to the rpc attempt info across | ||
| // derived contexts. | ||
| ctx = istats.RegisterTelemetryLabelCallback(ctx, func(labels map[string]string) { | ||
| maps.Copy(ai.xdsLabels, labels) |
There was a problem hiding this comment.
Ah ok, now I see what is happening in terms of how the labels added by the clusterimpl LB policy get to the metrics handler. Does this mean that now, we are incurring more copies (as compared to just writing to an existing map) for folks who don't even register this callback?
There was a problem hiding this comment.
If nothing is ever registered there shouldn't be any map copies since we only run the copy + callbacks if the context key contains something
if callbacks, ok := ctx.Value(telemetryLabelCallbackKey{}).([]LabelCallback); ok {
labelsCopy := map[string]string{}
maps.Copy(labelsCopy, labels)
for _, callback := range callbacks {
callback(labelsCopy)
}
}so anyone who doesn't use it (either through the otel middleware or exp package) shouldn't incur any penalty for copying the map
There was a problem hiding this comment.
Earlier, there was a single map in the context. I'm talking about the TelemetryLabels map inside of the Labels struct. This was being initialized by clientMetricsHandler.TagRPC. Then, clusterimpl would write to that single map. Then, the xdsLabels field inside of attemptInfo was a reference to the same underlying map, and would be read when the RPC completed.
Now, the xdsLabels map inside of attemptInfo is a map of its own. It is initialized in clientMetricsHandler.TagRPC. There is no map inside the context to store all the registered labels, but instead whenever anyone updates the labels, a copy of it is made and sent to all registered callbacks. One such callback is from the clientMetricsHandler. And in this callback, we are copying over values into ai.xdsLabels.
Seems mostly OK, but shouldn't we replacing ai.xdsLabels with the contents of labels in this callback?
The docstring for maps.Copy says:
Copy copies all key/value pairs in src adding them to dst. When a key in src is already present in dst,
the value in dst will be overwritten by the value associated with the key in src.
So, this means that if a label key was removed in an update, that would continue to remain in ai.xdsLabels. But is removing a key something that can possibly happen? If that is something that cannot happen, that assumption needs to be documented somewhere.
|
That test failure looks like a transient / unrelated flake |
|
@easwars Any chance you've had some time to re-review? 🙏 |
| // it does nothing. If any registered callback panics it will be swallowed and logged and | ||
| // continue running any other registered callbacks. |
There was a problem hiding this comment.
We are not handling panics anymore. So, this docstring needs to be updated.
| // RegisterTelemetryLabelCallback registers a callback function that is executed whenever | ||
| // telemetry labels are updated. | ||
| func RegisterTelemetryLabelCallback(ctx context.Context, callback LabelCallback) context.Context { |
There was a problem hiding this comment.
Now, we have two ways of doing the same thing. This one and telemetry.NewContextWithLabelCallback. Why?
| // ensures that all label updates are propagated to the rpc attempt info across | ||
| // derived contexts. | ||
| ctx = istats.RegisterTelemetryLabelCallback(ctx, func(labels map[string]string) { | ||
| maps.Copy(ai.xdsLabels, labels) |
There was a problem hiding this comment.
Earlier, there was a single map in the context. I'm talking about the TelemetryLabels map inside of the Labels struct. This was being initialized by clientMetricsHandler.TagRPC. Then, clusterimpl would write to that single map. Then, the xdsLabels field inside of attemptInfo was a reference to the same underlying map, and would be read when the RPC completed.
Now, the xdsLabels map inside of attemptInfo is a map of its own. It is initialized in clientMetricsHandler.TagRPC. There is no map inside the context to store all the registered labels, but instead whenever anyone updates the labels, a copy of it is made and sent to all registered callbacks. One such callback is from the clientMetricsHandler. And in this callback, we are copying over values into ai.xdsLabels.
Seems mostly OK, but shouldn't we replacing ai.xdsLabels with the contents of labels in this callback?
The docstring for maps.Copy says:
Copy copies all key/value pairs in src adding them to dst. When a key in src is already present in dst,
the value in dst will be overwritten by the value associated with the key in src.
So, this means that if a label key was removed in an update, that would continue to remain in ai.xdsLabels. But is removing a key something that can possibly happen? If that is something that cannot happen, that assumption needs to be documented somewhere.
| // To ensure callbacks do not mutate the state of the provided label map it is copied | ||
| // before execution. |
There was a problem hiding this comment.
Can we drop this copy and instead document on LabelCallback that callbacks should not mutate the provided label map?
| // resest the tracker at the end of every test | ||
| ctx, cancel := context.WithTimeout(context.Background(), 1*time.Second) | ||
| t.Cleanup(func() { | ||
| tracker = map[string]string{} |
There was a problem hiding this comment.
Why not move tracker into t.Run? That way, we can make the subtests hermetic.
| tracker := map[string]string{} | ||
|
|
||
| tests := map[string]struct { | ||
| callbacks []func(map[string]string) |
There was a problem hiding this comment.
There is only one callback registered by any of the tests. If that was intentional, this field doesn't have to be slice. If not, can we add a case for that?
| // Create a copy to compare against after calling UpdateLabels. | ||
| wantLabels := map[string]string{"key1": "val1", "key2": "val2"} | ||
|
|
||
| ctx, cancel := context.WithTimeout(context.Background(), 1*time.Second) |
There was a problem hiding this comment.
Here and elsewhere, please consider having a timeout value of 5s or 10s. Github Actions can sometimes be notoriously slow.
Apologies for the delay. I was finally able to make a full pass. Mostly minor comments this time around. We should be able to wrap this one up quickly. Thanks. |
|
This PR is labeled as requiring an update from the reporter, and no update has been received after 6 days. If no update is provided in the next 7 days, this issue will be automatically closed. |
Fixes #8682
Expose a new experimental API for registering a telemetry label callback function.
Some clients may not be instrumented with opentelemetry which restricts valuable information from being propagated to stats handlers. This gives clients the ability to collect otel labels by registering a label callback on the context and collecting the information themselves in their stats handlers.
RELEASE NOTES: