-
Notifications
You must be signed in to change notification settings - Fork 160
OCPBUGS-66152: split grpc requests slow into read/write alerts #1532
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -41,22 +41,27 @@ | |||||||||||||||||||||||||||||||||||||||||
| }, | ||||||||||||||||||||||||||||||||||||||||||
| }, | ||||||||||||||||||||||||||||||||||||||||||
| { | ||||||||||||||||||||||||||||||||||||||||||
| alert: 'etcdGRPCRequestsSlow', | ||||||||||||||||||||||||||||||||||||||||||
| expr: ||| | ||||||||||||||||||||||||||||||||||||||||||
| histogram_quantile(0.99, sum(rate(grpc_server_handling_seconds_bucket{job="etcd", grpc_method!="Defragment", grpc_type="unary"}[10m])) without(grpc_type)) | ||||||||||||||||||||||||||||||||||||||||||
| > on () group_left (type) | ||||||||||||||||||||||||||||||||||||||||||
| bottomk(1, | ||||||||||||||||||||||||||||||||||||||||||
| 1.5 * group by (type) (cluster_infrastructure_provider{type="Azure"}) | ||||||||||||||||||||||||||||||||||||||||||
| or | ||||||||||||||||||||||||||||||||||||||||||
| 1 * group by (type) (cluster_infrastructure_provider)) | ||||||||||||||||||||||||||||||||||||||||||
| |||, | ||||||||||||||||||||||||||||||||||||||||||
| 'for': '30m', | ||||||||||||||||||||||||||||||||||||||||||
| alert: 'etcdGRPCReadRequestsSlow', | ||||||||||||||||||||||||||||||||||||||||||
| expr: 'histogram_quantile(0.99, sum(rate(grpc_server_handling_seconds_bucket{job="etcd", grpc_method="Range", grpc_type="unary"}[10m])) without(grpc_type)) > 3', | ||||||||||||||||||||||||||||||||||||||||||
| 'for': '10m', | ||||||||||||||||||||||||||||||||||||||||||
| labels: { | ||||||||||||||||||||||||||||||||||||||||||
| severity: 'critical', | ||||||||||||||||||||||||||||||||||||||||||
| }, | ||||||||||||||||||||||||||||||||||||||||||
| annotations: { | ||||||||||||||||||||||||||||||||||||||||||
| description: 'etcd cluster "{{ $labels.job }}": 99th percentile of gRPC read requests is {{ $value }}s on etcd instance {{ $labels.instance }}.', | ||||||||||||||||||||||||||||||||||||||||||
| summary: 'etcd grpc read requests are slow', | ||||||||||||||||||||||||||||||||||||||||||
| }, | ||||||||||||||||||||||||||||||||||||||||||
| }, | ||||||||||||||||||||||||||||||||||||||||||
| { | ||||||||||||||||||||||||||||||||||||||||||
| alert: 'etcdGRPCWriteRequestsSlow', | ||||||||||||||||||||||||||||||||||||||||||
| expr: 'histogram_quantile(0.99, sum(rate(grpc_server_handling_seconds_bucket{job="etcd", grpc_method="Txn", grpc_type="unary"}[10m])) without(grpc_type)) > 5', | ||||||||||||||||||||||||||||||||||||||||||
| 'for': '10m', | ||||||||||||||||||||||||||||||||||||||||||
| labels: { | ||||||||||||||||||||||||||||||||||||||||||
| severity: 'critical', | ||||||||||||||||||||||||||||||||||||||||||
| }, | ||||||||||||||||||||||||||||||||||||||||||
| annotations: { | ||||||||||||||||||||||||||||||||||||||||||
| description: 'etcd cluster "{{ $labels.job }}": 99th percentile of gRPC requests is {{ $value }}s on etcd instance {{ $labels.instance }} for {{ $labels.grpc_method }} method.', | ||||||||||||||||||||||||||||||||||||||||||
| summary: 'etcd grpc requests are slow', | ||||||||||||||||||||||||||||||||||||||||||
| description: 'etcd cluster "{{ $labels.job }}": 99th percentile of gRPC write requests is {{ $value }}s on etcd instance {{ $labels.instance }}.', | ||||||||||||||||||||||||||||||||||||||||||
| summary: 'etcd grpc write requests are slow', | ||||||||||||||||||||||||||||||||||||||||||
|
Comment on lines
+56
to
+64
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🧩 Analysis chain🏁 Script executed: # First, locate and read the file to understand context
find . -name "custom.libsonnet" -type fRepository: openshift/cluster-etcd-operator Length of output: 101 🏁 Script executed: # Check file size and read the relevant section with context
fd -e libsonnet | head -20Repository: openshift/cluster-etcd-operator Length of output: 1790 🏁 Script executed: # Find the custom.libsonnet file
git ls-files | grep -i custom.libsonnetRepository: openshift/cluster-etcd-operator Length of output: 99 🏁 Script executed: # Read the custom.libsonnet file to understand context
wc -l ./jsonnet/custom.libsonnetRepository: openshift/cluster-etcd-operator Length of output: 105 🏁 Script executed: # Read the file with context around lines 56-64
cat -n ./jsonnet/custom.libsonnet | sed -n '40,75p'Repository: openshift/cluster-etcd-operator Length of output: 2098 🏁 Script executed: # Check if there are other alerts for Put or DeleteRange methods
grep -n "Put\|DeleteRange" ./jsonnet/custom.libsonnetRepository: openshift/cluster-etcd-operator Length of output: 57 🏁 Script executed: # Check the upstream etcd mixin alerts to compare
cat -n ./jsonnet/vendor/github.com/etcd-io/etcd/contrib/mixin/alerts/alerts.libsonnet | grep -A 5 "etcdGRPCWriteRequestsSlow" | head -20Repository: openshift/cluster-etcd-operator Length of output: 57 🏁 Script executed: # Search for write-related alerts in the upstream etcd mixin
cat ./jsonnet/vendor/github.com/etcd-io/etcd/contrib/mixin/alerts/alerts.libsonnet | grep -i "write" -A 3 -B 1Repository: openshift/cluster-etcd-operator Length of output: 865 🏁 Script executed: # Get a broader view of what's in the upstream etcd mixin alerts
wc -l ./jsonnet/vendor/github.com/etcd-io/etcd/contrib/mixin/alerts/alerts.libsonnetRepository: openshift/cluster-etcd-operator Length of output: 157 🏁 Script executed: # Read the full upstream etcd alerts file to find write-related alerts
cat -n ./jsonnet/vendor/github.com/etcd-io/etcd/contrib/mixin/alerts/alerts.libsonnet | grep -i "grpc.*write\|txn\|put\|deleterange" -B 2 -A 5Repository: openshift/cluster-etcd-operator Length of output: 57 🏁 Script executed: # Get the entire alerts file to understand the pattern
cat ./jsonnet/vendor/github.com/etcd-io/etcd/contrib/mixin/alerts/alerts.libsonnetRepository: openshift/cluster-etcd-operator Length of output: 11491 🏁 Script executed: # Read the entire custom.libsonnet to check if Put/DeleteRange are monitored elsewhere
cat -n ./jsonnet/custom.libsonnetRepository: openshift/cluster-etcd-operator Length of output: 10807 🌐 Web query:
💡 Result: Summary — etcd KV gRPC write operations
References:
Sources: Expand the write alert to include Put and DeleteRange methods. The alert is named Suggested fix- expr: 'histogram_quantile(0.99, sum(rate(grpc_server_handling_seconds_bucket{job="etcd", grpc_method="Txn", grpc_type="unary"}[10m])) without(grpc_type)) > 5',
+ expr: 'histogram_quantile(0.99, sum(rate(grpc_server_handling_seconds_bucket{job="etcd", grpc_method=~"Txn|Put|DeleteRange", grpc_type="unary"}[10m])) without(grpc_type)) > 5',📝 Committable suggestion
Suggested change
🤖 Prompt for AI Agents
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You'd be surprised, but deletes are also implemented as Txn calls: There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🧩 Analysis chain🏁 Script executed: Length of output: 3853 🏁 Script executed: Length of output: 611 🏁 Script executed: Length of output: 711 🏁 Script executed: Length of output: 0 🏁 Script executed: Length of output: 1654 🏁 Script executed: Length of output: 3404 🏁 Script executed: Length of output: 210 🏁 Script executed: Length of output: 1689 🏁 Script executed: Length of output: 23 🏁 Script executed: Length of output: 3779 🏁 Script executed: Length of output: 525 🏁 Script executed: Length of output: 2915 🏁 Script executed: Length of output: 3222 🏁 Script executed: Length of output: 796 🏁 Script executed: Length of output: 3966 🏁 Script executed: Length of output: 7601 🏁 Script executed: Length of output: 1384 🏁 Script executed: Length of output: 1537
The current alert filtering for My suggestion was incorrect. The alert as implemented properly captures all write operations that matter for the product. |
||||||||||||||||||||||||||||||||||||||||||
| }, | ||||||||||||||||||||||||||||||||||||||||||
| }, | ||||||||||||||||||||||||||||||||||||||||||
| { | ||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we're getting rid of the inflated threshold for Azure but bumping it overall right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, but specific for the read and write paths we care about
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm more happy about getting rid of the Azure special case. Hopefully we don't have to add it back eventually.