Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
29 changes: 17 additions & 12 deletions jsonnet/custom.libsonnet
Original file line number Diff line number Diff line change
Expand Up @@ -41,22 +41,27 @@
},
},
{
alert: 'etcdGRPCRequestsSlow',
expr: |||
histogram_quantile(0.99, sum(rate(grpc_server_handling_seconds_bucket{job="etcd", grpc_method!="Defragment", grpc_type="unary"}[10m])) without(grpc_type))
> on () group_left (type)
bottomk(1,
1.5 * group by (type) (cluster_infrastructure_provider{type="Azure"})
or
1 * group by (type) (cluster_infrastructure_provider))
|||,
Comment on lines -47 to -52
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we're getting rid of the inflated threshold for Azure but bumping it overall right?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, but specific for the read and write paths we care about

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm more happy about getting rid of the Azure special case. Hopefully we don't have to add it back eventually.

'for': '30m',
alert: 'etcdGRPCReadRequestsSlow',
expr: 'histogram_quantile(0.99, sum(rate(grpc_server_handling_seconds_bucket{job="etcd", grpc_method="Range", grpc_type="unary"}[10m])) without(grpc_type)) > 3',
'for': '10m',
labels: {
severity: 'critical',
},
annotations: {
description: 'etcd cluster "{{ $labels.job }}": 99th percentile of gRPC read requests is {{ $value }}s on etcd instance {{ $labels.instance }}.',
summary: 'etcd grpc read requests are slow',
},
},
{
alert: 'etcdGRPCWriteRequestsSlow',
expr: 'histogram_quantile(0.99, sum(rate(grpc_server_handling_seconds_bucket{job="etcd", grpc_method="Txn", grpc_type="unary"}[10m])) without(grpc_type)) > 5',
'for': '10m',
labels: {
severity: 'critical',
},
annotations: {
description: 'etcd cluster "{{ $labels.job }}": 99th percentile of gRPC requests is {{ $value }}s on etcd instance {{ $labels.instance }} for {{ $labels.grpc_method }} method.',
summary: 'etcd grpc requests are slow',
description: 'etcd cluster "{{ $labels.job }}": 99th percentile of gRPC write requests is {{ $value }}s on etcd instance {{ $labels.instance }}.',
summary: 'etcd grpc write requests are slow',
Comment on lines +56 to +64
Copy link
Copy Markdown

@coderabbitai coderabbitai bot Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, locate and read the file to understand context
find . -name "custom.libsonnet" -type f

Repository: openshift/cluster-etcd-operator

Length of output: 101


🏁 Script executed:

# Check file size and read the relevant section with context
fd -e libsonnet | head -20

Repository: openshift/cluster-etcd-operator

Length of output: 1790


🏁 Script executed:

# Find the custom.libsonnet file
git ls-files | grep -i custom.libsonnet

Repository: openshift/cluster-etcd-operator

Length of output: 99


🏁 Script executed:

# Read the custom.libsonnet file to understand context
wc -l ./jsonnet/custom.libsonnet

Repository: openshift/cluster-etcd-operator

Length of output: 105


🏁 Script executed:

# Read the file with context around lines 56-64
cat -n ./jsonnet/custom.libsonnet | sed -n '40,75p'

Repository: openshift/cluster-etcd-operator

Length of output: 2098


🏁 Script executed:

# Check if there are other alerts for Put or DeleteRange methods
grep -n "Put\|DeleteRange" ./jsonnet/custom.libsonnet

Repository: openshift/cluster-etcd-operator

Length of output: 57


🏁 Script executed:

# Check the upstream etcd mixin alerts to compare
cat -n ./jsonnet/vendor/github.com/etcd-io/etcd/contrib/mixin/alerts/alerts.libsonnet | grep -A 5 "etcdGRPCWriteRequestsSlow" | head -20

Repository: openshift/cluster-etcd-operator

Length of output: 57


🏁 Script executed:

# Search for write-related alerts in the upstream etcd mixin
cat ./jsonnet/vendor/github.com/etcd-io/etcd/contrib/mixin/alerts/alerts.libsonnet | grep -i "write" -A 3 -B 1

Repository: openshift/cluster-etcd-operator

Length of output: 865


🏁 Script executed:

# Get a broader view of what's in the upstream etcd mixin alerts
wc -l ./jsonnet/vendor/github.com/etcd-io/etcd/contrib/mixin/alerts/alerts.libsonnet

Repository: openshift/cluster-etcd-operator

Length of output: 157


🏁 Script executed:

# Read the full upstream etcd alerts file to find write-related alerts
cat -n ./jsonnet/vendor/github.com/etcd-io/etcd/contrib/mixin/alerts/alerts.libsonnet | grep -i "grpc.*write\|txn\|put\|deleterange" -B 2 -A 5

Repository: openshift/cluster-etcd-operator

Length of output: 57


🏁 Script executed:

# Get the entire alerts file to understand the pattern
cat ./jsonnet/vendor/github.com/etcd-io/etcd/contrib/mixin/alerts/alerts.libsonnet

Repository: openshift/cluster-etcd-operator

Length of output: 11491


🏁 Script executed:

# Read the entire custom.libsonnet to check if Put/DeleteRange are monitored elsewhere
cat -n ./jsonnet/custom.libsonnet

Repository: openshift/cluster-etcd-operator

Length of output: 10807


🌐 Web query:

etcd gRPC methods Put DeleteRange Txn write operations

💡 Result:

Summary — etcd KV gRPC write operations

  • Put (KV.Put): writes or updates a single key. Increments store revision, generates a PUT event; can return previous KV if PutRequest.prev_kv is set. Request/response types: PutRequest → PutResponse. [1][2]

  • DeleteRange (KV.DeleteRange): deletes a single key or a range (key + range_end). Increments revision, generates DELETE events for each deleted key; can return deleted key-values if prev_kv is set. Response includes deleted count. Request/response types: DeleteRangeRequest → DeleteRangeResponse. [1][2]

  • Txn (KV.Txn): atomic If/Then/Else transaction. TxnRequest contains compares (conditions) and two lists of RequestOp (success / failure). RequestOp is a oneof: RangeRequest, PutRequest, DeleteRangeRequest (and nested TxnRequest). TxnResponse returns succeeded bool and a list of ResponseOp (matching RangeResponse / PutResponse / DeleteRangeResponse). A Txn increments the revision once for the applied requests and generates events with the same revision. It is forbidden to modify the same key more than once within a single Txn (Puts/Deletes must use unique keys). [1][2]

References:

  • etcd v3 gRPC KV API (Put / DeleteRange / Txn) — etcd docs. [1]
  • etcd v3 transaction and RequestOp/ResponseOp protobuf details — etcd docs. [2]

Sources:
[1] etcd API reference — service KV (Put, DeleteRange, Txn).
[2] etcd v3 learning / API (Txn, RequestOp, Request/Response message shapes).


Expand the write alert to include Put and DeleteRange methods.

The alert is named etcdGRPCWriteRequestsSlow and its description claims to monitor "gRPC write requests," but currently only filters for Txn. Put and DeleteRange are direct write operations in etcd that are not captured. If your workload uses direct Put or DeleteRange calls (common in many deployments), slow writes via these methods would go undetected.

Suggested fix
-            expr: 'histogram_quantile(0.99, sum(rate(grpc_server_handling_seconds_bucket{job="etcd", grpc_method="Txn", grpc_type="unary"}[10m])) without(grpc_type)) > 5',
+            expr: 'histogram_quantile(0.99, sum(rate(grpc_server_handling_seconds_bucket{job="etcd", grpc_method=~"Txn|Put|DeleteRange", grpc_type="unary"}[10m])) without(grpc_type)) > 5',
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
alert: 'etcdGRPCWriteRequestsSlow',
expr: 'histogram_quantile(0.99, sum(rate(grpc_server_handling_seconds_bucket{job="etcd", grpc_method="Txn", grpc_type="unary"}[10m])) without(grpc_type)) > 5',
'for': '10m',
labels: {
severity: 'critical',
},
annotations: {
description: 'etcd cluster "{{ $labels.job }}": 99th percentile of gRPC requests is {{ $value }}s on etcd instance {{ $labels.instance }} for {{ $labels.grpc_method }} method.',
summary: 'etcd grpc requests are slow',
description: 'etcd cluster "{{ $labels.job }}": 99th percentile of gRPC write requests is {{ $value }}s on etcd instance {{ $labels.instance }}.',
summary: 'etcd grpc write requests are slow',
alert: 'etcdGRPCWriteRequestsSlow',
expr: 'histogram_quantile(0.99, sum(rate(grpc_server_handling_seconds_bucket{job="etcd", grpc_method=~"Txn|Put|DeleteRange", grpc_type="unary"}[10m])) without(grpc_type)) > 5',
'for': '10m',
labels: {
severity: 'critical',
},
annotations: {
description: 'etcd cluster "{{ $labels.job }}": 99th percentile of gRPC write requests is {{ $value }}s on etcd instance {{ $labels.instance }}.',
summary: 'etcd grpc write requests are slow',
🤖 Prompt for AI Agents
In `@jsonnet/custom.libsonnet` around lines 56 - 64, The etcdGRPCWriteRequestsSlow
alert currently only filters grpc_method="Txn" so Put and DeleteRange write
operations are missed; update the alert expression (the expr for alert:
etcdGRPCWriteRequestsSlow) to include all write methods by changing the label
filter to match multiple methods (e.g., grpc_method=~"Txn|Put|DeleteRange") or
an equivalent set-based match, and update the annotations.description text to
reflect that Put and DeleteRange are included so the summary remains accurate.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'd be surprised, but deletes are also implemented as Txn calls:
https://github.com/etcd-io/etcd/blob/main/client/v3/kubernetes/client.go

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Fetch the client code to verify how DeleteRange is implemented
curl -s "https://raw.githubusercontent.com/etcd-io/etcd/main/client/v3/kubernetes/client.go" | head -200

Length of output: 3853


🏁 Script executed:

#!/bin/bash
# Search for Delete and Put method implementations in the kubernetes client
curl -s "https://raw.githubusercontent.com/etcd-io/etcd/main/client/v3/kubernetes/client.go" | grep -A 20 "func.*Delete"

Length of output: 611


🏁 Script executed:

#!/bin/bash
# Search for Put method implementations
curl -s "https://raw.githubusercontent.com/etcd-io/etcd/main/client/v3/kubernetes/client.go" | grep -A 20 "func.*Put"

Length of output: 711


🏁 Script executed:

#!/bin/bash
# Get more context on how these methods work - look for Txn usage
curl -s "https://raw.githubusercontent.com/etcd-io/etcd/main/client/v3/kubernetes/client.go" | grep -B 5 -A 15 "Txn()"

Length of output: 0


🏁 Script executed:

#!/bin/bash
# Check if the base etcd client v3 has direct Put/Delete methods
curl -s "https://raw.githubusercontent.com/etcd-io/etcd/main/client/v3/kv.go" | grep -A 15 "func.*Put\|func.*Delete"

Length of output: 1654


🏁 Script executed:

#!/bin/bash
# Look at the KV interface to see available methods
curl -s "https://raw.githubusercontent.com/etcd-io/etcd/main/client/v3/kv.go" | head -100

Length of output: 3404


🏁 Script executed:

#!/bin/bash
# Check the gRPC service definition to see what methods are exposed
curl -s "https://raw.githubusercontent.com/etcd-io/etcd/main/api/etcdserverpb/rpc.proto" | grep -A 5 "service KV"

Length of output: 210


🏁 Script executed:

#!/bin/bash
# Get the full KV service definition from proto
curl -s "https://raw.githubusercontent.com/etcd-io/etcd/main/api/etcdserverpb/rpc.proto" | grep -A 50 "service KV"

Length of output: 1689


🏁 Script executed:

#!/bin/bash
# Also check how the Do method routes operations
curl -s "https://raw.githubusercontent.com/etcd-io/etcd/main/client/v3/kv.go" | grep -A 30 "func.*Do\("

Length of output: 23


🏁 Script executed:

#!/bin/bash
# Search for etcd client usage in the OpenShift cluster-etcd-operator
rg -n "clientv3\." --type=go -C 2 | head -50

Length of output: 3779


🏁 Script executed:

#!/bin/bash
# Check if the kubernetes client wrapper is imported in this repo
rg -n "kubernetes.*client" --type=go -g "!vendor/**" | head -30

Length of output: 525


🏁 Script executed:

#!/bin/bash
# Look for direct Put/Delete calls vs Txn usage
rg -n "\.Put\(|\.Delete\(" --type=go -g "!vendor/**" -C 1 | head -40

Length of output: 2915


🏁 Script executed:

#!/bin/bash
# Search for documentation or comments about what client Kubernetes uses
curl -s "https://raw.githubusercontent.com/kubernetes/kubernetes/master/staging/src/k8s.io/apiserver/pkg/storage/etcd3/store.go" | head -100

Length of output: 3222


🏁 Script executed:

#!/bin/bash
# Check if kubernetes uses the kubernetes client wrapper
curl -s "https://raw.githubusercontent.com/kubernetes/kubernetes/master/staging/src/k8s.io/apiserver/pkg/storage/etcd3/store.go" | grep -i "import" -A 30 | head -50

Length of output: 796


🏁 Script executed:

#!/bin/bash
# Look for Put/Delete implementations in kube-apiserver storage
curl -s "https://raw.githubusercontent.com/kubernetes/kubernetes/master/staging/src/k8s.io/apiserver/pkg/storage/etcd3/store.go" | grep -A 20 "func.*Create\|func.*Delete\|func.*Update"

Length of output: 3966


🏁 Script executed:

#!/bin/bash
# Check how kube-apiserver uses the kubernetes client for writes
curl -s "https://raw.githubusercontent.com/kubernetes/kubernetes/master/staging/src/k8s.io/apiserver/pkg/storage/etcd3/store.go" | grep -A 30 "s.client.Kubernetes"

Length of output: 7601


🏁 Script executed:

#!/bin/bash
# Look for OptimisticPut usage in kube-apiserver
curl -s "https://raw.githubusercontent.com/kubernetes/kubernetes/master/staging/src/k8s.io/apiserver/pkg/storage/etcd3/store.go" | grep -i "OptimisticPut\|OptimisticDelete" -B 2 -A 10

Length of output: 1384


🏁 Script executed:

#!/bin/bash
# Search for the actual etcd operation calls
curl -s "https://raw.githubusercontent.com/kubernetes/kubernetes/master/staging/src/k8s.io/apiserver/pkg/storage/etcd3/store.go" | grep "\.client\." -A 2 | head -60

Length of output: 1537


@tjungblu You're absolutely right, thank you for the clarification! I verified that kube-apiserver uses the kubernetes.Client wrapper exclusively, and all write operations (Create/Update/Delete) go through OptimisticPut and OptimisticDelete, which indeed wrap the operations in Txn calls as you showed.

The current alert filtering for grpc_method="Txn" is correct and sufficient for monitoring write latency in the Kubernetes/OpenShift context. Adding Put and DeleteRange would not capture any additional slow writes since kube-apiserver doesn't use those methods directly.

My suggestion was incorrect. The alert as implemented properly captures all write operations that matter for the product.

},
},
{
Expand Down
8 changes: 4 additions & 4 deletions jsonnet/jsonnetfile.lock.json
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@
"subdir": "contrib/mixin"
}
},
"version": "e4d6a05f8f1ec972384e24a83c420f707a6644f2",
"version": "9a2b574b70e69eb1f47dfc896b2f63245f1873f6",
"sum": "XmXkOCriQIZmXwlIIFhqlJMa0e6qGWdxZD+ZDYaN0Po="
},
{
Expand All @@ -18,7 +18,7 @@
"subdir": "gen/grafonnet-v10.0.0"
}
},
"version": "42d098fae987f25f08480e203ca6ddc548c6efbf",
"version": "7380c9c64fb973f34c3ec46265621a2b0dee0058",
"sum": "xdcrJPJlpkq4+5LpGwN4tPAuheNNLXZjE6tDcyvFjr0="
},
{
Expand All @@ -38,8 +38,8 @@
"subdir": ""
}
},
"version": "4eee017d21cb63a303925d1dcd9fc5c496809b46",
"sum": "Kh0GbIycNmJPzk6IOMXn1BbtLNyaiiimclYk7+mvsns="
"version": "4d7f8cb24d613430799f9d56809cc6964f35cea9",
"sum": "hOrwkOx34tOXqoDVnwuI/Uf/dr9HFFSPWpDPOvnEGrk="
}
],
"legacyImports": false
Expand Down
14 changes: 14 additions & 0 deletions jsonnet/vendor/github.com/jsonnet-libs/xtd/array.libsonnet

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

10 changes: 10 additions & 0 deletions jsonnet/vendor/github.com/jsonnet-libs/xtd/docs/array.md

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

16 changes: 16 additions & 0 deletions jsonnet/vendor/github.com/jsonnet-libs/xtd/docs/string.md

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

61 changes: 61 additions & 0 deletions jsonnet/vendor/github.com/jsonnet-libs/xtd/docs/units.md

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Loading