cluster/health_overview: report high disk usage in health reports#29136
Merged
dotnwat merged 3 commits intoredpanda-data:devfrom Jan 3, 2026
Merged
cluster/health_overview: report high disk usage in health reports#29136dotnwat merged 3 commits intoredpanda-data:devfrom
dotnwat merged 3 commits intoredpanda-data:devfrom
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
This PR adds disk usage alerting to cluster health reports by identifying and reporting nodes that exceed storage alert thresholds. When any node has high disk usage, the cluster is marked as unhealthy with a new high_disk_usage_nodes field containing the affected node IDs.
Key Changes:
- Added
high_disk_usage_nodesfield to cluster health overview reporting - Cluster health is marked unhealthy when nodes exceed disk usage thresholds
- Updated health monitoring backend to check disk alerts and populate the new field
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
src/v/cluster/health_monitor_types.h |
Added high_disk_usage_nodes vector field to store node IDs exceeding disk thresholds |
src/v/cluster/health_monitor_backend.cc |
Implemented disk alert checking logic and population of high disk usage nodes list |
src/v/cluster/health_monitor_types.cc |
Updated output operator to include high disk usage nodes in formatted output |
src/v/redpanda/admin/server.cc |
Exposed high_disk_usage_nodes field in admin API response |
src/v/redpanda/admin/api-doc/cluster.json |
Added API documentation for new high_disk_usage_nodes field |
tests/rptest/tests/cluster_health_overview_test.py |
Added test coverage for disk usage alert reporting functionality |
| "high_disk_usage_nodes": { | ||
| "type": "array", | ||
| "items": { | ||
| "type": "int" |
There was a problem hiding this comment.
The type should be "integer" instead of "int" to conform to JSON Schema specification. JSON Schema does not recognize "int" as a valid type keyword.
1a4b48c to
e3d9db7
Compare
dotnwat
approved these changes
Jan 3, 2026
Collaborator
Retry command for Build#78485please wait until all jobs are finished before running the slash command |
Collaborator
CI test resultstest results on build#78485
|
Contributor
Author
|
Unrelated failures (cloud topics) |
Collaborator
|
/backport v25.3.x |
r-vasquez
added a commit
to redpanda-data/common-go
that referenced
this pull request
Jan 30, 2026
This new field was added in redpanda-data/redpanda#29136 and released in 25.3.5.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
With this change the cluster health report has the following additions
This came out of an incident where it was hard to tell which nodes were running out of disk since we didn’t have access to metrics. When a node starts running low on disk, it tends to cause weird issues, so it’s better to surface that in the cluster health reporting.
A follow-up will be to surface this in rpk (that needs a change in a different repo, so it’ll be handled separately).
Fixes: https://redpandadata.atlassian.net/browse/INC-1048
Backports Required
Release Notes
Features