Skip to content

cluster/health_overview: report high disk usage in health reports#29136

Merged
dotnwat merged 3 commits intoredpanda-data:devfrom
bharathv:health_output_du
Jan 3, 2026
Merged

cluster/health_overview: report high disk usage in health reports#29136
dotnwat merged 3 commits intoredpanda-data:devfrom
bharathv:health_output_du

Conversation

@bharathv
Copy link
Copy Markdown
Contributor

@bharathv bharathv commented Jan 3, 2026

With this change the cluster health report has the following additions

  • A list of nodes that exceed storage alert thresholds
  • Overall health is marked unhealthy if there are any nodes that exceed ^^ reports

This came out of an incident where it was hard to tell which nodes were running out of disk since we didn’t have access to metrics. When a node starts running low on disk, it tends to cause weird issues, so it’s better to surface that in the cluster health reporting.

A follow-up will be to surface this in rpk (that needs a change in a different repo, so it’ll be handled separately).

Fixes: https://redpandadata.atlassian.net/browse/INC-1048

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v25.3.x
  • v25.2.x
  • v25.1.x

Release Notes

Features

  • Cluster health report now includes node IDs of nodes that exceed the disk usage reporting thresholds.

Copilot AI review requested due to automatic review settings January 3, 2026 01:48
@bharathv bharathv requested a review from a team as a code owner January 3, 2026 01:48
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds disk usage alerting to cluster health reports by identifying and reporting nodes that exceed storage alert thresholds. When any node has high disk usage, the cluster is marked as unhealthy with a new high_disk_usage_nodes field containing the affected node IDs.

Key Changes:

  • Added high_disk_usage_nodes field to cluster health overview reporting
  • Cluster health is marked unhealthy when nodes exceed disk usage thresholds
  • Updated health monitoring backend to check disk alerts and populate the new field

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
src/v/cluster/health_monitor_types.h Added high_disk_usage_nodes vector field to store node IDs exceeding disk thresholds
src/v/cluster/health_monitor_backend.cc Implemented disk alert checking logic and population of high disk usage nodes list
src/v/cluster/health_monitor_types.cc Updated output operator to include high disk usage nodes in formatted output
src/v/redpanda/admin/server.cc Exposed high_disk_usage_nodes field in admin API response
src/v/redpanda/admin/api-doc/cluster.json Added API documentation for new high_disk_usage_nodes field
tests/rptest/tests/cluster_health_overview_test.py Added test coverage for disk usage alert reporting functionality

"high_disk_usage_nodes": {
"type": "array",
"items": {
"type": "int"
Copy link

Copilot AI Jan 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The type should be "integer" instead of "int" to conform to JSON Schema specification. JSON Schema does not recognize "int" as a valid type keyword.

Copilot uses AI. Check for mistakes.
@vbotbuildovich
Copy link
Copy Markdown
Collaborator

Retry command for Build#78485

please wait until all jobs are finished before running the slash command

/ci-repeat 1
skip-redpanda-build
skip-units
skip-rebase
tests/rptest/tests/nodes_decommissioning_test.py::NodesDecommissioningTest.test_multiple_decommissions@{"cloud_topic":true}

@vbotbuildovich
Copy link
Copy Markdown
Collaborator

CI test results

test results on build#78485
test_class test_method test_arguments test_kind job_url test_status passed reason test_history
MountUnmountIcebergTest test_simple_remount {"cloud_storage_type": 1} integration https://buildkite.com/redpanda/redpanda/builds/78485#019b81af-e6c9-4c17-bfd2-8a6697ccb643 FLAKY 10/11 Test PASSES after retries.No significant increase in flaky rate(baseline=0.1957, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.4796, p1=0.0015, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=MountUnmountIcebergTest&test_method=test_simple_remount
NodesDecommissioningTest test_multiple_decommissions {"cloud_topic": true} integration https://buildkite.com/redpanda/redpanda/builds/78485#019b81af-e6d1-4148-bb64-60a77fe49527 FAIL 0/1 https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=NodesDecommissioningTest&test_method=test_multiple_decommissions

@bharathv
Copy link
Copy Markdown
Contributor Author

bharathv commented Jan 3, 2026

Unrelated failures (cloud topics)

@dotnwat dotnwat merged commit 176e0a1 into redpanda-data:dev Jan 3, 2026
18 of 21 checks passed
@vbotbuildovich
Copy link
Copy Markdown
Collaborator

/backport v25.3.x

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants