cluster/health_overview: report high disk usage in health reports by bharathv · Pull Request #29136 · redpanda-data/redpanda

bharathv · 2026-01-03T01:48:15Z

With this change the cluster health report has the following additions

A list of nodes that exceed storage alert thresholds
Overall health is marked unhealthy if there are any nodes that exceed ^^ reports

This came out of an incident where it was hard to tell which nodes were running out of disk since we didn’t have access to metrics. When a node starts running low on disk, it tends to cause weird issues, so it’s better to surface that in the cluster health reporting.

A follow-up will be to surface this in rpk (that needs a change in a different repo, so it’ll be handled separately).

Fixes: https://redpandadata.atlassian.net/browse/INC-1048

Backports Required

Release Notes

Features

Cluster health report now includes node IDs of nodes that exceed the disk usage reporting thresholds.

Copilot

Pull request overview

This PR adds disk usage alerting to cluster health reports by identifying and reporting nodes that exceed storage alert thresholds. When any node has high disk usage, the cluster is marked as unhealthy with a new high_disk_usage_nodes field containing the affected node IDs.

Key Changes:

Added high_disk_usage_nodes field to cluster health overview reporting
Cluster health is marked unhealthy when nodes exceed disk usage thresholds
Updated health monitoring backend to check disk alerts and populate the new field

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
`src/v/cluster/health_monitor_types.h`	Added `high_disk_usage_nodes` vector field to store node IDs exceeding disk thresholds
`src/v/cluster/health_monitor_backend.cc`	Implemented disk alert checking logic and population of high disk usage nodes list
`src/v/cluster/health_monitor_types.cc`	Updated output operator to include high disk usage nodes in formatted output
`src/v/redpanda/admin/server.cc`	Exposed `high_disk_usage_nodes` field in admin API response
`src/v/redpanda/admin/api-doc/cluster.json`	Added API documentation for new `high_disk_usage_nodes` field
`tests/rptest/tests/cluster_health_overview_test.py`	Added test coverage for disk usage alert reporting functionality

Copilot · 2026-01-03T01:48:48Z

src/v/redpanda/admin/api-doc/cluster.json

+                "high_disk_usage_nodes": {
+                    "type": "array",
+                    "items": {
+                        "type": "int"


The type should be "integer" instead of "int" to conform to JSON Schema specification. JSON Schema does not recognize "int" as a valid type keyword.

vbotbuildovich · 2026-01-03T03:07:15Z

Retry command for Build#78485

please wait until all jobs are finished before running the slash command

/ci-repeat 1
skip-redpanda-build
skip-units
skip-rebase
tests/rptest/tests/nodes_decommissioning_test.py::NodesDecommissioningTest.test_multiple_decommissions@{"cloud_topic":true}

vbotbuildovich · 2026-01-03T03:55:29Z

CI test results

test results on build#78485

test_class	test_method	test_arguments	test_kind	job_url	test_status	passed	reason	test_history
MountUnmountIcebergTest	test_simple_remount	{"cloud_storage_type": 1}	integration	https://buildkite.com/redpanda/redpanda/builds/78485#019b81af-e6c9-4c17-bfd2-8a6697ccb643	FLAKY	10/11	Test PASSES after retries.No significant increase in flaky rate(baseline=0.1957, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.4796, p1=0.0015, trust_threshold=0.5000)	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=MountUnmountIcebergTest&test_method=test_simple_remount
NodesDecommissioningTest	test_multiple_decommissions	{"cloud_topic": true}	integration	https://buildkite.com/redpanda/redpanda/builds/78485#019b81af-e6d1-4148-bb64-60a77fe49527	FAIL	0/1		https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=NodesDecommissioningTest&test_method=test_multiple_decommissions

bharathv · 2026-01-03T04:57:19Z

Unrelated failures (cloud topics)

vbotbuildovich · 2026-01-03T17:56:06Z

/backport v25.3.x

This new field was added in redpanda-data/redpanda#29136 and released in 25.3.5.

Copilot AI review requested due to automatic review settings January 3, 2026 01:48

bharathv requested a review from a team as a code owner January 3, 2026 01:48

github-actions bot added the area/redpanda label Jan 3, 2026

Copilot AI reviewed Jan 3, 2026

View reviewed changes

bharathv requested review from bashtanov, dotnwat, joe-redpanda and mmaslankaprv January 3, 2026 01:49

bharathv added 3 commits January 2, 2026 17:53

cluster/health: add a field for high disk usage nodes

5b5286f

cluster/health: bubble up nodes with high disk usage

a3cf6fa

cluster/health/test: a simple test for disk usage alerting

e3d9db7

bharathv force-pushed the health_output_du branch from 1a4b48c to e3d9db7 Compare January 3, 2026 01:57

dotnwat approved these changes Jan 3, 2026

View reviewed changes

dotnwat merged commit 176e0a1 into redpanda-data:dev Jan 3, 2026
18 of 21 checks passed

vbotbuildovich mentioned this pull request Jan 3, 2026

[v25.3.x] cluster/health_overview: report high disk usage in health reports #29141

Merged

r-vasquez added a commit to redpanda-data/common-go that referenced this pull request Jan 30, 2026

rpadmin: add high disk usage report to health overview

2853a7f

This new field was added in redpanda-data/redpanda#29136 and released in 25.3.5.

r-vasquez mentioned this pull request Jan 30, 2026

rpadmin: add high disk usage report to health overview redpanda-data/common-go#127

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cluster/health_overview: report high disk usage in health reports#29136

cluster/health_overview: report high disk usage in health reports#29136
dotnwat merged 3 commits intoredpanda-data:devfrom
bharathv:health_output_du

bharathv commented Jan 3, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jan 3, 2026

Uh oh!

vbotbuildovich commented Jan 3, 2026

Uh oh!

vbotbuildovich commented Jan 3, 2026

Uh oh!

bharathv commented Jan 3, 2026

Uh oh!

Uh oh!

vbotbuildovich commented Jan 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

bharathv commented Jan 3, 2026

Backports Required

Release Notes

Features

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Jan 3, 2026

Choose a reason for hiding this comment

Uh oh!

vbotbuildovich commented Jan 3, 2026

Retry command for Build#78485

Uh oh!

vbotbuildovich commented Jan 3, 2026

CI test results

Uh oh!

bharathv commented Jan 3, 2026

Uh oh!

Uh oh!

vbotbuildovich commented Jan 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants