Skip to content

chore(shard-manager): Emit metrics on total number of executors#7636

Merged
gazi-yestemirova merged 2 commits intocadence-workflow:masterfrom
gazi-yestemirova:shard-distributor-metrics
Jan 23, 2026
Merged

chore(shard-manager): Emit metrics on total number of executors#7636
gazi-yestemirova merged 2 commits intocadence-workflow:masterfrom
gazi-yestemirova:shard-distributor-metrics

Conversation

@gazi-yestemirova
Copy link
Contributor

@gazi-yestemirova gazi-yestemirova commented Jan 22, 2026

What changed?
This PR adds shard_distributor_total_executors gauge metric to track the number of executors registered with the shard distributor.
The metric is emitted during each rebalance loop, with executor statuses (e.g., ExecutorStatusACTIVE, ExecutorStatusDRAINING, ExecutorStatusDRAINED.
And it is tagged with namespace and namespace_type for per-namespace monitoring.

Why?
Overall to monitor the health of the shard distributor cluster - know how many executors are actively participating in shard distribution.
To detect executor churn or scaling events
Alert when executor count falls below expected thresholds, which could indicate deployment issues or infrastructure problems.

How did you test it?
Verified metric is emitted in local and dev environments with correct executor counts.

Potential risks
N/A

Release notes

Documentation Changes

Signed-off-by: Gaziza Yestemirova <gaziza@uber.com>
}

p.emitActiveShardMetric(namespaceState.ShardAssignments, metricsLoopScope)
p.emitExecutorMetric(namespaceState, len(staleExecutors), metricsLoopScope)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case of failures of AssignShards the metric will not be emitted. I think we can emit this metric right before a call of AssignShards

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I wanted to emit the "committed" state of the executors after the transaction, because it will be retried, so we avoid potentially emitting inconsistent data.
But I think emitting "observed" state is also reasonable. Let me update it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Perhaps, we also should add it for the shadow namespaces, don't we?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, we do, we emit the metric before exiting for shadow executors

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image So emitting metric happens right before the shadow executors

Signed-off-by: Gaziza Yestemirova <gaziza@uber.com>
Copy link
Contributor

@arzonus arzonus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm 🚀

@gazi-yestemirova gazi-yestemirova merged commit 474d530 into cadence-workflow:master Jan 23, 2026
42 of 43 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants