Skip to content

Releases: cadence-workflow/cadence

v1.4.0

27 Feb 21:44
1a42a94

Choose a tag to compare

Major Features

1. Active-Active Domains

Cadence domains have been running in active-passive mode for years, which has been limiting for use cases requiring processing in all clusters (regions). Since late 2025, Cadence can process domains in both regions and distribute traffic based on users’ preferences within domains. This change will make your domains more flexible and more efficient due to utilizing resources in all regions.

Key Capabilities

  • Introduced ClusterAttribute as a flexible abstraction for defining cluster groupings beyond traditional region-based configurations
  • Active Cluster Selection Policy allows workflows to specify which cluster attributes they should run on

Migration Path

Active-active support is designed for backward compatibility:

  1. All domains will support active-active by default and without breaking existing behavior. While users setting “cluster attributes” while starting their workflows will be able to benefit from active-active processing
  2. Existing active-passive domains continue to work without changes if leaving “cluster attributes” empty

Current Limitations

This feature is currently implemented for Cassandra and the support for other DBs will come in Q1. We will also release a blog explaining how this improved related use cases, a wiki explaining how to use it and a code lab to help you try out.
There is a risk for failover which causes workflows to be stuck if the schedule to start latency plus replication lag is more than 25 mins. We're working on a project to resolve the risk.

2. Replication Improvements

Cadence orchestrates its own replication, which allows us to seamlessly migrate from one DB technology to another, one cloud provider to another etc. The way it was working in the past was that the replication messages would be generated by reading workflow tasks from the database.

Given that replication is a continuous process between Cadence regions, we implemented a cache to keep the replication messages in memory until a replication poll message arrives so we could eliminate the DB calls due to replication. This came with a 99%+ cache hit rate, which almost entirely eliminated the DB calls due to replication, which used to be more than 20% of all DB calls. Another big benefit was for replication latencies; since we can directly serve the messages from memory, our replication latencies dropped from 13s to 2s.

Key Capabilities

  • Replication Budget Manager: New cache capacity control mechanism to prevent memory exhaustion during replication bursts
  • Improved task fetcher concurrency: Better handling of concurrent replication task fetching with enhanced metrics
  • BoundedAckCache optimizations: Generic cache implementation with improved ack handling

3. Shard Distributor Service (in progress)

The Shard Distributor is a new service component that provides dynamic shard assignment and load balancing across matching service instances. This enables better resource utilization, improved scalability, and operational flexibility for large-scale deployments.

Service Components

  • Leader Processor: Centralized controller for shard assignment decisions
  • Executor Client: Integration point for matching service instances to receive shard assignments
  • Spectator Client: Read-only monitoring interface for shard state
  • Canary: Health verification and ping protocol for shard ownership validation

Key Capabilities

  • Dynamic Shard Assignment
  • Dynamic spectator client control with enable/disable support
  • Load Balancing
  • Drain watching support for graceful shard handoff
  • Automatic retry on rebalancing loop failures
  • Migration Support: Migration mode for gradual rollout alongside existing static shard assignment
  • Integration with Matching Service: The matching service has been refactored to support shard distributor integration:

Configuration

The shard distributor can be enabled via dynamic config:

  shard-distributor:
      enabled: true
      loadBalancingMode: "shadow-mode"  # Options: naive, shard-stats, shadow-mode
      migrationMode: true  # Enable gradual migration

Monitoring

  • shard_handover_latency: Time taken for shard ownership transfer
  • active_shards_count: Number of active shards per executor
  • shard_assignment_conflicts: Concurrent assignment conflicts detected
  • executor_heartbeat_status: Executor health and liveness
  • ETCD watch event metrics for observability

4. Caller Type-Based Rate Limiting

A new caller type tracking and bypass mechanism has been introduced to allow granular rate limiting control for debugging and mitigation purposes.
Key Capabilities:

  • Caller Type Header Propagation (#7644, #7653, #7638): Introduced cadence-caller-type header that propagates through service boundaries
    • Extracted at service inbound boundaries using middleware
    • Available in CLI via header support
    • Minimal performance impact (~150-300ns per request)
  • Persistence Rate Limiter Bypass (#7656): Dynamic config persistence.rateLimiterBypassCallerTypes allows specific caller types to bypass persistence rate limits
  • Frontend Regional Rate Limiter Bypass (#7662): Caller type-based bypass for regional frontend rate limiter to allow priority requests during high load

5. Visibility Enhancements - Cron and Execution Status (#7527)

Added comprehensive cron workflow visibility with new fields in visibility records:

New Fields:

  • CronSchedule: Display the cron schedule for workflows
  • ExecutionStatus: Show actual execution status (PENDING, STARTED, or close statuses) instead of just CONTINUED_AS_NEW for cron workflows
  • ScheduledExecutionTime: Track the actual scheduled execution time for cron workflows

Schema Changes Required: This feature requires database schema upgrades for all persistence stores (Cassandra, MySQL, PostgreSQL, SQLite, Elasticsearch)

CLI Updates:

  • New --print_cron flag to display cron-related fields in cadence workflow list
  • Shows execution status by default

Performance & Scalability Improvements

  • Database & Persistence

    • PostgreSQL timer task pagination (#7621): Improved pagination logic to handle large timer task queries efficiently
    • History node deletion (#7484): Configurable page size for history deletion via dynamic config
    • Snappy compression for history blobs (#7269): Reduced storage footprint and network transfer for history events
    • SQLite fixes (#7469): Resolved database locking issues for local/test deployments
  • Memory & Resource Usage

    • ETCD watch optimization (#7578): Removed WithPrevKV() to reduce memory overhead in shard manager
    • Reduced allocations in metrics (#7456): Optimized insertReportIntoSizes to minimize GC pressure
    • History deletion improvements (#7472): Fixed infinite loop in RangeCompleteHistoryTask when invalid page size provided

Notable Bug Fixes

  • History & Workflow Execution

    • Child workflow duplicate events (#7400): Proper handling of duplicate child workflow started events
    • Activity scheduled time on reset (#7597): Correctly update not-started activities scheduled time when resetting workflows
    • Restart workflow cron scheduling (#7247): Fixed bug where each restart skipped an additional cron scheduled run
    • History cleanup timeout handling (#7617): Avoid dangerous timeout conditions in history cleanup process
    • Workflow creation leak (#7523): Fixed resource leak during workflow creation in history service
    • Signal-with-start cleanup (#7540): Proper handling of signal-with-start in cleanup logic
    • Signal handling with DelayStart (#7702): Prevent signals from bypassing DelayStart configuration
  • Cross-Datacenter Replication

    • Domain ID usage in replication (#7550): Use domain ID instead of domain name for more reliable replication
    • Replication panic logging (#7396): Improved error handling and logging for replication stack panics
    • Database consistency error detection (#7573): More accurate detection of DB consistency errors
  • Active-Active Operations

    • Race condition in failover (#7587): Fixed race condition during active-active failover
    • Query workflow support (#7339): Proper query handling for active-active domains
    • StartWorkflow with terminate-if-running (#7361): Correct policy enforcement for active-active workflows
    • Auto-forwarding (#7356): Fixed cluster forwarding logic for active-active domains
    • Standby task handling (#7423): Prevent premature dropping of standby tasks in active-active scenarios
  • Matching Service

    • TaskList stop on shard stop (#7581): Properly stop task lists when stopping shard processor
    • TaskListActivitiesPerSecond enforcement (#7575): Correct rate limiting enforcement
    • Nil load hints handling (#7551): Added nil pointer checks for load hints
    • TaskList partition config invalidation (#7618): Properly invalidate TaskListPartitionConfig on attempted writes to read-only partitions
    • Domain not active error handling (#7676): Fixed domain not active error to be non-retryable for matching service in active-active scenarios
    • TaskList management with shard distributor (#7682): Properly handle shard processor lifecycle when onboarded to SD
    • Task list registry pattern (#7720): Introduced registry for better task list management
  • Persistence & Database

    • Host tag reversion (#7675): Reverted addition of host tag to persistence calls due to issues
    • History cleanup defaults (#7661): Changed defaults for history cleanup configuration
    • History cleanup error classification (...
Read more

v1.3.7-prerelease33

16 Feb 08:44
a50e09a

Choose a tag to compare

v1.3.7-prerelease33 Pre-release
Pre-release

What's Changed

  • chore: add revert to commit types by @fimanishi in #7677
  • chore: add replication task processor histograms by @zawadzkidiana in #7685
  • fix(shard-distributor): separate watch event processing from the cache refresh by @arzonus in #7670
  • fix(shard-distributor): separate watch event processing from the rebalancing loop by @arzonus in #7669
  • ci: Fix Dockerfile by @arzonus in #7690
  • fix: only upsert search attribute when advanced visibility is enabled by @neil-xie in #7693
  • feat: Add cadence-caller-type to internal requests headers by @fimanishi in #7678
  • fix(cli): Deleteworkflow History manager nil check + test coverage by @YoavLev in #7672
  • feat(cli): Add --remote flag hint on delete workflow failure by @YoavLev in #7673
  • fix: (shard-distributor) use one transaction for assignShardsInCurren… by @eleonoradgr in #7687
  • feat(cadence-matching): simplify the load calculation for shards by @eleonoradgr in #7647
  • fix(cadence-matching): do not delete sp when onboarded to SD by @eleonoradgr in #7682
  • feat(shard-distributor-canary): add support of multiple executors by @arzonus in #7619
  • fix: [shard-distributor] remove error for local passthrough by @eleonoradgr in #7666
  • feat: Add tag for logging when a feature is in shadow mode by @c-warren in #7694
  • chore(shard-manager): Remove GlobalRevision check from shard rebalancing by @gazi-yestemirova in #7689
  • fix(shard-distributor): fix high-frequent triggering of the rebalancing loop by @arzonus in #7696
  • chore(shard-distributor): return PrevKV to cache refreshing by @arzonus in #7698
  • chore: allowlist new histogram migration metrics per comment (follow-up to #7685) by @zawadzkidiana in #7688
  • fix: (executor-client)Skip local assignment if no convergence with h… by @eleonoradgr in #7695
  • chore: set ReplicationTaskProcessorStartWait default to 0 by @fimanishi in #7701
  • docs: Update maintainers list by @demirkayaender in #7680
  • fix(history): prevent signals from bypassing DelayStart by @pratikscfr in #7702

New Contributors

Full Changelog: v1.3.7-prerelease31...v1.3.7-prerelease33

v1.3.7-prerelease31

05 Feb 01:26
103d8b7

Choose a tag to compare

v1.3.7-prerelease31 Pre-release
Pre-release

What's Changed

  • fix: Replace tokenbucket with standard limiter on CLI by @Scanf-s in #7585
  • chore: Update history task processor to switch to the new scheduler as the default by @Shaddoll in #7623
  • fix: tightning classifications a bit on history cleanup by @davidporter-id-au in #7627
  • chore: Add github action to standardize issue description and labeling by @timl3136 in #7615
  • chore: Add a metric to monitor the size of weighted channel pool by @Shaddoll in #7634
  • refactor: remove time generation from sql db layer by @ribaraka in #7631
  • fix(shard-distributor): fix shard handover and assignment distribution metrics by @arzonus in #7582
  • fix(shard-manager): Clean up stale executors in shadow mode by @gazi-yestemirova in #7635
  • chore(shard-manager): Emit metrics on total number of executors by @gazi-yestemirova in #7636
  • feat: Add cadence-caller-type to cli header by @fimanishi in #7638
  • feat(matching): Invalidate TaskListPartitionConfig on Attempted Writes to Read-Only Partitions by @joannalauu in #7618
  • chore(shard-manager): Emit metrics on oldest executors heartbeat by @gazi-yestemirova in #7639
  • feat: Add cron and workflow execution related fields to visibility by @neil-xie in #7527
  • feat: Extract cadence-caller-type from headers by @fimanishi in #7644
  • fix(shard-distributor): fix flaky tests by @arzonus in #7655
  • docs: Add reviewers checklist to pull request description by @c-warren in #7596
  • fix: Remove sync retry logic in AddTask function by @Scanf-s in #7650
  • feat: Extract cadence-caller-type from headers at services inbound boundaries by @fimanishi in #7653
  • refactor: QueueManager/Queue interfaces by @ribaraka in #7652
  • chore: Add natemort and c-warren to CODEOWNERS by @c-warren in #7659
  • feat: Implement persistence bypass based on caller type by @fimanishi in #7656
  • fix: adds some missing codegen on master by @davidporter-id-au in #7663
  • fix: changing defaults for history cleanup by @davidporter-id-au in #7661
  • feat(cli): Add workflow refresh tasks command by @gazi-yestemirova in #7657
  • fix(shard-manager): Cleanup stale executors when no active executors remain by @gazi-yestemirova in #7645
  • fix(active-active): domain not active error is non retryable for matching by @Shaddoll in #7676
  • feat: Implement bypass based on caller type in the regional frontend regional rate limiter by @fimanishi in #7662
  • fix: revert addition of host tag to persistence calls by @fimanishi in #7675
  • ci: Add pull request reviewer via gitar by @c-warren in #7649

New Contributors

Full Changelog: v1.3.7-prerelease29...v1.3.7-prerelease31

v1.3.7-prerelease29

21 Jan 01:40
e471caf

Choose a tag to compare

v1.3.7-prerelease29 Pre-release
Pre-release

What's Changed

  • fix: make CallerType in CallerInfo more flexible by @fimanishi in #7588
  • chore: organize and standardize comments for dynamic configs by @fimanishi in #7605
  • chore: remove deprecated not referenced dynamic configs by @fimanishi in #7613
  • refactor(shard-manager): Optimize etcd watch memory usage by removing WithPrevKV() by @gazi-yestemirova in #7578
  • fix(cadence-matching): stop tasklist when stopping the shardprocessor by @eleonoradgr in #7581
  • chore(shard-distributor): change bucket size for shard handover latency metrics by @arzonus in #7614
  • fix: Update not-started activities scheduled time when reseting workflows by @ribaraka in #7597
  • fix(shard-distributor): fix a leader loss by @arzonus in #7608
  • fix: history-cleanup: avoid a possible dangerous condition with timeouts by @davidporter-id-au in #7617
  • fix(postgres): improve timer task pagination by @ribaraka in #7621

Full Changelog: v1.3.7-prerelease28...v1.3.7-prerelease29

v1.3.7-prerelease28

14 Jan 19:28
4677aa7

Choose a tag to compare

v1.3.7-prerelease28 Pre-release
Pre-release

What's Changed

  • feat(matching): Provide DynamicConfig option to override RPS of a specific TaskList by @joannalauu in #7557
  • feat(shard-distributor): running algorithm in shadow-mode by @eleonoradgr in #7544
  • fix: enforce TaskListActivitiesPerSecond in matching by @timl3136 in #7575
  • feat(frontend): Allow poll requests to wait for a rate limit token by @joannalauu in #7571
  • fix: improving logging for matching / AA by @davidporter-id-au in #7584
  • fix(active-active): Fix a race condition for active-active failover by @Shaddoll in #7587
  • fix: small XDC config check by @davidporter-id-au in #7589
  • feat(history): cap decision task failure retries by @shuprime in #7580
  • fix(shard-manager): skip non-ACTIVE executors when assigning ephemeral shards by @gazi-yestemirova in #7594
  • refactor(matching): replace callback with explicit Registry by @dkrotx in #7593
  • refactor(matching): rename some matching-engine functions by @dkrotx in #7591
  • refactor(matching): remove obsolete commented-out code by @dkrotx in #7592
  • fix: Updates, hopefully corrects up the cleanup logic for start wf calls by @davidporter-id-au in #7590

New Contributors

Full Changelog: v1.3.7-prerelease26...v1.3.7-prerelease28

v1.3.7-prerelease26

02 Jan 08:50
50e4618

Choose a tag to compare

v1.3.7-prerelease26 Pre-release
Pre-release

What's Changed

  • refactor(shard-distributor): store shard stats under executor keys by @AndreasHolt in #7507
  • fix(shard distributor): remove heartbeat write cooldown by @AndreasHolt in #7513
  • refactor(ringpop): extract bootstrap logic into factory by @jakobht in #7517
  • chore(shard-distributor): add etcdclient.Client interface by @arzonus in #7521
  • fix: field mapping for list-failover-history should do graceful/force failover types by @davidporter-id-au in #7522
  • feat: [shard-distributor]Send "draining" heartbeat on executer shutdown by @gazi-yestemirova in #7505
  • fix: leak in history during workflow creation by @fimanishi in #7523
  • feat(cadence-matching): integration with executorclient by @eleonoradgr in #7504
  • chore: update idls to latest master by @neil-xie in #7529
  • fix: removing log noise by @davidporter-id-au in #7528
  • fix(shard-distributor): add error handling in namespace refresh loop by @arzonus in #7519
  • feat(matching): Use real metrics scope for shard distributor executor client by @gazi-yestemirova in #7535
  • feat: making error-injection less annoying on startup by @davidporter-id-au in #7532
  • fix: missed some followups with respect to error injection by @davidporter-id-au in #7538
  • fix: Cleanup needs to handle signal-with-start by @davidporter-id-au in #7540
  • chore(shard-distributor): introduce LoadBalancingMode dynamic property by @arzonus in #7525
  • feat(cadence-matching): Onbord to use ShardDistributor executor client by @eleonoradgr in #7533
  • fix: Return Cluster Attributes when describing a domain via the cli by @c-warren in #7539
  • chore: Clean up visibility override related code from frontend by @neil-xie in #7543
  • chore: add host tag to persistence metrics by @fimanishi in #7530
  • fix(shard-distributor): remove usage of context from Start in canary by @arzonus in #7541
  • refactor: Remove shard distributor config dependency by @gazi-yestemirova in #7542
  • fix(cadence-matching): start and stop the executor by @eleonoradgr in #7548
  • fix: Allow failovers of cluster attributes for active-active domains by @c-warren in #7549
  • refactor: Update matching executor configs by @gazi-yestemirova in #7547
  • fix(cadence-matching): fix panic in case of nil load hints by @eleonoradgr in #7551
  • feat(shard-distributor): exclude shard stats from naive load balancing mode by @arzonus in #7526
  • feat(shard-distributor): change shard load to be based on shard id in canary by @arzonus in #7552
  • feat(matching): integrate shard distributor spectator for task list routing by @jakobht in #7546
  • feat(shard-distributor): add shard rebalancing by shard load by @arzonus in #7545
  • refactor(sharddistributor): use dynamic config for migration mode in leader processor by @gazi-yestemirova in #7554
  • fix: Use domain id instead of domain name for replication by @bowenxia in #7550
  • feat(shard-distributor): improve spectator client logging by @jakobht in #7556
  • fix(cadence-matching): assigning lock for newProcessor by @eleonoradgr in #7559
  • feat: Add CallerType for request context propagation by @fimanishi in #7564
  • fix(shard-distributor-canary): remove noisy log by @eleonoradgr in #7572
  • fix: correct the detection of DB consistency error by @davidporter-id-au in #7573
  • feat: Create CallerInfo struct to be used in context by @fimanishi in #7574

Full Changelog: v1.3.7-prerelease23...v1.3.7-prerelease26

v1.3.7-prerelease23

08 Dec 22:50
079d692

Choose a tag to compare

v1.3.7-prerelease23 Pre-release
Pre-release

What's Changed

  • chore(shard-distirbutor): extend info to debug assignment conflicts by @eleonoradgr in #7506
  • feat(shard-distributor): Add ping verification to ephemeral shard creator by @jakobht in #7496
  • fix(shard-distributor): add context timeout into the shard rebalancing loop by @arzonus in #7514
  • chore(shard-distributor): increase observability of the leader election and the leader processor by @arzonus in #7515
  • fix(shard-distributor): remove storing AssignedState.ModRevision to etcd by @arzonus in #7516
  • chore: change logger to warn for nil mutable state in executeDeleteHistoryEventTask by @fimanishi in #7509
  • feat(docker): add docker-compose configuration for OpenSearch by @Bueller87 in #7510

Full Changelog: v1.3.7-prerelease21...v1.3.7-prerelease23

v1.3.7-prerelease21

03 Dec 21:41
24aa35c

Choose a tag to compare

v1.3.7-prerelease21 Pre-release
Pre-release

What's Changed

  • feat(shard-distributor): return executor metadata from spectator GetShardOwner by @jakobht in #7476
  • refactor: [shard-distributor]Remove error logs from store level by @gazi-yestemirova in #7492
  • feat(shard-distributor): add shard handover stats by @arzonus in #7495
  • feat(shard-distributor): add canary ping handler for executor health checks by @jakobht in #7486
  • feat(matching): use ShardProcessors instead of TaskListManagers by @eleonoradgr in #7480
  • feat(shard-distributor): add SpectatorPeerChooser for shard-aware routing by @jakobht in #7478
  • fix(shard-distributor): return executor metadata for ephemeral shard assignments by @jakobht in #7501
  • fix(shard-manager): fixed the ETCD integration test by @jakobht in #7502
  • fix(ReplicationTaskFetcher): fix task_fetcher concurrent fetching by @fimanishi in #7471
  • feat: adds cluster-attributes to start cli command by @davidporter-id-au in #7494
  • feat(shard-distributor): add shard handover latency metrics by @arzonus in #7442
  • fix(shard-distributor): send initial state to new subscribers by @jakobht in #7499
  • feat(shard-distributor): Add canary pinger for periodic shard ownership verification by @jakobht in #7487
  • fix(shard-distributor-ex-client): handle concurrent assignment by @eleonoradgr in #7500
  • fix: minor cli flag update by @davidporter-id-au in #7503
  • chore: Add script to generate cluster attributes for domain update tests by @c-warren in #7377

Full Changelog: v1.3.7-prerelease20...v1.3.7-prerelease21

v1.3.7-prerelease20

27 Nov 10:34
5487696

Choose a tag to compare

v1.3.7-prerelease20 Pre-release
Pre-release
  • feat(shard-distributor): add spectator client for read-only shard state monitoring by @jakobht in #7438
  • chore(shard-distributor): classify errors by @arzonus in #7466
  • docs: Canary Grafana Dashboard Panel for workflow success counter by @vishwa-uber in #7464
  • fix(sqlite): fix database locked issues by @arzonus in #7469
  • feat(shard-distributor): refactor time handling, data store structures, key building in etcd by @arzonus in #7447
  • feat(shard-distributor): add GetMetadata and GetNamespace methods to executor interface by @jakobht in #7477
  • fix(shard-distributor): prevent context cancellation in streaming WatchNamespaceState RPC by @jakobht in #7474
  • fix(shard-distributor): fix unit test in handler by @arzonus in #7479
  • fix(shard-distributor): make DeleteShardStats non-transactional and fix cleanup condition by @AndreasHolt in #7465
  • feat: [shard-distributor]Compress data before writing to ETCD by @gazi-yestemirova in #7412
  • fix(etcdstore): fix merge conflict on etcdstore_test by @fimanishi in #7483
  • fix(rpc): dns updater should not update current peer on failures by @shijiesheng in #7424
  • feat(active-active): Fail StartWorkflow request if cluster attribute doesn't exist by @Shaddoll in #7485
  • fix(RangeCompleteHistoryTask): fix infinite loop when page size provided is <= 0 by @fimanishi in #7472
  • fix(jitter): allow input equal to 0 to be provided without panic by @fimanishi in #7481
  • feat(persistence): make DeleteFromHistoryNode page size a dynamic config by @fimanishi in #7484
  • feat(persistence): make DeleteFromHistoryNode page size a dynamic config by @fimanishi in #7488
  • feat(shard-distributor): add canary gRPC protocol for executor-to-executor pings by @jakobht in #7475
  • [fix(shard-distributor): remove trimming of prefixes by @eleonoradgr in #7490

v1.3.7-prerelease19

20 Nov 07:59
c6959bf

Choose a tag to compare

v1.3.7-prerelease19 Pre-release
Pre-release

What's Changed

  • fix(shard-distributor): change to migration dynamic config name by @eleonoradgr in #7441
  • feat(shard-distributor): integrate executor cleanup with shard assignment by @jakobht in #7440
  • feat(executor-client): split function for local shard assignment by @eleonoradgr in #7446
  • feat: using the machines to slowly converge on the right tooling by @davidporter-id-au in #7453
  • fix: Improve logs for panics in replication stack by @Shaddoll in #7396
  • chore(shard-distributor): merge etcdstore go module to root go module by @arzonus in #7454
  • docs: Update Maintainers by @demirkayaender in #7455
  • chore: Improve insertReportIntoSizes to reduce memory allocations by @Shaddoll in #7456
  • fix(ReplicationBudgetManager): add HostTag to budget manager by @fimanishi in #7459
  • fix(active-active): Update CLI describe workflow output to show ActiveClusterSelectionPolicy by @Shaddoll in #7461
  • chore(replicationTaskFetcher): Add metrics to task_fetcher by @fimanishi in #7462

Full Changelog: v1.3.7-prerelease18...v1.3.7-prerelease19