Skip to content

agent: detect and disable abandoned tasks#2752

Draft
jshearer wants to merge 1 commit intomasterfrom
agent/abandoned_tasks
Draft

agent: detect and disable abandoned tasks#2752
jshearer wants to merge 1 commit intomasterfrom
agent/abandoned_tasks

Conversation

@jshearer
Copy link
Contributor

@jshearer jshearer commented Mar 9, 2026

Summary

Implements abandoned task detection from #2710. Two alert sequences identify tasks that are persistently failing or not processing data, warn the user, and then disable the task after a grace period. Auto-disable publication is gated behind DISABLE_ABANDONED_TASKS env var (off by default).

Sequence 1: Chronically Failing

TaskChronicallyFailing fires when all of:

  • has_task_shards is true (enabled, non-Dekaf task with shards)
  • ShardFailed has been continuously firing >= 30 days (CHRONICALLY_FAILING_THRESHOLD)
  • No user publication within 14 days (USER_PUB_THRESHOLD)

Since ShardFailed only resolves after 2 hours of continuous healthy operation, the 30-day condition means the task has not stayed alive for 2 consecutive hours at any point during that window.

After 7 days (CHRONICALLY_FAILING_DISABLE_AFTER), TaskAutoDisabledFailing fires and the task is disabled.

Resolves when ShardFailed resolves, a user publishes, or the task is disabled. A user publication resets a 14-day suppression window before the alert can re-fire. ShardFailed.first_ts (the 30-day clock) is unaffected by publications and only resets when the task recovers for 2+ hours. However, when TaskChronicallyFailing resolves and later re-fires, its own first_ts (which anchors the 7-day disable grace period) resets to the new fire time.

Sequence 2: Idle

TaskIdle fires when all of:

  • has_task_shards is true
  • No ShardFailed or TaskChronicallyFailing alert is active
  • No data movement in catalog_stats_daily within 30 days (IDLE_THRESHOLD)
  • No user publication within 14 days (USER_PUB_THRESHOLD)

After 7 days (IDLE_DISABLE_AFTER), TaskAutoDisabledIdle fires and the task is disabled.

Idle detection is suppressed while failure alerts are active. A task that is both failing and idle follows Sequence 1 only, so users receive one notification path. New tasks are protected by their initial publication, which sets last_user_pub_at and provides 14 days of suppression.

Scenarios

Task with intermittent 3-hour runs: Shard reaches PRIMARY, runs for 3 hours, then fails. Each cycle, ShardFailed resolves (2+ hours healthy) then re-fires, resetting ShardFailed.first_ts. TaskChronicallyFailing never fires. This is intentional: the task is doing some work each cycle. If it doesn't move any data during the period where it's PRIMARY, the TaskIdle alert will take over.

Healthy task that never moves data: A task stays in PRIMARY with no failures, but no data flows through it (e.g. a Google Sheets capture where the sheet is empty). No ShardFailed fires, so the chronically-failing path is never involved. After 30 days with no data movement and no user publication in 14 days, TaskIdle fires. After 37 days total, the task is auto-disabled.

Failing task that also has no data movement: A task starts erroring and never moves data. ShardFailed fires within hours (after 3 failures), which immediately suppresses TaskIdle for as long as it remains active. Even though the task meets all other idle conditions, TaskIdle cannot fire while ShardFailed or TaskChronicallyFailing exists in the alert map. The task follows the chronically-failing path: TaskChronicallyFailing at 30 days, auto-disable at 37. If the task later recovers and ShardFailed resolves but still has no data movement, idle evaluation resumes and TaskIdle can fire on the next evaluation.

Failing task where user publishes a fix attempt: TaskChronicallyFailing fires on day 30. User publishes on day 31, which resolves the alert silently. If the task continues failing, the alert re-fires on day 45 (when the publication ages past 14 days) with a fresh TaskChronicallyFailing.first_ts and a new 7-day grace period. ShardFailed.first_ts remains day 0 throughout. The user publication bought 21 extra days total (14 + 7).

Task re-enabled after auto-disable: All abandon alerts were silently resolved on disable. On re-enablement, has_task_shards returns true and evaluation starts from scratch. For failures, ShardFailed.first_ts restarts from zero, so the user has at least 30 + 7 = 37 days before auto-disable. For idle, the re-enablement publication suppresses alerts for 14 days (it updates last_user_pub_at), then TaskIdle can fire if no data has moved. That gives 14 + 7 = 21 days before auto-disable.

Rollout

On merge, controllers immediately begin evaluating the four new alert types on every wake cycle. alert_history rows will be created for tasks that meet the firing conditions. However, no emails will be sent and no tasks will be disabled:

No emails: The fetch_recipients query matches alert_type = any(include_alert_types) against alert_subscriptions. The new alert types are not in DEFAULT_ALERT_TYPES (the list applied when creating subscriptions during onboarding), and no existing subscription rows include them. The alert_history rows will have empty recipients arrays, so the notification task will produce zero emails.

No disablement: DISABLE_ABANDONED_TASKS defaults to false. The controller evaluates alerts and computes DisableReason, but skips the maybe_disable_task publication.

Monitoring

Query alert_history for the new alert types to see what's being flagged:

select catalog_name, alert_type, fired_at, arguments
from alert_history
where alert_type in (
  'task_chronically_failing', 'task_auto_disabled_failing',
  'task_idle', 'task_auto_disabled_idle'
)
order by fired_at desc;

Review these results to validate detection accuracy before enabling notifications or disablement.

Enabling notifications

Two steps, which can be done independently:

  1. New tenants: Add the four alert types to DEFAULT_ALERT_TYPES in alert_subscriptions.rs. New subscriptions created during onboarding will then include them.

  2. Existing tenants: Run a migration to append the new types to include_alert_types for existing alert_subscriptions rows:

update alert_subscriptions
set include_alert_types = include_alert_types
  || '{task_chronically_failing,task_auto_disabled_failing,task_idle,task_auto_disabled_idle}'::alert_type[]
where not include_alert_types @> '{task_chronically_failing}'::alert_type[];

Enabling disablement

Set DISABLE_ABANDONED_TASKS=true in the agent's environment. This should only happen after notifications have been active long enough that affected users have received at least one warning email (minimum 7 days, since that's the grace period between warning and disable).

Environment variable knobs

Controller behavior is tuned by environment variables (threshold durations, failure counts, feature flags). Previously these were cached in LazyLock statics that were impossible to override in tests, since LazyLock initializes once and never changes.

EnvConfig<T> wraps a RwLock<T> that is lazily initialized from the environment on first access (same as before), but exposes a #[cfg(test)] .set() method so tests can override values without touching env vars or using unsafe. The read path is .val() (named to avoid collision with Itertools::get which is in scope in these modules).

Two macros build on it:

  • env_config_interval!(NAME, default) for chrono::Duration values parsed via humantime (Phil's original macro, now wrapping EnvConfig instead of bare Duration)
  • env_config_bool!(NAME, default) for boolean feature flags.
Variable Default Description
CHRONICALLY_FAILING_THRESHOLD 30 days How long ShardFailed must be continuously firing before TaskChronicallyFailing fires
CHRONICALLY_FAILING_DISABLE_AFTER 7 days Grace period between TaskChronicallyFailing firing and TaskAutoDisabledFailing firing (auto-disable)
IDLE_THRESHOLD 30 days How long without data movement (in catalog_stats_daily) before TaskIdle can fire
USER_PUB_THRESHOLD 14 days How long without a user publication before either sequence can fire. A user publication within this window suppresses both TaskChronicallyFailing and TaskIdle
IDLE_DISABLE_AFTER 7 days Grace period between TaskIdle firing and TaskAutoDisabledIdle firing (auto-disable)
DISABLE_ABANDONED_TASKS false When true, the controller publishes shards.disable = true when an auto-disable alert fires. When false, the alerts still fire and appear in alert_history, but no publication occurs

Related thresholds in the activation controller (not introduced here, but relevant to understanding the chronically-failing sequence):

Variable Default Description
ALERT_AFTER_SHARD_FAILURES 3 Number of recent shard failures within the retention window needed to fire ShardFailed
RESOLVE_SHARD_FAILED_ALERT_AFTER 2 hours Duration of continuous Ok shard status needed to resolve ShardFailed

Closes #2710

@jshearer jshearer marked this pull request as draft March 9, 2026 16:59
@jshearer jshearer force-pushed the agent/abandoned_tasks branch 8 times, most recently from 14b8e7c to cafcb9d Compare March 10, 2026 14:15

/// Helper for getting a `chrono::Duration` from an environment variable, using humantime so it
/// supports parsing durations like `3h`.
/// supports parsing durations like `3h`. The value is read once on first access. In test builds,
Copy link
Contributor Author

@jshearer jshearer Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This kind of spiraled a bit. Happy to split it out into a separate PR if anyone cares, but the thought process is:

  • I needed a way to parse durations and booleans in abandon.rs, so I added env_config_bool as a copy of env_config_interval
  • I needed a way to change these values at runtime for testing, so I added EnvConfig with a #[cfg(test)] pub fn set method
  • I didn't like how ALERT_AFTER_SHARD_FAILURES and SUSTAINED_PRIMARY_MIN_CHECKS were all lonely by themselves so I made env_config_num and updated them to use it so that everything would be consistent

It's possible I could come up with some sort of meta-macro that reduces the duplication of these three, but macros are dense enough that I figured this is actually better.

@jshearer jshearer force-pushed the agent/abandoned_tasks branch from cafcb9d to 80223ec Compare March 10, 2026 16:29
@GregorShear
Copy link
Contributor

Just a test - ignore plz

@jshearer jshearer force-pushed the agent/abandoned_tasks branch from 80223ec to 42d28e5 Compare March 10, 2026 18:53
}

#[derive(Clone, Copy, Default)]
struct AbandonmentTimestamps {
Copy link
Contributor Author

@jshearer jshearer Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still waffling on if i I like this struct or not. I think it's justified because the alternative is using an unnamed tuple, which is confusing.

|| alerts_status.contains_key(&AlertType::TaskChronicallyFailing);
let user_pub_stale = last_user_pub_at.map_or(true, |ts| (now - ts) >= USER_PUB_THRESHOLD.get());

let last_data_movement_ts = if has_failure_alerts || !user_pub_stale {
Copy link
Contributor Author

@jshearer jshearer Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this complexity is all here just so we can short-circuit the query against catalog_stats if it's not needed. may be premature optimization...

@jshearer jshearer force-pushed the agent/abandoned_tasks branch from 42d28e5 to 2e9a064 Compare March 10, 2026 19:01
AlertType::TaskChronicallyFailing,
now,
format!("task shards have been failing since {failing_since}"),
activation.recent_failure_count,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this the right thing to pass for count here?

@jshearer jshearer force-pushed the agent/abandoned_tasks branch 2 times, most recently from 042ee41 to abaf4ba Compare March 10, 2026 21:29
// Only fire the alert once. On subsequent evaluations the stored
// disable_at and failing_since are the source of truth, so we don't
// shift the disable date if the env config changes between runs.
if !alerts_status.contains_key(&AlertType::TaskChronicallyFailing) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is worth pointing out. All of the other alerts call set_alert_firing on every invocation of the controller. I didn't want a scenario where we send someone an email saying that we'll disable their task in a week, then decide to shorten that timespan and they get their task disabled sooner, for example. So I decided that the alert itself should control, rather than re-computing the required timeframe every time. Thoughts? Is there any other reason we need to be calling alerts::set_alert_firing every run?

Adds two alert sequences that identify tasks that are persistently failing or not processing data, and progressively disable them.

* Sequence 1 fires `TaskChronicallyFailing` when `ShardFailed` has been continuously active for 30+ days with no user publication in 14 days, then fires `TaskAutoDisabledFailing` after a 7-day grace period.
* Sequence 2 fires `TaskIdle` when a task has moved no data for 30+ days with no recent user publication, then fires `TaskAutoDisabledIdle` after 7 days. Idle detection is suppressed while failure alerts are active so users receive only one notification path.

The controller now tracks `last_data_movement_ts` (from `catalog_stats_daily`) and `last_user_pub_at` (most recent non-system-user publication) to evaluate both sequences.
A user publication resets a 14-day suppression window before alerts can fire. All thresholds are configurable via environment variables, with disabling of tasks disabled by default
@jshearer jshearer force-pushed the agent/abandoned_tasks branch from abaf4ba to f6bf049 Compare March 10, 2026 21:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

agent: detect and disable abandoned tasks

2 participants