Conversation
14b8e7c to
cafcb9d
Compare
|
|
||
| /// Helper for getting a `chrono::Duration` from an environment variable, using humantime so it | ||
| /// supports parsing durations like `3h`. | ||
| /// supports parsing durations like `3h`. The value is read once on first access. In test builds, |
There was a problem hiding this comment.
This kind of spiraled a bit. Happy to split it out into a separate PR if anyone cares, but the thought process is:
- I needed a way to parse durations and booleans in
abandon.rs, so I addedenv_config_boolas a copy ofenv_config_interval - I needed a way to change these values at runtime for testing, so I added
EnvConfigwith a#[cfg(test)] pub fn setmethod - I didn't like how
ALERT_AFTER_SHARD_FAILURESandSUSTAINED_PRIMARY_MIN_CHECKSwere all lonely by themselves so I madeenv_config_numand updated them to use it so that everything would be consistent
It's possible I could come up with some sort of meta-macro that reduces the duplication of these three, but macros are dense enough that I figured this is actually better.
cafcb9d to
80223ec
Compare
|
Just a test - ignore plz |
80223ec to
42d28e5
Compare
| } | ||
|
|
||
| #[derive(Clone, Copy, Default)] | ||
| struct AbandonmentTimestamps { |
There was a problem hiding this comment.
Still waffling on if i I like this struct or not. I think it's justified because the alternative is using an unnamed tuple, which is confusing.
| || alerts_status.contains_key(&AlertType::TaskChronicallyFailing); | ||
| let user_pub_stale = last_user_pub_at.map_or(true, |ts| (now - ts) >= USER_PUB_THRESHOLD.get()); | ||
|
|
||
| let last_data_movement_ts = if has_failure_alerts || !user_pub_stale { |
There was a problem hiding this comment.
this complexity is all here just so we can short-circuit the query against catalog_stats if it's not needed. may be premature optimization...
42d28e5 to
2e9a064
Compare
| AlertType::TaskChronicallyFailing, | ||
| now, | ||
| format!("task shards have been failing since {failing_since}"), | ||
| activation.recent_failure_count, |
There was a problem hiding this comment.
is this the right thing to pass for count here?
042ee41 to
abaf4ba
Compare
| // Only fire the alert once. On subsequent evaluations the stored | ||
| // disable_at and failing_since are the source of truth, so we don't | ||
| // shift the disable date if the env config changes between runs. | ||
| if !alerts_status.contains_key(&AlertType::TaskChronicallyFailing) { |
There was a problem hiding this comment.
This is worth pointing out. All of the other alerts call set_alert_firing on every invocation of the controller. I didn't want a scenario where we send someone an email saying that we'll disable their task in a week, then decide to shorten that timespan and they get their task disabled sooner, for example. So I decided that the alert itself should control, rather than re-computing the required timeframe every time. Thoughts? Is there any other reason we need to be calling alerts::set_alert_firing every run?
Adds two alert sequences that identify tasks that are persistently failing or not processing data, and progressively disable them. * Sequence 1 fires `TaskChronicallyFailing` when `ShardFailed` has been continuously active for 30+ days with no user publication in 14 days, then fires `TaskAutoDisabledFailing` after a 7-day grace period. * Sequence 2 fires `TaskIdle` when a task has moved no data for 30+ days with no recent user publication, then fires `TaskAutoDisabledIdle` after 7 days. Idle detection is suppressed while failure alerts are active so users receive only one notification path. The controller now tracks `last_data_movement_ts` (from `catalog_stats_daily`) and `last_user_pub_at` (most recent non-system-user publication) to evaluate both sequences. A user publication resets a 14-day suppression window before alerts can fire. All thresholds are configurable via environment variables, with disabling of tasks disabled by default
abaf4ba to
f6bf049
Compare
Summary
Implements abandoned task detection from #2710. Two alert sequences identify tasks that are persistently failing or not processing data, warn the user, and then disable the task after a grace period. Auto-disable publication is gated behind
DISABLE_ABANDONED_TASKSenv var (off by default).Sequence 1: Chronically Failing
TaskChronicallyFailingfires when all of:has_task_shardsis true (enabled, non-Dekaf task with shards)ShardFailedhas been continuously firing >= 30 days (CHRONICALLY_FAILING_THRESHOLD)USER_PUB_THRESHOLD)Since
ShardFailedonly resolves after 2 hours of continuous healthy operation, the 30-day condition means the task has not stayed alive for 2 consecutive hours at any point during that window.After 7 days (
CHRONICALLY_FAILING_DISABLE_AFTER),TaskAutoDisabledFailingfires and the task is disabled.Resolves when
ShardFailedresolves, a user publishes, or the task is disabled. A user publication resets a 14-day suppression window before the alert can re-fire.ShardFailed.first_ts(the 30-day clock) is unaffected by publications and only resets when the task recovers for 2+ hours. However, whenTaskChronicallyFailingresolves and later re-fires, its ownfirst_ts(which anchors the 7-day disable grace period) resets to the new fire time.Sequence 2: Idle
TaskIdlefires when all of:has_task_shardsis trueShardFailedorTaskChronicallyFailingalert is activecatalog_stats_dailywithin 30 days (IDLE_THRESHOLD)USER_PUB_THRESHOLD)After 7 days (
IDLE_DISABLE_AFTER),TaskAutoDisabledIdlefires and the task is disabled.Idle detection is suppressed while failure alerts are active. A task that is both failing and idle follows Sequence 1 only, so users receive one notification path. New tasks are protected by their initial publication, which sets
last_user_pub_atand provides 14 days of suppression.Scenarios
Task with intermittent 3-hour runs: Shard reaches PRIMARY, runs for 3 hours, then fails. Each cycle,
ShardFailedresolves (2+ hours healthy) then re-fires, resettingShardFailed.first_ts.TaskChronicallyFailingnever fires. This is intentional: the task is doing some work each cycle. If it doesn't move any data during the period where it'sPRIMARY, theTaskIdlealert will take over.Healthy task that never moves data: A task stays in
PRIMARYwith no failures, but no data flows through it (e.g. a Google Sheets capture where the sheet is empty). NoShardFailedfires, so the chronically-failing path is never involved. After 30 days with no data movement and no user publication in 14 days,TaskIdlefires. After 37 days total, the task is auto-disabled.Failing task that also has no data movement: A task starts erroring and never moves data.
ShardFailedfires within hours (after 3 failures), which immediately suppressesTaskIdlefor as long as it remains active. Even though the task meets all other idle conditions,TaskIdlecannot fire whileShardFailedorTaskChronicallyFailingexists in the alert map. The task follows the chronically-failing path:TaskChronicallyFailingat 30 days, auto-disable at 37. If the task later recovers andShardFailedresolves but still has no data movement, idle evaluation resumes andTaskIdlecan fire on the next evaluation.Failing task where user publishes a fix attempt:
TaskChronicallyFailingfires on day 30. User publishes on day 31, which resolves the alert silently. If the task continues failing, the alert re-fires on day 45 (when the publication ages past 14 days) with a freshTaskChronicallyFailing.first_tsand a new 7-day grace period.ShardFailed.first_tsremains day 0 throughout. The user publication bought 21 extra days total (14 + 7).Task re-enabled after auto-disable: All abandon alerts were silently resolved on disable. On re-enablement,
has_task_shardsreturns true and evaluation starts from scratch. For failures,ShardFailed.first_tsrestarts from zero, so the user has at least 30 + 7 = 37 days before auto-disable. For idle, the re-enablement publication suppresses alerts for 14 days (it updateslast_user_pub_at), thenTaskIdlecan fire if no data has moved. That gives 14 + 7 = 21 days before auto-disable.Rollout
On merge, controllers immediately begin evaluating the four new alert types on every wake cycle.
alert_historyrows will be created for tasks that meet the firing conditions. However, no emails will be sent and no tasks will be disabled:No emails: The
fetch_recipientsquery matchesalert_type = any(include_alert_types)againstalert_subscriptions. The new alert types are not inDEFAULT_ALERT_TYPES(the list applied when creating subscriptions during onboarding), and no existing subscription rows include them. Thealert_historyrows will have emptyrecipientsarrays, so the notification task will produce zero emails.No disablement:
DISABLE_ABANDONED_TASKSdefaults tofalse. The controller evaluates alerts and computesDisableReason, but skips themaybe_disable_taskpublication.Monitoring
Query
alert_historyfor the new alert types to see what's being flagged:Review these results to validate detection accuracy before enabling notifications or disablement.
Enabling notifications
Two steps, which can be done independently:
New tenants: Add the four alert types to
DEFAULT_ALERT_TYPESinalert_subscriptions.rs. New subscriptions created during onboarding will then include them.Existing tenants: Run a migration to append the new types to
include_alert_typesfor existingalert_subscriptionsrows:Enabling disablement
Set
DISABLE_ABANDONED_TASKS=truein the agent's environment. This should only happen after notifications have been active long enough that affected users have received at least one warning email (minimum 7 days, since that's the grace period between warning and disable).Environment variable knobs
Controller behavior is tuned by environment variables (threshold durations, failure counts, feature flags). Previously these were cached in
LazyLockstatics that were impossible to override in tests, sinceLazyLockinitializes once and never changes.EnvConfig<T>wraps aRwLock<T>that is lazily initialized from the environment on first access (same as before), but exposes a#[cfg(test)].set()method so tests can override values without touching env vars or usingunsafe. The read path is.val()(named to avoid collision withItertools::getwhich is in scope in these modules).Two macros build on it:
env_config_interval!(NAME, default)forchrono::Durationvalues parsed via humantime (Phil's original macro, now wrappingEnvConfiginstead of bareDuration)env_config_bool!(NAME, default)for boolean feature flags.CHRONICALLY_FAILING_THRESHOLDShardFailedmust be continuously firing beforeTaskChronicallyFailingfiresCHRONICALLY_FAILING_DISABLE_AFTERTaskChronicallyFailingfiring andTaskAutoDisabledFailingfiring (auto-disable)IDLE_THRESHOLDcatalog_stats_daily) beforeTaskIdlecan fireUSER_PUB_THRESHOLDTaskChronicallyFailingandTaskIdleIDLE_DISABLE_AFTERTaskIdlefiring andTaskAutoDisabledIdlefiring (auto-disable)DISABLE_ABANDONED_TASKSfalsetrue, the controller publishesshards.disable = truewhen an auto-disable alert fires. Whenfalse, the alerts still fire and appear inalert_history, but no publication occursRelated thresholds in the activation controller (not introduced here, but relevant to understanding the chronically-failing sequence):
ALERT_AFTER_SHARD_FAILURESShardFailedRESOLVE_SHARD_FAILED_ALERT_AFTEROkshard status needed to resolveShardFailedCloses #2710