Skip to content

Ciaran/storage management#1675

Open
ciaranbor wants to merge 18 commits intomainfrom
ciaran/storage-management
Open

Ciaran/storage management#1675
ciaranbor wants to merge 18 commits intomainfrom
ciaran/storage-management

Conversation

@ciaranbor
Copy link
Member

Motivation

  • Adds storage management to exo so nodes can enforce disk usage limits for downloaded models
  • Prevents nodes from filling up their disks with model downloads, with configurable per-node limits and eviction policies
Screenshot 2026-03-06 at 19 15 01 Screenshot 2026-03-06 at 19 15 22

Changes

  • New types: StorageConfig (max storage + eviction policy), DownloadRejected download status, StorageConfigUpdated event, SetStorageConfig command
  • State/apply: StorageConfig added to cluster state; apply handles StorageConfigUpdated events
  • Storage utilities (shared/storage.py): pure functions for calculating used storage, checking quotas, computing LRU eviction candidates
  • Download coordinator: checks storage quota before starting downloads; rejects if over limit (manual policy) or auto-evicts oldest unused models (auto-evict policy); tracks model usage timestamps via a TOML file; persists config to
    ~/.cache/exo/config.toml
  • Master: handles SetStorageConfig command, emits StorageConfigUpdated event
  • API: new PUT /storage/config endpoint for setting per-node storage config
  • CLI: --max-storage and --eviction-policy arguments
  • Dashboard: storage bar per node in downloads page header, gear icon opens a settings modal to configure max storage and eviction policy, DownloadRejected cells shown with orange warning icon and retry button

Why It Works

  • Storage checks happen at download-request time in the coordinator, so downloads are rejected/evicted before disk space runs out
  • LRU eviction uses tracked usage timestamps (updated on inference) so actively-used models are preserved
  • Config persists to disk so it survives restarts; state is also broadcast via event sourcing so the dashboard reflects current settings

Test Plan

Manual Testing

  • Configure storage limits via dashboard modal; verify downloads are rejected when over limit and auto-evicted when using auto-evict policy

Automated Testing

  • test_storage.py: 336 lines covering calculate_used_storage, check_storage_quota, compute_evictions_needed, get_lru_eviction_candidates with edge cases
  • test_apply_storage.py: tests StorageConfigUpdated event application to state
  • test_auto_eviction.py: 409 lines testing the coordinator's eviction logic including LRU ordering, multi-model eviction, and insufficient-space scenarios

@ciaranbor ciaranbor force-pushed the ciaran/storage-management branch 3 times, most recently from 38f1dac to 7b64565 Compare March 7, 2026 13:54
@ciaranbor ciaranbor marked this pull request as draft March 8, 2026 17:53
@ciaranbor ciaranbor force-pushed the ciaran/storage-management branch 5 times, most recently from da436e2 to 3d58627 Compare March 12, 2026 11:43
@ciaranbor ciaranbor marked this pull request as ready for review March 12, 2026 11:43
@ciaranbor ciaranbor force-pushed the ciaran/storage-management branch 4 times, most recently from 7fc4ead to 3801b08 Compare March 19, 2026 11:55
@ciaranbor ciaranbor force-pushed the ciaran/storage-management branch 5 times, most recently from f33c5bc to 84eb4a3 Compare March 25, 2026 16:02
@ciaranbor ciaranbor force-pushed the ciaran/storage-management branch from 84eb4a3 to a4010fd Compare March 25, 2026 17:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant