Skip to content

t1338.1: Extend model-routing.md with local tier #2320

@github-actions

Description

@github-actions

Task ID: t1338.1 | Status: open | Estimate: ~1h | Plan: p032
Assignee: @alex-solovyev | Started: 2026-02-26T18:19:15Z
Tags: auto-dispatch

Description

Extend model-routing.md with local tier

Plan: Purpose

Add local AI model inference to aidevops via llama.cpp + HuggingFace, completing the cost spectrum from free (local) through budget (haiku) to premium (opus). Users get guided hardware-aware setup, access to any HuggingFace GGUF model, usage tracking, and disk cleanup recommendations.

Plan: Context & Architecture

Context

Why llama.cpp (not Ollama, LM Studio, or Jan.ai):

Criterion llama.cpp Ollama LM Studio Jan.ai
License MIT MIT Closed frontend AGPL
Speed Fastest (baseline) 20-70% slower Same (uses llama.cpp) Same (uses llama.cpp)
Security No daemon, localhost only 175k+ exposed instances (Jan 2026), multiple CVEs Desktop-safe Desktop-safe
Binary size 23-40 MB ~200 MB ~500 MB+ ~300 MB+
HuggingFace access Direct GGUF download Walled library HF browser built-in HF download
Control Full (quantization, context, sampling) Abstracted GUI-mediated GUI-mediated
Key decision: download-on-first-use, not bundled. llama.cpp releases weekly (b8152 current, daily commits). Bundling means stale binaries. Platform-specific (macOS ARM 29 MB, Linux x64 23 MB, Linux Vulkan 40 MB). Optional feature — not every user wants local models.
Binary sizes by platform (b8152):
Platform Size
---------- ------
macOS ARM64 29 MB
macOS x64 82 MB
Linux x64 (CPU) 23 MB
Linux Vulkan 40 MB
Linux ROCm 130 MB

Risks

Risk Mitigation
llama.cpp binary API changes between releases Pin to known-good release in helper script, test on update
HuggingFace API rate limits for search Cache search results, fallback to huggingface-cli
Model recommendations become stale as new models release Recommend by capability tier, not specific model names where possible
Large model downloads fail mid-transfer huggingface-cli handles resume; document manual resume
GPU detection unreliable across platforms Graceful fallback to CPU-only with clear messaging

Plan: Decision Log
Date Decision Rationale
2026-02-25 llama.cpp as primary runtime MIT license, fastest, most secure, every other tool wraps it anyway
2026-02-25 Download-on-first-use, not bundled Weekly releases, platform-specific, optional feature
2026-02-25 Single model-routing.md (extend, not new file) Local is just another tier in the same routing decision
2026-02-25 HuggingFace as model source (not Ollama library) Largest open repo, no walled garden, GGUF is standard
2026-02-25 SQLite for usage logging Consistent with existing framework pattern
2026-02-25 30-day cleanup threshold Models are 2-50+ GB; generous but prevents unbounded disk growth
2026-02-25 No Ollama fallback in v1 Users with Ollama can point at its API manually; add if demand exists

Synced from TODO.md by issue-sync-helper.sh

Metadata

Metadata

Assignees

Labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions