Task ID: t1338.1 | Status: open | Estimate: ~1h | Plan: p032
Assignee: @alex-solovyev | Started: 2026-02-26T18:19:15Z
Tags: auto-dispatch
Description
Extend model-routing.md with local tier
Plan: Purpose
Add local AI model inference to aidevops via llama.cpp + HuggingFace, completing the cost spectrum from free (local) through budget (haiku) to premium (opus). Users get guided hardware-aware setup, access to any HuggingFace GGUF model, usage tracking, and disk cleanup recommendations.
Plan: Context & Architecture
Context
Why llama.cpp (not Ollama, LM Studio, or Jan.ai):
| Criterion |
llama.cpp |
Ollama |
LM Studio |
Jan.ai |
| License |
MIT |
MIT |
Closed frontend |
AGPL |
| Speed |
Fastest (baseline) |
20-70% slower |
Same (uses llama.cpp) |
Same (uses llama.cpp) |
| Security |
No daemon, localhost only |
175k+ exposed instances (Jan 2026), multiple CVEs |
Desktop-safe |
Desktop-safe |
| Binary size |
23-40 MB |
~200 MB |
~500 MB+ |
~300 MB+ |
| HuggingFace access |
Direct GGUF download |
Walled library |
HF browser built-in |
HF download |
| Control |
Full (quantization, context, sampling) |
Abstracted |
GUI-mediated |
GUI-mediated |
| Key decision: download-on-first-use, not bundled. llama.cpp releases weekly (b8152 current, daily commits). Bundling means stale binaries. Platform-specific (macOS ARM 29 MB, Linux x64 23 MB, Linux Vulkan 40 MB). Optional feature — not every user wants local models. |
|
|
|
|
| Binary sizes by platform (b8152): |
|
|
|
|
| Platform |
Size |
|
|
|
| ---------- |
------ |
|
|
|
| macOS ARM64 |
29 MB |
|
|
|
| macOS x64 |
82 MB |
|
|
|
| Linux x64 (CPU) |
23 MB |
|
|
|
| Linux Vulkan |
40 MB |
|
|
|
| Linux ROCm |
130 MB |
|
|
|
Risks
| Risk |
Mitigation |
| llama.cpp binary API changes between releases |
Pin to known-good release in helper script, test on update |
| HuggingFace API rate limits for search |
Cache search results, fallback to huggingface-cli |
| Model recommendations become stale as new models release |
Recommend by capability tier, not specific model names where possible |
| Large model downloads fail mid-transfer |
huggingface-cli handles resume; document manual resume |
| GPU detection unreliable across platforms |
Graceful fallback to CPU-only with clear messaging |
Plan: Decision Log
| Date |
Decision |
Rationale |
| 2026-02-25 |
llama.cpp as primary runtime |
MIT license, fastest, most secure, every other tool wraps it anyway |
| 2026-02-25 |
Download-on-first-use, not bundled |
Weekly releases, platform-specific, optional feature |
| 2026-02-25 |
Single model-routing.md (extend, not new file) |
Local is just another tier in the same routing decision |
| 2026-02-25 |
HuggingFace as model source (not Ollama library) |
Largest open repo, no walled garden, GGUF is standard |
| 2026-02-25 |
SQLite for usage logging |
Consistent with existing framework pattern |
| 2026-02-25 |
30-day cleanup threshold |
Models are 2-50+ GB; generous but prevents unbounded disk growth |
| 2026-02-25 |
No Ollama fallback in v1 |
Users with Ollama can point at its API manually; add if demand exists |
Synced from TODO.md by issue-sync-helper.sh
Task ID:
t1338.1| Status: open | Estimate:~1h| Plan:p032Assignee: @alex-solovyev | Started: 2026-02-26T18:19:15Z
Tags:
auto-dispatchDescription
Extend model-routing.md with local tier
Plan: Purpose
Add local AI model inference to aidevops via llama.cpp + HuggingFace, completing the cost spectrum from free (local) through budget (haiku) to premium (opus). Users get guided hardware-aware setup, access to any HuggingFace GGUF model, usage tracking, and disk cleanup recommendations.
Plan: Context & Architecture
Context
Why llama.cpp (not Ollama, LM Studio, or Jan.ai):
Risks
huggingface-clihuggingface-clihandles resume; document manual resumePlan: Decision Log
Synced from TODO.md by issue-sync-helper.sh