t1338.1: Extend model-routing.md with local tier

**Task ID:** `t1338.1` | **Status:** open | **Estimate:** `~1h` | **Plan:** `p032`
**Assignee:** @alex-solovyev | **Started:** 2026-02-26T18:19:15Z
**Tags:** `auto-dispatch`

## Description

Extend model-routing.md with local tier

## Plan: Purpose

Add local AI model inference to aidevops via llama.cpp + HuggingFace, completing the cost spectrum from free (local) through budget (haiku) to premium (opus). Users get guided hardware-aware setup, access to any HuggingFace GGUF model, usage tracking, and disk cleanup recommendations.

<details><summary>Plan: Context &amp; Architecture</summary>


**Context**

**Why llama.cpp (not Ollama, LM Studio, or Jan.ai):**
| Criterion | llama.cpp | Ollama | LM Studio | Jan.ai |
|-----------|-----------|--------|-----------|--------|
| License | MIT | MIT | Closed frontend | AGPL |
| Speed | Fastest (baseline) | 20-70% slower | Same (uses llama.cpp) | Same (uses llama.cpp) |
| Security | No daemon, localhost only | 175k+ exposed instances (Jan 2026), multiple CVEs | Desktop-safe | Desktop-safe |
| Binary size | 23-40 MB | ~200 MB | ~500 MB+ | ~300 MB+ |
| HuggingFace access | Direct GGUF download | Walled library | HF browser built-in | HF download |
| Control | Full (quantization, context, sampling) | Abstracted | GUI-mediated | GUI-mediated |
**Key decision: download-on-first-use, not bundled.** llama.cpp releases weekly (b8152 current, daily commits). Bundling means stale binaries. Platform-specific (macOS ARM 29 MB, Linux x64 23 MB, Linux Vulkan 40 MB). Optional feature — not every user wants local models.
**Binary sizes by platform (b8152):**
| Platform | Size |
|----------|------|
| macOS ARM64 | 29 MB |
| macOS x64 | 82 MB |
| Linux x64 (CPU) | 23 MB |
| Linux Vulkan | 40 MB |
| Linux ROCm | 130 MB |

**Risks**

| Risk | Mitigation |
|------|------------|
| llama.cpp binary API changes between releases | Pin to known-good release in helper script, test on update |
| HuggingFace API rate limits for search | Cache search results, fallback to `huggingface-cli` |
| Model recommendations become stale as new models release | Recommend by capability tier, not specific model names where possible |
| Large model downloads fail mid-transfer | `huggingface-cli` handles resume; document manual resume |
| GPU detection unreliable across platforms | Graceful fallback to CPU-only with clear messaging |
---

</details>

<details><summary>Plan: Decision Log</summary>

| Date | Decision | Rationale |
|------|----------|-----------|
| 2026-02-25 | llama.cpp as primary runtime | MIT license, fastest, most secure, every other tool wraps it anyway |
| 2026-02-25 | Download-on-first-use, not bundled | Weekly releases, platform-specific, optional feature |
| 2026-02-25 | Single model-routing.md (extend, not new file) | Local is just another tier in the same routing decision |
| 2026-02-25 | HuggingFace as model source (not Ollama library) | Largest open repo, no walled garden, GGUF is standard |
| 2026-02-25 | SQLite for usage logging | Consistent with existing framework pattern |
| 2026-02-25 | 30-day cleanup threshold | Models are 2-50+ GB; generous but prevents unbounded disk growth |
| 2026-02-25 | No Ollama fallback in v1 | Users with Ollama can point at its API manually; add if demand exists |

</details>

---
*Synced from TODO.md by issue-sync-helper.sh*

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

t1338.1: Extend model-routing.md with local tier #2320

Description

Plan: Purpose

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Criterion	llama.cpp	Ollama	LM Studio	Jan.ai
License	MIT	MIT	Closed frontend	AGPL
Speed	Fastest (baseline)	20-70% slower	Same (uses llama.cpp)	Same (uses llama.cpp)
Security	No daemon, localhost only	175k+ exposed instances (Jan 2026), multiple CVEs	Desktop-safe	Desktop-safe
Binary size	23-40 MB	~200 MB	~500 MB+	~300 MB+
HuggingFace access	Direct GGUF download	Walled library	HF browser built-in	HF download
Control	Full (quantization, context, sampling)	Abstracted	GUI-mediated	GUI-mediated
Key decision: download-on-first-use, not bundled. llama.cpp releases weekly (b8152 current, daily commits). Bundling means stale binaries. Platform-specific (macOS ARM 29 MB, Linux x64 23 MB, Linux Vulkan 40 MB). Optional feature — not every user wants local models.
Binary sizes by platform (b8152):
Platform	Size
----------	------
macOS ARM64	29 MB
macOS x64	82 MB
Linux x64 (CPU)	23 MB
Linux Vulkan	40 MB
Linux ROCm	130 MB

Risk	Mitigation
llama.cpp binary API changes between releases	Pin to known-good release in helper script, test on update
HuggingFace API rate limits for search	Cache search results, fallback to `huggingface-cli`
Model recommendations become stale as new models release	Recommend by capability tier, not specific model names where possible
Large model downloads fail mid-transfer	`huggingface-cli` handles resume; document manual resume
GPU detection unreliable across platforms	Graceful fallback to CPU-only with clear messaging

Date	Decision	Rationale
2026-02-25	llama.cpp as primary runtime	MIT license, fastest, most secure, every other tool wraps it anyway
2026-02-25	Download-on-first-use, not bundled	Weekly releases, platform-specific, optional feature
2026-02-25	Single model-routing.md (extend, not new file)	Local is just another tier in the same routing decision
2026-02-25	HuggingFace as model source (not Ollama library)	Largest open repo, no walled garden, GGUF is standard
2026-02-25	SQLite for usage logging	Consistent with existing framework pattern
2026-02-25	30-day cleanup threshold	Models are 2-50+ GB; generous but prevents unbounded disk growth
2026-02-25	No Ollama fallback in v1	Users with Ollama can point at its API manually; add if demand exists

t1338.1: Extend model-routing.md with local tier #2320

Description

Description

Plan: Purpose

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions