perf(native): improve inference hot paths and add parity tooling by leehack · Pull Request #61 · leehack/llamadart

leehack · 2026-02-22T02:37:06Z

Summary

Reduce native inference overhead in SDK hot paths by caching metadata, making prompt-token counting optional for create(...), batching worker stream chunks, and adding prompt-prefix reuse with deterministic full-replay fallback.
Optimize ChatSession context trimming with bounded turn-offset search, add configurable stream batching/reuse knobs in GenerationParams, and extend unit coverage for the new behavior.
Add native benchmark and prompt-reuse parity tools, wire CI parity checks in ci.yml, and prepare the 0.6.2 release notes/version/doc snippets.

Validation

dart analyze
dart test
dart run tool/testing/native_prompt_reuse_parity.dart --model "example/basic_app/models/qwen2.5-0.5b-instruct-q4_k_m.gguf" --prompt-file "tool/testing/prompts/native_prompt_reuse_parity_prompts.txt" --runs 2 --max-tokens 128 --stream-batch-tokens 8 --stream-batch-bytes 512 --fail-on-mismatch

Cut prompt/template and stream transport overhead by caching metadata, batching token messages, and reusing prompt prefixes with parity-safe fallbacks. This improves TTFT and throughput while keeping chat session context trimming bounded for long histories.

Add native benchmarking/parity scripts and wire a CI parity job with a deterministic prompt set so prompt-prefix reuse regressions are caught automatically. Document the new workflow and tuning flags for reproducible perf validation.

Bump package/docs versions and add 0.6.2 release notes covering native inference performance improvements, benchmark/parity tooling, and CI parity validation.

codecov-commenter · 2026-02-22T02:42:13Z

Codecov Report

❌ Patch coverage is 93.10345% with 10 lines in your changes missing coverage. Please review.
✅ Project coverage is 77.26%. Comparing base (b252507) to head (9505d29).
⚠️ Report is 4 commits behind head on main.

Files with missing lines	Patch %	Lines
lib/src/backends/llama_cpp/llama_cpp_service.dart	87.27%	7 Missing ⚠️
lib/src/core/engine/engine.dart	85.71%	3 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main      #61      +/-   ##
==========================================
+ Coverage   76.91%   77.26%   +0.35%     
==========================================
  Files          65       66       +1     
  Lines        7930     8046     +116     
==========================================
+ Hits         6099     6217     +118     
+ Misses       1831     1829       -2

Flag	Coverage Δ
unittests	`77.26% <93.10%> (+0.35%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

leehack added 3 commits February 21, 2026 21:30

chore(release): prepare 0.6.2

9505d29

Bump package/docs versions and add 0.6.2 release notes covering native inference performance improvements, benchmark/parity tooling, and CI parity validation.

leehack merged commit c7c0ec0 into main Feb 22, 2026
6 checks passed

leehack deleted the perf/native-inference-optimization branch February 22, 2026 02:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(native): improve inference hot paths and add parity tooling#61

perf(native): improve inference hot paths and add parity tooling#61
leehack merged 3 commits intomainfrom
perf/native-inference-optimization

leehack commented Feb 22, 2026

Uh oh!

Uh oh!

codecov-commenter commented Feb 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

leehack commented Feb 22, 2026

Summary

Validation

Uh oh!

Uh oh!

codecov-commenter commented Feb 22, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants