Skip to content

perf(native): improve inference hot paths and add parity tooling#61

Merged
leehack merged 3 commits intomainfrom
perf/native-inference-optimization
Feb 22, 2026
Merged

perf(native): improve inference hot paths and add parity tooling#61
leehack merged 3 commits intomainfrom
perf/native-inference-optimization

Conversation

@leehack
Copy link
Copy Markdown
Owner

@leehack leehack commented Feb 22, 2026

Summary

  • Reduce native inference overhead in SDK hot paths by caching metadata, making prompt-token counting optional for create(...), batching worker stream chunks, and adding prompt-prefix reuse with deterministic full-replay fallback.
  • Optimize ChatSession context trimming with bounded turn-offset search, add configurable stream batching/reuse knobs in GenerationParams, and extend unit coverage for the new behavior.
  • Add native benchmark and prompt-reuse parity tools, wire CI parity checks in ci.yml, and prepare the 0.6.2 release notes/version/doc snippets.

Validation

  • dart analyze
  • dart test
  • dart run tool/testing/native_prompt_reuse_parity.dart --model "example/basic_app/models/qwen2.5-0.5b-instruct-q4_k_m.gguf" --prompt-file "tool/testing/prompts/native_prompt_reuse_parity_prompts.txt" --runs 2 --max-tokens 128 --stream-batch-tokens 8 --stream-batch-bytes 512 --fail-on-mismatch

Cut prompt/template and stream transport overhead by caching metadata, batching token messages, and reusing prompt prefixes with parity-safe fallbacks. This improves TTFT and throughput while keeping chat session context trimming bounded for long histories.
Add native benchmarking/parity scripts and wire a CI parity job with a deterministic prompt set so prompt-prefix reuse regressions are caught automatically. Document the new workflow and tuning flags for reproducible perf validation.
Bump package/docs versions and add 0.6.2 release notes covering native inference performance improvements, benchmark/parity tooling, and CI parity validation.
@leehack leehack merged commit c7c0ec0 into main Feb 22, 2026
6 checks passed
@leehack leehack deleted the perf/native-inference-optimization branch February 22, 2026 02:42
@codecov-commenter
Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 93.10345% with 10 lines in your changes missing coverage. Please review.
✅ Project coverage is 77.26%. Comparing base (b252507) to head (9505d29).
⚠️ Report is 4 commits behind head on main.

Files with missing lines Patch % Lines
lib/src/backends/llama_cpp/llama_cpp_service.dart 87.27% 7 Missing ⚠️
lib/src/core/engine/engine.dart 85.71% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main      #61      +/-   ##
==========================================
+ Coverage   76.91%   77.26%   +0.35%     
==========================================
  Files          65       66       +1     
  Lines        7930     8046     +116     
==========================================
+ Hits         6099     6217     +118     
+ Misses       1831     1829       -2     
Flag Coverage Δ
unittests 77.26% <93.10%> (+0.35%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants