Include Python dependencies in README#6
Merged
ggerganov merged 1 commit intoggml-org:masterfrom Mar 11, 2023
Merged
Conversation
Contributor
Author
simonw
added a commit
to simonw/til
that referenced
this pull request
Mar 11, 2023
This is documented in the LLaMA README now: - ggml-org/llama.cpp#6
SlyEcho
pushed a commit
to SlyEcho/llama.cpp
that referenced
this pull request
May 31, 2023
Buffer incomplete multibyte characters + other stuff.
4 tasks
ggerganov
pushed a commit
that referenced
this pull request
Oct 19, 2023
fix compilation errors with llvm
chsasank
pushed a commit
to chsasank/llama.cpp
that referenced
this pull request
Dec 20, 2023
…gml-org#6) * deprecate ffn_b * get tensor offloading levels * wip: split tensor loading * wip: framework of loading sparse model tensors * save and flush gpu alloc buffer * vram budget will fall back to remaining free memory * minor: remove vram safety margin * add options for vram budget; clean old env vars * minor: bugfix
chsasank
pushed a commit
to chsasank/llama.cpp
that referenced
this pull request
Dec 20, 2023
* Update demo video in README.md * Update demo at README.md
4 tasks
This was referenced Apr 7, 2024
Closed
4 tasks
younesbelkada
pushed a commit
to younesbelkada/llama.cpp
that referenced
this pull request
May 15, 2025
Fix model architecture name
3 tasks
gaugarg-nv
pushed a commit
to gaugarg-nv/llama.cpp
that referenced
this pull request
Feb 16, 2026
Support device-specific host buffer types in meta backend
kainlan
added a commit
to kainlan/llama.cpp-intel-optimizations
that referenced
this pull request
Mar 3, 2026
Fix all issues from spec review and quality review of the expert prefetch DMA engine (commit f60f94f): Spec fixes: - Req 3: Add design decision comment explaining why hint() is called at MoE dispatch with multi-layer lookahead instead of pre-attention, and why this gives equivalent DMA/compute overlap - Req 5: Implement hit-rate disable loop — when prediction accuracy drops below 30%, set prefetch_disabled_ flag and short-circuit hint() early with log message Critical fix: - ggml-org#1: Deadlock in await() — extract sycl::event copy while holding lock, release lock before blocking event.wait(), re-acquire to update state Important fixes: - ggml-org#2: TOCTOU in hint_batch_adaptive() — hold mutex_ across the entire function so budget snapshot and consumption are atomic - ggml-org#3: has_capacity() counted completed entries — now counts only active (non-completed) in-flight entries - ggml-org#4: gc_completed() safety — add explicit comment tying the safety invariant to the synchronous call chain (ggml_sycl_mul_mat_id -> await -> kernel dispatch -> stream->wait) Minor fixes: - ggml-org#6: Rename PrefetchRequest to prefetch_request (snake_case convention) - ggml-org#7: Log warning when all VRAM pool slots fail, permanently disable - ggml-org#8: Add initialized_ guard at top of hint_batch() - ggml-org#9: Add clarifying comment on n_miss_total <= max_inflight_ check - ggml-org#10: Remove dead using alias expert_prefetcher = ExpertPrefetcher - ggml-org#11: Rename accuracy_total_ to window_total_ for clarity Refactored hint() into hint_locked() internal helper so hint_batch() and hint_batch_adaptive() can hold the lock and call it directly, eliminating recursive locking and TOCTOU races. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
rururush
pushed a commit
to USTC-ADSL/llama.cpp
that referenced
this pull request
Mar 16, 2026
* redo: add convert nodes This reverts commit 8448acd. * align clang format with cann * rename binary_op -> general_op casue there're some op that will only tak 1 param * Revert "rename binary_op -> general_op" This reverts commit 5be63b1. * wip * add GGML_OP_PERMUTE * add GGML_OP_VIEW and GGML_OP_GET_ROWS * wip * Revert "wip" This reverts commit 772462c.
TheTom
referenced
this pull request
in TheTom/llama-cpp-turboquant
Mar 27, 2026
Complete experiment log: #1 4-mag LUT: 15.1 at 8K (BEST, +38%) #2 Batched extract: 13.7 (+25%) #3 Inline FA block: 13.5 (I-cache pressure) #4 Deferred norm: 12.9 (loses ILP) #5 2-pair half2: 12.0 (ternary overhead) #6 Select chain: 11.9 (branches kill) #7 Bit-arithmetic: 11.6 (ALU too heavy) #8 FMA branchless: 11.4 (ALU still too heavy) #9 Named-reg ternary: 10.3 (branches worst) #10 Main (8-LUT): 10.95 (baseline) #11 Non-vec FA: 10.2 (wrong kernel) Ceiling: 24.5 (no dequant) Apple8 hardware truth: 1 divergent constant read < 7 ALU ops (even with fma) Branches cost MORE than divergent constant reads Array indexing ALWAYS spills on Metal 4 constant addresses is the sweet spot The 4-mag LUT is the dequant-level ceiling on Apple Silicon. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: tturney@psyguard.ai
kainlan
added a commit
to kainlan/llama.cpp-intel-optimizations
that referenced
this pull request
Apr 11, 2026
…s (dleex) - l5ct0: Add pre_allocate_runtime_chunks() to pinned_chunk_pool and host_cache to prevent lazy runtime pool growth during inference. Called after zone configuration with onednn_scratchpad + dma_staging_pool bytes. - 4f4o3: GGML_SYCL_HOST_ALLOC_PHASE_GATE default changed from 0 to 1 (now unblocked by l5ct0 pre-allocation) - dleex ggml-org#1: Document name shadowing in binbcast.cpp:594 as intentional (required by GGML_TENSOR_BINARY_OP_LOCALS macro) - dleex ggml-org#4: Add GGML_ASSERT bounds checking to sycl_tensor::ne()/nb() - dleex ggml-org#5: Add null assertion in sycl_tensor::resolve_as<T>() to catch unresolved tensor data early - dleex ggml-org#7: Replace silent catch in fattn.cpp resolve_host_seq_ids with GGML_LOG_WARN fallback message - dleex ggml-org#2,3,6,8: Deferred — ggml-org#2 is consistent naming, ggml-org#3 addressed by accessor migration (1vy5r), ggml-org#6 is design tension with const_cast, ggml-org#8 is a review note on past commits
itme-brain
pushed a commit
to itme-brain/llama.cpp
that referenced
this pull request
Apr 16, 2026
Complete experiment log: ggml-org#1 4-mag LUT: 15.1 at 8K (BEST, +38%) ggml-org#2 Batched extract: 13.7 (+25%) ggml-org#3 Inline FA block: 13.5 (I-cache pressure) ggml-org#4 Deferred norm: 12.9 (loses ILP) ggml-org#5 2-pair half2: 12.0 (ternary overhead) ggml-org#6 Select chain: 11.9 (branches kill) ggml-org#7 Bit-arithmetic: 11.6 (ALU too heavy) ggml-org#8 FMA branchless: 11.4 (ALU still too heavy) ggml-org#9 Named-reg ternary: 10.3 (branches worst) ggml-org#10 Main (8-LUT): 10.95 (baseline) ggml-org#11 Non-vec FA: 10.2 (wrong kernel) Ceiling: 24.5 (no dequant) Apple8 hardware truth: 1 divergent constant read < 7 ALU ops (even with fma) Branches cost MORE than divergent constant reads Array indexing ALWAYS spills on Metal 4 constant addresses is the sweet spot The 4-mag LUT is the dequant-level ceiling on Apple Silicon. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: tturney@psyguard.ai
erazortt
pushed a commit
to erazortt/llama.cpp
that referenced
this pull request
Apr 17, 2026
Complete experiment log: #1 4-mag LUT: 15.1 at 8K (BEST, +38%) #2 Batched extract: 13.7 (+25%) #3 Inline FA block: 13.5 (I-cache pressure) #4 Deferred norm: 12.9 (loses ILP) #5 2-pair half2: 12.0 (ternary overhead) ggml-org#6 Select chain: 11.9 (branches kill) ggml-org#7 Bit-arithmetic: 11.6 (ALU too heavy) ggml-org#8 FMA branchless: 11.4 (ALU still too heavy) ggml-org#9 Named-reg ternary: 10.3 (branches worst) ggml-org#10 Main (8-LUT): 10.95 (baseline) ggml-org#11 Non-vec FA: 10.2 (wrong kernel) Ceiling: 24.5 (no dequant) Apple8 hardware truth: 1 divergent constant read < 7 ALU ops (even with fma) Branches cost MORE than divergent constant reads Array indexing ALWAYS spills on Metal 4 constant addresses is the sweet spot The 4-mag LUT is the dequant-level ceiling on Apple Silicon. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: tturney@psyguard.ai
ausshir
pushed a commit
to ausshir/llama.cpp-iso-rocm
that referenced
this pull request
Apr 20, 2026
Complete experiment log: ggml-org#1 4-mag LUT: 15.1 at 8K (BEST, +38%) ggml-org#2 Batched extract: 13.7 (+25%) ggml-org#3 Inline FA block: 13.5 (I-cache pressure) ggml-org#4 Deferred norm: 12.9 (loses ILP) ggml-org#5 2-pair half2: 12.0 (ternary overhead) ggml-org#6 Select chain: 11.9 (branches kill) ggml-org#7 Bit-arithmetic: 11.6 (ALU too heavy) ggml-org#8 FMA branchless: 11.4 (ALU still too heavy) ggml-org#9 Named-reg ternary: 10.3 (branches worst) ggml-org#10 Main (8-LUT): 10.95 (baseline) ggml-org#11 Non-vec FA: 10.2 (wrong kernel) Ceiling: 24.5 (no dequant) Apple8 hardware truth: 1 divergent constant read < 7 ALU ops (even with fma) Branches cost MORE than divergent constant reads Array indexing ALWAYS spills on Metal 4 constant addresses is the sweet spot The 4-mag LUT is the dequant-level ceiling on Apple Silicon. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: tturney@psyguard.ai
rocktw
added a commit
to rocktw/llama.cpp
that referenced
this pull request
Apr 20, 2026
Finishes the mid + lower tier GATE5_TODO.md list except item ggml-org#8 (huron machine variant, deferred for a separate session) and ggml-org#12 (blocked on an upstream llm_grammar parser limitation, wire-verified). ### Phase E1 — sampler edge cases (items ggml-org#5, ggml-org#9, ggml-org#10, ggml-org#11) New SAMPLER_EDGE_CASES list (11 cases) flowing through the v3 wire path with explicit slice_cap: Slice-fetch stress (ggml-org#5): slice_cap=1 (max fetch count), slice_cap=1024 (one-shot), slice_cap=37 (prime remainder) Top-n-sigma shapes (ggml-org#9): σ=0.5 aggressive, σ=5.0 loose Penalty_last_n boundaries (ggml-org#10): window==history_len, window<history_len, history=empty + active penalty DRY allowed_length boundaries (ggml-org#11): allowed=0, allowed=1, allowed=L-1 ### Phase E2 — adversarial tokenizer prompts (item ggml-org#6) 6 new entries in PROMPTS covering BOM-prefixed ASCII, zero-width joiner mid-word, Unicode Private Use Area codepoint, mixed CR/LF/CRLF line endings, all-control-character run, and a 220-byte repeated sentence. Size ceiling documented in GATE5_LESSONS.md §8 — pushing beyond ~300 bytes of repeated text overflows the MCU's 64 KB per-request arena and triggers silent truncation. Arena sizing is a deployment tuning decision tracked in 06_MCU_MEMORY_PLAN.md. ### Phase E3 — IPC chaos / error-path validation (item ggml-org#7) New --mode chaos runner. 6 malformed-frame cases each verify: a) firmware returns the correct error status b) firmware stays in sync (recovery probe with a valid OP_TOKENIZE returns status=0) Cases: bad magic, oversize prompt_len (5000 > 4096 MAX), unsupported opcode, OP_SAMPLE with params_len=50 (neither v1 40 nor v3 ≥92), OP_SAMPLE v3 with K=9999 (> MAX_E2E_K=128), OP_GRAMMAR_SET with n_vocab=300 (> MAX_GRAMMAR_VOCAB=256). Incidental fix: read_status in chaos mode was draining only the 16-byte main response header; SAMPLE responses carry a trailing 8-byte sub-header (selected_id + n_survivors + rsv) even on error paths, so the chaos helper now takes an op_was hint and drains the sub when op_was == OP_SAMPLE. Without this the post-oversize recovery probe saw the previous sub bytes as the new response magic. Captured this as an inline comment — no lesson-doc entry because it's a test-harness thing, not a firmware contract. ### Phase E1 item ggml-org#12 — custom grammar root name (partial) OP_GRAMMAR_SET payload grows an optional root_name field: the spare byte at tail[1] (`has_root_name`) gates a trailing `u32 root_name_len | bytes`. Wire plumbing is verified by one passing case (explicit root_name="root" matches default behaviour). Blocked follow-up: the upstream llm_grammar parser rejects non- "root" rule names as the entry point even when they are defined in the GBNF. A standalone host-side test reproduces the rejection (`custom_root ::= "abc"` with root="custom_root" → LLM_ERR_INVAL). The regression lives in cm33-port/src/llm_grammar.c and is out of Gate 5 scope. Captured in GATE5_LESSONS.md §7. The wire extension will start paying off once the parser is fixed — no further port work needed. ### Scoreboard test-mcu-e2e-tokenizer 62/62 (was 50, +12 adversarial) test-mcu-e2e-sampler 55/55 (was 43, +11 edge + 1 root) test-mcu-e2e-chaos 6/6 (new) test-mcu-e2e-replay 16/16 (unchanged) test-mcu-e2e-replay-mtmd 16/16 (unchanged) test-mcu-e2e-replay-gemma3 16/16 (unchanged) test-mcu-e2e-replay-llama3 16/16 (unchanged) test-mcu-e2e-replay-long 128/128 (unchanged) ───── 315/315 total Regression check: existing mcu-parity + phase{1..4} suites pass. Firmware size delta from D3: +448 B text, +512 B bss (g_grammar_root_name pool + optional parser plumbing). Also: SPM variant's Makefile objs needed llm_nfa.c added — grammar accept paths call llm_nfa_match_at for NFA triggers, symbol must resolve even if the test matrix never exercises the branch. ### Remaining Gate 5 todos - Item ggml-org#8 (huron machine variant) — substantial, own session. - Deferred: grammar production vocab scale, NFA trigger wire, arena-overflow status propagation, custom grammar root name (item ggml-org#12 — blocked upstream). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
YuruDeveloper
pushed a commit
to YuruDeveloper/llama.cpp-quant
that referenced
this pull request
Apr 21, 2026
Complete experiment log: ggml-org#1 4-mag LUT: 15.1 at 8K (BEST, +38%) ggml-org#2 Batched extract: 13.7 (+25%) ggml-org#3 Inline FA block: 13.5 (I-cache pressure) ggml-org#4 Deferred norm: 12.9 (loses ILP) ggml-org#5 2-pair half2: 12.0 (ternary overhead) ggml-org#6 Select chain: 11.9 (branches kill) ggml-org#7 Bit-arithmetic: 11.6 (ALU too heavy) ggml-org#8 FMA branchless: 11.4 (ALU still too heavy) ggml-org#9 Named-reg ternary: 10.3 (branches worst) ggml-org#10 Main (8-LUT): 10.95 (baseline) ggml-org#11 Non-vec FA: 10.2 (wrong kernel) Ceiling: 24.5 (no dequant) Apple8 hardware truth: 1 divergent constant read < 7 ALU ops (even with fma) Branches cost MORE than divergent constant reads Array indexing ALWAYS spills on Metal 4 constant addresses is the sweet spot The 4-mag LUT is the dequant-level ceiling on Apple Silicon. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: tturney@psyguard.ai
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Should maybe note that you need Python 3.10 - because there's no
torchwheel yet for Python 3.11.