Include Python dependencies in README by simonw · Pull Request #6 · ggml-org/llama.cpp

simonw · 2023-03-11T04:26:54Z

Should maybe note that you need Python 3.10 - because there's no torch wheel yet for Python 3.11.

simonw · 2023-03-11T04:27:05Z

See also https://til.simonwillison.net/llms/llama-7b-m2

This is documented in the LLaMA README now: - ggml-org/llama.cpp#6

Buffer incomplete multibyte characters + other stuff.

fix compilation errors with llvm

…gml-org#6) * deprecate ffn_b * get tensor offloading levels * wip: split tensor loading * wip: framework of loading sparse model tensors * save and flush gpu alloc buffer * vram budget will fall back to remaining free memory * minor: remove vram safety margin * add options for vram budget; clean old env vars * minor: bugfix

* Update demo video in README.md * Update demo at README.md

Fix model architecture name

Support device-specific host buffer types in meta backend

Fix all issues from spec review and quality review of the expert prefetch DMA engine (commit f60f94f): Spec fixes: - Req 3: Add design decision comment explaining why hint() is called at MoE dispatch with multi-layer lookahead instead of pre-attention, and why this gives equivalent DMA/compute overlap - Req 5: Implement hit-rate disable loop — when prediction accuracy drops below 30%, set prefetch_disabled_ flag and short-circuit hint() early with log message Critical fix: - ggml-org#1: Deadlock in await() — extract sycl::event copy while holding lock, release lock before blocking event.wait(), re-acquire to update state Important fixes: - ggml-org#2: TOCTOU in hint_batch_adaptive() — hold mutex_ across the entire function so budget snapshot and consumption are atomic - ggml-org#3: has_capacity() counted completed entries — now counts only active (non-completed) in-flight entries - ggml-org#4: gc_completed() safety — add explicit comment tying the safety invariant to the synchronous call chain (ggml_sycl_mul_mat_id -> await -> kernel dispatch -> stream->wait) Minor fixes: - ggml-org#6: Rename PrefetchRequest to prefetch_request (snake_case convention) - ggml-org#7: Log warning when all VRAM pool slots fail, permanently disable - ggml-org#8: Add initialized_ guard at top of hint_batch() - ggml-org#9: Add clarifying comment on n_miss_total <= max_inflight_ check - ggml-org#10: Remove dead using alias expert_prefetcher = ExpertPrefetcher - ggml-org#11: Rename accuracy_total_ to window_total_ for clarity Refactored hint() into hint_locked() internal helper so hint_batch() and hint_batch_adaptive() can hold the lock and call it directly, eliminating recursive locking and TOCTOU races. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* redo: add convert nodes This reverts commit 8448acd. * align clang format with cann * rename binary_op -> general_op casue there're some op that will only tak 1 param * Revert "rename binary_op -> general_op" This reverts commit 5be63b1. * wip * add GGML_OP_PERMUTE * add GGML_OP_VIEW and GGML_OP_GET_ROWS * wip * Revert "wip" This reverts commit 772462c.

Complete experiment log: #1 4-mag LUT: 15.1 at 8K (BEST, +38%) #2 Batched extract: 13.7 (+25%) #3 Inline FA block: 13.5 (I-cache pressure) #4 Deferred norm: 12.9 (loses ILP) #5 2-pair half2: 12.0 (ternary overhead) #6 Select chain: 11.9 (branches kill) #7 Bit-arithmetic: 11.6 (ALU too heavy) #8 FMA branchless: 11.4 (ALU still too heavy) #9 Named-reg ternary: 10.3 (branches worst) #10 Main (8-LUT): 10.95 (baseline) #11 Non-vec FA: 10.2 (wrong kernel) Ceiling: 24.5 (no dequant) Apple8 hardware truth: 1 divergent constant read < 7 ALU ops (even with fma) Branches cost MORE than divergent constant reads Array indexing ALWAYS spills on Metal 4 constant addresses is the sweet spot The 4-mag LUT is the dequant-level ceiling on Apple Silicon. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: tturney@psyguard.ai

…s (dleex) - l5ct0: Add pre_allocate_runtime_chunks() to pinned_chunk_pool and host_cache to prevent lazy runtime pool growth during inference. Called after zone configuration with onednn_scratchpad + dma_staging_pool bytes. - 4f4o3: GGML_SYCL_HOST_ALLOC_PHASE_GATE default changed from 0 to 1 (now unblocked by l5ct0 pre-allocation) - dleex ggml-org#1: Document name shadowing in binbcast.cpp:594 as intentional (required by GGML_TENSOR_BINARY_OP_LOCALS macro) - dleex ggml-org#4: Add GGML_ASSERT bounds checking to sycl_tensor::ne()/nb() - dleex ggml-org#5: Add null assertion in sycl_tensor::resolve_as<T>() to catch unresolved tensor data early - dleex ggml-org#7: Replace silent catch in fattn.cpp resolve_host_seq_ids with GGML_LOG_WARN fallback message - dleex ggml-org#2,3,6,8: Deferred — ggml-org#2 is consistent naming, ggml-org#3 addressed by accessor migration (1vy5r), ggml-org#6 is design tension with const_cast, ggml-org#8 is a review note on past commits

Complete experiment log: ggml-org#1 4-mag LUT: 15.1 at 8K (BEST, +38%) ggml-org#2 Batched extract: 13.7 (+25%) ggml-org#3 Inline FA block: 13.5 (I-cache pressure) ggml-org#4 Deferred norm: 12.9 (loses ILP) ggml-org#5 2-pair half2: 12.0 (ternary overhead) ggml-org#6 Select chain: 11.9 (branches kill) ggml-org#7 Bit-arithmetic: 11.6 (ALU too heavy) ggml-org#8 FMA branchless: 11.4 (ALU still too heavy) ggml-org#9 Named-reg ternary: 10.3 (branches worst) ggml-org#10 Main (8-LUT): 10.95 (baseline) ggml-org#11 Non-vec FA: 10.2 (wrong kernel) Ceiling: 24.5 (no dequant) Apple8 hardware truth: 1 divergent constant read < 7 ALU ops (even with fma) Branches cost MORE than divergent constant reads Array indexing ALWAYS spills on Metal 4 constant addresses is the sweet spot The 4-mag LUT is the dequant-level ceiling on Apple Silicon. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: tturney@psyguard.ai

Complete experiment log: #1 4-mag LUT: 15.1 at 8K (BEST, +38%) #2 Batched extract: 13.7 (+25%) #3 Inline FA block: 13.5 (I-cache pressure) #4 Deferred norm: 12.9 (loses ILP) #5 2-pair half2: 12.0 (ternary overhead) ggml-org#6 Select chain: 11.9 (branches kill) ggml-org#7 Bit-arithmetic: 11.6 (ALU too heavy) ggml-org#8 FMA branchless: 11.4 (ALU still too heavy) ggml-org#9 Named-reg ternary: 10.3 (branches worst) ggml-org#10 Main (8-LUT): 10.95 (baseline) ggml-org#11 Non-vec FA: 10.2 (wrong kernel) Ceiling: 24.5 (no dequant) Apple8 hardware truth: 1 divergent constant read < 7 ALU ops (even with fma) Branches cost MORE than divergent constant reads Array indexing ALWAYS spills on Metal 4 constant addresses is the sweet spot The 4-mag LUT is the dequant-level ceiling on Apple Silicon. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: tturney@psyguard.ai

Complete experiment log: ggml-org#1 4-mag LUT: 15.1 at 8K (BEST, +38%) ggml-org#2 Batched extract: 13.7 (+25%) ggml-org#3 Inline FA block: 13.5 (I-cache pressure) ggml-org#4 Deferred norm: 12.9 (loses ILP) ggml-org#5 2-pair half2: 12.0 (ternary overhead) ggml-org#6 Select chain: 11.9 (branches kill) ggml-org#7 Bit-arithmetic: 11.6 (ALU too heavy) ggml-org#8 FMA branchless: 11.4 (ALU still too heavy) ggml-org#9 Named-reg ternary: 10.3 (branches worst) ggml-org#10 Main (8-LUT): 10.95 (baseline) ggml-org#11 Non-vec FA: 10.2 (wrong kernel) Ceiling: 24.5 (no dequant) Apple8 hardware truth: 1 divergent constant read < 7 ALU ops (even with fma) Branches cost MORE than divergent constant reads Array indexing ALWAYS spills on Metal 4 constant addresses is the sweet spot The 4-mag LUT is the dequant-level ceiling on Apple Silicon. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: tturney@psyguard.ai

Finishes the mid + lower tier GATE5_TODO.md list except item ggml-org#8 (huron machine variant, deferred for a separate session) and ggml-org#12 (blocked on an upstream llm_grammar parser limitation, wire-verified). ### Phase E1 — sampler edge cases (items ggml-org#5, ggml-org#9, ggml-org#10, ggml-org#11) New SAMPLER_EDGE_CASES list (11 cases) flowing through the v3 wire path with explicit slice_cap: Slice-fetch stress (ggml-org#5): slice_cap=1 (max fetch count), slice_cap=1024 (one-shot), slice_cap=37 (prime remainder) Top-n-sigma shapes (ggml-org#9): σ=0.5 aggressive, σ=5.0 loose Penalty_last_n boundaries (ggml-org#10): window==history_len, window<history_len, history=empty + active penalty DRY allowed_length boundaries (ggml-org#11): allowed=0, allowed=1, allowed=L-1 ### Phase E2 — adversarial tokenizer prompts (item ggml-org#6) 6 new entries in PROMPTS covering BOM-prefixed ASCII, zero-width joiner mid-word, Unicode Private Use Area codepoint, mixed CR/LF/CRLF line endings, all-control-character run, and a 220-byte repeated sentence. Size ceiling documented in GATE5_LESSONS.md §8 — pushing beyond ~300 bytes of repeated text overflows the MCU's 64 KB per-request arena and triggers silent truncation. Arena sizing is a deployment tuning decision tracked in 06_MCU_MEMORY_PLAN.md. ### Phase E3 — IPC chaos / error-path validation (item ggml-org#7) New --mode chaos runner. 6 malformed-frame cases each verify: a) firmware returns the correct error status b) firmware stays in sync (recovery probe with a valid OP_TOKENIZE returns status=0) Cases: bad magic, oversize prompt_len (5000 > 4096 MAX), unsupported opcode, OP_SAMPLE with params_len=50 (neither v1 40 nor v3 ≥92), OP_SAMPLE v3 with K=9999 (> MAX_E2E_K=128), OP_GRAMMAR_SET with n_vocab=300 (> MAX_GRAMMAR_VOCAB=256). Incidental fix: read_status in chaos mode was draining only the 16-byte main response header; SAMPLE responses carry a trailing 8-byte sub-header (selected_id + n_survivors + rsv) even on error paths, so the chaos helper now takes an op_was hint and drains the sub when op_was == OP_SAMPLE. Without this the post-oversize recovery probe saw the previous sub bytes as the new response magic. Captured this as an inline comment — no lesson-doc entry because it's a test-harness thing, not a firmware contract. ### Phase E1 item ggml-org#12 — custom grammar root name (partial) OP_GRAMMAR_SET payload grows an optional root_name field: the spare byte at tail[1] (`has_root_name`) gates a trailing `u32 root_name_len | bytes`. Wire plumbing is verified by one passing case (explicit root_name="root" matches default behaviour). Blocked follow-up: the upstream llm_grammar parser rejects non- "root" rule names as the entry point even when they are defined in the GBNF. A standalone host-side test reproduces the rejection (`custom_root ::= "abc"` with root="custom_root" → LLM_ERR_INVAL). The regression lives in cm33-port/src/llm_grammar.c and is out of Gate 5 scope. Captured in GATE5_LESSONS.md §7. The wire extension will start paying off once the parser is fixed — no further port work needed. ### Scoreboard test-mcu-e2e-tokenizer 62/62 (was 50, +12 adversarial) test-mcu-e2e-sampler 55/55 (was 43, +11 edge + 1 root) test-mcu-e2e-chaos 6/6 (new) test-mcu-e2e-replay 16/16 (unchanged) test-mcu-e2e-replay-mtmd 16/16 (unchanged) test-mcu-e2e-replay-gemma3 16/16 (unchanged) test-mcu-e2e-replay-llama3 16/16 (unchanged) test-mcu-e2e-replay-long 128/128 (unchanged) ───── 315/315 total Regression check: existing mcu-parity + phase{1..4} suites pass. Firmware size delta from D3: +448 B text, +512 B bss (g_grammar_root_name pool + optional parser plumbing). Also: SPM variant's Makefile objs needed llm_nfa.c added — grammar accept paths call llm_nfa_match_at for NFA triggers, symbol must resolve even if the test matrix never exercises the branch. ### Remaining Gate 5 todos - Item ggml-org#8 (huron machine variant) — substantial, own session. - Deferred: grammar production vocab scale, NFA trigger wire, arena-overflow status propagation, custom grammar root name (item ggml-org#12 — blocked upstream). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Complete experiment log: ggml-org#1 4-mag LUT: 15.1 at 8K (BEST, +38%) ggml-org#2 Batched extract: 13.7 (+25%) ggml-org#3 Inline FA block: 13.5 (I-cache pressure) ggml-org#4 Deferred norm: 12.9 (loses ILP) ggml-org#5 2-pair half2: 12.0 (ternary overhead) ggml-org#6 Select chain: 11.9 (branches kill) ggml-org#7 Bit-arithmetic: 11.6 (ALU too heavy) ggml-org#8 FMA branchless: 11.4 (ALU still too heavy) ggml-org#9 Named-reg ternary: 10.3 (branches worst) ggml-org#10 Main (8-LUT): 10.95 (baseline) ggml-org#11 Non-vec FA: 10.2 (wrong kernel) Ceiling: 24.5 (no dequant) Apple8 hardware truth: 1 divergent constant read < 7 ALU ops (even with fma) Branches cost MORE than divergent constant reads Array indexing ALWAYS spills on Metal 4 constant addresses is the sweet spot The 4-mag LUT is the dequant-level ceiling on Apple Silicon. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: tturney@psyguard.ai

Include Python dependencies in README

029f2b1

ggerganov merged commit 5f2f970 into ggml-org:master Mar 11, 2023

simonw added a commit to simonw/til that referenced this pull request Mar 11, 2023

Update llama-7b-m2.md

ab2620b

This is documented in the LLaMA README now: - ggml-org/llama.cpp#6

SavageShrimp mentioned this pull request Mar 20, 2023

segmentation fault Alpaca #317

Closed

SlyEcho pushed a commit to SlyEcho/llama.cpp that referenced this pull request May 31, 2023

Merge pull request ggml-org#6 from anon998/fix-multibyte

96fa480

Buffer incomplete multibyte characters + other stuff.

windmaple mentioned this pull request Jul 4, 2023

crash when opening the app shixiangcap/llama-jni#1

Open

atopheim mentioned this pull request Sep 7, 2023

Segfault when compiling with make LLAMA_CUBLAS=1 #3054

Closed

4 tasks

ggerganov pushed a commit that referenced this pull request Oct 19, 2023

Merge pull request #6 from damian0815/fssrepo_mac_fixes

9035978

fix compilation errors with llvm

chsasank pushed a commit to chsasank/llama.cpp that referenced this pull request Dec 20, 2023

Update demo in README.md (ggml-org#6)

a81d4a5

* Update demo video in README.md * Update demo at README.md

Dyke-F mentioned this pull request Dec 21, 2023

CUDA error 719 #4563

Closed

3 tasks

nasawyer7 mentioned this pull request Jan 3, 2024

CUDA error: invalid device function when compiling and running for amd gfx 1032 #4762

Closed

segmond mentioned this pull request Jan 14, 2024

train-text-from-scratch oom (in tokenizer?) #4300

Closed

4 tasks

This was referenced Apr 7, 2024

GGML_ASSERT: llama.cpp/ggml-cuda/argsort.cu:48: (ncols & (ncols - 1)) == 0 #6527

Closed

Segmentation fault during IQ3_XS generation. #6597

Closed

micsthepick mentioned this pull request Jul 1, 2024

Bug: GGML assert with bf16, RTX3090 #8234

Closed

ko-alex mentioned this pull request Jul 4, 2024

Bug: gemma 2 27B GGML_ASSERT n_dims <= ne0 #8246

Closed

m828 mentioned this pull request Jul 16, 2024

Bug: ROCm CUDA error #8504

Closed

fan-chao mentioned this pull request Aug 13, 2024

[CANN] Support Q4_0 for Ascend NPU #8822

Merged

4 tasks

slaren mentioned this pull request Aug 15, 2024

Threadpool: take 2 #8672

Merged

4 tasks

znzjugod mentioned this pull request Aug 30, 2024

Bug: A crash occurs when llama-bench is running on multiple cann devices. #9250

Closed

narc1ssus1 mentioned this pull request Jan 23, 2025

Misc. bug: Docker Image llama-quantize Segmentation fault #11196

Closed

ko-alex mentioned this pull request Jan 27, 2025

SIGSEGV during inference #11456

Closed

gaykawadpk mentioned this pull request Feb 12, 2025

Misc. bug: llama-cli crash on ubuntu with GGML-VULKAN=ON #11823

Closed

acbits mentioned this pull request Feb 25, 2025

Regression. Unable to run any model. CRASH!!! #12075

Closed

Bearsaerker mentioned this pull request Mar 12, 2025

Eval bug: Gemma 3 extremly slow prompt processing when using quantized kv cache. #12352

Closed

steampunque mentioned this pull request May 4, 2025

Eval bug: b5237 broke Llama Scout #13287

Closed

younesbelkada pushed a commit to younesbelkada/llama.cpp that referenced this pull request May 15, 2025

Merge pull request ggml-org#6 from Eddie-Wang1120/dev-junhuihe

5eb47b7

Fix model architecture name

bjodah mentioned this pull request May 26, 2025

Eval bug: uncaught std::runtime_exception thrown in llama-server during tool use #13812

Closed

crysolut mentioned this pull request Feb 8, 2026

Eval bug: ggml_cuda_compute_forward: SOLVE_TRI failed on ROCm 6.4.3 2 X gfx906 GPU 32Gb - on Qwen3-Coder-Next #19442

Closed

jacekpoplawski mentioned this pull request Feb 10, 2026

models : optimizing qwen3next graph #19375

Merged

3 tasks

gaugarg-nv pushed a commit to gaugarg-nv/llama.cpp that referenced this pull request Feb 16, 2026

Merge pull request ggml-org#6 from gaugarg-nv/get_host_buffer_type

f0198ef

Support device-specific host buffer types in meta backend

henry701 mentioned this pull request Feb 19, 2026

Eval bug: CUDA backend crash on GLM-4.7-Flash with FA on and quantized KV cache #19724

Closed

wallentri88 mentioned this pull request Feb 24, 2026

Eval bug: qwen35 and qwen35moe graph split issues (Severe PP impact, crashes) #19864

Closed

steampunque mentioned this pull request Feb 25, 2026

Eval bug: Qwen 3.5 27B crashes running perplexity with RPC #19892

Open

snapo mentioned this pull request Feb 25, 2026

Eval bug: Qwen 3.5 27B GGUF from unsloth hard crash #19906

Closed

MartinEmrich mentioned this pull request Feb 28, 2026

Eval bug: Memory leak? using ROCm #19979

Closed

xinye0123 mentioned this pull request Mar 10, 2026

Eval bug: [MUSA] Illegal memory access in SOLVE_TRI on MTT S80 during warmup #20331

Open

feyleth mentioned this pull request Mar 11, 2026

Eval bug: vision model crash #20418

Open

LucianoJBarbosa mentioned this pull request Mar 14, 2026

[ROCm] gfx1010 (RX 5500 XT) HSA_STATUS_ERROR_MEMORY_APERTURE_VIOLATION on load_all_data #20564

Closed

wronglebowsk mentioned this pull request Mar 16, 2026

Misc. bug: OpenVINO cannot run the operation (CPY) #20619

Open

luke1105 mentioned this pull request Mar 20, 2026

Eval bug: OpenVino: Cant load Qwen3.5 #20562

Open

stew675 mentioned this pull request Mar 25, 2026

Eval bug: PR20908 breaks rpc-server functionality when balancing split a model across multiple machines. #21006

Closed

rubin55 mentioned this pull request Mar 26, 2026

Eval bug: Unresolved Symbol <__memcpy_chk> when running (any?) model #21041

Closed

mina-ai-io mentioned this pull request Apr 3, 2026

SYCL: flash attention tile kernel crash on 2nd prompt with Qwen3.5 #21396

Open

mzsergiu mentioned this pull request Apr 4, 2026

Eval bug: gemma-4-26B-A4B crashing (openweb-ui -> litellm -> llama.cpp version: 8661 (b7ad48ebd) #21420

Closed

greyhound3 mentioned this pull request Apr 6, 2026

Misc. bug: ggml-cuda\ggml-cuda.cu:98: CUDA error #21289

Open

Neko-Box-Coder mentioned this pull request Apr 11, 2026

Eval bug: Cuda error on split mode row after tensor parallelism changes #21773

Open

icoicqico mentioned this pull request Apr 17, 2026

llama-finetune bug: #22040

Open

mollahmasud977-sys mentioned this pull request Apr 17, 2026

Eval bug: Qwen 3.6 35B fails on -sm tensor #22058

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Include Python dependencies in README#6

Include Python dependencies in README#6
ggerganov merged 1 commit intoggml-org:masterfrom
simonw:patch-1

simonw commented Mar 11, 2023 •

edited

Loading

Uh oh!

simonw commented Mar 11, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

simonw commented Mar 11, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

simonw commented Mar 11, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

simonw commented Mar 11, 2023 •

edited

Loading