Skip to content

Include Python dependencies in README#6

Merged
ggerganov merged 1 commit intoggml-org:masterfrom
simonw:patch-1
Mar 11, 2023
Merged

Include Python dependencies in README#6
ggerganov merged 1 commit intoggml-org:masterfrom
simonw:patch-1

Conversation

@simonw
Copy link
Copy Markdown
Contributor

@simonw simonw commented Mar 11, 2023

Should maybe note that you need Python 3.10 - because there's no torch wheel yet for Python 3.11.

@simonw
Copy link
Copy Markdown
Contributor Author

simonw commented Mar 11, 2023

See also https://til.simonwillison.net/llms/llama-7b-m2

@ggerganov ggerganov merged commit 5f2f970 into ggml-org:master Mar 11, 2023
simonw added a commit to simonw/til that referenced this pull request Mar 11, 2023
This is documented in the LLaMA README now:
- ggml-org/llama.cpp#6
SlyEcho pushed a commit to SlyEcho/llama.cpp that referenced this pull request May 31, 2023
Buffer incomplete multibyte characters + other stuff.
ggerganov pushed a commit that referenced this pull request Oct 19, 2023
fix compilation errors with llvm
chsasank pushed a commit to chsasank/llama.cpp that referenced this pull request Dec 20, 2023
…gml-org#6)

* deprecate ffn_b

* get tensor offloading levels

* wip: split tensor loading

* wip: framework of loading sparse model tensors

* save and flush gpu alloc buffer

* vram budget will fall back to remaining free memory

* minor: remove vram safety margin

* add options for vram budget; clean old env vars

* minor: bugfix
chsasank pushed a commit to chsasank/llama.cpp that referenced this pull request Dec 20, 2023
* Update demo video in README.md

* Update demo at README.md
@Dyke-F Dyke-F mentioned this pull request Dec 21, 2023
3 tasks
@m828 m828 mentioned this pull request Jul 16, 2024
@fan-chao fan-chao mentioned this pull request Aug 13, 2024
4 tasks
@slaren slaren mentioned this pull request Aug 15, 2024
4 tasks
younesbelkada pushed a commit to younesbelkada/llama.cpp that referenced this pull request May 15, 2025
gaugarg-nv pushed a commit to gaugarg-nv/llama.cpp that referenced this pull request Feb 16, 2026
Support device-specific host buffer types in meta backend
kainlan added a commit to kainlan/llama.cpp-intel-optimizations that referenced this pull request Mar 3, 2026
Fix all issues from spec review and quality review of the expert
prefetch DMA engine (commit f60f94f):

Spec fixes:
- Req 3: Add design decision comment explaining why hint() is called at
  MoE dispatch with multi-layer lookahead instead of pre-attention, and
  why this gives equivalent DMA/compute overlap
- Req 5: Implement hit-rate disable loop — when prediction accuracy
  drops below 30%, set prefetch_disabled_ flag and short-circuit hint()
  early with log message

Critical fix:
- ggml-org#1: Deadlock in await() — extract sycl::event copy while holding lock,
  release lock before blocking event.wait(), re-acquire to update state

Important fixes:
- ggml-org#2: TOCTOU in hint_batch_adaptive() — hold mutex_ across the entire
  function so budget snapshot and consumption are atomic
- ggml-org#3: has_capacity() counted completed entries — now counts only active
  (non-completed) in-flight entries
- ggml-org#4: gc_completed() safety — add explicit comment tying the safety
  invariant to the synchronous call chain (ggml_sycl_mul_mat_id ->
  await -> kernel dispatch -> stream->wait)

Minor fixes:
- ggml-org#6: Rename PrefetchRequest to prefetch_request (snake_case convention)
- ggml-org#7: Log warning when all VRAM pool slots fail, permanently disable
- ggml-org#8: Add initialized_ guard at top of hint_batch()
- ggml-org#9: Add clarifying comment on n_miss_total <= max_inflight_ check
- ggml-org#10: Remove dead using alias expert_prefetcher = ExpertPrefetcher
- ggml-org#11: Rename accuracy_total_ to window_total_ for clarity

Refactored hint() into hint_locked() internal helper so hint_batch()
and hint_batch_adaptive() can hold the lock and call it directly,
eliminating recursive locking and TOCTOU races.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
rururush pushed a commit to USTC-ADSL/llama.cpp that referenced this pull request Mar 16, 2026
* redo: add convert nodes

This reverts commit 8448acd.

* align clang format with cann

* rename binary_op -> general_op

casue there're some op that will only tak 1 param

* Revert "rename binary_op -> general_op"

This reverts commit 5be63b1.

* wip

* add GGML_OP_PERMUTE

* add GGML_OP_VIEW and GGML_OP_GET_ROWS

* wip

* Revert "wip"

This reverts commit 772462c.
TheTom referenced this pull request in TheTom/llama-cpp-turboquant Mar 27, 2026
Complete experiment log:
  #1  4-mag LUT:           15.1 at 8K (BEST, +38%)
  #2  Batched extract:     13.7 (+25%)
  #3  Inline FA block:     13.5 (I-cache pressure)
  #4  Deferred norm:       12.9 (loses ILP)
  #5  2-pair half2:        12.0 (ternary overhead)
  #6  Select chain:        11.9 (branches kill)
  #7  Bit-arithmetic:      11.6 (ALU too heavy)
  #8  FMA branchless:      11.4 (ALU still too heavy)
  #9  Named-reg ternary:   10.3 (branches worst)
  #10 Main (8-LUT):        10.95 (baseline)
  #11 Non-vec FA:          10.2 (wrong kernel)
  Ceiling:                 24.5 (no dequant)

Apple8 hardware truth:
  1 divergent constant read < 7 ALU ops (even with fma)
  Branches cost MORE than divergent constant reads
  Array indexing ALWAYS spills on Metal
  4 constant addresses is the sweet spot

The 4-mag LUT is the dequant-level ceiling on Apple Silicon.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai
kainlan added a commit to kainlan/llama.cpp-intel-optimizations that referenced this pull request Apr 11, 2026
…s (dleex)

- l5ct0: Add pre_allocate_runtime_chunks() to pinned_chunk_pool and host_cache
  to prevent lazy runtime pool growth during inference. Called after zone
  configuration with onednn_scratchpad + dma_staging_pool bytes.
- 4f4o3: GGML_SYCL_HOST_ALLOC_PHASE_GATE default changed from 0 to 1
  (now unblocked by l5ct0 pre-allocation)
- dleex ggml-org#1: Document name shadowing in binbcast.cpp:594 as intentional
  (required by GGML_TENSOR_BINARY_OP_LOCALS macro)
- dleex ggml-org#4: Add GGML_ASSERT bounds checking to sycl_tensor::ne()/nb()
- dleex ggml-org#5: Add null assertion in sycl_tensor::resolve_as<T>() to catch
  unresolved tensor data early
- dleex ggml-org#7: Replace silent catch in fattn.cpp resolve_host_seq_ids with
  GGML_LOG_WARN fallback message
- dleex ggml-org#2,3,6,8: Deferred — ggml-org#2 is consistent naming, ggml-org#3 addressed by
  accessor migration (1vy5r), ggml-org#6 is design tension with const_cast,
  ggml-org#8 is a review note on past commits
itme-brain pushed a commit to itme-brain/llama.cpp that referenced this pull request Apr 16, 2026
Complete experiment log:
  ggml-org#1  4-mag LUT:           15.1 at 8K (BEST, +38%)
  ggml-org#2  Batched extract:     13.7 (+25%)
  ggml-org#3  Inline FA block:     13.5 (I-cache pressure)
  ggml-org#4  Deferred norm:       12.9 (loses ILP)
  ggml-org#5  2-pair half2:        12.0 (ternary overhead)
  ggml-org#6  Select chain:        11.9 (branches kill)
  ggml-org#7  Bit-arithmetic:      11.6 (ALU too heavy)
  ggml-org#8  FMA branchless:      11.4 (ALU still too heavy)
  ggml-org#9  Named-reg ternary:   10.3 (branches worst)
  ggml-org#10 Main (8-LUT):        10.95 (baseline)
  ggml-org#11 Non-vec FA:          10.2 (wrong kernel)
  Ceiling:                 24.5 (no dequant)

Apple8 hardware truth:
  1 divergent constant read < 7 ALU ops (even with fma)
  Branches cost MORE than divergent constant reads
  Array indexing ALWAYS spills on Metal
  4 constant addresses is the sweet spot

The 4-mag LUT is the dequant-level ceiling on Apple Silicon.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai
erazortt pushed a commit to erazortt/llama.cpp that referenced this pull request Apr 17, 2026
Complete experiment log:
  #1  4-mag LUT:           15.1 at 8K (BEST, +38%)
  #2  Batched extract:     13.7 (+25%)
  #3  Inline FA block:     13.5 (I-cache pressure)
  #4  Deferred norm:       12.9 (loses ILP)
  #5  2-pair half2:        12.0 (ternary overhead)
  ggml-org#6  Select chain:        11.9 (branches kill)
  ggml-org#7  Bit-arithmetic:      11.6 (ALU too heavy)
  ggml-org#8  FMA branchless:      11.4 (ALU still too heavy)
  ggml-org#9  Named-reg ternary:   10.3 (branches worst)
  ggml-org#10 Main (8-LUT):        10.95 (baseline)
  ggml-org#11 Non-vec FA:          10.2 (wrong kernel)
  Ceiling:                 24.5 (no dequant)

Apple8 hardware truth:
  1 divergent constant read < 7 ALU ops (even with fma)
  Branches cost MORE than divergent constant reads
  Array indexing ALWAYS spills on Metal
  4 constant addresses is the sweet spot

The 4-mag LUT is the dequant-level ceiling on Apple Silicon.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai
ausshir pushed a commit to ausshir/llama.cpp-iso-rocm that referenced this pull request Apr 20, 2026
Complete experiment log:
  ggml-org#1  4-mag LUT:           15.1 at 8K (BEST, +38%)
  ggml-org#2  Batched extract:     13.7 (+25%)
  ggml-org#3  Inline FA block:     13.5 (I-cache pressure)
  ggml-org#4  Deferred norm:       12.9 (loses ILP)
  ggml-org#5  2-pair half2:        12.0 (ternary overhead)
  ggml-org#6  Select chain:        11.9 (branches kill)
  ggml-org#7  Bit-arithmetic:      11.6 (ALU too heavy)
  ggml-org#8  FMA branchless:      11.4 (ALU still too heavy)
  ggml-org#9  Named-reg ternary:   10.3 (branches worst)
  ggml-org#10 Main (8-LUT):        10.95 (baseline)
  ggml-org#11 Non-vec FA:          10.2 (wrong kernel)
  Ceiling:                 24.5 (no dequant)

Apple8 hardware truth:
  1 divergent constant read < 7 ALU ops (even with fma)
  Branches cost MORE than divergent constant reads
  Array indexing ALWAYS spills on Metal
  4 constant addresses is the sweet spot

The 4-mag LUT is the dequant-level ceiling on Apple Silicon.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai
rocktw added a commit to rocktw/llama.cpp that referenced this pull request Apr 20, 2026
Finishes the mid + lower tier GATE5_TODO.md list except item ggml-org#8
(huron machine variant, deferred for a separate session) and ggml-org#12
(blocked on an upstream llm_grammar parser limitation, wire-verified).

### Phase E1 — sampler edge cases (items ggml-org#5, ggml-org#9, ggml-org#10, ggml-org#11)

New SAMPLER_EDGE_CASES list (11 cases) flowing through the v3 wire
path with explicit slice_cap:

  Slice-fetch stress (ggml-org#5): slice_cap=1 (max fetch count),
    slice_cap=1024 (one-shot), slice_cap=37 (prime remainder)
  Top-n-sigma shapes (ggml-org#9): σ=0.5 aggressive, σ=5.0 loose
  Penalty_last_n boundaries (ggml-org#10): window==history_len,
    window<history_len, history=empty + active penalty
  DRY allowed_length boundaries (ggml-org#11): allowed=0, allowed=1,
    allowed=L-1

### Phase E2 — adversarial tokenizer prompts (item ggml-org#6)

6 new entries in PROMPTS covering BOM-prefixed ASCII, zero-width
joiner mid-word, Unicode Private Use Area codepoint, mixed
CR/LF/CRLF line endings, all-control-character run, and a 220-byte
repeated sentence. Size ceiling documented in GATE5_LESSONS.md §8
— pushing beyond ~300 bytes of repeated text overflows the MCU's
64 KB per-request arena and triggers silent truncation. Arena sizing
is a deployment tuning decision tracked in 06_MCU_MEMORY_PLAN.md.

### Phase E3 — IPC chaos / error-path validation (item ggml-org#7)

New --mode chaos runner. 6 malformed-frame cases each verify:
  a) firmware returns the correct error status
  b) firmware stays in sync (recovery probe with a valid OP_TOKENIZE
     returns status=0)

Cases: bad magic, oversize prompt_len (5000 > 4096 MAX),
unsupported opcode, OP_SAMPLE with params_len=50 (neither v1 40
nor v3 ≥92), OP_SAMPLE v3 with K=9999 (> MAX_E2E_K=128),
OP_GRAMMAR_SET with n_vocab=300 (> MAX_GRAMMAR_VOCAB=256).

Incidental fix: read_status in chaos mode was draining only the
16-byte main response header; SAMPLE responses carry a trailing
8-byte sub-header (selected_id + n_survivors + rsv) even on error
paths, so the chaos helper now takes an op_was hint and drains the
sub when op_was == OP_SAMPLE. Without this the post-oversize
recovery probe saw the previous sub bytes as the new response
magic. Captured this as an inline comment — no lesson-doc entry
because it's a test-harness thing, not a firmware contract.

### Phase E1 item ggml-org#12 — custom grammar root name (partial)

OP_GRAMMAR_SET payload grows an optional root_name field: the spare
byte at tail[1] (`has_root_name`) gates a trailing
`u32 root_name_len | bytes`. Wire plumbing is verified by one
passing case (explicit root_name="root" matches default behaviour).

Blocked follow-up: the upstream llm_grammar parser rejects non-
"root" rule names as the entry point even when they are defined
in the GBNF. A standalone host-side test reproduces the rejection
(`custom_root ::= "abc"` with root="custom_root" → LLM_ERR_INVAL).
The regression lives in cm33-port/src/llm_grammar.c and is out of
Gate 5 scope. Captured in GATE5_LESSONS.md §7. The wire extension
will start paying off once the parser is fixed — no further port
work needed.

### Scoreboard

  test-mcu-e2e-tokenizer       62/62    (was 50, +12 adversarial)
  test-mcu-e2e-sampler         55/55    (was 43, +11 edge + 1 root)
  test-mcu-e2e-chaos            6/6     (new)
  test-mcu-e2e-replay          16/16    (unchanged)
  test-mcu-e2e-replay-mtmd     16/16    (unchanged)
  test-mcu-e2e-replay-gemma3   16/16    (unchanged)
  test-mcu-e2e-replay-llama3   16/16    (unchanged)
  test-mcu-e2e-replay-long    128/128   (unchanged)
                              ─────
                               315/315 total

Regression check: existing mcu-parity + phase{1..4} suites pass.

Firmware size delta from D3: +448 B text, +512 B bss
(g_grammar_root_name pool + optional parser plumbing).

Also: SPM variant's Makefile objs needed llm_nfa.c added — grammar
accept paths call llm_nfa_match_at for NFA triggers, symbol must
resolve even if the test matrix never exercises the branch.

### Remaining Gate 5 todos

- Item ggml-org#8 (huron machine variant) — substantial, own session.
- Deferred: grammar production vocab scale, NFA trigger wire,
  arena-overflow status propagation, custom grammar root name (item
  ggml-org#12 — blocked upstream).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
YuruDeveloper pushed a commit to YuruDeveloper/llama.cpp-quant that referenced this pull request Apr 21, 2026
Complete experiment log:
  ggml-org#1  4-mag LUT:           15.1 at 8K (BEST, +38%)
  ggml-org#2  Batched extract:     13.7 (+25%)
  ggml-org#3  Inline FA block:     13.5 (I-cache pressure)
  ggml-org#4  Deferred norm:       12.9 (loses ILP)
  ggml-org#5  2-pair half2:        12.0 (ternary overhead)
  ggml-org#6  Select chain:        11.9 (branches kill)
  ggml-org#7  Bit-arithmetic:      11.6 (ALU too heavy)
  ggml-org#8  FMA branchless:      11.4 (ALU still too heavy)
  ggml-org#9  Named-reg ternary:   10.3 (branches worst)
  ggml-org#10 Main (8-LUT):        10.95 (baseline)
  ggml-org#11 Non-vec FA:          10.2 (wrong kernel)
  Ceiling:                 24.5 (no dequant)

Apple8 hardware truth:
  1 divergent constant read < 7 ALU ops (even with fma)
  Branches cost MORE than divergent constant reads
  Array indexing ALWAYS spills on Metal
  4 constant addresses is the sweet spot

The 4-mag LUT is the dequant-level ceiling on Apple Silicon.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants