Releases: ggml-org/llama.cpp
b7306
Warning
Release Format Update: Linux releases will soon use .tar.gz archives instead of .zip. Please make the necessary changes to your deployment scripts.
HIP: fix RDNA3 FP16/BF16 matrix multiplication (#17817)
macOS/iOS:
Linux:
Windows:
b7302
Warning
Release Format Update: Linux releases will soon use .tar.gz archives instead of .zip. Please make the necessary changes to your deployment scripts.
ggml : improve error handling for search path existence checks (#17653)
- Improve error handling for search path existence checks
Refactor existence checks for search paths using std::error_code to handle potential errors.
- Improve cache file existence check with error code
Update fs::exists to use std::error_code for error handling.
- Simplify existence check for search paths
Simplify existence check for search paths
-
Fix logging path in error message for posix_stat
-
Update ggml/src/ggml-backend-reg.cpp
Co-authored-by: Aman Gupta [email protected]
- Adapt to the coding standard
Co-authored-by: Aman Gupta [email protected]
macOS/iOS:
Linux:
Windows:
b7301
Warning
Release Format Update: Linux releases will soon use .tar.gz archives instead of .zip. Please make the necessary changes to your deployment scripts.
llama : remove quantization sanity check (#17788)
- llama : remove quantization sanity check
This commit removes the quantization sanity check for attention layers.
The motivation for this is that there are model that are hybrid models
that have recurrent layers, experts layers, and attention layers. For
these models the current check fails as the experts layers are not
taking into account. After consideration, it was decided that this check
is not strictly necessary, and can be removed to allow for more flexible
model architectures.
- llama : remove unused pruned_attention_w and is_clip_model vars
macOS/iOS:
Linux:
Windows:
b7300
Warning
Release Format Update: Linux releases will soon use .tar.gz archives instead of .zip. Please make the necessary changes to your deployment scripts.
vulkan: Use one row per workgroup for f32 mmv (#17711)
The MoE models have a mul_mat_vec with very small m (32, 64, 128) right before
the topk_moe selection. Running multiple rows per wg doesn't utilize the SMs
well. I think even for larger m, f32 is so bandwidth-limited that running
multiple rows doesn't help.
macOS/iOS:
Linux:
Windows:
b7298
Warning
Release Format Update: Linux releases will soon use .tar.gz archives instead of .zip. Please make the necessary changes to your deployment scripts.
vulkan: support solve_tri with larger N/K values (#17781)
Split N into chunks to fit into shared memory.
If K > 128, use a larger workgroup with enough invocations.
Add perf tests matching qwen3next.
macOS/iOS:
Linux:
Windows:
b7296
Warning
Release Format Update: Linux releases will soon use .tar.gz archives instead of .zip. Please make the necessary changes to your deployment scripts.
metal : fix build(#17799)
-
metal : fix build
-
tests : fix context destruction
macOS/iOS:
Linux:
Windows:
b7285
Warning
Release Format Update: Linux releases will soon use .tar.gz archives instead of .zip. Please make the necessary changes to your deployment scripts.
HIP : fix RDNA4 build (#17792)
macOS/iOS:
Linux:
Windows:
b7278
Warning
Release Format Update: Linux releases will soon use .tar.gz archives instead of .zip. Please make the necessary changes to your deployment scripts.
ci : transform release binary root dir in tar to llama-bXXXX (#17773)
-
transform release binary root dir in tar to llama-bXXXX
-
bsdtar supports -s instead of --transform
macOS/iOS:
Linux:
Windows:
b7276
Warning
Release Format Update: Linux releases will soon use .tar.gz archives instead of .zip. Please make the necessary changes to your deployment scripts.
Add support for CUMSUM and TRI for CUDA. (#17584)
-
Add support for CUMSUM and TRI for CUDA.
-
Minor optimizations.
-
Correct warp_prefix_inclusive_sum in float2 variant to return float2
-
Optimize TRI
-
Whitespace
-
Fix strides.
-
Implement double loop
-
Whitespace
-
Fix HIP compilation bugs
-
Optimizations + big case performance tests
-
Implement using CUB with fallback to custom kernel
-
Remove error message.
-
Fixes from code review
-
Comment out CPU-unsupported F16/BF16 cases to fix CI
-
Fine, you win :P
-
Fix last cast, use NO_DEVICE_CODE and GGML_UNUSED_VARS
-
Vary warp-size based on physical warp size
-
Add GGML_UNUSED_VARS in tri as well
-
Use constexpr and call prefix_inclusive with warp_size template param
-
Update ggml/src/ggml-cuda/cumsum.cu
Co-authored-by: Johannes Gäßler [email protected]
- Apply suggestions from code review
Co-authored-by: Johannes Gäßler [email protected]
-
Change to tid % warp_size
-
Fix strides; hardcode mask; add ggml_lane_mask_t
-
Missing renames, remove unused get_warp_mask(), explicit calls to ggml_cuda_info()
-
Too hasty...
Co-authored-by: Johannes Gäßler [email protected]
macOS/iOS:
Linux:
Windows:
b7275
Warning
Release Format Update: Linux releases will soon use .tar.gz archives instead of .zip. Please make the necessary changes to your deployment scripts.
metal: TRI, FILL, EXPM1, SOFTPLUS (#16623)
- feat(wip): Port initial TRI impl from pervious work
The kernel does not work and is not optimized, but the
code compiles and runs, so this will be the starting point
now that the core op has been merged.
Branch: ggml-cumsum-tri
Signed-off-by: Gabe Goodhart [email protected]
- fix: Remove argument for constant val override
This was added in the original draft, but later removed. With this, the
kernel now passes tests.
Branch: ggml-cumsum-tri
Signed-off-by: Gabe Goodhart [email protected]
- feat: Move the ttype conditional to templating to avoid conditional in kernel
Branch: ggml-cumsum-tri
Signed-off-by: Gabe Goodhart [email protected]
- fix: Type fixes
Signed-off-by: Gabe Goodhart [email protected]
Co-authored-by: Georgi Gerganov [email protected]
Co-authored-by: Georgi Gerganov [email protected]
- feat: Add softplus for metal
Branch: ggml-cumsum-tri
Signed-off-by: Gabe Goodhart [email protected]
- feat: Add EXPM1 for metal
Branch: ggml-cumsum-tri
Signed-off-by: Gabe Goodhart [email protected]
- feat: Add FILL for metal
Branch: ggml-cumsum-tri
Signed-off-by: Gabe Goodhart [email protected]
- refactor: Branchless version of tri using _ggml_vec_tri_cmp as a mask
Branch: ggml-cumsum-tri
Signed-off-by: Gabe Goodhart [email protected]
- fix: Remove unused arguments
Branch: ggml-cumsum-tri
Signed-off-by: Gabe Goodhart [email protected]
- refactor: Use select instead of branch for softplus non-vec
Branch: ggml-cumsum-tri
Signed-off-by: Gabe Goodhart [email protected]
Signed-off-by: Gabe Goodhart [email protected]
Co-authored-by: Georgi Gerganov [email protected]
macOS/iOS:
Linux:
Windows: