-
Notifications
You must be signed in to change notification settings - Fork 43
Open
Labels
enhancementNew feature or requestNew feature or request
Description
Implement a local attention cache to get memory improvements. The current implementation just uses a windowed mask to achieve local attention. This does not provide memory improvements since the kv caches for these layers still store the same amount of data as the kv caches for the full attention layers.
Steps:
- Update the cache to allow for smaller sizes on some layers
- Update the cache logic in
Gemma3Attention to use%` and wrap around when assigning to the cache. This should handle prefill and decode steps. - Update the window attention mask to be compatible with this wrap around.
- Remove the
test_maskstest (or update it) for the window attention.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request