Skip to content

Gemma3 Local Attention Cache #124

@chapman20j

Description

@chapman20j

Implement a local attention cache to get memory improvements. The current implementation just uses a windowed mask to achieve local attention. This does not provide memory improvements since the kv caches for these layers still store the same amount of data as the kv caches for the full attention layers.

Steps:

  1. Update the cache to allow for smaller sizes on some layers
  2. Update the cache logic in Gemma3Attention to use %` and wrap around when assigning to the cache. This should handle prefill and decode steps.
  3. Update the window attention mask to be compatible with this wrap around.
  4. Remove the test_masks test (or update it) for the window attention.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions