Gemma3 Local Attention Cache

Implement a local attention cache to get memory improvements. The current implementation just uses a windowed mask to achieve local attention. This does not provide memory improvements since the kv caches for these layers still store the same amount of data as the kv caches for the full attention layers. 

Steps:
1. Update the cache to allow for smaller sizes on some layers
2. Update the cache logic in `Gemma3Attention to use `%` and wrap around when assigning to the cache. This should handle prefill and decode steps. 
3. Update the window attention mask to be compatible with this wrap around. 
4. Remove the `test_masks` test (or update it) for the window attention. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gemma3 Local Attention Cache #124

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Gemma3 Local Attention Cache #124

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions