Capture: Activation Caching for Efficient LLM Inference

TL;DR: Capture makes LLM inference faster by caching intermediate activations instead of just KV pairs. Think of it as a smarter cache that reduces recomputation and memory bandwidth bottlenecks.

🎯 Why Capture?

LLM inference is memory-bound. Traditional systems only cache Key-Value (KV) pairs, but this:

❌ Still requires recomputing all other layers
❌ Wastes memory bandwidth fetching KV cache repeatedly
❌ Can't leverage host memory effectively

Capture solves this by:

✅ Caching intermediate activations (not just KV)
✅ Smart mixed KV/activation caching strategy
✅ Efficient host memory offloading
✅ Up to 2.5x throughput improvement over vLLM

🚀 Quick Start

Installation

git clone https://github.com/casys-kaist/Capture.git
cd capture
pip install -e .

Run Benchmarks

cd capture
python scripts/capture_runner.py \
    --model opt-13b \
    --dataset sharegpt \
    --num-prompts 100 \
    --gen-len 128 \
    --enable-capture \
    --flag-turn-on-icache \
    --host-mem-size-GB 128 \
    --kvc-ratio 0.5

Supported Models: OPT (6.7B, 13B, 30B, 66B)

Datasets:

sharegpt - Real-world ShareGPT conversations
dummy - Fixed-length synthetic prompts
dummy-random-length - Variable-length synthetic prompts

Note: Docker setup and detailed usage instructions will be added later.

💾 How It Works: KV Cache vs Activation Cache

What Gets Cached?

Traditional KV Cache (vLLM)

For each token at each layer:
┌─────────────────────────────────────┐
│  Key Tensor (K)   ← Stored          │
│  Value Tensor (V) ← Stored          │
└─────────────────────────────────────┘
K and V are directly used during decode

Capture's Activation Cache

For each token at each layer:
┌─────────────────────────────────────┐
│  Input Activation (Ac) ← Stored     │
└─────────────────────────────────────┘
K and V regenerated: [K V] = Ac × [WK WV]

Memory Efficiency (per block of 16 tokens)

Traditional Approach:

Stores: K + V for all layers
Block size: SKV (full size)

Capture Approach:

Stores: Activation checkpoints only
Block size: SACT = ½ × SKV (50% memory savings!)
K and V regenerated on-demand during decode

The Key Innovation

Instead of storing the output of attention computation (K, V tensors), Capture stores the input (activation checkpoints). Since K and V can be regenerated via a simple linear transformation, this achieves:

50% memory reduction per cached block
Small recomputation cost (overlapped with weight loading from host)
2.19× throughput improvement over prior work

Hybrid KV-Activation Caching

Capture uses a mixed strategy:

Some tokens cached as KV blocks (no recomputation)
Some tokens cached as ACT blocks (recompute K, V from activations)
Optimal ratio balances PCIe bandwidth vs GPU computation
Unified block table tracks both types

Technical Deep Dive: Modified Paged Attention Kernel

Capture extends vLLM's PagedAttention kernel to read from both KV cache AND recomputed activation buffers during decode:

// Original vLLM: Only reads from KV cache
const cache_t* k_cache = ...;
const cache_t* v_cache = ...;

// Capture: Selectively reads from KV or activation cache
const cache_t* tg_k_cache;
const cache_t* tg_v_cache;

if (buf_mapping[block_idx] == 0) {
    // Read from KV cache (traditional)
    tg_k_cache = k_cache;
    tg_v_cache = v_cache;
} else if (buf_mapping[block_idx] == 1) {
    // Read from recomputed activation buffer (Capture's innovation!)
    tg_k_cache = recompute_key_cache;
    tg_v_cache = recompute_value_cache;
}

Why this matters:

The attention kernel can now use partial recomputations stored in activation cache
No need to recompute entire layers when activations are already cached
buf_mapping tensor dynamically routes each block to the appropriate cache

Implementation: See capture/csrc/attention/attention_kernels.cu:209-377

🔬 Running Benchmarks

Capture includes a comprehensive benchmarking tool for evaluating performance.

Using capture_runner.py

cd capture
python scripts/capture_runner.py \
    --model opt-13b \
    --dataset sharegpt \
    --num-prompts 100 \
    --gen-len 128 \
    --enable-capture \
    --flag-turn-on-icache \
    --host-mem-size-GB 128 \
    --kvc-ratio 0.5

Supported Datasets:

sharegpt - Real-world ShareGPT conversations (recommended for realistic benchmarks)
dummy - Fixed-length synthetic prompts (for reproducibility)
dummy-random-length - Variable-length synthetic prompts

Key Parameters:

--model: Model name (opt-6.7b, opt-13b, opt-30b, opt-66b, llama-7b, etc.)
--dataset: Dataset to use (sharegpt, dummy, dummy-random-length)
--num-prompts: Number of requests to generate
--gen-len: Generation length per request
--enable-capture: Enable Capture system
--flag-turn-on-icache: Enable activation caching
--host-mem-size-GB: Host memory budget in GB
--kvc-ratio: KV cache ratio (0.0 = all activation, 1.0 = all KV)

For the complete list of parameters, see python scripts/capture_runner.py --help.

🎛️ Key Configuration Options

Parameter	What it does	When to use
`enable_capture`	Turn on Capture system	Always set to `True`
`flag_turn_on_icache`	Enable activation caching	Your main performance knob
`host_mem_size_GB`	CPU memory budget	More = better performance
`kvc_ratio`	KV vs activation ratio	Tune based on sequence length
`flag_overlap`	Overlap data transfers	For weight offloading
`max_num_micro_batches`	Micro-batching degree	Higher = better overlap

See all 40+ configuration options

Memory Configuration

num_gpu_kv_blocks - GPU KV cache blocks
num_gpu_act_blocks - GPU activation cache blocks
num_host_kv_blocks - Host KV cache blocks
num_host_act_blocks - Host activation cache blocks
max_num_load_tokens - Max tokens for load operations

Advanced Scheduling

flag_allocate_by_balance - Balance-aware block allocation
flag_batching_by_balance - Balance-aware request batching
flag_async_load_merging - Async load merging
flag_equal_cache_size - Equal cache size allocation

Weight Offloading

flag_all_weight_on_host - All weights on host
num_gpu_attn_wgt_rows - GPU attention weight rows
num_gpu_gate_up_wgt_rows - GPU gate/up weight rows

Debug & Profiling

flag_profile - Enable profiling
flag_info - Print debug info
flag_validation - Validate against baseline

Full list: See Configuration Guide

🏗️ Architecture

                    ┌─────────────────────────────┐
                    │    CaptureScheduler         │
                    │  ┌─────────────────────┐   │
                    │  │ Micro-batch Manager │   │
                    │  │ Balance-aware Batch │   │
                    │  └─────────────────────┘   │
                    └──────────┬──────────────────┘
                               │
            ┌──────────────────┼──────────────────┐
            │                  │                  │
    ┌───────▼──────┐  ┌───────▼──────┐  ┌───────▼──────┐
    │  GPU KV      │  │  GPU Act     │  │  GPU Weight  │
    │  Cache       │  │  Cache       │  │  Buffer      │
    └───────┬──────┘  └───────┬──────┘  └───────┬──────┘
            │                  │                  │
            │ async transfer   │ async transfer   │
            │                  │                  │
    ┌───────▼──────┐  ┌───────▼──────┐  ┌───────▼──────┐
    │  Host KV     │  │  Host Act    │  │  Host Weight │
    │  Cache       │  │  Cache       │  │  Storage     │
    └──────────────┘  └──────────────┘  └──────────────┘

Core Components

CaptureScheduler (core/capture_scheduler.py) - Micro-batch scheduling with balance-aware batching
CaptureMemory (capture_memory.py) - Unified GPU/host memory management
CaptureWorker (worker/capture_worker.py) - Worker with cache engine
CaptureBlockManager (core/capture_block_manager.py) - Block-level allocation
Modified PagedAttention Kernel (csrc/attention/attention_kernels.cu) - CUDA kernel that reads from both KV and activation caches

How Attention Works in Capture

During decode, the modified paged attention kernel uses buf_mapping to decide where to read data:

For each block in sequence:
  if buf_mapping[block] == 0:
    ├─► Read from KV cache (traditional path)
    └─► Full Key/Value pairs
  else if buf_mapping[block] == 1:
    ├─► Read from recomputed activation buffer (Capture path!)
    └─► Cached intermediate activations

This dual-path design is the key innovation:

Flexibility: Mix KV and activation caching per-block
Efficiency: Use the best cache strategy for each part of the sequence

📄 License

Apache 2.0 License - see LICENSE for details.

Based on vLLM (also Apache 2.0).

📖 Citation

If you use Capture in your research, please cite:

@inproceedings{lee2025throughput,
  title={Throughput-Oriented LLM Inference via KV-Activation Hybrid Caching with a Single GPU},
  author={Lee, Sanghyeon and Kim, Hongbeen and Hwang, Soojin and Heo, Guseul and Noh, Minwoo and Huh, Jaehyuk},
  booktitle={The 43rd IEEE International Conference on Computer Design (ICCD 2025)},
  year={2025},
  organization={IEEE}
}

🙏 Acknowledgments

This work was supported by Institute of Information & Communications Technology Planning & Evaluation (IITP), Ministry of Science and ICT, Korea (RS2021-II211817, RS-2024-00402898).

We also acknowledge the contributions of the following projects:

vLLM – Foundation for high-throughput serving
FlexGen – Inspiration for offloading strategies

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.buildkite		.buildkite
.github		.github
benchmarks		benchmarks
cmake		cmake
csrc		csrc
rocm_patch		rocm_patch
scripts		scripts
tests		tests
vllm		vllm
.clang-format		.clang-format
.dockerignore		.dockerignore
.gitignore		.gitignore
.readthedocs.yaml		.readthedocs.yaml
.yapfignore		.yapfignore
CMakeLists.txt		CMakeLists.txt
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
Dockerfile.cpu		Dockerfile.cpu
Dockerfile.neuron		Dockerfile.neuron
Dockerfile.rocm		Dockerfile.rocm
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
collect_env.py		collect_env.py
format.sh		format.sh
pyproject.toml		pyproject.toml
requirements-build.txt		requirements-build.txt
requirements-common.txt		requirements-common.txt
requirements-cpu.txt		requirements-cpu.txt
requirements-cuda.txt		requirements-cuda.txt
requirements-dev.txt		requirements-dev.txt
requirements-neuron.txt		requirements-neuron.txt
requirements-rocm.txt		requirements-rocm.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Capture: Activation Caching for Efficient LLM Inference

🎯 Why Capture?

🚀 Quick Start

Installation

Run Benchmarks

💾 How It Works: KV Cache vs Activation Cache

What Gets Cached?

Memory Efficiency (per block of 16 tokens)

The Key Innovation

Hybrid KV-Activation Caching

Technical Deep Dive: Modified Paged Attention Kernel

🔬 Running Benchmarks

Using capture_runner.py

🎛️ Key Configuration Options

Memory Configuration

Advanced Scheduling

Weight Offloading

Debug & Profiling

🏗️ Architecture

Core Components

How Attention Works in Capture

📄 License

📖 Citation

🙏 Acknowledgments

About

Uh oh!

Releases

Packages

Languages

License

casys-kaist/Capture

Folders and files

Latest commit

History

Repository files navigation

Capture: Activation Caching for Efficient LLM Inference

🎯 Why Capture?

🚀 Quick Start

Installation

Run Benchmarks

💾 How It Works: KV Cache vs Activation Cache

What Gets Cached?

Memory Efficiency (per block of 16 tokens)

The Key Innovation

Hybrid KV-Activation Caching

Technical Deep Dive: Modified Paged Attention Kernel

🔬 Running Benchmarks

Using capture_runner.py

🎛️ Key Configuration Options

Memory Configuration

Advanced Scheduling

Weight Offloading

Debug & Profiling

🏗️ Architecture

Core Components

How Attention Works in Capture

📄 License

📖 Citation

🙏 Acknowledgments

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages