Token-Aware L7 Load Balancer for LLM Clusters

Standard load balancers cost you 12% latency on LLM clusters because they treat all requests equally. Here's a fix in Go.

Token-Aware L7 Load Balancer for LLM Clusters

An L7 reverse proxy in Go that tokenizes incoming prompts and routes requests to the backend with the lowest in-flight token count. Standard load balancers treat all requests equally — this one routes by computational weight.

The Problem

A 10-token prompt and a 4,000-token prompt both count as "1 connection" to a traditional load balancer. Least Connections will stack heavy prompts on one server while others sit idle, causing head-of-line blocking.

The Solution

Intercept requests at L7, count tokens before routing, and maintain a running total of in-flight tokens per backend. Route to the backend where current_in_flight_tokens + new_request_tokens is the lowest.

Architecture

TokenHandler (middleware)
  --> Read body, count tokens via tiktoken, store in context, restore body
    --> GetBackendServerHandler (middleware)
      --> Pick least-loaded backend, store in context, increment IFT
        --> httputil.ReverseProxy
          --> TokenAwareTransport.RoundTrip (custom RoundTripper)
            --> Forward to backend, decrement IFT on response

Middleware handles request validation, error responses (400, 503), token counting, and backend selection. Selection and token increment happen under the same lock to prevent races.

RoundTripper handles transport-level concerns: setting the destination URL and decrementing the token counter after the backend responds.

Key Design Decisions

Decision	Approach	Why
Body inspection	`io.ReadAll` + `io.NopCloser(bytes.NewReader)`	HTTP body is a stream, can only be read once. Buffer it, parse, then restore for the proxy
Token counting	`tiktoken-go` with `cl100k_base` encoding	Matches OpenAI's tokenizer for accurate weight estimation
Concurrency	`sync.RWMutex` on registry + `atomic.Int64` per backend	Read lock for routing (hot path), write lock only for adding backends
Token lifecycle	Increment in middleware, decrement in RoundTripper	Increment at selection time prevents races. Decrement when backend response arrives

Benchmark Results

3 backends simulating LLM inference (sleep proportional to token count, ±20% jitter), 60 requests per scenario.

High Contention (50% heavy, 50% small, concurrency=30)

Metric	Round Robin	Token Aware	Improvement
Average Latency	2.58s	2.27s	-12%
P90 Latency	8.60s	7.78s	-10%

Heavy Workload (80% heavy, 20% small, concurrency=5)

Metric	Round Robin	Token Aware	Improvement
Average Latency	4.45s	4.20s	-6%
P90 Latency	8.67s	8.57s	-1%

A 12% improvement across 3 simulated backends is a floor, not a ceiling. Real workloads with wider token variance and higher concurrency would amplify the difference.

Running

# Start 3 backends
go run ./cmd/dummyllm/ -port 8081 &
go run ./cmd/dummyllm/ -port 8082 &
go run ./cmd/dummyllm/ -port 8083 &

# Start balancer (token-aware or roundrobin)
go run ./cmd/balancer/ -strategy token http://localhost:8081 http://localhost:8082 http://localhost:8083

# Send a request
curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"gpt-4","messages":[{"role":"user","content":"Hello"}]}'

# Run benchmarks
./benchmark.sh

Tech Stack

Go 1.22+ standard library (net/http, httputil.ReverseProxy, sync/atomic) + tiktoken-go for tokenization. No frameworks.

What's Missing for Production

SSE / streaming support with per-chunk token tracking
Output token estimation (not just input)
KV cache pressure as a second routing signal (vLLM /metrics)
Health checks with automatic backend removal
Prometheus metrics for routing decisions

PRs welcome. Full write-up: Beyond Round Robin

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
balancer		balancer
cmd		cmd
llm_server		llm_server
model		model
.gitignore		.gitignore
README.md		README.md
benchmark.sh		benchmark.sh
go.mod		go.mod
go.sum		go.sum
request_huge.json		request_huge.json
request_small.json		request_small.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Token-Aware L7 Load Balancer for LLM Clusters

The Problem

The Solution

Architecture

Key Design Decisions

Benchmark Results

High Contention (50% heavy, 50% small, concurrency=30)

Heavy Workload (80% heavy, 20% small, concurrency=5)

Running

Tech Stack

What's Missing for Production

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Token-Aware L7 Load Balancer for LLM Clusters

The Problem

The Solution

Architecture

Key Design Decisions

Benchmark Results

High Contention (50% heavy, 50% small, concurrency=30)

Heavy Workload (80% heavy, 20% small, concurrency=5)

Running

Tech Stack

What's Missing for Production

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages