Skip to content

broker-router hits ~4% failures at ~4000 concurrent MCP sessions #630

@arielharush96

Description

@arielharush96

Bug: broker-router crashes at ~3600-4000 concurrent MCP sessions

Describe the bug

During performance experiments, the broker-router pod crashes and restarts when the number of concurrent users (MCP sessions) reaches ~4000. The crash happens well within the pod's resource limits (CPU at ~42%, memory at ~27%), so this is not a resource exhaustion issue. After the restart, all active sessions are lost.

To Reproduce
We share the full repo with dedicated mcp class as well as all the manifests.
a generic flow:
1.Deploy mcp-gateway with a perf-mock-mcp-server
2.Ramp up concurrent MCP sessions at 8 users/sec up to 8192 MCP sesions
3.Each user: initialize session, list tools, then call tools for the rest of the experiment
4.At ~3600-4000 concurrent sessions, the broker-router pod crashes

Expected behavior

The broker-router should handle increasing concurrent sessions without crashing.

Screenshots

Image Image

Additional context

Reproduced twice on OpenShift cluster (Kubernetes v1.33.6):

  • Run 1: crash at 4088 concurrent sessions
  • Run 2: crash at 3616 concurrent sessions

The next logs following the 2nd run:

  1. Pod metrics (cpu_usage.csv) - 3 crash events with gaps:

Crash 1 (~3600 users):

11:51:43  cpu=1496m  mem=267Mi   ← last reading
          ── 39 second gap ──    ← pod gone
11:52:22  cpu=1885m  mem=136Mi   ← restarted (267→136Mi)

Crash 2 (~5200 users):

11:55:44  cpu=1660m  mem=299Mi   ← last reading
          ── 53 second gap ──    ← pod gone
11:56:37  cpu=1864m  mem=254Mi   ← restarted (299→254Mi)

Crash 3 (~7600 users):

11:57:03  cpu=1816m  mem=259Mi   ← last reading
          ── 66 second gap ──    ← pod gone
11:58:09  cpu=1936m  mem=331Mi   ← restarted
  1. client-side fails:
users=3616  total_fail=3330   fail/s=0       ← first failures appear
users=3640  total_fail=12422  fail/s=404.7   ← avalanche
  1. Crash log (broker_router_crash.log, lines 164-186):
http: superfluous response.WriteHeader call from
  github.com/mark3labs/mcp-go/server.(*StreamableHTTPServer).handlePost.func1.1 (streamable_http.go:419)
ERROR: Failed to write SSE event: http: wrote more than the declared Content-Length
ERROR: Failed to write SSE event: http: wrote more than the declared Content-Length
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x6a78b2]

goroutine 488488 [running]:
github.com/mark3labs/mcp-go/server.(*StreamableHTTPServer).handlePost.func1.1.1()
  mcp-go@v0.43.2/server/streamable_http.go:410
github.com/mark3labs/mcp-go/server.(*StreamableHTTPServer).handlePost.func1.1(...)
  mcp-go@v0.43.2/server/streamable_http.go:425

all relevant logs:

broker_router_crash.log

cpu_usage.csv

gateway_failures.csv

Metadata

Metadata

Assignees

No one assigned

    Labels

    priority/highShould be worked on first, prior to any normal or low priority itemstriage/acceptedHas been assessed, and accepted for work

    Type

    No type

    Projects

    Status

    Backlog

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions