Skip to content

feat(polling): Implement hot/cold server classification and staggered polling for tool discovery#3839

Open
Lang-Akshay wants to merge 12 commits intomainfrom
Tool-discovery---Auto-Refresh
Open

feat(polling): Implement hot/cold server classification and staggered polling for tool discovery#3839
Lang-Akshay wants to merge 12 commits intomainfrom
Tool-discovery---Auto-Refresh

Conversation

@Lang-Akshay
Copy link
Collaborator

@Lang-Akshay Lang-Akshay commented Mar 24, 2026

Closes #3734


Overview

This PR introduces hot/cold server classification and usage-aware adaptive polling for tool list synchronisation across upstream MCP servers. Rather than checking all servers at a fixed interval regardless of activity, the gateway now polls frequently-used servers at 1× the base interval and deprioritises idle servers to 3× — reducing unnecessary load while keeping active integrations fresh.


Design Rationale: Polling vs. Push Notifications

The MCP spec does define notifications/tools/list_changed as the canonical mechanism for dynamic tool discovery, and it's a reasonable default for single-session clients. For a gateway operating at scale, however, persistent-connection notifications introduce a set of problems that polling sidesteps cleanly — this section explains that tradeoff honestly.

Why persistent notifications don't fit the gateway model

Notifications require a live transport stream. The MCP SDK delivers notifications through a _receive_loop tied to the open connection. The gateway's refresh path (_initialize_gatewayconnect_to_sse_server / connect_to_streamablehttp_server) uses ephemeral connections — open, fetch tools/list, close. No message_handler is registered, and the notification window is effectively zero.

Session pools are demand-driven, not proactive. MCPSessionPool does maintain persistent sessions with notification handlers, but sessions are only created when users invoke tools. If no tools have been called against a gateway, no session exists and no notifications are received. Idle sessions are evicted after 600 s (MCP_SESSION_POOL_IDLE_EVICTION). The pool covers active user traffic, not passive server monitoring.

The connection cost scales poorly. Listening to N upstream servers requires N open TCP sockets and 2N asyncio tasks per worker, plus keepalive traffic and reconnect logic. At realistic deployment sizes:

Scale Persistent Notifications Ephemeral Polling
Connections at rest N per worker 0
asyncio tasks at rest 2N per worker 0
Multi-worker support ✗ (each worker needs own connections) ✓ (leader election)
Server restart recovery Requires explicit reconnect Next poll picks it up
1K servers, 4 workers ~8K connections, ~8K tasks 0 at rest
10K servers, 4 workers ~80K persistent connections ~10K ephemeral calls/interval, batched

Polling holds zero file descriptors at rest, works across workers via leader election (FILELOCK_NAME), and self-heals automatically when upstream servers restart. The existing health-check infrastructure already provides semaphore-based concurrency control, chunked batching with inter-batch pauses, and per-gateway throttling — this PR builds on that foundation rather than replacing it.

If the MCP spec's push model becomes viable for large-scale gateway deployments in the future (e.g. via a dedicated notification broker), this polling layer can be replaced without touching the rest of the refresh pipeline.


Background: What Already Exists

The gateway's health check system already implements:

  • ✅ Semaphore-based concurrency control (adaptive limit)
  • ✅ Chunked processing with 50 ms pauses between batches
  • ✅ Per-gateway throttling via last_refresh_at timestamps
  • ✅ Lock-based conflict prevention (manual vs. auto-refresh)
  • ✅ Configurable intervals (HEALTH_CHECK_INTERVAL, GATEWAY_AUTO_REFRESH_INTERVAL)

Example: 100 gateways → 10 concurrent batches with 50 ms pauses = ~5–10 s total check time


Problem

Despite those safeguards, all servers were treated equally:

  • A server receiving 1,000+ requests/day → checked every 300 s
  • A server idle for weeks → also checked every 300 s
  • No differentiation based on real usage patterns
  • Unnecessary polling of servers that rarely if ever change

Solution

1. Hot/Cold Server Classification

The gateway analyses the MCP session pool to classify each server into one of two tiers:

Tier Criteria Poll Interval
Hot (top 20%) Recent active sessions, high use count 1× base interval (300 s default)
Cold (remaining 80%) No recent sessions or low usage 3× base interval (900 s default)

Classification algorithm:

  1. Extract per-server metrics from pooled sessions: server_last_used, active_session_count, total_use_count
  2. Filter to servers with a valid pooled session
  3. Sort by recency (most recently used first); ties broken deterministically
  4. Top 20% (floor(0.20 × N)) → hot
  5. Remainder → cold

Classification is deterministic and grounded entirely in observed usage — no heuristics or guesswork.

2. Intelligent Interval Selection

Each server's tier determines its poll frequency:

# Hot server (top 20% by usage)
should_poll = elapsed >= settings.hot_server_check_interval   # 300 s (1× base)
 
# Cold server (remaining 80%)
should_poll = elapsed >= settings.cold_server_check_interval  # 900 s (3× base)

3. Staggered Polling with Deterministic Offsets

Poll offsets are assigned using index-based linear distribution to eliminate thundering-herd spikes:

offset = (gateway_index / total_gateways) × interval

2,000 gateways at a 300 s interval → one poll every 0.15 s. Flat and predictable.

4. Multi-Worker Coordination

  • With Redis: Leader election ensures a single worker classifies servers; all workers read the shared classification from Redis.
  • Without Redis (make dev): Single-worker mode; classification runs locally — no Redis dependency required for local development.

Configuration

To enable automatic health checks and tool list sync:

AUTO_REFRESH_SERVERS=true            # Master switch — enables tool/resource/prompt sync during health checks
HEALTH_CHECK_INTERVAL=300            # Health check cycle interval in seconds (default: 300)
GATEWAY_AUTO_REFRESH_INTERVAL=300    # Tool list refresh interval in seconds (default: 300, minimum: 60)

All three are enabled by default in this PR (auto_refresh_servers changed from falsetrue, both intervals default to 300).

Optional tuning:

HOT_COLD_CLASSIFICATION_ENABLED=true    # Hot/cold classification (default: true, requires Redis for multi-worker)
STAGGERED_POLLING_ENABLED=true          # Deterministic offset scheduling (default: true)

All poll intervals are derived automatically from GATEWAY_AUTO_REFRESH_INTERVAL:

Server tier Poll interval
Hot (top 20% by usage) 1× base (300 s)
Cold (remaining 80%) 3× base (900 s)

@Lang-Akshay Lang-Akshay marked this pull request as draft March 24, 2026 13:44
@Lang-Akshay Lang-Akshay force-pushed the Tool-discovery---Auto-Refresh branch from e43f064 to 91a9e92 Compare March 24, 2026 14:29
@Lang-Akshay Lang-Akshay marked this pull request as ready for review March 24, 2026 14:55
Signed-off-by: Lang-Akshay <akshay.shinde26@ibm.com>
…taggered polling

Signed-off-by: Lang-Akshay <akshay.shinde26@ibm.com>
…intervals

Signed-off-by: Lang-Akshay <akshay.shinde26@ibm.com>
Signed-off-by: Lang-Akshay <akshay.shinde26@ibm.com>
Signed-off-by: Lang-Akshay <akshay.shinde26@ibm.com>
- Implement tests for initialization, classification logic, leader election, polling decisions, and Redis state management.
- Cover various scenarios including hot/cold classification, tie-breaking logic, and service lifecycle management.
- Ensure comprehensive testing of the server classification algorithm and its integration with the GatewayService.

Signed-off-by: Lang-Akshay <akshay.shinde26@ibm.com>
@Lang-Akshay Lang-Akshay force-pushed the Tool-discovery---Auto-Refresh branch from c91a4a3 to 6caa5e6 Compare March 24, 2026 15:58
@Lang-Akshay Lang-Akshay marked this pull request as draft March 24, 2026 16:04
Signed-off-by: Lang-Akshay <akshay.shinde26@ibm.com>
… clarity

Signed-off-by: Lang-Akshay <akshay.shinde26@ibm.com>
@Lang-Akshay Lang-Akshay marked this pull request as ready for review March 24, 2026 16:30
- Updated  to allow a minimum of 1 second and clarified description.
- Introduced  in  for customizable error handling.
- Improved logging format for classification completion.
- Added comprehensive tests for error handling scenarios in  and configuration validation in .

Signed-off-by: Lang-Akshay <akshay.shinde26@ibm.com>
@Lang-Akshay Lang-Akshay force-pushed the Tool-discovery---Auto-Refresh branch from 2666a15 to b381eff Compare March 24, 2026 17:39
…ssion counting

Signed-off-by: Lang-Akshay <akshay.shinde26@ibm.com>
…stic polling and key management

Signed-off-by: Lang-Akshay <akshay.shinde26@ibm.com>
…ice and pylint fixup

Signed-off-by: Lang-Akshay <akshay.shinde26@ibm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[CHORE][NOTIFICATIONS]: Investigate and test support for notifications/tools/list_changed signal for dynamic tool discovery

1 participant