feat(polling): Implement hot/cold server classification and staggered polling for tool discovery by Lang-Akshay · Pull Request #3839 · IBM/mcp-context-forge

Lang-Akshay · 2026-03-24T13:44:13Z

Overview

This PR introduces hot/cold server classification and usage-aware adaptive polling for tool list synchronisation across upstream MCP servers. Rather than checking all servers at a fixed interval regardless of activity, the gateway now polls frequently-used servers at 1× the base interval and deprioritises idle servers to 3× — reducing unnecessary load while keeping active integrations fresh.

Design Rationale: Polling vs. Push Notifications

The MCP spec does define notifications/tools/list_changed as the canonical mechanism for dynamic tool discovery, and it's a reasonable default for single-session clients. For a gateway operating at scale, however, persistent-connection notifications introduce a set of problems that polling sidesteps cleanly — this section explains that tradeoff honestly.

Why persistent notifications don't fit the gateway model

Notifications require a live transport stream. The MCP SDK delivers notifications through a _receive_loop tied to the open connection. The gateway's refresh path (_initialize_gateway → connect_to_sse_server / connect_to_streamablehttp_server) uses ephemeral connections — open, fetch tools/list, close. No message_handler is registered, and the notification window is effectively zero.

Session pools are demand-driven, not proactive. MCPSessionPool does maintain persistent sessions with notification handlers, but sessions are only created when users invoke tools. If no tools have been called against a gateway, no session exists and no notifications are received. Idle sessions are evicted after 600 s (MCP_SESSION_POOL_IDLE_EVICTION). The pool covers active user traffic, not passive server monitoring.

The connection cost scales poorly. Listening to N upstream servers requires N open TCP sockets and 2N asyncio tasks per worker, plus keepalive traffic and reconnect logic. At realistic deployment sizes:

Scale	Persistent Notifications	Ephemeral Polling
Connections at rest	N per worker	0
asyncio tasks at rest	2N per worker	0
Multi-worker support	✗ (each worker needs own connections)	✓ (leader election)
Server restart recovery	Requires explicit reconnect	Next poll picks it up
1K servers, 4 workers	~8K connections, ~8K tasks	0 at rest
10K servers, 4 workers	~80K persistent connections	~10K ephemeral calls/interval, batched

Polling holds zero file descriptors at rest, works across workers via leader election (FILELOCK_NAME), and self-heals automatically when upstream servers restart. The existing health-check infrastructure already provides semaphore-based concurrency control, chunked batching with inter-batch pauses, and per-gateway throttling — this PR builds on that foundation rather than replacing it.

If the MCP spec's push model becomes viable for large-scale gateway deployments in the future (e.g. via a dedicated notification broker), this polling layer can be replaced without touching the rest of the refresh pipeline.

Background: What Already Exists

The gateway's health check system already implements:

✅ Semaphore-based concurrency control (adaptive limit)
✅ Chunked processing with 50 ms pauses between batches
✅ Per-gateway throttling via last_refresh_at timestamps
✅ Lock-based conflict prevention (manual vs. auto-refresh)
✅ Configurable intervals (HEALTH_CHECK_INTERVAL, GATEWAY_AUTO_REFRESH_INTERVAL)

Example: 100 gateways → 10 concurrent batches with 50 ms pauses = ~5–10 s total check time

Problem

Despite those safeguards, all servers were treated equally:

A server receiving 1,000+ requests/day → checked every 300 s
A server idle for weeks → also checked every 300 s
No differentiation based on real usage patterns
Unnecessary polling of servers that rarely if ever change

Solution

1. Hot/Cold Server Classification

The gateway analyses the MCP session pool to classify each server into one of two tiers:

Tier	Criteria	Poll Interval
Hot (top 20%)	Recent active sessions, high use count	1× base interval (300 s default)
Cold (remaining 80%)	No recent sessions or low usage	3× base interval (900 s default)

Classification algorithm:

Extract per-server metrics from pooled sessions: server_last_used, active_session_count, total_use_count
Filter to servers with a valid pooled session
Sort by recency (most recently used first); ties broken deterministically
Top 20% (floor(0.20 × N)) → hot
Remainder → cold

Classification is deterministic and grounded entirely in observed usage — no heuristics or guesswork.

2. Intelligent Interval Selection

Each server's tier determines its poll frequency:

# Hot server (top 20% by usage)
should_poll = elapsed >= settings.hot_server_check_interval   # 300 s (1× base)
 
# Cold server (remaining 80%)
should_poll = elapsed >= settings.cold_server_check_interval  # 900 s (3× base)

3. Staggered Polling with Deterministic Offsets

Poll offsets are assigned using index-based linear distribution to eliminate thundering-herd spikes:

offset = (gateway_index / total_gateways) × interval

2,000 gateways at a 300 s interval → one poll every 0.15 s. Flat and predictable.

4. Multi-Worker Coordination

With Redis: Leader election ensures a single worker classifies servers; all workers read the shared classification from Redis.
Without Redis (make dev): Single-worker mode; classification runs locally — no Redis dependency required for local development.

Configuration

To enable automatic health checks and tool list sync:

AUTO_REFRESH_SERVERS=true            # Master switch — enables tool/resource/prompt sync during health checks
HEALTH_CHECK_INTERVAL=300            # Health check cycle interval in seconds (default: 300)
GATEWAY_AUTO_REFRESH_INTERVAL=300    # Tool list refresh interval in seconds (default: 300, minimum: 60)

All three are enabled by default in this PR (auto_refresh_servers changed from false → true, both intervals default to 300).

Optional tuning:

HOT_COLD_CLASSIFICATION_ENABLED=true    # Hot/cold classification (default: true, requires Redis for multi-worker)
STAGGERED_POLLING_ENABLED=true          # Deterministic offset scheduling (default: true)

All poll intervals are derived automatically from GATEWAY_AUTO_REFRESH_INTERVAL:

Server tier	Poll interval
Hot (top 20% by usage)	1× base (300 s)
Cold (remaining 80%)	3× base (900 s)

Signed-off-by: Lang-Akshay <akshay.shinde26@ibm.com>

…taggered polling Signed-off-by: Lang-Akshay <akshay.shinde26@ibm.com>

…intervals Signed-off-by: Lang-Akshay <akshay.shinde26@ibm.com>

Signed-off-by: Lang-Akshay <akshay.shinde26@ibm.com>

- Implement tests for initialization, classification logic, leader election, polling decisions, and Redis state management. - Cover various scenarios including hot/cold classification, tie-breaking logic, and service lifecycle management. - Ensure comprehensive testing of the server classification algorithm and its integration with the GatewayService. Signed-off-by: Lang-Akshay <akshay.shinde26@ibm.com>

Signed-off-by: Lang-Akshay <akshay.shinde26@ibm.com>

… clarity Signed-off-by: Lang-Akshay <akshay.shinde26@ibm.com>

- Updated to allow a minimum of 1 second and clarified description. - Introduced in for customizable error handling. - Improved logging format for classification completion. - Added comprehensive tests for error handling scenarios in and configuration validation in . Signed-off-by: Lang-Akshay <akshay.shinde26@ibm.com>

…ssion counting Signed-off-by: Lang-Akshay <akshay.shinde26@ibm.com>

…stic polling and key management Signed-off-by: Lang-Akshay <akshay.shinde26@ibm.com>

…ice and pylint fixup Signed-off-by: Lang-Akshay <akshay.shinde26@ibm.com>

Lang-Akshay requested a review from crivetimihai as a code owner March 24, 2026 13:44

Lang-Akshay marked this pull request as draft March 24, 2026 13:44

Lang-Akshay force-pushed the Tool-discovery---Auto-Refresh branch from e43f064 to 91a9e92 Compare March 24, 2026 14:29

Lang-Akshay marked this pull request as ready for review March 24, 2026 14:55

Lang-Akshay requested review from kevalmahajan and madhav165 as code owners March 24, 2026 15:57

Lang-Akshay added 6 commits March 24, 2026 15:57

feat(config): enable auto-refresh for tools

04f8f4e

Signed-off-by: Lang-Akshay <akshay.shinde26@ibm.com>

xfeat(classification): implement hot/cold server classification for s…

bff3568

…taggered polling Signed-off-by: Lang-Akshay <akshay.shinde26@ibm.com>

feat(polling): streamline hot/cold server classification and polling …

9759340

…intervals Signed-off-by: Lang-Akshay <akshay.shinde26@ibm.com>

fixup: formatting

3d6025d

Signed-off-by: Lang-Akshay <akshay.shinde26@ibm.com>

updated the default config values

9e67dfe

Signed-off-by: Lang-Akshay <akshay.shinde26@ibm.com>

Lang-Akshay force-pushed the Tool-discovery---Auto-Refresh branch from c91a4a3 to 6caa5e6 Compare March 24, 2026 15:58

Lang-Akshay marked this pull request as draft March 24, 2026 16:04

Lang-Akshay added 2 commits March 24, 2026 16:12

fixup: pylint

5878af6

Signed-off-by: Lang-Akshay <akshay.shinde26@ibm.com>

refactor: rename 'N' to 'total_servers' in ClassificationMetadata for…

0a279f0

… clarity Signed-off-by: Lang-Akshay <akshay.shinde26@ibm.com>

Lang-Akshay marked this pull request as ready for review March 24, 2026 16:30

Lang-Akshay requested a review from msureshkumar88 March 24, 2026 16:38

Lang-Akshay force-pushed the Tool-discovery---Auto-Refresh branch from 2666a15 to b381eff Compare March 24, 2026 17:39

Lang-Akshay added 3 commits March 24, 2026 21:18

test: add unit tests for classification service error handling and se…

e5173bd

…ssion counting Signed-off-by: Lang-Akshay <akshay.shinde26@ibm.com>

feat: update gateway and server classification services for determini…

c8e9cb4

…stic polling and key management Signed-off-by: Lang-Akshay <akshay.shinde26@ibm.com>

fix: correct timestamp validation logic in server classification serv…

2b9dbed

…ice and pylint fixup Signed-off-by: Lang-Akshay <akshay.shinde26@ibm.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(polling): Implement hot/cold server classification and staggered polling for tool discovery#3839

feat(polling): Implement hot/cold server classification and staggered polling for tool discovery#3839
Lang-Akshay wants to merge 12 commits intomainfrom
Tool-discovery---Auto-Refresh

Lang-Akshay commented Mar 24, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Lang-Akshay commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Design Rationale: Polling vs. Push Notifications

Why persistent notifications don't fit the gateway model

Background: What Already Exists

Problem

Solution

1. Hot/Cold Server Classification

2. Intelligent Interval Selection

3. Staggered Polling with Deterministic Offsets

4. Multi-Worker Coordination

Configuration

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Lang-Akshay commented Mar 24, 2026 •

edited

Loading