feat(polling): Implement hot/cold server classification and staggered polling for tool discovery#3839
Open
Lang-Akshay wants to merge 12 commits intomainfrom
Open
feat(polling): Implement hot/cold server classification and staggered polling for tool discovery#3839Lang-Akshay wants to merge 12 commits intomainfrom
Lang-Akshay wants to merge 12 commits intomainfrom
Conversation
e43f064 to
91a9e92
Compare
Signed-off-by: Lang-Akshay <akshay.shinde26@ibm.com>
…taggered polling Signed-off-by: Lang-Akshay <akshay.shinde26@ibm.com>
…intervals Signed-off-by: Lang-Akshay <akshay.shinde26@ibm.com>
Signed-off-by: Lang-Akshay <akshay.shinde26@ibm.com>
Signed-off-by: Lang-Akshay <akshay.shinde26@ibm.com>
- Implement tests for initialization, classification logic, leader election, polling decisions, and Redis state management. - Cover various scenarios including hot/cold classification, tie-breaking logic, and service lifecycle management. - Ensure comprehensive testing of the server classification algorithm and its integration with the GatewayService. Signed-off-by: Lang-Akshay <akshay.shinde26@ibm.com>
c91a4a3 to
6caa5e6
Compare
Signed-off-by: Lang-Akshay <akshay.shinde26@ibm.com>
… clarity Signed-off-by: Lang-Akshay <akshay.shinde26@ibm.com>
- Updated to allow a minimum of 1 second and clarified description. - Introduced in for customizable error handling. - Improved logging format for classification completion. - Added comprehensive tests for error handling scenarios in and configuration validation in . Signed-off-by: Lang-Akshay <akshay.shinde26@ibm.com>
2666a15 to
b381eff
Compare
…ssion counting Signed-off-by: Lang-Akshay <akshay.shinde26@ibm.com>
…stic polling and key management Signed-off-by: Lang-Akshay <akshay.shinde26@ibm.com>
…ice and pylint fixup Signed-off-by: Lang-Akshay <akshay.shinde26@ibm.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #3734
Overview
This PR introduces hot/cold server classification and usage-aware adaptive polling for tool list synchronisation across upstream MCP servers. Rather than checking all servers at a fixed interval regardless of activity, the gateway now polls frequently-used servers at 1× the base interval and deprioritises idle servers to 3× — reducing unnecessary load while keeping active integrations fresh.
Design Rationale: Polling vs. Push Notifications
The MCP spec does define
notifications/tools/list_changedas the canonical mechanism for dynamic tool discovery, and it's a reasonable default for single-session clients. For a gateway operating at scale, however, persistent-connection notifications introduce a set of problems that polling sidesteps cleanly — this section explains that tradeoff honestly.Why persistent notifications don't fit the gateway model
Notifications require a live transport stream. The MCP SDK delivers notifications through a
_receive_looptied to the open connection. The gateway's refresh path (_initialize_gateway→connect_to_sse_server/connect_to_streamablehttp_server) uses ephemeral connections — open, fetch tools/list, close. Nomessage_handleris registered, and the notification window is effectively zero.Session pools are demand-driven, not proactive.
MCPSessionPooldoes maintain persistent sessions with notification handlers, but sessions are only created when users invoke tools. If no tools have been called against a gateway, no session exists and no notifications are received. Idle sessions are evicted after 600 s (MCP_SESSION_POOL_IDLE_EVICTION). The pool covers active user traffic, not passive server monitoring.The connection cost scales poorly. Listening to N upstream servers requires N open TCP sockets and 2N asyncio tasks per worker, plus keepalive traffic and reconnect logic. At realistic deployment sizes:
Polling holds zero file descriptors at rest, works across workers via leader election (
FILELOCK_NAME), and self-heals automatically when upstream servers restart. The existing health-check infrastructure already provides semaphore-based concurrency control, chunked batching with inter-batch pauses, and per-gateway throttling — this PR builds on that foundation rather than replacing it.Background: What Already Exists
The gateway's health check system already implements:
last_refresh_attimestampsHEALTH_CHECK_INTERVAL,GATEWAY_AUTO_REFRESH_INTERVAL)Problem
Despite those safeguards, all servers were treated equally:
Solution
1. Hot/Cold Server Classification
The gateway analyses the MCP session pool to classify each server into one of two tiers:
Classification algorithm:
server_last_used,active_session_count,total_use_countfloor(0.20 × N)) → hotClassification is deterministic and grounded entirely in observed usage — no heuristics or guesswork.
2. Intelligent Interval Selection
Each server's tier determines its poll frequency:
3. Staggered Polling with Deterministic Offsets
Poll offsets are assigned using index-based linear distribution to eliminate thundering-herd spikes:
2,000 gateways at a 300 s interval → one poll every 0.15 s. Flat and predictable.
4. Multi-Worker Coordination
make dev): Single-worker mode; classification runs locally — no Redis dependency required for local development.Configuration
To enable automatic health checks and tool list sync:
Optional tuning:
All poll intervals are derived automatically from
GATEWAY_AUTO_REFRESH_INTERVAL: