Conversation
benchmark_serving.py
Outdated
| prompts_sent += 1 | ||
|
|
||
| results = await asyncio.gather(*tasks) | ||
| async with aiohttp.ClientSession(trust_env=False, connector=aiohttp.TCPConnector(keepalive_timeout=30, enable_cleanup_closed=True, limit=28000,),timeout=None, trace_configs=[trace_config]) as clientSession: |
There was a problem hiding this comment.
I am less inclined to hard code 28k vs set to 0 (no limit) here and catch the appropriate error, log, and retry.
The added metric and logging will help observability. The retry is effectively the same outcome (qps slowdown).
There was a problem hiding this comment.
yes if we can log failures due to ephemeral port exhaustion so that we know the experiment is not valid and the user needs to reduce the QPS or num_prompts
There was a problem hiding this comment.
+1 to logging when we exhaust ports, added and prevented including server metrics when we exhaust ports since the wait time invalidates these. Non-server metrics could still be valuable since the measured e2e latency includes the time waiting to send the request, if no requests are ever being queued on any model server and the bottleneck is this tool then yes the experiment data would for certain be invalid.
This PR fixes the
ClientConnectorErrors seen at high qps due to port exhaustion by upper bounding the number of concurrent requests to 28000. This is a temporary fix until the approach for high qps benchmarking is decided on. 28000 is roughly the number of ephemeral ports available when containerized. Addedactive_connections_metricgauge.