Server: when no slot is available, defer the task instead of returning "slot unavailable"#5018
Conversation
|
I think having a queue is a good idea, but it probably shouldn't be an unbounded queue. |
I agree with that. In fact, I suspect that the complexity of the server code comes from the communication between http server thread and the "worker" thread (the one who runs the model). Nevertheless, having used But that mean re-writing all the server code from zero, and for now I really don't have the time to do so. |
ggerganov
left a comment
There was a problem hiding this comment.
Good change
Probably you want to std::move(task) to avoid copies
* server: defer task when no slot is available * remove unnecessary log --------- Co-authored-by: Xuan Son Nguyen <xuanson.nguyen@snowpack.eu>
Motivation
Assuming that there is only one slot in server mode, when trying to send 2 requests at the same time, one request will fail with "slot unavailable" error. This behavior sometimes breaks OpenAI compatibility.
This PR defer the task until one of the slots is available.
On the bright side, request will no longer fail. But on the down side, one request now need to wait for the other one to finish.