This repository was archived by the owner on Jun 5, 2025. It is now read-only.
Normalize and denormalize llamacpp streaming reply#121
Merged
jhrozek merged 1 commit intostacklok:mainfrom Nov 29, 2024
Merged
Conversation
80718e6 to
1d647ea
Compare
Originally, I wanted to add the normalizers to convert the `im_start`/`im_end` tags, but we worked around that by setting llamacpp to use the OpenAI format. We'll still need a normalizer for the vllm provider though. At the moment we really need the denormalizer so that the blocking pipeline can return a stream of `ModelResponse`s and the denormalizer would convert them to the CreateChatCompletionStreamResponse structure that is then serialized to the client. This avoids any guessing or special casing that would otherwise be needed in the `llamacpp_stream_generator` which currently expected `Iterator[CreateChatCompletionStreamResponse]`. Another change that simplifies the logic is that the `llamacpp_stream_generator` now accepts an `AsyncIterator` instead of just `Iterator` that the llamacpp completion hander was returning. Again, this is to simplify the logic and pass the iterator from the blocking pipeline. On the completion side we have a simple sync-to-async wrapper. Fixes: #94
1d647ea to
79d68ce
Compare
lukehinds
approved these changes
Nov 29, 2024
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
At the moment we really need the denormalizer so that the blocking
pipeline can return a stream of
ModelResponses and the denormalizerwould convert them to the CreateChatCompletionStreamResponse structure
that is then serialized to the client. This avoids any guessing or
special casing that would otherwise be needed in the
llamacpp_stream_generatorwhich currently expectedIterator[CreateChatCompletionStreamResponse].Another change that simplifies the logic is that the
llamacpp_stream_generatornow accepts anAsyncIteratorinstead ofjust
Iteratorthat the llamacpp completion hander was returning.Again, this is to simplify the logic and pass the iterator from the
blocking pipeline. On the completion side we have a simple sync-to-async
wrapper.
Fixes: #94