Skip to content

fix: preserve role='assistant' in Azure streaming with include_usage#24354

Merged
Chesars merged 1 commit intoBerriAI:litellm_staging_03_22_2026from
Chesars:fix/azure-streaming-role-include-usage
Mar 22, 2026
Merged

fix: preserve role='assistant' in Azure streaming with include_usage#24354
Chesars merged 1 commit intoBerriAI:litellm_staging_03_22_2026from
Chesars:fix/azure-streaming-role-include-usage

Conversation

@Chesars
Copy link
Copy Markdown
Collaborator

@Chesars Chesars commented Mar 22, 2026

Relevant issues

Fixes #24221

Pre-Submission checklist

  • I have Added testing in the tests/test_litellm/ directory, Adding at least 1 test is a hard requirement - see details
  • My PR passes all unit tests on make test-unit
  • My PR's scope is as isolated as possible, it only solves 1 specific problem
  • I have requested a Greptile review by commenting @greptileai and received a Confidence Score of at least 4/5 before requesting a maintainer review

Type

🐛 Bug Fix

Changes

When stream_options.include_usage=True, Azure sends an initial chunk with choices=[] (prompt_filter_results) before the first content chunk. LiteLLM inflated this empty-choices chunk with a default StreamingChoices, which:

  1. Consumed the sent_first_chunk flag
  2. Caused strip_role_from_delta to strip role from the real first chunk
  3. The real first chunk with role='assistant' and content='' was also discarded by is_chunk_non_empty as "empty"

Net result: no chunk ever contained role='assistant'.

Fix (4 files):

  • streaming_handler.py - chunk_creator: Set model_response.choices = [] for chunks without choices, forwarding them faithfully instead of inflating with a default StreamingChoices
  • streaming_handler.py - is_chunk_non_empty: Treat chunks with role in delta as non-empty (the first chunk with role='assistant' and content='' is a valid OpenAI chunk)
  • streaming_handler.py - __next__/__anext__: Guard choices[0] access and only mark sent_first_chunk for chunks with real choices
  • main.py + streaming_chunk_builder_utils.py: Guard choices[0] access in stream_chunk_builder for chunks with choices=[]

When Azure sends stream_options.include_usage=True, it emits an initial
chunk with choices=[] (prompt_filter_results) before the first content
chunk. Previously, LiteLLM inflated this empty-choices chunk with a
default StreamingChoices, which consumed the sent_first_chunk flag and
caused strip_role_from_delta to strip role from the real first chunk.
Additionally, the first real chunk with role='assistant' and content=''
was discarded by is_chunk_non_empty as "empty".

This fix:
- Forwards chunks with choices=[] faithfully (no inflated default)
- Only marks sent_first_chunk for chunks with real choices
- Treats chunks with role in delta as non-empty
- Guards choices[0] access in __next__/__anext__ and stream_chunk_builder

Fixes BerriAI#24221
@vercel
Copy link
Copy Markdown

vercel Bot commented Mar 22, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
litellm Ready Ready Preview, Comment Mar 22, 2026 1:15pm

Request Review

@codspeed-hq
Copy link
Copy Markdown
Contributor

codspeed-hq Bot commented Mar 22, 2026

Merging this PR will not alter performance

✅ 16 untouched benchmarks


Comparing Chesars:fix/azure-streaming-role-include-usage (ce16db1) with main (c89496f)

Open in CodSpeed

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Mar 22, 2026

Greptile Summary

This PR fixes a role-stripping bug in Azure streaming when stream_options.include_usage=True (issue #24221). Azure sends an initial choices=[] chunk (prompt_filter_results) before any content. LiteLLM was inflating this empty-choices chunk with a default StreamingChoices, which consumed the sent_first_chunk flag and caused strip_role_from_delta to strip role='assistant' from the real first content chunk.

Key changes:

  • streaming_handler.py – explicitly sets model_response.choices = [] before the early return for no-choices chunks, preventing inflation; guards choices[0] access in __next__ / __anext__; ties sent_first_chunk marking to chunks that actually carry choices; adds a new is_chunk_non_empty clause so a chunk with role in its delta is never discarded as empty.
  • streaming_chunk_builder_utils.py / main.py – both now scan forward to find the first chunk with non-empty choices before accessing choices[0], avoiding IndexError when the very first chunk has choices=[].
  • A new mock-only regression test covers both sync and async paths end-to-end.

Minor concerns:

  • In build_base_response, if all chunks have empty choices (degenerate case), the next() fallback is chunk (i.e., self.first_chunk, which also has choices=[]), and the immediately following choices[0] access would still raise IndexError. A None default with an explicit guard would be more defensive.
  • In main.py, when first_chunk_with_choices is None (all empty choices), the code silently skips TextChoices routing and falls through to build_base_response, which can then hit the same issue. An early return None when no chunk has choices would make this path explicit.

Confidence Score: 4/5

  • Safe to merge — the fix is well-targeted and the regression test is solid; two minor defensive-coding gaps remain.
  • The core fix is correct: inflated empty-choices chunks are the root cause, and all four touch-points in the streaming pipeline are updated consistently. The mock-only test validates both sync and async flows and directly exercises the reported failure. Score is 4 rather than 5 because build_base_response still has an unsafe fallback that could raise IndexError in an all-empty-choices edge case, and stream_chunk_builder in main.py silently falls through when no chunk has choices rather than returning None early.
  • litellm/litellm_core_utils/streaming_chunk_builder_utils.py (unsafe chunk fallback in build_base_response) and litellm/main.py (silent fall-through when all chunks have empty choices).

Important Files Changed

Filename Overview
litellm/litellm_core_utils/streaming_handler.py Four targeted changes: (1) model_response.choices = [] set before early return so empty-choices chunks are not inflated with a default StreamingChoices; (2) new is_chunk_non_empty clause treats a delta with a non-None role as non-empty when sent_first_chunk is False; (3) choices[0] access in __next__ and __anext__ guarded behind response.choices truth checks; (4) sent_first_chunk only marked for chunks that carry real choices. Changes are logically consistent and correct for the described bug.
litellm/litellm_core_utils/streaming_chunk_builder_utils.py Adds next() scan over all chunks to find the first one with non-empty choices before extracting role. Correct for the Azure case, but the fallback to chunk (first_chunk, which itself has choices=[]) means a degenerate all-empty-choices stream would still raise IndexError on line 125.
litellm/main.py Guards the TextChoices routing check in stream_chunk_builder so it is only reached when at least one chunk has non-empty choices. If all chunks have empty choices, None is returned from next() and the isinstance check is skipped, which silently falls through to build_base_response.
tests/test_litellm/litellm_core_utils/test_streaming_handler.py Adds a well-structured regression test using ModelResponseListIterator (no network calls). Covers both sync and async iteration, verifies that the prompt-filter empty-choices chunk is forwarded faithfully, and asserts that at least one chunk contains role='assistant' in its delta. Satisfies the mock-only test requirement.

Sequence Diagram

sequenceDiagram
    participant Azure
    participant chunk_creator
    participant is_chunk_non_empty
    participant __next__

    Azure->>chunk_creator: Chunk 1 (choices=[], prompt_filter_results)
    Note over chunk_creator: stream_options.include_usage=True<br/>model_response.choices = [] ← NEW
    chunk_creator-->__next__: model_response (choices=[])
    Note over __next__: sent_first_chunk stays False<br/>(guarded by response.choices) ← NEW

    Azure->>chunk_creator: Chunk 2 (role='assistant', content='')
    chunk_creator->>is_chunk_non_empty: Check if non-empty
    Note over is_chunk_non_empty: NEW: role is not None AND not sent_first_chunk → True
    is_chunk_non_empty-->>chunk_creator: True (non-empty)
    chunk_creator->>chunk_creator: strip_role_from_delta → sent_first_chunk=True
    chunk_creator-->__next__: model_response (role='assistant', content='')

    Azure->>chunk_creator: Chunk 3 (content='Hello!')
    chunk_creator-->__next__: model_response (content='Hello!')

    Azure->>chunk_creator: Chunk 5 (choices=[], usage)
    chunk_creator-->__next__: model_response (choices=[], usage=...)
Loading

Comments Outside Diff (1)

  1. litellm/main.py, line 7373-7381 (link)

    P2 None fallback silently skips TextChoices detection

    When first_chunk_with_choices is None (all chunks have empty choices), the TextChoices routing check is silently skipped and execution falls through to processor.build_base_response(chunks). build_base_response will itself hit the same issue (no chunk with choices) and potentially raise. Adding a guard or an early return makes the failure explicit and easier to debug.

Reviews (1): Last reviewed commit: "fix: preserve role='assistant' in Azure ..." | Re-trigger Greptile

Comment thread litellm/litellm_core_utils/streaming_chunk_builder_utils.py
@Chesars Chesars changed the base branch from main to litellm_staging_03_22_2026 March 22, 2026 14:15
@Chesars Chesars merged commit 2132db4 into BerriAI:litellm_staging_03_22_2026 Mar 22, 2026
38 of 39 checks passed
Chesars added a commit that referenced this pull request Apr 16, 2026
Resolved conflicts:
- streaming_handler.py: combined role check (PR #24354, Azure streaming)
  with reasoning_items check (new in main) — both are independent OR
  conditions in is_chunk_non_empty()
- CI/CD: accepted main's versions throughout
  - Redis tests migrated to CircleCI (PR #25354): removed enable-redis
    from GH Actions workflows
  - E2E UI tests restructured (PR #25365): simplified CircleCI job
  - Coverage via Codecov added to all GH Actions unit test workflows
  - Deleted test-litellm-matrix.yml and test-proxy-e2e-azure-batches.yml
    (removed in main)
Chesars added a commit that referenced this pull request Apr 16, 2026
- streaming_iterator.py: adopted main's more defensive version of the
  tool-arg queueing check (.get() instead of [], isinstance guard) —
  same logic, same behavior, lower crash surface
- model_prices_and_context_window.json + backup: combined staging's
  search_context_cost_per_query fields (PR #24372) with main's new
  supports_service_tier field — both are independent additions to the
  same Gemini model entries
- test_streaming_handler.py: kept Azure streaming regression test
  (PR #24354) and added main's two new Gemini legacy vertex
  finish_reason normalization tests
- test_gemini_batch_embeddings.py: kept staging's unsupported-params
  filtering tests (PR #24370) and added main's index/order test
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: LiteLLM proxy doesn't include "role" for /chat/completions stream=true with Azure OpenAI and stream_options.include_usage=true

1 participant