Add prefix-preserving training chat template for GPT-OSS#5109
Add prefix-preserving training chat template for GPT-OSS#5109qgallouedec wants to merge 21 commits intomainfrom
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
albertvillanova
left a comment
There was a problem hiding this comment.
Thanks for the fix.
I think:
- The prefix-preserving fix is good and likely necessary.
- But I would say the Design note’s justification is a weak point: I think comparative objective doesn't guarantee preservation of an unseen special token; this could also impact stopping and parsing.
I would prefer others to comment on this.
100% agree. At this point it's an intuition. I'll update the comment to be more conservative. |
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7416e5fae4
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
trl/chat_template_utils.py
Outdated
| {%- elif message.content and not future_final_message.found %} | ||
| {{- "<|start|>assistant<|channel|>analysis<|message|>" + message.content + "<|end|>" }} | ||
| {%- elif message.thinking and not future_final_message.found %} | ||
| {{- "<|start|>assistant<|channel|>analysis<|message|>" + message.thinking + "<|end|>" }} |
There was a problem hiding this comment.
Keep tool-call analysis stable across future assistant turns
The training GPT-OSS template still drops analysis text from an assistant tool-call turn when a later assistant final turn exists, because analysis is emitted only when not future_final_message.found. In a valid tool flow (e.g., user → assistant tool call with content/thinking → tool → assistant final), formatting the shorter prefix includes analysis but formatting the extended conversation removes it, so earlier tokens change and the template is not prefix-preserving for tool-calling traces.
Useful? React with 👍 / 👎.
PR Description
This PR adds support for tool-calling training with GPT-OSS models (e.g., gpt-oss-20b) in the GRPO agent training pipeline, extending the existing Qwen3 support.
Problem
For context, the main challenge is to ensure that the chat template used for training is prefix-preserving, meaning that when new messages are appended to a conversation, the previous sequence of tokens remains unchanged. This is crucial for multi-turn training.
The original GPT-OSS chat template is not prefix-preserving for two reasons:
<|return|>vs<|end|>: The last assistant message ends with<|return|>while all other turns use<|end|>. When a conversation is extended with new turns, the previously-final assistant message switches from<|return|>to<|end|>, breaking the prefix.Conditional thinking blocks: (Same as Qwen3) The thinking field is only rendered for the final assistant turn (via
loop.lastin the template), so earlier assistant turns lose their thinking content when new messages are appended.This PR introduces
gpt_oss_chat_template— A reference copy of the original GPT-OSS chat template stored in chat_template_utils.py for template matching.gpt_oss_training_chat_template— A modified training-safe template with two key changes:<|return|>with<|end|>on the final assistant message to ensure consistent turn delimiters across all turns.loop.lasttotrueso thinking blocks are always rendered, not just on the final turn.get_training_chat_template()— Now recognizes GPT-OSS templates and returns the training variant, alongside the existing Qwen3 support.is_chat_template_prefix_preserving()— Test messages now include bothreasoning_contentandthinkingkeys, since GPT-OSS usesthinkingwhile Qwen3 usesreasoning_content.TestGetTrainingChatTemplatetests are now parameterized over both GPT-OSS and Qwen3, with a helper_assert_equalthat accounts for the expected<|return|>→<|end|>difference.Design notes (happy to get thoughts on this!)
The consequence of this is that the
<|return|>token is not seen during training. My intuition is that, since GRPO uses a comparative objective (relative ranking within groups), the model's ability to produce<|return|>at inference is not expected to degrade.