(gpt-oss, oai, chat): Remove Harmony Integration and Implement Native GPT-OSS Tool Call Support#9043
(gpt-oss, oai, chat): Remove Harmony Integration and Implement Native GPT-OSS Tool Call Support#9043
Conversation
There was a problem hiding this comment.
Summary of Changes
Hello @CatherineSue, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
I've undertaken a significant refactoring to enhance our system's capabilities by removing the existing Harmony integration. This decision was driven by Harmony's limitations, particularly its lack of support for output token IDs and tool calls in chat completions. My primary goal was to implement native GPT-OSS tool call support, which is a highly requested feature. This change allows our system to directly interpret and utilize tool calls from GPT-OSS models, streamlining our interaction with these models and opening up new possibilities for advanced functionalities. The impact is a more robust and feature-rich tool calling mechanism, directly integrated into our core serving logic.
Highlights
- Removal of Harmony Integration: The entire Harmony integration, which was previously used for GPT-OSS models, has been completely removed from the codebase. This includes removing related imports, initialization logic, and processing paths.
- Native GPT-OSS Tool Call Support: Native support for GPT-OSS tool calls has been implemented. This involves new
GptOssDetectorclasses for both function call parsing and reasoning parsing, allowing the system to directly interpret and handle tool calls embedded within the model's output. - Handling of GPT-OSS Channel Format: The new implementation specifically handles the 'channel format' used by GPT-OSS models, which includes distinct 'analysis' and 'final' channels, as well as 'commentary' channels for tool calls. This allows for structured extraction of reasoning and final answers.
- Refactoring of Chat Serving Logic: The
serving_chat.pymodule has been significantly refactored to remove all Harmony-specific code, making the chat completion endpoint cleaner and more focused on the native GPT-OSS implementation. This includes changes to request validation, template application, and stream processing.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
Before: - Tool calls that appeared between analysis sections were lost during GPTOSSDetector.detect_and_parse in reasoning_parser.py - Only content after the last analysis section was preserved Fix: Modified the detect_and_parse method to: 1. Process the text sequentially, preserving content between analysis sections 2. Collect all non-analysis content (including tool calls) in normal_parts 3. Extract commentary channels that aren't tool calls for reasoning 4. Return tool calls preserved in normal_text for the tool call parser to process
The two-stage parsing approach handles: - Reasoning extraction: Analysis + commentary → reasoning_text - Tool call extraction: Function calls → tool_calls - Clean final text: User-facing content → content
225f6ff to
9585c2a
Compare
There was a problem hiding this comment.
Code Review
This pull request removes the Harmony integration and introduces native support for GPT-OSS tool calls. The changes are extensive, involving the removal of old parsing logic and the addition of new detectors for both reasoning and tool calls. While the overall direction is good, the new parsing logic in GptOssDetector for both reasoning and function calls has some significant correctness and maintainability issues that need to be addressed. Specifically, the logic for distinguishing and parsing commentary blocks is fragile and can lead to incorrect behavior. Additionally, there is some code duplication in serving_chat.py that could be refactored for better maintainability.
|
wip: still debugging some streaming cases |
|
Streaming: tool-call
non-streaming + streaming: reasoning |
… to ensure proper context length handling Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
… GPT-OSS Tool Call Support (sgl-project#9043)
| f"max_completion_tokens is too large: {max_output_tokens}." | ||
| f"This model supports at most {server_context_length} completion tokens." | ||
| ) | ||
|
|
There was a problem hiding this comment.
Question: May I ask why there only care about the output length? Should the prompt_tokens' length be included into the comparison?
… GPT-OSS Tool Call Support (sgl-project#9043)
|
CatherineSue Could you please look at the error? #10738 |
… GPT-OSS Tool Call Support (sgl-project#9043)

Motivation
Why remove Harmony
Harmony integration was removed due to two critical limitations:
Known Limitations
Gpt-Oss Detector Design Challenges
Since tool calls are embedded within Chain-of-Thought (CoT) reasoning for GPT-OSS models, we implemented a coordinated parsing approach:
Streaming Limitations
request.separate_reasoning=FalseThe challenges preventing incremental streaming for channel format include:
TODOs
sglang/python/sglang/srt/entrypoints/context.py
Lines 86 to 92 in 2449a0a
<|start|>assistant<|channel|>analysis. This behavior needs investigation to understand:Modifications
serving_chat.py- Removed harmony integrationreasoning_parser.py- AddedGptOssDetectorgpt_oss_detector.py- Tool Call parser for gpt-oss modelsharmony_tool_parser.py- Removed (no longer needed)Accuracy Tests
Benchmarking and Profiling
Checklist