Skip to content

(gpt-oss, oai, chat): Remove Harmony Integration and Implement Native GPT-OSS Tool Call Support#9043

Merged
zhyncs merged 21 commits intomainfrom
chang/chat
Aug 12, 2025
Merged

(gpt-oss, oai, chat): Remove Harmony Integration and Implement Native GPT-OSS Tool Call Support#9043
zhyncs merged 21 commits intomainfrom
chang/chat

Conversation

@CatherineSue
Copy link
Collaborator

@CatherineSue CatherineSue commented Aug 11, 2025

Motivation

Why remove Harmony

Harmony integration was removed due to two critical limitations:

  1. Missing output token ID support: Harmony requires output token IDs for proper functioning, but we currently have a bug in tokenizer_manager that prevents accessing these token IDs
  2. No tool call support for chat completions: The current harmony implementation doesn't support tool calls in the chat completions endpoint, which is a popular ask for gpt-oss models

Known Limitations

Gpt-Oss Detector Design Challenges

Since tool calls are embedded within Chain-of-Thought (CoT) reasoning for GPT-OSS models, we implemented a coordinated parsing approach:

  • Reasoning parser (reasoning_parser.py): Extracts analysis sections and passes non-analysis content (including tool calls) to the tool call parser
  • Tool call parser (function_call/gpt_oss_detector.py): Handles tool call extraction from the passed content

Streaming Limitations

  • No true incremental streaming for channel format: Due to the complexity of interleaved channels (analysis, commentary, final), the current implementation uses batch processing for the full channel format. True incremental streaming is only supported for the simplified format (analysis...assistantfinal).
  • Won't work with request.separate_reasoning=False

The challenges preventing incremental streaming for channel format include:

  • Multiple interleaved channels that can appear in any order
  • Partial channel markers requiring buffering
  • Complex state management across partial sections

TODOs

  1. Fix output token ID issue in tokenizer_manager: Required for re-enabling harmony support, see
    # TODO: REMOVE here:
    # Very hacky, find the first occurrence of token 200006 and cut from there
    try:
    start_index = output_token_ids.index(200006)
    output_token_ids = output_token_ids[start_index:]
    except ValueError:
    pass
  2. Investigate content extraction: In normal cases, ret_item["text"] only contains analysis content but excludes channel markers like <|start|>assistant<|channel|>analysis. This behavior needs investigation to understand:
    • Whether this is intended model behavior or a parsing artifact
    • How channel markers are being filtered during text extraction
    • Impact on downstream processing and tool call coordination
  3. Implement true incremental streaming for channel format: May require harmony
  4. Add comprehensive test coverage: Expand test cases for edge cases and malformed channel sequences

Modifications

  • serving_chat.py - Removed harmony integration
  • reasoning_parser.py - Added GptOssDetector
  • gpt_oss_detector.py - Tool Call parser for gpt-oss models
  • harmony_tool_parser.py - Removed (no longer needed)

Accuracy Tests

Benchmarking and Profiling

Checklist

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @CatherineSue, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

I've undertaken a significant refactoring to enhance our system's capabilities by removing the existing Harmony integration. This decision was driven by Harmony's limitations, particularly its lack of support for output token IDs and tool calls in chat completions. My primary goal was to implement native GPT-OSS tool call support, which is a highly requested feature. This change allows our system to directly interpret and utilize tool calls from GPT-OSS models, streamlining our interaction with these models and opening up new possibilities for advanced functionalities. The impact is a more robust and feature-rich tool calling mechanism, directly integrated into our core serving logic.

Highlights

  • Removal of Harmony Integration: The entire Harmony integration, which was previously used for GPT-OSS models, has been completely removed from the codebase. This includes removing related imports, initialization logic, and processing paths.
  • Native GPT-OSS Tool Call Support: Native support for GPT-OSS tool calls has been implemented. This involves new GptOssDetector classes for both function call parsing and reasoning parsing, allowing the system to directly interpret and handle tool calls embedded within the model's output.
  • Handling of GPT-OSS Channel Format: The new implementation specifically handles the 'channel format' used by GPT-OSS models, which includes distinct 'analysis' and 'final' channels, as well as 'commentary' channels for tool calls. This allows for structured extraction of reasoning and final answers.
  • Refactoring of Chat Serving Logic: The serving_chat.py module has been significantly refactored to remove all Harmony-specific code, making the chat completion endpoint cleaner and more focused on the native GPT-OSS implementation. This includes changes to request validation, template application, and stream processing.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Before:
- Tool calls that appeared between analysis sections were lost during GPTOSSDetector.detect_and_parse in reasoning_parser.py
- Only content after the last analysis section was preserved

Fix:
 Modified the detect_and_parse method to:
  1. Process the text sequentially, preserving content between analysis sections
  2. Collect all non-analysis content (including tool calls) in normal_parts
  3. Extract commentary channels that aren't tool calls for reasoning
  4. Return tool calls preserved in normal_text for the tool call parser to process
The two-stage parsing approach handles:
  - Reasoning extraction: Analysis + commentary → reasoning_text
  - Tool call extraction: Function calls → tool_calls
  - Clean final text: User-facing content → content
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request removes the Harmony integration and introduces native support for GPT-OSS tool calls. The changes are extensive, involving the removal of old parsing logic and the addition of new detectors for both reasoning and tool calls. While the overall direction is good, the new parsing logic in GptOssDetector for both reasoning and function calls has some significant correctness and maintainability issues that need to be addressed. Specifically, the logic for distinguishing and parsing commentary blocks is fragile and can lead to incorrect behavior. Additionally, there is some code duplication in serving_chat.py that could be refactored for better maintainability.

@CatherineSue
Copy link
Collaborator Author

wip: still debugging some streaming cases

@CatherineSue
Copy link
Collaborator Author

Streaming: tool-call

Screenshot 2025-08-10 at 10 06 59 PM

non-streaming + streaming: reasoning
reasoning_result.txt

JustinTong0323 and others added 3 commits August 11, 2025 09:53
… to ensure proper context length handling

Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
@zhyncs zhyncs merged commit a218490 into main Aug 12, 2025
56 of 68 checks passed
@zhyncs zhyncs deleted the chang/chat branch August 12, 2025 01:59
narutolhy pushed a commit to narutolhy/sglang that referenced this pull request Aug 17, 2025
f"max_completion_tokens is too large: {max_output_tokens}."
f"This model supports at most {server_context_length} completion tokens."
)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: May I ask why there only care about the output length? Should the prompt_tokens' length be included into the comparison?

MahmoudAshraf97 pushed a commit to MahmoudAshraf97/sglang that referenced this pull request Sep 8, 2025
@Swipe4057
Copy link
Contributor

CatherineSue Could you please look at the error? #10738

yanbing-j pushed a commit to jianan-gu/sglang that referenced this pull request Oct 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants