(gpt-oss, oai, chat): Remove Harmony Integration and Implement Native GPT-OSS Tool Call Support by CatherineSue · Pull Request #9043 · sgl-project/sglang

CatherineSue · 2025-08-11T03:31:53Z

Motivation

Why remove Harmony

Harmony integration was removed due to two critical limitations:

Missing output token ID support: Harmony requires output token IDs for proper functioning, but we currently have a bug in tokenizer_manager that prevents accessing these token IDs
No tool call support for chat completions: The current harmony implementation doesn't support tool calls in the chat completions endpoint, which is a popular ask for gpt-oss models

Known Limitations

Gpt-Oss Detector Design Challenges

Since tool calls are embedded within Chain-of-Thought (CoT) reasoning for GPT-OSS models, we implemented a coordinated parsing approach:

Reasoning parser (reasoning_parser.py): Extracts analysis sections and passes non-analysis content (including tool calls) to the tool call parser
Tool call parser (function_call/gpt_oss_detector.py): Handles tool call extraction from the passed content

Streaming Limitations

No true incremental streaming for channel format: Due to the complexity of interleaved channels (analysis, commentary, final), the current implementation uses batch processing for the full channel format. True incremental streaming is only supported for the simplified format (analysis...assistantfinal).
Won't work with request.separate_reasoning=False

The challenges preventing incremental streaming for channel format include:

Multiple interleaved channels that can appear in any order
Partial channel markers requiring buffering
Complex state management across partial sections

TODOs

Fix output token ID issue in tokenizer_manager: Required for re-enabling harmony support, see

sglang/python/sglang/srt/entrypoints/context.py

Lines 86 to 92 in 2449a0a

    
           # TODO: REMOVE here: 
        
           # Very hacky, find the first occurrence of token 200006 and cut from there 
        
           try: 
        
               start_index = output_token_ids.index(200006) 
        
               output_token_ids = output_token_ids[start_index:] 
        
           except ValueError: 
        
               pass

Investigate content extraction: In normal cases, ret_item["text"] only contains analysis content but excludes channel markers like <|start|>assistant<|channel|>analysis. This behavior needs investigation to understand:
- Whether this is intended model behavior or a parsing artifact
- How channel markers are being filtered during text extraction
- Impact on downstream processing and tool call coordination
Implement true incremental streaming for channel format: May require harmony
Add comprehensive test coverage: Expand test cases for edge cases and malformed channel sequences

Modifications

serving_chat.py - Removed harmony integration
reasoning_parser.py - Added GptOssDetector
gpt_oss_detector.py - Tool Call parser for gpt-oss models
harmony_tool_parser.py - Removed (no longer needed)

Accuracy Tests

Benchmarking and Profiling

Checklist

Format your code according to the Code formatting with pre-commit.
Add unit tests according to the Running and adding unit tests.
Update documentation according to Writing documentations.
Provide accuracy and speed benchmark results according to Testing the accuracy and Benchmark and profiling

gemini-code-assist

Summary of Changes

Hello @CatherineSue, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

I've undertaken a significant refactoring to enhance our system's capabilities by removing the existing Harmony integration. This decision was driven by Harmony's limitations, particularly its lack of support for output token IDs and tool calls in chat completions. My primary goal was to implement native GPT-OSS tool call support, which is a highly requested feature. This change allows our system to directly interpret and utilize tool calls from GPT-OSS models, streamlining our interaction with these models and opening up new possibilities for advanced functionalities. The impact is a more robust and feature-rich tool calling mechanism, directly integrated into our core serving logic.

Highlights

Removal of Harmony Integration: The entire Harmony integration, which was previously used for GPT-OSS models, has been completely removed from the codebase. This includes removing related imports, initialization logic, and processing paths.
Native GPT-OSS Tool Call Support: Native support for GPT-OSS tool calls has been implemented. This involves new GptOssDetector classes for both function call parsing and reasoning parsing, allowing the system to directly interpret and handle tool calls embedded within the model's output.
Handling of GPT-OSS Channel Format: The new implementation specifically handles the 'channel format' used by GPT-OSS models, which includes distinct 'analysis' and 'final' channels, as well as 'commentary' channels for tool calls. This allows for structured extraction of reasoning and final answers.
Refactoring of Chat Serving Logic: The serving_chat.py module has been significantly refactored to remove all Harmony-specific code, making the chat completion endpoint cleaner and more focused on the native GPT-OSS implementation. This includes changes to request validation, template application, and stream processing.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

Before: - Tool calls that appeared between analysis sections were lost during GPTOSSDetector.detect_and_parse in reasoning_parser.py - Only content after the last analysis section was preserved Fix: Modified the detect_and_parse method to: 1. Process the text sequentially, preserving content between analysis sections 2. Collect all non-analysis content (including tool calls) in normal_parts 3. Extract commentary channels that aren't tool calls for reasoning 4. Return tool calls preserved in normal_text for the tool call parser to process

The two-stage parsing approach handles: - Reasoning extraction: Analysis + commentary → reasoning_text - Tool call extraction: Function calls → tool_calls - Clean final text: User-facing content → content

…ng case

gemini-code-assist

Code Review

This pull request removes the Harmony integration and introduces native support for GPT-OSS tool calls. The changes are extensive, involving the removal of old parsing logic and the addition of new detectors for both reasoning and tool calls. While the overall direction is good, the new parsing logic in GptOssDetector for both reasoning and function calls has some significant correctness and maintainability issues that need to be addressed. Specifically, the logic for distinguishing and parsing commentary blocks is fragile and can lead to incorrect behavior. Additionally, there is some code duplication in serving_chat.py that could be refactored for better maintainability.

python/sglang/srt/function_call/gpt_oss_detector.py

python/sglang/srt/reasoning_parser.py

CatherineSue · 2025-08-11T03:56:45Z

wip: still debugging some streaming cases

CatherineSue · 2025-08-11T05:22:09Z

Streaming: tool-call

non-streaming + streaming: reasoning
reasoning_result.txt

… to ensure proper context length handling Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com>

Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com>

… GPT-OSS Tool Call Support (sgl-project#9043)

Huixxi · 2025-08-28T09:19:25Z

python/sglang/srt/entrypoints/openai/serving_chat.py

+                f"max_completion_tokens is too large: {max_output_tokens}."
+                f"This model supports at most {server_context_length} completion tokens."
+            )
+


Question: May I ask why there only care about the output length? Should the prompt_tokens' length be included into the comparison?

… GPT-OSS Tool Call Support (sgl-project#9043)

Swipe4057 · 2025-09-22T13:07:04Z

CatherineSue Could you please look at the error? #10738

… GPT-OSS Tool Call Support (sgl-project#9043)

CatherineSue requested review from ispobock, merrymercy and slin1237 as code owners August 11, 2025 03:31

gemini-code-assist bot reviewed Aug 11, 2025

View reviewed changes

CatherineSue added 15 commits August 11, 2025 03:32

Temporary change to make oss work with parser

99fda34

Pass resoning_effort and builtin_tools

c405576

Remove harmony in serving_chat

03ba291

Format files

c865c82

Add max_tokens validation in _valdiate_chat

6bf7e99

Support final context extraction in tool-call GPTOSSDetector

661a0b9

The two-stage parsing approach handles: - Reasoning extraction: Analysis + commentary → reasoning_text - Tool call extraction: Function calls → tool_calls - Clean final text: User-facing content → content

Format files

8269d7d

Remove harmony tool parser

553bf11

Rename GPTOSSDetector to GptOssDetector and support tool-call streami…

2fec317

…ng case

More clean up from harmony

51084b3

Remove defined ResponseReasoningItem in protocol.py

eb88300

Add copy info to context.py and harmony_utils.py

cdf6647

Remove debug files

f79e201

Cleanup 2 server_args in serving_chat.py

9585c2a

CatherineSue force-pushed the chang/chat branch from 225f6ff to 9585c2a Compare August 11, 2025 03:32

CatherineSue mentioned this pull request Aug 11, 2025

Support v1/responses and use harmony in serving_chat #8837

Merged

6 tasks

slin1237 added no-activity 30d high priority oai and removed no-activity 30d labels Aug 11, 2025

gemini-code-assist bot reviewed Aug 11, 2025

View reviewed changes

python/sglang/srt/function_call/gpt_oss_detector.py Outdated Show resolved Hide resolved

python/sglang/srt/reasoning_parser.py Outdated Show resolved Hide resolved

CatherineSue added the wip label Aug 11, 2025

Resolve comments

9158647

JustinTong0323 and others added 3 commits August 11, 2025 09:53

fix: None error for max_output_tokens validation in OpenAIServingChat…

e052290

… to ensure proper context length handling Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com>

fix lint

332e56d

Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com>

Merge branch 'main' into chang/chat

0eaf911

This was referenced Aug 11, 2025

fuse allreduce and residual_rmsnorm #8731

Merged

chore: bump v0.5.0rc1 #9069

Merged

zhyncs added 2 commits August 11, 2025 13:54

Merge branch 'main' into chang/chat

fae083e

upd

e226e17

zhyncs merged commit a218490 into main Aug 12, 2025
56 of 68 checks passed

zhyncs deleted the chang/chat branch August 12, 2025 01:59

CatherineSue mentioned this pull request Aug 12, 2025

[Bugfix] Fix accuracy-test-1-gpu failure caused by builtin_tools #9114

Merged

4 tasks

jonaslsaa mentioned this pull request Aug 14, 2025

Fix Harmony reasoning parser for and auto-separation for gpt-oss models #9190

Merged

4 tasks

narutolhy pushed a commit to narutolhy/sglang that referenced this pull request Aug 17, 2025

(gpt-oss, oai, chat): Remove Harmony Integration and Implement Native…

e746783

… GPT-OSS Tool Call Support (sgl-project#9043)

lshmouse mentioned this pull request Aug 26, 2025

[Bug] TypeError: TiktokenTokenizer.apply_chat_template() got an unexpected keyword argument 'reasoning_effort' #9625

Closed

5 tasks

Huixxi reviewed Aug 28, 2025

View reviewed changes

MahmoudAshraf97 pushed a commit to MahmoudAshraf97/sglang that referenced this pull request Sep 8, 2025

(gpt-oss, oai, chat): Remove Harmony Integration and Implement Native…

c72f879

… GPT-OSS Tool Call Support (sgl-project#9043)

yanbing-j pushed a commit to jianan-gu/sglang that referenced this pull request Oct 13, 2025

(gpt-oss, oai, chat): Remove Harmony Integration and Implement Native…

fef9c01

… GPT-OSS Tool Call Support (sgl-project#9043)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(gpt-oss, oai, chat): Remove Harmony Integration and Implement Native GPT-OSS Tool Call Support#9043

(gpt-oss, oai, chat): Remove Harmony Integration and Implement Native GPT-OSS Tool Call Support#9043
zhyncs merged 21 commits intomainfrom
chang/chat

CatherineSue commented Aug 11, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

CatherineSue commented Aug 11, 2025

Uh oh!

CatherineSue commented Aug 11, 2025

Uh oh!

Uh oh!

Huixxi Aug 28, 2025

Uh oh!

Swipe4057 commented Sep 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

	# TODO: REMOVE here:
	# Very hacky, find the first occurrence of token 200006 and cut from there
	try:
	start_index = output_token_ids.index(200006)
	output_token_ids = output_token_ids[start_index:]
	except ValueError:
	pass

Conversation

CatherineSue commented Aug 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Why remove Harmony

Harmony integration was removed due to two critical limitations:

Known Limitations

Gpt-Oss Detector Design Challenges

Streaming Limitations

TODOs

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

CatherineSue commented Aug 11, 2025

Uh oh!

CatherineSue commented Aug 11, 2025

Uh oh!

Uh oh!

Huixxi Aug 28, 2025

Choose a reason for hiding this comment

Uh oh!

Swipe4057 commented Sep 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

CatherineSue commented Aug 11, 2025 •

edited

Loading