Skip to content

Add Logprobs unit test with a loose threshold#10230

Merged
hnyls2002 merged 37 commits intosgl-project:mainfrom
PrinsYin:logprobs
Sep 16, 2025
Merged

Add Logprobs unit test with a loose threshold#10230
hnyls2002 merged 37 commits intosgl-project:mainfrom
PrinsYin:logprobs

Conversation

@PrinsYin
Copy link
Contributor

@PrinsYin PrinsYin commented Sep 9, 2025

Add Logprobs Unittest

Motivation

We want to merge #6318, so this PR adds a dedicated unittest suite to verify that the changes do not break logprobs computation and interfaces.
The main goal is to catch indexing/corner-case issues and ensure stability/accuracy of logprobs against baseline data.


Modifications

  • Added test/srt/test_logprobs.py

    • Implements:
      • TestLogprobsDense: runs against DEFAULT_SMALL_MODEL_NAME_FOR_TEST
    • Configurable batch sizes, sample counts, and temperatures
    • Randomly samples records per run for coverage
    • Compares input_top_logprobs and output_top_logprobs against baselines
    • Validates both max diff and mean diff against thresholds
    • Writes per-config results into $GITHUB_STEP_SUMMARY
  • Added to run_suite.py so it runs in CI

  • Added retry logic for downloading Hugging Face baselines

  • Enabled selective return_logprob in batches to check partial-request cases

  • Cleaned up unsafe access and added tolerance-based failure reporting


Baseline Data

WE ARE NOT TESTING MOE in THIS PR, the data below is just for the reference

  • Version: v0.5.1.post3
  • Dense baseline: sglang_baseline_2000.pkl
  • MoE baseline: sglang_baseline_moe.pkl
  • Generated ~1000 samples with v0.5.1.post3, uploaded to Hugging Face
  • Each test run samples a few hundred records based on config
  • Prompt length controlled to ~2000 tokens

Tolerance Settings

Dense (CUDA):

  • Max diff ≤ 1.5
  • Mean diff ≤ 0.1

Dense (ROCm):

  • Max diff ≤ 1.4
  • Mean diff ≤ 0.1

MoE (NVIDIA):

  • Max diff ≤ 10
  • Mean diff ≤ 0.3

MoE (ROCm):

  • Max diff ≤ 9.0
  • Mean diff ≤ 0.5

GitHub Summary Reporting

Each run appends a section like:

@PrinsYin PrinsYin marked this pull request as ready for review September 10, 2025 03:26
@yushengsu-thu
Copy link
Collaborator

Add AMD baseline and thresholds

  • Dense:
    Max diff ≤ 1.4
    Mean diff ≤ 0.1 (per sample)

  • MoE:
    Max diff ≤ 9.0
    Mean diff ≤ 0.5 (per sample)

@zhaochenyang20
Copy link
Collaborator

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comprehensive unittest suite for log probability correctness in SGLang, which is a great addition for ensuring model stability. The tests cover both dense and MoE models, with configurable parameters and comparisons against baseline data. My review focuses on improving code quality, robustness, and maintainability. I've identified some critical issues related to unsafe data access that could cause tests to crash, and I've also suggested refactoring to reduce significant code duplication between the test classes. Additionally, there are some medium-severity suggestions to improve exception handling and reporting consistency.

@gemini-code-assist
Copy link
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@PrinsYin
Copy link
Contributor Author

@narutolhy @zhaochenyang20 can you take a moment to review this? thanks!

},
]

os.environ["RETURN_ORIGINAL_LOGPROB"] = "True"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to turn on this switch? After turning it on, only the original logprobs will be returned. The different temperatures added in TEST_CONFIGS will not affect the returned logprobs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we enable this so that logprobs stay independent of temperature and we can purely test consistency across different batch/ratio cases in this test? @zhaochenyang20

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well. I think we do not need 4 group of tests. We only need one group of it. num_samples 1000, ratio 0.5.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel that if this is added, there is no need to compare the effects of different temperatures, because different temperatures do not affect the return value.

@zhaochenyang20 zhaochenyang20 changed the title [WIP] Add Logprobs unittest Add Logprobs unit test with a loose threshold Sep 13, 2025
@zhaochenyang20
Copy link
Collaborator

Tested on H200 for 5 times:

max Δ=0.624243
mean Δ=0.0032411
logprobs returned for 500 samples (expected: 500)
.
max Δ=0.624243
mean Δ=0.00346705
logprobs returned for 500 samples (expected: 500)
.
max Δ=0.624243
mean Δ=0.00395164
logprobs returned for 500 samples (expected: 500)

max Δ=0.624243
mean Δ=0.00399972
logprobs returned for 500 samples (expected: 500)

  loop = asyncio.get_event_loop()
max Δ=0.499887
mean Δ=0.00326239
logprobs returned for 500 samples (expected: 500)

@hnyls2002 hnyls2002 merged commit f1c692f into sgl-project:main Sep 16, 2025
35 of 39 checks passed
HanHan009527 pushed a commit to HanHan009527/sglang that referenced this pull request Oct 9, 2025
Co-authored-by: Yusheng Su <yushengsu.thu@gmail.com>
Co-authored-by: Chayenne <zhaochen20@outlook.com>
Co-authored-by: Ryan <ryan@ryanmini.mynetworksettings.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants

Comments