Add Logprobs unit test with a loose threshold#10230
Add Logprobs unit test with a loose threshold#10230hnyls2002 merged 37 commits intosgl-project:mainfrom
Conversation
|
Add AMD baseline and thresholds
|
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a comprehensive unittest suite for log probability correctness in SGLang, which is a great addition for ensuring model stability. The tests cover both dense and MoE models, with configurable parameters and comparisons against baseline data. My review focuses on improving code quality, robustness, and maintainability. I've identified some critical issues related to unsafe data access that could cause tests to crash, and I've also suggested refactoring to reduce significant code duplication between the test classes. Additionally, there are some medium-severity suggestions to improve exception handling and reporting consistency.
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
@narutolhy @zhaochenyang20 can you take a moment to review this? thanks! |
test/srt/test_logprobs.py
Outdated
| }, | ||
| ] | ||
|
|
||
| os.environ["RETURN_ORIGINAL_LOGPROB"] = "True" |
There was a problem hiding this comment.
Why do we need to turn on this switch? After turning it on, only the original logprobs will be returned. The different temperatures added in TEST_CONFIGS will not affect the returned logprobs.
There was a problem hiding this comment.
I think we enable this so that logprobs stay independent of temperature and we can purely test consistency across different batch/ratio cases in this test? @zhaochenyang20
There was a problem hiding this comment.
Well. I think we do not need 4 group of tests. We only need one group of it. num_samples 1000, ratio 0.5.
There was a problem hiding this comment.
I feel that if this is added, there is no need to compare the effects of different temperatures, because different temperatures do not affect the return value.
|
Tested on H200 for 5 times: max Δ=0.624243
mean Δ=0.0032411
logprobs returned for 500 samples (expected: 500)
.
max Δ=0.624243
mean Δ=0.00346705
logprobs returned for 500 samples (expected: 500)
.
max Δ=0.624243
mean Δ=0.00395164
logprobs returned for 500 samples (expected: 500)
max Δ=0.624243
mean Δ=0.00399972
logprobs returned for 500 samples (expected: 500)
loop = asyncio.get_event_loop()
max Δ=0.499887
mean Δ=0.00326239
logprobs returned for 500 samples (expected: 500) |
Co-authored-by: Yusheng Su <yushengsu.thu@gmail.com> Co-authored-by: Chayenne <zhaochen20@outlook.com> Co-authored-by: Ryan <ryan@ryanmini.mynetworksettings.com>
Add Logprobs Unittest
Motivation
We want to merge #6318, so this PR adds a dedicated unittest suite to verify that the changes do not break logprobs computation and interfaces.
The main goal is to catch indexing/corner-case issues and ensure stability/accuracy of logprobs against baseline data.
Modifications
Added
test/srt/test_logprobs.pyTestLogprobsDense: runs againstDEFAULT_SMALL_MODEL_NAME_FOR_TESTinput_top_logprobsandoutput_top_logprobsagainst baselines$GITHUB_STEP_SUMMARYAdded to
run_suite.pyso it runs in CIAdded retry logic for downloading Hugging Face baselines
Enabled selective
return_logprobin batches to check partial-request casesCleaned up unsafe access and added tolerance-based failure reporting
Baseline Data
WE ARE NOT TESTING MOE in THIS PR, the data below is just for the reference
v0.5.1.post3v0.5.1.post3, uploaded to Hugging FaceTolerance Settings
Dense (CUDA):
Dense (ROCm):
MoE (NVIDIA):
MoE (ROCm):
GitHub Summary Reporting
Each run appends a section like: