Congratulations on releasing this work.
I’m trying to reproduce the reported BrowseComp evaluation results using the exact code provided in this repository. My setup is:
However, my evaluation result is always 0, which seems abnormal:
Overall - Avg Score: 0.0000, Success: 150/150
By Data Source:
bc_test_easy: 0.0000 (50 items)
bc_test_hard: 0.0000 (50 items)
bc_test_meduim: 0.0000 (50 items)
From the logs, it looks like the tasks “finish” successfully, but the judged score is always 0, and the model’s answers seem consistently incorrect. Here are a few examples:
False <class 'envs.local_search.LocalSearch'>
[BRANCH] Identify Person 5991
[BRANCH] Find Hospital 6413
[TASK] Task Finish, Start Reward
[Judged] score=0
Label: Sanatorio de la Trinidad
Model: Paris, France
('', 0)
False <class 'envs.local_search.LocalSearch'>
[BRANCH] Identify Developer 15663
[BRANCH] Verify Commodore 64 Claim 16529
[CallAPI] Attempt 4 failed: Request timed out.. Retrying in 8s...
[TASK] Task Finish, Start Reward
[Judged] score=0
Label: Atari 130XE
Model: Commodore 64
('', 0)
False <class 'envs.local_search.LocalSearch'>
[TASK] Task Finish, Start Reward
[Judged] score=0
Label: Owlman
Model: Raccoon
('', 0)
False <class 'envs.local_search.LocalSearch'>
[TASK] Task Finish, Start Reward
[Judged] score=0
Label: Pierre Montale
Model: Charlotte Emma Tilbury
('', 0)
False <class 'envs.local_search.LocalSearch'>
[BRANCH] Author 2 Clues 7734
[BRANCH] Author 1 Institution 8594
[BRANCH] Verify Variety Term 9448
[TASK] Task Finish, Start Reward
[Judged] score=0
Label: Nicotiana tabacum variety Wisconsin 38
Model: amino acid variety
('', 0)
Could you please let me know:
- Have you encountered a similar “all-zero score” issue during evaluation?
- If possible, could you share a reference evaluation output/log snippet (or a minimal example run) so I can compare against my run?
Thank you very much for your time and for maintaining this project.
Congratulations on releasing this work.
I’m trying to reproduce the reported BrowseComp evaluation results using the exact code provided in this repository. My setup is:
Backend: vLLM
Model:
Seed-OSS-36B-InstructI only changed two runtime flags:
--max-model-lenfrom131072to65536--num_workersfrom32to2However, my evaluation result is always 0, which seems abnormal:
From the logs, it looks like the tasks “finish” successfully, but the judged score is always 0, and the model’s answers seem consistently incorrect. Here are a few examples:
Could you please let me know:
Thank you very much for your time and for maintaining this project.