Skip to content

BrowseComp reproduction with Seed-OSS-36B-Instruct via vLLM yields 0 score #11

@hb-studying

Description

@hb-studying

Congratulations on releasing this work.

I’m trying to reproduce the reported BrowseComp evaluation results using the exact code provided in this repository. My setup is:

  • Backend: vLLM

  • Model: Seed-OSS-36B-Instruct

  • I only changed two runtime flags:

    • --max-model-len from 131072 to 65536
    • --num_workers from 32 to 2

However, my evaluation result is always 0, which seems abnormal:

Overall - Avg Score: 0.0000, Success: 150/150

By Data Source:
  bc_test_easy: 0.0000 (50 items)
  bc_test_hard: 0.0000 (50 items)
  bc_test_meduim: 0.0000 (50 items)

From the logs, it looks like the tasks “finish” successfully, but the judged score is always 0, and the model’s answers seem consistently incorrect. Here are a few examples:

False <class 'envs.local_search.LocalSearch'>
[BRANCH] Identify Person 5991
[BRANCH] Find Hospital 6413
[TASK] Task Finish, Start Reward
[Judged] score=0
Label: Sanatorio de la Trinidad
Model: Paris, France
('', 0)

False <class 'envs.local_search.LocalSearch'>
[BRANCH] Identify Developer 15663
[BRANCH] Verify Commodore 64 Claim 16529
[CallAPI] Attempt 4 failed: Request timed out.. Retrying in 8s...
[TASK] Task Finish, Start Reward
[Judged] score=0
Label: Atari 130XE
Model: Commodore 64
('', 0)

False <class 'envs.local_search.LocalSearch'>
[TASK] Task Finish, Start Reward
[Judged] score=0
Label: Owlman
Model: Raccoon
('', 0)

False <class 'envs.local_search.LocalSearch'>
[TASK] Task Finish, Start Reward
[Judged] score=0
Label: Pierre Montale
Model: Charlotte Emma Tilbury
('', 0)

False <class 'envs.local_search.LocalSearch'>
[BRANCH] Author 2 Clues 7734
[BRANCH] Author 1 Institution 8594
[BRANCH] Verify Variety Term 9448
[TASK] Task Finish, Start Reward
[Judged] score=0
Label: Nicotiana tabacum variety Wisconsin 38
Model: amino acid variety
('', 0)

Could you please let me know:

  1. Have you encountered a similar “all-zero score” issue during evaluation?
  2. If possible, could you share a reference evaluation output/log snippet (or a minimal example run) so I can compare against my run?

Thank you very much for your time and for maintaining this project.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions