BrowseComp reproduction with Seed-OSS-36B-Instruct via vLLM yields 0 score

Congratulations on releasing this work.

I’m trying to reproduce the reported BrowseComp evaluation results using the exact code provided in this repository. My setup is:

* Backend: vLLM
* Model: `Seed-OSS-36B-Instruct`
* I only changed two runtime flags:

  * `--max-model-len` from `131072` to `65536`
  * `--num_workers` from `32` to `2`

However, my evaluation result is always **0**, which seems abnormal:

```
Overall - Avg Score: 0.0000, Success: 150/150

By Data Source:
  bc_test_easy: 0.0000 (50 items)
  bc_test_hard: 0.0000 (50 items)
  bc_test_meduim: 0.0000 (50 items)
```

From the logs, it looks like the tasks “finish” successfully, but the judged score is always 0, and the model’s answers seem consistently incorrect. Here are a few examples:

```
False <class 'envs.local_search.LocalSearch'>
[BRANCH] Identify Person 5991
[BRANCH] Find Hospital 6413
[TASK] Task Finish, Start Reward
[Judged] score=0
Label: Sanatorio de la Trinidad
Model: Paris, France
('', 0)

False <class 'envs.local_search.LocalSearch'>
[BRANCH] Identify Developer 15663
[BRANCH] Verify Commodore 64 Claim 16529
[CallAPI] Attempt 4 failed: Request timed out.. Retrying in 8s...
[TASK] Task Finish, Start Reward
[Judged] score=0
Label: Atari 130XE
Model: Commodore 64
('', 0)

False <class 'envs.local_search.LocalSearch'>
[TASK] Task Finish, Start Reward
[Judged] score=0
Label: Owlman
Model: Raccoon
('', 0)

False <class 'envs.local_search.LocalSearch'>
[TASK] Task Finish, Start Reward
[Judged] score=0
Label: Pierre Montale
Model: Charlotte Emma Tilbury
('', 0)

False <class 'envs.local_search.LocalSearch'>
[BRANCH] Author 2 Clues 7734
[BRANCH] Author 1 Institution 8594
[BRANCH] Verify Variety Term 9448
[TASK] Task Finish, Start Reward
[Judged] score=0
Label: Nicotiana tabacum variety Wisconsin 38
Model: amino acid variety
('', 0)
```

Could you please let me know:

1. Have you encountered a similar “all-zero score” issue during evaluation?
2. If possible, could you share a reference evaluation output/log snippet (or a minimal example run) so I can compare against my run?

Thank you very much for your time and for maintaining this project.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BrowseComp reproduction with Seed-OSS-36B-Instruct via vLLM yields 0 score #11

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

BrowseComp reproduction with Seed-OSS-36B-Instruct via vLLM yields 0 score #11

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions