Evaluation Results on Qwen2.5-VL-7B-Instruct Do Not Match Reported Paper Metrics

First of all, thanks for your great work.

I evaluated **Qwen2.5-VL-7B-Instruct** using the current codebase (with sample=False) and found noticeable discrepancies compared to the results reported in the paper.

Given that the model is fixed and sampling is disabled, I would expect the results to be fully reproducible.

My Results vs Paper Results
Localization: 30.389 (paper: 30.5)
Ego_Centric_Absolute_Distance_MultiChoice: 34.497 (paper: 32.7)
Object_Centric_Absolute_Distance_MultiChoice: 36.0 (paper: 31.5)
Object_Centric_Relative_Distance: 57.28 (paper: 66.5)
Travel_Time: 39.08 (paper: 34.5)

**Yes/No Questions:**
Ego_Centric_Relative_Distance: 50.05 (paper: 54.0)
Ego_Centric_Motion_Reasoning: 25.49 (paper: 45.9)
Object_Centric_Motion_Reasoning: 30.15 (paper: 44.0)


Could the authors confirm whether the evaluation code and dataset version used in the paper are exactly the same as the current repository?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluation Results on Qwen2.5-VL-7B-Instruct Do Not Match Reported Paper Metrics #4

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Evaluation Results on Qwen2.5-VL-7B-Instruct Do Not Match Reported Paper Metrics #4

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions