Skip to content

Evaluation Results on Qwen2.5-VL-7B-Instruct Do Not Match Reported Paper Metrics #4

@cjfcsjt

Description

@cjfcsjt

First of all, thanks for your great work.

I evaluated Qwen2.5-VL-7B-Instruct using the current codebase (with sample=False) and found noticeable discrepancies compared to the results reported in the paper.

Given that the model is fixed and sampling is disabled, I would expect the results to be fully reproducible.

My Results vs Paper Results
Localization: 30.389 (paper: 30.5)
Ego_Centric_Absolute_Distance_MultiChoice: 34.497 (paper: 32.7)
Object_Centric_Absolute_Distance_MultiChoice: 36.0 (paper: 31.5)
Object_Centric_Relative_Distance: 57.28 (paper: 66.5)
Travel_Time: 39.08 (paper: 34.5)

Yes/No Questions:
Ego_Centric_Relative_Distance: 50.05 (paper: 54.0)
Ego_Centric_Motion_Reasoning: 25.49 (paper: 45.9)
Object_Centric_Motion_Reasoning: 30.15 (paper: 44.0)

Could the authors confirm whether the evaluation code and dataset version used in the paper are exactly the same as the current repository?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions