First of all, thanks for your great work.
I evaluated Qwen2.5-VL-7B-Instruct using the current codebase (with sample=False) and found noticeable discrepancies compared to the results reported in the paper.
Given that the model is fixed and sampling is disabled, I would expect the results to be fully reproducible.
My Results vs Paper Results
Localization: 30.389 (paper: 30.5)
Ego_Centric_Absolute_Distance_MultiChoice: 34.497 (paper: 32.7)
Object_Centric_Absolute_Distance_MultiChoice: 36.0 (paper: 31.5)
Object_Centric_Relative_Distance: 57.28 (paper: 66.5)
Travel_Time: 39.08 (paper: 34.5)
Yes/No Questions:
Ego_Centric_Relative_Distance: 50.05 (paper: 54.0)
Ego_Centric_Motion_Reasoning: 25.49 (paper: 45.9)
Object_Centric_Motion_Reasoning: 30.15 (paper: 44.0)
Could the authors confirm whether the evaluation code and dataset version used in the paper are exactly the same as the current repository?
First of all, thanks for your great work.
I evaluated Qwen2.5-VL-7B-Instruct using the current codebase (with sample=False) and found noticeable discrepancies compared to the results reported in the paper.
Given that the model is fixed and sampling is disabled, I would expect the results to be fully reproducible.
My Results vs Paper Results
Localization: 30.389 (paper: 30.5)
Ego_Centric_Absolute_Distance_MultiChoice: 34.497 (paper: 32.7)
Object_Centric_Absolute_Distance_MultiChoice: 36.0 (paper: 31.5)
Object_Centric_Relative_Distance: 57.28 (paper: 66.5)
Travel_Time: 39.08 (paper: 34.5)
Yes/No Questions:
Ego_Centric_Relative_Distance: 50.05 (paper: 54.0)
Ego_Centric_Motion_Reasoning: 25.49 (paper: 45.9)
Object_Centric_Motion_Reasoning: 30.15 (paper: 44.0)
Could the authors confirm whether the evaluation code and dataset version used in the paper are exactly the same as the current repository?