-
Notifications
You must be signed in to change notification settings - Fork 14
Description
Description
The eval.py script currently only supports specific question types (perception, prediction, and planning) for fog evaluation. The self.results dictionary only includes these categories, which causes issues when evaluating fog JSON files that may contain other question types.
Current Behavior
The eval.py script initializes the scores dictionary with only the following categories:
scores = {
"perception": {"MCQ": {}, "VQA": {}},
"prediction": {"VQA": {}},
"planning": {"VQA": {}},
"behavior": {"MCQ": {}}
}This limits the evaluation to only perception, prediction, and planning question types, and excludes other potential types.
Expected Behavior
The eval.py script should support all relevant question types for fog evaluation, including any additional types that may be present in the JSON files.
Steps to Reproduce
- Attempt to evaluate a fog JSON file containing question types outside of perception, prediction, and planning.
- Observe that the evaluation results are incomplete or incorrect due to the limited
scoresdictionary.
Suggested Improvements
- Update the
scoresDictionary: Include all relevant question types that may be present in fog evaluation JSON files. - Modify the Evaluation Logic: Ensure the script can handle and evaluate all supported question types dynamically.
Example
Here’s a potential modification to the scores dictionary:
scores = {
"perception": {"MCQ": {}, "VQA": {}},
"prediction": {"VQA": {}},
"planning": {"VQA": {}},
"behavior": {"MCQ": {}},
"robust_qas": {"VQA": {}} # Add fog-specific question types
}Questions
- How should the
scoresdictionary be updated to support all question types for [fog/rain/etc.] evaluation? - What is the recommended approach for dynamically handling different question types in the evaluation script?
- How should the Robustness Analysis results be interpreted when evaluating fog data?
Additional Notes
This issue affects the accuracy and completeness of the evaluation results when working with fog data. Updating the script to support all relevant question types would improve the robustness analysis.