I attempted to reproduce the results for ALEs using LAION-CLAP for encoding both audio and hypotheses (reformulated with GPT-4o). I then selected the best hypothesis based on cosine similarity, following the exact procedure described in the paper. However, when running the provided evaluation code, I only achieve 25% accuracy, whereas the paper reports 45.10% for the "sound" category.
Could you provide more details on this evaluation step, or would you like me to share my implementation for review?
Thank you!