There's a mismatch between what detectors return and what evaluators expect when dealing with multi-turn conversations:
- Detectors are expected to return a list of length len(all_outputs) (all assistant turns in the conversation)
- Evaluators index into
attempt.outputs (only the last assistant turn's output)
The issue manifests in garak/evaluators/base.py:81 - where messages.append(attempt.outputs[idx]) assumes alignment with detector results. However detector results are of length attempt.all_outputs which is greater than attempt.outputs in multi-turn setting.
I wanted to check if this is an expected behavior..?