Skip to content

Anti-Evaluation Awareness Steering #3619

@AMindToThink

Description

@AMindToThink

It would be extremely useful if the harness had a way to steer the LLM to not be eval-aware before running the tests, in a way similar to Steering Evaluation-Aware Language Models to Act Like They Are Deployed. Eval-Awareness risks interfering with our measurements, so implementing preventative measures in the Harness would be valuable.

This could look like:

  • Calculating steering vectors
  • Caching them locally (though ideally on the cloud so that users can save a step if they want to steer a model again)
  • Add a flag to evaluations that determines "whether to apply deployment steering"

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions