-
Notifications
You must be signed in to change notification settings - Fork 3.1k
Open
Description
It would be extremely useful if the harness had a way to steer the LLM to not be eval-aware before running the tests, in a way similar to Steering Evaluation-Aware Language Models to Act Like They Are Deployed. Eval-Awareness risks interfering with our measurements, so implementing preventative measures in the Harness would be valuable.
This could look like:
- Calculating steering vectors
- Caching them locally (though ideally on the cloud so that users can save a step if they want to steer a model again)
- Add a flag to evaluations that determines "whether to apply deployment steering"
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels