Training and evaluation pipeline for Introspection Adapters (IAs) — LoRA adapters that enable language models to verbalize the behaviors trained into them.
Models and datasets are available on HuggingFace: introspection-auditing
git clone <repo-url>
cd introspection-adapters
bash setup.shThe setup script initializes the safety-tooling submodule, patches its dependency pins for compatibility, and runs uv sync.
Then create a .env file with your API keys:
HF_TOKEN=<your-huggingface-token>
ANTHROPIC_API_KEY=<your-anthropic-api-key>
Requirements:
- GPU with sufficient VRAM (A100 80GB recommended for Llama 70B, A40/A6000 for Qwen 14B)
- Anthropic API access (for grading steps that use Claude)
- HuggingFace access to the model repos
Each use case has a bash script in scripts/ and example configs in run_configs/.
Evaluate a pretrained IA from HuggingFace on out-of-distribution behaviors:
bash scripts/eval_ia.sh run_configs/eval_llama_example.shTrain a new IA from scratch with optional DPO refinement:
bash scripts/train_ia.sh run_configs/train_llama_example.shRun the covert malicious fine-tuning evaluation with a separate grading rubric:
bash scripts/eval_cmft.sh run_configs/cmft_llama_example.shAll results are saved to final_results/<experiment_name>/.
Run configs are simple shell files sourced by the bash scripts. Copy an example and modify it.
MODEL=llama # llama or qwen
IA_PATH=introspection-auditing/Llama-3.3-70B-Instruct_dpo_meta_lora_all_six_dpo # HF hub path or local path
EXPERIMENT_NAME=eval_llama_six_dpo
EVAL_CATEGORIES="prism4 ukaisi" # Categories to evaluate on
SEED=1547EVAL_CATEGORIES controls which behavior categories are evaluated. Options:
- Individual:
backdoor,benign,harmful,heuristic,quirk,rare,problematic,sandbagging,prism4_synth_docs,prism4_synth_docs_kto,prism4_transcripts,prism4_transcripts_kto,ukaisi - Groups:
ood(prism4 + ukaisi),prism4(all 4 prism4 variants),training(the 8 standard categories),all
For IAs trained on the standard settings, use ood (default) to evaluate on out-of-distribution behaviors only.
MODEL=llama
EXPERIMENT_NAME=llama_six_setting_dpo
TRAINING_CATEGORIES="backdoor benign harmful heuristic quirk rare"
EVAL_CATEGORIES="prism4 ukaisi"
DPO_FRACTION=0.10 # 0.0 = no DPO training
SEED=1547Set DPO_FRACTION=0.0 to train SFT only (pipeline stops after grading step 4z).
Optional hyperparameters (omit to use defaults):
SFT_LEARNING_RATE=1e-4 # SFT learning rate (default: 1e-4)
SFT_R=16 # SFT LoRA rank (default: 16)
SFT_BATCH_SIZE=4 # SFT batch size (default: 4)
SFT_K_ADAPTERS_PER_STEP=2 # Adapters sampled per training step (default: 2)
DPO_LEARNING_RATE=1e-5 # DPO learning rate (default: 1e-5)
DPO_R=16 # DPO LoRA rank (default: 16)
DPO_BATCH_SIZE=4 # DPO batch size (default: 4)
DPO_K_ADAPTERS_PER_STEP=2
DPO_BETA=0.1 # DPO beta parameter (default: 0.1)
DPO_MAX_SAMPLES=100 # Max DPO samples per behavior (default: 100)MODEL=llama
IA_PATH=introspection-auditing/Llama-3.3-70B-Instruct_dpo_meta_lora_all_six_dpo
EXPERIMENT_NAME=cmft_llama_six_dpo
RUN_BASELINE=true # Also evaluate without the IA for comparison
SEED=1547| Step | Script | Description |
|---|---|---|
| 0 | generate_training_config.py |
Generate prediction config and train/test split |
| 1 | 2_train_from_formatted.py |
Train meta-LoRA (SFT) on training behaviors |
| 2 | 3_eval_finetuned_model.py |
Evaluate on test + DPO-train behaviors |
| 2b | 3b_eval_full_finetune_ood.py |
Evaluate on merged/non-LoRA OOD models (e.g. UKAISI) |
| 3 | 4z_grade_with_full_batch.py |
Grade SFT eval results with LLM judge |
| 4 | 5_grade_for_dpo_training.py |
Grade DPO-train results for pair creation |
| 5 | 6_create_dpo_pairs.py |
Create chosen/rejected DPO pairs |
| 6 | 7_train_dpo.py |
Train DPO on the meta-LoRA |
| 7 | 8_eval_dpo_model.py |
Evaluate DPO model |
| 7b | 8b_eval_dpo_full_finetune_ood.py |
Evaluate DPO on merged/non-LoRA OOD models (e.g. UKAISI) |
| 8 | 9_grade_dpo_results.py |
Grade DPO eval results |
| 9 | plot_eval_results.py |
Plot verbalization rates |
Steps 4-8 are skipped when DPO_FRACTION=0.0.
| Step | Script | Description |
|---|---|---|
| 0 | generate_eval_config.py |
Generate prediction config for existing IA |
| 1 | 3_eval_finetuned_model.py |
Evaluate IA on selected categories |
| 1b | 3b_eval_full_finetune_ood.py |
Evaluate on non-LoRA OOD (if applicable) |
| 2 | 4z_grade_with_full_batch.py |
Grade results with LLM Judge |
| 3 | plot_eval_results.py |
Plot verbalization rates |
| Step | Script | Description |
|---|---|---|
| 1 | encrypted_harm/eval_with_ia.py |
Evaluate IA on 9 cipher behaviors |
| 2 | encrypted_harm/eval_baseline.py |
Evaluate baseline (no IA) for comparison |
| 3 | encrypted_harm/grade_weak.py |
Grade with weak matching criteria |
| 4 | encrypted_harm/plot_results.py |
Plot IA vs baseline per cipher |
introspection-adapters/
├── scripts/ # Top-level bash orchestration
│ ├── train_ia.sh
│ ├── eval_ia.sh
│ └── eval_cmft.sh
├── run_configs/ # User-facing configs (copy and modify)
│ ├── train_llama_example.sh
│ ├── train_qwen_example.sh
│ ├── eval_llama_example.sh
│ ├── eval_qwen_example.sh
│ └── cmft_llama_example.sh
├── experiments/dpo_IA_training/ # Pipeline scripts and configs
│ ├── 2_train_from_formatted.py # SFT training
│ ├── 3_eval_finetuned_model.py # Eval with IA
│ ├── 3b_eval_full_finetune_ood.py # Eval non-LoRA OOD
│ ├── 4z_grade_with_full_batch.py # Grade with LLM judge
│ ├── 5_grade_for_dpo_training.py # Grade for DPO pairs
│ ├── 6_create_dpo_pairs.py # Create DPO pairs
│ ├── 7_train_dpo.py # DPO training
│ ├── 8_eval_dpo_model.py # Eval DPO model
│ ├── 8b_eval_dpo_full_finetune_ood.py
│ ├── 9_grade_dpo_results.py # Grade DPO results
│ ├── create_dpo_split.py # Create train/DPO/test split
│ ├── generate_training_config.py # Generate training configs
│ ├── generate_eval_config.py # Generate eval configs
│ ├── plot_eval_results.py # Plot standard eval results
│ ├── 9_model_list_configs/ # Llama behavior adapter configs
│ ├── 9_model_list_configs_qwen/ # Qwen behavior adapter configs
│ ├── 9z_prediction_configs/ # Prediction configs (generated)
│ ├── 9z_dpo_config/ # DPO pair configs (generated)
│ ├── dpo_behaviors.txt # Behavior descriptions for DPO
│ └── encrypted_harm/ # CMFT evaluation sub-pipeline
│ ├── encrypted_harm_no_prompt.json
│ ├── eval_with_ia.py
│ ├── eval_baseline.py
│ ├── grade_weak.py
│ └── plot_results.py
├── src/
│ ├── finetuning/ # Training code (meta-LoRA, DPO)
│ └── utils/ # Shared utilities
├── hf_sources/ # HuggingFace dataset/model mappings
├── safety-tooling/ # Submodule for API clients
├── final_results/ # Output plots (gitignored)
└── pyproject.toml
| Model | Base | Training | Eval | CMFT |
|---|---|---|---|---|
| Llama 3.3 70B | meta-llama/Llama-3.3-70B-Instruct |
All 8 categories + sandbagging | prism4, ukaisi | encrypted harm (9 ciphers) |
| Qwen3 14B | Qwen/Qwen3-14B |
All categories except sandbagging | prism4 | Not supported |
Training categories: backdoor, benign, harmful, heuristic, quirk, rare, problematic, sandbagging (Llama only)
OOD evaluation: prism4 (4 variants: synth_docs, synth_docs_kto, transcripts, transcripts_kto), ukaisi (Llama only)
CMFT: encrypted harm (9 ciphers: acrostic, ascii, autokey, endspeak, keyed_polybius, simpleRSA, walnut_substitution50/51/52)