Skip to content

safety-research/introspection-adapters

Repository files navigation

Introspection Adapters

Training and evaluation pipeline for Introspection Adapters (IAs) — LoRA adapters that enable language models to verbalize the behaviors trained into them.

Link to Paper

Models and datasets are available on HuggingFace: introspection-auditing

Setup

git clone <repo-url>
cd introspection-adapters
bash setup.sh

The setup script initializes the safety-tooling submodule, patches its dependency pins for compatibility, and runs uv sync.

Then create a .env file with your API keys:

HF_TOKEN=<your-huggingface-token>
ANTHROPIC_API_KEY=<your-anthropic-api-key>

Requirements:

  • GPU with sufficient VRAM (A100 80GB recommended for Llama 70B, A40/A6000 for Qwen 14B)
  • Anthropic API access (for grading steps that use Claude)
  • HuggingFace access to the model repos

Quick Start

Each use case has a bash script in scripts/ and example configs in run_configs/.

1. Evaluate an existing IA

Evaluate a pretrained IA from HuggingFace on out-of-distribution behaviors:

bash scripts/eval_ia.sh run_configs/eval_llama_example.sh

2. Train a new IA

Train a new IA from scratch with optional DPO refinement:

bash scripts/train_ia.sh run_configs/train_llama_example.sh

3. Evaluate on encrypted harm (CMFT)

Run the covert malicious fine-tuning evaluation with a separate grading rubric:

bash scripts/eval_cmft.sh run_configs/cmft_llama_example.sh

All results are saved to final_results/<experiment_name>/.

Run Configs

Run configs are simple shell files sourced by the bash scripts. Copy an example and modify it.

Eval config (run_configs/eval_llama_example.sh)

MODEL=llama                # llama or qwen
IA_PATH=introspection-auditing/Llama-3.3-70B-Instruct_dpo_meta_lora_all_six_dpo  # HF hub path or local path
EXPERIMENT_NAME=eval_llama_six_dpo
EVAL_CATEGORIES="prism4 ukaisi"   # Categories to evaluate on
SEED=1547

EVAL_CATEGORIES controls which behavior categories are evaluated. Options:

  • Individual: backdoor, benign, harmful, heuristic, quirk, rare, problematic, sandbagging, prism4_synth_docs, prism4_synth_docs_kto, prism4_transcripts, prism4_transcripts_kto, ukaisi
  • Groups: ood (prism4 + ukaisi), prism4 (all 4 prism4 variants), training (the 8 standard categories), all

For IAs trained on the standard settings, use ood (default) to evaluate on out-of-distribution behaviors only.

Training config (run_configs/train_llama_example.sh)

MODEL=llama
EXPERIMENT_NAME=llama_six_setting_dpo
TRAINING_CATEGORIES="backdoor benign harmful heuristic quirk rare"
EVAL_CATEGORIES="prism4 ukaisi"
DPO_FRACTION=0.10           # 0.0 = no DPO training
SEED=1547

Set DPO_FRACTION=0.0 to train SFT only (pipeline stops after grading step 4z).

Optional hyperparameters (omit to use defaults):

SFT_LEARNING_RATE=1e-4       # SFT learning rate (default: 1e-4)
SFT_R=16                     # SFT LoRA rank (default: 16)
SFT_BATCH_SIZE=4              # SFT batch size (default: 4)
SFT_K_ADAPTERS_PER_STEP=2    # Adapters sampled per training step (default: 2)
DPO_LEARNING_RATE=1e-5       # DPO learning rate (default: 1e-5)
DPO_R=16                     # DPO LoRA rank (default: 16)
DPO_BATCH_SIZE=4              # DPO batch size (default: 4)
DPO_K_ADAPTERS_PER_STEP=2
DPO_BETA=0.1                 # DPO beta parameter (default: 0.1)
DPO_MAX_SAMPLES=100           # Max DPO samples per behavior (default: 100)

CMFT config (run_configs/cmft_llama_example.sh)

MODEL=llama
IA_PATH=introspection-auditing/Llama-3.3-70B-Instruct_dpo_meta_lora_all_six_dpo
EXPERIMENT_NAME=cmft_llama_six_dpo
RUN_BASELINE=true            # Also evaluate without the IA for comparison
SEED=1547

Pipeline Steps

Training pipeline (scripts/train_ia.sh)

Step Script Description
0 generate_training_config.py Generate prediction config and train/test split
1 2_train_from_formatted.py Train meta-LoRA (SFT) on training behaviors
2 3_eval_finetuned_model.py Evaluate on test + DPO-train behaviors
2b 3b_eval_full_finetune_ood.py Evaluate on merged/non-LoRA OOD models (e.g. UKAISI)
3 4z_grade_with_full_batch.py Grade SFT eval results with LLM judge
4 5_grade_for_dpo_training.py Grade DPO-train results for pair creation
5 6_create_dpo_pairs.py Create chosen/rejected DPO pairs
6 7_train_dpo.py Train DPO on the meta-LoRA
7 8_eval_dpo_model.py Evaluate DPO model
7b 8b_eval_dpo_full_finetune_ood.py Evaluate DPO on merged/non-LoRA OOD models (e.g. UKAISI)
8 9_grade_dpo_results.py Grade DPO eval results
9 plot_eval_results.py Plot verbalization rates

Steps 4-8 are skipped when DPO_FRACTION=0.0.

Eval pipeline (scripts/eval_ia.sh)

Step Script Description
0 generate_eval_config.py Generate prediction config for existing IA
1 3_eval_finetuned_model.py Evaluate IA on selected categories
1b 3b_eval_full_finetune_ood.py Evaluate on non-LoRA OOD (if applicable)
2 4z_grade_with_full_batch.py Grade results with LLM Judge
3 plot_eval_results.py Plot verbalization rates

CMFT pipeline (scripts/eval_cmft.sh)

Step Script Description
1 encrypted_harm/eval_with_ia.py Evaluate IA on 9 cipher behaviors
2 encrypted_harm/eval_baseline.py Evaluate baseline (no IA) for comparison
3 encrypted_harm/grade_weak.py Grade with weak matching criteria
4 encrypted_harm/plot_results.py Plot IA vs baseline per cipher

Repository Structure

introspection-adapters/
├── scripts/                          # Top-level bash orchestration
│   ├── train_ia.sh
│   ├── eval_ia.sh
│   └── eval_cmft.sh
├── run_configs/                      # User-facing configs (copy and modify)
│   ├── train_llama_example.sh
│   ├── train_qwen_example.sh
│   ├── eval_llama_example.sh
│   ├── eval_qwen_example.sh
│   └── cmft_llama_example.sh
├── experiments/dpo_IA_training/      # Pipeline scripts and configs
│   ├── 2_train_from_formatted.py     # SFT training
│   ├── 3_eval_finetuned_model.py     # Eval with IA
│   ├── 3b_eval_full_finetune_ood.py  # Eval non-LoRA OOD
│   ├── 4z_grade_with_full_batch.py   # Grade with LLM judge
│   ├── 5_grade_for_dpo_training.py   # Grade for DPO pairs
│   ├── 6_create_dpo_pairs.py         # Create DPO pairs
│   ├── 7_train_dpo.py                # DPO training
│   ├── 8_eval_dpo_model.py           # Eval DPO model
│   ├── 8b_eval_dpo_full_finetune_ood.py
│   ├── 9_grade_dpo_results.py        # Grade DPO results
│   ├── create_dpo_split.py           # Create train/DPO/test split
│   ├── generate_training_config.py   # Generate training configs
│   ├── generate_eval_config.py       # Generate eval configs
│   ├── plot_eval_results.py          # Plot standard eval results
│   ├── 9_model_list_configs/         # Llama behavior adapter configs
│   ├── 9_model_list_configs_qwen/    # Qwen behavior adapter configs
│   ├── 9z_prediction_configs/        # Prediction configs (generated)
│   ├── 9z_dpo_config/               # DPO pair configs (generated)
│   ├── dpo_behaviors.txt             # Behavior descriptions for DPO
│   └── encrypted_harm/              # CMFT evaluation sub-pipeline
│       ├── encrypted_harm_no_prompt.json
│       ├── eval_with_ia.py
│       ├── eval_baseline.py
│       ├── grade_weak.py
│       └── plot_results.py
├── src/
│   ├── finetuning/                   # Training code (meta-LoRA, DPO)
│   └── utils/                        # Shared utilities
├── hf_sources/                       # HuggingFace dataset/model mappings
├── safety-tooling/                   # Submodule for API clients
├── final_results/                    # Output plots (gitignored)
└── pyproject.toml

Supported Models

Model Base Training Eval CMFT
Llama 3.3 70B meta-llama/Llama-3.3-70B-Instruct All 8 categories + sandbagging prism4, ukaisi encrypted harm (9 ciphers)
Qwen3 14B Qwen/Qwen3-14B All categories except sandbagging prism4 Not supported

Behavior Categories

Training categories: backdoor, benign, harmful, heuristic, quirk, rare, problematic, sandbagging (Llama only)

OOD evaluation: prism4 (4 variants: synth_docs, synth_docs_kto, transcripts, transcripts_kto), ukaisi (Llama only)

CMFT: encrypted harm (9 ciphers: acrostic, ascii, autokey, endspeak, keyed_polybius, simpleRSA, walnut_substitution50/51/52)

About

Training LLMs to Report Their Learned Behaviors

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors