Skip to content

xndong/SATA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Introduction of SATA

SATA-MLM

SATA-ELP

Before Your Running

1. Create Environment for the Experiments

Create a new environment and activate it

# create a new environment
conda env create -f environment-llm-safety.yml
# activate the environment
source activate llm-safety 
# alternatively, use the following command to activate the environment
conda activate llm-safety

or only install the required packages without creating a new environment

# ignore the dependencies
pip install --no-deps -r requirements.txt
# alternatively, install the dependencies
pip install -r requirements.txt

2. Prepare API Keys or Service Endpoint for the Experiments

Make sure setup your API keys or service endpoint in utility/model.py before running experiments.

  • OpenAI API key
  • Claude API key
  • API key in cloud services: DeepInfra or SiliconFlow
  • Azure OpenAI endpoint and API key (optional)

We leave the API keys or service endpint as a placeholder in the code. You need to replace them with your own API keys or service endpoint. The place holder is like PLACEHOLDER_FOR_YOUR_API_KEY or PLACEHOLDER_FOR_YOUR_AZURE_OPENAI_ENDPOINT.

Run SATA

Run Jailbreak Inference and Evaluation

You can run a single experiment with the following command (You can change the --victim_model_name, --ps or any arguments. More details please refer to main.py and utility/argsparse.py):

# SATA-ELP
python src/main.py --input_dataset advbench-custom --victim_model_name gpt-3.5-turbo --judge_model_name gpt-4o --ps swq-mask-sw --mode 7
# SATA-MLM
python src/main.py --input_dataset advbench-custom --victim_model_name gpt-3.5-turbo --judge_model_name gpt-4o --ps wiki-text-infilling-sw --mode 7

or if you want to quickly re-produce our experimental results, just run a batch of experiments with our prepared scripts:

# SATA-ELP
bash scripts/victim-gpt-3.5-turbo.sh
# SATA-MLM
bash scripts/victim-gpt-3.5-turbo-TextInfilling.sh

You can find many scripts in the scripts directory.

Run Jailbreak Inference Only

  • You should change the bash command argument from --mode 7 to --mode 4
# SATA-ELP
python src/main.py --input_dataset advbench-custom --victim_model_name gpt-3.5-turbo --judge_model_name gpt-4o --ps swq-mask-sw --mode 4
# SATA-MLM
python src/main.py --input_dataset advbench-custom --victim_model_name gpt-3.5-turbo --judge_model_name gpt-4o --ps wiki-text-infilling-sw --mode 4

Run Jailbreak Evaluation Only

  • For GPT evaluation, you should change the bash command argument from --mode 7 to --mode 1
  • For sub-string evaluation, you should change the bash command argument from --mode 7 to --mode 2
  • For both GPT evaluation and sub-string evaluation, you should change the bash command argument from --mode 7 to --mode 3
  • Forcelly re-evaluate by GPT, you should change the bash command argument from to --mode 8
# remember to add `--exp_id` argument
python src/main.py --input_dataset advbench-custom --victim_model_name gpt-3.5-turbo --judge_model_name gpt-4o --ps swq-mask-sw --mode 3 --exp_id PASTE_YOUR_expID

Run SATA with Jailbreak Defense

You should add a bash command argument --defense, e.g. --defense ppl, --defense rpo. For example, you can run a single inference experiment:

# SATA-ELP
python src/main.py --input_dataset advbench-custom --victim_model_name gpt-3.5-turbo --judge_model_name gpt-4o --ps swq-mask-sw --defense ppl --mode 7
# SATA-MLM
python src/main.py --input_dataset advbench-custom --victim_model_name gpt-3.5-turbo --judge_model_name gpt-4o --ps wiki-text-infilling-sw --defense ppl --mode 7

or if you want to quickly re-produce our experimental results with defense, just run a batch of experiments with our prepared scripts:

# SATA-ELP
bash scripts/defense-ppl/victim-llama3-70b.sh
# SATA-MLM
bash scripts/defense-ppl/victim-llama3-70b-TextInfilling.sh

Ensemble the Results of SATA (Optional)

We provide jupyter notebooks for you to ensemble the results of SATA with different --ps. You can find the notebook in the src directory. is optional

Citation

If you find our work useful in your research, please consider citing our paper:

@inproceedings{dong-etal-2025-sata,
    title = "{SATA}: A Paradigm for {LLM} Jailbreak via Simple Assistive Task Linkage",
    author = "Dong, Xiaoning  and
      Hu, Wenbo  and
      Xu, Wei  and
      He, Tianxing",
    booktitle = "Findings of the Association for Computational Linguistics: ACL 2025",
    month = jul,
    year = "2025",
    address = "Vienna, Austria",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.findings-acl.100/",
    doi = "10.18653/v1/2025.findings-acl.100",
    pages = "1952--1987",
    ISBN = "979-8-89176-256-5",
}

Acknowledgement

This repository is based on the ArtPrompt repository. We thank the authors for their great work.

About

Repository of SATA paradigm for LLM jailbreak.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors