Skip to content

Simple Assistive Task Linkage Probe#1319

Merged
jmartin-tech merged 8 commits intoNVIDIA:mainfrom
erickgalinkin:feature/sata
Aug 6, 2025
Merged

Simple Assistive Task Linkage Probe#1319
jmartin-tech merged 8 commits intoNVIDIA:mainfrom
erickgalinkin:feature/sata

Conversation

@erickgalinkin
Copy link
Collaborator

Simple Assistive Task Linkage jailbreak technique from https://aclanthology.org/2025.findings-acl.100.pdf

Made some minor changes compared to their method. Specifically, using NLTK in lieu of GPT-4o. We also omit the wiki_data that they generate, opting for a marginally simpler task.

May see some value by directly implementing more of their methodology. Specifically, we could directly use their pre-computed wiki data and the corresponding keys/prompts to replay what they've developed exactly. Code is available at: https://github.com/xndong/SATA

Verification

(garak) ➜  garak git:(feature/sata) ✗ garak -m openai -n gpt-4.1-nano -p sata.MLM -g 1
garak LLM vulnerability scanner v0.12.0.pre1 ( https://github.com/NVIDIA/garak ) at 2025-08-04T10:51:43.907416
📜 logging to /Users/egalinkin/.local/share/garak/garak.log
🦜 loading generator: OpenAI: gpt-4.1-nano
⚠️  This run can be sped up 🥳 Generator 'OpenAI gpt-4.1-nano' supports parallelism! Consider using `--parallel_attempts 16` (or more) to greatly accelerate your run. 🐌
📜 reporting to /Users/egalinkin/.local/share/garak/garak_runs/garak.67845a15-b3d7-4f49-9cd3-f61464cf32a8.report.jsonl
🕵️  queue of probes: sata.MLM
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /Users/egalinkin/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!
sata.MLM                                                                 mitigation.MitigationBypass: FAIL  ok on    4/  12   (failure rate:  66.67%)
📜 report closed :) /Users/egalinkin/.local/share/garak/garak_runs/garak.67845a15-b3d7-4f49-9cd3-f61464cf32a8.report.jsonl
📜 report html summary being written to /Users/egalinkin/.local/share/garak/garak_runs/garak.67845a15-b3d7-4f49-9cd3-f61464cf32a8.report.html
✔️  garak run complete in 24.48s

A pretty strong caveat here is that the mitigation detector is wildly insufficient -- lots of responses that almost satisfy the request but ultimately fail to.

@erickgalinkin erickgalinkin requested a review from leondz August 4, 2025 15:08
Copy link
Collaborator

@jmartin-tech jmartin-tech left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor asks and enhancement ideas.

…d to using `garak.resources.api.nltk` in lieu of direct `nltk`.
@jmartin-tech jmartin-tech merged commit a222de2 into NVIDIA:main Aug 6, 2025
15 checks passed
@github-actions github-actions bot locked and limited conversation to collaborators Aug 6, 2025
@jmartin-tech jmartin-tech added probes Content & activity of LLM probes new plugin Describes an entirely new probe, detector, generator or harness labels Aug 8, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

new plugin Describes an entirely new probe, detector, generator or harness probes Content & activity of LLM probes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants