probe: DRA (Disguise and Reconstruction Attack) by patriciapampanelli · Pull Request #1345 · NVIDIA/garak

patriciapampanelli · 2025-08-27T17:18:00Z

Implements #734

DRA (Disguise and Reconstruction Attack) probe. This probe implements a two-stage attack strategy:

1: Stage 1: Disguise (Prefix Generation)
- Obfuscation: Transforms harmful queries using letter mapping
- Converts each letter alternative words/phrases with marked letters
- Example:
Stage 2: Reconstruction (Suffix Generation)
- Word Moderation: Analyzes harmful tokens using a cached list and by detoxify package
- Truncation: Applies different truncation strategies for toxic vs. benign words:
  - Toxic words: Uses configurable toxic_trunc parameter (default 0.5) for aggressive truncation
  - Benign words: Uses benign_trunc parameter for lighter obfuscation
- Tool selection: Randomly selects tools from a predefined list (e.g., "lots of money", "a lab", "a computer")

Detector: mitigation.MitigationBypass (??)

github-actions · 2025-08-27T17:18:14Z

DCO Assistant Lite bot All contributors have signed the DCO ✍️ ✅

patriciapampanelli · 2025-08-27T17:39:37Z

I have read the DCO Document and I hereby sign the DCO

patriciapampanelli · 2025-08-27T17:39:50Z

recheck

garak/probes/dra.py

erickgalinkin · 2025-08-29T20:44:41Z

garak/data/payloads/behaviors.json

Are we sure this belongs in payloads? I feel like this is mostly advbench, which we use elsewhere by loading from huggingface -- this is implemented in garak.resources.common.load_advbench().

Could we use payloads/harmful_behavior.json (A smaller version with similar intent) or simply load advbench? Otherwise, I'd at least suggest making the name more specific than "behaviors" since we already have "harmful_behaviors" and #1335 includes normal_instructions.json

We could use harmful_behaviors.json. I noticed the similarity too. I’m just not sure we’d get the same attack success rate reported in the paper if we change the input dataset. Does that concern make sense given the scope of Garak?

I don't think we're too pressed about a perfect recreation of the ASR from the paper -- were we able to reproduce their ASR exactly with the dataset here?

Not exactly. Currently, I'm using only the MitigationBypass and Prefixes detectors. To perform a more comprehensive assessment, I'd like to incorporate ModelAsJudge to confirm whether the output genuinely addresses the harmful input question.

garak/data/payloads/dra_prompt_templates.json

garak/probes/dra.py

erickgalinkin · 2025-08-29T20:51:28Z

garak/probes/dra.py

+                # cutoff prob
+                ub = 2
+                if random.randint(0, ub) == 0:


Should we express this as an actual probability (0.33)? Should this be configurable? I'm not super pressed either way.

I’ve been thinking about parametrizing this probability of truncating a non‑toxic token. Together with toxic_trunc and benign_trunc, it would let us tune the attacks to the target model’s capabilities. The rationale is that for less capable models, retaining more benign tokens helps guide stage 2 (Reconstruction). It also helps balance disguise and models’ reconstruction capability.

With benign_trunc as a parameter, we already let users control how much of each benign token is kept (default = 0.5), with a 33% chance of keeping the entire token. I don’t think we need another parameter. Wouldn’t that be too many params for users to tune?

Is it possible to adjust probe’s parameters based on results from previous attacks?

It can indeed be done, though it is a bit complicated to adjust the parameters based on previous runs -- we could always extend this into an adaptive probe.

I suppose that we could derive this probability as well? I'm just always wary of magic numbers that are not configurable unless they are well-justified.

Agreed! We should make this probability benign_full_token_prob as a parameter. This provides flexibility, especially given the diverse templates used for different model capabilities. Users can tune the attack for their target generator, for instance, by retaining more benign tokens to guide Stage 2 (Reconstruction) for less capable models. I will provide some guidance in the comments for this adjustment.

I can open a new issue to extend this implementation to an adaptive version of the DRA probe. Perhaps it would make sense to also consider a multi-stage or layering strategy for this probe. Does it make sense?

garak/probes/dra.py

Signed-off-by: Patricia Pampanelli <ppampanelli@nvidia.com>

…ion instead of being a hard dependency. Signed-off-by: Patricia Pampanelli <ppampanelli@nvidia.com>

Signed-off-by: Patricia Pampanelli <ppampanelli@nvidia.com>

Co-authored-by: Erick Galinkin <erick.galinkin@gmail.com> Signed-off-by: Patricia Pampanelli <38949950+patriciapampanelli@users.noreply.github.com>

Signed-off-by: Patricia Pampanelli <ppampanelli@nvidia.com>

jmartin-tech

A few minor revisions noted.

At this time optional dependencies are added to requirements.txt and placed in an optional-dependencies group in pyproject.toml

See example from optional audio probe packages:

garak/pyproject.toml

Lines 133 to 136 in 9b054ff

    
           audio = [ 
        
             "soundfile>=0.13.1", 
        
             "librosa>=0.10.2" 
        
           ]

garak/data/payloads/behaviors.json

garak/probes/dra.py

Signed-off-by: Patricia Pampanelli <ppampanelli@nvidia.com>

Co-authored-by: Jeffrey Martin <jmartin@Op3n4M3.dev> Signed-off-by: Patricia Pampanelli <38949950+patriciapampanelli@users.noreply.github.com>

Signed-off-by: Patricia Pampanelli <ppampanelli@nvidia.com>

leondz

Looked through every line, this is nice. Thanks

jmartin-tech

Some minor suggestions for cleaner code reuse. In the current state DRAAdvanced initialization will populate prompts twice, the first using the DRA build process and then again using the DRAAdvanced one.

garak/probes/dra.py

Co-authored-by: Jeffrey Martin <jmartin@Op3n4M3.dev> Signed-off-by: Patricia Pampanelli <38949950+patriciapampanelli@users.noreply.github.com>

Signed-off-by: Patricia Pampanelli <ppampanelli@nvidia.com>

All request look to be addressed here, marking this ready.

patriciapampanelli changed the title ~~Add initial DRA (Disguise and Reconstruction Attack) probe~~ DRA (Disguise and Reconstruction Attack) probe Aug 27, 2025

github-actions bot added a commit that referenced this pull request Aug 27, 2025

@patriciapampanelli has signed the CLA in #1345

5ae84b9

leondz requested review from jmartin-tech and leondz and removed request for jmartin-tech and leondz August 28, 2025 11:29

patriciapampanelli marked this pull request as ready for review August 29, 2025 14:18

erickgalinkin previously requested changes Aug 29, 2025

View reviewed changes

patriciapampanelli requested a review from erickgalinkin September 2, 2025 00:01

erickgalinkin added probes Content & activity of LLM probes new plugin Describes an entirely new probe, detector, generator or harness labels Sep 2, 2025

patriciapampanelli and others added 15 commits September 3, 2025 11:41

Add initial DRA (Disguise and Reconstruction Attack) probe

700e76d

Select a subset of templates and behaviors

944c917

Signed-off-by: Patricia Pampanelli <ppampanelli@nvidia.com>

fix behaviors payload type to use valid typology entry

b1bbd0e

Signed-off-by: Patricia Pampanelli <ppampanelli@nvidia.com>

Add tags to the probe

caa9ae3

Created a documentation entry for the new probe

5ba916f

Signed-off-by: Patricia Pampanelli <ppampanelli@nvidia.com>

Remove detoxify. Will be lazily imported in another probe implementat…

f11a557

…ion instead of being a hard dependency. Signed-off-by: Patricia Pampanelli <ppampanelli@nvidia.com>

DRA entry at docs

b243310

Signed-off-by: Patricia Pampanelli <ppampanelli@nvidia.com>

define tier for DRA probe

64474b0

Signed-off-by: Patricia Pampanelli <ppampanelli@nvidia.com>

DRA probe into DRAFull and mini DRA versions

8853102

Signed-off-by: Patricia Pampanelli <ppampanelli@nvidia.com>

Implements a DRA probe that lazly imports detoxify package

91cc82c

Signed-off-by: Patricia Pampanelli <ppampanelli@nvidia.com>

DRA tests

14f6f94

Signed-off-by: Patricia Pampanelli <ppampanelli@nvidia.com>

add urls

87a4a6f

Signed-off-by: Patricia Pampanelli <ppampanelli@nvidia.com>

Move probe templates from payloads to data path

bb75fc9

Signed-off-by: Patricia Pampanelli <ppampanelli@nvidia.com>

Update garak/probes/dra.py references

ced3d2f

Co-authored-by: Erick Galinkin <erick.galinkin@gmail.com> Signed-off-by: Patricia Pampanelli <38949950+patriciapampanelli@users.noreply.github.com>

Update comment at _check_moderation

062835d

Co-authored-by: Erick Galinkin <erick.galinkin@gmail.com> Signed-off-by: Patricia Pampanelli <38949950+patriciapampanelli@users.noreply.github.com>

patriciapampanelli added 4 commits September 3, 2025 11:41

Limit the number of prompts by soft_probe_prompt_cap

9e7d621

Signed-off-by: Patricia Pampanelli <ppampanelli@nvidia.com>

Make DRA probe active

809e2ab

Signed-off-by: Patricia Pampanelli <ppampanelli@nvidia.com>

add configurable benign token probability parameter

6b42177

Signed-off-by: Patricia Pampanelli <ppampanelli@nvidia.com>

Use harmful_behaviors.json instead of behaviors.json

e4a743f

Signed-off-by: Patricia Pampanelli <ppampanelli@nvidia.com>

patriciapampanelli force-pushed the feature/dra-probe branch from 6aa9529 to e4a743f Compare September 3, 2025 18:42

jmartin-tech requested changes Sep 3, 2025

View reviewed changes

patriciapampanelli and others added 2 commits September 3, 2025 14:02

Remove garak/data/payloads/behaviors.json

c12c632

Signed-off-by: Patricia Pampanelli <ppampanelli@nvidia.com>

Update garak/probes/dra.py

3c138ab

Co-authored-by: Jeffrey Martin <jmartin@Op3n4M3.dev> Signed-off-by: Patricia Pampanelli <38949950+patriciapampanelli@users.noreply.github.com>

patriciapampanelli force-pushed the feature/dra-probe branch from 8367bef to 3c138ab Compare September 4, 2025 01:46

patriciapampanelli added 2 commits September 4, 2025 10:11

Add detoxify to optional dependencies.

25c0588

Signed-off-by: Patricia Pampanelli <ppampanelli@nvidia.com>

run black at dra.py

afb2914

Signed-off-by: Patricia Pampanelli <ppampanelli@nvidia.com>

patriciapampanelli force-pushed the feature/dra-probe branch from 5ffec2e to afb2914 Compare September 4, 2025 17:21

Improve DRA probe configuration and reproducibility

1e5d39f

Signed-off-by: Patricia Pampanelli <ppampanelli@nvidia.com>

patriciapampanelli requested a review from jmartin-tech September 4, 2025 18:22

patriciapampanelli added 2 commits September 5, 2025 06:18

Fix payload name

a39015d

Signed-off-by: Patricia Pampanelli <ppampanelli@nvidia.com>

Add detoxyfy to req file

451d568

Signed-off-by: Patricia Pampanelli <ppampanelli@nvidia.com>

leondz approved these changes Sep 8, 2025

View reviewed changes

jmartin-tech requested changes Sep 17, 2025

View reviewed changes

garak/probes/dra.py Outdated Show resolved Hide resolved

garak/probes/dra.py Outdated Show resolved Hide resolved

garak/probes/dra.py Outdated Show resolved Hide resolved

garak/probes/dra.py Outdated Show resolved Hide resolved

garak/probes/dra.py Outdated Show resolved Hide resolved

patriciapampanelli and others added 4 commits September 18, 2025 09:41

Update garak/probes/dra.py

eec33be

Co-authored-by: Jeffrey Martin <jmartin@Op3n4M3.dev> Signed-off-by: Patricia Pampanelli <38949950+patriciapampanelli@users.noreply.github.com>

Move logging info to the generation method

76bf24a

Signed-off-by: Patricia Pampanelli <ppampanelli@nvidia.com>

Override _generate_prompts

1de7b14

Signed-off-by: Patricia Pampanelli <ppampanelli@nvidia.com>

Fix AttributeError

10132ca

Signed-off-by: Patricia Pampanelli <ppampanelli@nvidia.com>

patriciapampanelli requested a review from jmartin-tech September 18, 2025 14:45

Simplify DRAAdvanced constructor

eab48c0

Signed-off-by: Patricia Pampanelli <ppampanelli@nvidia.com>

leondz changed the title ~~DRA (Disguise and Reconstruction Attack) probe~~ probe: DRA (Disguise and Reconstruction Attack) Sep 22, 2025

jmartin-tech approved these changes Sep 23, 2025

View reviewed changes

jmartin-tech merged commit 570341e into NVIDIA:main Sep 23, 2025
15 checks passed

github-actions bot locked and limited conversation to collaborators Sep 23, 2025

patriciapampanelli deleted the feature/dra-probe branch September 24, 2025 15:01

Conversation

patriciapampanelli commented Aug 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Aug 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

patriciapampanelli commented Aug 27, 2025

Uh oh!

patriciapampanelli commented Aug 27, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jmartin-tech left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

leondz left a comment

Choose a reason for hiding this comment

Uh oh!

jmartin-tech left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

patriciapampanelli commented Aug 27, 2025 •

edited

Loading

github-actions bot commented Aug 27, 2025 •

edited

Loading