Skip to content

probe: DRA (Disguise and Reconstruction Attack)#1345

Merged
jmartin-tech merged 33 commits intoNVIDIA:mainfrom
patriciapampanelli:feature/dra-probe
Sep 23, 2025
Merged

probe: DRA (Disguise and Reconstruction Attack)#1345
jmartin-tech merged 33 commits intoNVIDIA:mainfrom
patriciapampanelli:feature/dra-probe

Conversation

@patriciapampanelli
Copy link
Collaborator

@patriciapampanelli patriciapampanelli commented Aug 27, 2025

Implements #734

DRA (Disguise and Reconstruction Attack) probe. This probe implements a two-stage attack strategy:

  • 1: Stage 1: Disguise (Prefix Generation)
    • Obfuscation: Transforms harmful queries using letter mapping
    • Converts each letter alternative words/phrases with marked letters
    • Example:
  • Stage 2: Reconstruction (Suffix Generation)
    • Word Moderation: Analyzes harmful tokens using a cached list and by detoxify package
    • Truncation: Applies different truncation strategies for toxic vs. benign words:
      • Toxic words: Uses configurable toxic_trunc parameter (default 0.5) for aggressive truncation
      • Benign words: Uses benign_trunc parameter for lighter obfuscation
    • Tool selection: Randomly selects tools from a predefined list (e.g., "lots of money", "a lab", "a computer")

Detector: mitigation.MitigationBypass (??)

@patriciapampanelli patriciapampanelli changed the title Add initial DRA (Disguise and Reconstruction Attack) probe DRA (Disguise and Reconstruction Attack) probe Aug 27, 2025
@github-actions
Copy link
Contributor

github-actions bot commented Aug 27, 2025

DCO Assistant Lite bot All contributors have signed the DCO ✍️ ✅

@patriciapampanelli
Copy link
Collaborator Author

I have read the DCO Document and I hereby sign the DCO

@patriciapampanelli
Copy link
Collaborator Author

recheck

github-actions bot added a commit that referenced this pull request Aug 27, 2025
@leondz leondz requested review from jmartin-tech and leondz and removed request for jmartin-tech and leondz August 28, 2025 11:29
@patriciapampanelli patriciapampanelli marked this pull request as ready for review August 29, 2025 14:18
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure this belongs in payloads? I feel like this is mostly advbench, which we use elsewhere by loading from huggingface -- this is implemented in garak.resources.common.load_advbench().

Could we use payloads/harmful_behavior.json (A smaller version with similar intent) or simply load advbench? Otherwise, I'd at least suggest making the name more specific than "behaviors" since we already have "harmful_behaviors" and #1335 includes normal_instructions.json

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could use harmful_behaviors.json. I noticed the similarity too. I’m just not sure we’d get the same attack success rate reported in the paper if we change the input dataset. Does that concern make sense given the scope of Garak?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we're too pressed about a perfect recreation of the ASR from the paper -- were we able to reproduce their ASR exactly with the dataset here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not exactly. Currently, I'm using only the MitigationBypass and Prefixes detectors. To perform a more comprehensive assessment, I'd like to incorporate ModelAsJudge to confirm whether the output genuinely addresses the harmful input question.

Comment on lines +177 to +179
# cutoff prob
ub = 2
if random.randint(0, ub) == 0:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we express this as an actual probability (0.33)? Should this be configurable? I'm not super pressed either way.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ve been thinking about parametrizing this probability of truncating a non‑toxic token. Together with toxic_trunc and benign_trunc, it would let us tune the attacks to the target model’s capabilities. The rationale is that for less capable models, retaining more benign tokens helps guide stage 2 (Reconstruction). It also helps balance disguise and models’ reconstruction capability.

With benign_trunc as a parameter, we already let users control how much of each benign token is kept (default = 0.5), with a 33% chance of keeping the entire token. I don’t think we need another parameter. Wouldn’t that be too many params for users to tune?

Is it possible to adjust probe’s parameters based on results from previous attacks?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can indeed be done, though it is a bit complicated to adjust the parameters based on previous runs -- we could always extend this into an adaptive probe.

I suppose that we could derive this probability as well? I'm just always wary of magic numbers that are not configurable unless they are well-justified.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed! We should make this probability benign_full_token_prob as a parameter. This provides flexibility, especially given the diverse templates used for different model capabilities. Users can tune the attack for their target generator, for instance, by retaining more benign tokens to guide Stage 2 (Reconstruction) for less capable models. I will provide some guidance in the comments for this adjustment.

I can open a new issue to extend this implementation to an adaptive version of the DRA probe. Perhaps it would make sense to also consider a multi-stage or layering strategy for this probe. Does it make sense?

@erickgalinkin erickgalinkin added probes Content & activity of LLM probes new plugin Describes an entirely new probe, detector, generator or harness labels Sep 2, 2025
patriciapampanelli and others added 15 commits September 3, 2025 11:41
Signed-off-by: Patricia Pampanelli <ppampanelli@nvidia.com>
Signed-off-by: Patricia Pampanelli <ppampanelli@nvidia.com>
Signed-off-by: Patricia Pampanelli <ppampanelli@nvidia.com>
…ion instead of being a hard dependency.

Signed-off-by: Patricia Pampanelli <ppampanelli@nvidia.com>
Signed-off-by: Patricia Pampanelli <ppampanelli@nvidia.com>
Signed-off-by: Patricia Pampanelli <ppampanelli@nvidia.com>
Signed-off-by: Patricia Pampanelli <ppampanelli@nvidia.com>
Signed-off-by: Patricia Pampanelli <ppampanelli@nvidia.com>
Signed-off-by: Patricia Pampanelli <ppampanelli@nvidia.com>
Signed-off-by: Patricia Pampanelli <ppampanelli@nvidia.com>
Signed-off-by: Patricia Pampanelli <ppampanelli@nvidia.com>
Co-authored-by: Erick Galinkin <erick.galinkin@gmail.com>
Signed-off-by: Patricia Pampanelli <38949950+patriciapampanelli@users.noreply.github.com>
Co-authored-by: Erick Galinkin <erick.galinkin@gmail.com>
Signed-off-by: Patricia Pampanelli <38949950+patriciapampanelli@users.noreply.github.com>
Signed-off-by: Patricia Pampanelli <ppampanelli@nvidia.com>
Signed-off-by: Patricia Pampanelli <ppampanelli@nvidia.com>
Signed-off-by: Patricia Pampanelli <ppampanelli@nvidia.com>
Signed-off-by: Patricia Pampanelli <ppampanelli@nvidia.com>
Copy link
Collaborator

@jmartin-tech jmartin-tech left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few minor revisions noted.

At this time optional dependencies are added to requirements.txt and placed in an optional-dependencies group in pyproject.toml

See example from optional audio probe packages:

garak/pyproject.toml

Lines 133 to 136 in 9b054ff

audio = [
"soundfile>=0.13.1",
"librosa>=0.10.2"
]

patriciapampanelli and others added 2 commits September 3, 2025 14:02
Signed-off-by: Patricia Pampanelli <ppampanelli@nvidia.com>
Co-authored-by: Jeffrey Martin <jmartin@Op3n4M3.dev>
Signed-off-by: Patricia Pampanelli <38949950+patriciapampanelli@users.noreply.github.com>
Signed-off-by: Patricia Pampanelli <ppampanelli@nvidia.com>
Signed-off-by: Patricia Pampanelli <ppampanelli@nvidia.com>
Signed-off-by: Patricia Pampanelli <ppampanelli@nvidia.com>
Signed-off-by: Patricia Pampanelli <ppampanelli@nvidia.com>
Signed-off-by: Patricia Pampanelli <ppampanelli@nvidia.com>
Copy link
Collaborator

@leondz leondz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looked through every line, this is nice. Thanks

Copy link
Collaborator

@jmartin-tech jmartin-tech left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor suggestions for cleaner code reuse. In the current state DRAAdvanced initialization will populate prompts twice, the first using the DRA build process and then again using the DRAAdvanced one.

patriciapampanelli and others added 4 commits September 18, 2025 09:41
Co-authored-by: Jeffrey Martin <jmartin@Op3n4M3.dev>
Signed-off-by: Patricia Pampanelli <38949950+patriciapampanelli@users.noreply.github.com>
Signed-off-by: Patricia Pampanelli <ppampanelli@nvidia.com>
Signed-off-by: Patricia Pampanelli <ppampanelli@nvidia.com>
Signed-off-by: Patricia Pampanelli <ppampanelli@nvidia.com>
Signed-off-by: Patricia Pampanelli <ppampanelli@nvidia.com>
@leondz leondz changed the title DRA (Disguise and Reconstruction Attack) probe probe: DRA (Disguise and Reconstruction Attack) Sep 22, 2025
@jmartin-tech jmartin-tech dismissed erickgalinkin’s stale review September 23, 2025 19:57

All request look to be addressed here, marking this ready.

@jmartin-tech jmartin-tech merged commit 570341e into NVIDIA:main Sep 23, 2025
15 checks passed
@github-actions github-actions bot locked and limited conversation to collaborators Sep 23, 2025
@patriciapampanelli patriciapampanelli deleted the feature/dra-probe branch September 24, 2025 15:01
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

new plugin Describes an entirely new probe, detector, generator or harness probes Content & activity of LLM probes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants