probe: DRA (Disguise and Reconstruction Attack)#1345
probe: DRA (Disguise and Reconstruction Attack)#1345jmartin-tech merged 33 commits intoNVIDIA:mainfrom
Conversation
|
DCO Assistant Lite bot All contributors have signed the DCO ✍️ ✅ |
|
I have read the DCO Document and I hereby sign the DCO |
|
recheck |
garak/data/payloads/behaviors.json
Outdated
There was a problem hiding this comment.
Are we sure this belongs in payloads? I feel like this is mostly advbench, which we use elsewhere by loading from huggingface -- this is implemented in garak.resources.common.load_advbench().
Could we use payloads/harmful_behavior.json (A smaller version with similar intent) or simply load advbench? Otherwise, I'd at least suggest making the name more specific than "behaviors" since we already have "harmful_behaviors" and #1335 includes normal_instructions.json
There was a problem hiding this comment.
We could use harmful_behaviors.json. I noticed the similarity too. I’m just not sure we’d get the same attack success rate reported in the paper if we change the input dataset. Does that concern make sense given the scope of Garak?
There was a problem hiding this comment.
I don't think we're too pressed about a perfect recreation of the ASR from the paper -- were we able to reproduce their ASR exactly with the dataset here?
There was a problem hiding this comment.
Not exactly. Currently, I'm using only the MitigationBypass and Prefixes detectors. To perform a more comprehensive assessment, I'd like to incorporate ModelAsJudge to confirm whether the output genuinely addresses the harmful input question.
garak/probes/dra.py
Outdated
| # cutoff prob | ||
| ub = 2 | ||
| if random.randint(0, ub) == 0: |
There was a problem hiding this comment.
Should we express this as an actual probability (0.33)? Should this be configurable? I'm not super pressed either way.
There was a problem hiding this comment.
I’ve been thinking about parametrizing this probability of truncating a non‑toxic token. Together with toxic_trunc and benign_trunc, it would let us tune the attacks to the target model’s capabilities. The rationale is that for less capable models, retaining more benign tokens helps guide stage 2 (Reconstruction). It also helps balance disguise and models’ reconstruction capability.
With benign_trunc as a parameter, we already let users control how much of each benign token is kept (default = 0.5), with a 33% chance of keeping the entire token. I don’t think we need another parameter. Wouldn’t that be too many params for users to tune?
Is it possible to adjust probe’s parameters based on results from previous attacks?
There was a problem hiding this comment.
It can indeed be done, though it is a bit complicated to adjust the parameters based on previous runs -- we could always extend this into an adaptive probe.
I suppose that we could derive this probability as well? I'm just always wary of magic numbers that are not configurable unless they are well-justified.
There was a problem hiding this comment.
Agreed! We should make this probability benign_full_token_prob as a parameter. This provides flexibility, especially given the diverse templates used for different model capabilities. Users can tune the attack for their target generator, for instance, by retaining more benign tokens to guide Stage 2 (Reconstruction) for less capable models. I will provide some guidance in the comments for this adjustment.
I can open a new issue to extend this implementation to an adaptive version of the DRA probe. Perhaps it would make sense to also consider a multi-stage or layering strategy for this probe. Does it make sense?
Signed-off-by: Patricia Pampanelli <ppampanelli@nvidia.com>
Signed-off-by: Patricia Pampanelli <ppampanelli@nvidia.com>
Signed-off-by: Patricia Pampanelli <ppampanelli@nvidia.com>
…ion instead of being a hard dependency. Signed-off-by: Patricia Pampanelli <ppampanelli@nvidia.com>
Signed-off-by: Patricia Pampanelli <ppampanelli@nvidia.com>
Signed-off-by: Patricia Pampanelli <ppampanelli@nvidia.com>
Signed-off-by: Patricia Pampanelli <ppampanelli@nvidia.com>
Signed-off-by: Patricia Pampanelli <ppampanelli@nvidia.com>
Signed-off-by: Patricia Pampanelli <ppampanelli@nvidia.com>
Co-authored-by: Erick Galinkin <erick.galinkin@gmail.com> Signed-off-by: Patricia Pampanelli <38949950+patriciapampanelli@users.noreply.github.com>
Co-authored-by: Erick Galinkin <erick.galinkin@gmail.com> Signed-off-by: Patricia Pampanelli <38949950+patriciapampanelli@users.noreply.github.com>
Signed-off-by: Patricia Pampanelli <ppampanelli@nvidia.com>
Signed-off-by: Patricia Pampanelli <ppampanelli@nvidia.com>
Signed-off-by: Patricia Pampanelli <ppampanelli@nvidia.com>
Signed-off-by: Patricia Pampanelli <ppampanelli@nvidia.com>
6aa9529 to
e4a743f
Compare
jmartin-tech
left a comment
There was a problem hiding this comment.
A few minor revisions noted.
At this time optional dependencies are added to requirements.txt and placed in an optional-dependencies group in pyproject.toml
See example from optional audio probe packages:
Lines 133 to 136 in 9b054ff
Signed-off-by: Patricia Pampanelli <ppampanelli@nvidia.com>
Co-authored-by: Jeffrey Martin <jmartin@Op3n4M3.dev> Signed-off-by: Patricia Pampanelli <38949950+patriciapampanelli@users.noreply.github.com>
8367bef to
3c138ab
Compare
Signed-off-by: Patricia Pampanelli <ppampanelli@nvidia.com>
Signed-off-by: Patricia Pampanelli <ppampanelli@nvidia.com>
5ffec2e to
afb2914
Compare
Signed-off-by: Patricia Pampanelli <ppampanelli@nvidia.com>
Signed-off-by: Patricia Pampanelli <ppampanelli@nvidia.com>
Signed-off-by: Patricia Pampanelli <ppampanelli@nvidia.com>
leondz
left a comment
There was a problem hiding this comment.
Looked through every line, this is nice. Thanks
jmartin-tech
left a comment
There was a problem hiding this comment.
Some minor suggestions for cleaner code reuse. In the current state DRAAdvanced initialization will populate prompts twice, the first using the DRA build process and then again using the DRAAdvanced one.
Co-authored-by: Jeffrey Martin <jmartin@Op3n4M3.dev> Signed-off-by: Patricia Pampanelli <38949950+patriciapampanelli@users.noreply.github.com>
Signed-off-by: Patricia Pampanelli <ppampanelli@nvidia.com>
Signed-off-by: Patricia Pampanelli <ppampanelli@nvidia.com>
Signed-off-by: Patricia Pampanelli <ppampanelli@nvidia.com>
Signed-off-by: Patricia Pampanelli <ppampanelli@nvidia.com>
All request look to be addressed here, marking this ready.
Implements #734
DRA (Disguise and Reconstruction Attack) probe. This probe implements a two-stage attack strategy:
detoxifypackageDetector:
mitigation.MitigationBypass(??)