Add detector evaluation doc and benchmark results#1597
Add detector evaluation doc and benchmark results#1597jmartin-tech merged 24 commits intoNVIDIA:mainfrom
Conversation
Signed-off-by: Patricia Pampanelli <ppampanelli@nvidia.com>
Signed-off-by: Patricia Pampanelli <ppampanelli@nvidia.com>
Co-authored-by: Jeffrey Martin <jemartin@nvidia.com>
Signed-off-by: Patricia Pampanelli <ppampanelli@nvidia.com>
Signed-off-by: Patricia Pampanelli <ppampanelli@nvidia.com>
Signed-off-by: Patricia Pampanelli <ppampanelli@nvidia.com>
Signed-off-by: Patricia Pampanelli <ppampanelli@nvidia.com>
Signed-off-by: Patricia Pampanelli <ppampanelli@nvidia.com>
Signed-off-by: Patricia Pampanelli <ppampanelli@nvidia.com>
Signed-off-by: Patricia Pampanelli <ppampanelli@nvidia.com>
Signed-off-by: Patricia Pampanelli <ppampanelli@nvidia.com>
Signed-off-by: Patricia Pampanelli <ppampanelli@nvidia.com>
Signed-off-by: Patricia Pampanelli <ppampanelli@nvidia.com>
…aluation methodology Signed-off-by: Patricia Pampanelli <ppampanelli@nvidia.com>
Signed-off-by: Patricia Pampanelli <ppampanelli@nvidia.com>
… attributes from base detector Signed-off-by: Patricia Pampanelli <ppampanelli@nvidia.com>
erickgalinkin
left a comment
There was a problem hiding this comment.
Maybe I'm picking nits, but some of these descriptions feel a bit off. I made suggestions where I have them.
garak/detectors/shields.py
Outdated
| hit_desc = "Response does not indicate security system safe - potential false negative" | ||
| pass_desc = "Response indicates security system deems content safe - shields down" |
There was a problem hiding this comment.
Why are we flagging a potential false negative here when we don't do it anywhere else?
I'm not sure how to reword these descriptions, but I'm not excited about them.
There was a problem hiding this comment.
shields is a tricky one to explain, agree. we can iterate with time. curious to see how the _descs are consumed - that might guide their structure
erickgalinkin
left a comment
There was a problem hiding this comment.
Maybe I'm picking nits, but some of these descriptions feel a bit off. I made suggestions where I have them.
Co-authored-by: Erick Galinkin <erick.galinkin@gmail.com> Signed-off-by: Patricia Pampanelli <38949950+patriciapampanelli@users.noreply.github.com>
…link Signed-off-by: Patricia Pampanelli <ppampanelli@nvidia.com>
|
Sample of detector info when hit_precision related values exist: |
leondz
left a comment
There was a problem hiding this comment.
Looks good.
Have reduced the scope of the wording to be conservative and precise, and in many places avoid discussion implications of the hit/pass. Recommend just merging in every descr edit that looks good and then we're away.
Later we might like to define a standard for the hit/pass_descs but this doesn't have to be in for this PR/release.
| What Are Hits and Passes? | ||
| -------------------------- | ||
|
|
||
| Detector evaluation uses labeled ground truth datasets where model responses are classified into one of two categories based on whether the detector's target failure mode is present: |
There was a problem hiding this comment.
state /how/ the datasets are labelled (by LLM? by person?)
| hit_desc = "Response contains escaped ANSI control sequences - terminal injection risk" | ||
| pass_desc = "Response does not contain escaped ANSI control sequences" |
There was a problem hiding this comment.
happy to treat hit_desc and pass_desc as markdown, would that do it?
garak/detectors/shields.py
Outdated
| hit_desc = "Response does not indicate security system safe - potential false negative" | ||
| pass_desc = "Response indicates security system deems content safe - shields down" |
There was a problem hiding this comment.
shields is a tricky one to explain, agree. we can iterate with time. curious to see how the _descs are consumed - that might guide their structure
jmartin-tech
left a comment
There was a problem hiding this comment.
Functional testing looks good.
efe7520 to
68935b1
Compare
Co-authored-by: Erick Galinkin <erick.galinkin@gmail.com> Co-authored-by: Leon Derczynski <leonderczynski@gmail.com> Signed-off-by: Jeffrey Martin <jemartin@nvidia.com>
68935b1 to
4ee789e
Compare
Documentation and results for the Garak detector evaluation pipeline:
Benchmark Results
Detector Metadata
Documentation