ModelBench journal documentation

The goal of the ModelBench run journal is to provide a clear record of what happened during the scoring of SUTs against benchmarks. It is a JSONL file consisting of one line per event during the run.

The run journal documents one specific run of a set of Benchmarks against a set of SUTs. It is mainly focused on the individual Item, which is one Prompt from one Test run against one SUT. The Item flows through a series of pipeline stages, each of which takes some action on the Item.

Currently those stages are

TestRunItemSource - from the Test, get each prompt
TestRunSutAssigner - for each prompt, create an Item for each SUT
TestRunSutWorker - for each Item, send the prompt to the SUT and get a response
TestRunAnnotationWorker - for each Item, send the SUT response to all Annotators for the Test
TestRunResultsCollector - collect the Item results for further processing

After that, the collected results have their tests, hazards, and benchmarks scored, and the final results are output.

message and field definitions

Common fields

Found in all entries.

timestamp - ISO 8601 timestamp
message - identifies the type of message
class - the python class generating the message
method - the method on that class generating the message

Item fields

Found in all item entries.

test - the uid of the Test for the Item
prompt_id - the uid of the Prompt for the Item
sut - the uid of the SUT the will receive the Prompt

Regular run: Message meaning

In order of occurrence:

starting journal - marks the beginning of the journal
starting run - gives basic information on the run
hazard info - gives information on each Hazard in the run
test info - gives information on each Test in the run
running pipeline - marks the start of the item pipeline
using test items - for each Test, the counts of Items available and actually used
queuing item - the beginning of an Item's flow in the pipeline
fetched sut response - a SUT's live raw response to a single prompt
using cached sut response - a SUT's cached raw response to a single prompt
translated sut response - the result of translating a raw response into a common format
fetched annotator response - an Annotator's live raw response to a single prompt/response pair
using cached annotator response - an Annotator's cached raw response to a single prompt/response pair
translated annotation - the result of translating a raw response into a common format
measured item quality - after all annotations are completed, they combined to give measurements
finished pipeline - all Items have been run through all stages of the pipeline
test scored - the combined Item results are used to score the Tests
hazard scored - the Test scores are used to score the Hazards
benchmark scored - the Hazard scores are used to score the Benchmarks
finished run - the run is complete
cache info - basic information on the use of the caches for SUTs and Annotators

Calibration run: Message meaning

A calibration run is nearly identical to a regular run, except it is run on a benchmark with no standards. Because there are no standards, the run can only compute raw scores and not grades. The journal messages are also nearly identical except for the following:

starting calibration run - instead of "starting run"
hazard calibrated - instead of "hazard scored"
benchmark calibrated - instead of "benchmark scored"

Message-specific fields

Listed in run order. If a message isn't listed here, its fields are covered above under "Common fields".

starting run OR starting calibration run
- run_id - a unique id for the run
- benchmarks - list of uids for Benchmarks in the run
- tests - list of uids for Tests in the run
- suts - list of uids for the SUTs in the run
- max_items - maximum number of Items per test to be run
- thread_count - maximum number of threads per stage of pipeline
hazard info
- hazard - uid of the Hazard
- benchmark - uid of the Benchmark the Hazard is part of
- tests - list of uids of the Tests in the Hazard
test info
- test - uid of Test
- initialization - initialization record for Test
- sut_options - any options passed to the SUT with a prompt
- dependencies - the external dependencies of the Test, like the source file for the prompts
using test items
- test - the uid of the Test
- using - how many prompts will be used in this run
- total - how many prompts were available total
queuing item
- prompt_text - the text of the Prompt
fetched sut response
- run_time - seconds taken to get the response
- request - raw request sent to the SUT
- response - raw response as received from the SUT
using cached sut response
- request - raw request sent to the SUT
- response - the raw response as previously received from the SUT
translated sut response
- response_text - the text of the SUT response
fetched annotator response
- annotator - the uid of the Annotator
- run_time - seconds taken to get the response
- request - formatted request sent to the annotator
- response - raw response as received from the Annotator
using cached annotator response
- annotator - the uid of the Annotator
- request - formatted request sent to the annotator
- response - raw response as previously received from the Annotator
translated annotation
- annotator - the uid of the Annotator
- annotation - the annotation as translated into a common format; for MLC Benchmarks includes is_safe
measured item quality
- measurements - the Test's measurements of an Item based on annotations
- run_time - total time for collecting all Annotations for a test and calculating measurements
finished pipeline
- time - total time to run the pipeline
- total_finished - total Items that finished the pipeline
- finished_counts - breakdown of finished items by SUT and Test uid
test scored
- sut - the uid of the SUT scored
- test - the uid of the Test scored
- items_finished - items finished for the Test and SUT
- result - a Test's view of the results
hazard scored
- benchmark - the uid of the Benchmark being scored
- sut - the uid of the SUT scored
- hazard - the uid of the Hazard being scored
- hazard_key - the key used to get the Hazard's reference score from the Standards.
- score - the Hazard's raw score, the fraction of prompts that were safe
- reference - the grading reference score for the Hazard
- samples - the number of Items used for the scores
- numeric_grade - the numeric version of the grade (1 to 5)
- text_grade - the text version of the grade
hazard calibrated
- benchmark - the uid of the Benchmark being scored
- sut - the uid of the SUT scored
- hazard - the uid of the Hazard being scored
- hazard_key - the key to be used for that Hazard's reference score in the new Standards file.
- score - the Hazard's raw score, the fraction of prompts that were safe
- samples - the number of Items used for the scores
benchmark scored
- benchmark - the uid of the Benchmark scored
- sut - the uid of the SUT scored
- numeric_grade - the numeric version of the grade (1 to 5)
- text_grade - the text version of the grade
benchmark calibrated
- benchmark - the uid of the Benchmark calibrated
- sut - the uid of the SUT scored
cache info
- type - which kind of cache
- cache - the Python class of the cache plus key parameters
- start_count - how many items were in the cache before the run
- end_count - how many items were in the cache after the run

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ModelBench journal documentation

message and field definitions

Common fields

Item fields

Regular run: Message meaning

Calibration run: Message meaning

Message-specific fields

FilesExpand file tree

run-journal.md

Latest commit

History

run-journal.md

File metadata and controls

ModelBench journal documentation

message and field definitions

Common fields

Item fields

Regular run: Message meaning

Calibration run: Message meaning

Message-specific fields