run experiment on dataset #8433
-
|
In the code: Load the datasetdataset = get_client().get_dataset("<dataset_name>") Loop over the dataset itemsfor item in dataset.items: Flush the langfuse client to ensure all data is sent to the server at the end of the experiment runget_client().flush() end of codecan my_eval_fn be one of the LLM-as-a-judge evaluators I set in langfuse? |
Beta Was this translation helpful? Give feedback.
Replies: 6 comments 5 replies
-
|
No, you cannot directly use an LLM-as-a-Judge evaluator you set up in Langfuse as your my_eval_fn in the code example you provided. The my_eval_fn function in your script is meant for custom, local evaluation logic—such as comparing outputs or running your own checks—executed in your Python environment(1). LLM-as-a-Judge evaluators are managed and executed within Langfuse itself. To use these, you should configure them in the Langfuse UI. When you run a Dataset Run (experiment) from the UI, Langfuse will automatically apply any selected LLM-as-a-Judge evaluators to the outputs of your application, scoring them based on the evaluator's prompts and logic(2)(3). This process is separate from your local Python code. If you want to use LLM-as-a-Judge evaluation, set up the evaluator in Langfuse and trigger the Dataset Run through the UI, selecting the evaluators you want to apply. The scores will then be generated and stored in Langfuse automatically(2)(3). 📚 Sources: Have another question? Just tag @inkeep. |
Beta Was this translation helpful? Give feedback.
-
|
When you set up an LLM-as-a-Judge evaluator (like for correctness), you can map the evaluator's variables (such as ground-truth) directly to fields in your dataset items. The evaluator does "read" the dataset by referencing these mapped fields during evaluation runs. For example, if your dataset items contain an expected output (the ground-truth), you map the evaluator's {{ground_truth}} variable to the dataset item's expected output field during the variable mapping step. This tells Langfuse to pass the expected answer from each dataset item to the evaluator as ground-truth. The same applies for other variables like input or output. You configure this mapping in the UI when setting up the evaluator for a Dataset Run. Langfuse provides a prompt preview showing how actual dataset values fill the evaluator's variables, ensuring the ground-truth and other fields are correctly assigned(1)(2)(3). 📚 Sources:
Have another question? Just tag @inkeep. |
Beta Was this translation helpful? Give feedback.
-
|
When running an LLM-as-a-Judge evaluator on a Dataset Run, each dataset item creates a trace and the evaluator operates on those traces. During variable mapping (e.g., for query, generation, ground_truth), you map each variable to properties on these trace objects—such as Input, Output, or Metadata. The dataset's fields (like expected output for ground_truth) are linked to these trace properties when the Dataset Run is executed. Langfuse handles the connection: it populates the trace's input/output from the dataset item (input/expected_output), so mapping to trace variables in the evaluator setup effectively pulls from your dataset. You don't need to map directly to dataset fields; you map to the trace fields, which are already populated from your dataset during the run. If your dataset item has an expected_output, it is typically mapped to the trace's output or a custom metadata field, which you then select in the evaluator mapping step. Use JsonPath if your data is nested. In summary: select the appropriate object (usually Trace or Generation), then pick the variable (Input/Output/Metadata) where your dataset's ground_truth is stored after the Dataset Run, and the evaluator will consume it as intended(1). 📚 Sources: Have another question? Just tag @inkeep. |
Beta Was this translation helpful? Give feedback.
-
|
You're correct—the expected output from a dataset item is not automatically stored on the trace object. When running experiments, the dataset item's When configuring LLM-as-a-Judge evaluators for Offline Experiment Data (datasets), the variable mapping works differently than for live production traces. You map evaluator variables directly to Dataset item properties, not trace properties. (1) For example, if your evaluator prompt includes
The system will often autocomplete common mappings based on typical field names in experiments. You can map any properties of your experiment item (e.g., If you're only seeing Trace/Span/Generation/Event options in the Object dropdown and not "Dataset item," ensure you've selected Offline Experiment Data as your evaluation target rather than Live Production Data. The variable mapping options differ based on this selection. (1) 📚 Sources: Have another question? Just tag @inkeep. |
Beta Was this translation helpful? Give feedback.
-
|
Yes, when configuring an LLM-as-a-Judge evaluator for Offline Experiment Data (datasets), you can map the evaluator's variables directly to Dataset item properties rather than trace or generation properties. (1) For your use case, you would map the evaluator's The variable mapping for experiment data allows you to select from:
So your mapping could look like:
Make sure you have Offline Experiment Data selected as your evaluation target—this is what enables mapping to dataset item fields directly. (1) 📚 Sources: Have another question? Just tag @inkeep. |
Beta Was this translation helpful? Give feedback.
-
|
When mapping evaluator variables for Offline Experiment Data, you should be able to select Dataset item as the Object and then choose Input as the Object Variable. This should map to your dataset item's input field—not the generation's full prompt input. (1) If you're seeing the full generation input (the complete prompt with variables substituted) instead of just your dataset item's input values, verify that:
The mapping should look like:
If the data is nested within a JSON structure, you can use a JsonPath expression (like If you've confirmed the Object is set to "Dataset item" and you're still getting the full prompt, this may be a UI or configuration issue worth raising in GitHub Discussions. (2) 📚 Sources: Have another question? Just tag @inkeep. |
Beta Was this translation helpful? Give feedback.
No, you cannot directly use an LLM-as-a-Judge evaluator you set up in Langfuse as your my_eval_fn in the code example you provided. The my_eval_fn function in your script is meant for custom, local evaluation logic—such as comparing outputs or running your own checks—executed in your Python environment(1).
LLM-as-a-Judge evaluators are managed and executed within Langfuse itself. To use these, you should configure them in the Langfuse UI. When you run a Dataset Run (experiment) from the UI, Langfuse will automatically apply any selected LLM-as-a-Judge evaluators to the outputs of your application, scoring them based on the evaluator's prompts and logic(2)(3). This process is separate from…