run experiment on dataset #8433

erantr1 · 2025-08-10T22:23:42Z

erantr1
Aug 10, 2025

In the code:
from langfuse import get_client
from .app import my_llm_application

Load the dataset

dataset = get_client().get_dataset("<dataset_name>")

Loop over the dataset items

for item in dataset.items:
# Use the item.run() context manager for automatic trace linking
with item.run(
run_name="<run_name>",
run_description="My first run",
run_metadata={"model": "llama3"},
) as root_span:
# Execute your LLM-app against the dataset item input
output = my_llm_application.run(item.input)

    # Optionally: Add scores computed in your experiment runner, e.g. json equality check
    root_span.score_trace(
        name="<example_eval>",
        value=my_eval_fn(item.input, output, item.expected_output),
        comment="This is a comment",  # optional, useful to add reasoning
    )

Flush the langfuse client to ensure all data is sent to the server at the end of the experiment run

get_client().flush()

end of code

can my_eval_fn be one of the LLM-as-a-judge evaluators I set in langfuse?

Answered by inkeep-ai-assistant[bot]

Aug 10, 2025

No, you cannot directly use an LLM-as-a-Judge evaluator you set up in Langfuse as your my_eval_fn in the code example you provided. The my_eval_fn function in your script is meant for custom, local evaluation logic—such as comparing outputs or running your own checks—executed in your Python environment(1).

LLM-as-a-Judge evaluators are managed and executed within Langfuse itself. To use these, you should configure them in the Langfuse UI. When you run a Dataset Run (experiment) from the UI, Langfuse will automatically apply any selected LLM-as-a-Judge evaluators to the outputs of your application, scoring them based on the evaluator's prompts and logic(2)(3). This process is separate from…

View full answer

erantr1 · 2025-08-10T22:23:56Z

inkeep-ai-assistant[bot]
Bot Aug 10, 2025

No, you cannot directly use an LLM-as-a-Judge evaluator you set up in Langfuse as your my_eval_fn in the code example you provided. The my_eval_fn function in your script is meant for custom, local evaluation logic—such as comparing outputs or running your own checks—executed in your Python environment(1).

LLM-as-a-Judge evaluators are managed and executed within Langfuse itself. To use these, you should configure them in the Langfuse UI. When you run a Dataset Run (experiment) from the UI, Langfuse will automatically apply any selected LLM-as-a-Judge evaluators to the outputs of your application, scoring them based on the evaluator's prompts and logic(2)(3). This process is separate from your local Python code.

If you want to use LLM-as-a-Judge evaluation, set up the evaluator in Langfuse and trigger the Dataset Run through the UI, selecting the evaluators you want to apply. The scores will then be generated and stored in Langfuse automatically(2)(3).

📚 Sources:

Have another question? Just tag @inkeep.

1 reply

erantr1 Aug 10, 2025
Author

@inkeep
Thanks!

In correctness LLM-as-a-judge evaluator (and possibly in other LLM-as-a-judge evaluators) I need to set a ground-truth.
I don't understand how to set it right. The ground-truth comes from the expected answers in the dataset but the evaluator doesn't really read a dataset. How can I set the ground truth then?

erantr1 · 2025-08-10T22:30:41Z

inkeep-ai-assistant[bot]
Bot Aug 10, 2025

When you set up an LLM-as-a-Judge evaluator (like for correctness), you can map the evaluator's variables (such as ground-truth) directly to fields in your dataset items. The evaluator does "read" the dataset by referencing these mapped fields during evaluation runs.

For example, if your dataset items contain an expected output (the ground-truth), you map the evaluator's {{ground_truth}} variable to the dataset item's expected output field during the variable mapping step. This tells Langfuse to pass the expected answer from each dataset item to the evaluator as ground-truth. The same applies for other variables like input or output.

You configure this mapping in the UI when setting up the evaluator for a Dataset Run. Langfuse provides a prompt preview showing how actual dataset values fill the evaluator's variables, ensuring the ground-truth and other fields are correctly assigned(1)(2)(3).

📚 Sources:

Have another question? Just tag @inkeep.

1 reply

erantr1 Aug 10, 2025
Author

@inkeep

Maybe I'm missing something.

When I go to LLM-as-a-Judge Evaluators and select "Set up evaluator" --> Correctness --> "Use selected evaluator", under "Variable mapping" I have the following variables:

query
generation
ground_truth

For each of the above I can select from a menu an Object (Trace, Span, Generation or Event), an Object Variable (e.g., Metadata, Input, Output) and an optional JsonPath.

Nothing about datasets.

MFerraz · 2025-08-10T22:36:42Z

inkeep-ai-assistant[bot]
Bot Aug 10, 2025

When running an LLM-as-a-Judge evaluator on a Dataset Run, each dataset item creates a trace and the evaluator operates on those traces. During variable mapping (e.g., for query, generation, ground_truth), you map each variable to properties on these trace objects—such as Input, Output, or Metadata. The dataset's fields (like expected output for ground_truth) are linked to these trace properties when the Dataset Run is executed.

Langfuse handles the connection: it populates the trace's input/output from the dataset item (input/expected_output), so mapping to trace variables in the evaluator setup effectively pulls from your dataset. You don't need to map directly to dataset fields; you map to the trace fields, which are already populated from your dataset during the run. If your dataset item has an expected_output, it is typically mapped to the trace's output or a custom metadata field, which you then select in the evaluator mapping step. Use JsonPath if your data is nested.

In summary: select the appropriate object (usually Trace or Generation), then pick the variable (Input/Output/Metadata) where your dataset's ground_truth is stored after the Dataset Run, and the evaluator will consume it as intended(1).

📚 Sources:

LLM-as-a-Judge

Have another question? Just tag @inkeep.

1 reply

MFerraz Mar 27, 2026

@inkeep

I just inspected a complete trace object and I can't find my Expected Output anywhere in the trace.

MFerraz · 2026-03-27T13:04:21Z

inkeep-ai-assistant[bot]
Bot Mar 27, 2026

You're correct—the expected output from a dataset item is not automatically stored on the trace object. When running experiments, the dataset item's expected_output is a separate property of the dataset item itself, not the trace.

When configuring LLM-as-a-Judge evaluators for Offline Experiment Data (datasets), the variable mapping works differently than for live production traces. You map evaluator variables directly to Dataset item properties, not trace properties. (1)

For example, if your evaluator prompt includes {{input}}, {{output}}, and {{ground_truth}} variables, the suggested mappings would be:

{{input}} → Dataset item's input
{{output}} → Trace output (the result of your experiment run)
{{ground_truth}} → Dataset item's expected_output

(1)

The system will often autocomplete common mappings based on typical field names in experiments. You can map any properties of your experiment item (e.g., input, expected_output). The trace output represents your experiment run's output, while the dataset item's expected_output serves as your ground truth. (1)

If you're only seeing Trace/Span/Generation/Event options in the Object dropdown and not "Dataset item," ensure you've selected Offline Experiment Data as your evaluation target rather than Live Production Data. The variable mapping options differ based on this selection. (1)

📚 Sources:

LLM-as-a-Judge

Have another question? Just tag @inkeep.

1 reply

MFerraz Mar 27, 2026

@inkeep

And is it possible to map the dataset item input instead of the generation input (which includes the whole prompt with variables replaced with my dataset item input values). I just want the evaluator to see my variable values (i.e. dataset input), not the full prompt.

MFerraz · 2026-03-27T18:45:43Z

inkeep-ai-assistant[bot]
Bot Mar 27, 2026

Yes, when configuring an LLM-as-a-Judge evaluator for Offline Experiment Data (datasets), you can map the evaluator's variables directly to Dataset item properties rather than trace or generation properties. (1)

For your use case, you would map the evaluator's {{input}} variable (or {{query}}) to the dataset item's input field instead of the generation's input. This way, the evaluator sees only your original variable values from the dataset, not the full prompt with variables substituted. (1)

The variable mapping for experiment data allows you to select from:

Dataset item properties (e.g., input, expected_output)
Trace output (the result of your experiment run)

So your mapping could look like:

{{query}} or {{input}} → Dataset item's input
{{generation}} or {{output}} → Trace output
{{ground_truth}} → Dataset item's expected_output

(1)

Make sure you have Offline Experiment Data selected as your evaluation target—this is what enables mapping to dataset item fields directly. (1)

📚 Sources:

LLM-as-a-Judge

Have another question? Just tag @inkeep.

1 reply

MFerraz Mar 27, 2026

@inkeep, by selecting Input, the LLM-as-judge is taking all of the generation input (full prompt) and not just the data sat input (variable values)

2026-03-27T18:51:19Z

inkeep-ai-assistant[bot]
Bot Mar 27, 2026

When mapping evaluator variables for Offline Experiment Data, you should be able to select Dataset item as the Object and then choose Input as the Object Variable. This should map to your dataset item's input field—not the generation's full prompt input. (1)

If you're seeing the full generation input (the complete prompt with variables substituted) instead of just your dataset item's input values, verify that:

Object selection: Make sure you've selected "Dataset item" as the Object, not "Trace" or "Generation"
Object Variable: Select "Input" from the Dataset item's properties

The mapping should look like:

Object: Dataset item
Object Variable: Input
JsonPath: (leave empty unless your input is nested JSON and you need to extract a specific field)

(1)

If the data is nested within a JSON structure, you can use a JsonPath expression (like $.your_field) to precisely locate just the values you need. (1)

If you've confirmed the Object is set to "Dataset item" and you're still getting the full prompt, this may be a UI or configuration issue worth raising in GitHub Discussions. (2)

📚 Sources:

Have another question? Just tag @inkeep.

0 replies

Langfuse

run experiment on dataset #8433

Uh oh!

erantr1 Aug 10, 2025

Load the dataset

Loop over the dataset items

Flush the langfuse client to ensure all data is sent to the server at the end of the experiment run

end of code

Replies: 6 comments · 5 replies

Uh oh!

inkeep-ai-assistant[bot] Bot Aug 10, 2025

Uh oh!

erantr1 Aug 10, 2025 Author

Uh oh!

inkeep-ai-assistant[bot] Bot Aug 10, 2025

Uh oh!

erantr1 Aug 10, 2025 Author

Uh oh!

inkeep-ai-assistant[bot] Bot Aug 10, 2025

Uh oh!

MFerraz Mar 27, 2026

Uh oh!

inkeep-ai-assistant[bot] Bot Mar 27, 2026

Uh oh!

MFerraz Mar 27, 2026

Uh oh!

inkeep-ai-assistant[bot] Bot Mar 27, 2026

Uh oh!

MFerraz Mar 27, 2026

Uh oh!

inkeep-ai-assistant[bot] Bot Mar 27, 2026

erantr1
Aug 10, 2025

Replies: 6 comments 5 replies

inkeep-ai-assistant[bot]
Bot Aug 10, 2025

erantr1 Aug 10, 2025
Author

inkeep-ai-assistant[bot]
Bot Aug 10, 2025

erantr1 Aug 10, 2025
Author

inkeep-ai-assistant[bot]
Bot Aug 10, 2025

inkeep-ai-assistant[bot]
Bot Mar 27, 2026

inkeep-ai-assistant[bot]
Bot Mar 27, 2026

inkeep-ai-assistant[bot]
Bot Mar 27, 2026