[NA] [TS SDK] feat: indexed keys in LLMJudge schema, reasoning_effort, and UX improvements by alexkuzmik · Pull Request #6011 · comet-ml/opik

alexkuzmik · 2026-03-31T18:59:43Z

Details

Port Python SDK PRs #5690 and #5677 to the TypeScript SDK for cross-provider compatibility and UX parity.

Response schema (from #5690):

Use indexed keys (assertion_1, assertion_2, ...) instead of assertion text as JSON schema property names — fixes Anthropic's ^[a-zA-Z0-9_.-]{1,64}$ requirement and OpenAI's 15k combined character limit
Store original assertion text as Zod .describe() on each field
Refactor standalone buildResponseSchema() + parseResponse() into a ResponseSchema class with formatAssertions() and parse() methods
Update prompt to instruct the LLM to use field keys as JSON property names

UX improvements (from #5677):

Add reasoningEffort option to LLMJudge (defaults to "low") for faster assertion checking, serialized as customParameters.reasoning_effort in config
Add ---BEGIN INPUT---/---END INPUT--- and ---BEGIN OUTPUT---/---END OUTPUT--- delimiters in LLM judge prompt so short agent outputs don't blend into description text
Move dashboard link inside the result box (bold cyan, at the top)
Remove "Uploading results to Opik..." message

Cross-source compatibility:

fromConfig() now reads schema[i].description (falling back to schema[i].name) to match Python SDK behavior and correctly handle UI-created configs where name is a short label (e.g. "Correctness") and description is the full assertion text
toConfig() now writes variables as path-style {"input": "input", "output": "output"} matching Python SDK and backend convention

Change checklist

User facing
Documentation update

Issues

Ports [OPIK-4992] [Python SDK] fix: use indexed keys in LLMJudge response schema for cross-provider compatibility #5690 and [OPIK-4957] [SDK] feat: improve evaluation suite run experience and performance #5677

AI-WATERMARK

AI-WATERMARK: yes

If yes:
- Tools: Claude Code
- Model(s): Claude Opus 4.6
- Scope: Implementation, tests, PR description
- Human verification: Code reviewed and iterated on by author

Testing

Unit tests: npx vitest run tests/unit/evaluation/suite_evaluators/ (44 passed)
E2E: npx tsx examples/evaluation_suite_example.ts — 3/3 items passed, 100% pass rate against local Opik server with real OpenAI LLM calls
Cross-provider: Verified indexed keys schema works with OpenAI gpt-5-nano (single, multiple, long assertions)

Documentation

No documentation updates required — these are internal implementation changes. The public LLMJudge API gains an optional reasoningEffort parameter but existing usage is unaffected.

…, and UX improvements Port Python SDK PRs #5690 and #5677 to TypeScript SDK: - Use indexed keys (assertion_1, assertion_2) instead of assertion text as JSON schema property names for cross-provider compatibility (Anthropic, OpenAI character limits) - Refactor buildResponseSchema/parseResponse into ResponseSchema class - Add reasoningEffort option to LLMJudge (defaults to "low") - Add ---BEGIN/END--- delimiters around input/output in LLM judge prompt - Move dashboard link inside result box, remove "Uploading results" message Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…compatibility Ensures UI-created LLM judge configs (where name="Correctness" but description="Whether the output is correct") deserialize correctly. Also fixes variables format to match Python SDK / backend convention. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

github-actions · 2026-03-31T19:00:00Z

📋 PR Linter Failed

❌ Missing Section. The description is missing the ## Documentation section.

sdks/typescript/src/opik/evaluation/suite_evaluators/index.ts

sdks/typescript/src/opik/evaluation/suite_evaluators/llmJudgeParsers.ts

…nce-batch

The result box with the dashboard link was only displayed when metrics were present. Moved getUrl() to processResults so the link is always shown, fixing the evaluate.test.ts regression. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

sdks/typescript/src/opik/evaluation/results/EvaluationResultProcessor.ts

Wrap experiment.getUrl() in try/catch so a missing dataset doesn't crash the evaluation results flow. The dashboard link is skipped gracefully if the URL cannot be resolved. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The test was missing a mock for Experiment.insert's underlying API call, causing unhandled 401 rejections in CI after test teardown. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

awkoy

Nice work overall -- the indexed keys approach and ResponseSchema class are solid improvements. A few issues below, the main one being that reasoningEffort isn't wired through to the actual model call.

awkoy · 2026-04-01T11:02:07Z

sdks/typescript/src/opik/evaluation/suite_evaluators/LLMJudge.ts

          }),
          ...(this.seed !== undefined && { seed: this.seed }),
-          output: Output.object({ schema: responseSchema }),
+          output: Output.object({ schema: schema.responseSchema }),


reasoningEffort is stored and serialized into toConfig() (line 88), but it's never passed to generateProviderResponse here. The LLM never actually receives the reasoning effort setting at runtime. Should this be something like reasoning_effort: this.reasoningEffort in these options?

Commit 1de62ef addressed this comment by passing reasoning_effort: this.reasoningEffort into the options for generateProviderResponse, ensuring the runtime request uses the stored setting, and by reusing the cached response schema for the provider output parsing.

awkoy · 2026-04-01T11:02:07Z

sdks/typescript/src/opik/evaluation/suite_evaluators/LLMJudge.ts

-
-function formatAssertionsList(assertions: string[]): string {
-  return assertions.map((a, i) => `${i + 1}. ${a}`).join("\n");
+  reasoningEffort?: string;


Nit: this accepts any arbitrary string. Would a union type like "low" | "medium" | "high" be more appropriate to prevent invalid values from being silently serialized into configs?

awkoy · 2026-04-01T11:02:07Z

sdks/typescript/src/opik/evaluation/suite_evaluators/llmJudgeParsers.ts

-  const results: EvaluationScoreResult[] = [];
+export class ResponseSchema {
+  private readonly assertions: string[];
+  private readonly fieldMapping: Map<string, string>;


This field is assigned in the constructor but never read anywhere in the class -- only fieldMapping is used. Looks like dead code that can be removed.

Commit 1de62ef addressed this comment by removing the unused assertions field and its constructor assignment so only fieldMapping remains.

awkoy · 2026-04-01T11:02:07Z

sdks/typescript/src/opik/evaluation/suite_evaluators/LLMJudge.ts


-    const assertionsList = formatAssertionsList(this.assertions);
+    const schema = new ResponseSchema(this.assertions);



ResponseSchema is constructed on every score() call (and again in toConfig() at line 70). Since this.assertions is immutable after construction, this could be a single instance field initialized in the constructor. In batch evaluations this adds up to a lot of unnecessary Map + z.object() allocations.

Commit 1de62ef addressed this comment by introducing a single ResponseSchema instance stored on the judge (used in toConfig and score) instead of re-creating it on every call.

…Effort, remove dead code - Cache ResponseSchema as instance field instead of recreating on every score()/toConfig() call - Pass reasoning_effort to generateProviderResponse so the LLM actually receives it at runtime - Remove unused assertions field from ResponseSchema class Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

alexkuzmik · 2026-04-01T11:42:13Z

Addressing PR review comments

@awkoy's comments — all fixed in `1de62ef`

reasoningEffort not passed to generateProviderResponse — Fixed. Now passes reasoning_effort: this.reasoningEffort in the options to generateProviderResponse.
reasoningEffort should be a union type — Keeping as string intentionally to match the Python SDK and allow provider-specific values that may vary across OpenAI/Anthropic/Gemini. The Vercel AI SDK passes it through as-is.
assertions field in ResponseSchema is dead code — Fixed. Removed the unused field, only fieldMapping is kept.
ResponseSchema created on every score() call — Fixed. Now cached as an instance field this.responseSchema in the constructor, reused in both score() and toConfig().

@baz-reviewer comments — skipping

Removed buildResponseSchema/parseResponse exports — Intentional. These were internal helpers, SDK is pre-1.0, and ResponseSchema is the replacement.
Duplicate failure object literals in parse() — Keeping as-is. Three similar lines are clearer than a premature abstraction.
Add EvaluationResultProcessor.test.ts — Already covered by evaluate.test.ts which exercises the no-metrics and with-metrics paths.
getUrl() can throw — Already fixed in 06c14dc (wrapped in try/catch).

baz-reviewer · 2026-04-01T11:42:36Z

Thanks for the detailed follow-up on how you addressed or skipped the comments. I appreciate the context, but I can only save feedback tied to specific code review findings to memory, not higher-level PR decisions or general review responses.

alexkuzmik and others added 2 commits March 31, 2026 19:26

alexkuzmik requested a review from a team as a code owner March 31, 2026 18:59

github-actions bot added tests Including test files, or tests related like configuration. typescript *.ts *.tsx TypeScript SDK labels Mar 31, 2026

github-actions bot assigned alexkuzmik Mar 31, 2026

baz-reviewer bot reviewed Mar 31, 2026

View reviewed changes

sdks/typescript/src/opik/evaluation/suite_evaluators/index.ts Show resolved Hide resolved

sdks/typescript/src/opik/evaluation/suite_evaluators/llmJudgeParsers.ts Show resolved Hide resolved

alexkuzmik and others added 2 commits March 31, 2026 23:09

Merge branch 'main' into alexkuzmik/ts-sdk-evaluation-suites-performa…

743c63d

…nce-batch

baz-reviewer bot reviewed Mar 31, 2026

View reviewed changes

sdks/typescript/src/opik/evaluation/results/EvaluationResultProcessor.ts Outdated Show resolved Hide resolved

sdks/typescript/src/opik/evaluation/results/EvaluationResultProcessor.ts Show resolved Hide resolved

alexkuzmik and others added 2 commits April 1, 2026 12:14

fix(ts-sdk): mock createExperimentItems in evaluateWithVersion test

e6d5d93

The test was missing a mock for Experiment.insert's underlying API call, causing unhandled 401 rejections in CI after test teardown. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

baz-reviewer bot approved these changes Apr 1, 2026

View reviewed changes

awkoy requested changes Apr 1, 2026

View reviewed changes

baz-reviewer bot approved these changes Apr 1, 2026

View reviewed changes

awkoy approved these changes Apr 1, 2026

View reviewed changes

alexkuzmik merged commit 3028b0d into main Apr 1, 2026
25 checks passed

alexkuzmik deleted the alexkuzmik/ts-sdk-evaluation-suites-performance-batch branch April 1, 2026 11:58


		const assertionsList = formatAssertionsList(this.assertions);
		const schema = new ResponseSchema(this.assertions);

Conversation

alexkuzmik commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Details

Change checklist

Issues

AI-WATERMARK

Testing

Documentation

Uh oh!

github-actions bot commented Mar 31, 2026

📋 PR Linter Failed

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

awkoy left a comment

Choose a reason for hiding this comment

Uh oh!

awkoy Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

baz-reviewer bot Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

awkoy Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

awkoy Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

baz-reviewer bot Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

awkoy Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

baz-reviewer bot Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

alexkuzmik commented Apr 1, 2026

Addressing PR review comments

@awkoy's comments — all fixed in 1de62ef

@baz-reviewer comments — skipping

Uh oh!

baz-reviewer bot commented Apr 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

alexkuzmik commented Mar 31, 2026 •

edited

Loading

@awkoy's comments — all fixed in `1de62ef`