Skip to content

[NA] [TS SDK] feat: indexed keys in LLMJudge schema, reasoning_effort, and UX improvements#6011

Merged
alexkuzmik merged 7 commits intomainfrom
alexkuzmik/ts-sdk-evaluation-suites-performance-batch
Apr 1, 2026
Merged

[NA] [TS SDK] feat: indexed keys in LLMJudge schema, reasoning_effort, and UX improvements#6011
alexkuzmik merged 7 commits intomainfrom
alexkuzmik/ts-sdk-evaluation-suites-performance-batch

Conversation

@alexkuzmik
Copy link
Copy Markdown
Collaborator

@alexkuzmik alexkuzmik commented Mar 31, 2026

Details

Port Python SDK PRs #5690 and #5677 to the TypeScript SDK for cross-provider compatibility and UX parity.

Response schema (from #5690):

  • Use indexed keys (assertion_1, assertion_2, ...) instead of assertion text as JSON schema property names — fixes Anthropic's ^[a-zA-Z0-9_.-]{1,64}$ requirement and OpenAI's 15k combined character limit
  • Store original assertion text as Zod .describe() on each field
  • Refactor standalone buildResponseSchema() + parseResponse() into a ResponseSchema class with formatAssertions() and parse() methods
  • Update prompt to instruct the LLM to use field keys as JSON property names

UX improvements (from #5677):

  • Add reasoningEffort option to LLMJudge (defaults to "low") for faster assertion checking, serialized as customParameters.reasoning_effort in config
  • Add ---BEGIN INPUT---/---END INPUT--- and ---BEGIN OUTPUT---/---END OUTPUT--- delimiters in LLM judge prompt so short agent outputs don't blend into description text
  • Move dashboard link inside the result box (bold cyan, at the top)
  • Remove "Uploading results to Opik..." message

Cross-source compatibility:

  • fromConfig() now reads schema[i].description (falling back to schema[i].name) to match Python SDK behavior and correctly handle UI-created configs where name is a short label (e.g. "Correctness") and description is the full assertion text
  • toConfig() now writes variables as path-style {"input": "input", "output": "output"} matching Python SDK and backend convention

Change checklist

  • User facing
  • Documentation update

Issues

AI-WATERMARK

AI-WATERMARK: yes

  • If yes:
    • Tools: Claude Code
    • Model(s): Claude Opus 4.6
    • Scope: Implementation, tests, PR description
    • Human verification: Code reviewed and iterated on by author

Testing

  • Unit tests: npx vitest run tests/unit/evaluation/suite_evaluators/ (44 passed)
  • E2E: npx tsx examples/evaluation_suite_example.ts — 3/3 items passed, 100% pass rate against local Opik server with real OpenAI LLM calls
  • Cross-provider: Verified indexed keys schema works with OpenAI gpt-5-nano (single, multiple, long assertions)

Documentation

No documentation updates required — these are internal implementation changes. The public LLMJudge API gains an optional reasoningEffort parameter but existing usage is unaffected.

alexkuzmik and others added 2 commits March 31, 2026 19:26
…, and UX improvements

Port Python SDK PRs #5690 and #5677 to TypeScript SDK:

- Use indexed keys (assertion_1, assertion_2) instead of assertion text as
  JSON schema property names for cross-provider compatibility (Anthropic,
  OpenAI character limits)
- Refactor buildResponseSchema/parseResponse into ResponseSchema class
- Add reasoningEffort option to LLMJudge (defaults to "low")
- Add ---BEGIN/END--- delimiters around input/output in LLM judge prompt
- Move dashboard link inside result box, remove "Uploading results" message

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…compatibility

Ensures UI-created LLM judge configs (where name="Correctness" but
description="Whether the output is correct") deserialize correctly.
Also fixes variables format to match Python SDK / backend convention.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@alexkuzmik alexkuzmik requested a review from a team as a code owner March 31, 2026 18:59
@github-actions github-actions bot added tests Including test files, or tests related like configuration. typescript *.ts *.tsx TypeScript SDK labels Mar 31, 2026
@github-actions
Copy link
Copy Markdown
Contributor

📋 PR Linter Failed

Missing Section. The description is missing the ## Documentation section.

alexkuzmik and others added 2 commits March 31, 2026 23:09
The result box with the dashboard link was only displayed when metrics
were present. Moved getUrl() to processResults so the link is always
shown, fixing the evaluate.test.ts regression.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
alexkuzmik and others added 2 commits April 1, 2026 12:14
Wrap experiment.getUrl() in try/catch so a missing dataset doesn't
crash the evaluation results flow. The dashboard link is skipped
gracefully if the URL cannot be resolved.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The test was missing a mock for Experiment.insert's underlying API call,
causing unhandled 401 rejections in CI after test teardown.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@awkoy awkoy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work overall -- the indexed keys approach and ResponseSchema class are solid improvements. A few issues below, the main one being that reasoningEffort isn't wired through to the actual model call.

}),
...(this.seed !== undefined && { seed: this.seed }),
output: Output.object({ schema: responseSchema }),
output: Output.object({ schema: schema.responseSchema }),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reasoningEffort is stored and serialized into toConfig() (line 88), but it's never passed to generateProviderResponse here. The LLM never actually receives the reasoning effort setting at runtime. Should this be something like reasoning_effort: this.reasoningEffort in these options?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commit 1de62ef addressed this comment by passing reasoning_effort: this.reasoningEffort into the options for generateProviderResponse, ensuring the runtime request uses the stored setting, and by reusing the cached response schema for the provider output parsing.


function formatAssertionsList(assertions: string[]): string {
return assertions.map((a, i) => `${i + 1}. ${a}`).join("\n");
reasoningEffort?: string;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: this accepts any arbitrary string. Would a union type like "low" | "medium" | "high" be more appropriate to prevent invalid values from being silently serialized into configs?

const results: EvaluationScoreResult[] = [];
export class ResponseSchema {
private readonly assertions: string[];
private readonly fieldMapping: Map<string, string>;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This field is assigned in the constructor but never read anywhere in the class -- only fieldMapping is used. Looks like dead code that can be removed.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commit 1de62ef addressed this comment by removing the unused assertions field and its constructor assignment so only fieldMapping remains.


const assertionsList = formatAssertionsList(this.assertions);
const schema = new ResponseSchema(this.assertions);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ResponseSchema is constructed on every score() call (and again in toConfig() at line 70). Since this.assertions is immutable after construction, this could be a single instance field initialized in the constructor. In batch evaluations this adds up to a lot of unnecessary Map + z.object() allocations.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commit 1de62ef addressed this comment by introducing a single ResponseSchema instance stored on the judge (used in toConfig and score) instead of re-creating it on every call.

…Effort, remove dead code

- Cache ResponseSchema as instance field instead of recreating on every
  score()/toConfig() call
- Pass reasoning_effort to generateProviderResponse so the LLM actually
  receives it at runtime
- Remove unused assertions field from ResponseSchema class

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@alexkuzmik
Copy link
Copy Markdown
Collaborator Author

Addressing PR review comments

@awkoy's comments — all fixed in 1de62ef

  1. reasoningEffort not passed to generateProviderResponse — Fixed. Now passes reasoning_effort: this.reasoningEffort in the options to generateProviderResponse.

  2. reasoningEffort should be a union type — Keeping as string intentionally to match the Python SDK and allow provider-specific values that may vary across OpenAI/Anthropic/Gemini. The Vercel AI SDK passes it through as-is.

  3. assertions field in ResponseSchema is dead code — Fixed. Removed the unused field, only fieldMapping is kept.

  4. ResponseSchema created on every score() call — Fixed. Now cached as an instance field this.responseSchema in the constructor, reused in both score() and toConfig().

@baz-reviewer comments — skipping

  1. Removed buildResponseSchema/parseResponse exports — Intentional. These were internal helpers, SDK is pre-1.0, and ResponseSchema is the replacement.

  2. Duplicate failure object literals in parse() — Keeping as-is. Three similar lines are clearer than a premature abstraction.

  3. Add EvaluationResultProcessor.test.ts — Already covered by evaluate.test.ts which exercises the no-metrics and with-metrics paths.

  4. getUrl() can throw — Already fixed in 06c14dc (wrapped in try/catch).

@baz-reviewer
Copy link
Copy Markdown
Contributor

baz-reviewer bot commented Apr 1, 2026

Thanks for the detailed follow-up on how you addressed or skipped the comments. I appreciate the context, but I can only save feedback tied to specific code review findings to memory, not higher-level PR decisions or general review responses.

@alexkuzmik alexkuzmik merged commit 3028b0d into main Apr 1, 2026
25 checks passed
@alexkuzmik alexkuzmik deleted the alexkuzmik/ts-sdk-evaluation-suites-performance-batch branch April 1, 2026 11:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

tests Including test files, or tests related like configuration. TypeScript SDK typescript *.ts *.tsx

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants