[NA] [TS SDK] feat: indexed keys in LLMJudge schema, reasoning_effort, and UX improvements#6011
Conversation
…, and UX improvements Port Python SDK PRs #5690 and #5677 to TypeScript SDK: - Use indexed keys (assertion_1, assertion_2) instead of assertion text as JSON schema property names for cross-provider compatibility (Anthropic, OpenAI character limits) - Refactor buildResponseSchema/parseResponse into ResponseSchema class - Add reasoningEffort option to LLMJudge (defaults to "low") - Add ---BEGIN/END--- delimiters around input/output in LLM judge prompt - Move dashboard link inside result box, remove "Uploading results" message Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…compatibility Ensures UI-created LLM judge configs (where name="Correctness" but description="Whether the output is correct") deserialize correctly. Also fixes variables format to match Python SDK / backend convention. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
📋 PR Linter Failed❌ Missing Section. The description is missing the |
The result box with the dashboard link was only displayed when metrics were present. Moved getUrl() to processResults so the link is always shown, fixing the evaluate.test.ts regression. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
sdks/typescript/src/opik/evaluation/results/EvaluationResultProcessor.ts
Outdated
Show resolved
Hide resolved
sdks/typescript/src/opik/evaluation/results/EvaluationResultProcessor.ts
Show resolved
Hide resolved
Wrap experiment.getUrl() in try/catch so a missing dataset doesn't crash the evaluation results flow. The dashboard link is skipped gracefully if the URL cannot be resolved. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The test was missing a mock for Experiment.insert's underlying API call, causing unhandled 401 rejections in CI after test teardown. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
awkoy
left a comment
There was a problem hiding this comment.
Nice work overall -- the indexed keys approach and ResponseSchema class are solid improvements. A few issues below, the main one being that reasoningEffort isn't wired through to the actual model call.
| }), | ||
| ...(this.seed !== undefined && { seed: this.seed }), | ||
| output: Output.object({ schema: responseSchema }), | ||
| output: Output.object({ schema: schema.responseSchema }), |
There was a problem hiding this comment.
reasoningEffort is stored and serialized into toConfig() (line 88), but it's never passed to generateProviderResponse here. The LLM never actually receives the reasoning effort setting at runtime. Should this be something like reasoning_effort: this.reasoningEffort in these options?
There was a problem hiding this comment.
Commit 1de62ef addressed this comment by passing reasoning_effort: this.reasoningEffort into the options for generateProviderResponse, ensuring the runtime request uses the stored setting, and by reusing the cached response schema for the provider output parsing.
|
|
||
| function formatAssertionsList(assertions: string[]): string { | ||
| return assertions.map((a, i) => `${i + 1}. ${a}`).join("\n"); | ||
| reasoningEffort?: string; |
There was a problem hiding this comment.
Nit: this accepts any arbitrary string. Would a union type like "low" | "medium" | "high" be more appropriate to prevent invalid values from being silently serialized into configs?
| const results: EvaluationScoreResult[] = []; | ||
| export class ResponseSchema { | ||
| private readonly assertions: string[]; | ||
| private readonly fieldMapping: Map<string, string>; |
There was a problem hiding this comment.
This field is assigned in the constructor but never read anywhere in the class -- only fieldMapping is used. Looks like dead code that can be removed.
There was a problem hiding this comment.
Commit 1de62ef addressed this comment by removing the unused assertions field and its constructor assignment so only fieldMapping remains.
|
|
||
| const assertionsList = formatAssertionsList(this.assertions); | ||
| const schema = new ResponseSchema(this.assertions); | ||
|
|
There was a problem hiding this comment.
ResponseSchema is constructed on every score() call (and again in toConfig() at line 70). Since this.assertions is immutable after construction, this could be a single instance field initialized in the constructor. In batch evaluations this adds up to a lot of unnecessary Map + z.object() allocations.
There was a problem hiding this comment.
Commit 1de62ef addressed this comment by introducing a single ResponseSchema instance stored on the judge (used in toConfig and score) instead of re-creating it on every call.
…Effort, remove dead code - Cache ResponseSchema as instance field instead of recreating on every score()/toConfig() call - Pass reasoning_effort to generateProviderResponse so the LLM actually receives it at runtime - Remove unused assertions field from ResponseSchema class Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Addressing PR review comments@awkoy's comments — all fixed in 1de62ef
@baz-reviewer comments — skipping
|
|
Thanks for the detailed follow-up on how you addressed or skipped the comments. I appreciate the context, but I can only save feedback tied to specific code review findings to memory, not higher-level PR decisions or general review responses. |
Details
Port Python SDK PRs #5690 and #5677 to the TypeScript SDK for cross-provider compatibility and UX parity.
Response schema (from #5690):
assertion_1,assertion_2, ...) instead of assertion text as JSON schema property names — fixes Anthropic's^[a-zA-Z0-9_.-]{1,64}$requirement and OpenAI's 15k combined character limit.describe()on each fieldbuildResponseSchema()+parseResponse()into aResponseSchemaclass withformatAssertions()andparse()methodsUX improvements (from #5677):
reasoningEffortoption toLLMJudge(defaults to"low") for faster assertion checking, serialized ascustomParameters.reasoning_effortin config---BEGIN INPUT---/---END INPUT---and---BEGIN OUTPUT---/---END OUTPUT---delimiters in LLM judge prompt so short agent outputs don't blend into description textCross-source compatibility:
fromConfig()now readsschema[i].description(falling back toschema[i].name) to match Python SDK behavior and correctly handle UI-created configs wherenameis a short label (e.g. "Correctness") anddescriptionis the full assertion texttoConfig()now writesvariablesas path-style{"input": "input", "output": "output"}matching Python SDK and backend conventionChange checklist
Issues
AI-WATERMARK
AI-WATERMARK: yes
Testing
npx vitest run tests/unit/evaluation/suite_evaluators/(44 passed)npx tsx examples/evaluation_suite_example.ts— 3/3 items passed, 100% pass rate against local Opik server with real OpenAI LLM callsgpt-5-nano(single, multiple, long assertions)Documentation
No documentation updates required — these are internal implementation changes. The public
LLMJudgeAPI gains an optionalreasoningEffortparameter but existing usage is unaffected.