Skip to content

fix: Fix builtin evaluator edge cases#11405

Open
anticorrelator wants to merge 2 commits intomainfrom
dustin/fix-builtin-evaluator-correctness-and-availability-issues
Open

fix: Fix builtin evaluator edge cases#11405
anticorrelator wants to merge 2 commits intomainfrom
dustin/fix-builtin-evaluator-correctness-and-availability-issues

Conversation

@anticorrelator
Copy link
Contributor

@anticorrelator anticorrelator commented Feb 12, 2026

  • Reject None, list, and dict values for string-typed template fields instead of silently coercing them (e.g. str(None)"None")
  • Change case_sensitive default from True to False for ExactMatch and Levenshtein evaluators
  • Cap Levenshtein distance inputs at 5000 characters to prevent expensive O(n*m) computations
  • Add early-exit for identical strings in Levenshtein to skip unnecessary computation
  • Fix json_diff_count to treat int/float as equivalent (1 == 1.0) using math.isclose, and distinguish bool from int (True != 1)

Note

Medium Risk
Behavior changes in evaluator defaults and input casting/serialization can affect existing evaluation results and traces, though scope is limited to evaluator logic and covered by unit tests.

Overview
Hardens evaluator input handling by making cast_template_variable_types fail fast on None for string fields and JSON-serializing dict/list values (instead of Python str() output), which also changes LLM prompt/span inputs to use JSON strings.

Changes ExactMatchEvaluator and LevenshteinDistanceEvaluator to default case_sensitive to False, and adds guardrails to Levenshtein evaluation (5000-char length cap plus early-exit when strings already match). json_diff_count now treats int/float as numerically equivalent (via math.isclose) while distinguishing bool from int, with tests updated/added accordingly.

Written by Cursor Bugbot for commit 95de673. This will update automatically on new commits. Configure here.

- Reject None, list, and dict values for string-typed template fields instead of silently coercing them (e.g. `str(None)` → `"None"`)
- Change `case_sensitive` default from `True` to `False` for ExactMatch and Levenshtein evaluators
- Cap Levenshtein distance inputs at 5000 characters to prevent expensive O(n*m) computations
- Add early-exit for identical strings in Levenshtein to skip unnecessary computation
- Fix `json_diff_count` to treat int/float as equivalent (1 == 1.0) using `math.isclose`, and distinguish bool from int (`True` != `1`)
@anticorrelator anticorrelator requested a review from a team as a code owner February 12, 2026 23:40
@github-project-automation github-project-automation bot moved this to 📘 Todo in phoenix Feb 12, 2026
@dosubot dosubot bot added the size:L This PR changes 100-499 lines, ignoring generated files. label Feb 12, 2026
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.

if value is None:
raise ValueError(f"Field '{key}' expects a string but got NoneType")
if isinstance(value, (dict, list)):
casted_template_variables[key] = json.dumps(value, default=str)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

String schema still accepts containers

Medium Severity

cast_template_variable_types still converts dict and list values for "string" fields into JSON text via json.dumps instead of rejecting them. This keeps non-string inputs silently passing validation, so template variables that are structurally wrong continue to be treated as valid strings.

Fix in Cursor Fix in Web

if max(len(compare_expected), len(compare_actual)) > 5000:
raise ValueError(
"Inputs too long for Levenshtein distance (max 5000 characters)"
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Length cap blocks identical long strings

Low Severity

LevenshteinDistanceEvaluator enforces the 5000-character limit before checking compare_expected == compare_actual. This causes identical over-limit strings to return an error instead of distance 0, even though the early-exit path avoids the expensive levenshtein_distance computation.

Additional Locations (1)

Fix in Cursor Fix in Web

@mikeldking mikeldking assigned axiomofjoy and ehutt and unassigned axiomofjoy Mar 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:L This PR changes 100-499 lines, ignoring generated files.

Projects

Status: 📘 Todo

Development

Successfully merging this pull request may close these issues.

3 participants