fix(sem_search): add semantic search quality evaluation with llm judge and improve descriptions by amitksingh1490 · Pull Request #2370 · tailcallhq/forgecode

amitksingh1490 · 2026-02-09T07:42:22Z

No description provided.

…void are always passed - Always pass both parameters to llm_judge.ts, even when empty - Fixes issue where llm_judge.ts requires both parameters unconditionally - Sets empty string defaults for missing EXPECTED_TYPES and SHOULD_AVOID Co-Authored-By: ForgeCode <noreply@forgecode.dev>

amitksingh1490 · 2026-02-09T07:55:49Z

Fixed in commit 65b896c ✅

The review comment was correct - the script was conditionally adding parameters that llm_judge.ts requires unconditionally.

Changes made:

Always pass both --expected-file-types and --should-avoid to llm_judge.ts
Set empty string defaults when EXPECTED_TYPES or SHOULD_AVOID are not provided
This ensures the command always has the required arguments, preventing "Missing required arguments" errors

Co-Authored-By: ForgeCode noreply@forgecode.dev

…ry evaluation

…arks

Co-authored-by: graphite-app[bot] <96075541+graphite-app[bot]@users.noreply.github.com>

…benchmarks

graphite-app · 2026-02-11T07:46:58Z

+if cd /Users/amit/code-forge/benchmarks/evals/semantic_search_quality && \
+   ./run_eval.sh /tmp/test_semantic_eval/full_context.json > /dev/null 2>&1; then
+  echo "  ✓ PASSED (score >= 70)"
+  PASSED=$((PASSED + 1))
+else
+  echo "  ✗ FAILED (expected pass)"
+  FAILED=$((FAILED + 1))
+fi
+
+# Test 2: Documentation queries - should PASS (may be marginal)
+echo "Test 2: Documentation queries..."
+if cd /Users/amit/code-forge/benchmarks/evals/semantic_search_quality && \
+   ./run_eval.sh /tmp/test_semantic_eval/doc_context.json > /dev/null 2>&1; then
+  echo "  ✓ PASSED (score >= 70)"
+  PASSED=$((PASSED + 1))
+else
+  echo "  ✗ FAILED (expected pass)"
+  FAILED=$((FAILED + 1))
+fi
+
+# Test 3: Bad queries - should FAIL
+echo "Test 3: Bad queries (generic keywords)..."
+if cd /Users/amit/code-forge/benchmarks/evals/semantic_search_quality && \
+   ./run_eval.sh /tmp/test_semantic_eval/bad_context.json > /dev/null 2>&1; then
+  echo "  ✗ FAILED (expected failure, got pass)"
+  FAILED=$((FAILED + 1))
+else
+  echo "  ✓ PASSED (correctly failed - score < 70)"
+  PASSED=$((PASSED + 1))
+fi
+
+# Test 4: Missing sem_search - should FAIL early
+echo "Test 4: Missing sem_search tool..."
+if cd /Users/amit/code-forge/benchmarks/evals/semantic_search_quality && \
+   ./run_eval.sh /tmp/test_semantic_eval/no_sem_search_context.json > /dev/null 2>&1; then
+  echo "  ✗ FAILED (expected early exit failure)"
+  FAILED=$((FAILED + 1))
+else
+  echo "  ✓ PASSED (correctly failed early)"
+  PASSED=$((PASSED + 1))


Hardcoded absolute path /Users/amit/code-forge/benchmarks/evals/semantic_search_quality will fail on any machine other than the original developer's. This breaks portability and will cause test failures in CI/CD or on other developer machines.

Fix: Use relative paths or dynamically determine the script directory:

SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" if cd "$SCRIPT_DIR" && \ ./run_eval.sh /tmp/test_semantic_eval/full_context.json > /dev/null 2>&1; then

Suggested change

if cd /Users/amit/code-forge/benchmarks/evals/semantic_search_quality && \

./run_eval.sh /tmp/test_semantic_eval/full_context.json > /dev/null 2>&1; then

echo " ✓ PASSED (score >= 70)"

PASSED=$((PASSED + 1))

else

echo " ✗ FAILED (expected pass)"

FAILED=$((FAILED + 1))

fi

# Test 2: Documentation queries - should PASS (may be marginal)

echo "Test 2: Documentation queries..."

if cd /Users/amit/code-forge/benchmarks/evals/semantic_search_quality && \

./run_eval.sh /tmp/test_semantic_eval/doc_context.json > /dev/null 2>&1; then

echo " ✓ PASSED (score >= 70)"

PASSED=$((PASSED + 1))

else

echo " ✗ FAILED (expected pass)"

FAILED=$((FAILED + 1))

fi

# Test 3: Bad queries - should FAIL

echo "Test 3: Bad queries (generic keywords)..."

if cd /Users/amit/code-forge/benchmarks/evals/semantic_search_quality && \

./run_eval.sh /tmp/test_semantic_eval/bad_context.json > /dev/null 2>&1; then

echo " ✗ FAILED (expected failure, got pass)"

FAILED=$((FAILED + 1))

else

echo " ✓ PASSED (correctly failed - score < 70)"

PASSED=$((PASSED + 1))

fi

# Test 4: Missing sem_search - should FAIL early

echo "Test 4: Missing sem_search tool..."

if cd /Users/amit/code-forge/benchmarks/evals/semantic_search_quality && \

./run_eval.sh /tmp/test_semantic_eval/no_sem_search_context.json > /dev/null 2>&1; then

echo " ✗ FAILED (expected early exit failure)"

FAILED=$((FAILED + 1))

else

echo " ✓ PASSED (correctly failed early)"

PASSED=$((PASSED + 1))

SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"

if cd "$SCRIPT_DIR" && \

./run_eval.sh /tmp/test_semantic_eval/full_context.json > /dev/null 2>&1; then

echo " ✓ PASSED (score >= 70)"

PASSED=$((PASSED + 1))

else

echo " ✗ FAILED (expected pass)"

FAILED=$((FAILED + 1))

fi

# Test 2: Documentation queries - should PASS (may be marginal)

echo "Test 2: Documentation queries..."

if cd "$SCRIPT_DIR" && \

./run_eval.sh /tmp/test_semantic_eval/doc_context.json > /dev/null 2>&1; then

echo " ✓ PASSED (score >= 70)"

PASSED=$((PASSED + 1))

else

echo " ✗ FAILED (expected pass)"

FAILED=$((FAILED + 1))

fi

# Test 3: Bad queries - should FAIL

echo "Test 3: Bad queries (generic keywords)..."

if cd "$SCRIPT_DIR" && \

./run_eval.sh /tmp/test_semantic_eval/bad_context.json > /dev/null 2>&1; then

echo " ✗ FAILED (expected failure, got pass)"

FAILED=$((FAILED + 1))

else

echo " ✓ PASSED (correctly failed - score < 70)"

PASSED=$((PASSED + 1))

fi

# Test 4: Missing sem_search - should FAIL early

echo "Test 4: Missing sem_search tool..."

if cd "$SCRIPT_DIR" && \

./run_eval.sh /tmp/test_semantic_eval/no_sem_search_context.json > /dev/null 2>&1; then

echo " ✗ FAILED (expected early exit failure)"

FAILED=$((FAILED + 1))

else

echo " ✓ PASSED (correctly failed early)"

PASSED=$((PASSED + 1))

fi

Spotted by Graphite Agent

Is this helpful? React 👍 or 👎 to let us know.

… file validation

graphite-app · 2026-02-11T07:59:36Z

+  CMD="npx tsx llm_judge.ts --context \"$CONTEXT_FILE\" --intent \"$INTENT\" --expected-file-types \"$EXPECTED_TYPES\" --should-avoid \"$SHOULD_AVOID\""
+
+  eval $CMD


Using eval with string interpolation is a shell injection vulnerability. If $CONTEXT_FILE, $INTENT, $EXPECTED_TYPES, or $SHOULD_AVOID contain shell metacharacters, arbitrary commands could be executed.

# Fix: Execute command directly without eval npx tsx llm_judge.ts --context "$CONTEXT_FILE" --intent "$INTENT" --expected-file-types "$EXPECTED_TYPES" --should-avoid "$SHOULD_AVOID"

Suggested change

CMD="npx tsx llm_judge.ts --context \"$CONTEXT_FILE\" --intent \"$INTENT\" --expected-file-types \"$EXPECTED_TYPES\" --should-avoid \"$SHOULD_AVOID\""

eval $CMD

npx tsx llm_judge.ts --context "$CONTEXT_FILE" --intent "$INTENT" --expected-file-types "$EXPECTED_TYPES" --should-avoid "$SHOULD_AVOID"

Spotted by Graphite Agent

Is this helpful? React 👍 or 👎 to let us know.

feat(evals): add semantic search quality evaluation with llm judge

73bd2d1

github-actions Bot added the type: feature Brand new functionality, features, pages, workflows, endpoints, etc. label Feb 9, 2026

amitksingh1490 changed the title ~~feat(evals): add semantic search quality evaluation with llm judge~~ fix(sem_search): add semantic search quality evaluation with llm judge and improve descriptions Feb 9, 2026

amitksingh1490 added type: fix Iterations on existing features or infrastructure. and removed type: feature Brand new functionality, features, pages, workflows, endpoints, etc. labels Feb 9, 2026

[autofix.ci] apply automated fixes

9a8944f

graphite-app Bot reviewed Feb 9, 2026

View reviewed changes

Comment thread benchmarks/evals/semantic_search_quality/run_eval.sh Outdated

amitksingh1490 and others added 4 commits February 10, 2026 15:41

Merge branch 'main' into fix-sem-search-query

ee9e835

feat(semantic_search): add construct keyword scoring to reranking que…

7b42729

…ry evaluation

[autofix.ci] apply automated fixes

9bb0f1b

test(semantic_search): add expected file validation to quality benchm…

242449a

…arks

graphite-app Bot reviewed Feb 11, 2026

View reviewed changes

Comment thread benchmarks/evals/semantic_search_quality/task.yml Outdated

amitksingh1490 and others added 4 commits February 11, 2026 11:54

Merge remote-tracking branch 'origin/main' into fix-sem-search-query

61462eb

Apply suggestion from @graphite-app[bot]

b6f7dbe

Co-authored-by: graphite-app[bot] <96075541+graphite-app[bot]@users.noreply.github.com>

Merge branch 'main' into fix-sem-search-query

dec599e

chore(evals): remove queries_simple.csv from semantic search quality …

67c9898

…benchmarks

graphite-app Bot reviewed Feb 11, 2026

View reviewed changes

fix(semantic_search_quality): parse sem_search XML format in expected…

0c305cc

… file validation

tusharmath enabled auto-merge (squash) February 11, 2026 07:55

tusharmath merged commit 6d5db68 into main Feb 11, 2026
9 checks passed

tusharmath deleted the fix-sem-search-query branch February 11, 2026 07:57

graphite-app Bot reviewed Feb 11, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(sem_search): add semantic search quality evaluation with llm judge and improve descriptions#2370

fix(sem_search): add semantic search quality evaluation with llm judge and improve descriptions#2370
tusharmath merged 12 commits intomainfrom
fix-sem-search-query

amitksingh1490 commented Feb 9, 2026

Uh oh!

Uh oh!

amitksingh1490 commented Feb 9, 2026

Uh oh!

Uh oh!

graphite-app Bot Feb 11, 2026

Uh oh!

Uh oh!

graphite-app Bot Feb 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		CMD="npx tsx llm_judge.ts --context \"$CONTEXT_FILE\" --intent \"$INTENT\" --expected-file-types \"$EXPECTED_TYPES\" --should-avoid \"$SHOULD_AVOID\""

		eval $CMD

Conversation

amitksingh1490 commented Feb 9, 2026

Uh oh!

Uh oh!

amitksingh1490 commented Feb 9, 2026

Uh oh!

Uh oh!

graphite-app Bot Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

graphite-app Bot Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants