Skip to content

fix(sem_search): add semantic search quality evaluation with llm judge and improve descriptions#2370

Merged
tusharmath merged 12 commits intomainfrom
fix-sem-search-query
Feb 11, 2026
Merged

fix(sem_search): add semantic search quality evaluation with llm judge and improve descriptions#2370
tusharmath merged 12 commits intomainfrom
fix-sem-search-query

Conversation

@amitksingh1490
Copy link
Copy Markdown
Contributor

No description provided.

@github-actions github-actions Bot added the type: feature Brand new functionality, features, pages, workflows, endpoints, etc. label Feb 9, 2026
@amitksingh1490 amitksingh1490 changed the title feat(evals): add semantic search quality evaluation with llm judge fix(sem_search): add semantic search quality evaluation with llm judge and improve descriptions Feb 9, 2026
@amitksingh1490 amitksingh1490 added type: fix Iterations on existing features or infrastructure. and removed type: feature Brand new functionality, features, pages, workflows, endpoints, etc. labels Feb 9, 2026
Comment thread benchmarks/evals/semantic_search_quality/run_eval.sh Outdated
…void are always passed

- Always pass both parameters to llm_judge.ts, even when empty
- Fixes issue where llm_judge.ts requires both parameters unconditionally
- Sets empty string defaults for missing EXPECTED_TYPES and SHOULD_AVOID

Co-Authored-By: ForgeCode <noreply@forgecode.dev>
@amitksingh1490
Copy link
Copy Markdown
Contributor Author

Fixed in commit 65b896c

The review comment was correct - the script was conditionally adding parameters that llm_judge.ts requires unconditionally.

Changes made:

  • Always pass both --expected-file-types and --should-avoid to llm_judge.ts
  • Set empty string defaults when EXPECTED_TYPES or SHOULD_AVOID are not provided
  • This ensures the command always has the required arguments, preventing "Missing required arguments" errors

Co-Authored-By: ForgeCode noreply@forgecode.dev

Comment thread benchmarks/evals/semantic_search_quality/task.yml Outdated
Comment on lines +17 to +56
if cd /Users/amit/code-forge/benchmarks/evals/semantic_search_quality && \
./run_eval.sh /tmp/test_semantic_eval/full_context.json > /dev/null 2>&1; then
echo " ✓ PASSED (score >= 70)"
PASSED=$((PASSED + 1))
else
echo " ✗ FAILED (expected pass)"
FAILED=$((FAILED + 1))
fi

# Test 2: Documentation queries - should PASS (may be marginal)
echo "Test 2: Documentation queries..."
if cd /Users/amit/code-forge/benchmarks/evals/semantic_search_quality && \
./run_eval.sh /tmp/test_semantic_eval/doc_context.json > /dev/null 2>&1; then
echo " ✓ PASSED (score >= 70)"
PASSED=$((PASSED + 1))
else
echo " ✗ FAILED (expected pass)"
FAILED=$((FAILED + 1))
fi

# Test 3: Bad queries - should FAIL
echo "Test 3: Bad queries (generic keywords)..."
if cd /Users/amit/code-forge/benchmarks/evals/semantic_search_quality && \
./run_eval.sh /tmp/test_semantic_eval/bad_context.json > /dev/null 2>&1; then
echo " ✗ FAILED (expected failure, got pass)"
FAILED=$((FAILED + 1))
else
echo " ✓ PASSED (correctly failed - score < 70)"
PASSED=$((PASSED + 1))
fi

# Test 4: Missing sem_search - should FAIL early
echo "Test 4: Missing sem_search tool..."
if cd /Users/amit/code-forge/benchmarks/evals/semantic_search_quality && \
./run_eval.sh /tmp/test_semantic_eval/no_sem_search_context.json > /dev/null 2>&1; then
echo " ✗ FAILED (expected early exit failure)"
FAILED=$((FAILED + 1))
else
echo " ✓ PASSED (correctly failed early)"
PASSED=$((PASSED + 1))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hardcoded absolute path /Users/amit/code-forge/benchmarks/evals/semantic_search_quality will fail on any machine other than the original developer's. This breaks portability and will cause test failures in CI/CD or on other developer machines.

Fix: Use relative paths or dynamically determine the script directory:

SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"

if cd "$SCRIPT_DIR" && \
   ./run_eval.sh /tmp/test_semantic_eval/full_context.json > /dev/null 2>&1; then
Suggested change
if cd /Users/amit/code-forge/benchmarks/evals/semantic_search_quality && \
./run_eval.sh /tmp/test_semantic_eval/full_context.json > /dev/null 2>&1; then
echo " ✓ PASSED (score >= 70)"
PASSED=$((PASSED + 1))
else
echo " ✗ FAILED (expected pass)"
FAILED=$((FAILED + 1))
fi
# Test 2: Documentation queries - should PASS (may be marginal)
echo "Test 2: Documentation queries..."
if cd /Users/amit/code-forge/benchmarks/evals/semantic_search_quality && \
./run_eval.sh /tmp/test_semantic_eval/doc_context.json > /dev/null 2>&1; then
echo " ✓ PASSED (score >= 70)"
PASSED=$((PASSED + 1))
else
echo " ✗ FAILED (expected pass)"
FAILED=$((FAILED + 1))
fi
# Test 3: Bad queries - should FAIL
echo "Test 3: Bad queries (generic keywords)..."
if cd /Users/amit/code-forge/benchmarks/evals/semantic_search_quality && \
./run_eval.sh /tmp/test_semantic_eval/bad_context.json > /dev/null 2>&1; then
echo " ✗ FAILED (expected failure, got pass)"
FAILED=$((FAILED + 1))
else
echo " ✓ PASSED (correctly failed - score < 70)"
PASSED=$((PASSED + 1))
fi
# Test 4: Missing sem_search - should FAIL early
echo "Test 4: Missing sem_search tool..."
if cd /Users/amit/code-forge/benchmarks/evals/semantic_search_quality && \
./run_eval.sh /tmp/test_semantic_eval/no_sem_search_context.json > /dev/null 2>&1; then
echo " ✗ FAILED (expected early exit failure)"
FAILED=$((FAILED + 1))
else
echo " ✓ PASSED (correctly failed early)"
PASSED=$((PASSED + 1))
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
if cd "$SCRIPT_DIR" && \
./run_eval.sh /tmp/test_semantic_eval/full_context.json > /dev/null 2>&1; then
echo " ✓ PASSED (score >= 70)"
PASSED=$((PASSED + 1))
else
echo " ✗ FAILED (expected pass)"
FAILED=$((FAILED + 1))
fi
# Test 2: Documentation queries - should PASS (may be marginal)
echo "Test 2: Documentation queries..."
if cd "$SCRIPT_DIR" && \
./run_eval.sh /tmp/test_semantic_eval/doc_context.json > /dev/null 2>&1; then
echo " ✓ PASSED (score >= 70)"
PASSED=$((PASSED + 1))
else
echo " ✗ FAILED (expected pass)"
FAILED=$((FAILED + 1))
fi
# Test 3: Bad queries - should FAIL
echo "Test 3: Bad queries (generic keywords)..."
if cd "$SCRIPT_DIR" && \
./run_eval.sh /tmp/test_semantic_eval/bad_context.json > /dev/null 2>&1; then
echo " ✗ FAILED (expected failure, got pass)"
FAILED=$((FAILED + 1))
else
echo " ✓ PASSED (correctly failed - score < 70)"
PASSED=$((PASSED + 1))
fi
# Test 4: Missing sem_search - should FAIL early
echo "Test 4: Missing sem_search tool..."
if cd "$SCRIPT_DIR" && \
./run_eval.sh /tmp/test_semantic_eval/no_sem_search_context.json > /dev/null 2>&1; then
echo " ✗ FAILED (expected early exit failure)"
FAILED=$((FAILED + 1))
else
echo " ✓ PASSED (correctly failed early)"
PASSED=$((PASSED + 1))
fi

Spotted by Graphite Agent

Fix in Graphite


Is this helpful? React 👍 or 👎 to let us know.

@tusharmath tusharmath enabled auto-merge (squash) February 11, 2026 07:55
@tusharmath tusharmath merged commit 6d5db68 into main Feb 11, 2026
9 checks passed
@tusharmath tusharmath deleted the fix-sem-search-query branch February 11, 2026 07:57
Comment on lines +69 to +71
CMD="npx tsx llm_judge.ts --context \"$CONTEXT_FILE\" --intent \"$INTENT\" --expected-file-types \"$EXPECTED_TYPES\" --should-avoid \"$SHOULD_AVOID\""

eval $CMD
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using eval with string interpolation is a shell injection vulnerability. If $CONTEXT_FILE, $INTENT, $EXPECTED_TYPES, or $SHOULD_AVOID contain shell metacharacters, arbitrary commands could be executed.

# Fix: Execute command directly without eval
npx tsx llm_judge.ts --context "$CONTEXT_FILE" --intent "$INTENT" --expected-file-types "$EXPECTED_TYPES" --should-avoid "$SHOULD_AVOID"
Suggested change
CMD="npx tsx llm_judge.ts --context \"$CONTEXT_FILE\" --intent \"$INTENT\" --expected-file-types \"$EXPECTED_TYPES\" --should-avoid \"$SHOULD_AVOID\""
eval $CMD
npx tsx llm_judge.ts --context "$CONTEXT_FILE" --intent "$INTENT" --expected-file-types "$EXPECTED_TYPES" --should-avoid "$SHOULD_AVOID"

Spotted by Graphite Agent

Fix in Graphite


Is this helpful? React 👍 or 👎 to let us know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

type: fix Iterations on existing features or infrastructure.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants