Detailed benchmark data for graph-tool-call. The README contains a 3-row summary; this document contains the full pipeline, retrieval-only, competitive, large-scale, and LangChain agent results.
- Model used (LLM benchmarks):
qwen3:4b(4-bit, Ollama), unless noted - Pipelines compared:
baseline(all tools),retrieve-k3 / k5 / k10, plus+ embedding,+ ontology - Reproduce: see Reproduce at the bottom
graph-tool-call verifies two things.
- Can performance be maintained or improved by giving the LLM only a subset of retrieved tools?
- Does the retriever itself rank the correct tools within the top K?
These are different questions. A retriever that achieves high Gold Tool Recall@K does not automatically translate to high end-to-end accuracy — the LLM still has to pick the right tool from the candidate set.
- End-to-end Accuracy — did the LLM ultimately succeed in selecting the correct tool / performing the correct workflow?
- Gold Tool Recall@K — was the canonical gold tool included in the top K at the retrieval stage?
- Avg tokens — average tokens passed to the LLM
- Token reduction — token savings vs. baseline
The two accuracy metrics often diverge. Evaluations that accept alternative tools or equivalent workflows as correct may show End-to-end Accuracy that doesn't exactly match Gold Tool Recall@K.
baselinehas no retrieval stage, so Gold Tool Recall@K does not apply.
| Dataset | Tools | Pipeline | End-to-end Accuracy | Gold Tool Recall@K | Avg tokens | Token reduction |
|---|---|---|---|---|---|---|
| Petstore | 19 | baseline | 100.0% | — | 1,239 | — |
| Petstore | 19 | retrieve-k3 | 90.0% | 93.3% | 305 | 75.4% |
| Petstore | 19 | retrieve-k5 | 95.0% | 98.3% | 440 | 64.4% |
| Petstore | 19 | retrieve-k10 | 100.0% | 98.3% | 720 | 41.9% |
| GitHub | 50 | baseline | 100.0% | — | 3,302 | — |
| GitHub | 50 | retrieve-k3 | 85.0% | 87.5% | 289 | 91.3% |
| GitHub | 50 | retrieve-k5 | 87.5% | 87.5% | 398 | 87.9% |
| GitHub | 50 | retrieve-k10 | 90.0% | 92.5% | 662 | 79.9% |
| Mixed MCP | 38 | baseline | 96.7% | — | 2,741 | — |
| Mixed MCP | 38 | retrieve-k3 | 86.7% | 93.3% | 328 | 88.0% |
| Mixed MCP | 38 | retrieve-k5 | 90.0% | 96.7% | 461 | 83.2% |
| Mixed MCP | 38 | retrieve-k10 | 96.7% | 100.0% | 826 | 69.9% |
| Kubernetes core/v1 | 248 | baseline | 12.0% | — | 8,192 | — |
| Kubernetes core/v1 | 248 | retrieve-k5 | 78.0% | 91.0% | 1,613 | 80.3% |
| Kubernetes core/v1 | 248 | retrieve-k5 + embedding | 80.0% | 94.0% | 1,728 | 78.9% |
| Kubernetes core/v1 | 248 | retrieve-k5 + ontology | 82.0% | 96.0% | 1,699 | 79.3% |
| Kubernetes core/v1 | 248 | retrieve-k5 + embedding + ontology | 82.0% | 98.0% | 1,924 | 76.5% |
- Small/medium APIs (19~50 tools) — baseline is already strong. graph-tool-call's main value here is 64~91% token savings with little accuracy loss.
- Large APIs (248 tools) — baseline collapses to 12% due to context overload. graph-tool-call recovers performance to 78~82% by narrowing candidates through retrieval. At this scale it's not an optimization — it's closer to a required retrieval layer.
retrieve-k5is the best default. Good token/accuracy tradeoff. On large datasets, adding embedding/ontology yields further gains.
The table below measures retrieval quality before the LLM stage. Only BM25 + graph traversal — no embedding or ontology.
| Dataset | Tools | Gold Tool Recall@3 | Gold Tool Recall@5 | Gold Tool Recall@10 |
|---|---|---|---|---|
| Petstore | 19 | 93.3% | 98.3% | 98.3% |
| GitHub | 50 | 87.5% | 87.5% | 92.5% |
| Mixed MCP | 38 | 93.3% | 96.7% | 100.0% |
| Kubernetes core/v1 | 248 | 82.0% | 91.0% | 92.0% |
- Gold Tool Recall@K measures the retriever's ability to include the correct tool in the candidate set, not final LLM accuracy.
- On small datasets,
k=5already achieves high recall. - On large datasets, increasing
kraises recall but also increases tokens passed to the LLM — consider both.
- Petstore / Mixed MCP —
k=5alone includes nearly all correct tools. - GitHub — there's a recall gap between
k=5andk=10; choosek=10if recall matters more than tokens. - Kubernetes core/v1 — even with 248 tools,
k=5already achieves 91.0% gold recall. The retrieval stage alone compresses the candidate set dramatically while retaining most correct tools.
Comparison on the largest dataset (Kubernetes core/v1, 248 tools), all on top of retrieve-k5.
| Pipeline | End-to-end Accuracy | Gold Tool Recall@5 | Interpretation |
|---|---|---|---|
| retrieve-k5 | 78.0% | 91.0% | BM25 + graph alone is a strong baseline |
| + embedding | 80.0% | 94.0% | Recovers semantically-similar but differently-worded queries |
| + ontology | 82.0% | 96.0% | LLM-generated keywords/example queries significantly improve retrieval |
| + embedding + ontology | 82.0% | 98.0% | Accuracy maintained, gold recall at its highest |
- Embedding compensates for semantic similarity that BM25 misses.
- Ontology expands the searchable representation itself when descriptions are short or non-standard.
- Using both together yields limited extra end-to-end gains, but gold recall reaches its highest.
Compared 6 retrieval strategies across 9 datasets (19–1068 tools):
| Strategy | Recall@5 | MRR | Latency |
|---|---|---|---|
| Vector Only (≈bigtool) | 96.8% | 0.897 | 176ms |
| BM25 Only | 91.6% | 0.819 | 1.5ms |
| BM25 + Graph (default) | 91.6% | 0.819 | 14ms |
| Full Pipeline (with embedding) | 96.8% | 0.897 | 172ms |
Key finding — without embedding, BM25+Graph achieves 91.6% Recall, competitive with vector search at 65× faster speed. With embedding enabled, performance matches pure vector search.
| Strategy | Recall@5 | MRR | Miss% |
|---|---|---|---|
| Vector Only | 88.0% | 0.761 | 12.0% |
| BM25 + Graph | 78.0% | 0.643 | 22.0% |
| Full Pipeline | 88.0% | 0.761 | 12.0% |
At 1068 tools, baseline (passing all definitions) is impractical due to context size — graph-tool-call provides a working retrieval layer where vector-only and full pipeline tie.
End-to-end accuracy when 200 simple tools are registered and invoked through a LangChain agent.
- Direct (D) — all 200 tool definitions passed to the LLM at once
- Graph (G) — tools managed via graph-tool-call gateway (search → call, 2 turns)
| Model | D-Acc | G-Acc | D-Turns | G-Turns | D-Tokens | G-Tokens | Savings | D-Time | G-Time |
|---|---|---|---|---|---|---|---|---|---|
| gpt-4.1 | 60.0% | 80.0% | 1.0 | 2.0 | 52,587 | 6,639 | 87.4% | 15.5s | 17.6s |
| gpt-5.2 | 60.0% | 100.0% | 1.0 | 2.0 | 53,645 | 10,508 | 80.4% | 20.5s | 17.1s |
| gpt-5.4 | 60.0% | 100.0% | 1.0 | 2.0 | 60,035 | 14,049 | 76.6% | 18.2s | 17.0s |
| claude-sonnet-4-20250514 | 100.0% | 100.0% | 1.0 | 2.0 | 196,183 | 17,349 | 91.2% | 58.2s | 49.4s |
| claude-sonnet-4-6 | 100.0% | 100.0% | 1.0 | 2.0 | 198,665 | 20,074 | 89.9% | 67.0s | 69.4s |
| claude-haiku-4-5 | 100.0% | 100.0% | 1.0 | 2.0 | 197,845 | 19,714 | 90.0% | 23.7s | 22.8s |
Acc = accuracy, Turns = average agent turns, Tokens = total tokens, Savings = token reduction (D→G), Time = wall-clock.
- GPT-series models drop to 60% accuracy when all 200 tools are passed directly; graph-tool-call recovers to 80–100%.
- Claude-series models maintain 100% accuracy either way, but graph-tool-call delivers 89–91% token savings.
- Graph mode adds 1 extra turn (search → call) but total latency stays comparable or decreases thanks to smaller context.
- Across all models, token reduction ranges from 76.6% to 91.2%.
# Retrieval quality only (fast, no LLM needed)
python -m benchmarks.run_benchmark
python -m benchmarks.run_benchmark -d k8s -v
# Pipeline benchmark (LLM comparison)
python -m benchmarks.run_benchmark --mode pipeline -m qwen3:4b
python -m benchmarks.run_benchmark --mode pipeline \
--pipelines baseline retrieve-k3 retrieve-k5 retrieve-k10
# Save baseline and compare across runs
python -m benchmarks.run_benchmark --mode pipeline --save-baseline
python -m benchmarks.run_benchmark --mode pipeline --diffSee benchmarks/ for dataset definitions, ground truth, and the runner source.