A lightweight library and dashboard for debugging multi-step, non-deterministic decision systems.
X-Ray provides transparency into complex decision processes. Unlike traditional logging that tells you what happened, X-Ray tells you why decisions were made.
Perfect for debugging:
- LLM-powered search and filtering
- Multi-stage ranking algorithms
- Complex business rule pipelines
- Any non-deterministic workflow where you need to understand "why this output?"
xray-sdk/
├── sdk/python/ # Core X-Ray library
│ └── xray/ # Tracer, Step, Storage, Types
├── dashboard/ # Next.js visualization UI
├── demo/ # Competitor selection demo
│ ├── competitor_selection.py
│ └── mock-data/
├── api/ # FastAPI backend
└── executions/ # Execution traces (JSON)
# Install SDK
cd sdk/python
pip install -e .
cd ../..
# Run competitor selection demo
python3 demo/competitor_selection.pyThis creates an execution trace showing a 3-step workflow:
- Generate search keywords (simulated LLM)
- Search for candidates (mock API)
- Apply filters and select best competitor
# Terminal 1: Start API
cd api
pip install -r requirements.txt
python main.py
# Terminal 2: Start Dashboard
cd dashboard
npm install
npm run devVisit http://localhost:3000 to view your execution traces!
Or use Docker:
docker-compose upfrom xray import XRayTracer
# Create tracer
tracer = XRayTracer(use_case="competitor_selection")
# Step 1: Generate keywords
with tracer.step("keyword_generation", "generation") as step:
step.set_input({"product_title": "Water Bottle"})
keywords = your_llm_call(...)
step.set_output({"keywords": keywords})
step.set_reasoning("Extracted key product attributes")
# Step 2: Search
with tracer.step("search", "search") as step:
step.set_input({"keyword": keywords[0]})
results = your_search_api(...)
step.set_output({"candidates": results})
# Step 3: Apply filters
with tracer.step("apply_filters", "filter") as step:
step.set_filters({"price_range": {"min": 10, "max": 50}})
for candidate in results:
# Evaluate each candidate
step.add_evaluation(
candidate_id=candidate["id"],
candidate_data=candidate,
qualified=passes_filters(candidate),
filter_results=[...] # Details on each filter
)
step.set_output({"selected": best_candidate})
# Save execution
tracer.save()- Executions List - See all captured executions with status, timestamps, step counts
- Execution Detail - Deep dive into each step's inputs, outputs, reasoning
- Filter Visualization - See exactly why candidates passed or failed filters
- Candidate Evaluations - View filter results for each candidate
- XRayTracer - Main entry point, manages execution lifecycle
- StepContext - Context manager for capturing step data (timing, errors)
- Storage - Simple JSON file storage (easily extensible to DB)
- Types - Strongly-typed data structures (ExecutionTrace, StepData, FilterResult)
- List View - Shows all executions
- Detail View - Visualizes complete decision trail
- StepView Component - Renders step with expandable sections
GET /api/executions- List all executionsGET /api/executions/:id- Get execution detailDELETE /api/executions/:id- Delete execution
| Aspect | Traditional Logging | X-Ray |
|---|---|---|
| Focus | Events | Decision reasoning |
| Data | Messages, errors | Candidates, filters, selections |
| Question | "What happened?" | "Why this output?" |
| Granularity | Function level | Business logic level |
Given a seller's product, find the best competitor to benchmark against:
Step 1: Keywords
- Input: Product title, category
- Output: Search keywords
- Reasoning: "Extracted material, capacity, features"
Step 2: Search
- Input: Keywords, limit
- Output: 50 candidate products
- Reasoning: "Fetched top results by relevance"
Step 3: Filters
- Input: 50 candidates, reference product
- Filters: Price range (0.5x-2x), Min rating (3.8★), Min reviews (100)
- Evaluations: Each candidate with pass/fail details
- Output: Selected competitor
- Reasoning: "Highest review count among qualified candidates"
When debugging, you can immediately see:
- Which products failed which filters (and why)
- Whether the problem is bad keywords, too-strict filters, or poor ranking
- The complete context for every decision
With more time, I would add:
- Database storage - PostgreSQL instead of JSON files
- Real-time updates - WebSocket for live execution streaming
- Filter builder - Query executions by status, use case, date range
- Comparison view - Side-by-side comparison of executions
- Export - Download execution traces
- TypeScript SDK - For Node.js applications
- Replay mode - Re-run past executions with different parameters
- Cost tracking - For LLM token usage
- SDK: Python 3.11+, zero dependencies
- API: FastAPI, Uvicorn
- Dashboard: Next.js 15, React 19, TailwindCSS v4, TypeScript
- Storage: JSON files (easily extensible)
MIT