[OPIK-3397] [SDK] Add token usage tracking to DSPy integration#4388
[OPIK-3397] [SDK] Add token usage tracking to DSPy integration#4388Lothiraldan merged 8 commits intomainfrom
Conversation
- Extract token usage (prompt_tokens, completion_tokens, total_tokens) from LM history - Add cache_hit metadata to detect cached responses - Verify history entry via messages matching for concurrent call safety - Add tests for cache behavior and usage tracking
There was a problem hiding this comment.
Pull request overview
This PR adds token usage tracking to the DSPy integration by extracting usage data from DSPy's LM history after each LM call. Previously, LM spans were created without token consumption metrics (prompt_tokens, completion_tokens, total_tokens). The implementation stores LM instances and expected messages during on_lm_start, then extracts usage from lm.history[-1] in on_lm_end with message verification to handle concurrent calls safely. Additionally, the PR adds cache_hit metadata to distinguish cached responses.
Key changes:
- Extract token usage from DSPy LM history and convert to OpikUsage format
- Add cache detection via
response.cache_hitattribute - Implement message verification to prevent race conditions in concurrent LM calls
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
sdks/python/src/opik/integrations/dspy/callback.py |
Added _extract_lm_info_from_history method to extract usage and cache status from LM history, updated _end_span signature to accept usage and metadata, stored LM info in on_lm_start for later extraction |
sdks/python/tests/library_integration/dspy/test_dspy.py |
Added custom matchers for usage dict and metadata validation, updated existing tests to verify usage presence, added three new tests for cache scenarios (disabled, first call, cached response) |
…is None Some DSPy versions set response.cache_hit=None instead of False for non-cached responses. Normalize None to False for consistent behavior across versions.
- Infer cache_hit from empty usage when response.cache_hit is None or missing - Handle different DSPy versions that may not set cache_hit attribute - Simplify logic with explicit conditionals for readability
Address PR review comment - move import uuid from inside test functions to the top of the file following Python conventions.
SDK Unit Tests Results0 tests 0 ✅ 0s ⏱️ Results for commit 39f0599. |
There was a problem hiding this comment.
Please take a look at this documentation about the cost tracking from dspy https://dspy.ai/learn/programming/modules/?h=track_usag#how-do-i-track-lm-usage, I'm not saying to rewrite your implementation, just advising to double check that we're not missing anything compared to the "official" approach.
IIRC, last time I looked into their code, this usage information was available in some of the callback methods.
Address PR review feedback from @alexkuzmik: - Replace _get_usage_dict_matcher() with ANY_DICT.containing() - Replace _get_metadata_with_created_from_matcher() with ANY_DICT.containing() This is simpler and uses the existing testlib utilities.
Thanks for the pointer! I investigated DSPy's official TL;DR: Our Key findings:
Example with
With Why our approach is correct:
The DSPy |
* [OPIK-3397] [SDK] Add token usage tracking to DSPy integration - Extract token usage (prompt_tokens, completion_tokens, total_tokens) from LM history - Add cache_hit metadata to detect cached responses - Verify history entry via messages matching for concurrent call safety - Add tests for cache behavior and usage tracking * Revision 2: Address PR review comments - Fix #1: Use specific type hints (Tuple[Any, Optional[Any]]) instead of generic tuple - Fix #2: Only add cache_hit to metadata when value is not None - Fix #4: Rename misleading matcher to MetadataWithCreatedFromMatcher * Revision 3: Fix cache_hit handling for DSPy versions where cache_hit is None Some DSPy versions set response.cache_hit=None instead of False for non-cached responses. Normalize None to False for consistent behavior across versions. * Revision 3: Fix cache_hit detection for Python 3.9 compatibility - Infer cache_hit from empty usage when response.cache_hit is None or missing - Handle different DSPy versions that may not set cache_hit attribute - Simplify logic with explicit conditionals for readability * Revision 4: Move uuid import to module level Address PR review comment - move import uuid from inside test functions to the top of the file following Python conventions. * Revision 5: Replace custom matchers with ANY_DICT.containing() Address PR review feedback from @alexkuzmik: - Replace _get_usage_dict_matcher() with ANY_DICT.containing() - Replace _get_metadata_with_created_from_matcher() with ANY_DICT.containing() This is simpler and uses the existing testlib utilities.
Details
Add token usage tracking to the DSPy integration. Previously, the LM spans were missing token consumption data (
prompt_tokens,completion_tokens,total_tokens).Changes:
on_lm_endcallbackcache_hitmetadata to LM spans to identify cached responsesImplementation approach:
on_lm_starton_lm_end, accesslm.history[-1]and verify messages match to prevent race conditionsllm_usage.build_opik_usage_from_unknown_provider()response.cache_hitattribute (DSPy/LiteLLM sets this)Change checklist
Issues
Testing
test_dspy__happyflow- Updated to verify usage is present on LM spanstest_dspy__cache_disabled__usage_present_and_cache_hit_false- Verifies usage present and cache_hit=False when caching disabledtest_dspy__cache_enabled_first_call__has_usage_and_cache_hit_false- Verifies first call has usage and cache_hit=Falsetest_dspy__cache_enabled_and_response_cached__no_usage_and_cache_hit_true- Verifies cached call has no usage and cache_hit=TrueAll 12 DSPy integration tests pass.
Documentation
No documentation updates required - this is an internal enhancement to existing functionality.