Skip to content

[OPIK-3397] [SDK] Add token usage tracking to DSPy integration#4388

Merged
Lothiraldan merged 8 commits intomainfrom
lothiraldan/OPIK-3397-dspy-add-token-usage-to-trace
Dec 11, 2025
Merged

[OPIK-3397] [SDK] Add token usage tracking to DSPy integration#4388
Lothiraldan merged 8 commits intomainfrom
lothiraldan/OPIK-3397-dspy-add-token-usage-to-trace

Conversation

@Lothiraldan
Copy link
Copy Markdown
Contributor

Details

Add token usage tracking to the DSPy integration. Previously, the LM spans were missing token consumption data (prompt_tokens, completion_tokens, total_tokens).

Changes:

  • Extract token usage from DSPy's LM history in on_lm_end callback
  • Add cache_hit metadata to LM spans to identify cached responses
  • Implement message verification to handle concurrent LM calls safely
  • Add comprehensive tests for usage tracking and cache behavior

Implementation approach:

  • Store LM instance and expected messages during on_lm_start
  • In on_lm_end, access lm.history[-1] and verify messages match to prevent race conditions
  • Extract usage dict and convert via llm_usage.build_opik_usage_from_unknown_provider()
  • Detect cached responses via response.cache_hit attribute (DSPy/LiteLLM sets this)

Change checklist

  • User facing
  • Documentation update

Issues

Testing

  • test_dspy__happyflow - Updated to verify usage is present on LM spans
  • test_dspy__cache_disabled__usage_present_and_cache_hit_false - Verifies usage present and cache_hit=False when caching disabled
  • test_dspy__cache_enabled_first_call__has_usage_and_cache_hit_false - Verifies first call has usage and cache_hit=False
  • test_dspy__cache_enabled_and_response_cached__no_usage_and_cache_hit_true - Verifies cached call has no usage and cache_hit=True

All 12 DSPy integration tests pass.

Documentation

No documentation updates required - this is an internal enhancement to existing functionality.

- Extract token usage (prompt_tokens, completion_tokens, total_tokens) from LM history
- Add cache_hit metadata to detect cached responses
- Verify history entry via messages matching for concurrent call safety
- Add tests for cache behavior and usage tracking
@Lothiraldan Lothiraldan requested a review from a team as a code owner December 8, 2025 17:23
Copilot AI review requested due to automatic review settings December 8, 2025 17:23
@github-actions github-actions bot added python Pull requests that update Python code Python-SDK tests Including test files, or tests related like configuration. labels Dec 8, 2025
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds token usage tracking to the DSPy integration by extracting usage data from DSPy's LM history after each LM call. Previously, LM spans were created without token consumption metrics (prompt_tokens, completion_tokens, total_tokens). The implementation stores LM instances and expected messages during on_lm_start, then extracts usage from lm.history[-1] in on_lm_end with message verification to handle concurrent calls safely. Additionally, the PR adds cache_hit metadata to distinguish cached responses.

Key changes:

  • Extract token usage from DSPy LM history and convert to OpikUsage format
  • Add cache detection via response.cache_hit attribute
  • Implement message verification to prevent race conditions in concurrent LM calls

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
sdks/python/src/opik/integrations/dspy/callback.py Added _extract_lm_info_from_history method to extract usage and cache status from LM history, updated _end_span signature to accept usage and metadata, stored LM info in on_lm_start for later extraction
sdks/python/tests/library_integration/dspy/test_dspy.py Added custom matchers for usage dict and metadata validation, updated existing tests to verify usage presence, added three new tests for cache scenarios (disabled, first call, cached response)

Comment thread sdks/python/src/opik/integrations/dspy/callback.py Outdated
Comment thread sdks/python/src/opik/integrations/dspy/callback.py Outdated
Comment thread sdks/python/tests/library_integration/dspy/test_dspy.py Outdated
Comment thread sdks/python/tests/library_integration/dspy/test_dspy.py Outdated
- Fix #1: Use specific type hints (Tuple[Any, Optional[Any]]) instead of generic tuple
- Fix #2: Only add cache_hit to metadata when value is not None
- Fix #4: Rename misleading matcher to MetadataWithCreatedFromMatcher
…is None

Some DSPy versions set response.cache_hit=None instead of False for non-cached
responses. Normalize None to False for consistent behavior across versions.
- Infer cache_hit from empty usage when response.cache_hit is None or missing
- Handle different DSPy versions that may not set cache_hit attribute
- Simplify logic with explicit conditionals for readability
@Lothiraldan Lothiraldan requested a review from Copilot December 8, 2025 17:59
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

Comment thread sdks/python/src/opik/integrations/dspy/callback.py
Comment thread sdks/python/src/opik/integrations/dspy/callback.py
Comment thread sdks/python/tests/library_integration/dspy/test_dspy.py Outdated
Comment thread sdks/python/tests/library_integration/dspy/test_dspy.py Outdated
Address PR review comment - move import uuid from inside test functions
to the top of the file following Python conventions.
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Dec 8, 2025

SDK E2E Tests Results

0 tests  ±0   0 ✅ ±0   0s ⏱️ ±0s
0 suites ±0   0 💤 ±0 
0 files   ±0   0 ❌ ±0 

Results for commit 39f0599. ± Comparison against base commit cd0c828.

♻️ This comment has been updated with latest results.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Dec 9, 2025

SDK Unit Tests Results

0 tests   0 ✅  0s ⏱️
0 suites  0 💤
0 files    0 ❌

Results for commit 39f0599.

Copy link
Copy Markdown
Collaborator

@alexkuzmik alexkuzmik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please take a look at this documentation about the cost tracking from dspy https://dspy.ai/learn/programming/modules/?h=track_usag#how-do-i-track-lm-usage, I'm not saying to rewrite your implementation, just advising to double check that we're not missing anything compared to the "official" approach.

IIRC, last time I looked into their code, this usage information was available in some of the callback methods.

Comment thread sdks/python/tests/library_integration/dspy/test_dspy.py Outdated
Comment thread sdks/python/tests/library_integration/dspy/test_dspy.py Outdated
Address PR review feedback from @alexkuzmik:
- Replace _get_usage_dict_matcher() with ANY_DICT.containing()
- Replace _get_metadata_with_created_from_matcher() with ANY_DICT.containing()

This is simpler and uses the existing testlib utilities.
@Lothiraldan
Copy link
Copy Markdown
Contributor Author

Please take a look at this documentation about the cost tracking from dspy https://dspy.ai/learn/programming/modules/?h=track_usag#how-do-i-track-lm-usage, I'm not saying to rewrite your implementation, just advising to double check that we're not missing anything compared to the "official" approach.

IIRC, last time I looked into their code, this usage information was available in some of the callback methods.

Thanks for the pointer! I investigated DSPy's official track_usage approach thoroughly.

TL;DR: Our lm.history[-1] approach is correct and actually better for our use case. DSPy's get_lm_usage() is designed for module-level aggregation, not per-LM-call tracking.

Key findings:

  1. on_lm_end doesn't receive usage - The outputs parameter is just a list of strings (the generated text), not a Prediction object. Usage is not passed to callbacks at the LM level.

  2. get_lm_usage() only works on the outermost module - In on_module_end, only the outermost module in a call chain has get_lm_usage() populated. Inner modules return None.

  3. Usage is aggregated, not per-call - For modules like dspy.ReAct that make multiple LM calls in a single forward(), get_lm_usage() returns one aggregated total, not per-call breakdown.

Example with ReAct (4 LM calls in one forward):

  • Our approach: 4 individual usages (584/85, 678/109, 797/126, 542/170 tokens)
  • get_lm_usage(): 1 aggregated total (2601/490 tokens)

With get_lm_usage(), we'd have one usage value for 4 LM spans - losing granularity and making per-span cost attribution impossible.

Why our approach is correct:

  • Captures usage at the LM call level (what we need for LM spans)
  • Works without requiring track_usage=True
  • Provides per-call breakdown for accurate cost tracking

The DSPy track_usage feature is designed for users who want to see total consumption at the program level, not for instrumentation/observability at the span level.

@Lothiraldan Lothiraldan merged commit 23e9381 into main Dec 11, 2025
104 checks passed
@Lothiraldan Lothiraldan deleted the lothiraldan/OPIK-3397-dspy-add-token-usage-to-trace branch December 11, 2025 10:23
juanferrub pushed a commit that referenced this pull request Dec 18, 2025
* [OPIK-3397] [SDK] Add token usage tracking to DSPy integration

- Extract token usage (prompt_tokens, completion_tokens, total_tokens) from LM history
- Add cache_hit metadata to detect cached responses
- Verify history entry via messages matching for concurrent call safety
- Add tests for cache behavior and usage tracking

* Revision 2: Address PR review comments

- Fix #1: Use specific type hints (Tuple[Any, Optional[Any]]) instead of generic tuple
- Fix #2: Only add cache_hit to metadata when value is not None
- Fix #4: Rename misleading matcher to MetadataWithCreatedFromMatcher

* Revision 3: Fix cache_hit handling for DSPy versions where cache_hit is None

Some DSPy versions set response.cache_hit=None instead of False for non-cached
responses. Normalize None to False for consistent behavior across versions.

* Revision 3: Fix cache_hit detection for Python 3.9 compatibility

- Infer cache_hit from empty usage when response.cache_hit is None or missing
- Handle different DSPy versions that may not set cache_hit attribute
- Simplify logic with explicit conditionals for readability

* Revision 4: Move uuid import to module level

Address PR review comment - move import uuid from inside test functions
to the top of the file following Python conventions.

* Revision 5: Replace custom matchers with ANY_DICT.containing()

Address PR review feedback from @alexkuzmik:
- Replace _get_usage_dict_matcher() with ANY_DICT.containing()
- Replace _get_metadata_with_created_from_matcher() with ANY_DICT.containing()

This is simpler and uses the existing testlib utilities.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

python Pull requests that update Python code Python-SDK tests Including test files, or tests related like configuration.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FR]: DSPy Integration - Add token usage to trace.

3 participants