[OPIK-3397] [SDK] Add token usage tracking to DSPy integration by Lothiraldan · Pull Request #4388 · comet-ml/opik

Lothiraldan · 2025-12-08T17:23:45Z

Details

Add token usage tracking to the DSPy integration. Previously, the LM spans were missing token consumption data (prompt_tokens, completion_tokens, total_tokens).

Changes:

Extract token usage from DSPy's LM history in on_lm_end callback
Add cache_hit metadata to LM spans to identify cached responses
Implement message verification to handle concurrent LM calls safely
Add comprehensive tests for usage tracking and cache behavior

Implementation approach:

Store LM instance and expected messages during on_lm_start
In on_lm_end, access lm.history[-1] and verify messages match to prevent race conditions
Extract usage dict and convert via llm_usage.build_opik_usage_from_unknown_provider()
Detect cached responses via response.cache_hit attribute (DSPy/LiteLLM sets this)

Change checklist

User facing
Documentation update

Issues

Resolves [FR]: DSPy Integration - Add token usage to trace. #3679
OPIK-3397

Testing

test_dspy__happyflow - Updated to verify usage is present on LM spans
test_dspy__cache_disabled__usage_present_and_cache_hit_false - Verifies usage present and cache_hit=False when caching disabled
test_dspy__cache_enabled_first_call__has_usage_and_cache_hit_false - Verifies first call has usage and cache_hit=False
test_dspy__cache_enabled_and_response_cached__no_usage_and_cache_hit_true - Verifies cached call has no usage and cache_hit=True

All 12 DSPy integration tests pass.

Documentation

No documentation updates required - this is an internal enhancement to existing functionality.

- Extract token usage (prompt_tokens, completion_tokens, total_tokens) from LM history - Add cache_hit metadata to detect cached responses - Verify history entry via messages matching for concurrent call safety - Add tests for cache behavior and usage tracking

Copilot

Pull request overview

This PR adds token usage tracking to the DSPy integration by extracting usage data from DSPy's LM history after each LM call. Previously, LM spans were created without token consumption metrics (prompt_tokens, completion_tokens, total_tokens). The implementation stores LM instances and expected messages during on_lm_start, then extracts usage from lm.history[-1] in on_lm_end with message verification to handle concurrent calls safely. Additionally, the PR adds cache_hit metadata to distinguish cached responses.

Key changes:

Extract token usage from DSPy LM history and convert to OpikUsage format
Add cache detection via response.cache_hit attribute
Implement message verification to prevent race conditions in concurrent LM calls

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File	Description
`sdks/python/src/opik/integrations/dspy/callback.py`	Added `_extract_lm_info_from_history` method to extract usage and cache status from LM history, updated `_end_span` signature to accept usage and metadata, stored LM info in `on_lm_start` for later extraction
`sdks/python/tests/library_integration/dspy/test_dspy.py`	Added custom matchers for usage dict and metadata validation, updated existing tests to verify usage presence, added three new tests for cache scenarios (disabled, first call, cached response)

- Fix #1: Use specific type hints (Tuple[Any, Optional[Any]]) instead of generic tuple - Fix #2: Only add cache_hit to metadata when value is not None - Fix #4: Rename misleading matcher to MetadataWithCreatedFromMatcher

…is None Some DSPy versions set response.cache_hit=None instead of False for non-cached responses. Normalize None to False for consistent behavior across versions.

- Infer cache_hit from empty usage when response.cache_hit is None or missing - Handle different DSPy versions that may not set cache_hit attribute - Simplify logic with explicit conditionals for readability

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

Address PR review comment - move import uuid from inside test functions to the top of the file following Python conventions.

github-actions · 2025-12-08T18:16:31Z

SDK E2E Tests Results

0 tests ±0 0 ✅ ±0 0s ⏱️ ±0s
0 suites ±0 0 💤 ±0
0 files ±0 0 ❌ ±0

Results for commit 39f0599. ± Comparison against base commit cd0c828.

♻️ This comment has been updated with latest results.

…o-trace

github-actions · 2025-12-09T10:13:46Z

SDK Unit Tests Results

0 tests 0 ✅ 0s ⏱️
0 suites 0 💤
0 files 0 ❌

Results for commit 39f0599.

…o-trace

alexkuzmik

Please take a look at this documentation about the cost tracking from dspy https://dspy.ai/learn/programming/modules/?h=track_usag#how-do-i-track-lm-usage, I'm not saying to rewrite your implementation, just advising to double check that we're not missing anything compared to the "official" approach.

IIRC, last time I looked into their code, this usage information was available in some of the callback methods.

@alexkuzmik

Address PR review feedback from @alexkuzmik: - Replace _get_usage_dict_matcher() with ANY_DICT.containing() - Replace _get_metadata_with_created_from_matcher() with ANY_DICT.containing() This is simpler and uses the existing testlib utilities.

Lothiraldan · 2025-12-09T16:11:46Z

Please take a look at this documentation about the cost tracking from dspy https://dspy.ai/learn/programming/modules/?h=track_usag#how-do-i-track-lm-usage, I'm not saying to rewrite your implementation, just advising to double check that we're not missing anything compared to the "official" approach.

IIRC, last time I looked into their code, this usage information was available in some of the callback methods.

Thanks for the pointer! I investigated DSPy's official track_usage approach thoroughly.

TL;DR: Our lm.history[-1] approach is correct and actually better for our use case. DSPy's get_lm_usage() is designed for module-level aggregation, not per-LM-call tracking.

Key findings:

on_lm_end doesn't receive usage - The outputs parameter is just a list of strings (the generated text), not a Prediction object. Usage is not passed to callbacks at the LM level.
get_lm_usage() only works on the outermost module - In on_module_end, only the outermost module in a call chain has get_lm_usage() populated. Inner modules return None.
Usage is aggregated, not per-call - For modules like dspy.ReAct that make multiple LM calls in a single forward(), get_lm_usage() returns one aggregated total, not per-call breakdown.

Example with ReAct (4 LM calls in one forward):

Our approach: 4 individual usages (584/85, 678/109, 797/126, 542/170 tokens)
get_lm_usage(): 1 aggregated total (2601/490 tokens)

With get_lm_usage(), we'd have one usage value for 4 LM spans - losing granularity and making per-span cost attribution impossible.

Why our approach is correct:

Captures usage at the LM call level (what we need for LM spans)
Works without requiring track_usage=True
Provides per-call breakdown for accurate cost tracking

The DSPy track_usage feature is designed for users who want to see total consumption at the program level, not for instrumentation/observability at the span level.

@alexkuzmik

* [OPIK-3397] [SDK] Add token usage tracking to DSPy integration - Extract token usage (prompt_tokens, completion_tokens, total_tokens) from LM history - Add cache_hit metadata to detect cached responses - Verify history entry via messages matching for concurrent call safety - Add tests for cache behavior and usage tracking * Revision 2: Address PR review comments - Fix #1: Use specific type hints (Tuple[Any, Optional[Any]]) instead of generic tuple - Fix #2: Only add cache_hit to metadata when value is not None - Fix #4: Rename misleading matcher to MetadataWithCreatedFromMatcher * Revision 3: Fix cache_hit handling for DSPy versions where cache_hit is None Some DSPy versions set response.cache_hit=None instead of False for non-cached responses. Normalize None to False for consistent behavior across versions. * Revision 3: Fix cache_hit detection for Python 3.9 compatibility - Infer cache_hit from empty usage when response.cache_hit is None or missing - Handle different DSPy versions that may not set cache_hit attribute - Simplify logic with explicit conditionals for readability * Revision 4: Move uuid import to module level Address PR review comment - move import uuid from inside test functions to the top of the file following Python conventions. * Revision 5: Replace custom matchers with ANY_DICT.containing() Address PR review feedback from @alexkuzmik: - Replace _get_usage_dict_matcher() with ANY_DICT.containing() - Replace _get_metadata_with_created_from_matcher() with ANY_DICT.containing() This is simpler and uses the existing testlib utilities.

Lothiraldan requested a review from a team as a code owner December 8, 2025 17:23

Copilot AI review requested due to automatic review settings December 8, 2025 17:23

github-actions bot added python Pull requests that update Python code Python-SDK tests Including test files, or tests related like configuration. labels Dec 8, 2025

github-actions bot assigned Lothiraldan Dec 8, 2025

Copilot AI reviewed Dec 8, 2025

View reviewed changes

Lothiraldan added 3 commits December 8, 2025 18:31

Revision 2: Address PR review comments

0d19d37

- Fix #1: Use specific type hints (Tuple[Any, Optional[Any]]) instead of generic tuple - Fix #2: Only add cache_hit to metadata when value is not None - Fix #4: Rename misleading matcher to MetadataWithCreatedFromMatcher

Revision 3: Fix cache_hit handling for DSPy versions where cache_hit …

23807d5

…is None Some DSPy versions set response.cache_hit=None instead of False for non-cached responses. Normalize None to False for consistent behavior across versions.

Revision 3: Fix cache_hit detection for Python 3.9 compatibility

0c9e9d0

- Infer cache_hit from empty usage when response.cache_hit is None or missing - Handle different DSPy versions that may not set cache_hit attribute - Simplify logic with explicit conditionals for readability

Lothiraldan requested a review from Copilot December 8, 2025 17:59

Copilot AI reviewed Dec 8, 2025

View reviewed changes

Comment thread sdks/python/src/opik/integrations/dspy/callback.py

Comment thread sdks/python/src/opik/integrations/dspy/callback.py

Comment thread sdks/python/tests/library_integration/dspy/test_dspy.py Outdated

Comment thread sdks/python/tests/library_integration/dspy/test_dspy.py Outdated

Revision 4: Move uuid import to module level

64e7e4e

Address PR review comment - move import uuid from inside test functions to the top of the file following Python conventions.

Merge branch 'main' into lothiraldan/OPIK-3397-dspy-add-token-usage-t…

39f0599

…o-trace

Merge branch 'main' into lothiraldan/OPIK-3397-dspy-add-token-usage-t…

d38ac0a

…o-trace

alexkuzmik requested changes Dec 9, 2025

View reviewed changes

Comment thread sdks/python/tests/library_integration/dspy/test_dspy.py Outdated

Comment thread sdks/python/tests/library_integration/dspy/test_dspy.py Outdated

Lothiraldan requested a review from alexkuzmik December 9, 2025 16:11

alexkuzmik approved these changes Dec 11, 2025

View reviewed changes

Lothiraldan merged commit 23e9381 into main Dec 11, 2025
104 checks passed

Lothiraldan deleted the lothiraldan/OPIK-3397-dspy-add-token-usage-to-trace branch December 11, 2025 10:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[OPIK-3397] [SDK] Add token usage tracking to DSPy integration#4388

[OPIK-3397] [SDK] Add token usage tracking to DSPy integration#4388
Lothiraldan merged 8 commits intomainfrom
lothiraldan/OPIK-3397-dspy-add-token-usage-to-trace

Lothiraldan commented Dec 8, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Dec 8, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Dec 9, 2025

Uh oh!

alexkuzmik left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Lothiraldan commented Dec 9, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Lothiraldan commented Dec 8, 2025

Details

Changes:

Implementation approach:

Change checklist

Issues

Testing

Documentation

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Dec 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

SDK E2E Tests Results

Uh oh!

github-actions bot commented Dec 9, 2025

SDK Unit Tests Results

Uh oh!

alexkuzmik left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Lothiraldan commented Dec 9, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

github-actions bot commented Dec 8, 2025 •

edited

Loading

alexkuzmik left a comment •

edited

Loading