Conversation
… log verbosity Map OpenAI rate_limit_exceeded error to HTTP 429 so it's treated as a non-retryable client error, preventing unnecessary Redis stream retries. Remove full stack traces from error/retry logs across the online scoring pipeline to drastically cut log volume.
📋 PR Linter Failed❌ Missing Section. The description is missing the ❌ Missing Section. The description is missing the ❌ Missing Section. The description is missing the ❌ Missing Section. The description is missing the ❌ Missing Section. The description is missing the |
Backend Tests - Integration Group 5127 tests 127 ✅ 3m 13s ⏱️ Results for commit 702bbe6. ♻️ This comment has been updated with latest results. |
00faf36 to
f4595b1
Compare
apps/opik-backend/src/main/java/com/comet/opik/api/resources/v1/events/BaseRedisSubscriber.java
Show resolved
Hide resolved
apps/opik-backend/src/main/java/com/comet/opik/api/resources/v1/events/BaseRedisSubscriber.java
Show resolved
Hide resolved
apps/opik-backend/src/main/java/com/comet/opik/api/resources/v1/events/BaseRedisSubscriber.java
Show resolved
Hide resolved
...in/java/com/comet/opik/api/resources/v1/events/OnlineScoringTraceThreadLlmAsJudgeScorer.java
Show resolved
Hide resolved
apps/opik-backend/src/main/java/com/comet/opik/domain/llm/ChatCompletionService.java
Show resolved
Hide resolved
...k-backend/src/test/java/com/comet/opik/infrastructure/llm/openai/OpenAiErrorMessageTest.java
Show resolved
Hide resolved
andrescrz
left a comment
There was a problem hiding this comment.
This generally LGTM, just a bit of polishing before moving forward. Mostly about not losing debugging information and about double checking the quotas HTTP status code. The rest is minor.
apps/opik-backend/src/main/java/com/comet/opik/api/resources/v1/events/BaseRedisSubscriber.java
Outdated
Show resolved
Hide resolved
...in/java/com/comet/opik/api/resources/v1/events/OnlineScoringTraceThreadLlmAsJudgeScorer.java
Show resolved
Hide resolved
.../opik-backend/src/main/java/com/comet/opik/infrastructure/llm/openai/OpenAiErrorMessage.java
Outdated
Show resolved
Hide resolved
| var error = new OpenAiErrorMessage( | ||
| new OpenAiErrorMessage.OpenAiError("some error message", errorCode, "some_type")); |
There was a problem hiding this comment.
Very nit: Add builders (with to builder) and use them, to improve our code base.
Comment applies to all tests in this PR.
...k-backend/src/test/java/com/comet/opik/infrastructure/llm/openai/OpenAiErrorMessageTest.java
Outdated
Show resolved
Hide resolved
...k-backend/src/test/java/com/comet/opik/infrastructure/llm/openai/OpenAiErrorMessageTest.java
Outdated
Show resolved
Hide resolved
- Keep full Throwable in log calls (restore stack traces) but keep warn level for handled errors in BaseRedisSubscriber and ChatCompletionService - Consolidate duplicate log in OnlineScoringTraceThreadLlmAsJudgeScorer into single log.error with Throwable - Map insufficient_quota to 402 instead of 429 - Fix test naming convention, use whole-object assertions, merge standalone tests into parameterized
andrescrz
left a comment
There was a problem hiding this comment.
LGTM, just an optional comment that I didn't notice on the 1st review.
| .ifPresent(llmProviderError -> failHandlingLLMProviderError(runtimeException, llmProviderError)); | ||
|
|
||
| log.error(UNEXPECTED_ERROR_CALLING_LLM_PROVIDER, runtimeException); | ||
| log.warn(UNEXPECTED_ERROR_CALLING_LLM_PROVIDER, runtimeException); |
There was a problem hiding this comment.
Sorry as I didn't notice on the first review.
I'd maybe keep these two changes on this file at error level , as they're real errors and re-thrown.
Even better, probably these logs can and should be removed as the exception below should be printed somewhere if not caught or if logged by some Dropwizard error handle.
I'd say the best scenario would be removing these two if you confirm logs are duplicated.
Details
Fix the Online Scoring retry storm generating ~28M
rate_limit_exceedederrors/week and ~70M log lines/day.The root cause is
OpenAiErrorMessage.getCode()not mappingrate_limit_exceededto HTTP 429, causing rate limit errors to be misclassified as 500 (retryable) instead of 429 (non-retryable) — triggering 3x Redis-level retries on top of langchain4j's inner retries.Additionally, removes full Java stack traces from all error/retry log paths, replacing them with single-line error messages.
rate_limit_exceededandinsufficient_quota→ 429,model_not_found→ 404 inOpenAiErrorMessageBaseRedisSubscriber,ChatCompletionService, andOnlineScoringTraceThreadLlmAsJudgeScorererror logsChange checklist
Issues
Testing
mvn test -Dtest="**/OpenAiErrorMessageTest"— 8 tests pass (new, parameterized tests for all error code → HTTP status mappings includingrate_limit_exceeded→ 429)mvn test -Dtest="**/BaseRedisSubscriberTest"— 14 tests pass (verified new log format: error messages only, no stack traces)mvn test -Dtest="**/ChatCompletionServiceTest"— 11 tests passDocumentation