Skip to content

[OPIK-5689] [BE] fix: stop retrying rate-limited LLM calls and reduce log verbosity#6132

Merged
ldaugusto merged 3 commits intomainfrom
daniela/opik-5689-fix-onlinescoring-retry-storm
Apr 8, 2026
Merged

[OPIK-5689] [BE] fix: stop retrying rate-limited LLM calls and reduce log verbosity#6132
ldaugusto merged 3 commits intomainfrom
daniela/opik-5689-fix-onlinescoring-retry-storm

Conversation

@ldaugusto
Copy link
Copy Markdown
Contributor

@ldaugusto ldaugusto commented Apr 8, 2026

Details

Fix the Online Scoring retry storm generating ~28M rate_limit_exceeded errors/week and ~70M log lines/day.
The root cause is OpenAiErrorMessage.getCode() not mapping rate_limit_exceeded to HTTP 429, causing rate limit errors to be misclassified as 500 (retryable) instead of 429 (non-retryable) — triggering 3x Redis-level retries on top of langchain4j's inner retries.
Additionally, removes full Java stack traces from all error/retry log paths, replacing them with single-line error messages.

  • Map rate_limit_exceeded and insufficient_quota → 429, model_not_found → 404 in OpenAiErrorMessage
  • Remove full stack traces from BaseRedisSubscriber, ChatCompletionService, and OnlineScoringTraceThreadLlmAsJudgeScorer error logs
  • Note: the fix on OPIK-5605 already tackled part of this problem by reducing the error volume ~50%; this PR addresses the remaining root cause and logging issues

Change checklist

  • User facing
  • Documentation update

Issues

  • OPIK-5689

Testing

  • mvn test -Dtest="**/OpenAiErrorMessageTest" — 8 tests pass (new, parameterized tests for all error code → HTTP status mappings including rate_limit_exceeded → 429)
  • mvn test -Dtest="**/BaseRedisSubscriberTest" — 14 tests pass (verified new log format: error messages only, no stack traces)
  • mvn test -Dtest="**/ChatCompletionServiceTest" — 11 tests pass
  • Scenarios validated: known error codes map to correct HTTP statuses, unknown codes fall back to 500, null message returns null
  • Environment: local macOS, Docker (Redis testcontainer for BaseRedisSubscriberTest)
  • Staging deploy pending to validate log volume reduction in production

Documentation

… log verbosity

Map OpenAI rate_limit_exceeded error to HTTP 429 so it's treated as a
non-retryable client error, preventing unnecessary Redis stream retries.
Remove full stack traces from error/retry logs across the online scoring
pipeline to drastically cut log volume.
@ldaugusto ldaugusto requested a review from a team as a code owner April 8, 2026 13:06
@github-actions github-actions bot added java Pull requests that update Java code Backend tests Including test files, or tests related like configuration. labels Apr 8, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 8, 2026

📋 PR Linter Failed

Missing Section. The description is missing the ## Details section.


Missing Section. The description is missing the ## Change checklist section.


Missing Section. The description is missing the ## Issues section.


Missing Section. The description is missing the ## Testing section.


Missing Section. The description is missing the ## Documentation section.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 8, 2026

Backend Tests - Integration Group 5

127 tests   127 ✅  3m 13s ⏱️
 30 suites    0 💤
 30 files      0 ❌

Results for commit 702bbe6.

♻️ This comment has been updated with latest results.

@ldaugusto ldaugusto force-pushed the daniela/opik-5689-fix-onlinescoring-retry-storm branch from 00faf36 to f4595b1 Compare April 8, 2026 13:13
Copy link
Copy Markdown
Member

@andrescrz andrescrz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This generally LGTM, just a bit of polishing before moving forward. Mostly about not losing debugging information and about double checking the quotas HTTP status code. The rest is minor.

Comment on lines +28 to +29
var error = new OpenAiErrorMessage(
new OpenAiErrorMessage.OpenAiError("some error message", errorCode, "some_type"));
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nit: Add builders (with to builder) and use them, to improve our code base.

Comment applies to all tests in this PR.

- Keep full Throwable in log calls (restore stack traces) but keep
  warn level for handled errors in BaseRedisSubscriber and
  ChatCompletionService
- Consolidate duplicate log in OnlineScoringTraceThreadLlmAsJudgeScorer
  into single log.error with Throwable
- Map insufficient_quota to 402 instead of 429
- Fix test naming convention, use whole-object assertions, merge
  standalone tests into parameterized
Copy link
Copy Markdown
Member

@andrescrz andrescrz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just an optional comment that I didn't notice on the 1st review.

.ifPresent(llmProviderError -> failHandlingLLMProviderError(runtimeException, llmProviderError));

log.error(UNEXPECTED_ERROR_CALLING_LLM_PROVIDER, runtimeException);
log.warn(UNEXPECTED_ERROR_CALLING_LLM_PROVIDER, runtimeException);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry as I didn't notice on the first review.

I'd maybe keep these two changes on this file at error level , as they're real errors and re-thrown.

Even better, probably these logs can and should be removed as the exception below should be printed somewhere if not caught or if logged by some Dropwizard error handle.

I'd say the best scenario would be removing these two if you confirm logs are duplicated.

@ldaugusto ldaugusto merged commit a0f8ef6 into main Apr 8, 2026
76 checks passed
@ldaugusto ldaugusto deleted the daniela/opik-5689-fix-onlinescoring-retry-storm branch April 8, 2026 14:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Backend java Pull requests that update Java code tests Including test files, or tests related like configuration.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants