[OPIK-5020] [BE] feat: wire model registry into provider routing and add remote refresh#5863
Conversation
…add remote refresh - Registry-first lookup in getLlmProvider() with enum fallback - New getStructuredOutputStrategy() on LlmProviderFactory encapsulating registry-based structured output resolution - Two-pass findModel() disambiguates VertexAI/Gemini by qualifiedName vs bare id - Remote YAML fetch via HttpClient with 30s timeout and URL scheme validation - ScheduledExecutorService refreshes registry from CDN on configurable interval - 3-tier merge: classpath defaults → remote CDN → local customer override - Remote fetch failure is non-fatal at every level (logs warning, keeps previous) - remoteEnabled defaults to false — zero behavior change for existing deployments Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
apps/opik-backend/src/main/java/com/comet/opik/infrastructure/llm/LlmModelRegistryService.java
Outdated
Show resolved
Hide resolved
apps/opik-backend/src/main/java/com/comet/opik/infrastructure/llm/LlmModelRegistryService.java
Outdated
Show resolved
Hide resolved
apps/opik-backend/src/main/java/com/comet/opik/infrastructure/llm/LlmModelRegistryService.java
Show resolved
Hide resolved
apps/opik-backend/src/main/java/com/comet/opik/infrastructure/llm/LlmModelRegistryService.java
Show resolved
Hide resolved
ldaugusto
left a comment
There was a problem hiding this comment.
Besides the reusage of our infrastructured mentioned in the comments below, I believe we could have tests mocking the remote return for loadRemoteResource(): non-200 responses, malformed YAML, or any other problem you can think of.
apps/opik-backend/src/main/java/com/comet/opik/infrastructure/llm/LlmModelRegistryService.java
Outdated
Show resolved
Hide resolved
apps/opik-backend/src/main/java/com/comet/opik/infrastructure/llm/LlmModelRegistryService.java
Outdated
Show resolved
Hide resolved
- Replace java.net.http.HttpClient with JAX-RS Client (injected via Guice) for consistency with WorkspaceNameService, OllamaService, RemoteAuthService - Replace ScheduledExecutorService with LlmModelRegistryRefreshJob Quartz job matching TraceThreadsClosingJob, DailyUsageReportJob patterns - Add remote fetch tests: successful merge, non-200 fallback, malformed YAML fallback, disabled remote skips fetch Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…test Quartz @on cron replaced the ScheduledExecutorService, making the configurable interval unused. Removed to avoid misleading operators. Strengthened disabled-remote test to verify mock client is never called. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
...opik-backend/src/main/java/com/comet/opik/infrastructure/llm/LlmModelRegistryRefreshJob.java
Show resolved
Hide resolved
...opik-backend/src/main/java/com/comet/opik/infrastructure/llm/LlmModelRegistryRefreshJob.java
Show resolved
Hide resolved
Backend Tests - Integration Group 12225 tests 223 ✅ 2m 6s ⏱️ Results for commit 4ce75f7. ♻️ This comment has been updated with latest results. |
Backend Tests - Integration Group 15280 tests 280 ✅ 6m 49s ⏱️ Results for commit 4ce75f7. ♻️ This comment has been updated with latest results. |
Backend Tests - Integration Group 11193 tests 193 ✅ 4m 23s ⏱️ Results for commit 4ce75f7. ♻️ This comment has been updated with latest results. |
Backend Tests - Integration Group 41 485 tests 1 484 ✅ 10m 2s ⏱️ Results for commit 4ce75f7. ♻️ This comment has been updated with latest results. |
Backend Tests - Integration Group 5118 tests 118 ✅ 2m 37s ⏱️ Results for commit 4ce75f7. ♻️ This comment has been updated with latest results. |
Backend Tests - Integration Group 1410 tests 410 ✅ 12m 55s ⏱️ Results for commit 4ce75f7. ♻️ This comment has been updated with latest results. |
Backend Tests - Integration Group 10230 tests 227 ✅ 8m 52s ⏱️ Results for commit 4ce75f7. ♻️ This comment has been updated with latest results. |
Backend Tests - Integration Group 2266 tests 266 ✅ 19m 53s ⏱️ Results for commit 4ce75f7. ♻️ This comment has been updated with latest results. |
Backend Tests - Integration Group 13545 tests 545 ✅ 5m 45s ⏱️ Results for commit 4ce75f7. ♻️ This comment has been updated with latest results. |
Backend Tests - Integration Group 6260 tests 260 ✅ 2m 52s ⏱️ Results for commit 4ce75f7. ♻️ This comment has been updated with latest results. |
Backend Tests - Integration Group 3324 tests 324 ✅ 10m 24s ⏱️ Results for commit 4ce75f7. ♻️ This comment has been updated with latest results. |
Backend Tests - Integration Group 16186 tests 186 ✅ 5m 23s ⏱️ Results for commit 0b291b2. ♻️ This comment has been updated with latest results. |
Backend Tests - Integration Group 14276 tests 276 ✅ 7m 43s ⏱️ Results for commit 4ce75f7. ♻️ This comment has been updated with latest results. |
Backend Tests - Integration Group 8302 tests 300 ✅ 4m 2s ⏱️ Results for commit 4ce75f7. ♻️ This comment has been updated with latest results. |
Backend Tests - Integration Group 9401 tests 398 ✅ 9m 33s ⏱️ Results for commit 4ce75f7. ♻️ This comment has been updated with latest results. |
ldaugusto
left a comment
There was a problem hiding this comment.
Great improvement, still one thing to do and another one to check.
...opik-backend/src/main/java/com/comet/opik/infrastructure/llm/LlmModelRegistryRefreshJob.java
Outdated
Show resolved
Hide resolved
...opik-backend/src/main/java/com/comet/opik/infrastructure/llm/LlmModelRegistryRefreshJob.java
Show resolved
Hide resolved
- Restore refreshIntervalSeconds in LlmModelRegistryConfig (was removed but still in config.yml — caught by Daniel) - Replace @on cron with programmatic registration via OpikGuiceyLifecycleEventListener.scheduleRepeatingJob(), matching TraceThreadsClosingJob pattern - Job only scheduled when remoteEnabled=true - Add comment explaining why no distributed lock is needed Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
| private void setLlmModelRegistryRefreshJob() { | ||
| LlmModelRegistryConfig registryConfig = injector.get().getInstance(OpikConfiguration.class) | ||
| .getLlmModelRegistry(); | ||
|
|
||
| if (!registryConfig.isRemoteEnabled()) { | ||
| log.info("LLM model registry remote refresh is disabled, skipping job setup"); | ||
| return; | ||
| } | ||
|
|
||
| scheduleRepeatingJob(LlmModelRegistryRefreshJob.class, | ||
| Duration.ofSeconds(registryConfig.getRefreshIntervalSeconds()), null); | ||
| } |
There was a problem hiding this comment.
setLlmModelRegistryRefreshJob has no tests for scheduling—should we add unit tests mocking GuiceJobManager/Injector to assert scheduleRepeatingJob is called with LlmModelRegistryRefreshJob when llmModelRegistry is remote enabled and not called when disabled?
Finding types: Reliable parameterized tests AI Coding Guidelines | Severity: 🟠 Medium
Want Baz to fix this for you? Activate Fixer
Other fix methods
Prompt for AI Agents:
Before applying, verify this suggestion against the current code. In
apps/opik-backend/src/main/java/com/comet/opik/infrastructure/bi/OpikGuiceyLifecycleEventListener.java
around lines 254-265, the setLlmModelRegistryRefreshJob method schedules a repeating job
when LlmModelRegistryConfig.remoteEnabled is true and uses refreshIntervalSeconds, but
there are no tests validating this behavior. In
apps/opik-backend/src/test/java/com/comet/opik/infrastructure/bi/OpikGuiceyLifecycleEventListenerTest.java,
add two unit tests: (1) when LlmModelRegistryConfig.remoteEnabled is false, verify
scheduleRepeatingJob is NOT invoked for LlmModelRegistryRefreshJob; (2) when
remoteEnabled is true with a specific refreshIntervalSeconds value, verify
scheduleRepeatingJob is invoked exactly once for LlmModelRegistryRefreshJob and that the
Duration argument equals Duration.ofSeconds(refreshIntervalSeconds). Use mocks/stubs for
Injector/OpikConfiguration/LlmModelRegistryConfig and spy or verify interactions on the
listener to assert the gating and the derived interval.
ldaugusto
left a comment
There was a problem hiding this comment.
LGTM, thanks for addressing all the changes
Details
Wires the YAML-based LLM model registry (shipped in OPIK-5019) into the backend's provider routing and structured output detection. After this, new models added to the YAML are automatically routable — no Java enum changes needed.
Also adds remote YAML fetch from a configurable CDN URL with scheduled periodic refresh, enabling new models to reach running deployments without a redeploy. Remote fetch is disabled by default (
remoteEnabled: false), so no behavior change for existing deployments.Key changes:
getLlmProvider()with enum fallback (backward compatible)getStructuredOutputStrategy()onLlmProviderFactory— encapsulates registry-based structured output resolution, simplifying all 3 online scoring callersfindModel()disambiguates VertexAI/Gemini by qualifiedName vs bare idHttpClientwith 30s timeout, URL scheme validation (SSRF defense)ScheduledExecutorServicerefreshes registry on configurable interval (daemon thread)Rollout context (OPIK-4866)
This is step 2 of the externalization initiative. The CDN infrastructure is being set up by DevOps in OPIK-5241 (in parallel). Once the CDN URL is available, we can set
remoteEnabled: trueand add the URL into the config.Change checklist
Issues
AI-WATERMARK
AI-WATERMARK: yes
Testing
Unit tests (570 passing):
Tests cover:
findModel(): qualifiedName match → VertexAI, bare id → OpenAI, bare id with qualifiedName → empty (disambiguation), nonexistent → emptygetStructuredOutputStrategy(): registry model withstructuredOutput=true→ ToolCallingStrategy,structuredOutput=false→ InstructionStrategyLlmProviderFactoryTestcases pass through the new registry-first pathLlmModelRegistryServiceTestcases pass (9 existing + 4 new findModel tests)Quality checks:
Local Docker testing:
GET /api/v1/private/llm/modelsreturns 525 models across 5 providers with correctsnake_caseserializationDocumentation
N/A — backend-only, no user-visible documentation changes needed. Self-hosted configuration docs will be updated when the full externalization is complete (OPIK-5022).