Skip to content

[OPIK-5020] [BE] feat: wire model registry into provider routing and add remote refresh#5863

Merged
AndreiCautisanu merged 9 commits intomainfrom
andreicautisanu/OPIK-5020-wire-registry-routing
Apr 8, 2026
Merged

[OPIK-5020] [BE] feat: wire model registry into provider routing and add remote refresh#5863
AndreiCautisanu merged 9 commits intomainfrom
andreicautisanu/OPIK-5020-wire-registry-routing

Conversation

@AndreiCautisanu
Copy link
Copy Markdown
Contributor

@AndreiCautisanu AndreiCautisanu commented Mar 25, 2026

Details

Wires the YAML-based LLM model registry (shipped in OPIK-5019) into the backend's provider routing and structured output detection. After this, new models added to the YAML are automatically routable — no Java enum changes needed.

Also adds remote YAML fetch from a configurable CDN URL with scheduled periodic refresh, enabling new models to reach running deployments without a redeploy. Remote fetch is disabled by default (remoteEnabled: false), so no behavior change for existing deployments.

Key changes:

  • Registry-first lookup in getLlmProvider() with enum fallback (backward compatible)
  • New getStructuredOutputStrategy() on LlmProviderFactory — encapsulates registry-based structured output resolution, simplifying all 3 online scoring callers
  • Two-pass findModel() disambiguates VertexAI/Gemini by qualifiedName vs bare id
  • Remote YAML fetch via HttpClient with 30s timeout, URL scheme validation (SSRF defense)
  • ScheduledExecutorService refreshes registry on configurable interval (daemon thread)
  • 3-tier merge: classpath defaults → remote CDN → local customer override
  • Remote fetch failure is non-fatal at every level (logs warning, keeps previous registry)

Rollout context (OPIK-4866)

This is step 2 of the externalization initiative. The CDN infrastructure is being set up by DevOps in OPIK-5241 (in parallel). Once the CDN URL is available, we can set remoteEnabled: true and add the URL into the config.

Change checklist

  • User facing
  • Documentation update

Issues

  • OPIK-5020
  • Part of OPIK-4866

AI-WATERMARK

AI-WATERMARK: yes

  • If yes:
    • Tools: Claude Code
    • Model(s): Claude Opus 4.6
    • Scope: Full implementation with human guidance on design decisions
    • Human verification: Code review (2x self-review iterations), local Docker testing, curl endpoint verification, Playground UI validation

Testing

Unit tests (570 passing):

cd apps/opik-backend && mvn test -Dtest="LlmProviderFactoryTest,LlmModelRegistryServiceTest"

Tests cover:

  • findModel(): qualifiedName match → VertexAI, bare id → OpenAI, bare id with qualifiedName → empty (disambiguation), nonexistent → empty
  • getStructuredOutputStrategy(): registry model with structuredOutput=true → ToolCallingStrategy, structuredOutput=false → InstructionStrategy
  • All 555 existing LlmProviderFactoryTest cases pass through the new registry-first path
  • All 13 LlmModelRegistryServiceTest cases pass (9 existing + 4 new findModel tests)

Quality checks:

mvn compile -DskipTests  # passes
mvn spotless:check        # passes

Local Docker testing:

  • Verified GET /api/v1/private/llm/models returns 525 models across 5 providers with correct snake_case serialization
  • Sent messages in Playground and scored traces using various models — full routing path works end-to-end through registry-first lookup

Documentation

N/A — backend-only, no user-visible documentation changes needed. Self-hosted configuration docs will be updated when the full externalization is complete (OPIK-5022).

…add remote refresh

- Registry-first lookup in getLlmProvider() with enum fallback
- New getStructuredOutputStrategy() on LlmProviderFactory encapsulating
  registry-based structured output resolution
- Two-pass findModel() disambiguates VertexAI/Gemini by qualifiedName vs bare id
- Remote YAML fetch via HttpClient with 30s timeout and URL scheme validation
- ScheduledExecutorService refreshes registry from CDN on configurable interval
- 3-tier merge: classpath defaults → remote CDN → local customer override
- Remote fetch failure is non-fatal at every level (logs warning, keeps previous)
- remoteEnabled defaults to false — zero behavior change for existing deployments

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions github-actions bot added java Pull requests that update Java code Backend tests Including test files, or tests related like configuration. labels Mar 25, 2026
@AndreiCautisanu AndreiCautisanu marked this pull request as ready for review March 25, 2026 18:35
@AndreiCautisanu AndreiCautisanu requested a review from a team as a code owner March 25, 2026 18:35
Copy link
Copy Markdown
Contributor

@ldaugusto ldaugusto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides the reusage of our infrastructured mentioned in the comments below, I believe we could have tests mocking the remote return for loadRemoteResource(): non-200 responses, malformed YAML, or any other problem you can think of.

Andrei Căutișanu and others added 2 commits March 26, 2026 16:53
- Replace java.net.http.HttpClient with JAX-RS Client (injected via Guice)
  for consistency with WorkspaceNameService, OllamaService, RemoteAuthService
- Replace ScheduledExecutorService with LlmModelRegistryRefreshJob Quartz job
  matching TraceThreadsClosingJob, DailyUsageReportJob patterns
- Add remote fetch tests: successful merge, non-200 fallback, malformed YAML
  fallback, disabled remote skips fetch

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…test

Quartz @on cron replaced the ScheduledExecutorService, making the
configurable interval unused. Removed to avoid misleading operators.
Strengthened disabled-remote test to verify mock client is never called.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 27, 2026

Backend Tests - Integration Group 12

225 tests   223 ✅  2m 6s ⏱️
 39 suites    2 💤
 39 files      0 ❌

Results for commit 4ce75f7.

♻️ This comment has been updated with latest results.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 27, 2026

Backend Tests - Integration Group 7

   10 files  ±0     10 suites  ±0   8m 51s ⏱️ +44s
1 241 tests ±0  1 241 ✅ ±0  0 💤 ±0  0 ❌ ±0 
1 237 runs   - 4  1 237 ✅  - 4  0 💤 ±0  0 ❌ ±0 

Results for commit 1c72467. ± Comparison against base commit 71bc039.

♻️ This comment has been updated with latest results.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 27, 2026

Backend Tests - Integration Group 15

280 tests   280 ✅  6m 49s ⏱️
 18 suites    0 💤
 18 files      0 ❌

Results for commit 4ce75f7.

♻️ This comment has been updated with latest results.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 27, 2026

Backend Tests - Integration Group 11

193 tests   193 ✅  4m 23s ⏱️
 39 suites    0 💤
 39 files      0 ❌

Results for commit 4ce75f7.

♻️ This comment has been updated with latest results.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 27, 2026

Backend Tests - Integration Group 4

1 485 tests   1 484 ✅  10m 2s ⏱️
    8 suites      1 💤
    8 files        0 ❌

Results for commit 4ce75f7.

♻️ This comment has been updated with latest results.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 27, 2026

Backend Tests - Integration Group 5

118 tests   118 ✅  2m 37s ⏱️
 25 suites    0 💤
 25 files      0 ❌

Results for commit 4ce75f7.

♻️ This comment has been updated with latest results.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 27, 2026

Backend Tests - Integration Group 1

410 tests   410 ✅  12m 55s ⏱️
 23 suites    0 💤
 23 files      0 ❌

Results for commit 4ce75f7.

♻️ This comment has been updated with latest results.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 27, 2026

Backend Tests - Integration Group 10

230 tests   227 ✅  8m 52s ⏱️
 20 suites    3 💤
 20 files      0 ❌

Results for commit 4ce75f7.

♻️ This comment has been updated with latest results.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 27, 2026

Backend Tests - Integration Group 2

266 tests   266 ✅  19m 53s ⏱️
 20 suites    0 💤
 20 files      0 ❌

Results for commit 4ce75f7.

♻️ This comment has been updated with latest results.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 27, 2026

Backend Tests - Integration Group 13

545 tests   545 ✅  5m 45s ⏱️
 31 suites    0 💤
 31 files      0 ❌

Results for commit 4ce75f7.

♻️ This comment has been updated with latest results.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 27, 2026

Backend Tests - Integration Group 6

260 tests   260 ✅  2m 52s ⏱️
 28 suites    0 💤
 28 files      0 ❌

Results for commit 4ce75f7.

♻️ This comment has been updated with latest results.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 27, 2026

Backend Tests - Integration Group 3

324 tests   324 ✅  10m 24s ⏱️
 31 suites    0 💤
 31 files      0 ❌

Results for commit 4ce75f7.

♻️ This comment has been updated with latest results.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 27, 2026

Backend Tests - Integration Group 16

186 tests   186 ✅  5m 23s ⏱️
 14 suites    0 💤
 14 files      0 ❌

Results for commit 0b291b2.

♻️ This comment has been updated with latest results.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 27, 2026

Backend Tests - Integration Group 14

276 tests   276 ✅  7m 43s ⏱️
 16 suites    0 💤
 16 files      0 ❌

Results for commit 4ce75f7.

♻️ This comment has been updated with latest results.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 27, 2026

Backend Tests - Integration Group 8

302 tests   300 ✅  4m 2s ⏱️
 24 suites    2 💤
 24 files      0 ❌

Results for commit 4ce75f7.

♻️ This comment has been updated with latest results.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 27, 2026

Backend Tests - Integration Group 9

401 tests   398 ✅  9m 33s ⏱️
 35 suites    3 💤
 35 files      0 ❌

Results for commit 4ce75f7.

♻️ This comment has been updated with latest results.

Copy link
Copy Markdown
Contributor

@ldaugusto ldaugusto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great improvement, still one thing to do and another one to check.

- Restore refreshIntervalSeconds in LlmModelRegistryConfig (was removed
  but still in config.yml — caught by Daniel)
- Replace @on cron with programmatic registration via
  OpikGuiceyLifecycleEventListener.scheduleRepeatingJob(), matching
  TraceThreadsClosingJob pattern
- Job only scheduled when remoteEnabled=true
- Add comment explaining why no distributed lock is needed

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Comment on lines +254 to +265
private void setLlmModelRegistryRefreshJob() {
LlmModelRegistryConfig registryConfig = injector.get().getInstance(OpikConfiguration.class)
.getLlmModelRegistry();

if (!registryConfig.isRemoteEnabled()) {
log.info("LLM model registry remote refresh is disabled, skipping job setup");
return;
}

scheduleRepeatingJob(LlmModelRegistryRefreshJob.class,
Duration.ofSeconds(registryConfig.getRefreshIntervalSeconds()), null);
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

setLlmModelRegistryRefreshJob has no tests for scheduling—should we add unit tests mocking GuiceJobManager/Injector to assert scheduleRepeatingJob is called with LlmModelRegistryRefreshJob when llmModelRegistry is remote enabled and not called when disabled?

Finding types: Reliable parameterized tests AI Coding Guidelines | Severity: 🟠 Medium


Want Baz to fix this for you? Activate Fixer

Other fix methods

Fix in Cursor

Prompt for AI Agents:

Before applying, verify this suggestion against the current code. In
apps/opik-backend/src/main/java/com/comet/opik/infrastructure/bi/OpikGuiceyLifecycleEventListener.java
around lines 254-265, the setLlmModelRegistryRefreshJob method schedules a repeating job
when LlmModelRegistryConfig.remoteEnabled is true and uses refreshIntervalSeconds, but
there are no tests validating this behavior. In
apps/opik-backend/src/test/java/com/comet/opik/infrastructure/bi/OpikGuiceyLifecycleEventListenerTest.java,
add two unit tests: (1) when LlmModelRegistryConfig.remoteEnabled is false, verify
scheduleRepeatingJob is NOT invoked for LlmModelRegistryRefreshJob; (2) when
remoteEnabled is true with a specific refreshIntervalSeconds value, verify
scheduleRepeatingJob is invoked exactly once for LlmModelRegistryRefreshJob and that the
Duration argument equals Duration.ofSeconds(refreshIntervalSeconds). Use mocks/stubs for
Injector/OpikConfiguration/LlmModelRegistryConfig and spy or verify interactions on the
listener to assert the gating and the derived interval.

Copy link
Copy Markdown
Contributor

@ldaugusto ldaugusto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for addressing all the changes

@AndreiCautisanu AndreiCautisanu merged commit f5e6239 into main Apr 8, 2026
71 checks passed
@AndreiCautisanu AndreiCautisanu deleted the andreicautisanu/OPIK-5020-wire-registry-routing branch April 8, 2026 12:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Backend java Pull requests that update Java code tests Including test files, or tests related like configuration.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants