[issue-5397] [BE] Make ClickHouse health check timeout configurable#5414
[issue-5397] [BE] Make ClickHouse health check timeout configurable#5414
Conversation
apps/opik-backend/src/main/java/com/comet/opik/infrastructure/db/ClickHouseHealthyCheck.java
Outdated
Show resolved
Hide resolved
Backend Tests Results 435 files 435 suites 1h 3m 22s ⏱️ Results for commit f9440e1. |
Backend Tests - Unit Tests1 433 tests 1 431 ✅ 55s ⏱️ Results for commit 781f4f4. |
Backend Tests - Integration Group 7244 tests 244 ✅ 2m 1s ⏱️ Results for commit 781f4f4. |
Backend Tests - Integration Group 14185 tests 185 ✅ 1m 40s ⏱️ Results for commit 781f4f4. |
Backend Tests - Integration Group 5113 tests 113 ✅ 2m 11s ⏱️ Results for commit 781f4f4. |
Backend Tests - Integration Group 11159 tests 157 ✅ 2m 35s ⏱️ Results for commit 781f4f4. |
Backend Tests - Integration Group 12186 tests 185 ✅ 3m 22s ⏱️ Results for commit 781f4f4. |
Backend Tests - Integration Group 15273 tests 273 ✅ 4m 43s ⏱️ Results for commit a29249d. ♻️ This comment has been updated with latest results. |
Backend Tests - Integration Group 16227 tests 225 ✅ 5m 23s ⏱️ Results for commit 3ccc041. ♻️ This comment has been updated with latest results. |
Backend Tests - Integration Group 8288 tests 288 ✅ 4m 1s ⏱️ Results for commit 781f4f4. |
Backend Tests - Integration Group 61 129 tests 1 129 ✅ 6m 40s ⏱️ Results for commit 781f4f4. |
Backend Tests - Integration Group 4 5 files 5 suites 3m 5s ⏱️ Results for commit 781f4f4. |
Backend Tests - Integration Group 1413 tests 413 ✅ 13m 53s ⏱️ Results for commit a29249d. ♻️ This comment has been updated with latest results. |
Backend Tests - Integration Group 9326 tests 322 ✅ 8m 40s ⏱️ Results for commit 781f4f4. |
Backend Tests - Integration Group 3307 tests 307 ✅ 9m 32s ⏱️ Results for commit 781f4f4. |
Backend Tests - Integration Group 10 22 files 22 suites 6m 29s ⏱️ Results for commit 781f4f4. |
Backend Tests - Integration Group 13430 tests 428 ✅ 7m 53s ⏱️ Results for commit 781f4f4. |
Backend Tests - Integration Group 2256 tests 256 ✅ 17m 38s ⏱️ Results for commit 781f4f4. |
TS SDK E2E Tests - Node 22236 tests 234 ✅ 19m 7s ⏱️ Results for commit 781f4f4. |
TS SDK E2E Tests - Node 20236 tests 234 ✅ 19m 24s ⏱️ Results for commit 781f4f4. |
TS SDK E2E Tests - Node 18236 tests 234 ✅ 22m 21s ⏱️ Results for commit 781f4f4. |
Python SDK E2E Tests Results (Python 3.13)243 tests 241 ✅ 8m 53s ⏱️ Results for commit 781f4f4. |
Python SDK E2E Tests Results (Python 3.10)243 tests 241 ✅ 9m 15s ⏱️ Results for commit 781f4f4. |
Python SDK E2E Tests Results (Python 3.12)243 tests 241 ✅ 8m 38s ⏱️ Results for commit 781f4f4. |
Python SDK E2E Tests Results (Python 3.14)243 tests 241 ✅ 8m 42s ⏱️ Results for commit 781f4f4. |
Python SDK E2E Tests Results (Python 3.11)243 tests 241 ✅ 9m 10s ⏱️ Results for commit 781f4f4. |
andrescrz
left a comment
There was a problem hiding this comment.
LGTM, all optional comments, except the missing new param in the test config file. Please add that before moving forward.
apps/opik-backend/src/main/java/com/comet/opik/infrastructure/db/DatabaseAnalyticsModule.java
Outdated
Show resolved
Hide resolved
apps/opik-backend/src/main/java/com/comet/opik/infrastructure/db/ClickHouseHealthyCheck.java
Show resolved
Hide resolved
apps/opik-backend/src/main/java/com/comet/opik/infrastructure/DatabaseAnalyticsFactory.java
Outdated
Show resolved
Hide resolved
|
Hi! Any updates on this PR ? We are waiting so much 🙏 |
781f4f4 to
1d6c12b
Compare
apps/opik-backend/src/main/java/com/comet/opik/infrastructure/db/ClickHouseHealthyCheck.java
Show resolved
Hide resolved
…able Add ANALYTICS_DB_HEALTH_CHECK_TIMEOUT_SECONDS env var (default: 1) to allow overriding the hardcoded 1-second timeout in ClickHouseHealthyCheck. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…heckTimeout - Change healthCheckTimeout field type from int to Dropwizard Duration in DatabaseAnalyticsFactory, allowing flexible config values (e.g. 1s, 500ms) - Rename @nAmed binding to snake_case "health_check_timeout" - Update config.yml and config-test.yml to use Duration string format - Env var renamed from ANALYTICS_DB_HEALTH_CHECK_TIMEOUT_SECONDS to ANALYTICS_DB_HEALTH_CHECK_TIMEOUT Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
63e64f5 to
e035e44
Compare
|
@andrescrz I think your review comments have been addressed. |
andrescrz
left a comment
There was a problem hiding this comment.
LGTM, just a couple of minor comments to follow-up. The one about the injector naming is highly recommended.
apps/opik-backend/src/main/java/com/comet/opik/infrastructure/db/ClickHouseHealthyCheck.java
Outdated
Show resolved
Hide resolved
apps/opik-backend/src/main/java/com/comet/opik/infrastructure/db/ClickHouseHealthyCheck.java
Outdated
Show resolved
Hide resolved
…nversion - Rename @nAmed binding from "health_check_timeout" to "clickhouse_health_check_timeout" for clarity (per andrescrz) - Replace manual Duration.ofMillis(toMilliseconds()) with toJavaDuration() (per andrescrz) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Details
The ClickHouse health check had a hardcoded 1-second timeout, causing pod restarts in environments where the round-trip to ClickHouse over HTTPS exceeds that threshold.
This PR introduces a new environment variable
ANALYTICS_DB_HEALTH_CHECK_TIMEOUT_SECONDS(default:1) so operators can tune the timeout without code changes.Changes:
DatabaseAnalyticsFactory.java: AddedhealthCheckTimeoutSecondsfield (default1) mapped from configconfig.yml: AddedhealthCheckTimeoutSeconds: ${ANALYTICS_DB_HEALTH_CHECK_TIMEOUT_SECONDS:-1}underdatabaseAnalyticsDatabaseAnalyticsModule.java: Exposes the value as a@NamedGuice bindingClickHouseHealthyCheck.java: Injects the configured timeout and uses it in place of the hardcodedDuration.ofSeconds(1)Change checklist
Issues
Testing
To verify the new default behaviour is unchanged, start the backend normally — the health check continues to use a 1-second timeout.
To test a custom timeout, set the env var before starting the backend:
export ANALYTICS_DB_HEALTH_CHECK_TIMEOUT_SECONDS=10Then check the
/health-checkendpoint; the ClickHouse check should now allow up to 10 seconds before reporting unhealthy.Documentation
The new environment variable should be added to the deployment / environment variable reference docs:
ANALYTICS_DB_HEALTH_CHECK_TIMEOUT_SECONDS1