Replace Semaphore with ReleasableLock in Engine and Translog related classes#17304
Replace Semaphore with ReleasableLock in Engine and Translog related classes#17304ahnyujin wants to merge 4 commits intoopensearch-project:mainfrom
Conversation
|
❌ Gradle check result for 93c1fbc: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
a70f320 to
fb61223
Compare
|
❌ Gradle check result for f82640b: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
|
This PR is stalled because it has been open for 30 days with no activity. |
f82640b to
dc6c118
Compare
|
❌ Gradle check result for dc6c118: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
dc6c118 to
cdb76cc
Compare
|
❌ Gradle check result for cdb76cc: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
cdb76cc to
276c933
Compare
|
❌ Gradle check result for 276c933: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Signed-off-by: Yujin Ahn <ujahnn@gmail.com>
Signed-off-by: Yujin Ahn <ujahnn@gmail.com>
Signed-off-by: Yujin Ahn <ujahnn@gmail.com>
Signed-off-by: Yujin Ahn <ujahnn@gmail.com>
276c933 to
c7eb6ca
Compare
📝 WalkthroughWalkthroughReplaces Semaphore-based syncPermit with ReentrantLock wrapped in ReleasableLock to guard in-flight remote upload synchronization in RemoteFsTranslog. Updates control flow checks and assertions to use lock availability instead of permit counts, with corresponding test updates. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Tip Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
❗ AI-powered Code-Diff-Analyzer found issues on commit c7eb6ca.
The table above displays the top 10 most important findings. Pull Requests Author(s): Please update your Pull Request according to the report above. Repository Maintainer(s): You can Thanks. |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
server/src/main/java/org/opensearch/index/translog/RemoteFsTranslog.java (1)
564-583:⚠️ Potential issue | 🟡 Minor
drainSynclacks a lock-release guard if the assertion firesIf
pauseSync.compareAndSet(false, true)returnsfalse(concurrentdrainSynccalls on the same instance),resultisfalseand theassertat Line 568 fires. With assertions enabled anAssertionErroris thrown while the lock is held — there is nofinallyblock to callsyncPermit.close(), leaking the lock.A concurrent
drainSynccall is unlikely by contract, but a defensive try/finally would prevent the leak on assertion failure:🛡️ Suggested defensive fix
if (syncPermit.tryAcquire(timeValueMinutes(1L)) != null) { - boolean result = pauseSync.compareAndSet(false, true); - assert result && syncPermit.isHeldByCurrentThread(); + try { + boolean result = pauseSync.compareAndSet(false, true); + assert result && syncPermit.isHeldByCurrentThread(); + } catch (AssertionError e) { + syncPermit.close(); + throw e; + }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@server/src/main/java/org/opensearch/index/translog/RemoteFsTranslog.java` around lines 564 - 583, drainSync can throw an AssertionError after acquiring the syncPermit when pauseSync.compareAndSet(false,true) returns false, leaking the permit; change drainSync so that after a successful syncPermit.tryAcquire(...) you guard the compareAndSet and the remainder with a try/finally: if compareAndSet fails or any throwable occurs before returning the Releasable, ensure you call syncPermit.close() (and revert pauseSync if you set it) so the permit is always released, and preserve the existing Releasables.releaseOnce behavior for the normal path; specifically adjust logic around syncPermit.tryAcquire, pauseSync.compareAndSet(false,true), syncPermit.close(), and pauseSync.getAndSet(false) to release the permit on error paths.
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
server/src/main/java/org/opensearch/index/translog/RemoteFsTranslog.javaserver/src/test/java/org/opensearch/index/translog/RemoteFsTimestampAwareTranslogTests.javaserver/src/test/java/org/opensearch/index/translog/RemoteFsTranslogTests.java
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2026-01-13T17:40:27.167Z
Learnt from: reta
Repo: opensearch-project/OpenSearch PR: 20411
File: server/src/main/java/org/opensearch/index/codec/CodecService.java:112-133
Timestamp: 2026-01-13T17:40:27.167Z
Learning: Avoid capturing or evaluating a supplier (e.g., this::defaultCodec) upfront when passing it to a registry during object construction. If registries may replace defaults during iteration (as in EnginePlugin.getAdditionalCodecs), pass the supplier itself and only resolve it at use time. This ensures dynamic behavior is preserved during initialization and prevents premature binding of defaults in codecs/registry setup. This pattern should apply to similar initialization paths in Java server code where registries may mutate defaults during construction.
Applied to files:
server/src/main/java/org/opensearch/index/translog/RemoteFsTranslog.java
🪛 GitHub Actions: Gradle Check (Jenkins)
server/src/main/java/org/opensearch/index/translog/RemoteFsTranslog.java
[error] 96-96: Diff analyzer reported a high-severity issue: Lock changed from instance-level to static (reentrantLock and syncPermit now static final). This can cause system-wide synchronization bottlenecks and potential DoS across shards/indices.
🔇 Additional comments (2)
server/src/test/java/org/opensearch/index/translog/RemoteFsTimestampAwareTranslogTests.java (1)
550-554: Test assertions correctly reflect the new lock-based semanticsLine 550's
assertBusy(() -> assertFalse(translog.isLockAvailable()))correctly polls until the upload thread acquires the lock, and Line 554'sassertFalse(translog.isLockAvailable())correctly validates thatdrainSynccontinues to hold the lock after the upload finishes. Both replacements are semantically equivalent to the removedavailablePermits() == 0checks.Note: with the static lock bug in
RemoteFsTranslog(Lines 97–98),isLockAvailable()reflects JVM-wide lock state rather than per-instance state. Once the static issue is fixed, these assertions will correctly scope to this translog instance.server/src/test/java/org/opensearch/index/translog/RemoteFsTranslogTests.java (1)
888-892: LGTM — same correct adaptation as in the timestamp-aware testLine 888's
assertBusy(() -> assertFalse(translog.isLockAvailable()))replaces the equivalentavailablePermits() == 0busy-wait, and Line 892 correctly asserts the lock remains held afterdrainSync()is called. The test logic is sound and will correctly validate the expected behavior once the static lock issue is resolved.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@server/src/main/java/org/opensearch/index/translog/RemoteFsTranslog.java`:
- Around line 402-406: prepareAndUpload leaks the Releasable acquired from
syncPermit.tryAcquire() on early returns and exception paths (the lock is only
released in upload()'s finally), so change the acquisition site in
prepareAndUpload to hold the returned Releasable in a try-with-resources /
try-finally that guarantees close() on every exit (including the generation <
current.getGeneration() early return and any exceptions before
readLock.acquire()); ensure the scope covers both the branch that calls
upload(...) and the branches that return early, and remove reliance on
upload()'s finally as the sole release point; reference symbols:
syncPermit.tryAcquire(), prepareAndUpload(...), upload(...), readLock.acquire(),
and drainSync().
- Around line 97-98: The lock fields ReentrantLock reentrantLock and
ReleasableLock syncPermit are declared static final causing JVM-wide cross-shard
contention; change them to instance fields by removing the static modifier so
each RemoteFsTranslog instance has its own ReentrantLock and corresponding
ReleasableLock; update any code that relies on these being class-level (e.g.,
usages in prepareAndUpload, drainSync(), and isLockAvailable()) to use the
instance fields instead so lock checks and timeouts apply per-shard rather than
JVM-wide.
---
Outside diff comments:
In `@server/src/main/java/org/opensearch/index/translog/RemoteFsTranslog.java`:
- Around line 564-583: drainSync can throw an AssertionError after acquiring the
syncPermit when pauseSync.compareAndSet(false,true) returns false, leaking the
permit; change drainSync so that after a successful syncPermit.tryAcquire(...)
you guard the compareAndSet and the remainder with a try/finally: if
compareAndSet fails or any throwable occurs before returning the Releasable,
ensure you call syncPermit.close() (and revert pauseSync if you set it) so the
permit is always released, and preserve the existing Releasables.releaseOnce
behavior for the normal path; specifically adjust logic around
syncPermit.tryAcquire, pauseSync.compareAndSet(false,true), syncPermit.close(),
and pauseSync.getAndSet(false) to release the permit on error paths.
| private static final ReentrantLock reentrantLock = new ReentrantLock(); | ||
| private static final ReleasableLock syncPermit = new ReleasableLock(reentrantLock); |
There was a problem hiding this comment.
static final on lock fields causes JVM-wide cross-shard contention — confirmed by CI
Both reentrantLock and syncPermit are declared static final, meaning every RemoteFsTranslog instance in the JVM (one per shard/index) shares a single lock. Consequences:
- A single long-running upload on shard A blocks
prepareAndUploadon all other shards until it completes. - A
drainSync()call during primary relocation on shard A stalls uploads on every unrelated shard for up to 1 minute (thetryAcquiretimeout). isLockAvailable()(Line 781) reflects whether any shard's upload is in progress, not only this instance's — making the predicate semantically wrong for per-shard decisions.
The fix is straightforward: make both fields instance-level (remove static).
🔧 Proposed fix
- // This permit exists to allow any inflight background triggered upload.
- private static final ReentrantLock reentrantLock = new ReentrantLock();
- private static final ReleasableLock syncPermit = new ReleasableLock(reentrantLock);
+ // This lock exists to allow only one inflight remote upload at a time.
+ private final ReentrantLock reentrantLock = new ReentrantLock();
+ private final ReleasableLock syncPermit = new ReleasableLock(reentrantLock);📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| private static final ReentrantLock reentrantLock = new ReentrantLock(); | |
| private static final ReleasableLock syncPermit = new ReleasableLock(reentrantLock); | |
| private final ReentrantLock reentrantLock = new ReentrantLock(); | |
| private final ReleasableLock syncPermit = new ReleasableLock(reentrantLock); |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@server/src/main/java/org/opensearch/index/translog/RemoteFsTranslog.java`
around lines 97 - 98, The lock fields ReentrantLock reentrantLock and
ReleasableLock syncPermit are declared static final causing JVM-wide cross-shard
contention; change them to instance fields by removing the static modifier so
each RemoteFsTranslog instance has its own ReentrantLock and corresponding
ReleasableLock; update any code that relies on these being class-level (e.g.,
usages in prepareAndUpload, drainSync(), and isLockAvailable()) to use the
instance fields instead so lock checks and timeouts apply per-shard rather than
JVM-wide.
| if (startedPrimarySupplier.getAsBoolean() == false || pauseSync.get() || syncPermit.tryAcquire() == null) { | ||
| logger.debug("skipped uploading translog for {} {} isLockAvailable={}", primaryTerm, generation, isLockAvailable()); | ||
| // NO-OP | ||
| return false; | ||
| } |
There was a problem hiding this comment.
Lock acquired at Line 402 is leaked when the generation < current.getGeneration() branch is taken
syncPermit.tryAcquire() at Line 402 acquires the lock if it returns non-null, but the only release point is syncPermit.close() in upload()'s finally block (Line 477). The else if (generation < current.getGeneration()) { return false; } branch at Line 431–432 exits prepareAndUpload without ever reaching upload(), leaving the ReentrantLock permanently held by this thread.
Downstream effects:
- Other threads calling
prepareAndUploadwill seetryAcquire()return null and skip uploads indefinitely. drainSync()will block for the full 1-minute timeout and then throwRuntimeException.
The exception propagation path at Lines 426–429 (inside the inner try block) carries the same defect: if an exception is thrown before readLock.acquire(), the exception propagates out of prepareAndUpload while syncPermit is still held.
The minimal targeted fix for the early-return path:
🔧 Proposed fix
} else if (generation < current.getGeneration()) {
+ syncPermit.close();
return false;
}For the exception path, wrap the lock in a try/finally at the acquisition site or restructure prepareAndUpload so the Releasable returned by tryAcquire() is held in a try-with-resources that also covers the early-return:
- if (startedPrimarySupplier.getAsBoolean() == false || pauseSync.get() || syncPermit.tryAcquire() == null) {
+ Releasable permit = syncPermit.tryAcquire();
+ if (startedPrimarySupplier.getAsBoolean() == false || pauseSync.get() || permit == null) {
logger.debug("skipped uploading translog for {} {} isLockAvailable={}", primaryTerm, generation, isLockAvailable());
// NO-OP
return false;
}
+ // permit is non-null; must be closed on every exit pathThen ensure syncPermit.close() (or the enclosing try-with-resources) covers all exit paths, instead of relying solely on the upload() finally block.
Also applies to: 431-432
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@server/src/main/java/org/opensearch/index/translog/RemoteFsTranslog.java`
around lines 402 - 406, prepareAndUpload leaks the Releasable acquired from
syncPermit.tryAcquire() on early returns and exception paths (the lock is only
released in upload()'s finally), so change the acquisition site in
prepareAndUpload to hold the returned Releasable in a try-with-resources /
try-finally that guarantees close() on every exit (including the generation <
current.getGeneration() early return and any exceptions before
readLock.acquire()); ensure the scope covers both the branch that calls
upload(...) and the branches that return early, and remove reliance on
upload()'s finally as the sole release point; reference symbols:
syncPermit.tryAcquire(), prepareAndUpload(...), upload(...), readLock.acquire(),
and drainSync().
Description
This refactor replaces the use of Semaphore with ReleasableLock for better consistency across the codebase. By aligning with the existing locking mechanisms, this change improves readability and ensures uniform lock acquisition and release patterns.
Changes
Related Issues
Resolves #11360
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.