Skip to content

Replace Semaphore with ReleasableLock in Engine and Translog related classes#17304

Open
ahnyujin wants to merge 4 commits intoopensearch-project:mainfrom
ahnyujin:use-releasableLock
Open

Replace Semaphore with ReleasableLock in Engine and Translog related classes#17304
ahnyujin wants to merge 4 commits intoopensearch-project:mainfrom
ahnyujin:use-releasableLock

Conversation

@ahnyujin
Copy link
Copy Markdown

@ahnyujin ahnyujin commented Feb 8, 2025

Description

This refactor replaces the use of Semaphore with ReleasableLock for better consistency across the codebase. By aligning with the existing locking mechanisms, this change improves readability and ensures uniform lock acquisition and release patterns.

Changes

  • Replaced Semaphore with ReleasableLock in RemoteFsTranslog.
  • Ensured that lock handling follows the established project conventions.
  • Improved code maintainability by using a more structured locking approach.

Related Issues

Resolves #11360

Check List

  • Functionality includes testing.
  • API changes companion pull request created, if applicable.
  • Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@github-actions github-actions bot added enhancement Enhancement or improvement to existing feature or request good first issue Good for newcomers Storage:Durability Issues and PRs related to the durability framework labels Feb 8, 2025
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Feb 8, 2025

❌ Gradle check result for 93c1fbc: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Copy Markdown
Contributor

❌ Gradle check result for f82640b: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@opensearch-trigger-bot
Copy link
Copy Markdown
Contributor

This PR is stalled because it has been open for 30 days with no activity.

@opensearch-trigger-bot opensearch-trigger-bot bot added the stalled Issues that have stalled label Apr 10, 2025
@ahnyujin ahnyujin force-pushed the use-releasableLock branch from f82640b to dc6c118 Compare April 13, 2025 00:28
@github-actions
Copy link
Copy Markdown
Contributor

❌ Gradle check result for dc6c118: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@opensearch-trigger-bot opensearch-trigger-bot bot removed the stalled Issues that have stalled label Apr 13, 2025
@ahnyujin ahnyujin force-pushed the use-releasableLock branch from dc6c118 to cdb76cc Compare May 20, 2025 12:29
@ahnyujin ahnyujin requested a review from a team as a code owner May 20, 2025 12:29
@github-actions
Copy link
Copy Markdown
Contributor

❌ Gradle check result for cdb76cc: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Nov 3, 2025

❌ Gradle check result for 276c933: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

yujini-ahn and others added 4 commits February 18, 2026 22:22
Signed-off-by: Yujin Ahn <ujahnn@gmail.com>
Signed-off-by: Yujin Ahn <ujahnn@gmail.com>
Signed-off-by: Yujin Ahn <ujahnn@gmail.com>
Signed-off-by: Yujin Ahn <ujahnn@gmail.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Feb 18, 2026

📝 Walkthrough

Walkthrough

Replaces Semaphore-based syncPermit with ReentrantLock wrapped in ReleasableLock to guard in-flight remote upload synchronization in RemoteFsTranslog. Updates control flow checks and assertions to use lock availability instead of permit counts, with corresponding test updates.

Changes

Cohort / File(s) Summary
Lock Mechanism Refactoring
server/src/main/java/org/opensearch/index/translog/RemoteFsTranslog.java
Replaces Semaphore with ReentrantLock wrapped in ReleasableLock. Adds isLockAvailable() method (package-visible), removes availablePermits(). Updates prepareAndUpload control flow to check lock availability and pauseSync state. Modifies drainSync to use timeout-based lock acquisition and updates assertions to use isHeldByCurrentThread(). Adds imports for ReentrantLock and timeValueMinutes utility.
Test Assertion Updates
server/src/test/java/org/opensearch/index/translog/RemoteFsTimestampAwareTranslogTests.java, server/src/test/java/org/opensearch/index/translog/RemoteFsTranslogTests.java
Updates test assertions in testDrainSync methods to check isLockAvailable() instead of availablePermits(), reflecting the shift from permit-based to lock-based synchronization semantics.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 14.29% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely describes the main change: replacing Semaphore with ReleasableLock in the relevant classes.
Description check ✅ Passed The description covers the main changes, references the related issue #11360, and includes all checklist items completed as per the template requirements.
Linked Issues check ✅ Passed The PR successfully replaces Semaphore with ReleasableLock in RemoteFsTranslog, addressing the primary objective from issue #11360 to standardize locking mechanisms.
Out of Scope Changes check ✅ Passed All changes are focused on replacing Semaphore with ReleasableLock in translog classes and related tests, directly aligned with the PR objectives and linked issue #11360.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown
Contributor

❗ AI-powered Code-Diff-Analyzer found issues on commit c7eb6ca.

PathLineSeverityDescription
server/src/main/java/org/opensearch/index/translog/RemoteFsTranslog.java96highLock changed from instance-level to static: reentrantLock and syncPermit are now static final, meaning ALL RemoteFsTranslog instances across all shards/indices will share the same lock. This creates a system-wide synchronization bottleneck and potential DoS vector. Previous Semaphore was instance-level (non-static). This fundamentally changes concurrency behavior and could block all translog operations cluster-wide when any single instance holds the lock.

The table above displays the top 10 most important findings.

Total: 1 | Critical: 0 | High: 1 | Medium: 0 | Low: 0


Pull Requests Author(s): Please update your Pull Request according to the report above.

Repository Maintainer(s): You can bypass diff analyzer by adding label skip-diff-analyzer after reviewing the changes carefully, then re-run failed actions. To re-enable the analyzer, remove the label, then re-run all actions.


⚠️ Note: The Code-Diff-Analyzer helps protect against potentially harmful code patterns. Please ensure you have thoroughly reviewed the changes beforehand.

Thanks.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
server/src/main/java/org/opensearch/index/translog/RemoteFsTranslog.java (1)

564-583: ⚠️ Potential issue | 🟡 Minor

drainSync lacks a lock-release guard if the assertion fires

If pauseSync.compareAndSet(false, true) returns false (concurrent drainSync calls on the same instance), result is false and the assert at Line 568 fires. With assertions enabled an AssertionError is thrown while the lock is held — there is no finally block to call syncPermit.close(), leaking the lock.

A concurrent drainSync call is unlikely by contract, but a defensive try/finally would prevent the leak on assertion failure:

🛡️ Suggested defensive fix
         if (syncPermit.tryAcquire(timeValueMinutes(1L)) != null) {
-            boolean result = pauseSync.compareAndSet(false, true);
-            assert result && syncPermit.isHeldByCurrentThread();
+            try {
+                boolean result = pauseSync.compareAndSet(false, true);
+                assert result && syncPermit.isHeldByCurrentThread();
+            } catch (AssertionError e) {
+                syncPermit.close();
+                throw e;
+            }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@server/src/main/java/org/opensearch/index/translog/RemoteFsTranslog.java`
around lines 564 - 583, drainSync can throw an AssertionError after acquiring
the syncPermit when pauseSync.compareAndSet(false,true) returns false, leaking
the permit; change drainSync so that after a successful
syncPermit.tryAcquire(...) you guard the compareAndSet and the remainder with a
try/finally: if compareAndSet fails or any throwable occurs before returning the
Releasable, ensure you call syncPermit.close() (and revert pauseSync if you set
it) so the permit is always released, and preserve the existing
Releasables.releaseOnce behavior for the normal path; specifically adjust logic
around syncPermit.tryAcquire, pauseSync.compareAndSet(false,true),
syncPermit.close(), and pauseSync.getAndSet(false) to release the permit on
error paths.
📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 14749f7 and c7eb6ca.

📒 Files selected for processing (3)
  • server/src/main/java/org/opensearch/index/translog/RemoteFsTranslog.java
  • server/src/test/java/org/opensearch/index/translog/RemoteFsTimestampAwareTranslogTests.java
  • server/src/test/java/org/opensearch/index/translog/RemoteFsTranslogTests.java
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2026-01-13T17:40:27.167Z
Learnt from: reta
Repo: opensearch-project/OpenSearch PR: 20411
File: server/src/main/java/org/opensearch/index/codec/CodecService.java:112-133
Timestamp: 2026-01-13T17:40:27.167Z
Learning: Avoid capturing or evaluating a supplier (e.g., this::defaultCodec) upfront when passing it to a registry during object construction. If registries may replace defaults during iteration (as in EnginePlugin.getAdditionalCodecs), pass the supplier itself and only resolve it at use time. This ensures dynamic behavior is preserved during initialization and prevents premature binding of defaults in codecs/registry setup. This pattern should apply to similar initialization paths in Java server code where registries may mutate defaults during construction.

Applied to files:

  • server/src/main/java/org/opensearch/index/translog/RemoteFsTranslog.java
🪛 GitHub Actions: Gradle Check (Jenkins)
server/src/main/java/org/opensearch/index/translog/RemoteFsTranslog.java

[error] 96-96: Diff analyzer reported a high-severity issue: Lock changed from instance-level to static (reentrantLock and syncPermit now static final). This can cause system-wide synchronization bottlenecks and potential DoS across shards/indices.

🔇 Additional comments (2)
server/src/test/java/org/opensearch/index/translog/RemoteFsTimestampAwareTranslogTests.java (1)

550-554: Test assertions correctly reflect the new lock-based semantics

Line 550's assertBusy(() -> assertFalse(translog.isLockAvailable())) correctly polls until the upload thread acquires the lock, and Line 554's assertFalse(translog.isLockAvailable()) correctly validates that drainSync continues to hold the lock after the upload finishes. Both replacements are semantically equivalent to the removed availablePermits() == 0 checks.

Note: with the static lock bug in RemoteFsTranslog (Lines 97–98), isLockAvailable() reflects JVM-wide lock state rather than per-instance state. Once the static issue is fixed, these assertions will correctly scope to this translog instance.

server/src/test/java/org/opensearch/index/translog/RemoteFsTranslogTests.java (1)

888-892: LGTM — same correct adaptation as in the timestamp-aware test

Line 888's assertBusy(() -> assertFalse(translog.isLockAvailable())) replaces the equivalent availablePermits() == 0 busy-wait, and Line 892 correctly asserts the lock remains held after drainSync() is called. The test logic is sound and will correctly validate the expected behavior once the static lock issue is resolved.

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@server/src/main/java/org/opensearch/index/translog/RemoteFsTranslog.java`:
- Around line 402-406: prepareAndUpload leaks the Releasable acquired from
syncPermit.tryAcquire() on early returns and exception paths (the lock is only
released in upload()'s finally), so change the acquisition site in
prepareAndUpload to hold the returned Releasable in a try-with-resources /
try-finally that guarantees close() on every exit (including the generation <
current.getGeneration() early return and any exceptions before
readLock.acquire()); ensure the scope covers both the branch that calls
upload(...) and the branches that return early, and remove reliance on
upload()'s finally as the sole release point; reference symbols:
syncPermit.tryAcquire(), prepareAndUpload(...), upload(...), readLock.acquire(),
and drainSync().
- Around line 97-98: The lock fields ReentrantLock reentrantLock and
ReleasableLock syncPermit are declared static final causing JVM-wide cross-shard
contention; change them to instance fields by removing the static modifier so
each RemoteFsTranslog instance has its own ReentrantLock and corresponding
ReleasableLock; update any code that relies on these being class-level (e.g.,
usages in prepareAndUpload, drainSync(), and isLockAvailable()) to use the
instance fields instead so lock checks and timeouts apply per-shard rather than
JVM-wide.

---

Outside diff comments:
In `@server/src/main/java/org/opensearch/index/translog/RemoteFsTranslog.java`:
- Around line 564-583: drainSync can throw an AssertionError after acquiring the
syncPermit when pauseSync.compareAndSet(false,true) returns false, leaking the
permit; change drainSync so that after a successful syncPermit.tryAcquire(...)
you guard the compareAndSet and the remainder with a try/finally: if
compareAndSet fails or any throwable occurs before returning the Releasable,
ensure you call syncPermit.close() (and revert pauseSync if you set it) so the
permit is always released, and preserve the existing Releasables.releaseOnce
behavior for the normal path; specifically adjust logic around
syncPermit.tryAcquire, pauseSync.compareAndSet(false,true), syncPermit.close(),
and pauseSync.getAndSet(false) to release the permit on error paths.

Comment on lines +97 to +98
private static final ReentrantLock reentrantLock = new ReentrantLock();
private static final ReleasableLock syncPermit = new ReleasableLock(reentrantLock);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

static final on lock fields causes JVM-wide cross-shard contention — confirmed by CI

Both reentrantLock and syncPermit are declared static final, meaning every RemoteFsTranslog instance in the JVM (one per shard/index) shares a single lock. Consequences:

  • A single long-running upload on shard A blocks prepareAndUpload on all other shards until it completes.
  • A drainSync() call during primary relocation on shard A stalls uploads on every unrelated shard for up to 1 minute (the tryAcquire timeout).
  • isLockAvailable() (Line 781) reflects whether any shard's upload is in progress, not only this instance's — making the predicate semantically wrong for per-shard decisions.

The fix is straightforward: make both fields instance-level (remove static).

🔧 Proposed fix
-    // This permit exists to allow any inflight background triggered upload.
-    private static final ReentrantLock reentrantLock = new ReentrantLock();
-    private static final ReleasableLock syncPermit = new ReleasableLock(reentrantLock);
+    // This lock exists to allow only one inflight remote upload at a time.
+    private final ReentrantLock reentrantLock = new ReentrantLock();
+    private final ReleasableLock syncPermit = new ReleasableLock(reentrantLock);
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
private static final ReentrantLock reentrantLock = new ReentrantLock();
private static final ReleasableLock syncPermit = new ReleasableLock(reentrantLock);
private final ReentrantLock reentrantLock = new ReentrantLock();
private final ReleasableLock syncPermit = new ReleasableLock(reentrantLock);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@server/src/main/java/org/opensearch/index/translog/RemoteFsTranslog.java`
around lines 97 - 98, The lock fields ReentrantLock reentrantLock and
ReleasableLock syncPermit are declared static final causing JVM-wide cross-shard
contention; change them to instance fields by removing the static modifier so
each RemoteFsTranslog instance has its own ReentrantLock and corresponding
ReleasableLock; update any code that relies on these being class-level (e.g.,
usages in prepareAndUpload, drainSync(), and isLockAvailable()) to use the
instance fields instead so lock checks and timeouts apply per-shard rather than
JVM-wide.

Comment on lines +402 to 406
if (startedPrimarySupplier.getAsBoolean() == false || pauseSync.get() || syncPermit.tryAcquire() == null) {
logger.debug("skipped uploading translog for {} {} isLockAvailable={}", primaryTerm, generation, isLockAvailable());
// NO-OP
return false;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Lock acquired at Line 402 is leaked when the generation < current.getGeneration() branch is taken

syncPermit.tryAcquire() at Line 402 acquires the lock if it returns non-null, but the only release point is syncPermit.close() in upload()'s finally block (Line 477). The else if (generation < current.getGeneration()) { return false; } branch at Line 431–432 exits prepareAndUpload without ever reaching upload(), leaving the ReentrantLock permanently held by this thread.

Downstream effects:

  • Other threads calling prepareAndUpload will see tryAcquire() return null and skip uploads indefinitely.
  • drainSync() will block for the full 1-minute timeout and then throw RuntimeException.

The exception propagation path at Lines 426–429 (inside the inner try block) carries the same defect: if an exception is thrown before readLock.acquire(), the exception propagates out of prepareAndUpload while syncPermit is still held.

The minimal targeted fix for the early-return path:

🔧 Proposed fix
         } else if (generation < current.getGeneration()) {
+            syncPermit.close();
             return false;
         }

For the exception path, wrap the lock in a try/finally at the acquisition site or restructure prepareAndUpload so the Releasable returned by tryAcquire() is held in a try-with-resources that also covers the early-return:

-        if (startedPrimarySupplier.getAsBoolean() == false || pauseSync.get() || syncPermit.tryAcquire() == null) {
+        Releasable permit = syncPermit.tryAcquire();
+        if (startedPrimarySupplier.getAsBoolean() == false || pauseSync.get() || permit == null) {
             logger.debug("skipped uploading translog for {} {} isLockAvailable={}", primaryTerm, generation, isLockAvailable());
             // NO-OP
             return false;
         }
+        // permit is non-null; must be closed on every exit path

Then ensure syncPermit.close() (or the enclosing try-with-resources) covers all exit paths, instead of relying solely on the upload() finally block.

Also applies to: 431-432

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@server/src/main/java/org/opensearch/index/translog/RemoteFsTranslog.java`
around lines 402 - 406, prepareAndUpload leaks the Releasable acquired from
syncPermit.tryAcquire() on early returns and exception paths (the lock is only
released in upload()'s finally), so change the acquisition site in
prepareAndUpload to hold the returned Releasable in a try-with-resources /
try-finally that guarantees close() on every exit (including the generation <
current.getGeneration() early return and any exceptions before
readLock.acquire()); ensure the scope covers both the branch that calls
upload(...) and the branches that return early, and remove reliance on
upload()'s finally as the sole release point; reference symbols:
syncPermit.tryAcquire(), prepareAndUpload(...), upload(...), readLock.acquire(),
and drainSync().

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement Enhancement or improvement to existing feature or request good first issue Good for newcomers Storage:Durability Issues and PRs related to the durability framework

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

Use ReleasableLock within Engine and Translog related classes

4 participants