Force Refresh Listeners when Acquiring all Operation Permits#36835
Force Refresh Listeners when Acquiring all Operation Permits#36835original-brownbear merged 12 commits intoelastic:masterfrom original-brownbear:relocation-refresh-fix
Conversation
|
Pinging @elastic/es-distributed |
There was a problem hiding this comment.
It seems the only production use case for this method is in relocation so admit it's a little noisy to add this kind of general callback here, but it still seems like the smallest possible change to get a hook to run the refresh conditionally here (after preventing new operations from piling on more waits concurrently).
ywelsch
left a comment
There was a problem hiding this comment.
This has still a race I think. The issue is that there's no guarantee that the refresh will happen after all pending requests have registered a refresh listener. To ensure this, we need multiple steps. First, ensure that no new listeners are registered (this can be achieved by setting getMaxRefreshListeners to 0), and then doing a manual refresh to free all existing listeners. There is no need I think to inline all this into blockOperations, it can be done before calling this method.
* Fixes the issue reproduced in the added tests: * When having open index requests on a shard that are waiting for a refresh, relocating that shard becomes blocked until that refresh happens (which could be never as in the test scenario). * Fixed by: * Before trying to aquire all permits for relocation, refresh if there are outstanding operations
server/src/main/java/org/elasticsearch/index/shard/RefreshListeners.java
Show resolved
Hide resolved
ywelsch
left a comment
There was a problem hiding this comment.
Thanks @original-brownbear. The concurrency looks better. I think we need to extend this to all actions that can possibly acquire all operation permits. In particular, I think this might also cause problems on replicas, e.g. when a replica learns of a new primary and tries to bump its term (see IndexShard#bumpPrimaryTerm). If it then has a refresh=wait_for op waiting (from the old primary), it will run into the same issue, and indefinitely stop accepting any writes from the new primary.
server/src/main/java/org/elasticsearch/index/shard/RefreshListeners.java
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/index/shard/IndexShard.java
Outdated
Show resolved
Hide resolved
| refreshListeners.disallowAdd(); | ||
| try { | ||
| if (refreshListeners.refreshNeeded()) { | ||
| refresh("relocated"); |
There was a problem hiding this comment.
As we always want to do the refresh after calling disallowAdd, I wonder if we should combine both into one method
There was a problem hiding this comment.
I couldn't find a neat way of doing that since we have the async case and the blocking case of acquiring all the permits and we want to enforce some try-finally semantics for allowing the listeners again for both now. I'm not sure it actually makes things more readable if we hide the handling of exceptions from refresh(...) in some other method. I can try finding a nice way though :)
server/src/main/java/org/elasticsearch/index/shard/RefreshListeners.java
Outdated
Show resolved
Hide resolved
|
@ywelsch all points addressed I think => should be good for another review.
Done, I wrapped all cases of acquiring all permits I could find. It seems though, that |
|
Ok thanks, I'll add some tests :) |
| throw e; | ||
| } | ||
| } | ||
| return () -> runOnce.run(); |
There was a problem hiding this comment.
maybe we could assert before this line here that assert refreshListeners == null?
|
Please adapt PR title to make it clear this is not only for relocations. For example: "Force refresh listeners when acquiring all operation permits" |
* Force Refresh Listeners when Acquiring all Operation Permits (#36835) * Fixes the issue reproduced in the added tests: * When having open index requests on a shard that are waiting for a refresh, relocating that shard becomes blocked until that refresh happens (which could be never as in the test scenario).
|
Thanks for fixing this @original-brownbear ! |
becomes blocked until that refresh happens (which could be never as in the test scenario).
PS: I ran the added tests for a few thousand runs without trouble.