Fix Concurrent Index Auto Create Failing Bulk Requests by original-brownbear · Pull Request #82541 · elastic/elasticsearch

original-brownbear · 2022-01-13T13:37:35Z

Batching these requests introduced a bug where auto-create requests for system
indices would fail because system indices are always auto-created and thus
throw resource-already-exists if multiple equal ones are batched together
even though the index doesn't yet exist in the cluster state but only
in the intermediary task executor state.
This leads to bulk requests ignoring the exeception (thinking that the index
already exists) in their auto-create callback when the request doesn't yet
exist.
Fixed by deduplicating these requests for now, added a TODO to do it a little
nicer down the road but this fix is somewhat urgent as it breaks ML integ
tests.

Non-issue as this hasn't been released yet.

relates #82159

Batching these requests introduced a bug where auto-create requests for system indices would fail because system indices are always auto-created and thus throw resource-already-exists if multiple equal ones are batched together even though the index doesn't yet exist in the cluster state but only in the intermediary task executor state. This leads to bulk requests ignoring the exeception (thinking that the index already exists) in their auto-create callback when the request doesn't yet exist. Fixed by deduplicating these requests for now, added a TODO to do it a little nicer down the road but this fix is somewhat urgent as it breaks ML integ tests.

elasticmachine · 2022-01-13T13:37:39Z

Pinging @elastic/es-data-management (Team:Data Management)

original-brownbear · 2022-01-13T13:38:43Z

server/src/main/java/org/elasticsearch/action/admin/indices/create/AutoCreateAction.java

                    }
                }
-                state = allocationService.reroute(state, "auto-create");
+                if (state != currentState) {


small optimization in case the full batch doesn't do anything no need to reroute.

Could we move this to the clusterStatePublished event and pass it off to ClusterService#rerouteService for batching with other pending reroutes too?

I'm not sure about that. This adds one more CS update if no reroutes are pending and saves a reroute every now and then. A reroute is pretty fast these days though. With all the fixes we added recently I'm not sure it's worth it in 8.1.

Hmm it's not trivially fast in a large cluster still so I think we still prefer batching. Yes it's one more publication in an otherwise-quiet cluster but that's not really where the problems arise.

idegtiarenko · 2022-01-14T12:03:27Z

server/src/main/java/org/elasticsearch/action/admin/indices/create/AutoCreateAction.java

+                ClusterTasksResult.Builder<CreateIndexTask> builder = ClusterTasksResult.builder();
                ClusterState state = currentState;
-                for (AckedClusterStateUpdateTask task : tasks) {
+                final Map<CreateIndexRequest, CreateIndexTask> successfulRequests = new HashMap<>(tasks.size());


NIT: this and line 109 are good candidates to use var to simplify declaration

DaveCTurner

Looks ok but I left some thoughts and suggestions.

DaveCTurner · 2022-01-14T11:45:53Z

server/src/main/java/org/elasticsearch/action/admin/indices/create/AutoCreateAction.java

                    }
                }
-                state = allocationService.reroute(state, "auto-create");
+                if (state != currentState) {


Could we move this to the clusterStatePublished event and pass it off to ClusterService#rerouteService for batching with other pending reroutes too?

DaveCTurner · 2022-01-14T11:55:05Z

server/src/main/java/org/elasticsearch/action/admin/indices/create/AutoCreateAction.java

+        private final class CreateIndexTask extends AckedClusterStateUpdateTask {
+
+            final CreateIndexRequest request;
+            final AtomicReference<String> indexNameRef;


Does this need to be an AtomicReference or could it just be a volatile field?

Annoyingly enough yes, with the weird way the listener is constructed here it currently seems like we need this.

DaveCTurner · 2022-01-14T12:03:10Z

server/src/main/java/org/elasticsearch/action/admin/indices/create/AutoCreateAction.java

-                            request.masterNodeTimeout(),
-                            request.timeout(),
-                            false
+        private final class CreateIndexTask extends AckedClusterStateUpdateTask {


Are we getting much benefit from the extends AckedClusterStateUpdateTask here? I think if we dropped that and just implemented AckedClusterStateTaskListener directly we wouldn't need to construct the listener up front and could probably solve the TODO about deduplicating the listeners.

++ I agree, the main motivation of this PR is to unblock some broken tests quickly (also in ML QA). Are we good doing that kind of refactoring in a follow-up? :)

Sure, maybe just leave a TODO here tho.

TODO added!

…stem-index-concurrency

DaveCTurner

LGTM

original-brownbear · 2022-01-14T14:49:32Z

Thanks David + Ievgen!

original-brownbear added >non-issue :Data Management/Indices APIs DO NOT USE. Use ":Distributed/Indices APIs" or ":StorageEngine/Templates" instead. v8.1.0 labels Jan 13, 2022

elasticmachine added the Team:Data Management (obsolete) DO NOT USE. This team no longer exists. label Jan 13, 2022

original-brownbear commented Jan 13, 2022

View reviewed changes

original-brownbear requested review from DaveCTurner and martijnvg January 13, 2022 15:55

idegtiarenko mentioned this pull request Jan 14, 2022

[CI] SearchableSnapshotsBlobStoreCacheIntegTests testBlobStoreCache failing #82470

Closed

idegtiarenko reviewed Jan 14, 2022

View reviewed changes

DaveCTurner reviewed Jan 14, 2022

View reviewed changes

idegtiarenko approved these changes Jan 14, 2022

View reviewed changes

original-brownbear added 2 commits January 14, 2022 13:46

add todo

f289236

Merge remote-tracking branch 'elastic/master' into fix-auto-create-sy…

c1d294c

…stem-index-concurrency

original-brownbear requested a review from DaveCTurner January 14, 2022 12:53

DaveCTurner approved these changes Jan 14, 2022

View reviewed changes

original-brownbear merged commit 584df17 into elastic:master Jan 14, 2022

original-brownbear deleted the fix-auto-create-system-index-concurrency branch January 14, 2022 14:49

idegtiarenko mentioned this pull request Jan 14, 2022

un-mute integration test #82623

Merged

original-brownbear restored the fix-auto-create-system-index-concurrency branch April 18, 2023 20:42

Conversation

original-brownbear commented Jan 13, 2022

Uh oh!

elasticmachine commented Jan 13, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DaveCTurner left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DaveCTurner left a comment

Choose a reason for hiding this comment

Uh oh!

original-brownbear commented Jan 14, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants