SNAPSHOT: Deterministic ClusterState Tests#36644
SNAPSHOT: Deterministic ClusterState Tests#36644original-brownbear merged 47 commits intoelastic:masterfrom original-brownbear:deterministic-snapshot-tests
Conversation
* Use `DeterministicTaskQueue` infrastructure to reproduce #32265
|
Pinging @elastic/es-distributed |
original-brownbear
left a comment
There was a problem hiding this comment.
@ywelsch did my best to clean this up and make it readable now :) Take a look when you have some time, I think this should be much closer now to what you were envisioning :)
| @Override | ||
| public ScheduledExecutorService scheduler() { | ||
| throw new UnsupportedOperationException(); | ||
| return new ScheduledExecutorService() { |
There was a problem hiding this comment.
Needed to add a dummy return here since this was used by org.elasticsearch.index.shard.IndexShard#IndexShard (only used in the constructor though for the current test, so no actual implementation necessary otherwise).
| public Cancellable scheduleWithFixedDelay(Runnable command, TimeValue interval, String executor) { | ||
| throw new UnsupportedOperationException(); | ||
| // TODO: Implement fully like schedule | ||
| return new Cancellable() { |
There was a problem hiding this comment.
Added dummy return only here for now. This is used to schedule a task org.elasticsearch.indices.IndexingMemoryController.ShardsIndicesStatusChecker in org.elasticsearch.indices.IndexingMemoryController. Just a dummy for now sine that task doesn't seem relevant for this test.
There was a problem hiding this comment.
might as well properly implement this, looks not that difficult (I think it's just to call super. scheduleWithFixedDelay() here) and add a test to DeterministicTaskQueueTests.
| } | ||
| }; | ||
|
|
||
| TestClusterNode(DiscoveryNode node, DeterministicTaskQueue deterministicTaskQueue) throws IOException { |
There was a problem hiding this comment.
Setting up all the real services used for snapshotting with here with the exceptions:
- MockTransportService that just short-circuits the network.
- Mock ClusterStatePublisher that just short-circuits the network (I added a todo here, because I wasn't sure if we could maybe use the real thing here. It seemed very tricky to do so, but maybe it's not or worth the effort?)
| new IndexScopedSettings(settings, IndexScopedSettings.BUILT_IN_INDEX_SETTINGS); | ||
| indicesService = new IndicesService( | ||
| settings, | ||
| mock(PluginsService.class), |
There was a problem hiding this comment.
Just mocked this one out since it's not relevant for the test and just a bunch of code to get up and running.
| shardStateAction, | ||
| new NodeMappingRefreshAction(transportService, new MetaDataMappingService(clusterService, indicesService)), | ||
| repositoriesService, | ||
| mock(SearchService.class), |
There was a problem hiding this comment.
Just mocked this one out since it's not relevant for the test and just a bunch of code to get up and running.
|
|
||
| private final ClusterService clusterService; | ||
|
|
||
| private final RepositoriesService repositoriesService = mock(RepositoriesService.class); |
There was a problem hiding this comment.
Just mocked this one out since it's not that relevant for the test (we really only need it to return the repository, that's the only call we make to it) and just a bunch of code to get up and running.
| ) | ||
| ); | ||
|
|
||
| runOutstandingTasks(); |
There was a problem hiding this comment.
Running all tasks so that the index is fully set up when we create the snapshot.
|
@ywelsch fixed all issues we talk about today in e719046:
|
|
Jenkins run gradle build tests 2 |
server/src/test/java/org/elasticsearch/snapshots/SnapshotsServiceTests.java
Outdated
Show resolved
Hide resolved
| repositoriesService = new RepositoriesService( | ||
| settings, clusterService, transportService, | ||
| Collections.singletonMap(FsRepository.TYPE, metaData -> { | ||
| final Repository repository = new FsRepository(metaData, createEnvironment(), xContentRegistry()) { |
There was a problem hiding this comment.
I'm confused. With each node having their own environment, how do the nodes access a shared FS location for writing the snapshot?
There was a problem hiding this comment.
I think the reason this did not fail yet is that we aren't writing any data yet because all the shards are empty?
Regardless, I cleaned this up and made sure all nodes have the same repository path in their settings now :)
server/src/test/java/org/elasticsearch/snapshots/SnapshotsServiceTests.java
Outdated
Show resolved
Hide resolved
server/src/test/java/org/elasticsearch/snapshots/SnapshotsServiceTests.java
Outdated
Show resolved
Hide resolved
server/src/test/java/org/elasticsearch/snapshots/SnapshotsServiceTests.java
Outdated
Show resolved
Hide resolved
|
@ywelsch thanks for taking a look! All points addressed I think -> should be good for another review. |
| tempDir = createTempDir(); | ||
| deterministicTaskQueue = | ||
| new DeterministicTaskQueue(Settings.builder().put(NODE_NAME_SETTING.getKey(), "shared").build(), random()); | ||
| // TODO: Random number of master nodes and simulate master failover states |
There was a problem hiding this comment.
we're not simulating failovers yet?
There was a problem hiding this comment.
No, not yet. I was under the impression that we wanted to get the simple successful test case in first and then add those things when we last spoke about the steps here?
There was a problem hiding this comment.
yes, I found the comment just confusing here, given that we have no master failovers yet.
| final ClusterSettings clusterSettings = new ClusterSettings(settings, ClusterSettings.BUILT_IN_CLUSTER_SETTINGS); | ||
| final ThreadPool threadPool = deterministicTaskQueue.getThreadPool(); | ||
| clusterService = new ClusterService(settings, clusterSettings, threadPool, masterService); | ||
| mockTransport = new MockTransport() { |
There was a problem hiding this comment.
perhaps it's simpler to implement DisruptableMockTransport, see CoordinatorTests
| // Mock publisher that invokes other cluster change listeners directly | ||
| // TODO: Run state updates on the individual nodes out of order, this is currently not possible | ||
| // TODO: because it can lead to running the blocking recovery tasks on the deterministicTaskQueue | ||
| // TODO: when a DelayRecoveryException is thrown on the transport layer as a result of |
There was a problem hiding this comment.
as far as I understand, the problem is not the DelayRecoveryException, but the general blocking nature of peer recoveries (e.g. PeerRecoveryTargetService blockingly waits on the recovery to complete).
Perhaps we could only have this while allocating shards, but for the duration of the snapshot, while no shards are being allocated, revert to a more randomized mode. Alternatively, we can test without replica shards for now.
There was a problem hiding this comment.
Removed the replicas for now in 237f9e7, that also allows for a simpler mock transport until we have non blocking replication.
| }); | ||
| }); | ||
| masterService.setClusterStateSupplier(currentState::get); | ||
| if (node.isMasterNode()) { |
There was a problem hiding this comment.
is this if-clause necessary?
| public Cancellable scheduleWithFixedDelay(Runnable command, TimeValue interval, String executor) { | ||
| throw new UnsupportedOperationException(); | ||
| // TODO: Implement fully like schedule | ||
| return new Cancellable() { |
There was a problem hiding this comment.
might as well properly implement this, looks not that difficult (I think it's just to call super. scheduleWithFixedDelay() here) and add a test to DeterministicTaskQueueTests.
|
@ywelsch all points addressed :) |
| tempDir = createTempDir(); | ||
| deterministicTaskQueue = | ||
| new DeterministicTaskQueue(Settings.builder().put(NODE_NAME_SETTING.getKey(), "shared").build(), random()); | ||
| // TODO: Random number of master nodes and simulate master failover states |
There was a problem hiding this comment.
yes, I found the comment just confusing here, given that we have no master failovers yet.
|
@ywelsch thanks! |
Deterministic Cluster State Tests for Snapshots
TestClusterState) andDeterministicTaskQueueinfrastructure to as well as no networking to be able to iterate through every step of state updates and snapshot task execution in a reproducible manner