Skip to content

[BUG] Same translog metadata file uploaded from old primary during a race condition  #11322

@ashking94

Description

@ashking94

Describe the bug
Across different nodes, the combination of primary term, translog generation has to be unique for the translog metadata file name.
There is a bug where the old primary can still upload a translog metadata file which has same primary term and generation which is generated as part of the relocation handoff by the new primary. This happens when there is any internal or background flush triggered around the same time as the relocation handoff but just before the primary mode becomes false on the old primary. In the cases where we found the issue, the internal flush was triggered due to no writes happening in last 5 mins on a shard and the relocation happening around the same time as of the internal flush.

public void flushOnIdle(long inactiveTimeNS) {
Engine engineOrNull = getEngineOrNull();
if (engineOrNull != null && System.nanoTime() - engineOrNull.getLastWriteNanos() >= inactiveTimeNS) {
boolean wasActive = active.getAndSet(false);
if (wasActive) {
logger.debug("flushing shard on inactive");
threadPool.executor(ThreadPool.Names.FLUSH).execute(new AbstractRunnable() {
@Override
public void onFailure(Exception e) {
if (state != IndexShardState.CLOSED) {
logger.warn("failed to flush shard on inactive", e);
}
}
@Override
protected void doRun() {
flush(new FlushRequest().waitIfOngoing(false).force(false));
periodicFlushMetric.inc();
}
});
}
}
}

To Reproduce
This is very difficult to reproduce and shows up at very high scale. However, we can still attempt to reproduce by creating mutliple indexes and triggering the relocation around the 5th minute of no write on the shard.

Expected behavior
The old primary must not upload once the control reaches the handoff stage.

Plugins
Please list all plugins currently enabled.

Screenshots
If applicable, add screenshots to help explain your problem.

Host/Environment (please complete the following information):

  • OS: [e.g. iOS]
  • Version [e.g. 22]

Additional context
Add any other context about the problem here.

Metadata

Metadata

Assignees

Labels

Storage:DurabilityIssues and PRs related to the durability frameworkStorage:RemotebugSomething isn't workingv2.12.0Issues and PRs related to version 2.12.0

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions