Skip to content

[BUG] S3 Multi-part upload fails for remote cluster state #14808

@soosinha

Description

@soosinha

Describe the bug

When s3 is used as a backing store for remote cluster state, the multi part upload of remote state files fails with the below error.

[2024-07-16T12:53:45,471][ERROR][o.o.g.r.RemoteClusterStateService] [7c09ef8cc274078bab152a013b5cbb55] Exception during transfer of Metadata Fragment to Remote nodes
org.opensearch.gateway.remote.RemoteStateTransferException: nodes, failed entity:org.opensearch.gateway.remote.model.RemoteDiscoveryNodes@1c832901
	at org.opensearch.gateway.remote.RemoteClusterStateAttributesManager.lambda$getActionListener$2(RemoteClusterStateAttributesManager.java:106)
	at org.opensearch.core.action.ActionListener$1.onFailure(ActionListener.java:90)
	at org.opensearch.repositories.s3.S3BlobContainer.lambda$createFileCompletableFuture$7(S3BlobContainer.java:320)
Caused by: software.amazon.awssdk.core.exception.SdkClientException: Failed to send multipart upload requests.
	at software.amazon.awssdk.core.exception.SdkClientException$BuilderImpl.build(SdkClientException.java:111)
	at software.amazon.awssdk.core.exception.SdkClientException.create(SdkClientException.java:47)
	at org.opensearch.repositories.s3.async.AsyncTransferManager.handleException(AsyncTransferManager.java:326)
	... 61 more
Caused by: software.amazon.awssdk.core.exception.SdkClientException: Unable to execute HTTP request: Request content was only 177910 bytes, but the specified content-length was 5288374 bytes.
	at software.amazon.awssdk.core.exception.SdkClientException$BuilderImpl.build(SdkClientException.java:111)
	at software.amazon.awssdk.core.exception.SdkClientException.create(SdkClientException.java:47)
	at software.amazon.awssdk.core.internal.http.pipeline.stages.utils.RetryableStageHelper.setLastException(RetryableStageHelper.java:223)
	at software.amazon.awssdk.core.internal.http.pipeline.stages.utils.RetryableStageHelper.setLastException(RetryableStageHelper.java:218)
	at software.amazon.awssdk.core.internal.http.pipeline.stages.AsyncRetryableStage$RetryingExecutor.maybeRetryExecute(AsyncRetryableStage.java:182)
	... 24 more
Caused by: java.lang.IllegalStateException: Request content was only 177910 bytes, but the specified content-length was 5288374 bytes.
	at software.amazon.awssdk.http.nio.netty.internal.NettyRequestExecutor$StreamedRequest$1.onComplete(NettyRequestExecutor.java:479)
	at software.amazon.awssdk.utils.async.SimplePublisher.doProcessQueue(SimplePublisher.java:275)
	at software.amazon.awssdk.utils.async.SimplePublisher.processEventQueue(SimplePublisher.java:224)

When s3 async upload is invoked, an IndexInput is passed which is created using the serialized bytes(code ref). Internally in the s3 plugin, when the parts are initialized for multi part upload, they set the file pointer to the location in IndexInput where it should start reading the bytes. But since the backing IndexInput is the same, the file pointer gets set to the last part. Now when the s3 client starts to read, only one of the parts will be able to read and but will face the issue with the content length mismatch. The other parts will not even be able to read any byte as the file pointer gets set to the last location in IndexInput.

Related component

Cluster Manager

To Reproduce

  1. Create a remote state publication enabled cluster
  2. Keep adding nodes to the cluster, so that the size of DiscoveryNodes in cluster state breaches 5 MB.
  3. When the size reaches 5 MB, s3 plugin tries to perform multi part upload.
  4. Check the logs to see upload failure exception.

Expected behavior

Multi part upload should work correctly

Solution

When s3 async upload is invoked, there should be new instance of IndexInput created in the stream supplier function.

Additional Details

Plugins
s3 plugin

Screenshots
If applicable, add screenshots to help explain your problem.

Host/Environment (please complete the following information):
OS 2.15

Additional context
Add any other context about the problem here.

Metadata

Metadata

Assignees

Type

No type

Projects

Status

✅ Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions