Skip to content

Loading from AWS S3 large file gives "Required array length is too large" error #478

@msmygit

Description

@msmygit

Command Executed:

export DSBULK_JAVA_OPTS="-Xmx10G"
./dsbulk load -k <keyspace> -t transactions -b secure-connect-<db_name>.zip -u <username> -p <password> -url "s3://path/to/transactions.csv?region=us-east-1"

Console Output:

Operation LOAD_20230622-155525-014753 failed unexpectedly: Required array length 2147483639 + 96 is too large.

Full Stacktrace:

2023-06-22 15:55:59 ERROR Operation LOAD_20230622-155525-014753 failed unexpectedly: Required array length 2147483639 + 96 is too large.
java.lang.OutOfMemoryError: Required array length 2147483639 + 96 is too large
        at java.base/jdk.internal.util.ArraysSupport.hugeLength(ArraysSupport.java:649)
        at java.base/jdk.internal.util.ArraysSupport.newLength(ArraysSupport.java:642)
        at java.base/java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:100)
        at java.base/java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:132)
        at software.amazon.awssdk.utils.IoUtils.toByteArray(IoUtils.java:48)
        at software.amazon.awssdk.core.sync.ResponseTransformer.lambda$toBytes$3(ResponseTransformer.java:175)
        at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler$HttpResponseHandlerAdapter.transformResponse(BaseSyncClientHandler.java:218)
        at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler$HttpResponseHandlerAdapter.handle(BaseSyncClientHandler.java:206)
        at software.amazon.awssdk.core.internal.http.CombinedResponseHandler.handleSuccessResponse(CombinedResponseHandler.java:99)
        at software.amazon.awssdk.core.internal.http.CombinedResponseHandler.handleResponse(CombinedResponseHandler.java:75)

FWIW, smaller csv files can be used for the load with no problem (~200mb), but get Java heap space errors on CSVs greater than 1 gb and hence the usage of export DSBULK_JAVA_OPTS="-Xmx10G". Are there other throttling available on here? Tried with a file of size 187GB csv and another with 2.3GB and both ended with the same error.

┆Issue is synchronized with this Jira Task by Unito

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions