Skip to content

[Bug] Reading local shuffle data in high-pressure scenarios may lead to high system load #1596

@rickyma

Description

@rickyma

Code of Conduct

Search before asking

  • I have searched in the issues and found no similar issues.

Describe the bug

  1. When enabling Netty, the current timing of releasing readMemory in the code is incorrect because the method client.getChannel().writeAndFlush() is asynchronous. If we release readMemory directly, it will result in readMemory being released before it has a chance to take effect. And the file reading only occurs after the writeAndFlush method is called. We should add a ChannelFutureListener and use its callback mechanism to release readMemory. This can ensure the writeAndFlush method is truly completed.

  2. We haven't set a limit on the maximum number of concurrent requests(for reading local shuffle data) that can be processed at the same time; we only controlled the maximum readCapacity of the buffer when reading local shuffle data. This approach falls short. In our tests, we discovered that the metric read_used_buffer_size could potentially only reach up to 3.75GB, yet the system load on the shuffle server was already high. This suggests that relying solely on readCapacity to manage the reading of local shuffle data is inadequate.

Affects Version(s)

master

Uniffle Server Log Output

No response

Uniffle Engine Log Output

No response

Uniffle Server Configurations

No response

Uniffle Engine Configurations

No response

Additional context

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions