Skip to content

[Serious Bug] Inaccurate flow control leads to Shuffle server OOM when enabling Netty #1472

@rickyma

Description

@rickyma

Code of Conduct

Search before asking

  • I have searched in the issues and found no similar issues.

Describe the bug

In high-pressure scenarios, inaccurate flow control(usedMemory? preAllocatedMemory?) leads to Shuffle server OOM.

image

The SQL used to reproduce the bug:
tpcds:
select * from (
select s.,c. from store_sales s join customer  c on s.ss_customer_sk=c.c_customer_sk
) sc DISTRIBUTE BY sc.ss_customer_sk,sc.ss_item_sk;

Affects Version(s)

master

Uniffle Server Log Output

[2024-01-19 03:18:50.483] [Grpc-714] [DEBUG] org.apache.uniffle.server.buffer.ShuffleBufferManager.requireMemory - Require memory succeeded with 1023891 bytes, usedMemory[117888810505] include preAllocation[180869913], inFlushSize[110004009963]
[2024-01-19 03:18:50.485] [Grpc-19] [DEBUG] org.apache.uniffle.server.buffer.ShuffleBufferManager.requireMemory - Require memory succeeded with 3093288 bytes, usedMemory[117891903793] include preAllocation[183963201], inFlushSize[110004009963]
[2024-01-19 03:18:50.485] [epollEventLoopGroup-3-19] [DEBUG] org.apache.uniffle.server.netty.ShuffleServerNettyHandler.handleSendShuffleDataRequest - Cache Shuffle Data for appId[application_1703049085550_1651917_1705170927963], shuffleId[0], cost 0 ms with 52 blocks and 1189625 bytes
[2024-01-19 03:18:50.487] [epollEventLoopGroup-3-44] [WARN] org.apache.uniffle.common.netty.handle.TransportChannelHandler.exceptionCaught - Exception in connection from /9.23.12.144:45176
io.netty.util.internal.OutOfDirectMemoryError: failed to allocate 4194304 byte(s) of direct memory (used: 171798691840, max: 171798691840)
        at io.netty.util.internal.PlatformDependent.incrementMemoryCounter(PlatformDependent.java:843)
        at io.netty.util.internal.PlatformDependent.allocateDirectNoCleaner(PlatformDependent.java:772)
        at io.netty.buffer.PoolArena$DirectArena.allocateDirect(PoolArena.java:710)
        at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:685)
        at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:212)
        at io.netty.buffer.PoolArena.tcacheAllocateNormal(PoolArena.java:194)
        at io.netty.buffer.PoolArena.allocate(PoolArena.java:136)
        at io.netty.buffer.PoolArena.allocate(PoolArena.java:126)
        at io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:397)
        at io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:188)
        at io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:179)
        at org.apache.uniffle.common.netty.protocol.Decoders.decodeShuffleBlockInfo(Decoders.java:50)
        at org.apache.uniffle.common.netty.protocol.SendShuffleDataRequest.decodePartitionData(SendShuffleDataRequest.java:92)
        at org.apache.uniffle.common.netty.protocol.SendShuffleDataRequest.decode(SendShuffleDataRequest.java:104)
        at org.apache.uniffle.common.netty.protocol.Message.decode(Message.java:145)
        at org.apache.uniffle.common.netty.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:72)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
        at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:440)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
        at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
        at io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:800)
        at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:509)
        at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:407)
        at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
        at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
        at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
        at java.lang.Thread.run(Thread.java:750)

Uniffle Engine Log Output

No response

Uniffle Server Configurations

xmx:120g
capacity:110g
read.capacity:20g
max_direct_mem:160g
rss.server.netty.epoll.enable true
rss.rpc.server.type GRPC_NETTY

Uniffle Engine Configurations

set spark.sql.files.maxPartitionBytes=1073741824;
set spark.executor.cores=8;
set spark.task.cpus=4;
set spark.executor.memory=49g;
set spark.driver.memory=20g;
set spark.dynamicAllocation.maxExecutors=150;
set spark.dynamicAllocation.minExecutors=150;
set spark.rss.client.type = GRPC_NETTY;
set spark.rss.client.netty.io.mode = EPOLL;

Additional context

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions