-
Notifications
You must be signed in to change notification settings - Fork 168
Closed
Description
Code of Conduct
- I agree to follow this project's Code of Conduct
Search before asking
- I have searched in the issues and found no similar issues.
Describe the bug
In high-pressure scenarios, inaccurate flow control(usedMemory? preAllocatedMemory?) leads to Shuffle server OOM.
The SQL used to reproduce the bug:
tpcds:
select * from (
select s.,c. from store_sales s join customer c on s.ss_customer_sk=c.c_customer_sk
) sc DISTRIBUTE BY sc.ss_customer_sk,sc.ss_item_sk;
Affects Version(s)
master
Uniffle Server Log Output
[2024-01-19 03:18:50.483] [Grpc-714] [DEBUG] org.apache.uniffle.server.buffer.ShuffleBufferManager.requireMemory - Require memory succeeded with 1023891 bytes, usedMemory[117888810505] include preAllocation[180869913], inFlushSize[110004009963]
[2024-01-19 03:18:50.485] [Grpc-19] [DEBUG] org.apache.uniffle.server.buffer.ShuffleBufferManager.requireMemory - Require memory succeeded with 3093288 bytes, usedMemory[117891903793] include preAllocation[183963201], inFlushSize[110004009963]
[2024-01-19 03:18:50.485] [epollEventLoopGroup-3-19] [DEBUG] org.apache.uniffle.server.netty.ShuffleServerNettyHandler.handleSendShuffleDataRequest - Cache Shuffle Data for appId[application_1703049085550_1651917_1705170927963], shuffleId[0], cost 0 ms with 52 blocks and 1189625 bytes
[2024-01-19 03:18:50.487] [epollEventLoopGroup-3-44] [WARN] org.apache.uniffle.common.netty.handle.TransportChannelHandler.exceptionCaught - Exception in connection from /9.23.12.144:45176
io.netty.util.internal.OutOfDirectMemoryError: failed to allocate 4194304 byte(s) of direct memory (used: 171798691840, max: 171798691840)
at io.netty.util.internal.PlatformDependent.incrementMemoryCounter(PlatformDependent.java:843)
at io.netty.util.internal.PlatformDependent.allocateDirectNoCleaner(PlatformDependent.java:772)
at io.netty.buffer.PoolArena$DirectArena.allocateDirect(PoolArena.java:710)
at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:685)
at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:212)
at io.netty.buffer.PoolArena.tcacheAllocateNormal(PoolArena.java:194)
at io.netty.buffer.PoolArena.allocate(PoolArena.java:136)
at io.netty.buffer.PoolArena.allocate(PoolArena.java:126)
at io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:397)
at io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:188)
at io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:179)
at org.apache.uniffle.common.netty.protocol.Decoders.decodeShuffleBlockInfo(Decoders.java:50)
at org.apache.uniffle.common.netty.protocol.SendShuffleDataRequest.decodePartitionData(SendShuffleDataRequest.java:92)
at org.apache.uniffle.common.netty.protocol.SendShuffleDataRequest.decode(SendShuffleDataRequest.java:104)
at org.apache.uniffle.common.netty.protocol.Message.decode(Message.java:145)
at org.apache.uniffle.common.netty.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:72)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:440)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
at io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:800)
at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:509)
at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:407)
at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.lang.Thread.run(Thread.java:750)Uniffle Engine Log Output
No response
Uniffle Server Configurations
xmx:120g
capacity:110g
read.capacity:20g
max_direct_mem:160g
rss.server.netty.epoll.enable true
rss.rpc.server.type GRPC_NETTYUniffle Engine Configurations
set spark.sql.files.maxPartitionBytes=1073741824;
set spark.executor.cores=8;
set spark.task.cpus=4;
set spark.executor.memory=49g;
set spark.driver.memory=20g;
set spark.dynamicAllocation.maxExecutors=150;
set spark.dynamicAllocation.minExecutors=150;
set spark.rss.client.type = GRPC_NETTY;
set spark.rss.client.netty.io.mode = EPOLL;Additional context
No response
Are you willing to submit PR?
- Yes I am willing to submit a PR!
Metadata
Metadata
Assignees
Labels
No labels
