This repository contains Yotta Labs' optimized implementations for the AMD Developer Challenge 2025: Distributed Inference Kernels. We developed high-performance implementations of three critical distributed GPU kernels for single-node 8× AMD MI300X configurations.
The challenge focuses on optimizing three fundamental distributed primitives that are essential for modern large language model (LLM) training and inference:
- All-to-All Communication for Mixture-of-Experts (MoE) models
- GEMM-ReduceScatter for tensor parallelism
- AllGather-GEMM for distributed inference
Memory allocated with identical layout across multiple GPUs, enabling direct remote writes at the same relative offsets without complex addressing.
Direct GPU-to-GPU memory access using hipIpcGetMemHandle/hipIpcOpenMemHandle for zero-copy data transfer.
Hardware component of AMD GPUs. Our optimizations remap thread blocks across MI300X's 8 XCDs for maximum parallelism.
Directives like .cg (cache global) and .cv (cache volatile) for controlling cache behavior in GPU memory operations.
Our optimizations demonstrate significant performance improvements through:
- Communication-computation overlap
- Reduced memory allocations
- Hardware-aware optimizations
- Custom launchers and barriers
The geometric mean performance metric ensures solutions perform well across diverse workloads rather than being tuned for specific cases.
- AMD Developer Challenge 2025
- AMD Instinct MI300X Accelerator
- ROCm Documentation
- Reference Kernels Repository
- Yotta Labs Blog: Optimizing Distributed Inference Kernels for AMD Developer Challenge 2025
We thank AMD for organizing the GPU Optimization Challenge 2025 and thanks InnoMatrix.ai for providing access to MI300X hardware. We also thank GPUMode and all organizers for making this competition possible.