Skip to content

Latest commit

 

History

History
48 lines (31 loc) · 2.5 KB

File metadata and controls

48 lines (31 loc) · 2.5 KB

AMD Developer Challenge 2025: Distributed Inference Kernels

This repository contains Yotta Labs' optimized implementations for the AMD Developer Challenge 2025: Distributed Inference Kernels. We developed high-performance implementations of three critical distributed GPU kernels for single-node 8× AMD MI300X configurations.

Overview

The challenge focuses on optimizing three fundamental distributed primitives that are essential for modern large language model (LLM) training and inference:

  1. All-to-All Communication for Mixture-of-Experts (MoE) models
  2. GEMM-ReduceScatter for tensor parallelism
  3. AllGather-GEMM for distributed inference

Key Technical Concepts

Symmetric Heap

Memory allocated with identical layout across multiple GPUs, enabling direct remote writes at the same relative offsets without complex addressing.

IPC (Inter-Process Communication)

Direct GPU-to-GPU memory access using hipIpcGetMemHandle/hipIpcOpenMemHandle for zero-copy data transfer.

XCD (eXtreme Compute Die)

Hardware component of AMD GPUs. Our optimizations remap thread blocks across MI300X's 8 XCDs for maximum parallelism.

Cache Modifiers

Directives like .cg (cache global) and .cv (cache volatile) for controlling cache behavior in GPU memory operations.

Performance Results

Our optimizations demonstrate significant performance improvements through:

  • Communication-computation overlap
  • Reduced memory allocations
  • Hardware-aware optimizations
  • Custom launchers and barriers

The geometric mean performance metric ensures solutions perform well across diverse workloads rather than being tuned for specific cases.

References

  1. AMD Developer Challenge 2025
  2. AMD Instinct MI300X Accelerator
  3. ROCm Documentation
  4. Reference Kernels Repository
  5. Yotta Labs Blog: Optimizing Distributed Inference Kernels for AMD Developer Challenge 2025

Acknowledgments

We thank AMD for organizing the GPU Optimization Challenge 2025 and thanks InnoMatrix.ai for providing access to MI300X hardware. We also thank GPUMode and all organizers for making this competition possible.