AMD Developer Challenge 2025: Distributed Inference Kernels

This repository contains Yotta Labs' optimized implementations for the AMD Developer Challenge 2025: Distributed Inference Kernels. We developed high-performance implementations of three critical distributed GPU kernels for single-node 8× AMD MI300X configurations.

Overview

The challenge focuses on optimizing three fundamental distributed primitives that are essential for modern large language model (LLM) training and inference:

All-to-All Communication for Mixture-of-Experts (MoE) models
GEMM-ReduceScatter for tensor parallelism
AllGather-GEMM for distributed inference

Key Technical Concepts

Symmetric Heap

Memory allocated with identical layout across multiple GPUs, enabling direct remote writes at the same relative offsets without complex addressing.

IPC (Inter-Process Communication)

Direct GPU-to-GPU memory access using hipIpcGetMemHandle/hipIpcOpenMemHandle for zero-copy data transfer.

XCD (eXtreme Compute Die)

Hardware component of AMD GPUs. Our optimizations remap thread blocks across MI300X's 8 XCDs for maximum parallelism.

Cache Modifiers

Directives like .cg (cache global) and .cv (cache volatile) for controlling cache behavior in GPU memory operations.

Performance Results

Our optimizations demonstrate significant performance improvements through:

Communication-computation overlap
Reduced memory allocations
Hardware-aware optimizations
Custom launchers and barriers

The geometric mean performance metric ensures solutions perform well across diverse workloads rather than being tuned for specific cases.

References

Acknowledgments

We thank AMD for organizing the GPU Optimization Challenge 2025 and thanks InnoMatrix.ai for providing access to MI300X hardware. We also thank GPUMode and all organizers for making this competition possible.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AMD Developer Challenge 2025: Distributed Inference Kernels

Overview

Key Technical Concepts

Symmetric Heap

IPC (Inter-Process Communication)

XCD (eXtreme Compute Die)

Cache Modifiers

Performance Results

References

Acknowledgments

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

AMD Developer Challenge 2025: Distributed Inference Kernels

Overview

Key Technical Concepts

Symmetric Heap

IPC (Inter-Process Communication)

XCD (eXtreme Compute Die)

Cache Modifiers

Performance Results

References

Acknowledgments