AMD Developer Challenge 2025: Distributed Inference Kernels

This repository contains Yotta Labs' optimized implementations for the AMD Developer Challenge 2025: Distributed Inference Kernels. We developed high-performance implementations of three critical distributed GPU kernels for single-node 8× AMD MI300X configurations.

Overview

The challenge focuses on optimizing three fundamental distributed primitives that are essential for modern large language model (LLM) training and inference:

All-to-All Communication for Mixture-of-Experts (MoE) models
GEMM-ReduceScatter for tensor parallelism
AllGather-GEMM for distributed inference

Key Technical Concepts

Symmetric Heap

Memory allocated with identical layout across multiple GPUs, enabling direct remote writes at the same relative offsets without complex addressing.

IPC (Inter-Process Communication)

Direct GPU-to-GPU memory access using hipIpcGetMemHandle/hipIpcOpenMemHandle for zero-copy data transfer.

XCD (eXtreme Compute Die)

Hardware component of AMD GPUs. Our optimizations remap thread blocks across MI300X's 8 XCDs for maximum parallelism.

Cache Modifiers

Directives like .cg (cache global) and .cv (cache volatile) for controlling cache behavior in GPU memory operations.

Performance Results

Our optimizations demonstrate significant performance improvements through:

Communication-computation overlap
Reduced memory allocations
Hardware-aware optimizations
Custom launchers and barriers

The geometric mean performance metric ensures solutions perform well across diverse workloads rather than being tuned for specific cases.

References

Acknowledgments

We thank AMD for organizing the GPU Optimization Challenge 2025 and thanks InnoMatrix.ai for providing access to MI300X hardware. We also thank GPUMode and all organizers for making this competition possible.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
images		images
LICENSE		LICENSE
README.md		README.md
all2all.py		all2all.py
allgather.py		allgather.py
paper.md		paper.md
reducescatter.py		reducescatter.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AMD Developer Challenge 2025: Distributed Inference Kernels

Overview

Key Technical Concepts

Symmetric Heap

IPC (Inter-Process Communication)

XCD (eXtreme Compute Die)

Cache Modifiers

Performance Results

References

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

License

yottalabsai/yotta_amd_kernel

Folders and files

Latest commit

History

Repository files navigation

AMD Developer Challenge 2025: Distributed Inference Kernels

Overview

Key Technical Concepts

Symmetric Heap

IPC (Inter-Process Communication)

XCD (eXtreme Compute Die)

Cache Modifiers

Performance Results

References

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages