Conversation
There was a problem hiding this comment.
Summary of Changes
Hello @artetaout, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
I've implemented a significant optimization to our Tensor Parallelism strategy by enabling the overlap of GEMM computations with AllReduce communication. This change introduces specialized operators for attention and MLP layers, designed to reduce communication overhead and enhance the overall efficiency of distributed model inference. The new functionality is integrated into the model's linear layers and can be activated via an environment variable, providing a flexible way to leverage this performance improvement.
Highlights
- Tensor Parallelism Overlap: I've introduced support for overlapping GEMM (General Matrix Multiply) operations with AllReduce communication during Tensor Parallelism. This aims to improve performance by hiding communication latency.
- New GEMM+AllReduce Operators: I've added two new GEMM+AllReduce operators, specifically designed for attention (ATTN) and Multi-Layer Perceptron (MLP) operations, leveraging the
triton_distlibrary. - Distributed State Management: The system now initializes an NVSHMEM-enabled GLOO process group for Tensor Parallelism, which is crucial for the new overlap functionality.
- Conditional Operator Execution: The
Linearlayer's forward pass has been modified to conditionally use these new overlapped operators based on theSGL_USE_TP_OVERLAPenvironment variable, allowing for easy activation of this feature.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Code Review
This pull request introduces support for overlapping Tensor Parallelism communication (AllReduce) with computation (GEMM) to improve performance. This is achieved by adding new GEMM+AllReduce operators for attention and MLP layers, controlled by an environment variable. The changes look promising, but I've found a critical typo that needs to be fixed, along with some leftover debug code and unused imports that should be cleaned up.
|
@artetaout Nice work! |
Thanks! We lack AMD GPU to develope and debug code. But Triton-distributed itself, supports AMD, so you can try to support that! |
|
Hi! I'm interested in using this feature. Is there any update on the status of this PR or what's blocking it from being merged? @artetaout @FlamingoPg |
Great work, btw we may need to discuss how to install Triton-distributed for use in sgl. Should we compile it within sgl-kernel, or can we install it directly via pip? |
It needs to uninstall |
|
Hello, I have a question about the shmem allocation in this implementation, I am wondering if there is a good way to co-use the allocated shmem buffer for the same type of layer (MLP and O_proj layer) |
|
It seems that the PR Test errors also indicates that the Memory Allocation is not enough because of the Shared Memory Allocation takes too much memory |
… into feat/overlap
Agree, this |
Motivation
We do GEMM+AllReduce overlap during TensorParallel via Triton-Distributed to speed it up !
https://github.com/ByteDance-Seed/Triton-distributed
Modifications
Accuracy Tests
besides, every layer's hidden_states are checked to be all-closed with origin's
Benchmarking and Profiling
bench_one_batch
bench_serving
Checklist