-
Notifications
You must be signed in to change notification settings - Fork 3.7k
[CPU] Fix TP padding case with weight block size #8243
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @jianan-gu, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request addresses a tensor parallelism (TP) padding issue on CPU, specifically when weight_block_size is a factor. It refines the padding logic for attention heads by introducing head_dim as a critical parameter, ensuring that padding accounts for cases where head dimensions are not perfectly aligned with weight block sizes. This change improves the robustness of CPU-based tensor parallelism configurations.
Highlights
- Refined Padding Logic: The
get_num_heads_padding_sizefunction now incorporateshead_diminto its padding calculation. Previously, padding was applied iftp_sizewas odd andweight_block_sizewas present. The updated logic adds an additional condition: padding is also applied ifhead_dimis not perfectly divisible by the first element ofweight_block_size(i.e.,weight_block_size[0]), ensuring better alignment for tensor parallelism. - Dynamic Head Dimension Calculation: The
adjust_config_with_unaligned_cpu_tpfunction has been enhanced to dynamically determine the appropriatehead_dimto pass to the padding function. It now checks for the presence ofqk_nope_head_dimandqk_rope_head_dimto compute a combinedqk_head_dim, which is then prioritized as thehead_dimfor padding calculations. This ensures that models with more complex head dimension configurations (e.g., for different types of attention) are correctly handled.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request addresses a padding issue for Tensor Parallelism on CPUs when using block-wise quantized weights. The changes introduce a new condition for padding based on the model's head_dim and the weight_block_size. The logic correctly identifies the head_dim for various model architectures and passes it to an updated padding calculation function.
|
I reckon that we still need to move fast for DP-Attn and EPMoE to skip padding for head dimension... not very decent to have a TP=6 for any scenario |
Yes, agree and noted. |
Motivation
Fixes the Kimi-K2-Instruct (FP8) TP=6 failure on CPU.
ValueError: Weight output_partition_size = 2112 is not divisible by weight quantization block_n = 128.
Modifications
Considering
weight_block_sizewhen padding TP, to make self.num_heads * self.qk_head_dim / tp_size divisible byweight_block_sizein the below ColumnParallelLinear.In our case, self.num_heads = 64, self.qk_head_dim =192, tp_size = 6, block_size = 128
Before this PR, self.num_heads is padded to 66, self.num_heads * self.qk_head_dim / tp_size = 2112, which is not divisible by 128.
After this PR, self.num_heads is padded to 72, self.num_heads * self.qk_head_dim / tp_size = 2304, which is divisible by 128.
MISC.
This PR also covers a minor fix for unquant MoE module with
apply_router_weight_on_inputconfig, sine CPU AMX path supports now.