Conversation
Merge reduce scatter op
fix reduce scatter return
ag & rs fix and zero buff version update
Summary of ChangesHello @littleyellowbicycle, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a new ZCCL (Zero-Copy Collective Communication Library) module, providing optimized collective communication primitives for Ascend NPUs. It includes the core implementations for AllReduce, AllGather, and ReduceScatter operations, meticulously crafted for both host and device execution. The module incorporates advanced tiling strategies to efficiently manage data distribution and processing across multiple cores. Accompanying these implementations is a robust testing framework, complete with build scripts, data generation utilities, and performance analysis tools, ensuring the correctness and efficiency of the new collective operations. It's noted that the Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces new collective communication operations (AllReduce, AllGather, ReduceScatter) under a zccl module, along with corresponding tests. While the additions are significant, the current implementation contains several critical issues that must be addressed. These include use-after-free bugs due to missing stream synchronization before freeing resources, memory management errors such as leaks and incorrect deallocation, and logical flaws in the kernel implementations. I've provided detailed comments and suggestions to fix these problems.
| aclrtFreeHost(tiling_host.get()); | ||
| aclrtFree(tiling_device_ptr); | ||
| shmem_free(gva); |
There was a problem hiding this comment.
The kernel is launched asynchronously, but resources (tiling_device_ptr, gva) are freed immediately without waiting for the kernel to complete. This is a use-after-free bug. You must synchronize the stream before freeing these resources. Also, aclrtFreeHost(tiling_host.get()) is incorrect as tiling_host is managed by a shared_ptr and was not allocated with aclrtMallocHost.
aclrtSynchronizeStream(stream);
aclrtFree(tiling_device_ptr);
shmem_free(gva);| uint32_t rankSize = shmem_team_n_pes(teamId); | ||
| ZCCLDataType zcclDataType = static_cast<ZCCLDataType>(dataType); | ||
| size_t typeSize = getSizeFromTypeEnum(zcclDataType); | ||
| bool smallFlag = (totalLength >= BIG_DATA_SIZE / typeSize) ? false : true; |
| tilingconfig->outputLastNumCore_ = | ||
| elements - ((tilingconfig->outputCorePerRank_) - 1) * (tilingconfig->outputCorePerRank_); |
There was a problem hiding this comment.
There appears to be a copy-paste error in the calculation of outputLastNumCore_. It's using tilingconfig->outputCorePerRank_ instead of tilingconfig->outputNumPerCore_ in the multiplication. This will lead to an incorrect tiling configuration.
tilingconfig->outputLastNumCore_ =
elements - (tilingconfig->outputCorePerRank_ - 1) * tilingconfig->outputNumPerCore_;| shmem_free(gva); | ||
| return 0; |
There was a problem hiding this comment.
The kernel is launched asynchronously on the stream. However, shmem_free(gva) is called immediately after, without waiting for the kernel to finish. This is a race condition that can lead to a use-after-free error. The same issue applies to deviceTilingConfig, which is a shared_ptr whose memory will be freed when it goes out of scope at the end of the function. You should synchronize the stream before freeing these resources.
aclrtSynchronizeStream(stream);
shmem_free(gva);| /* launch the kernel function via ACLRT_LAUNCH_KERNEL */ | ||
| ACLRT_LAUNCH_KERNEL(ShmemReduceScatter)(blockDim, stream, inp, out, (uint8_t *)ptr, | ||
| fftsAddr, dataTypeNum, inpNumel, teamId, reduceOp); | ||
| shmem_free(ptr); |
There was a problem hiding this comment.
The kernel is launched asynchronously, but shmem_free(ptr) is called immediately without waiting for the kernel to finish. This can lead to a use-after-free error. You should synchronize the stream before freeing the memory.
ACLRT_LAUNCH_KERNEL(ShmemReduceScatter)(blockDim, stream, inp, out, (uint8_t *)ptr,
fftsAddr, dataTypeNum, inpNumel, teamId, reduceOp);
aclrtSynchronizeStream(stream);
shmem_free(ptr);| uint32_t output_last_num_core; | ||
| }; | ||
|
|
||
| } // namespace npu_kernel |
| // See the License for the specific language governing permissions and | ||
| // limitations under the License. | ||
|
|
||
| #ifndef SGL_KERNEL_NPU_ALL_GATHER_KERNEL_H |
| uint32_t outputLastNumCore_; | ||
| }; | ||
|
|
||
| } // namespace npu_kernel |
| void* rawPtr = nullptr; | ||
| auto res = aclrtMalloc(&rawPtr, size, policy); | ||
| if (res != 0) { | ||
| return nullptr; // 分配失败 |
|
|
||
|
|
||
| def gen_random_data(size, dtype): | ||
| return np.ones_like(size, dtype=dtype) |
There was a problem hiding this comment.
Using np.ones_like for generating test data is not ideal for testing reduction operations like SUM, MAX, MIN, or PROD, as the results will be trivial. Using random data (as in the commented-out line) would provide more robust testing.
| return np.ones_like(size, dtype=dtype) | |
| return np.random.uniform(low=0.0, high=10.0, size=size).astype(dtype) |
39929cc to
dc76ca0
Compare
No description provided.