Skip to content

Fix bus error or segfault from roi_align with large batchsize#9441

Draft
zy1git wants to merge 3 commits intopytorch:mainfrom
zy1git:issue-8206
Draft

Fix bus error or segfault from roi_align with large batchsize#9441
zy1git wants to merge 3 commits intopytorch:mainfrom
zy1git:issue-8206

Conversation

@zy1git
Copy link
Contributor

@zy1git zy1git commented Mar 13, 2026

Summary
Bug: roi_align in torchvision crashes with a bus error/segfault on CPU or returns silently wrong (all-zero) results on CUDA when the total number of output elements exceeds INT_MAX (~2.1 billion). This is caused by 32-bit int overflow in index arithmetic within the C++ and CUDA kernels.

Root Cause: The kernels use int for composite index calculations like n × channels × pooled_width × pooled_height and pointer offsets like (roi_batch_ind × channels + c) × height × width. When these products exceed 2,147,483,647, the int wraps to a negative value, causing out-of-bounds memory access.

Example: FasterRCNN with batch_size=172 generates ~172,000 ROIs. The output index reaches 171,999 × 256 × 7 × 7 = 2,157,555,456 > INT_MAX, which matches the reporter's observed threshold exactly.

Fix: Promoted int to int64_t for all index, offset, and stride variables in the relevant files.

Test Plan
New overflow regression test
pytest test/test_ops.py::TestRoIAlign::test_roi_align_large_index -v

Existing tests — verify no regressions
pytest test/test_ops.py::TestRoIAlign -v

Fixes #8206

@pytorch-bot
Copy link

pytorch-bot bot commented Mar 13, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/vision/9441

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit d9ab5ce with merge base 6f131f1 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla bot added the cla signed label Mar 13, 2026
@zy1git zy1git marked this pull request as draft March 13, 2026 10:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bus error or segfault from roi_align with large batchsize

1 participant