Skip to content

fix(ci): add swap and limit CPU cores to prevent arm64 runner OOM#6957

Merged
xmfcx merged 2 commits into
mainfrom
fix/limit-cpu-cores-docker-build
Mar 28, 2026
Merged

fix(ci): add swap and limit CPU cores to prevent arm64 runner OOM#6957
xmfcx merged 2 commits into
mainfrom
fix/limit-cpu-cores-docker-build

Conversation

@xmfcx
Copy link
Copy Markdown
Contributor

@xmfcx xmfcx commented Mar 27, 2026

Changes

  1. Add 8 GB swapfile for arm64 health-check builds -- the arm64 runner (4 vCPU / 16 GB RAM) intermittently OOM-kills during heavy C++ compilation. Adding swap increases virtual memory from ~18 GB to ~26 GB, providing headroom for transient memory spikes.

  2. Use taskset to restrict colcon build to nproc - 1 cores -- reserves one core for OS/Docker/BuildKit overhead. Under taskset, nproc returns 3, so colcon builds 3 packages in parallel instead of 4.

  3. Print runner info (CPU, memory, swap, disk) before the build for easier debugging.

Why

The docker-build (main-arm64) job intermittently fails because colcon build defaults to using all available cores. With multiple packages compiling in parallel, each spawning multiple cmake compile jobs, the runner exceeds its memory budget and loses communication with the server.

taskset alone was insufficient (see #6956 comment) because it only limits CPU affinity, not memory. 9 concurrent compiler processes (3 packages x 3 jobs) can still exceed 16 GB. The additional swap provides the extra headroom needed.

If this is still not enough, the next step is CMAKE_BUILD_PARALLEL_LEVEL to directly limit per-package compile jobs.

Test plan

  • Verify docker-build (main-arm64) health-check job passes without runner communication loss
  • Verify other matrix entries (main, nightly) are unaffected
  • Verify runner info step prints CPU/memory/swap/disk stats

Use taskset to restrict colcon build to (nproc - 1) cores, reserving
one core for OS/Docker/BuildKit overhead. This prevents the ARM64
public runners (4 vCPUs / 16GB RAM) from being starved during heavy
C++ compilation, which causes the runner to lose communication with
the server and fail the job.

Closes #6956

Signed-off-by: Mete Fatih Cırıt <mfc@autoware.org>
@xmfcx xmfcx self-assigned this Mar 27, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Mar 27, 2026

Thank you for contributing to the Autoware project!

🚧 If your pull request is in progress, switch it to draft mode.

Please ensure:

@xmfcx xmfcx requested a review from mitsudome-r March 27, 2026 22:37
@xmfcx xmfcx added the run:health-check Run health-check label Mar 27, 2026
@xmfcx xmfcx enabled auto-merge (squash) March 27, 2026 22:42
@xmfcx xmfcx disabled auto-merge March 28, 2026 06:23
@xmfcx xmfcx force-pushed the fix/limit-cpu-cores-docker-build branch 2 times, most recently from de59c08 to ca7fc93 Compare March 28, 2026 07:32
@xmfcx xmfcx changed the title fix(docker): limit CPU cores in colcon build to prevent runner OOM fix(ci): add swap and limit CPU cores to prevent arm64 runner OOM Mar 28, 2026
@xmfcx
Copy link
Copy Markdown
Contributor Author

xmfcx commented Mar 28, 2026

Now testing with the added 8G swap.

image

The arm64 health-check runner (4 vCPU / 16 GB RAM) intermittently
OOM-kills during heavy C++ compilation. Adding an 8 GB swapfile
increases available virtual memory from ~20 GB to ~28 GB, providing
headroom for transient memory spikes.

Signed-off-by: Mete Fatih Cırıt <mfc@autoware.org>
@xmfcx xmfcx force-pushed the fix/limit-cpu-cores-docker-build branch from ca7fc93 to ee0f44c Compare March 28, 2026 07:42
Copy link
Copy Markdown
Member

@mitsudome-r mitsudome-r left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

It might be more useful if we could add an argument for workflow dispatch to enable limiting the number of cores further whenever CI fails due to out of resource. But that doesn't have to be done in this PR.

@xmfcx
Copy link
Copy Markdown
Contributor Author

xmfcx commented Mar 28, 2026

It took 3 hours but at least it passed.

image

My guess is that because it utilizes swap more, it becomes more inefficient.

Maybe CMAKE_BUILD_PARALLEL_LEVEL could also be used to make it use less memory by limiting parallel tasks per package build too. But it complicates things even more. Let's merge for now.

It might be more useful if we could add an argument for workflow dispatch to enable limiting the number of cores further whenever CI fails due to out of resource. But that doesn't have to be done in this PR.

The health-check is a required check and runs on: pull_request. So even if we ran it with different params through workflow_dispatch, we would have to override things from this PR to merge. I tried to have a non-intervention way of fixing the issue.

@xmfcx xmfcx merged commit 2222ad8 into main Mar 28, 2026
18 checks passed
@xmfcx xmfcx deleted the fix/limit-cpu-cores-docker-build branch March 28, 2026 10:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

run:health-check Run health-check

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants