Skip to content

[Docs] Add initial user guide for Ray resource isolation with writable cgroups#59051

Merged
edoakes merged 4 commits intoray-project:masterfrom
andrewsykim:ray-resource-isolation-guide
Jan 22, 2026
Merged

[Docs] Add initial user guide for Ray resource isolation with writable cgroups#59051
edoakes merged 4 commits intoray-project:masterfrom
andrewsykim:ray-resource-isolation-guide

Conversation

@andrewsykim
Copy link
Member

Description

Add a user guide for enabling Ray resource isolation on Kubernetes using writable cgroups

Related issues

Link related issues: "Fixes #1234", "Closes #1234", or "Related to #1234".

Additional information

Optional: Add implementation details, API changes, usage examples, screenshots, etc.

@andrewsykim andrewsykim requested review from a team as code owners November 27, 2025 23:27
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a user guide for enabling Ray resource isolation on Kubernetes with writable cgroups. The documentation is comprehensive and well-written. My review focuses on improving the copy-paste-friendliness of the example commands by removing hardcoded values and fixing an invalid Python snippet in one of the commands.


Run a task on your Ray cluster:
```bash
$ ray job submit --address http://localhost:8265 --no-wait -- python -c "import ray; ray.init(); sleep(100)"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The Python code in the ray job submit command is invalid because sleep is not a built-in function and needs to be imported from the time module. This will cause the command to fail.

Suggested change
$ ray job submit --address http://localhost:8265 --no-wait -- python -c "import ray; ray.init(); sleep(100)"
$ ray job submit --address http://localhost:8265 --no-wait -- python -c "import ray, time; ray.init(); time.sleep(100)"

2 CPUs and reserves 1 CPU for system processes, expect a CPU weight of 5000 for the system processes.

```bash
(base) ray@raycluster-resource-isolation-head-p2xqx:~$ cat /sys/fs/cgroup/ray-node_1cb40f18034c57379e688f9126d7671b2ac48dd6f84a9f49265bc4fb/system/cpu.weight
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The node ID in the cgroup path is hardcoded. This will be different for each user and will cause the command to fail if copied directly. Please use a wildcard * so the shell can expand it to the correct path. This pattern is already used in other commands in this guide.

Suggested change
(base) ray@raycluster-resource-isolation-head-p2xqx:~$ cat /sys/fs/cgroup/ray-node_1cb40f18034c57379e688f9126d7671b2ac48dd6f84a9f49265bc4fb/system/cpu.weight
(base) ray@raycluster-resource-isolation-head-p2xqx:~$ cat /sys/fs/cgroup/ray-node_*/system/cpu.weight


Verify the list of processes under the `system` cgroup hierarchy by inspecting the `cgroup.procs` file:
```
$ cat /sys/fs/cgroup/ray-node_1cb40f18034c57379e688f9126d7671b2ac48dd6f84a9f49265bc4fb/system/leaf/cgroup.procs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The node ID in the cgroup path is hardcoded. This will be different for each user and will cause the command to fail if copied directly. Please use a wildcard * so the shell can expand it to the correct path.

Suggested change
$ cat /sys/fs/cgroup/ray-node_1cb40f18034c57379e688f9126d7671b2ac48dd6f84a9f49265bc4fb/system/leaf/cgroup.procs
$ cat /sys/fs/cgroup/ray-node_*/system/leaf/cgroup.procs


Verify no user processes:
```
$ cat /sys/fs/cgroup/ray-node_1cb40f18034c57379e688f9126d7671b2ac48dd6f84a9f49265bc4fb/user/workers/cgroup.procs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The node ID in the cgroup path is hardcoded. This will be different for each user and will cause the command to fail if copied directly. Please use a wildcard * so the shell can expand it to the correct path.

Suggested change
$ cat /sys/fs/cgroup/ray-node_1cb40f18034c57379e688f9126d7671b2ac48dd6f84a9f49265bc4fb/user/workers/cgroup.procs
$ cat /sys/fs/cgroup/ray-node_*/user/workers/cgroup.procs


Observe the new processes:
```
$ cat /sys/fs/cgroup/ray-node_1cb40f18034c57379e688f9126d7671b2ac48dd6f84a9f49265bc4fb/user/workers/cgroup.procs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The node ID in the cgroup path is hardcoded. This will be different for each user and will cause the command to fail if copied directly. Please use a wildcard * so the shell can expand it to the correct path.

Suggested change
$ cat /sys/fs/cgroup/ray-node_1cb40f18034c57379e688f9126d7671b2ac48dd6f84a9f49265bc4fb/user/workers/cgroup.procs
$ cat /sys/fs/cgroup/ray-node_*/user/workers/cgroup.procs

@andrewsykim andrewsykim force-pushed the ray-resource-isolation-guide branch 2 times, most recently from 4a4696f to 87066c9 Compare November 28, 2025 00:15
@andrewsykim andrewsykim force-pushed the ray-resource-isolation-guide branch from 87066c9 to 47a1cf7 Compare November 28, 2025 00:23
@ray-gardener ray-gardener bot added docs An issue or change related to documentation core Issues that should be addressed in Ray Core community-contribution Contributed by the community kubernetes labels Nov 28, 2025
@edoakes
Copy link
Collaborator

edoakes commented Nov 30, 2025

@israbbani PTAL

@andrewsykim andrewsykim force-pushed the ray-resource-isolation-guide branch from 47a1cf7 to 1a9931d Compare December 4, 2025 19:50

This guide covers how to enable Ray resource isolation on GKE using [writable cgroups](https://docs.cloud.google.com/kubernetes-engine/docs/how-to/writable-cgroups).
Ray resource isolation (introduced in v2.51.0) significantly improves Ray's reliability by using cgroupsv2 to reserve dedicated CPU and memory
resources for critical system processes. Historically, enabling resource isolation required privileged containers capable of writing to

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think "Historically" might be better placed at the start of a new paragraph.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

split this part into a new paragraph

--cluster-version=1.34 \
--machine-type=e2-standard-16 \
--num-nodes=3 \
--scopes="cloud-platform" \

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this required for Ray on GKE? It's not required for the writable cgroups feature

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was in the writable cgroups documentation, and I got an error when trying without it

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you point to where you see that? That wasn't mentioned in https://docs.cloud.google.com/kubernetes-engine/docs/how-to/writable-cgroups

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see -- I believe that field is only required when configuring privateRegistryAccessConfig and/or registryHosts. I had successfully created clusters with writableCgroups without it

the `/sys/fs/cgroup` file system. This approach was not recommended due to the security risks associated with privileged containers. In newer versions of GKE,
you can enable writable cgroups, granting containers read-write access to the cgroups API without requiring privileged mode.

## Create a GKE Cluster with writable cgroups enabled

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be worth adding a prerequisites section calling out the requirements (e.g. gcloud and kubectl installed and configured)? If this is obvious to the reader feel free to ignore

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added the section

enabled: true
EOF

$ gcloud container clusters create ray-resource-isolation \

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may fail if the user's default configuration does not specify a zone. We might want to add:

--zone=$ZONE

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will leave this out for now to be consistent with other docs

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my understanding, what does --zone do in this case?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It specifies the location of the cluster:
https://docs.cloud.google.com/compute/docs/regions-zones

# Resource Isolation with Kubernetes Writable Cgroups

This guide covers how to enable Ray resource isolation on GKE using [writable cgroups](https://docs.cloud.google.com/kubernetes-engine/docs/how-to/writable-cgroups).
Ray resource isolation (introduced in v2.51.0) significantly improves Ray's reliability by using cgroupsv2 to reserve dedicated CPU and memory

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cgroups(7) manpage refers to version 2 via "cgroups v2". We may want to standardize on that phrasing across the doc.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

## Verify resource isolation for Ray processes

Verify that resource isolation is enabled by inspecting the cgroup filesystem within a Ray container.
```bash

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shell output is pretty dense. I'd recommend introducing newlines between subsequent commands for easier reading.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ray also prints a log line on startup that will give you this information. It's in raylet.out:

{"asctime":"2026-01-14 13:53:13,853","levelname":"I","message":"Initializing CgroupManager at base cgroup at '/sys/fs/cgroup'. Ray's cgroup hierarchy will under the node cgroup at '/sys/fs/cgroup/ray-node_b9e4de7636296bc3e8a75f5e345eebfc4c423bb4c99706a64196ec04' with [memory, cpu] controllers enabled. The system cgroup at '/sys/fs/cgroup/ray-node_b9e4de7636296bc3e8a75f5e345eebfc4c423bb4c99706a64196ec04/system' will have [memory] controllers enabled with [cpu.weight=666, memory.min=25482231398] constraints. The user cgroup '/sys/fs/cgroup/ray-node_b9e4de7636296bc3e8a75f5e345eebfc4c423bb4c99706a64196ec04/user' will have no controllers enabled with [cpu.weight=9334] constraints. The user cgroup will contain the [/sys/fs/cgroup/ray-node_b9e4de7636296bc3e8a75f5e345eebfc4c423bb4c99706a64196ec04/user/workers, /sys/fs/cgroup/ray-node_b9e4de7636296bc3e8a75f5e345eebfc4c423bb4c99706a64196ec04/user/non-ray] cgroups.","component":"raylet","filename":"cgroup_manager.cc","lineno":212}

This is still pretty dense, but it only includes cgroup-specific information that's relevant to the user.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for documentation purposes the current approach is better as user can copy the commands directly. We can update this based on feedback as well

@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had
any activity for 14 days. It will be closed in another 14 days if no further activity occurs.
Thank you for your contributions.

You can always ask for help on our discussion forum or Ray's public slack channel.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

@github-actions github-actions bot added stale The issue is stale. It will be closed within 7 days unless there are further conversation unstale A PR that has been marked unstale. It will not get marked stale again if this label is on it. and removed stale The issue is stale. It will be closed within 7 days unless there are further conversation labels Dec 20, 2025
@edoakes edoakes added the go add ONLY when ready to merge, run all tests label Dec 22, 2025
# Resource Isolation with Kubernetes Writable Cgroups

This guide covers how to enable Ray resource isolation on GKE using [writable cgroups](https://docs.cloud.google.com/kubernetes-engine/docs/how-to/writable-cgroups).
Ray resource isolation (introduced in v2.51.0) significantly improves Ray's reliability by using cgroupsv2 to reserve dedicated CPU and memory
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@israbbani @Kunchd this should be updated to link to resource isolation docs/guide once published

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. We'll publish our docs shortly after this and x-link.

@edoakes edoakes enabled auto-merge (squash) December 22, 2025 22:16
@andrewsykim andrewsykim force-pushed the ray-resource-isolation-guide branch from afa2328 to 5e065fa Compare December 23, 2025 11:46
@github-actions github-actions bot disabled auto-merge December 23, 2025 11:46
Copy link
Contributor

@israbbani israbbani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks pretty close to ideal. I've left a few comments.

# Resource Isolation with Kubernetes Writable Cgroups

This guide covers how to enable Ray resource isolation on GKE using [writable cgroups](https://docs.cloud.google.com/kubernetes-engine/docs/how-to/writable-cgroups).
Ray resource isolation (introduced in v2.51.0) significantly improves Ray's reliability by using cgroupsv2 to reserve dedicated CPU and memory
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. We'll publish our docs shortly after this and x-link.

enabled: true
EOF

$ gcloud container clusters create ray-resource-isolation \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my understanding, what does --zone do in this case?

## Verify resource isolation for Ray processes

Verify that resource isolation is enabled by inspecting the cgroup filesystem within a Ray container.
```bash
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ray also prints a log line on startup that will give you this information. It's in raylet.out:

{"asctime":"2026-01-14 13:53:13,853","levelname":"I","message":"Initializing CgroupManager at base cgroup at '/sys/fs/cgroup'. Ray's cgroup hierarchy will under the node cgroup at '/sys/fs/cgroup/ray-node_b9e4de7636296bc3e8a75f5e345eebfc4c423bb4c99706a64196ec04' with [memory, cpu] controllers enabled. The system cgroup at '/sys/fs/cgroup/ray-node_b9e4de7636296bc3e8a75f5e345eebfc4c423bb4c99706a64196ec04/system' will have [memory] controllers enabled with [cpu.weight=666, memory.min=25482231398] constraints. The user cgroup '/sys/fs/cgroup/ray-node_b9e4de7636296bc3e8a75f5e345eebfc4c423bb4c99706a64196ec04/user' will have no controllers enabled with [cpu.weight=9334] constraints. The user cgroup will contain the [/sys/fs/cgroup/ray-node_b9e4de7636296bc3e8a75f5e345eebfc4c423bb4c99706a64196ec04/user/workers, /sys/fs/cgroup/ray-node_b9e4de7636296bc3e8a75f5e345eebfc4c423bb4c99706a64196ec04/user/non-ray] cgroups.","component":"raylet","filename":"cgroup_manager.cc","lineno":212}

This is still pretty dense, but it only includes cgroup-specific information that's relevant to the user.

Comment on lines +117 to +127
You can inspect specific files to confirm the reserved CPU and memory for system and user processes.
The RayCluster created in an earlier step creates containers requesting a total of 2 CPUs.
Based on Ray's default calculation of system resources (`min(3.0, max(1.0, 0.05 * num_cores_on_the_system))`),
we should expect 1 CPU for system processes. However, since CPU is a compressible resource, cgroupsv2 expresses
CPU resources using weights rather than core units, with a total weight of 10000. If Ray has
2 CPUs and reserves 1 CPU for system processes, expect a CPU weight of 5000 for the system processes.

```bash
(base) ray@raycluster-resource-isolation-head-p2xqx:~$ cat /sys/fs/cgroup/ray-node*/system/cpu.weight
5000
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you use the log line, it'll give you all of this information as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do these logs go to stdout or does a user have to look through specific file in the logs dir?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, I missed your other comment saying it's from raylet.out

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm okay to keep the current approach to check filesystem directly but let's follow-up and add logs if users find it confusing

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or even better, if there are plans to add the cgroups details in Ray Dashboard then we should just document that

@@ -0,0 +1,201 @@
(resource-isolation-with-writable-cgroups)=

# Resource Isolation with Kubernetes Writable Cgroups
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Resource Isolation with Kubernetes Writable Cgroups
# Resource Isolation with Kubernetes Writable Cgroups on Google Kubernetes Engine

Same suggestion for the link in user-guides.md.

# Resource Isolation with Kubernetes Writable Cgroups

This guide covers how to enable Ray resource isolation on GKE using [writable cgroups](https://docs.cloud.google.com/kubernetes-engine/docs/how-to/writable-cgroups).
Ray resource isolation (introduced in v2.51.0) significantly improves Ray's reliability by using cgroupsv2 to reserve dedicated CPU and memory
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Ray resource isolation (introduced in v2.51.0) significantly improves Ray's reliability by using cgroupsv2 to reserve dedicated CPU and memory
Ray resource isolation (introduced in v2.51.0) significantly improves Ray's reliability by using cgroups v2 to reserve dedicated CPU and memory

s/cgroupsv2/cgroups v2/g.

@edoakes
Copy link
Collaborator

edoakes commented Jan 21, 2026

ping on comments when you have a chance @andrewsykim

or let me know if you'd like us to make the changes directly

@andrewsykim
Copy link
Member Author

Sorry I missed the latest comments, I'll update tomorrow

Signed-off-by: Andrew Sy Kim <andrewsy@google.com>
Signed-off-by: Andrew Sy Kim <andrewsy@google.com>
Signed-off-by: Andrew Sy Kim <andrewsy@google.com>
Signed-off-by: Andrew Sy Kim <andrewsy@google.com>
@andrewsykim andrewsykim force-pushed the ray-resource-isolation-guide branch from b118997 to 2115eb7 Compare January 22, 2026 19:33
@andrewsykim
Copy link
Member Author

@israbbani @edoakes comments addressed

@edoakes edoakes merged commit fe2eddb into ray-project:master Jan 22, 2026
6 checks passed
jinbum-kim pushed a commit to jinbum-kim/ray that referenced this pull request Jan 29, 2026
…e cgroups (ray-project#59051)

## Description

Add a user guide for enabling Ray resource isolation on Kubernetes using
writable cgroups

## Related issues
> Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to
ray-project#1234".

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: Andrew Sy Kim <andrewsy@google.com>
Signed-off-by: jinbum-kim <jinbum9958@gmail.com>
400Ping pushed a commit to 400Ping/ray that referenced this pull request Feb 1, 2026
…e cgroups (ray-project#59051)

## Description

Add a user guide for enabling Ray resource isolation on Kubernetes using
writable cgroups

## Related issues
> Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to
ray-project#1234".

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: Andrew Sy Kim <andrewsy@google.com>
Signed-off-by: 400Ping <jiekaichang@apache.org>
peterxcli pushed a commit to peterxcli/ray that referenced this pull request Feb 25, 2026
…e cgroups (ray-project#59051)

## Description

Add a user guide for enabling Ray resource isolation on Kubernetes using
writable cgroups

## Related issues
> Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to
ray-project#1234".

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: Andrew Sy Kim <andrewsy@google.com>
Signed-off-by: peterxcli <peterxcli@gmail.com>
peterxcli pushed a commit to peterxcli/ray that referenced this pull request Feb 25, 2026
…e cgroups (ray-project#59051)

## Description

Add a user guide for enabling Ray resource isolation on Kubernetes using
writable cgroups

## Related issues
> Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to
ray-project#1234".

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: Andrew Sy Kim <andrewsy@google.com>
Signed-off-by: peterxcli <peterxcli@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community core Issues that should be addressed in Ray Core docs An issue or change related to documentation go add ONLY when ready to merge, run all tests kubernetes unstale A PR that has been marked unstale. It will not get marked stale again if this label is on it.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Ray fails to serialize self-reference objects

4 participants