[Docs] Add initial user guide for Ray resource isolation with writable cgroups#59051
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a user guide for enabling Ray resource isolation on Kubernetes with writable cgroups. The documentation is comprehensive and well-written. My review focuses on improving the copy-paste-friendliness of the example commands by removing hardcoded values and fixing an invalid Python snippet in one of the commands.
|
|
||
| Run a task on your Ray cluster: | ||
| ```bash | ||
| $ ray job submit --address http://localhost:8265 --no-wait -- python -c "import ray; ray.init(); sleep(100)" |
There was a problem hiding this comment.
The Python code in the ray job submit command is invalid because sleep is not a built-in function and needs to be imported from the time module. This will cause the command to fail.
| $ ray job submit --address http://localhost:8265 --no-wait -- python -c "import ray; ray.init(); sleep(100)" | |
| $ ray job submit --address http://localhost:8265 --no-wait -- python -c "import ray, time; ray.init(); time.sleep(100)" |
| 2 CPUs and reserves 1 CPU for system processes, expect a CPU weight of 5000 for the system processes. | ||
|
|
||
| ```bash | ||
| (base) ray@raycluster-resource-isolation-head-p2xqx:~$ cat /sys/fs/cgroup/ray-node_1cb40f18034c57379e688f9126d7671b2ac48dd6f84a9f49265bc4fb/system/cpu.weight |
There was a problem hiding this comment.
The node ID in the cgroup path is hardcoded. This will be different for each user and will cause the command to fail if copied directly. Please use a wildcard * so the shell can expand it to the correct path. This pattern is already used in other commands in this guide.
| (base) ray@raycluster-resource-isolation-head-p2xqx:~$ cat /sys/fs/cgroup/ray-node_1cb40f18034c57379e688f9126d7671b2ac48dd6f84a9f49265bc4fb/system/cpu.weight | |
| (base) ray@raycluster-resource-isolation-head-p2xqx:~$ cat /sys/fs/cgroup/ray-node_*/system/cpu.weight |
|
|
||
| Verify the list of processes under the `system` cgroup hierarchy by inspecting the `cgroup.procs` file: | ||
| ``` | ||
| $ cat /sys/fs/cgroup/ray-node_1cb40f18034c57379e688f9126d7671b2ac48dd6f84a9f49265bc4fb/system/leaf/cgroup.procs |
There was a problem hiding this comment.
The node ID in the cgroup path is hardcoded. This will be different for each user and will cause the command to fail if copied directly. Please use a wildcard * so the shell can expand it to the correct path.
| $ cat /sys/fs/cgroup/ray-node_1cb40f18034c57379e688f9126d7671b2ac48dd6f84a9f49265bc4fb/system/leaf/cgroup.procs | |
| $ cat /sys/fs/cgroup/ray-node_*/system/leaf/cgroup.procs |
|
|
||
| Verify no user processes: | ||
| ``` | ||
| $ cat /sys/fs/cgroup/ray-node_1cb40f18034c57379e688f9126d7671b2ac48dd6f84a9f49265bc4fb/user/workers/cgroup.procs |
There was a problem hiding this comment.
The node ID in the cgroup path is hardcoded. This will be different for each user and will cause the command to fail if copied directly. Please use a wildcard * so the shell can expand it to the correct path.
| $ cat /sys/fs/cgroup/ray-node_1cb40f18034c57379e688f9126d7671b2ac48dd6f84a9f49265bc4fb/user/workers/cgroup.procs | |
| $ cat /sys/fs/cgroup/ray-node_*/user/workers/cgroup.procs |
|
|
||
| Observe the new processes: | ||
| ``` | ||
| $ cat /sys/fs/cgroup/ray-node_1cb40f18034c57379e688f9126d7671b2ac48dd6f84a9f49265bc4fb/user/workers/cgroup.procs |
There was a problem hiding this comment.
The node ID in the cgroup path is hardcoded. This will be different for each user and will cause the command to fail if copied directly. Please use a wildcard * so the shell can expand it to the correct path.
| $ cat /sys/fs/cgroup/ray-node_1cb40f18034c57379e688f9126d7671b2ac48dd6f84a9f49265bc4fb/user/workers/cgroup.procs | |
| $ cat /sys/fs/cgroup/ray-node_*/user/workers/cgroup.procs |
doc/source/cluster/kubernetes/user-guides/resource-isolation-with-writable-cgroups.md
Outdated
Show resolved
Hide resolved
4a4696f to
87066c9
Compare
doc/source/cluster/kubernetes/user-guides/resource-isolation-with-writable-cgroups.md
Outdated
Show resolved
Hide resolved
87066c9 to
47a1cf7
Compare
|
@israbbani PTAL |
47a1cf7 to
1a9931d
Compare
|
|
||
| This guide covers how to enable Ray resource isolation on GKE using [writable cgroups](https://docs.cloud.google.com/kubernetes-engine/docs/how-to/writable-cgroups). | ||
| Ray resource isolation (introduced in v2.51.0) significantly improves Ray's reliability by using cgroupsv2 to reserve dedicated CPU and memory | ||
| resources for critical system processes. Historically, enabling resource isolation required privileged containers capable of writing to |
There was a problem hiding this comment.
nit: I think "Historically" might be better placed at the start of a new paragraph.
There was a problem hiding this comment.
split this part into a new paragraph
| --cluster-version=1.34 \ | ||
| --machine-type=e2-standard-16 \ | ||
| --num-nodes=3 \ | ||
| --scopes="cloud-platform" \ |
There was a problem hiding this comment.
Is this required for Ray on GKE? It's not required for the writable cgroups feature
There was a problem hiding this comment.
It was in the writable cgroups documentation, and I got an error when trying without it
There was a problem hiding this comment.
Can you point to where you see that? That wasn't mentioned in https://docs.cloud.google.com/kubernetes-engine/docs/how-to/writable-cgroups
There was a problem hiding this comment.
It's referenced in the general guide for updating containerd config https://docs.cloud.google.com/kubernetes-engine/docs/how-to/customize-containerd-configuration#apply-containerd-configuration-to-new-clusters
There was a problem hiding this comment.
I see -- I believe that field is only required when configuring privateRegistryAccessConfig and/or registryHosts. I had successfully created clusters with writableCgroups without it
| the `/sys/fs/cgroup` file system. This approach was not recommended due to the security risks associated with privileged containers. In newer versions of GKE, | ||
| you can enable writable cgroups, granting containers read-write access to the cgroups API without requiring privileged mode. | ||
|
|
||
| ## Create a GKE Cluster with writable cgroups enabled |
There was a problem hiding this comment.
Would it be worth adding a prerequisites section calling out the requirements (e.g. gcloud and kubectl installed and configured)? If this is obvious to the reader feel free to ignore
| enabled: true | ||
| EOF | ||
|
|
||
| $ gcloud container clusters create ray-resource-isolation \ |
There was a problem hiding this comment.
This may fail if the user's default configuration does not specify a zone. We might want to add:
--zone=$ZONEThere was a problem hiding this comment.
Will leave this out for now to be consistent with other docs
There was a problem hiding this comment.
For my understanding, what does --zone do in this case?
There was a problem hiding this comment.
It specifies the location of the cluster:
https://docs.cloud.google.com/compute/docs/regions-zones
| # Resource Isolation with Kubernetes Writable Cgroups | ||
|
|
||
| This guide covers how to enable Ray resource isolation on GKE using [writable cgroups](https://docs.cloud.google.com/kubernetes-engine/docs/how-to/writable-cgroups). | ||
| Ray resource isolation (introduced in v2.51.0) significantly improves Ray's reliability by using cgroupsv2 to reserve dedicated CPU and memory |
There was a problem hiding this comment.
The cgroups(7) manpage refers to version 2 via "cgroups v2". We may want to standardize on that phrasing across the doc.
| ## Verify resource isolation for Ray processes | ||
|
|
||
| Verify that resource isolation is enabled by inspecting the cgroup filesystem within a Ray container. | ||
| ```bash |
There was a problem hiding this comment.
This shell output is pretty dense. I'd recommend introducing newlines between subsequent commands for easier reading.
There was a problem hiding this comment.
Ray also prints a log line on startup that will give you this information. It's in raylet.out:
{"asctime":"2026-01-14 13:53:13,853","levelname":"I","message":"Initializing CgroupManager at base cgroup at '/sys/fs/cgroup'. Ray's cgroup hierarchy will under the node cgroup at '/sys/fs/cgroup/ray-node_b9e4de7636296bc3e8a75f5e345eebfc4c423bb4c99706a64196ec04' with [memory, cpu] controllers enabled. The system cgroup at '/sys/fs/cgroup/ray-node_b9e4de7636296bc3e8a75f5e345eebfc4c423bb4c99706a64196ec04/system' will have [memory] controllers enabled with [cpu.weight=666, memory.min=25482231398] constraints. The user cgroup '/sys/fs/cgroup/ray-node_b9e4de7636296bc3e8a75f5e345eebfc4c423bb4c99706a64196ec04/user' will have no controllers enabled with [cpu.weight=9334] constraints. The user cgroup will contain the [/sys/fs/cgroup/ray-node_b9e4de7636296bc3e8a75f5e345eebfc4c423bb4c99706a64196ec04/user/workers, /sys/fs/cgroup/ray-node_b9e4de7636296bc3e8a75f5e345eebfc4c423bb4c99706a64196ec04/user/non-ray] cgroups.","component":"raylet","filename":"cgroup_manager.cc","lineno":212}
This is still pretty dense, but it only includes cgroup-specific information that's relevant to the user.
There was a problem hiding this comment.
I think for documentation purposes the current approach is better as user can copy the commands directly. We can update this based on feedback as well
|
This pull request has been automatically marked as stale because it has not had You can always ask for help on our discussion forum or Ray's public slack channel. If you'd like to keep this open, just leave any comment, and the stale label will be removed. |
| # Resource Isolation with Kubernetes Writable Cgroups | ||
|
|
||
| This guide covers how to enable Ray resource isolation on GKE using [writable cgroups](https://docs.cloud.google.com/kubernetes-engine/docs/how-to/writable-cgroups). | ||
| Ray resource isolation (introduced in v2.51.0) significantly improves Ray's reliability by using cgroupsv2 to reserve dedicated CPU and memory |
There was a problem hiding this comment.
@israbbani @Kunchd this should be updated to link to resource isolation docs/guide once published
There was a problem hiding this comment.
Makes sense. We'll publish our docs shortly after this and x-link.
afa2328 to
5e065fa
Compare
israbbani
left a comment
There was a problem hiding this comment.
This looks pretty close to ideal. I've left a few comments.
| # Resource Isolation with Kubernetes Writable Cgroups | ||
|
|
||
| This guide covers how to enable Ray resource isolation on GKE using [writable cgroups](https://docs.cloud.google.com/kubernetes-engine/docs/how-to/writable-cgroups). | ||
| Ray resource isolation (introduced in v2.51.0) significantly improves Ray's reliability by using cgroupsv2 to reserve dedicated CPU and memory |
There was a problem hiding this comment.
Makes sense. We'll publish our docs shortly after this and x-link.
| enabled: true | ||
| EOF | ||
|
|
||
| $ gcloud container clusters create ray-resource-isolation \ |
There was a problem hiding this comment.
For my understanding, what does --zone do in this case?
| ## Verify resource isolation for Ray processes | ||
|
|
||
| Verify that resource isolation is enabled by inspecting the cgroup filesystem within a Ray container. | ||
| ```bash |
There was a problem hiding this comment.
Ray also prints a log line on startup that will give you this information. It's in raylet.out:
{"asctime":"2026-01-14 13:53:13,853","levelname":"I","message":"Initializing CgroupManager at base cgroup at '/sys/fs/cgroup'. Ray's cgroup hierarchy will under the node cgroup at '/sys/fs/cgroup/ray-node_b9e4de7636296bc3e8a75f5e345eebfc4c423bb4c99706a64196ec04' with [memory, cpu] controllers enabled. The system cgroup at '/sys/fs/cgroup/ray-node_b9e4de7636296bc3e8a75f5e345eebfc4c423bb4c99706a64196ec04/system' will have [memory] controllers enabled with [cpu.weight=666, memory.min=25482231398] constraints. The user cgroup '/sys/fs/cgroup/ray-node_b9e4de7636296bc3e8a75f5e345eebfc4c423bb4c99706a64196ec04/user' will have no controllers enabled with [cpu.weight=9334] constraints. The user cgroup will contain the [/sys/fs/cgroup/ray-node_b9e4de7636296bc3e8a75f5e345eebfc4c423bb4c99706a64196ec04/user/workers, /sys/fs/cgroup/ray-node_b9e4de7636296bc3e8a75f5e345eebfc4c423bb4c99706a64196ec04/user/non-ray] cgroups.","component":"raylet","filename":"cgroup_manager.cc","lineno":212}
This is still pretty dense, but it only includes cgroup-specific information that's relevant to the user.
| You can inspect specific files to confirm the reserved CPU and memory for system and user processes. | ||
| The RayCluster created in an earlier step creates containers requesting a total of 2 CPUs. | ||
| Based on Ray's default calculation of system resources (`min(3.0, max(1.0, 0.05 * num_cores_on_the_system))`), | ||
| we should expect 1 CPU for system processes. However, since CPU is a compressible resource, cgroupsv2 expresses | ||
| CPU resources using weights rather than core units, with a total weight of 10000. If Ray has | ||
| 2 CPUs and reserves 1 CPU for system processes, expect a CPU weight of 5000 for the system processes. | ||
|
|
||
| ```bash | ||
| (base) ray@raycluster-resource-isolation-head-p2xqx:~$ cat /sys/fs/cgroup/ray-node*/system/cpu.weight | ||
| 5000 | ||
| ``` |
There was a problem hiding this comment.
If you use the log line, it'll give you all of this information as well.
There was a problem hiding this comment.
Do these logs go to stdout or does a user have to look through specific file in the logs dir?
There was a problem hiding this comment.
ah, I missed your other comment saying it's from raylet.out
There was a problem hiding this comment.
I'm okay to keep the current approach to check filesystem directly but let's follow-up and add logs if users find it confusing
There was a problem hiding this comment.
Or even better, if there are plans to add the cgroups details in Ray Dashboard then we should just document that
| @@ -0,0 +1,201 @@ | |||
| (resource-isolation-with-writable-cgroups)= | |||
|
|
|||
| # Resource Isolation with Kubernetes Writable Cgroups | |||
There was a problem hiding this comment.
| # Resource Isolation with Kubernetes Writable Cgroups | |
| # Resource Isolation with Kubernetes Writable Cgroups on Google Kubernetes Engine |
Same suggestion for the link in user-guides.md.
| # Resource Isolation with Kubernetes Writable Cgroups | ||
|
|
||
| This guide covers how to enable Ray resource isolation on GKE using [writable cgroups](https://docs.cloud.google.com/kubernetes-engine/docs/how-to/writable-cgroups). | ||
| Ray resource isolation (introduced in v2.51.0) significantly improves Ray's reliability by using cgroupsv2 to reserve dedicated CPU and memory |
There was a problem hiding this comment.
| Ray resource isolation (introduced in v2.51.0) significantly improves Ray's reliability by using cgroupsv2 to reserve dedicated CPU and memory | |
| Ray resource isolation (introduced in v2.51.0) significantly improves Ray's reliability by using cgroups v2 to reserve dedicated CPU and memory |
s/cgroupsv2/cgroups v2/g.
|
ping on comments when you have a chance @andrewsykim or let me know if you'd like us to make the changes directly |
|
Sorry I missed the latest comments, I'll update tomorrow |
Signed-off-by: Andrew Sy Kim <andrewsy@google.com>
Signed-off-by: Andrew Sy Kim <andrewsy@google.com>
Signed-off-by: Andrew Sy Kim <andrewsy@google.com>
Signed-off-by: Andrew Sy Kim <andrewsy@google.com>
b118997 to
2115eb7
Compare
|
@israbbani @edoakes comments addressed |
…e cgroups (ray-project#59051) ## Description Add a user guide for enabling Ray resource isolation on Kubernetes using writable cgroups ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Andrew Sy Kim <andrewsy@google.com> Signed-off-by: jinbum-kim <jinbum9958@gmail.com>
…e cgroups (ray-project#59051) ## Description Add a user guide for enabling Ray resource isolation on Kubernetes using writable cgroups ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Andrew Sy Kim <andrewsy@google.com> Signed-off-by: 400Ping <jiekaichang@apache.org>
…e cgroups (ray-project#59051) ## Description Add a user guide for enabling Ray resource isolation on Kubernetes using writable cgroups ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Andrew Sy Kim <andrewsy@google.com> Signed-off-by: peterxcli <peterxcli@gmail.com>
…e cgroups (ray-project#59051) ## Description Add a user guide for enabling Ray resource isolation on Kubernetes using writable cgroups ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Andrew Sy Kim <andrewsy@google.com> Signed-off-by: peterxcli <peterxcli@gmail.com>
Description
Add a user guide for enabling Ray resource isolation on Kubernetes using writable cgroups
Related issues
Additional information