You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: doc/source/ray-observability/reference/system-metrics.rst
+37-16Lines changed: 37 additions & 16 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -21,10 +21,10 @@ Ray exports a number of system metrics, which provide introspection into the sta
21
21
- `Name`, `State`
22
22
- Current number of actors in each state described in `rpc::ActorTableData::ActorState <https://github.com/ray-project/ray/blob/b3799a53dcabd8d1a4d20f22faa98e781b0059c7/src/ray/protobuf/gcs.proto#L79>`. ALIVE has two sub-states: ALIVE_IDLE, and ALIVE_RUNNING_TASKS. An actor is considered ALIVE_IDLE if it is not running any tasks.
23
23
* - `ray_resources`
24
-
- `Name`, `State`, `InstanceId`
24
+
- `Name`, `State`, `instance`
25
25
- Logical resource usage for each node of the cluster. Each resource has some quantity that is in either `USED or AVAILABLE state <https://github.com/ray-project/ray/blob/9eab65ed77bdd9907989ecc3e241045954a09cb4/src/ray/stats/metric_defs.cc#L188>`_. The Name label defines the resource name (e.g., CPU, GPU).
26
26
* - `ray_object_store_memory`
27
-
- `Location`, `ObjectState`, `InstanceId`
27
+
- `Location`, `ObjectState`, `instance`
28
28
- Object store memory usage in bytes, `broken down <https://github.com/ray-project/ray/blob/9eab65ed77bdd9907989ecc3e241045954a09cb4/src/ray/stats/metric_defs.cc#L231>`_ by logical Location (SPILLED, MMAP_DISK, MMAP_SHM, and WORKER_HEAP). Definitions are as follows. SPILLED--Objects that have spilled to disk or a remote Storage solution (for example, AWS S3). The default is the disk. MMAP_DISK--Objects stored on a memory-mapped page on disk. This mode very slow and only happens under severe memory pressure. MMAP_SHM--Objects store on a memory-mapped page in Shared Memory. This mode is the default, in the absence of memory pressure. WORKER_HEAP--Objects, usually smaller, stored in the memory of the Ray Worker process itself. Small objects are stored in the worker heap.
29
29
* - `ray_placement_groups`
30
30
- `State`
@@ -33,46 +33,67 @@ Ray exports a number of system metrics, which provide introspection into the sta
33
33
- `Type`, `Name`
34
34
- The number of tasks and actors killed by the Ray Out of Memory killer (https://docs.ray.io/en/master/ray-core/scheduling/ray-oom-prevention.html) broken down by types (whether it is tasks or actors) and names (name of tasks and actors).
35
35
* - `ray_node_cpu_utilization`
36
-
- `InstanceId`
36
+
- `instance`
37
37
- The CPU utilization per node as a percentage quantity (0..100). This should be scaled by the number of cores per node to convert the units into cores.
38
38
* - `ray_node_cpu_count`
39
-
- `InstanceId`
39
+
- `instance`
40
40
- The number of CPU cores per node.
41
41
* - `ray_node_gpus_utilization`
42
-
- `InstanceId`, `GpuDeviceName`, `GpuIndex`
42
+
- `instance`, `GpuDeviceName`, `GpuIndex`
43
43
- The GPU utilization per GPU as a percentage quantity (0..NGPU*100). `GpuDeviceName` is a name of a GPU device (e.g., NVIDIA A10G) and `GpuIndex` is the index of the GPU.
44
44
* - `ray_node_disk_usage`
45
-
- `InstanceId`
45
+
- `instance`
46
46
- The amount of disk space used per node, in bytes.
47
47
* - `ray_node_disk_free`
48
-
- `InstanceId`
48
+
- `instance`
49
49
- The amount of disk space available per node, in bytes.
50
+
* - `ray_node_disk_write_iops`
51
+
- `instance`, `node_type`
52
+
- The disk write operations per second per node.
50
53
* - `ray_node_disk_io_write_speed`
51
-
- `InstanceId`
54
+
- `instance`
52
55
- The disk write throughput per node, in bytes per second.
56
+
* - `ray_node_disk_read_iops`
57
+
- `instance`, `node_type`
58
+
- The disk read operations per second per node.
53
59
* - `ray_node_disk_io_read_speed`
54
-
- `InstanceId`
60
+
- `instance`
55
61
- The disk read throughput per node, in bytes per second.
62
+
* - `ray_node_mem_available`
63
+
- `instance`, `node_type`
64
+
- The amount of physical memory available per node, in bytes.
65
+
* - `ray_node_mem_shared_bytes`
66
+
- `instance`, `node_type`
67
+
- The amount of shared memory per node, in bytes.
56
68
* - `ray_node_mem_used`
57
-
- `InstanceId`
69
+
- `instance`
58
70
- The amount of physical memory used per node, in bytes.
59
71
* - `ray_node_mem_total`
60
-
- `InstanceId`
72
+
- `instance`
61
73
- The amount of physical memory available per node, in bytes.
62
74
* - `ray_component_uss_mb`
63
-
- `Component`, `InstanceId`
75
+
- `Component`, `instance`
64
76
- The measured unique set size in megabytes, broken down by logical Ray component. Ray components consist of system components (e.g., raylet, gcs, dashboard, or agent) and the method names of running tasks/actors.
65
77
* - `ray_component_cpu_percentage`
66
-
- `Component`, `InstanceId`
78
+
- `Component`, `instance`
67
79
- The measured CPU percentage, broken down by logical Ray component. Ray components consist of system components (e.g., raylet, gcs, dashboard, or agent) and the method names of running tasks/actors.
Ray Train exports Prometheus metrics including the Ray Train controller state, worker group start times, checkpointing times and more. You can use these metrics to monitor Ray Train runs.
6
+
The Ray dashboard displays these metrics in the Ray Train Grafana Dashboard. See :ref:`Ray Dashboard documentation<observability-getting-started>` for more information.
7
+
8
+
The Ray Train dashboard also displays a subset of Ray Core metrics that are useful for monitoring training but are not listed in the table below.
9
+
For more information about these metrics, see the :ref:`System Metrics documentation<system-metrics>`.
10
+
11
+
The following table lists the Prometheus metrics emitted by Ray Train:
0 commit comments