[train] Ray Train Metrics Doc Page (ray-project#58235)

JasonLi1909 · peterxcli · commit eb51930cdefc · 2026-02-25T15:55:18.000+08:00
This PR: - adds a new page to the Ray Train docs called "Monitor your Application" that lists and describes the Prometheus metrics emitted by Ray Train - Updates the Ray Core system metrics docs to include some missing metrics Link to example build: https://anyscale-ray--58235.com.readthedocs.build/en/58235/train/user-guides/monitor-your-application.html Preview Screenshot: <img width="1630" height="662" alt="Screenshot 2025-10-29 at 2 46 07 PM" src="https://github.com/user-attachments/assets/9ca7ea6d-522b-4033-909a-2ee626960e8a" /> --------- Signed-off-by: JasonLi1909 <jasli1909@gmail.com> Signed-off-by: peterxcli <peterxcli@gmail.com>
diff --git a/doc/source/ray-observability/reference/system-metrics.rst b/doc/source/ray-observability/reference/system-metrics.rst
@@ -21,10 +21,10 @@ Ray exports a number of system metrics, which provide introspection into the sta
      - `Name`, `State`
      - Current number of actors in each state described in `rpc::ActorTableData::ActorState <https://github.com/ray-project/ray/blob/b3799a53dcabd8d1a4d20f22faa98e781b0059c7/src/ray/protobuf/gcs.proto#L79>`. ALIVE has two sub-states: ALIVE_IDLE, and ALIVE_RUNNING_TASKS. An actor is considered ALIVE_IDLE if it is not running any tasks.
    * - `ray_resources`
-     - `Name`, `State`, `InstanceId`
+     - `Name`, `State`, `instance`
      - Logical resource usage for each node of the cluster. Each resource has some quantity that is in either `USED or AVAILABLE state <https://github.com/ray-project/ray/blob/9eab65ed77bdd9907989ecc3e241045954a09cb4/src/ray/stats/metric_defs.cc#L188>`_. The Name label defines the resource name (e.g., CPU, GPU).
    * - `ray_object_store_memory`
-     - `Location`, `ObjectState`, `InstanceId`
+     - `Location`, `ObjectState`, `instance`
      - Object store memory usage in bytes, `broken down <https://github.com/ray-project/ray/blob/9eab65ed77bdd9907989ecc3e241045954a09cb4/src/ray/stats/metric_defs.cc#L231>`_ by logical Location (SPILLED, MMAP_DISK, MMAP_SHM, and WORKER_HEAP). Definitions are as follows. SPILLED--Objects that have spilled to disk or a remote Storage solution (for example, AWS S3). The default is the disk. MMAP_DISK--Objects stored on a memory-mapped page on disk. This mode very slow and only happens under severe memory pressure. MMAP_SHM--Objects store on a memory-mapped page in Shared Memory. This mode is the default, in the absence of memory pressure. WORKER_HEAP--Objects, usually smaller, stored in the memory of the Ray Worker process itself. Small objects are stored in the worker heap.
    * - `ray_placement_groups`
      - `State`
@@ -33,46 +33,67 @@ Ray exports a number of system metrics, which provide introspection into the sta
      - `Type`, `Name`
      - The number of tasks and actors killed by the Ray Out of Memory killer (https://docs.ray.io/en/master/ray-core/scheduling/ray-oom-prevention.html) broken down by types (whether it is tasks or actors) and names (name of tasks and actors).
    * - `ray_node_cpu_utilization`
-     - `InstanceId`
+     - `instance`
      - The CPU utilization per node as a percentage quantity (0..100). This should be scaled by the number of cores per node to convert the units into cores.
    * - `ray_node_cpu_count`
-     - `InstanceId`
+     - `instance`
      - The number of CPU cores per node.
    * - `ray_node_gpus_utilization`
-     - `InstanceId`, `GpuDeviceName`, `GpuIndex`
+     - `instance`, `GpuDeviceName`, `GpuIndex`
      - The GPU utilization per GPU as a percentage quantity (0..NGPU*100). `GpuDeviceName` is a name of a GPU device (e.g., NVIDIA A10G) and `GpuIndex` is the index of the GPU.
    * - `ray_node_disk_usage`
-     - `InstanceId`
+     - `instance`
      - The amount of disk space used per node, in bytes.
    * - `ray_node_disk_free`
-     - `InstanceId`
+     - `instance`
      - The amount of disk space available per node, in bytes.
+   * - `ray_node_disk_write_iops`
+     - `instance`, `node_type`
+     - The disk write operations per second per node.
    * - `ray_node_disk_io_write_speed`
-     - `InstanceId`
+     - `instance`
      - The disk write throughput per node, in bytes per second.
+   * - `ray_node_disk_read_iops`
+     - `instance`, `node_type`
+     - The disk read operations per second per node.
    * - `ray_node_disk_io_read_speed`
-     - `InstanceId`
+     - `instance`
      - The disk read throughput per node, in bytes per second.
+   * - `ray_node_mem_available`
+     - `instance`, `node_type`
+     - The amount of physical memory available per node, in bytes.
+   * - `ray_node_mem_shared_bytes`
+     - `instance`, `node_type`
+     - The amount of shared memory per node, in bytes.
    * - `ray_node_mem_used`
-     - `InstanceId`
+     - `instance`
      - The amount of physical memory used per node, in bytes.
    * - `ray_node_mem_total`
-     - `InstanceId`
+     - `instance`
      - The amount of physical memory available per node, in bytes.
    * - `ray_component_uss_mb`
-     - `Component`, `InstanceId`
+     - `Component`, `instance`
      - The measured unique set size in megabytes, broken down by logical Ray component. Ray components consist of system components (e.g., raylet, gcs, dashboard, or agent) and the method names of running tasks/actors.
    * - `ray_component_cpu_percentage`
-     - `Component`, `InstanceId`
+     - `Component`, `instance`
      - The measured CPU percentage, broken down by logical Ray component. Ray components consist of system components (e.g., raylet, gcs, dashboard, or agent) and the method names of running tasks/actors.
+   * - `ray_node_gram_available`
+     - `instance`, `node_type`, `GpuIndex`, `GpuDeviceName`
+     - The amount of GPU memory available per GPU, in megabytes.
    * - `ray_node_gram_used`
-     - `InstanceId`, `GpuDeviceName`, `GpuIndex`
+     - `instance`, `GpuDeviceName`, `GpuIndex`
      - The amount of GPU memory used per GPU, in bytes.
+   * - `ray_node_network_received`
+     - `instance`, `node_type`
+     - The total network traffic received per node, in bytes.
+   * - `ray_node_network_sent`
+     - `instance`, `node_type`
+     - The total network traffic sent per node, in bytes.
    * - `ray_node_network_receive_speed`
-     - `InstanceId`
+     - `instance`
      - The network receive throughput per node, in bytes per second.
    * - `ray_node_network_send_speed`
-     - `InstanceId`
+     - `instance`
      - The network send throughput per node, in bytes per second.
    * - `ray_cluster_active_nodes`
      - `node_type`
diff --git a/doc/source/train/user-guides.rst b/doc/source/train/user-guides.rst
@@ -16,5 +16,6 @@ Ray Train User Guides
     user-guides/experiment-tracking
     user-guides/results
     user-guides/fault-tolerance
+    user-guides/monitor-your-application
     user-guides/reproducibility
     Hyperparameter Optimization <user-guides/hyperparameter-optimization>
diff --git a/doc/source/train/user-guides/monitor-your-application.rst b/doc/source/train/user-guides/monitor-your-application.rst
@@ -0,0 +1,30 @@
+.. _train-metrics:
+
+Ray Train Metrics
+-----------------
+Ray Train exports Prometheus metrics including the Ray Train controller state, worker group start times, checkpointing times and more. You can use these metrics to monitor Ray Train runs.
+The Ray dashboard displays these metrics in the Ray Train Grafana Dashboard. See :ref:`Ray Dashboard documentation<observability-getting-started>` for more information.
+
+The Ray Train dashboard also displays a subset of Ray Core metrics that are useful for monitoring training but are not listed in the table below.
+For more information about these metrics, see the :ref:`System Metrics documentation<system-metrics>`.
+
+The following table lists the Prometheus metrics emitted by Ray Train:
+
+.. list-table:: Train Metrics
+    :header-rows: 1
+
+    * - Prometheus Metric
+      - Labels
+      - Description
+    * - `ray_train_controller_state`
+      - `ray_train_run_name`, `ray_train_run_id`, `ray_train_controller_state`
+      - Current state of the Ray Train controller.
+    * - `ray_train_worker_group_start_total_time_s`
+      - `ray_train_run_name`, `ray_train_run_id`
+      - Total time taken to start the worker group.
+    * - `ray_train_worker_group_shutdown_total_time_s`
+      - `ray_train_run_name`, `ray_train_run_id`
+      - Total time taken to shut down the worker group.
+    * - `ray_train_report_total_blocked_time_s`
+      - `ray_train_run_name`, `ray_train_run_id`, `ray_train_worker_world_rank`, `ray_train_worker_actor_id`
+      - Cumulative time in seconds to report a checkpoint to storage.