Skip to content

Commit ec680fa

Browse files
authored
[data] documentation for ray data metrics (#58610)
## Description Adds ray data metrics documentation for visibility. This should be periodically updated with the latest metrics. ## Related issues None ## Additional information None --------- Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
1 parent aecff3c commit ec680fa

File tree

1 file changed

+233
-0
lines changed

1 file changed

+233
-0
lines changed

doc/source/data/monitoring-your-workload.rst

Lines changed: 233 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -118,6 +118,239 @@ The metrics recorded include:
118118

119119
To learn more about the Ray dashboard, including detailed setup instructions, see :ref:`Ray Dashboard <observability-getting-started>`.
120120

121+
Prometheus metrics
122+
~~~~~~~~~~~~~~~~~~
123+
124+
Ray Data emits Prometheus metrics that you can use to monitor dataset execution. The metrics are tagged with `dataset` and `operator` labels to help you identify which dataset and operator the metrics are coming from.
125+
126+
To access these metrics, you can query the Prometheus server running on the Ray head node. The default Prometheus server URL is `http://<head-node-ip>:8080`.
127+
128+
The following tables list all available Ray Data metrics grouped by category.
129+
130+
Overview metrics
131+
^^^^^^^^^^^^^^^^
132+
133+
These metrics provide high-level information about dataset execution and resource usage.
134+
135+
.. list-table::
136+
:header-rows: 1
137+
:widths: 30 70
138+
139+
* - Metric name
140+
- Description
141+
* - `data_spilled_bytes`
142+
- Bytes spilled by dataset operators. Set `DataContext.enable_get_object_locations_for_metrics` to `True` to report this metric.
143+
* - `data_freed_bytes`
144+
- Bytes freed by dataset operators
145+
* - `data_current_bytes`
146+
- Bytes of object store memory used by dataset operators
147+
* - `data_cpu_usage_cores`
148+
- CPUs allocated to dataset operators
149+
* - `data_gpu_usage_cores`
150+
- GPUs allocated to dataset operators
151+
* - `data_output_bytes`
152+
- Bytes outputted by dataset operators
153+
* - `data_output_rows`
154+
- Rows outputted by dataset operators
155+
156+
Input metrics
157+
^^^^^^^^^^^^^
158+
159+
These metrics track input data flowing into operators.
160+
161+
.. list-table::
162+
:header-rows: 1
163+
:widths: 30 70
164+
165+
* - Metric name
166+
- Description
167+
* - `num_inputs_received`
168+
- Number of input blocks received by operator
169+
* - `num_row_inputs_received`
170+
- Number of input rows received by operator
171+
* - `bytes_inputs_received`
172+
- Byte size of input blocks received by operator
173+
* - `num_task_inputs_processed`
174+
- Number of input blocks that the operator's tasks finished processing
175+
* - `bytes_task_inputs_processed`
176+
- Byte size of input blocks that the operator's tasks finished processing
177+
* - `bytes_inputs_of_submitted_tasks`
178+
- Byte size of input blocks passed to submitted tasks
179+
* - `rows_inputs_of_submitted_tasks`
180+
- Number of rows in the input blocks passed to submitted tasks
181+
* - `average_num_inputs_per_task`
182+
- Average number of input blocks per task, or `None` if no task finished
183+
* - `average_bytes_inputs_per_task`
184+
- Average size in bytes of ref bundles passed to tasks, or `None` if no tasks submitted
185+
* - `average_rows_inputs_per_task`
186+
- Average number of rows in input blocks per task, or `None` if no task submitted
187+
188+
Output metrics
189+
^^^^^^^^^^^^^^
190+
191+
These metrics track output data generated by operators.
192+
193+
.. list-table::
194+
:header-rows: 1
195+
:widths: 30 70
196+
197+
* - Metric name
198+
- Description
199+
* - `num_task_outputs_generated`
200+
- Number of output blocks generated by tasks
201+
* - `bytes_task_outputs_generated`
202+
- Byte size of output blocks generated by tasks
203+
* - `rows_task_outputs_generated`
204+
- Number of output rows generated by tasks
205+
* - `row_outputs_taken`
206+
- Number of rows that are already taken by downstream operators
207+
* - `block_outputs_taken`
208+
- Number of blocks that are already taken by downstream operators
209+
* - `num_outputs_taken`
210+
- Number of output blocks that are already taken by downstream operators
211+
* - `bytes_outputs_taken`
212+
- Byte size of output blocks that are already taken by downstream operators
213+
* - `num_outputs_of_finished_tasks`
214+
- Number of generated output blocks that are from finished tasks
215+
* - `bytes_outputs_of_finished_tasks`
216+
- Total byte size of generated output blocks produced by finished tasks
217+
* - `rows_outputs_of_finished_tasks`
218+
- Number of rows generated by finished tasks
219+
* - `num_external_inqueue_blocks`
220+
- Number of blocks in the external inqueue
221+
* - `num_external_inqueue_bytes`
222+
- Byte size of blocks in the external inqueue
223+
* - `num_external_outqueue_blocks`
224+
- Number of blocks in the external outqueue
225+
* - `num_external_outqueue_bytes`
226+
- Byte size of blocks in the external outqueue
227+
* - `average_num_outputs_per_task`
228+
- Average number of output blocks per task, or `None` if no task finished
229+
* - `average_bytes_per_output`
230+
- Average size in bytes of output blocks
231+
* - `average_bytes_outputs_per_task`
232+
- Average total output size of task in bytes, or `None` if no task finished
233+
* - `average_rows_outputs_per_task`
234+
- Average number of rows produced per task, or `None` if no task finished
235+
* - `num_output_blocks_per_task_s`
236+
- Average number of output blocks per task per second
237+
238+
Task metrics
239+
^^^^^^^^^^^^
240+
241+
These metrics track task execution and scheduling.
242+
243+
.. list-table::
244+
:header-rows: 1
245+
:widths: 30 70
246+
247+
* - Metric name
248+
- Description
249+
* - `num_tasks_submitted`
250+
- Number of submitted tasks
251+
* - `num_tasks_running`
252+
- Number of running tasks
253+
* - `num_tasks_have_outputs`
254+
- Number of tasks with at least one output
255+
* - `num_tasks_finished`
256+
- Number of finished tasks
257+
* - `num_tasks_failed`
258+
- Number of failed tasks
259+
* - `block_generation_time`
260+
- Time spent generating blocks in tasks
261+
* - `task_submission_backpressure_time`
262+
- Time spent in task submission backpressure
263+
* - `task_output_backpressure_time`
264+
- Time spent in task output backpressure
265+
* - `task_completion_time`
266+
- Histogram of time spent running tasks to completion
267+
* - `block_completion_time`
268+
- Histogram of time spent running a single block to completion. If multiple blocks are generated per task, Ray Data approximates this by assuming each block took an equal amount of time to process.
269+
* - `task_completion_time_s`
270+
- Time spent running tasks to completion
271+
* - `task_completion_time_excl_backpressure_s`
272+
- Time spent running tasks to completion without backpressure
273+
* - `block_size_bytes`
274+
- Histogram of block sizes in bytes generated by tasks
275+
* - `block_size_rows`
276+
- Histogram of number of rows in blocks generated by tasks
277+
* - `average_total_task_completion_time_s`
278+
- Average task completion time in seconds including throttling. This includes Ray Core and Ray Data backpressure.
279+
* - `average_task_completion_excl_backpressure_time_s`
280+
- Average task completion time in seconds excluding throttling
281+
* - `average_max_uss_per_task`
282+
- Average Unique Set Size (USS) memory usage of tasks. USS is the amount of memory unique to a process that would be freed if the process was terminated.
283+
284+
Actor metrics
285+
^^^^^^^^^^^^^
286+
287+
These metrics track actor lifecycle for operations that use actors.
288+
289+
.. list-table::
290+
:header-rows: 1
291+
:widths: 30 70
292+
293+
* - Metric name
294+
- Description
295+
* - `num_alive_actors`
296+
- Number of alive actors
297+
* - `num_restarting_actors`
298+
- Number of restarting actors
299+
* - `num_pending_actors`
300+
- Number of pending actors
301+
302+
Object store memory metrics
303+
^^^^^^^^^^^^^^^^^^^^^^^^^^^
304+
305+
These metrics track memory usage in the Ray object store.
306+
307+
.. list-table::
308+
:header-rows: 1
309+
:widths: 30 70
310+
311+
* - Metric name
312+
- Description
313+
* - `obj_store_mem_internal_inqueue_blocks`
314+
- Number of blocks in the operator's internal input queue
315+
* - `obj_store_mem_internal_outqueue_blocks`
316+
- Number of blocks in the operator's internal output queue
317+
* - `obj_store_mem_freed`
318+
- Byte size of freed memory in object store
319+
* - `obj_store_mem_spilled`
320+
- Byte size of spilled memory in object store
321+
* - `obj_store_mem_used`
322+
- Byte size of used memory in object store
323+
* - `obj_store_mem_internal_inqueue`
324+
- Byte size of input blocks in the operator's internal input queue
325+
* - `obj_store_mem_internal_outqueue`
326+
- Byte size of output blocks in the operator's internal output queue
327+
* - `obj_store_mem_pending_task_inputs`
328+
- Byte size of input blocks used by pending tasks
329+
330+
Scheduling and resource metrics
331+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
332+
333+
These metrics track resource allocation and scheduling behavior in the streaming executor.
334+
335+
.. list-table::
336+
:header-rows: 1
337+
:widths: 30 70
338+
339+
* - Metric name
340+
- Description
341+
* - `data_sched_loop_duration_s`
342+
- Duration of the scheduling loop in seconds
343+
* - `data_cpu_budget`
344+
- CPU budget allocated per operator
345+
* - `data_gpu_budget`
346+
- GPU budget allocated per operator
347+
* - `data_memory_budget`
348+
- Memory budget allocated per operator
349+
* - `data_object_store_memory_budget`
350+
- Object store memory budget allocated per operator
351+
* - `data_max_bytes_to_read`
352+
- Maximum bytes to read from streaming generator buffer per operator
353+
121354
.. _ray-data-logs:
122355

123356
Ray Data logs

0 commit comments

Comments
 (0)