You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[data] documentation for ray data metrics (#58610)
## Description
Adds ray data metrics documentation for visibility. This should be
periodically updated with the latest metrics.
## Related issues
None
## Additional information
None
---------
Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
Copy file name to clipboardExpand all lines: doc/source/data/monitoring-your-workload.rst
+233Lines changed: 233 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -118,6 +118,239 @@ The metrics recorded include:
118
118
119
119
To learn more about the Ray dashboard, including detailed setup instructions, see :ref:`Ray Dashboard <observability-getting-started>`.
120
120
121
+
Prometheus metrics
122
+
~~~~~~~~~~~~~~~~~~
123
+
124
+
Ray Data emits Prometheus metrics that you can use to monitor dataset execution. The metrics are tagged with `dataset` and `operator` labels to help you identify which dataset and operator the metrics are coming from.
125
+
126
+
To access these metrics, you can query the Prometheus server running on the Ray head node. The default Prometheus server URL is `http://<head-node-ip>:8080`.
127
+
128
+
The following tables list all available Ray Data metrics grouped by category.
129
+
130
+
Overview metrics
131
+
^^^^^^^^^^^^^^^^
132
+
133
+
These metrics provide high-level information about dataset execution and resource usage.
134
+
135
+
.. list-table::
136
+
:header-rows: 1
137
+
:widths: 30 70
138
+
139
+
* - Metric name
140
+
- Description
141
+
* - `data_spilled_bytes`
142
+
- Bytes spilled by dataset operators. Set `DataContext.enable_get_object_locations_for_metrics` to `True` to report this metric.
143
+
* - `data_freed_bytes`
144
+
- Bytes freed by dataset operators
145
+
* - `data_current_bytes`
146
+
- Bytes of object store memory used by dataset operators
147
+
* - `data_cpu_usage_cores`
148
+
- CPUs allocated to dataset operators
149
+
* - `data_gpu_usage_cores`
150
+
- GPUs allocated to dataset operators
151
+
* - `data_output_bytes`
152
+
- Bytes outputted by dataset operators
153
+
* - `data_output_rows`
154
+
- Rows outputted by dataset operators
155
+
156
+
Input metrics
157
+
^^^^^^^^^^^^^
158
+
159
+
These metrics track input data flowing into operators.
160
+
161
+
.. list-table::
162
+
:header-rows: 1
163
+
:widths: 30 70
164
+
165
+
* - Metric name
166
+
- Description
167
+
* - `num_inputs_received`
168
+
- Number of input blocks received by operator
169
+
* - `num_row_inputs_received`
170
+
- Number of input rows received by operator
171
+
* - `bytes_inputs_received`
172
+
- Byte size of input blocks received by operator
173
+
* - `num_task_inputs_processed`
174
+
- Number of input blocks that the operator's tasks finished processing
175
+
* - `bytes_task_inputs_processed`
176
+
- Byte size of input blocks that the operator's tasks finished processing
177
+
* - `bytes_inputs_of_submitted_tasks`
178
+
- Byte size of input blocks passed to submitted tasks
179
+
* - `rows_inputs_of_submitted_tasks`
180
+
- Number of rows in the input blocks passed to submitted tasks
181
+
* - `average_num_inputs_per_task`
182
+
- Average number of input blocks per task, or `None` if no task finished
183
+
* - `average_bytes_inputs_per_task`
184
+
- Average size in bytes of ref bundles passed to tasks, or `None` if no tasks submitted
185
+
* - `average_rows_inputs_per_task`
186
+
- Average number of rows in input blocks per task, or `None` if no task submitted
187
+
188
+
Output metrics
189
+
^^^^^^^^^^^^^^
190
+
191
+
These metrics track output data generated by operators.
192
+
193
+
.. list-table::
194
+
:header-rows: 1
195
+
:widths: 30 70
196
+
197
+
* - Metric name
198
+
- Description
199
+
* - `num_task_outputs_generated`
200
+
- Number of output blocks generated by tasks
201
+
* - `bytes_task_outputs_generated`
202
+
- Byte size of output blocks generated by tasks
203
+
* - `rows_task_outputs_generated`
204
+
- Number of output rows generated by tasks
205
+
* - `row_outputs_taken`
206
+
- Number of rows that are already taken by downstream operators
207
+
* - `block_outputs_taken`
208
+
- Number of blocks that are already taken by downstream operators
209
+
* - `num_outputs_taken`
210
+
- Number of output blocks that are already taken by downstream operators
211
+
* - `bytes_outputs_taken`
212
+
- Byte size of output blocks that are already taken by downstream operators
213
+
* - `num_outputs_of_finished_tasks`
214
+
- Number of generated output blocks that are from finished tasks
215
+
* - `bytes_outputs_of_finished_tasks`
216
+
- Total byte size of generated output blocks produced by finished tasks
217
+
* - `rows_outputs_of_finished_tasks`
218
+
- Number of rows generated by finished tasks
219
+
* - `num_external_inqueue_blocks`
220
+
- Number of blocks in the external inqueue
221
+
* - `num_external_inqueue_bytes`
222
+
- Byte size of blocks in the external inqueue
223
+
* - `num_external_outqueue_blocks`
224
+
- Number of blocks in the external outqueue
225
+
* - `num_external_outqueue_bytes`
226
+
- Byte size of blocks in the external outqueue
227
+
* - `average_num_outputs_per_task`
228
+
- Average number of output blocks per task, or `None` if no task finished
229
+
* - `average_bytes_per_output`
230
+
- Average size in bytes of output blocks
231
+
* - `average_bytes_outputs_per_task`
232
+
- Average total output size of task in bytes, or `None` if no task finished
233
+
* - `average_rows_outputs_per_task`
234
+
- Average number of rows produced per task, or `None` if no task finished
235
+
* - `num_output_blocks_per_task_s`
236
+
- Average number of output blocks per task per second
237
+
238
+
Task metrics
239
+
^^^^^^^^^^^^
240
+
241
+
These metrics track task execution and scheduling.
242
+
243
+
.. list-table::
244
+
:header-rows: 1
245
+
:widths: 30 70
246
+
247
+
* - Metric name
248
+
- Description
249
+
* - `num_tasks_submitted`
250
+
- Number of submitted tasks
251
+
* - `num_tasks_running`
252
+
- Number of running tasks
253
+
* - `num_tasks_have_outputs`
254
+
- Number of tasks with at least one output
255
+
* - `num_tasks_finished`
256
+
- Number of finished tasks
257
+
* - `num_tasks_failed`
258
+
- Number of failed tasks
259
+
* - `block_generation_time`
260
+
- Time spent generating blocks in tasks
261
+
* - `task_submission_backpressure_time`
262
+
- Time spent in task submission backpressure
263
+
* - `task_output_backpressure_time`
264
+
- Time spent in task output backpressure
265
+
* - `task_completion_time`
266
+
- Histogram of time spent running tasks to completion
267
+
* - `block_completion_time`
268
+
- Histogram of time spent running a single block to completion. If multiple blocks are generated per task, Ray Data approximates this by assuming each block took an equal amount of time to process.
269
+
* - `task_completion_time_s`
270
+
- Time spent running tasks to completion
271
+
* - `task_completion_time_excl_backpressure_s`
272
+
- Time spent running tasks to completion without backpressure
273
+
* - `block_size_bytes`
274
+
- Histogram of block sizes in bytes generated by tasks
275
+
* - `block_size_rows`
276
+
- Histogram of number of rows in blocks generated by tasks
277
+
* - `average_total_task_completion_time_s`
278
+
- Average task completion time in seconds including throttling. This includes Ray Core and Ray Data backpressure.
- Average task completion time in seconds excluding throttling
281
+
* - `average_max_uss_per_task`
282
+
- Average Unique Set Size (USS) memory usage of tasks. USS is the amount of memory unique to a process that would be freed if the process was terminated.
283
+
284
+
Actor metrics
285
+
^^^^^^^^^^^^^
286
+
287
+
These metrics track actor lifecycle for operations that use actors.
288
+
289
+
.. list-table::
290
+
:header-rows: 1
291
+
:widths: 30 70
292
+
293
+
* - Metric name
294
+
- Description
295
+
* - `num_alive_actors`
296
+
- Number of alive actors
297
+
* - `num_restarting_actors`
298
+
- Number of restarting actors
299
+
* - `num_pending_actors`
300
+
- Number of pending actors
301
+
302
+
Object store memory metrics
303
+
^^^^^^^^^^^^^^^^^^^^^^^^^^^
304
+
305
+
These metrics track memory usage in the Ray object store.
306
+
307
+
.. list-table::
308
+
:header-rows: 1
309
+
:widths: 30 70
310
+
311
+
* - Metric name
312
+
- Description
313
+
* - `obj_store_mem_internal_inqueue_blocks`
314
+
- Number of blocks in the operator's internal input queue
315
+
* - `obj_store_mem_internal_outqueue_blocks`
316
+
- Number of blocks in the operator's internal output queue
317
+
* - `obj_store_mem_freed`
318
+
- Byte size of freed memory in object store
319
+
* - `obj_store_mem_spilled`
320
+
- Byte size of spilled memory in object store
321
+
* - `obj_store_mem_used`
322
+
- Byte size of used memory in object store
323
+
* - `obj_store_mem_internal_inqueue`
324
+
- Byte size of input blocks in the operator's internal input queue
325
+
* - `obj_store_mem_internal_outqueue`
326
+
- Byte size of output blocks in the operator's internal output queue
327
+
* - `obj_store_mem_pending_task_inputs`
328
+
- Byte size of input blocks used by pending tasks
329
+
330
+
Scheduling and resource metrics
331
+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
332
+
333
+
These metrics track resource allocation and scheduling behavior in the streaming executor.
334
+
335
+
.. list-table::
336
+
:header-rows: 1
337
+
:widths: 30 70
338
+
339
+
* - Metric name
340
+
- Description
341
+
* - `data_sched_loop_duration_s`
342
+
- Duration of the scheduling loop in seconds
343
+
* - `data_cpu_budget`
344
+
- CPU budget allocated per operator
345
+
* - `data_gpu_budget`
346
+
- GPU budget allocated per operator
347
+
* - `data_memory_budget`
348
+
- Memory budget allocated per operator
349
+
* - `data_object_store_memory_budget`
350
+
- Object store memory budget allocated per operator
351
+
* - `data_max_bytes_to_read`
352
+
- Maximum bytes to read from streaming generator buffer per operator
0 commit comments