[data] Default return 0s for object locality instead of -1s#58754
Conversation
Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
There was a problem hiding this comment.
Code Review
This pull request correctly addresses an issue where negative metrics were being reported when object locality tracking was disabled. The fix involves checking for the sentinel value (-1, -1, -1) and skipping the metrics aggregation, which is the right approach. Additionally, the new log message that appears once is a helpful diagnostic for users, and the updated docstring clarifies the purpose of the configuration flag. I have one minor suggestion to simplify the check for the sentinel value.
f6b259e to
c34bb21
Compare
| "Set `DataContext.get_current().enable_get_object_locations_for_metrics=True` to enable this." | ||
| ) | ||
|
|
||
| return -1, -1, -1 |
There was a problem hiding this comment.
(-1, -1, -1) wasn't really being used, so defaulting to 0s instead.
Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
python/ray/data/tests/test_stats.py
Outdated
|
|
||
| assert final_stats == ds_stats | ||
| assert f"dataset_{ds._uuid}_0" == update_fn.call_args_list[-1].args[1] | ||
| assert f"dataset_{ds._uuid}_1" == update_fn.call_args_list[-1].args[1] |
There was a problem hiding this comment.
Bug: Incorrect test expectation after run index change
The test expects the dataset ID to be "dataset_{uuid}_1" after calling take_all(), but with the new logic where _run_index starts at 0 and is incremented after creating the executor, the first execution uses "dataset_{uuid}_0". The executor is created with get_dataset_id() which returns the current _run_index value before incrementing it. Since _run_index starts at 0, the first execution uses index 0, not 1. The test assertion should be checking for "dataset_{uuid}_0" instead of "dataset_{uuid}_1".
bveeramani
left a comment
There was a problem hiding this comment.
@iamjustinhsu Does this prevent the issue where you get -1000 blocks, like in the "Iteration Blocks Local" chart? If so, how?
| elif log_once("enable_get_object_locations_for_metrics"): | ||
| logger.info( | ||
| "Tracking the number of object cache hits and misses is currently disabled. " | ||
| "Set `DataContext.get_current().enable_get_object_locations_for_metrics=True` to enable this." | ||
| ) |
There was a problem hiding this comment.
Since enable_get_object_locations_for_metrics is disabled by default, we'll also emit this warning by default too, right?
If the metric is niche, it might be better to surface it only when the feature is explicitly relevant. Otherwise, the warning might clutter logs or confuse users.
There was a problem hiding this comment.
This feature is only used when they do iter_batches(), iter_rows, iterations etc..., which I don't think is exclusively train specific (these are on the data dashboard). I can remove this info statement alltogether, or change it to a debug statement and specify that iteration metrics (data dashboard) are not being tracked. Thoughts?
Does this prevent the issue where you get -1000 blocks, like in the "Iteration Blocks Local" chart? If so, how?
It does because before, we were returning -1 when this feature is disabled. Then we were doing some blind logic to always add the # of cache hits, # of cache misses, and unknowns, even if it was -1. Here is a local run i did:

the first 2 runs I set it to False, the 3rd (in blue) i set it to true.
There was a problem hiding this comment.
IMO this might confuse users. From the log message, I don't think it's clear whether you should enable this, and what the tradeoffs are (I think there are performance implications?)
In the majority of cases, do you want to enable this? If so, I think we should by default. If not, I don't think we need to emit a log.
Also, I think we log this once per train worker, and I think that could become noisy.
…/handle-disabled-object-locality-stats
Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
bveeramani
left a comment
There was a problem hiding this comment.
One comment abut the log, but otherwise LGTM
| elif log_once("enable_get_object_locations_for_metrics"): | ||
| logger.info( | ||
| "Tracking the number of object cache hits and misses is currently disabled. " | ||
| "Set `DataContext.get_current().enable_get_object_locations_for_metrics=True` to enable this." | ||
| ) |
There was a problem hiding this comment.
IMO this might confuse users. From the log message, I don't think it's clear whether you should enable this, and what the tradeoffs are (I think there are performance implications?)
In the majority of cases, do you want to enable this? If so, I think we should by default. If not, I don't think we need to emit a log.
Also, I think we log this once per train worker, and I think that could become noisy.
…ect#58754) ## Description Previously, if `DataContext.get_current().enable_get_object_locations_for_metrics=False` (which it is by default), then we would default return `(-1, -1, -1)`. This wasn't properly being handled, so we would get negative metrics. This PR addresses that This PR also fixes run_index=-1 for **streaming split**. For all iterators but streaming split: 1. create executor with `dataset_tag` from 2) 2. increment `dataset_tag` 3. Get dataset_tag (dataset_-1) However, streaming_split skips step 2. This PR addreses that. ## Related issues ## Additional information --------- Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
…ect#58754) ## Description Previously, if `DataContext.get_current().enable_get_object_locations_for_metrics=False` (which it is by default), then we would default return `(-1, -1, -1)`. This wasn't properly being handled, so we would get negative metrics. This PR addresses that This PR also fixes run_index=-1 for **streaming split**. For all iterators but streaming split: 1. create executor with `dataset_tag` from 2) 2. increment `dataset_tag` 3. Get dataset_tag (dataset_-1) However, streaming_split skips step 2. This PR addreses that. ## Related issues ## Additional information --------- Signed-off-by: iamjustinhsu <jhsu@anyscale.com> Signed-off-by: peterxcli <peterxcli@gmail.com>
Description
Previously, if
DataContext.get_current().enable_get_object_locations_for_metrics=False(which it is by default), then we would default return(-1, -1, -1). This wasn't properly being handled, so we would get negative metrics. This PR addresses thatThis PR also fixes run_index=-1 for streaming split. For all iterators but streaming split:
dataset_tagfrom 2)dataset_tagHowever, streaming_split skips step 2. This PR addreses that.
Related issues
Additional information