Skip to content

[data] Default return 0s for object locality instead of -1s#58754

Merged
bveeramani merged 9 commits intoray-project:masterfrom
iamjustinhsu:jhsu/handle-disabled-object-locality-stats
Nov 25, 2025
Merged

[data] Default return 0s for object locality instead of -1s#58754
bveeramani merged 9 commits intoray-project:masterfrom
iamjustinhsu:jhsu/handle-disabled-object-locality-stats

Conversation

@iamjustinhsu
Copy link
Contributor

@iamjustinhsu iamjustinhsu commented Nov 18, 2025

Description

Previously, if DataContext.get_current().enable_get_object_locations_for_metrics=False (which it is by default), then we would default return (-1, -1, -1). This wasn't properly being handled, so we would get negative metrics. This PR addresses that

This PR also fixes run_index=-1 for streaming split. For all iterators but streaming split:

  1. create executor with dataset_tag from 2)
  2. increment dataset_tag
  3. Get dataset_tag (dataset_-1)

However, streaming_split skips step 2. This PR addreses that.

Related issues

Additional information

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
@iamjustinhsu iamjustinhsu requested a review from a team as a code owner November 18, 2025 20:21
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly addresses an issue where negative metrics were being reported when object locality tracking was disabled. The fix involves checking for the sentinel value (-1, -1, -1) and skipping the metrics aggregation, which is the right approach. Additionally, the new log message that appears once is a helpful diagnostic for users, and the updated docstring clarifies the purpose of the configuration flag. I have one minor suggestion to simplify the check for the sentinel value.

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
@iamjustinhsu iamjustinhsu force-pushed the jhsu/handle-disabled-object-locality-stats branch from f6b259e to c34bb21 Compare November 18, 2025 20:23
Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
"Set `DataContext.get_current().enable_get_object_locations_for_metrics=True` to enable this."
)

return -1, -1, -1
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(-1, -1, -1) wasn't really being used, so defaulting to 0s instead.

@iamjustinhsu iamjustinhsu changed the title [data] Skip if object locality if disabled [data] Default return 0s for object locality instead of -1s Nov 18, 2025
Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
Signed-off-by: iamjustinhsu <jhsu@anyscale.com>

assert final_stats == ds_stats
assert f"dataset_{ds._uuid}_0" == update_fn.call_args_list[-1].args[1]
assert f"dataset_{ds._uuid}_1" == update_fn.call_args_list[-1].args[1]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Incorrect test expectation after run index change

The test expects the dataset ID to be "dataset_{uuid}_1" after calling take_all(), but with the new logic where _run_index starts at 0 and is incremented after creating the executor, the first execution uses "dataset_{uuid}_0". The executor is created with get_dataset_id() which returns the current _run_index value before incrementing it. Since _run_index starts at 0, the first execution uses index 0, not 1. The test assertion should be checking for "dataset_{uuid}_0" instead of "dataset_{uuid}_1".

Fix in Cursor Fix in Web

@ray-gardener ray-gardener bot added the data Ray Data-related issues label Nov 19, 2025
Copy link
Member

@bveeramani bveeramani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@iamjustinhsu Does this prevent the issue where you get -1000 blocks, like in the "Iteration Blocks Local" chart? If so, how?

image

Comment on lines +40 to +44
elif log_once("enable_get_object_locations_for_metrics"):
logger.info(
"Tracking the number of object cache hits and misses is currently disabled. "
"Set `DataContext.get_current().enable_get_object_locations_for_metrics=True` to enable this."
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since enable_get_object_locations_for_metrics is disabled by default, we'll also emit this warning by default too, right?

If the metric is niche, it might be better to surface it only when the feature is explicitly relevant. Otherwise, the warning might clutter logs or confuse users.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feature is only used when they do iter_batches(), iter_rows, iterations etc..., which I don't think is exclusively train specific (these are on the data dashboard). I can remove this info statement alltogether, or change it to a debug statement and specify that iteration metrics (data dashboard) are not being tracked. Thoughts?

Does this prevent the issue where you get -1000 blocks, like in the "Iteration Blocks Local" chart? If so, how?

It does because before, we were returning -1 when this feature is disabled. Then we were doing some blind logic to always add the # of cache hits, # of cache misses, and unknowns, even if it was -1. Here is a local run i did:
image
the first 2 runs I set it to False, the 3rd (in blue) i set it to true.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO this might confuse users. From the log message, I don't think it's clear whether you should enable this, and what the tradeoffs are (I think there are performance implications?)

In the majority of cases, do you want to enable this? If so, I think we should by default. If not, I don't think we need to emit a log.

Also, I think we log this once per train worker, and I think that could become noisy.

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
Copy link
Member

@bveeramani bveeramani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One comment abut the log, but otherwise LGTM

Comment on lines +40 to +44
elif log_once("enable_get_object_locations_for_metrics"):
logger.info(
"Tracking the number of object cache hits and misses is currently disabled. "
"Set `DataContext.get_current().enable_get_object_locations_for_metrics=True` to enable this."
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO this might confuse users. From the log message, I don't think it's clear whether you should enable this, and what the tradeoffs are (I think there are performance implications?)

In the majority of cases, do you want to enable this? If so, I think we should by default. If not, I don't think we need to emit a log.

Also, I think we log this once per train worker, and I think that could become noisy.

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
@bveeramani bveeramani enabled auto-merge (squash) November 25, 2025 19:48
@github-actions github-actions bot added the go add ONLY when ready to merge, run all tests label Nov 25, 2025
@bveeramani bveeramani merged commit 8f6ce88 into ray-project:master Nov 25, 2025
8 checks passed
@iamjustinhsu iamjustinhsu deleted the jhsu/handle-disabled-object-locality-stats branch November 25, 2025 20:52
SheldonTsen pushed a commit to SheldonTsen/ray that referenced this pull request Dec 1, 2025
…ect#58754)

## Description
Previously, if
`DataContext.get_current().enable_get_object_locations_for_metrics=False`
(which it is by default), then we would default return `(-1, -1, -1)`.
This wasn't properly being handled, so we would get negative metrics.
This PR addresses that

This PR also fixes run_index=-1 for **streaming split**. For all
iterators but streaming split:
1. create executor with `dataset_tag` from 2)
2. increment `dataset_tag`
3. Get dataset_tag (dataset_-1)

However, streaming_split skips step 2. This PR addreses that.

## Related issues

## Additional information

---------

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
peterxcli pushed a commit to peterxcli/ray that referenced this pull request Feb 25, 2026
…ect#58754)

## Description
Previously, if
`DataContext.get_current().enable_get_object_locations_for_metrics=False`
(which it is by default), then we would default return `(-1, -1, -1)`.
This wasn't properly being handled, so we would get negative metrics.
This PR addresses that

This PR also fixes run_index=-1 for **streaming split**. For all
iterators but streaming split:
1. create executor with `dataset_tag` from 2)
2. increment `dataset_tag`
3. Get dataset_tag (dataset_-1)

However, streaming_split skips step 2. This PR addreses that.

## Related issues

## Additional information

---------

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
Signed-off-by: peterxcli <peterxcli@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data Ray Data-related issues go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants