[Data] Enable Concurrency cap backpressure with tuning by srinathk10 · Pull Request #59392 · ray-project/ray

srinathk10 · 2025-12-11T19:23:59Z

Thank you for contributing to Ray! 🚀
Please review the Ray Contribution Guide before opening a pull request.

⚠️ Remove these instructions before submitting your PR.

💡 Tip: Mark as draft if you want early feedback, or ready for review when it's complete.

Description

Briefly describe what this PR accomplishes and why it's needed.

[Data] Concurrency cap backpressure tuning

EWMA_ALPHA
Update EWMA_ALPHA from 0.2->0.1. This makes adjusting level to be more in-favor of limiting concurrency by being more sensitive to downstreaming queuing.

K_DEV
Update K_DEV from 2.0->1.0. This makes stddev to be more in-favor of limiting concurrency by being more sensitive to downstreaming queuing.

Related issues

Link related issues: "Fixes #1234", "Closes #1234", or "Related to #1234".

Additional information

Optional: Add implementation details, API changes, usage examples, screenshots, etc.

gemini-code-assist

Code Review

This pull request tunes the concurrency cap backpressure policy by adjusting EWMA parameters and, most notably, adding a hard guardrail based on the initial queue size to make the system more responsive to downstream queueing. The changes are generally well-implemented and include relevant test updates. My review includes a high-severity comment about a potential bug in detecting downstream materializing operators, which could lead to incorrect backpressure behavior. I've also included a couple of medium-severity suggestions to improve code clarity and maintainability by refactoring a magic number and removing unrelated changes from this PR.

I am having trouble creating individual review comments. Click here to see my feedback.

python/ray/data/_internal/execution/resource_manager.py (409-414)

The current implementation of has_materializing_downstream_op only checks for immediate downstream operators. This can lead to incorrect behavior if there are non-materializing operators (like limit or filter) between the current operator and a materializing one (e.g., map -> limit -> all_to_all). In such a scenario, backpressure might be incorrectly applied to the map operator, potentially starving the all_to_all operator. The check should traverse the full downstream DAG to be correct.

A breadth-first search would be a robust way to implement this traversal.

    def has_materializing_downstream_op(self, op: PhysicalOperator) -> bool:
        """Check if the operator has a downstream materializing operator."""
        q = list(op.output_dependencies)
        visited = set(op.output_dependencies)
        while q:
            curr_op = q.pop(0)
            if isinstance(curr_op, MATERIALIZING_OPERATORS):
                return True
            for next_op in curr_op.output_dependencies:
                if next_op not in visited:
                    visited.add(next_op)
                    q.append(next_op)
        return False

python/ray/data/_internal/execution/backpressure_policy/concurrency_cap_backpressure_policy.py (231)

The formula for max_queue_bytes uses a magic number 2 for division. To improve readability and make the tuning parameter more explicit, consider defining this factor as a named constant at the class level, for example: HARD_GUARDRAIL_BUDGET_FACTOR = 0.5. This makes the logic clearer and easier to adjust in the future.

            max_queue_bytes = initial_q * (1.0 + self.OBJECT_STORE_BUDGET_RATIO * 0.5)

python/ray/data/context.py (254-256)

This new context setting DEFAULT_DOWNSTREAM_CAPACITY_OUTPUTS_RATIO is added but appears to be unused within this pull request. To keep PRs focused and easier to review, it's generally better to introduce new configurations in the same PR where they are first used. Consider moving this to a future PR where it's implemented. The same applies to other unrelated changes in this file, such as the polars rename and _epoch_idx addition.