You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[Data] Add approximate quantile to aggregator (#57598)
<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->
<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->
## Why are these changes needed?
Add ApproximateQuantile aggregator to Ray Data using DataSketches KLL.
Reason:
• Enables efficient support for the summary API.
• More scalable than exact Quantile on large datasets.
Note:
• DataSketches is not added as a Ray dependency; if missing, users are
prompted to install it.
---
Here's a simple test to show the efficiency difference between
`ApproximateQuantile` and `Quantile`
```py
import ray
import ray.data
import time
ray.init(num_cpus=16)
from ray.data.aggregate import ApproximateQuantile, Quantile
ds = ray.data.range(10**8)
start_time = time.time()
print(ds.aggregate(ApproximateQuantile(on="id", quantiles=[0.5])))
print(f"Time taken ApproximateQuantile: {time.time() - start_time} seconds")
ds = ray.data.range(10**8)
start_time = time.time()
print(ds.aggregate(Quantile(on="id", q=0.5)))
print(f"Time taken Quantile: {time.time() - start_time} seconds")
```
In this run with 1e8 rows, the approximate median returned 49,979,428.0
in ~12.46s, while the exact Quantile returned 49,999,999.5 in ~163.33s.
The difference reflects the sketch’s accuracy trade-off for significant
speed and scalability gains.
When k=800 (the default), we are guaranteed to have the error rate <
0.45% , in this test our error rate is
`(49,999,999.5-49,979,428.0)/49,999,999.5`= 0.00041143 = 0.041143% which
is < 0.45% , but we get the approximate median **13.11x** faster.
```
{'approx_quantile(id)': [49979428.0]}
Time taken ApproximateQuantile: 12.457247257232666 seconds
{'quantile(id)': 49999999.5}
Time taken Quantile: 163.32705521583557 seconds
```
<!-- Please give a short summary of the change and the problem this
solves. -->
## Related issue number
<!-- For example: "Closes#1234" -->
## Checks
- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run pre-commit jobs to lint the changes in this PR.
([pre-commit
setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting))
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
- [ ] Unit tests
- [ ] Release tests
- [ ] This PR is not tested :(
---------
Signed-off-by: You-Cheng Lin (Owen) <mses010108@gmail.com>
Signed-off-by: You-Cheng Lin <106612301+owenowenisme@users.noreply.github.com>
Co-authored-by: Balaji Veeramani <balaji@anyscale.com>
on: The name of the column to calculate the quantile on. Must be a numeric column.
1254
+
quantiles: The list of quantiles to compute. Must be between 0 and 1 inclusive. For example, quantiles=[0.5] computes the median. Null entries in the source column are skipped.
1255
+
quantile_precision: Controls the accuracy and memory footprint of the sketch (K in KLL); higher values yield lower error but use more memory. Defaults to 800. See https://datasketches.apache.org/docs/KLL/KLLAccuracyAndSize.html for details on accuracy and size.
1256
+
alias_name: Optional name for the resulting column. If not provided, defaults to "approx_quantile({column_name})".
0 commit comments