[Data] Add approximate quantile to aggregator#57598
[Data] Add approximate quantile to aggregator#57598alexeykudinkin merged 12 commits intoray-project:masterfrom
Conversation
e0584b6 to
45381b1
Compare
Signed-off-by: You-Cheng Lin (Owen) <mses010108@gmail.com>
45381b1 to
024f199
Compare
Signed-off-by: You-Cheng Lin (Owen) <mses010108@gmail.com>
Signed-off-by: You-Cheng Lin (Owen) <mses010108@gmail.com>
Signed-off-by: You-Cheng Lin (Owen) <mses010108@gmail.com>
Signed-off-by: You-Cheng Lin (Owen) <mses010108@gmail.com>
Signed-off-by: You-Cheng Lin <106612301+owenowenisme@users.noreply.github.com>
Signed-off-by: You-Cheng Lin <106612301+owenowenisme@users.noreply.github.com>
python/ray/data/aggregate.py
Outdated
| """ | ||
| self._require_datasketches() | ||
| self._quantiles = quantiles | ||
| self._k = k |
There was a problem hiding this comment.
instead of k, let's use capacity_per_level
There was a problem hiding this comment.
capacity_per_level does not feel accurate to me, I think maybe we don't need to hide the detail of k, since user will need to see the doc from datasketches anyway.
I added link to k params description to guide users to the doc for more info.
There was a problem hiding this comment.
The problem there is it's not obvious to a user what k represents.
They have to look up the algorithm to build intuition. Curious why do you say capacity_per_level is inaccurate?
There was a problem hiding this comment.
It's just that I think the concept of "accuracy" should be in param name.
And from user's view capacity might make them confused.
How about accuracy_factor?
There was a problem hiding this comment.
quantile_precision?
Signed-off-by: You-Cheng Lin (Owen) <mses010108@gmail.com>
Signed-off-by: You-Cheng Lin (Owen) <mses010108@gmail.com>
Signed-off-by: You-Cheng Lin (Owen) <mses010108@gmail.com>
python/ray/data/aggregate.py
Outdated
| ) | ||
|
|
||
| def zero(self, quantile_precision: int): | ||
| sketch_cls = self._require_datasketches() |
There was a problem hiding this comment.
This should only be needed in the ctor
<!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? Add ApproximateQuantile aggregator to Ray Data using DataSketches KLL. Reason: • Enables efficient support for the summary API. • More scalable than exact Quantile on large datasets. Note: • DataSketches is not added as a Ray dependency; if missing, users are prompted to install it. --- Here's a simple test to show the efficiency difference between `ApproximateQuantile` and `Quantile` ```py import ray import ray.data import time ray.init(num_cpus=16) from ray.data.aggregate import ApproximateQuantile, Quantile ds = ray.data.range(10**8) start_time = time.time() print(ds.aggregate(ApproximateQuantile(on="id", quantiles=[0.5]))) print(f"Time taken ApproximateQuantile: {time.time() - start_time} seconds") ds = ray.data.range(10**8) start_time = time.time() print(ds.aggregate(Quantile(on="id", q=0.5))) print(f"Time taken Quantile: {time.time() - start_time} seconds") ``` In this run with 1e8 rows, the approximate median returned 49,979,428.0 in ~12.46s, while the exact Quantile returned 49,999,999.5 in ~163.33s. The difference reflects the sketch’s accuracy trade-off for significant speed and scalability gains. When k=800 (the default), we are guaranteed to have the error rate < 0.45% , in this test our error rate is `(49,999,999.5-49,979,428.0)/49,999,999.5`= 0.00041143 = 0.041143% which is < 0.45% , but we get the approximate median **13.11x** faster. ``` {'approx_quantile(id)': [49979428.0]} Time taken ApproximateQuantile: 12.457247257232666 seconds {'quantile(id)': 49999999.5} Time taken Quantile: 163.32705521583557 seconds ``` <!-- Please give a short summary of the change and the problem this solves. --> ## Related issue number <!-- For example: "Closes ray-project#1234" --> ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run pre-commit jobs to lint the changes in this PR. ([pre-commit setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting)) - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: You-Cheng Lin (Owen) <mses010108@gmail.com> Signed-off-by: You-Cheng Lin <106612301+owenowenisme@users.noreply.github.com> Co-authored-by: Balaji Veeramani <balaji@anyscale.com>
<!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? Add ApproximateQuantile aggregator to Ray Data using DataSketches KLL. Reason: • Enables efficient support for the summary API. • More scalable than exact Quantile on large datasets. Note: • DataSketches is not added as a Ray dependency; if missing, users are prompted to install it. --- Here's a simple test to show the efficiency difference between `ApproximateQuantile` and `Quantile` ```py import ray import ray.data import time ray.init(num_cpus=16) from ray.data.aggregate import ApproximateQuantile, Quantile ds = ray.data.range(10**8) start_time = time.time() print(ds.aggregate(ApproximateQuantile(on="id", quantiles=[0.5]))) print(f"Time taken ApproximateQuantile: {time.time() - start_time} seconds") ds = ray.data.range(10**8) start_time = time.time() print(ds.aggregate(Quantile(on="id", q=0.5))) print(f"Time taken Quantile: {time.time() - start_time} seconds") ``` In this run with 1e8 rows, the approximate median returned 49,979,428.0 in ~12.46s, while the exact Quantile returned 49,999,999.5 in ~163.33s. The difference reflects the sketch’s accuracy trade-off for significant speed and scalability gains. When k=800 (the default), we are guaranteed to have the error rate < 0.45% , in this test our error rate is `(49,999,999.5-49,979,428.0)/49,999,999.5`= 0.00041143 = 0.041143% which is < 0.45% , but we get the approximate median **13.11x** faster. ``` {'approx_quantile(id)': [49979428.0]} Time taken ApproximateQuantile: 12.457247257232666 seconds {'quantile(id)': 49999999.5} Time taken Quantile: 163.32705521583557 seconds ``` <!-- Please give a short summary of the change and the problem this solves. --> ## Related issue number <!-- For example: "Closes ray-project#1234" --> ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run pre-commit jobs to lint the changes in this PR. ([pre-commit setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting)) - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: You-Cheng Lin (Owen) <mses010108@gmail.com> Signed-off-by: You-Cheng Lin <106612301+owenowenisme@users.noreply.github.com> Co-authored-by: Balaji Veeramani <balaji@anyscale.com> Signed-off-by: xgui <xgui@anyscale.com>
<!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? Add ApproximateQuantile aggregator to Ray Data using DataSketches KLL. Reason: • Enables efficient support for the summary API. • More scalable than exact Quantile on large datasets. Note: • DataSketches is not added as a Ray dependency; if missing, users are prompted to install it. --- Here's a simple test to show the efficiency difference between `ApproximateQuantile` and `Quantile` ```py import ray import ray.data import time ray.init(num_cpus=16) from ray.data.aggregate import ApproximateQuantile, Quantile ds = ray.data.range(10**8) start_time = time.time() print(ds.aggregate(ApproximateQuantile(on="id", quantiles=[0.5]))) print(f"Time taken ApproximateQuantile: {time.time() - start_time} seconds") ds = ray.data.range(10**8) start_time = time.time() print(ds.aggregate(Quantile(on="id", q=0.5))) print(f"Time taken Quantile: {time.time() - start_time} seconds") ``` In this run with 1e8 rows, the approximate median returned 49,979,428.0 in ~12.46s, while the exact Quantile returned 49,999,999.5 in ~163.33s. The difference reflects the sketch’s accuracy trade-off for significant speed and scalability gains. When k=800 (the default), we are guaranteed to have the error rate < 0.45% , in this test our error rate is `(49,999,999.5-49,979,428.0)/49,999,999.5`= 0.00041143 = 0.041143% which is < 0.45% , but we get the approximate median **13.11x** faster. ``` {'approx_quantile(id)': [49979428.0]} Time taken ApproximateQuantile: 12.457247257232666 seconds {'quantile(id)': 49999999.5} Time taken Quantile: 163.32705521583557 seconds ``` <!-- Please give a short summary of the change and the problem this solves. --> ## Related issue number <!-- For example: "Closes #1234" --> ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run pre-commit jobs to lint the changes in this PR. ([pre-commit setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting)) - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: You-Cheng Lin (Owen) <mses010108@gmail.com> Signed-off-by: You-Cheng Lin <106612301+owenowenisme@users.noreply.github.com> Co-authored-by: Balaji Veeramani <balaji@anyscale.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
<!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? Add ApproximateQuantile aggregator to Ray Data using DataSketches KLL. Reason: • Enables efficient support for the summary API. • More scalable than exact Quantile on large datasets. Note: • DataSketches is not added as a Ray dependency; if missing, users are prompted to install it. --- Here's a simple test to show the efficiency difference between `ApproximateQuantile` and `Quantile` ```py import ray import ray.data import time ray.init(num_cpus=16) from ray.data.aggregate import ApproximateQuantile, Quantile ds = ray.data.range(10**8) start_time = time.time() print(ds.aggregate(ApproximateQuantile(on="id", quantiles=[0.5]))) print(f"Time taken ApproximateQuantile: {time.time() - start_time} seconds") ds = ray.data.range(10**8) start_time = time.time() print(ds.aggregate(Quantile(on="id", q=0.5))) print(f"Time taken Quantile: {time.time() - start_time} seconds") ``` In this run with 1e8 rows, the approximate median returned 49,979,428.0 in ~12.46s, while the exact Quantile returned 49,999,999.5 in ~163.33s. The difference reflects the sketch’s accuracy trade-off for significant speed and scalability gains. When k=800 (the default), we are guaranteed to have the error rate < 0.45% , in this test our error rate is `(49,999,999.5-49,979,428.0)/49,999,999.5`= 0.00041143 = 0.041143% which is < 0.45% , but we get the approximate median **13.11x** faster. ``` {'approx_quantile(id)': [49979428.0]} Time taken ApproximateQuantile: 12.457247257232666 seconds {'quantile(id)': 49999999.5} Time taken Quantile: 163.32705521583557 seconds ``` <!-- Please give a short summary of the change and the problem this solves. --> ## Related issue number <!-- For example: "Closes ray-project#1234" --> ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run pre-commit jobs to lint the changes in this PR. ([pre-commit setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting)) - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: You-Cheng Lin (Owen) <mses010108@gmail.com> Signed-off-by: You-Cheng Lin <106612301+owenowenisme@users.noreply.github.com> Co-authored-by: Balaji Veeramani <balaji@anyscale.com>
<!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? Add ApproximateQuantile aggregator to Ray Data using DataSketches KLL. Reason: • Enables efficient support for the summary API. • More scalable than exact Quantile on large datasets. Note: • DataSketches is not added as a Ray dependency; if missing, users are prompted to install it. --- Here's a simple test to show the efficiency difference between `ApproximateQuantile` and `Quantile` ```py import ray import ray.data import time ray.init(num_cpus=16) from ray.data.aggregate import ApproximateQuantile, Quantile ds = ray.data.range(10**8) start_time = time.time() print(ds.aggregate(ApproximateQuantile(on="id", quantiles=[0.5]))) print(f"Time taken ApproximateQuantile: {time.time() - start_time} seconds") ds = ray.data.range(10**8) start_time = time.time() print(ds.aggregate(Quantile(on="id", q=0.5))) print(f"Time taken Quantile: {time.time() - start_time} seconds") ``` In this run with 1e8 rows, the approximate median returned 49,979,428.0 in ~12.46s, while the exact Quantile returned 49,999,999.5 in ~163.33s. The difference reflects the sketch’s accuracy trade-off for significant speed and scalability gains. When k=800 (the default), we are guaranteed to have the error rate < 0.45% , in this test our error rate is `(49,999,999.5-49,979,428.0)/49,999,999.5`= 0.00041143 = 0.041143% which is < 0.45% , but we get the approximate median **13.11x** faster. ``` {'approx_quantile(id)': [49979428.0]} Time taken ApproximateQuantile: 12.457247257232666 seconds {'quantile(id)': 49999999.5} Time taken Quantile: 163.32705521583557 seconds ``` <!-- Please give a short summary of the change and the problem this solves. --> ## Related issue number <!-- For example: "Closes ray-project#1234" --> ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run pre-commit jobs to lint the changes in this PR. ([pre-commit setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting)) - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: You-Cheng Lin (Owen) <mses010108@gmail.com> Signed-off-by: You-Cheng Lin <106612301+owenowenisme@users.noreply.github.com> Co-authored-by: Balaji Veeramani <balaji@anyscale.com> Signed-off-by: Aydin Abiar <aydin@anyscale.com>
<!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? Add ApproximateQuantile aggregator to Ray Data using DataSketches KLL. Reason: • Enables efficient support for the summary API. • More scalable than exact Quantile on large datasets. Note: • DataSketches is not added as a Ray dependency; if missing, users are prompted to install it. --- Here's a simple test to show the efficiency difference between `ApproximateQuantile` and `Quantile` ```py import ray import ray.data import time ray.init(num_cpus=16) from ray.data.aggregate import ApproximateQuantile, Quantile ds = ray.data.range(10**8) start_time = time.time() print(ds.aggregate(ApproximateQuantile(on="id", quantiles=[0.5]))) print(f"Time taken ApproximateQuantile: {time.time() - start_time} seconds") ds = ray.data.range(10**8) start_time = time.time() print(ds.aggregate(Quantile(on="id", q=0.5))) print(f"Time taken Quantile: {time.time() - start_time} seconds") ``` In this run with 1e8 rows, the approximate median returned 49,979,428.0 in ~12.46s, while the exact Quantile returned 49,999,999.5 in ~163.33s. The difference reflects the sketch’s accuracy trade-off for significant speed and scalability gains. When k=800 (the default), we are guaranteed to have the error rate < 0.45% , in this test our error rate is `(49,999,999.5-49,979,428.0)/49,999,999.5`= 0.00041143 = 0.041143% which is < 0.45% , but we get the approximate median **13.11x** faster. ``` {'approx_quantile(id)': [49979428.0]} Time taken ApproximateQuantile: 12.457247257232666 seconds {'quantile(id)': 49999999.5} Time taken Quantile: 163.32705521583557 seconds ``` <!-- Please give a short summary of the change and the problem this solves. --> ## Related issue number <!-- For example: "Closes ray-project#1234" --> ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run pre-commit jobs to lint the changes in this PR. ([pre-commit setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting)) - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: You-Cheng Lin (Owen) <mses010108@gmail.com> Signed-off-by: You-Cheng Lin <106612301+owenowenisme@users.noreply.github.com> Co-authored-by: Balaji Veeramani <balaji@anyscale.com> Signed-off-by: Future-Outlier <eric901201@gmail.com>
Why are these changes needed?
Add ApproximateQuantile aggregator to Ray Data using DataSketches KLL.
Reason:
• Enables efficient support for the summary API.
• More scalable than exact Quantile on large datasets.
Note:
• DataSketches is not added as a Ray dependency; if missing, users are prompted to install it.
Here's a simple test to show the efficiency difference between
ApproximateQuantileandQuantileIn this run with 1e8 rows, the approximate median returned 49,979,428.0 in ~12.46s, while the exact Quantile returned 49,999,999.5 in ~163.33s. The difference reflects the sketch’s accuracy trade-off for significant speed and scalability gains.
When k=800 (the default), we are guaranteed to have the error rate < 0.45% , in this test our error rate is
(49,999,999.5-49,979,428.0)/49,999,999.5= 0.00041143 = 0.041143% which is < 0.45% , but we get the approximate median 13.11x faster.Related issue number
Checks
git commit -s) in this PR.method in Tune, I've added it in
doc/source/tune/api/under thecorresponding
.rstfile.