Skip to content

[Data] Add approximate quantile to aggregator#57598

Merged
alexeykudinkin merged 12 commits intoray-project:masterfrom
owenowenisme:data/add-approximate-quantile-to-aggregrator
Oct 16, 2025
Merged

[Data] Add approximate quantile to aggregator#57598
alexeykudinkin merged 12 commits intoray-project:masterfrom
owenowenisme:data/add-approximate-quantile-to-aggregrator

Conversation

@owenowenisme
Copy link
Member

@owenowenisme owenowenisme commented Oct 9, 2025

Why are these changes needed?

Add ApproximateQuantile aggregator to Ray Data using DataSketches KLL.

Reason:
• Enables efficient support for the summary API.
• More scalable than exact Quantile on large datasets.

Note:
• DataSketches is not added as a Ray dependency; if missing, users are prompted to install it.


Here's a simple test to show the efficiency difference between ApproximateQuantile and Quantile

import ray
import ray.data
import time

ray.init(num_cpus=16)
from ray.data.aggregate import ApproximateQuantile, Quantile

ds = ray.data.range(10**8)
start_time = time.time()

print(ds.aggregate(ApproximateQuantile(on="id", quantiles=[0.5])))
print(f"Time taken ApproximateQuantile: {time.time() - start_time} seconds")

ds = ray.data.range(10**8)
start_time = time.time()

print(ds.aggregate(Quantile(on="id", q=0.5)))
print(f"Time taken Quantile: {time.time() - start_time} seconds")

In this run with 1e8 rows, the approximate median returned 49,979,428.0 in ~12.46s, while the exact Quantile returned 49,999,999.5 in ~163.33s. The difference reflects the sketch’s accuracy trade-off for significant speed and scalability gains.

When k=800 (the default), we are guaranteed to have the error rate < 0.45% , in this test our error rate is (49,999,999.5-49,979,428.0)/49,999,999.5= 0.00041143 = 0.041143% which is < 0.45% , but we get the approximate median 13.11x faster.

{'approx_quantile(id)': [49979428.0]}
Time taken ApproximateQuantile: 12.457247257232666 seconds
{'quantile(id)': 49999999.5}
Time taken Quantile: 163.32705521583557 seconds

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run pre-commit jobs to lint the changes in this PR. (pre-commit setup)
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@owenowenisme owenowenisme force-pushed the data/add-approximate-quantile-to-aggregrator branch from e0584b6 to 45381b1 Compare October 9, 2025 13:20
@owenowenisme owenowenisme added the go add ONLY when ready to merge, run all tests label Oct 9, 2025
Signed-off-by: You-Cheng Lin (Owen) <mses010108@gmail.com>
@owenowenisme owenowenisme force-pushed the data/add-approximate-quantile-to-aggregrator branch from 45381b1 to 024f199 Compare October 9, 2025 23:55
Signed-off-by: You-Cheng Lin (Owen) <mses010108@gmail.com>
Signed-off-by: You-Cheng Lin (Owen) <mses010108@gmail.com>
Signed-off-by: You-Cheng Lin (Owen) <mses010108@gmail.com>
Signed-off-by: You-Cheng Lin (Owen) <mses010108@gmail.com>
@owenowenisme owenowenisme marked this pull request as ready for review October 10, 2025 08:27
@owenowenisme owenowenisme requested a review from a team as a code owner October 10, 2025 08:27
cursor[bot]

This comment was marked as outdated.

Signed-off-by: You-Cheng Lin <106612301+owenowenisme@users.noreply.github.com>
Signed-off-by: You-Cheng Lin <106612301+owenowenisme@users.noreply.github.com>
cursor[bot]

This comment was marked as outdated.

@ray-gardener ray-gardener bot added the data Ray Data-related issues label Oct 10, 2025
"""
self._require_datasketches()
self._quantiles = quantiles
self._k = k
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of k, let's use capacity_per_level

Copy link
Member Author

@owenowenisme owenowenisme Oct 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

capacity_per_level does not feel accurate to me, I think maybe we don't need to hide the detail of k, since user will need to see the doc from datasketches anyway.

I added link to k params description to guide users to the doc for more info.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem there is it's not obvious to a user what k represents.

They have to look up the algorithm to build intuition. Curious why do you say capacity_per_level is inaccurate?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's just that I think the concept of "accuracy" should be in param name.
And from user's view capacity might make them confused.
How about accuracy_factor?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

quantile_precision?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SG!

Signed-off-by: You-Cheng Lin (Owen) <mses010108@gmail.com>
Signed-off-by: You-Cheng Lin (Owen) <mses010108@gmail.com>
Signed-off-by: You-Cheng Lin (Owen) <mses010108@gmail.com>
cursor[bot]

This comment was marked as outdated.

@bveeramani bveeramani enabled auto-merge (squash) October 15, 2025 17:33
@github-actions github-actions bot disabled auto-merge October 15, 2025 17:33
)

def zero(self, quantile_precision: int):
sketch_cls = self._require_datasketches()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should only be needed in the ctor

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto everywhere

Signed-off-by: You-Cheng Lin (Owen) <mses010108@gmail.com>
@alexeykudinkin alexeykudinkin merged commit 81cf351 into ray-project:master Oct 16, 2025
6 checks passed
justinyeh1995 pushed a commit to justinyeh1995/ray that referenced this pull request Oct 20, 2025
<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?
Add ApproximateQuantile aggregator to Ray Data using DataSketches KLL.

Reason:
 • Enables efficient support for the summary API.
 • More scalable than exact Quantile on large datasets.

Note:
• DataSketches is not added as a Ray dependency; if missing, users are
prompted to install it.

---

Here's a simple test to show the efficiency difference between
`ApproximateQuantile` and `Quantile`
```py
import ray
import ray.data
import time

ray.init(num_cpus=16)
from ray.data.aggregate import ApproximateQuantile, Quantile

ds = ray.data.range(10**8)
start_time = time.time()

print(ds.aggregate(ApproximateQuantile(on="id", quantiles=[0.5])))
print(f"Time taken ApproximateQuantile: {time.time() - start_time} seconds")

ds = ray.data.range(10**8)
start_time = time.time()

print(ds.aggregate(Quantile(on="id", q=0.5)))
print(f"Time taken Quantile: {time.time() - start_time} seconds")
```
In this run with 1e8 rows, the approximate median returned 49,979,428.0
in ~12.46s, while the exact Quantile returned 49,999,999.5 in ~163.33s.
The difference reflects the sketch’s accuracy trade-off for significant
speed and scalability gains.

When k=800 (the default), we are guaranteed to have the error rate <
0.45% , in this test our error rate is
`(49,999,999.5-49,979,428.0)/49,999,999.5`= 0.00041143 = 0.041143% which
is < 0.45% , but we get the approximate median **13.11x** faster.
```
{'approx_quantile(id)': [49979428.0]}
Time taken ApproximateQuantile: 12.457247257232666 seconds
{'quantile(id)': 49999999.5}
Time taken Quantile: 163.32705521583557 seconds
```
<!-- Please give a short summary of the change and the problem this
solves. -->

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run pre-commit jobs to lint the changes in this PR.
([pre-commit
setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting))
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: You-Cheng Lin (Owen) <mses010108@gmail.com>
Signed-off-by: You-Cheng Lin <106612301+owenowenisme@users.noreply.github.com>
Co-authored-by: Balaji Veeramani <balaji@anyscale.com>
xinyuangui2 pushed a commit to xinyuangui2/ray that referenced this pull request Oct 22, 2025
<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?
Add ApproximateQuantile aggregator to Ray Data using DataSketches KLL.

Reason:
 • Enables efficient support for the summary API.
 • More scalable than exact Quantile on large datasets.

Note:
• DataSketches is not added as a Ray dependency; if missing, users are
prompted to install it.

---

Here's a simple test to show the efficiency difference between
`ApproximateQuantile` and `Quantile`
```py
import ray
import ray.data
import time

ray.init(num_cpus=16)
from ray.data.aggregate import ApproximateQuantile, Quantile

ds = ray.data.range(10**8)
start_time = time.time()

print(ds.aggregate(ApproximateQuantile(on="id", quantiles=[0.5])))
print(f"Time taken ApproximateQuantile: {time.time() - start_time} seconds")

ds = ray.data.range(10**8)
start_time = time.time()

print(ds.aggregate(Quantile(on="id", q=0.5)))
print(f"Time taken Quantile: {time.time() - start_time} seconds")
```
In this run with 1e8 rows, the approximate median returned 49,979,428.0
in ~12.46s, while the exact Quantile returned 49,999,999.5 in ~163.33s.
The difference reflects the sketch’s accuracy trade-off for significant
speed and scalability gains.

When k=800 (the default), we are guaranteed to have the error rate <
0.45% , in this test our error rate is
`(49,999,999.5-49,979,428.0)/49,999,999.5`= 0.00041143 = 0.041143% which
is < 0.45% , but we get the approximate median **13.11x** faster.
```
{'approx_quantile(id)': [49979428.0]}
Time taken ApproximateQuantile: 12.457247257232666 seconds
{'quantile(id)': 49999999.5}
Time taken Quantile: 163.32705521583557 seconds
```
<!-- Please give a short summary of the change and the problem this
solves. -->

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run pre-commit jobs to lint the changes in this PR.
([pre-commit
setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting))
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: You-Cheng Lin (Owen) <mses010108@gmail.com>
Signed-off-by: You-Cheng Lin <106612301+owenowenisme@users.noreply.github.com>
Co-authored-by: Balaji Veeramani <balaji@anyscale.com>
Signed-off-by: xgui <xgui@anyscale.com>
elliot-barn pushed a commit that referenced this pull request Oct 23, 2025
<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?
Add ApproximateQuantile aggregator to Ray Data using DataSketches KLL.

Reason:
 • Enables efficient support for the summary API.
 • More scalable than exact Quantile on large datasets.

Note:
• DataSketches is not added as a Ray dependency; if missing, users are
prompted to install it.

---

Here's a simple test to show the efficiency difference between
`ApproximateQuantile` and `Quantile`
```py
import ray
import ray.data
import time

ray.init(num_cpus=16)
from ray.data.aggregate import ApproximateQuantile, Quantile

ds = ray.data.range(10**8)
start_time = time.time()

print(ds.aggregate(ApproximateQuantile(on="id", quantiles=[0.5])))
print(f"Time taken ApproximateQuantile: {time.time() - start_time} seconds")

ds = ray.data.range(10**8)
start_time = time.time()

print(ds.aggregate(Quantile(on="id", q=0.5)))
print(f"Time taken Quantile: {time.time() - start_time} seconds")
```
In this run with 1e8 rows, the approximate median returned 49,979,428.0
in ~12.46s, while the exact Quantile returned 49,999,999.5 in ~163.33s.
The difference reflects the sketch’s accuracy trade-off for significant
speed and scalability gains.

When k=800 (the default), we are guaranteed to have the error rate <
0.45% , in this test our error rate is
`(49,999,999.5-49,979,428.0)/49,999,999.5`= 0.00041143 = 0.041143% which
is < 0.45% , but we get the approximate median **13.11x** faster.
```
{'approx_quantile(id)': [49979428.0]}
Time taken ApproximateQuantile: 12.457247257232666 seconds
{'quantile(id)': 49999999.5}
Time taken Quantile: 163.32705521583557 seconds
```
<!-- Please give a short summary of the change and the problem this
solves. -->

## Related issue number

<!-- For example: "Closes #1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run pre-commit jobs to lint the changes in this PR.
([pre-commit
setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting))
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: You-Cheng Lin (Owen) <mses010108@gmail.com>
Signed-off-by: You-Cheng Lin <106612301+owenowenisme@users.noreply.github.com>
Co-authored-by: Balaji Veeramani <balaji@anyscale.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
landscapepainter pushed a commit to landscapepainter/ray that referenced this pull request Nov 17, 2025
<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?
Add ApproximateQuantile aggregator to Ray Data using DataSketches KLL.

Reason:
 • Enables efficient support for the summary API.
 • More scalable than exact Quantile on large datasets.

Note:
• DataSketches is not added as a Ray dependency; if missing, users are
prompted to install it.

---

Here's a simple test to show the efficiency difference between
`ApproximateQuantile` and `Quantile`
```py
import ray
import ray.data
import time

ray.init(num_cpus=16)
from ray.data.aggregate import ApproximateQuantile, Quantile

ds = ray.data.range(10**8)
start_time = time.time()

print(ds.aggregate(ApproximateQuantile(on="id", quantiles=[0.5])))
print(f"Time taken ApproximateQuantile: {time.time() - start_time} seconds")

ds = ray.data.range(10**8)
start_time = time.time()

print(ds.aggregate(Quantile(on="id", q=0.5)))
print(f"Time taken Quantile: {time.time() - start_time} seconds")
```
In this run with 1e8 rows, the approximate median returned 49,979,428.0
in ~12.46s, while the exact Quantile returned 49,999,999.5 in ~163.33s.
The difference reflects the sketch’s accuracy trade-off for significant
speed and scalability gains.

When k=800 (the default), we are guaranteed to have the error rate <
0.45% , in this test our error rate is
`(49,999,999.5-49,979,428.0)/49,999,999.5`= 0.00041143 = 0.041143% which
is < 0.45% , but we get the approximate median **13.11x** faster.
```
{'approx_quantile(id)': [49979428.0]}
Time taken ApproximateQuantile: 12.457247257232666 seconds
{'quantile(id)': 49999999.5}
Time taken Quantile: 163.32705521583557 seconds
```
<!-- Please give a short summary of the change and the problem this
solves. -->

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run pre-commit jobs to lint the changes in this PR.
([pre-commit
setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting))
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: You-Cheng Lin (Owen) <mses010108@gmail.com>
Signed-off-by: You-Cheng Lin <106612301+owenowenisme@users.noreply.github.com>
Co-authored-by: Balaji Veeramani <balaji@anyscale.com>
Aydin-ab pushed a commit to Aydin-ab/ray-aydin that referenced this pull request Nov 19, 2025
<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?
Add ApproximateQuantile aggregator to Ray Data using DataSketches KLL.

Reason:
 • Enables efficient support for the summary API.
 • More scalable than exact Quantile on large datasets.

Note:
• DataSketches is not added as a Ray dependency; if missing, users are
prompted to install it.

---

Here's a simple test to show the efficiency difference between
`ApproximateQuantile` and `Quantile`
```py
import ray
import ray.data
import time

ray.init(num_cpus=16)
from ray.data.aggregate import ApproximateQuantile, Quantile

ds = ray.data.range(10**8)
start_time = time.time()

print(ds.aggregate(ApproximateQuantile(on="id", quantiles=[0.5])))
print(f"Time taken ApproximateQuantile: {time.time() - start_time} seconds")

ds = ray.data.range(10**8)
start_time = time.time()

print(ds.aggregate(Quantile(on="id", q=0.5)))
print(f"Time taken Quantile: {time.time() - start_time} seconds")
```
In this run with 1e8 rows, the approximate median returned 49,979,428.0
in ~12.46s, while the exact Quantile returned 49,999,999.5 in ~163.33s.
The difference reflects the sketch’s accuracy trade-off for significant
speed and scalability gains.

When k=800 (the default), we are guaranteed to have the error rate <
0.45% , in this test our error rate is
`(49,999,999.5-49,979,428.0)/49,999,999.5`= 0.00041143 = 0.041143% which
is < 0.45% , but we get the approximate median **13.11x** faster.
```
{'approx_quantile(id)': [49979428.0]}
Time taken ApproximateQuantile: 12.457247257232666 seconds
{'quantile(id)': 49999999.5}
Time taken Quantile: 163.32705521583557 seconds
```
<!-- Please give a short summary of the change and the problem this
solves. -->

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run pre-commit jobs to lint the changes in this PR.
([pre-commit
setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting))
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: You-Cheng Lin (Owen) <mses010108@gmail.com>
Signed-off-by: You-Cheng Lin <106612301+owenowenisme@users.noreply.github.com>
Co-authored-by: Balaji Veeramani <balaji@anyscale.com>
Signed-off-by: Aydin Abiar <aydin@anyscale.com>
Future-Outlier pushed a commit to Future-Outlier/ray that referenced this pull request Dec 7, 2025
<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?
Add ApproximateQuantile aggregator to Ray Data using DataSketches KLL.

Reason:
 • Enables efficient support for the summary API.
 • More scalable than exact Quantile on large datasets.

Note:
• DataSketches is not added as a Ray dependency; if missing, users are
prompted to install it.

---

Here's a simple test to show the efficiency difference between
`ApproximateQuantile` and `Quantile`
```py
import ray
import ray.data
import time

ray.init(num_cpus=16)
from ray.data.aggregate import ApproximateQuantile, Quantile

ds = ray.data.range(10**8)
start_time = time.time()

print(ds.aggregate(ApproximateQuantile(on="id", quantiles=[0.5])))
print(f"Time taken ApproximateQuantile: {time.time() - start_time} seconds")

ds = ray.data.range(10**8)
start_time = time.time()

print(ds.aggregate(Quantile(on="id", q=0.5)))
print(f"Time taken Quantile: {time.time() - start_time} seconds")
```
In this run with 1e8 rows, the approximate median returned 49,979,428.0
in ~12.46s, while the exact Quantile returned 49,999,999.5 in ~163.33s.
The difference reflects the sketch’s accuracy trade-off for significant
speed and scalability gains.

When k=800 (the default), we are guaranteed to have the error rate <
0.45% , in this test our error rate is
`(49,999,999.5-49,979,428.0)/49,999,999.5`= 0.00041143 = 0.041143% which
is < 0.45% , but we get the approximate median **13.11x** faster.
```
{'approx_quantile(id)': [49979428.0]}
Time taken ApproximateQuantile: 12.457247257232666 seconds
{'quantile(id)': 49999999.5}
Time taken Quantile: 163.32705521583557 seconds
```
<!-- Please give a short summary of the change and the problem this
solves. -->

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run pre-commit jobs to lint the changes in this PR.
([pre-commit
setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting))
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: You-Cheng Lin (Owen) <mses010108@gmail.com>
Signed-off-by: You-Cheng Lin <106612301+owenowenisme@users.noreply.github.com>
Co-authored-by: Balaji Veeramani <balaji@anyscale.com>
Signed-off-by: Future-Outlier <eric901201@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data Ray Data-related issues go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants