[BLOG]: Supporting dask arrays in scipy via the Python Array API standard#904
[BLOG]: Supporting dask arrays in scipy via the Python Array API standard#904lithomas1 wants to merge 6 commits intoQuansight:mainfrom
Conversation
|
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
|
Not done yet, but just wanted to put up a quick draft of my blog post. The main omission is probably the case study section where I port a scipy.stats workflow to dask arrays and look at performance. All other sections are basically complete. |
|
This looks like a good start @lithomas1, thanks. The flow of the story at a high level looks good to me. The case study section is important indeed; the post is still pretty draft now so I didn't review in detail. |
|
@rgommers This should be ready for a look now. |
pavithraes
left a comment
There was a problem hiding this comment.
@lithomas1 Thank you! This is a good read!
I've shared some suggestions, mainly around phrasing :)
| --- | ||
| title: 'Supporting dask arrays in scipy via the Python Array API standard' | ||
| authors: [thomas-li] | ||
| published: May 26, 2025 |
There was a problem hiding this comment.
Just noting that we'll need to update this before merging. :)
| In this post, I describe my journey getting SciPy to work with Dask arrays natively via the array API and the current | ||
| limitations and future outlook. |
There was a problem hiding this comment.
| In this post, I describe my journey getting SciPy to work with Dask arrays natively via the array API and the current | |
| limitations and future outlook. | |
| In this post, I describe my journey getting SciPy to work with Dask Arrays natively via the Array API standard, | |
| and discuss the current limitations and future outlook of this work. |
|
|
||
| ## Introduction: A quick refresher of the Python Array API standard | ||
|
|
||
| For those unfamiliar, the [Python Array API standard](https://data-apis.org/array-api/latest/API_specification/), |
There was a problem hiding this comment.
| For those unfamiliar, the [Python Array API standard](https://data-apis.org/array-api/latest/API_specification/), | |
| The [Python Array API standard](https://data-apis.org/array-api/latest/API_specification/), |
I think it'll be nicer to start with a simpler statement
| ## Introduction: A quick refresher of the Python Array API standard | ||
|
|
||
| For those unfamiliar, the [Python Array API standard](https://data-apis.org/array-api/latest/API_specification/), | ||
| is a specification aimed at unifying the various APIs of different array libraries (e.g. Numpy, PyTorch, JAX, Dask, etc.). |
There was a problem hiding this comment.
| is a specification aimed at unifying the various APIs of different array libraries (e.g. Numpy, PyTorch, JAX, Dask, etc.). | |
| is a specification aimed at unifying the various APIs of different array libraries (e.g. NumPy, PyTorch, JAX, Dask, etc.). |
I see there are several different capitalizations for various libraries through the blog. Could you please do a find and replace to have one style for all?
I think these are the capitalizations: NumPy, SciPy, Dask, PyTorch, pandas, JAX, and CuPy, unless you're referring to the API, in which case it's all lowercase and presented as inline code like dask.array. I think this is already the case for the most part, but I noticed a few deviations here and there, hence the explicit comment. :)
| users to treat arbritrary array objects as numpy arrays via duck typing. | ||
|
|
||
| Today, [array api support](https://scipy.github.io/devdocs/dev/api-dev/array_api.html) in scipy has progressed a long | ||
| way since mid 2023 when array API support was first experimentally introduced within the libary. While the array API |
There was a problem hiding this comment.
| way since mid 2023 when array API support was first experimentally introduced within the libary. While the array API | |
| way since mid-2023 when array API support was first experimentally introduced within the library. While the array API |
|
|
||
| `*` - Some public API functions/methods in this module have not yet been ported to the Array API standard. | ||
| (Status refers to the status of dask.array with ) | ||
| See [here](https://scipy.github.io/devdocs/dev/api-dev/array_api.html#currently-supported-functionality) |
There was a problem hiding this comment.
| See [here](https://scipy.github.io/devdocs/dev/api-dev/array_api.html#currently-supported-functionality) | |
| See the [SciPy developer docs](https://scipy.github.io/devdocs/dev/api-dev/array_api.html#currently-supported-functionality) |
| See [here](https://scipy.github.io/devdocs/dev/api-dev/array_api.html#currently-supported-functionality) | ||
| for a list of supported functions/methods. | ||
|
|
||
| As of today, the `scipy.fft/special/stats` modules have the best support for dask arrays today, and are able to |
There was a problem hiding this comment.
| As of today, the `scipy.fft/special/stats` modules have the best support for dask arrays today, and are able to | |
| The `scipy.fft/special/stats` modules have the best support for dask arrays today, and are able to |
| In the next section, we will take a look more closely at how array API compatibility enables better performance with | ||
| dask arrays within the `scipy.stats` module. | ||
|
|
||
| ## Example |
|
|
||
| From this p-value, we can reject our null hypothesis that the average fare for trips with one passenger is the same as the average fare for trips with multiple passengers. | ||
|
|
||
| While we weren't entirely able to avoid computation in the middle (dask still struggles with unknown shapes which we get through our boolean masking on the dataframe), we were able to entirely keep the computation in dask. This is a big improvement over the pre-Array API behavior where the input dask arrays would be cast to numpy arrays (forcing computation and storage of intermediate results in one worker which can lead to performance degredation and out-of-memory errors) |
There was a problem hiding this comment.
| While we weren't entirely able to avoid computation in the middle (dask still struggles with unknown shapes which we get through our boolean masking on the dataframe), we were able to entirely keep the computation in dask. This is a big improvement over the pre-Array API behavior where the input dask arrays would be cast to numpy arrays (forcing computation and storage of intermediate results in one worker which can lead to performance degredation and out-of-memory errors) | |
| While we weren't entirely able to avoid computation in the middle (dask still struggles with unknown shapes which we get through our boolean masking on the dataframe), we were able to entirely keep the computation in dask. This is a big improvement over the pre-Array API behavior where the input dask arrays would be cast to numpy arrays (forcing computation and storage of intermediate results in one worker which can lead to performance degredation and out-of-memory errors). |
|
|
||
| Looking forward, we'd also like to enable `dask.array` support via the Array API in other Array API | ||
| compatible libraries, most notably scikit-learn. A previous | ||
| [attempt](https://github.com/scikit-learn/scikit-learn/pull/28588) to add array API support within scikit-learn stalled |
There was a problem hiding this comment.
| [attempt](https://github.com/scikit-learn/scikit-learn/pull/28588) to add array API support within scikit-learn stalled | |
| [attempt (scikit-learn PR#28588)](https://github.com/scikit-learn/scikit-learn/pull/28588) to add array API support within scikit-learn stalled |
|
@rgommers @pavithraes Do you think it's worth it for me to address the feedback comments on this and try to get it merged or should we close it? |
|
I think it's close to done, and would be nice to publish still as it captures interesting work done under a grant. @lithomas1 I think you planned to revisit this still, and it fell through the cracks? Any thoughts here? |
Text styling
Non-text contents