Refactor: use Apache Arrow compute for string function by maartenbreddels · Pull Request #885 · vaexio/vaex

maartenbreddels · 2020-07-03T10:47:01Z

This is a draft PR to check the status of arrow compute with vaex. I think we likely cherry pick from this branch as arrow makes new releases.

xhochy · 2020-07-03T15:43:01Z

FYI: If you have a bit of patience (like 24h of patience), you could use the arrow conda packages in the arrow-nightlies channel instead of building it yourself.

maartenbreddels · 2020-07-03T18:39:34Z

Great, I didn't know it existed, and was difficult to find, thanks a lot!

maartenbreddels · 2020-07-13T08:36:56Z

@JovanVeljanoski would be great if you can add/finish the str->booleans added in apache/arrow#7656
There are a few new ones (see for compute.py), and especially binary_isascii is something we may want to think about. Maybe we want to expose this under Expression.str.isascii() but also later on under Expression.binary.isascii() if we are going to add that accessor.

maartenbreddels · 2020-07-15T16:59:37Z

@JovanVeljanoski I think I want to merge this early and leave it for you to do the rest in a different PR, we need some of this in #865 and I also want to merge that soon.
This means that very likely, master will be breaking, meaning the next release will be vaex v4, do you agree?

JovanVeljanoski · 2020-07-16T06:47:12Z

 _doc_snippets['chunk_size_export'] = 'Number of rows to be written to disk in a single iteration'
 _doc_snippets['evaluate_parallel'] = 'Evaluate the (virtual) columns in parallel'
 _doc_snippets['array_type'] = 'Type of output array, possible values are None/"numpy" (ndarray), "xarray" for a xarray.DataArray, or "list" for a Python list'
+_doc_snippets['ascii'] = 'Transform only ascii character (usually faster).'


character -> characters

maartenbreddels · 2020-08-21T14:05:21Z

windows CI has becomes crazy slow btw, we'll have to trace back when/why that happened. It seem the conda env creation takes ages.

maartenbreddels force-pushed the refactor_use_arrow_compute branch 10 times, most recently from adf48db to da0a232 Compare July 3, 2020 14:49

maartenbreddels force-pushed the refactor_use_arrow_compute branch from da0a232 to 229f9ce Compare July 3, 2020 18:38

maartenbreddels force-pushed the refactor_use_arrow_compute branch 2 times, most recently from 820f61c to d55a96a Compare July 10, 2020 17:05

maartenbreddels force-pushed the refactor_use_arrow_compute branch from dbc8a7b to 0f89f54 Compare July 16, 2020 06:43

JovanVeljanoski reviewed Jul 16, 2020

View reviewed changes

maartenbreddels force-pushed the refactor_use_arrow_compute branch from 0f89f54 to 62e8302 Compare July 16, 2020 08:32

maartenbreddels marked this pull request as ready for review July 16, 2020 08:36

maartenbreddels force-pushed the refactor_use_arrow_compute branch 9 times, most recently from 025988f to 2a85f7c Compare July 17, 2020 11:49

maartenbreddels mentioned this pull request Jul 17, 2020

Fix: value counts on added/concatenated strings #908

Merged

2 tasks

maartenbreddels force-pushed the refactor_use_arrow_compute branch 3 times, most recently from e24fc95 to 5ed5b5b Compare July 18, 2020 17:41

maartenbreddels added 3 commits August 21, 2020 12:29

refactor: use arrow.compute for string functions

283f78a

core(fix): offsets for arrow string array not respected in conversion

e189b3f

chore: we can use pyarrow 1.0

394e70a

maartenbreddels force-pushed the refactor_use_arrow_compute branch from 5e50d73 to 394e70a Compare August 21, 2020 10:30

maartenbreddels merged commit 48531b5 into master Aug 21, 2020

maartenbreddels deleted the refactor_use_arrow_compute branch August 21, 2020 14:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor: use Apache Arrow compute for string function#885

Refactor: use Apache Arrow compute for string function#885
maartenbreddels merged 3 commits intomasterfrom
refactor_use_arrow_compute

maartenbreddels commented Jul 3, 2020

Uh oh!

xhochy commented Jul 3, 2020

Uh oh!

maartenbreddels commented Jul 3, 2020

Uh oh!

maartenbreddels commented Jul 13, 2020

Uh oh!

maartenbreddels commented Jul 15, 2020

Uh oh!

JovanVeljanoski Jul 16, 2020

Uh oh!

maartenbreddels commented Aug 21, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

maartenbreddels commented Jul 3, 2020

Uh oh!

xhochy commented Jul 3, 2020

Uh oh!

maartenbreddels commented Jul 3, 2020

Uh oh!

maartenbreddels commented Jul 13, 2020

Uh oh!

maartenbreddels commented Jul 15, 2020

Uh oh!

JovanVeljanoski Jul 16, 2020

Choose a reason for hiding this comment

Uh oh!

maartenbreddels commented Aug 21, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants