Refactor input data iteration #707

hoesler · 2020-05-29T15:25:31Z

This PR replaces tsfreshs intermediate dataframe with a flexible TsData class which can be used to directly iterate over the different input data formats and yield the time series tuples used to extract features. Because it skips some heavy data transformations, this approach is way less memory consuming.

Closes #702

I also spent some time trying to improve performance of partitioning the data. From what I can tell by non-exhaustive tests, I had some success given my specific data mentioned in the issue:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6024604 entries, 0 to 1299438
Columns: 77 entries, index to machine_id
dtypes: float32(75), int64(1), object(1)
memory usage: 1.8+ GB

RelevantFeatureAugmenter(column_id='machine_id', column_sort='index', default_fc_parameters = MinimalFCParameters(), n_jobs=..., chunksize=...)

# default partitioning

0 jobs, default chunksize 1
766125/766125 [03:06<00:00, 4115.68it/s]

0 jobs, chunksize 766125
1/1 [02:28<00:00, 148.64s/it]

4 jobs, default chunksize ~38000
20/20 [01:45<00:00,  5.28s/it]

4 jobs, chunksize 191532
4/4 [02:16<00:00, 34.25s/it]

4 jobs, chunksize 10000
77/77 [01:50<00:00,  1.44s/it]


# custom partitioning

4 jobs, default chunksize ~38000
20/20 [02:21<00:00,  7.05s/it]

4 jobs, chunksize 191532
4/4 [01:21<00:00, 20.28s/it]

It's most probably heavily depending on the data, but I leave that open for discussion and more tests. Also, iteration performance might be less important if feature extraction computation is the bottleneck. Anyway, I think a more performant strategy might be to iterate the data once and map it to async Future objects, but I guess that's not compatible with the current map-reduce framework.

coveralls · 2020-05-29T15:31:00Z

Coverage increased (+0.03%) to 97.132% when pulling faf2b46 on hoesler:distribute-melt into 29cdae3 on blue-yonder:master.

nils-braun · 2020-06-02T20:20:34Z

@hoesler So first of all thank you very very much for this very good PR!
I just had a very quick over the code, and it looks quite well written. As @MaxBenChrist added himself as a reviewer, I will let him comment first but it already looks quite promising.

Just to understand: the speed improvement came from the facts that you both do not need to do any pre-processing now and that you do need less memory, or? So the first part will especially be seen in "wide-column" (as you named it) data, but probably less in already melted data - whereas the second part will be seen always, correct?

hoesler · 2020-06-03T08:56:37Z

Hi @nils-braun.

Thank you very much. My primary goal was to improve memory consumption. This is archived by directly iterating the input data and skipping costly transformations. I tried to implement this iteration as performant as possible, but did not yet compare it to the speed of the original implementation. Under the new data model, however, it seems that accessing the "wide" dataframe in slices can speed things up, even though I am iterating over the groups more than once using itertools.isclice. I didn't spent time in improving narrow and dict format yet. This still needs to be done.

TKlerx · 2020-06-05T08:41:09Z

Nice work @hoesler .
Any comment from @MaxBenChrist ?
I guess the coverage needs to be increased?!

MaxBenChrist

Just a start of my review, I will try to finish the review this weekend.

tests/fixtures.py

tests/units/feature_extraction/test_data.py

MaxBenChrist · 2020-06-05T17:03:44Z

tests/units/feature_extraction/test_data.py

+                    ("b", 'v2', pd.Series([11], index=[2], name="v2"))]
+        self.assert_data_chunk_object_equal(result, expected)
+
+    def test_with_wrong_input(self):


It seems like this test is checking a lot of side conditions, normally I check those with pytest parmetrize

Same here:

tsfresh/tests/units/utilities/test_dataframe_functions.py

Line 214 in 29cdae3

def test_with_wrong_input(self):

But I can try to improve these tests.

MaxBenChrist · 2020-06-05T17:06:56Z

Nice work @hoesler .
Any comment from @MaxBenChrist ?
I guess the coverage needs to be increased?!

sorry, was busy lately. very interesting work @hoesler, will try to understand everything this weekend.

We should aim for high Coverage, yes. But, if we know why the coverage drops and there are good reasons to not test it, then I am also fine with merging PRs that reduce coverage.

hoesler · 2020-06-05T17:34:08Z

No problem. I am happy to improve coverage, just wanted to wait with the cumbersome work until I get some feedback on the overall design.

MaxBenChrist

I am getting the overall design, I like the clean splitting of input formats in feature_extraction/data.py.

However, I still need to review and think about the interplay with the distributors in distribution/data.py.

Will your solution work with all distributors?

I will need a few more round of reviews, also to get into the low level tsfresh codebase again. So far I don't see any show stopper from a design point of view. So, I can see this being merged in the near future. However, we will probably need a few round of reviews to polish everything, modernize the tests, etc. Good work again 👍

tsfresh/utilities/distribution.py

tsfresh/feature_extraction/data.py

tests/units/feature_extraction/test_data.py

MaxBenChrist · 2020-06-07T10:54:02Z

tsfresh/feature_extraction/data.py

+    def __iter__(self):
+        for kind, grouped_df in self.grouped_dict.items():
+            for ts_id, group in grouped_df:
+                yield ts_id, str(kind), group[self.column_value]


shouldn't we raise here if kind is not a string? I am not sure if recasting is the right way to go in such a low level method.

This is something you have to tell me. I was recreating this:

tsfresh/tsfresh/utilities/dataframe_functions.py

Line 374 in 29cdae3

timeseries_container[column_kind] = timeseries_container[column_kind].astype(str)

Maybe the conversion should happen in from_columns?

tsfresh/utilities/distribution.py

tsfresh/feature_extraction/data.py

nils-braun

Thank you so much for this PR @hoesler
It is really nice. I added a couple of comments, mostly to understand more on the motivations behind your implementation.
In general, I also like the approach with the classes - but we need to make sure we understand what is going on here so please do not feel overwhelmed by all the remarks ;-)

tsfresh/feature_extraction/data.py

nils-braun · 2020-06-08T18:48:55Z

tsfresh/feature_extraction/data.py

+    def __len__(self):
+        return self.df_grouped.ngroups * len(self.value_columns)
+
+    def slice(self, offset, length=None):


Can you add a docstring here describing, which offset offset actually is?
Just to understand if I am correct: I need to think of a large matrix of (id groups) x (value columns) and the offset starts counting from the top left to the bottom right, correct?

It depends on the implementation, but for the wide df, yes. TsData, so far, does not guarantee any order of iteration.

Can you add this as an explanation to the docstring?

I already documented this in the base class SliceableTsData. Are you missing something?

tsfresh/feature_extraction/data.py

Co-authored-by: Nils Braun <[email protected]>

nils-braun · 2020-06-09T21:19:51Z

Thanks for answering all my questions @hoesler !

MaxBenChrist

just some random comments, still having this on my todo list

tests/units/feature_extraction/test_data.py

MaxBenChrist · 2020-06-17T17:06:09Z

tsfresh/feature_extraction/data.py

+    def __len__(self):
+        return self.df_grouped.ngroups * len(self.value_columns)
+
+    def slice(self, offset, length=None):


Can you add this as an explanation to the docstring?

MaxBenChrist

sorry for taking so long with reviewing it. Thank you for this pr @hoesler

So, if I see it correctly, next steps would be to optimize the long and dict adapter, right?

MaxBenChrist · 2020-06-22T17:08:11Z

tsfresh/feature_extraction/data.py

+            for key, df in ts_dict.items():
+                _check_nan(df, column_sort)
+
+            self.grouped_dict = {key: df.sort_values([column_sort]).groupby(column_id)


Just out of curiosity, how expensive is that groupby? Is it saving the groups in memory?

Thanks for reviewing.

Regarding your first question, yes one could investigate if long and dict could be iterated more efficiently. But don't think this is critical. To your second question, I must say that I am relatively new to pandas and the way it behaves is a bit fuzzy to me. And to most other people, if I look at SO. For the case of groupby, I found this: https://stackoverflow.com/questions/52711823/does-groupby-in-pandas-create-a-copy-of-the-data-or-just-a-view

yeah, the internal workings of pandas like the blockmanager of ((https://uwekorn.com/2020/05/24/the-one-pandas-internal.html)) are also fuzzy to me

nils-braun · 2020-06-22T19:53:58Z

I would vote for merging this first and tackling this in follow-ups!

hoesler · 2020-06-23T09:39:10Z

Thank you @nils-braun and @MaxBenChrist for reviewing and accepting this PR!

Some thoughts on follow ups:

As @MaxBenChrist identified, the unit tests could need some refactoring, especially the DataTestCase class with its "fixtures".
The default number of partitions might not be optimal any more. In general, this depends not only on the distributor but also on the underlying data. Especially if you also want to support naturally partitioned data like dask or spark data frames in the future.
Maybe performance and memory benchmarking could be part of the automated tests.
Optimizations of long and dict data could be evaluated.

dbarbier · 2020-06-23T10:08:36Z

Hello, point 3 is discussed at #710.

This is an unpopular opinion, but I would drop _check_nan, performance penalty is noticeable with huge datasets (I did not perform any benchmark, this is based on my experience with similar tools). Or at the very least, please provide an argument to disable these checks if user knows what she is doing.

MaxBenChrist · 2020-06-23T10:20:01Z

Thank you @nils-braun and @MaxBenChrist for reviewing and accepting this PR!

Some thoughts on follow ups:

As @MaxBenChrist identified, the unit tests could need some refactoring, especially the DataTestCase class with its "fixtures".

The default number of partitions might not be optimal any more. In general, this depends not only on the distributor but also on the underlying data. Especially if you also want to support naturally partitioned data like dask or spark data frames in the future.

Maybe performance and memory benchmarking could be part of the automated tests.

Optimizations of long and dict data could be evaluated.

Well though list, feel free to create tickets for those points (you seem to already have a good grasp on the topic)

Thanks for your contribution, looking forward to recieve some more prs in the future! 👍

MaxBenChrist · 2020-06-23T10:21:38Z

Hello, point 3 is discussed at #710.

This is an unpopular opinion, but I would drop _check_nan, performance penalty is noticeable with huge datasets (I did not perform any benchmark, this is based on my experience with similar tools). Or at the very least, please provide an argument to disable these checks if user knows what she is doing.

well, if we remove the _check_nan globally, every feature calculator that cannot operate on nans in value column will have to perform that check, probably duplicating checks, right?

I am not sure I understand why removing _check_nan will speed up the calculation?

hoesler added 19 commits May 23, 2020 11:55

push data transformations to distributor

e7be28a

sort first

a2be46c

check arguments

f90c8f0

Use adapter classes to generate ts data

9523d04

move df operations from constructor to methods

daff541

add doc strings

f1f43c6

refactor sclicing

9a43aed

iterate groups

51241d5

simple first

0bac2c6

remove _normalize_input_to_internal_representation

a97069e

add some more tests to generate_data_chunk_format

a12383b

fix slice in wide adapter

ba16895

sort once

d067ee0

move tsdata classes to separate file

139e859

fix partition function

e58de1d

make from_column happy

09fa3c2

add documentation and improve TsDictAdapter

ef89a5e

improve tests

7274305

refactor scliceable implementations

d647441

make it work with pandas 0.23

fb773d2

hoesler changed the title ~~Distribute melt~~ Refactor input data iteration May 29, 2020

hoesler added 3 commits May 30, 2020 12:46

add more argument validations and test them

2dafae5

fix style

1494134

fix argument verification for numeric column names

86ce589

MaxBenChrist self-requested a review May 30, 2020 15:05

MaxBenChrist reviewed Jun 5, 2020

View reviewed changes

MaxBenChrist reviewed Jun 7, 2020

View reviewed changes

hoesler added 2 commits June 8, 2020 12:26

refactor test data and improve docs

80b1bb0

extract argument validations

b1092ab

nils-braun reviewed Jun 8, 2020

View reviewed changes

hoesler and others added 5 commits June 9, 2020 09:54

Update tsfresh/feature_extraction/data.py

12d1075

Co-authored-by: Nils Braun <[email protected]>

remove default implementation of __len__ and update pydoc

7837356

refactor: rename variables

52dd369

simplify LongTsFrameAdapter

c5c698e

refactor: remove unnecessary else statement

d64a061

improve documentation and add argument validation

faf2b46

MaxBenChrist reviewed Jun 17, 2020

View reviewed changes

MaxBenChrist approved these changes Jun 22, 2020

View reviewed changes

nils-braun approved these changes Jun 22, 2020

View reviewed changes

nils-braun merged commit 6f18e18 into blue-yonder:master Jun 22, 2020

nils-braun mentioned this pull request Jul 24, 2020

Remove slices again #735

Merged

Refactor input data iteration #707

Refactor input data iteration #707

Uh oh!

Conversation

hoesler commented May 29, 2020

Uh oh!

coveralls commented May 29, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nils-braun commented Jun 2, 2020

Uh oh!

hoesler commented Jun 3, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TKlerx commented Jun 5, 2020

Uh oh!

MaxBenChrist left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MaxBenChrist commented Jun 5, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hoesler commented Jun 5, 2020

Uh oh!

MaxBenChrist left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

MaxBenChrist Jun 7, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nils-braun left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nils-braun commented Jun 9, 2020

Uh oh!

MaxBenChrist left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MaxBenChrist left a comment

Choose a reason for hiding this comment

coveralls commented May 29, 2020 •

edited

Loading

hoesler commented Jun 3, 2020 •

edited

Loading

MaxBenChrist commented Jun 5, 2020 •

edited

Loading

MaxBenChrist left a comment •

edited

Loading

MaxBenChrist Jun 7, 2020 •

edited

Loading