Skip to content

Conversation

@hoesler
Copy link
Contributor

@hoesler hoesler commented May 29, 2020

This PR replaces tsfreshs intermediate dataframe with a flexible TsData class which can be used to directly iterate over the different input data formats and yield the time series tuples used to extract features. Because it skips some heavy data transformations, this approach is way less memory consuming.

Closes #702

I also spent some time trying to improve performance of partitioning the data. From what I can tell by non-exhaustive tests, I had some success given my specific data mentioned in the issue:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6024604 entries, 0 to 1299438
Columns: 77 entries, index to machine_id
dtypes: float32(75), int64(1), object(1)
memory usage: 1.8+ GB

RelevantFeatureAugmenter(column_id='machine_id', column_sort='index', default_fc_parameters = MinimalFCParameters(), n_jobs=..., chunksize=...)

# default partitioning

0 jobs, default chunksize 1
766125/766125 [03:06<00:00, 4115.68it/s]

0 jobs, chunksize 766125
1/1 [02:28<00:00, 148.64s/it]

4 jobs, default chunksize ~38000
20/20 [01:45<00:00,  5.28s/it]

4 jobs, chunksize 191532
4/4 [02:16<00:00, 34.25s/it]

4 jobs, chunksize 10000
77/77 [01:50<00:00,  1.44s/it]


# custom partitioning

4 jobs, default chunksize ~38000
20/20 [02:21<00:00,  7.05s/it]

4 jobs, chunksize 191532
4/4 [01:21<00:00, 20.28s/it]

It's most probably heavily depending on the data, but I leave that open for discussion and more tests. Also, iteration performance might be less important if feature extraction computation is the bottleneck. Anyway, I think a more performant strategy might be to iterate the data once and map it to async Future objects, but I guess that's not compatible with the current map-reduce framework.

@coveralls
Copy link

coveralls commented May 29, 2020

Coverage Status

Coverage increased (+0.03%) to 97.132% when pulling faf2b46 on hoesler:distribute-melt into 29cdae3 on blue-yonder:master.

@hoesler hoesler changed the title Distribute melt Refactor input data iteration May 29, 2020
@MaxBenChrist MaxBenChrist self-requested a review May 30, 2020 15:05
@nils-braun
Copy link
Collaborator

@hoesler So first of all thank you very very much for this very good PR!
I just had a very quick over the code, and it looks quite well written. As @MaxBenChrist added himself as a reviewer, I will let him comment first but it already looks quite promising.

Just to understand: the speed improvement came from the facts that you both do not need to do any pre-processing now and that you do need less memory, or? So the first part will especially be seen in "wide-column" (as you named it) data, but probably less in already melted data - whereas the second part will be seen always, correct?

@hoesler
Copy link
Contributor Author

hoesler commented Jun 3, 2020

Hi @nils-braun.

Thank you very much. My primary goal was to improve memory consumption. This is archived by directly iterating the input data and skipping costly transformations. I tried to implement this iteration as performant as possible, but did not yet compare it to the speed of the original implementation. Under the new data model, however, it seems that accessing the "wide" dataframe in slices can speed things up, even though I am iterating over the groups more than once using itertools.isclice. I didn't spent time in improving narrow and dict format yet. This still needs to be done.

@TKlerx
Copy link
Contributor

TKlerx commented Jun 5, 2020

Nice work @hoesler .
Any comment from @MaxBenChrist ?
I guess the coverage needs to be increased?!

Copy link
Collaborator

@MaxBenChrist MaxBenChrist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a start of my review, I will try to finish the review this weekend.

("b", 'v2', pd.Series([11], index=[2], name="v2"))]
self.assert_data_chunk_object_equal(result, expected)

def test_with_wrong_input(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like this test is checking a lot of side conditions, normally I check those with pytest parmetrize

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here:

def test_with_wrong_input(self):

But I can try to improve these tests.

@MaxBenChrist
Copy link
Collaborator

MaxBenChrist commented Jun 5, 2020

Nice work @hoesler .
Any comment from @MaxBenChrist ?
I guess the coverage needs to be increased?!

sorry, was busy lately. very interesting work @hoesler, will try to understand everything this weekend.

We should aim for high Coverage, yes. But, if we know why the coverage drops and there are good reasons to not test it, then I am also fine with merging PRs that reduce coverage.

@hoesler
Copy link
Contributor Author

hoesler commented Jun 5, 2020

No problem. I am happy to improve coverage, just wanted to wait with the cumbersome work until I get some feedback on the overall design.

Copy link
Collaborator

@MaxBenChrist MaxBenChrist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am getting the overall design, I like the clean splitting of input formats in feature_extraction/data.py.

However, I still need to review and think about the interplay with the distributors in distribution/data.py.

Will your solution work with all distributors?

I will need a few more round of reviews, also to get into the low level tsfresh codebase again. So far I don't see any show stopper from a design point of view. So, I can see this being merged in the near future. However, we will probably need a few round of reviews to polish everything, modernize the tests, etc. Good work again 👍

def __iter__(self):
for kind, grouped_df in self.grouped_dict.items():
for ts_id, group in grouped_df:
yield ts_id, str(kind), group[self.column_value]
Copy link
Collaborator

@MaxBenChrist MaxBenChrist Jun 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't we raise here if kind is not a string? I am not sure if recasting is the right way to go in such a low level method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is something you have to tell me. I was recreating this:

timeseries_container[column_kind] = timeseries_container[column_kind].astype(str)

Maybe the conversion should happen in from_columns?

Copy link
Collaborator

@nils-braun nils-braun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much for this PR @hoesler
It is really nice. I added a couple of comments, mostly to understand more on the motivations behind your implementation.
In general, I also like the approach with the classes - but we need to make sure we understand what is going on here so please do not feel overwhelmed by all the remarks ;-)

def __len__(self):
return self.df_grouped.ngroups * len(self.value_columns)

def slice(self, offset, length=None):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a docstring here describing, which offset offset actually is?
Just to understand if I am correct: I need to think of a large matrix of (id groups) x (value columns) and the offset starts counting from the top left to the bottom right, correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It depends on the implementation, but for the wide df, yes. TsData, so far, does not guarantee any order of iteration.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add this as an explanation to the docstring?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I already documented this in the base class SliceableTsData. Are you missing something?

@nils-braun
Copy link
Collaborator

Thanks for answering all my questions @hoesler !

Copy link
Collaborator

@MaxBenChrist MaxBenChrist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just some random comments, still having this on my todo list

def __len__(self):
return self.df_grouped.ngroups * len(self.value_columns)

def slice(self, offset, length=None):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add this as an explanation to the docstring?

Copy link
Collaborator

@MaxBenChrist MaxBenChrist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry for taking so long with reviewing it. Thank you for this pr @hoesler

So, if I see it correctly, next steps would be to optimize the long and dict adapter, right?

for key, df in ts_dict.items():
_check_nan(df, column_sort)

self.grouped_dict = {key: df.sort_values([column_sort]).groupby(column_id)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just out of curiosity, how expensive is that groupby? Is it saving the groups in memory?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for reviewing.

Regarding your first question, yes one could investigate if long and dict could be iterated more efficiently. But don't think this is critical. To your second question, I must say that I am relatively new to pandas and the way it behaves is a bit fuzzy to me. And to most other people, if I look at SO. For the case of groupby, I found this: https://stackoverflow.com/questions/52711823/does-groupby-in-pandas-create-a-copy-of-the-data-or-just-a-view

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, the internal workings of pandas like the blockmanager of ((https://uwekorn.com/2020/05/24/the-one-pandas-internal.html)) are also fuzzy to me

@nils-braun
Copy link
Collaborator

I would vote for merging this first and tackling this in follow-ups!

@nils-braun nils-braun merged commit 6f18e18 into blue-yonder:master Jun 22, 2020
@hoesler
Copy link
Contributor Author

hoesler commented Jun 23, 2020

Thank you @nils-braun and @MaxBenChrist for reviewing and accepting this PR!

Some thoughts on follow ups:

  1. As @MaxBenChrist identified, the unit tests could need some refactoring, especially the DataTestCase class with its "fixtures".
  2. The default number of partitions might not be optimal any more. In general, this depends not only on the distributor but also on the underlying data. Especially if you also want to support naturally partitioned data like dask or spark data frames in the future.
  3. Maybe performance and memory benchmarking could be part of the automated tests.
  4. Optimizations of long and dict data could be evaluated.

@dbarbier
Copy link
Contributor

Hello, point 3 is discussed at #710.

This is an unpopular opinion, but I would drop _check_nan, performance penalty is noticeable with huge datasets (I did not perform any benchmark, this is based on my experience with similar tools). Or at the very least, please provide an argument to disable these checks if user knows what she is doing.

@MaxBenChrist
Copy link
Collaborator

Thank you @nils-braun and @MaxBenChrist for reviewing and accepting this PR!

Some thoughts on follow ups:

  1. As @MaxBenChrist identified, the unit tests could need some refactoring, especially the DataTestCase class with its "fixtures".
  2. The default number of partitions might not be optimal any more. In general, this depends not only on the distributor but also on the underlying data. Especially if you also want to support naturally partitioned data like dask or spark data frames in the future.
  3. Maybe performance and memory benchmarking could be part of the automated tests.
  4. Optimizations of long and dict data could be evaluated.

Well though list, feel free to create tickets for those points (you seem to already have a good grasp on the topic)

Thanks for your contribution, looking forward to recieve some more prs in the future! 👍

@MaxBenChrist
Copy link
Collaborator

Hello, point 3 is discussed at #710.

This is an unpopular opinion, but I would drop _check_nan, performance penalty is noticeable with huge datasets (I did not perform any benchmark, this is based on my experience with similar tools). Or at the very least, please provide an argument to disable these checks if user knows what she is doing.

well, if we remove the _check_nan globally, every feature calculator that cannot operate on nans in value column will have to perform that check, probably duplicating checks, right?

I am not sure I understand why removing _check_nan will speed up the calculation?

@nils-braun nils-braun mentioned this pull request Jul 24, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

High memory requirements during feature extraction

6 participants