-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Refactor input data iteration #707
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@hoesler So first of all thank you very very much for this very good PR! Just to understand: the speed improvement came from the facts that you both do not need to do any pre-processing now and that you do need less memory, or? So the first part will especially be seen in "wide-column" (as you named it) data, but probably less in already melted data - whereas the second part will be seen always, correct? |
|
Hi @nils-braun. Thank you very much. My primary goal was to improve memory consumption. This is archived by directly iterating the input data and skipping costly transformations. I tried to implement this iteration as performant as possible, but did not yet compare it to the speed of the original implementation. Under the new data model, however, it seems that accessing the "wide" dataframe in slices can speed things up, even though I am iterating over the groups more than once using |
|
Nice work @hoesler . |
MaxBenChrist
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a start of my review, I will try to finish the review this weekend.
| ("b", 'v2', pd.Series([11], index=[2], name="v2"))] | ||
| self.assert_data_chunk_object_equal(result, expected) | ||
|
|
||
| def test_with_wrong_input(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like this test is checking a lot of side conditions, normally I check those with pytest parmetrize
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here:
| def test_with_wrong_input(self): |
But I can try to improve these tests.
sorry, was busy lately. very interesting work @hoesler, will try to understand everything this weekend. We should aim for high Coverage, yes. But, if we know why the coverage drops and there are good reasons to not test it, then I am also fine with merging PRs that reduce coverage. |
|
No problem. I am happy to improve coverage, just wanted to wait with the cumbersome work until I get some feedback on the overall design. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am getting the overall design, I like the clean splitting of input formats in feature_extraction/data.py.
However, I still need to review and think about the interplay with the distributors in distribution/data.py.
Will your solution work with all distributors?
I will need a few more round of reviews, also to get into the low level tsfresh codebase again. So far I don't see any show stopper from a design point of view. So, I can see this being merged in the near future. However, we will probably need a few round of reviews to polish everything, modernize the tests, etc. Good work again 👍
tsfresh/feature_extraction/data.py
Outdated
| def __iter__(self): | ||
| for kind, grouped_df in self.grouped_dict.items(): | ||
| for ts_id, group in grouped_df: | ||
| yield ts_id, str(kind), group[self.column_value] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't we raise here if kind is not a string? I am not sure if recasting is the right way to go in such a low level method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is something you have to tell me. I was recreating this:
| timeseries_container[column_kind] = timeseries_container[column_kind].astype(str) |
Maybe the conversion should happen in from_columns?
nils-braun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much for this PR @hoesler
It is really nice. I added a couple of comments, mostly to understand more on the motivations behind your implementation.
In general, I also like the approach with the classes - but we need to make sure we understand what is going on here so please do not feel overwhelmed by all the remarks ;-)
| def __len__(self): | ||
| return self.df_grouped.ngroups * len(self.value_columns) | ||
|
|
||
| def slice(self, offset, length=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a docstring here describing, which offset offset actually is?
Just to understand if I am correct: I need to think of a large matrix of (id groups) x (value columns) and the offset starts counting from the top left to the bottom right, correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It depends on the implementation, but for the wide df, yes. TsData, so far, does not guarantee any order of iteration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add this as an explanation to the docstring?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I already documented this in the base class SliceableTsData. Are you missing something?
|
Thanks for answering all my questions @hoesler ! |
MaxBenChrist
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just some random comments, still having this on my todo list
| def __len__(self): | ||
| return self.df_grouped.ngroups * len(self.value_columns) | ||
|
|
||
| def slice(self, offset, length=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add this as an explanation to the docstring?
MaxBenChrist
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry for taking so long with reviewing it. Thank you for this pr @hoesler
So, if I see it correctly, next steps would be to optimize the long and dict adapter, right?
| for key, df in ts_dict.items(): | ||
| _check_nan(df, column_sort) | ||
|
|
||
| self.grouped_dict = {key: df.sort_values([column_sort]).groupby(column_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just out of curiosity, how expensive is that groupby? Is it saving the groups in memory?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for reviewing.
Regarding your first question, yes one could investigate if long and dict could be iterated more efficiently. But don't think this is critical. To your second question, I must say that I am relatively new to pandas and the way it behaves is a bit fuzzy to me. And to most other people, if I look at SO. For the case of groupby, I found this: https://stackoverflow.com/questions/52711823/does-groupby-in-pandas-create-a-copy-of-the-data-or-just-a-view
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, the internal workings of pandas like the blockmanager of ((https://uwekorn.com/2020/05/24/the-one-pandas-internal.html)) are also fuzzy to me
|
I would vote for merging this first and tackling this in follow-ups! |
|
Thank you @nils-braun and @MaxBenChrist for reviewing and accepting this PR! Some thoughts on follow ups:
|
|
Hello, point 3 is discussed at #710. This is an unpopular opinion, but I would drop |
Well though list, feel free to create tickets for those points (you seem to already have a good grasp on the topic) Thanks for your contribution, looking forward to recieve some more prs in the future! 👍 |
well, if we remove the I am not sure I understand why removing |
This PR replaces tsfreshs intermediate dataframe with a flexible TsData class which can be used to directly iterate over the different input data formats and yield the time series tuples used to extract features. Because it skips some heavy data transformations, this approach is way less memory consuming.
Closes #702
I also spent some time trying to improve performance of partitioning the data. From what I can tell by non-exhaustive tests, I had some success given my specific data mentioned in the issue:
It's most probably heavily depending on the data, but I leave that open for discussion and more tests. Also, iteration performance might be less important if feature extraction computation is the bottleneck. Anyway, I think a more performant strategy might be to iterate the data once and map it to async Future objects, but I guess that's not compatible with the current map-reduce framework.