Extend STLDecomposer to Support Multiseries#4253
Conversation
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## main #4253 +/- ##
=======================================
+ Coverage 99.7% 99.7% +0.1%
=======================================
Files 355 355
Lines 39155 39458 +303
=======================================
+ Hits 39035 39338 +303
Misses 120 120
☔ View full report in Codecov by Sentry. |
* Add unstacking function * Add stacking function * Add tests for both functions
* Squashed changes * Ignored index * Disabled column checking * Reverted deleted code * Updated pyproject.toml * Replaced version check code
1faf23d to
4a8cc0f
Compare
christopherbunn
left a comment
There was a problem hiding this comment.
Some smaller nits. Only reviewed implemetation so far but will review tests later.
evalml/pipelines/components/transformers/preprocessing/decomposer.py
Outdated
Show resolved
Hide resolved
evalml/pipelines/components/transformers/preprocessing/stl_decomposer.py
Show resolved
Hide resolved
evalml/pipelines/components/transformers/preprocessing/stl_decomposer.py
Show resolved
Hide resolved
eccabay
left a comment
There was a problem hiding this comment.
Making awesome progress! I left a few more comments, mostly just condensing code
evalml/pipelines/components/transformers/preprocessing/decomposer.py
Outdated
Show resolved
Hide resolved
| series_y = y[id] | ||
|
|
||
| # Determine the period of the seasonal component | ||
| if id not in self.periods or self.period is None: |
There was a problem hiding this comment.
A very small use case, but users should be able to pass in self.periods when initializing the decomposer, rather than necessarily defining it here (our detection is decent, but if a user knows what the periods should be, they should be able to pass that in for better results)
And unless I'm missing something, it doesn't look like we actually use self.period anywhere any more, because even in the single series case, we're still indexing into self.periods. Am I correct? If so, I think we should just swap out one for the other entirely.
There was a problem hiding this comment.
Yeah, I should have taken self.period out. I think I kept it to keep the functionality of set_period the same, but is the function used in the PolynomialDecomposer at all because I see that it is tested on both decomposers in test_decomposer_set_period. Also would the periods parameter be same for the single series case or would it be fine for period to be an integer for single series and then take in a dictionary for multiseries?
There was a problem hiding this comment.
set_period is tested on both decomposers because it's implemented in the parent class, since we have to test its implementation on one subclass, why not test it on both? Feel free to bastardize whatever preexisting functions you need to - I could totally see the case for eliminating set_period entirely and just calling _determine_periodicity directly, or having set_period calculate the periods for all series and save them in self.periods instead of self.period. It's up to you, IMO
There was a problem hiding this comment.
In order to allow users to still pass in a period for both multivariate and univariate cases, I was thinking of keeping both self.period and self.periods as parameters, but that probably isn't the most efficient implementation so I'm open to other suggestions 😅
evalml/pipelines/components/transformers/preprocessing/stl_decomposer.py
Show resolved
Hide resolved
evalml/pipelines/components/transformers/preprocessing/stl_decomposer.py
Outdated
Show resolved
Hide resolved
evalml/tests/component_tests/decomposer_tests/test_decomposer.py
Outdated
Show resolved
Hide resolved
evalml/tests/component_tests/decomposer_tests/test_decomposer.py
Outdated
Show resolved
Hide resolved
evalml/tests/component_tests/decomposer_tests/test_stl_decomposer.py
Outdated
Show resolved
Hide resolved
evalml/tests/component_tests/decomposer_tests/test_stl_decomposer.py
Outdated
Show resolved
Hide resolved
eccabay
left a comment
There was a problem hiding this comment.
A few final comments, but nothing blocking. Awesome work!
| if self.period is None and len(y.columns) == 1: | ||
| self.period = period | ||
| self.update_parameters({"period": self.period}) | ||
| elif self.period is not None and len(y.columns) == 1: | ||
| period = self.period |
There was a problem hiding this comment.
IMO, it's ok if we call self.update_parameters an extra time - we can collapse these into a single if len(y.columns)==1
There was a problem hiding this comment.
Since everything is accessing the period through self.periods I'm thinking we might not even need to update self.period at all, instead just use it to set period if its given by a user
evalml/pipelines/components/transformers/preprocessing/stl_decomposer.py
Outdated
Show resolved
Hide resolved
evalml/pipelines/components/transformers/preprocessing/stl_decomposer.py
Show resolved
Hide resolved
evalml/tests/component_tests/decomposer_tests/test_decomposer.py
Outdated
Show resolved
Hide resolved
evalml/tests/component_tests/decomposer_tests/test_stl_decomposer.py
Outdated
Show resolved
Hide resolved
| y = y.to_frame() | ||
| series_results = {} | ||
| # Iterate through each series id | ||
| for id in y.columns: |
There was a problem hiding this comment.
This is so late in the process to realize this, but do we actually need to be calling _decompose_target in a loop? Since the decomposer handles multiseries generally, it should be able to handle multiseries here too, we can pass self.periods through instead of period=period, and switch around the logic to return the data in the format we need? That will prevent us from calling Decomposer.fit() too many times, which can be slow
There was a problem hiding this comment.
I modified this so that get_trend_dataframe returns a dictionary for multiseries, but for single series it is still returning a list of dataframes. In the PolynomialDecomposer, get_trend_dataframe returns a list of dataframes. I think this is because there is some multivariate implementation written here. Since they're both using plot_decomposition for single series, I wanted to keep the indexing the same so I left it as a list for STLDecomposer for now (even though it can be modified later to just return a dataframe). This is pretty inconsistent so I wanted to see what others think about it and if I should consider changing the return type of get_trend_dataframe for the multiseries and/or single series STLDecomposer.
There was a problem hiding this comment.
I updated get_trend_dataframe() in the STLDecomposer to return list(pd.DataFrame) for single series and dict(list(pd.DataFrame)) for multiseries, but it should be updated in the future to no longer be in a list #4294
christopherbunn
left a comment
There was a problem hiding this comment.
LGTM once the test case Becca mentioned is covered!
Resolves #4244