Added support for additional estimators for multiseries datasets by christopherbunn · Pull Request #4385 · alteryx/evalml

christopherbunn · 2024-01-26T18:05:59Z

Resolves #4386

codecov · 2024-01-26T18:11:16Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (c93b8f2) 99.7% compared to head (701eef6) 99.7%.

Additional details and impacted files

@@           Coverage Diff           @@
##            main   #4385     +/-   ##
=======================================
+ Coverage   99.7%   99.7%   +0.1%     
=======================================
  Files        357     357             
  Lines      39928   39965     +37     
=======================================
+ Hits       39802   39840     +38     
+ Misses       126     125      -1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

jeremyliweishih

LGTM pending performance tests

jeremyliweishih · 2024-01-29T18:06:21Z

evalml/tests/automl_tests/test_iterative_algorithm.py

    assert algo.default_max_batches == 1
    estimators = get_estimators(problem_type)
-    decomposer = [STLDecomposer] if is_regression(problem_type) else []
+    decomposer = [True, False] if is_regression(problem_type) else [True]


Can you add a comment clarifying why you're using true false instead of the decomposer name?

I think it would be better to just parametrize the include_decomposer argument here - this and the below section are confusing to read out of context

For this test, we are basically only checking the number of pipelines matches up. Before, we only needed to add the decomposer one once since there was one estimator type (VARMAX).

Now, since we have multiple estimator types, each estimator type will have one pipeline with a decomposer and another without a decomposer. As such, we need to have this [True, False] and to iterate through it in order to generate the correct number of pipelines.

I think a clarifying comment would be useful here 👍

Added a short one to this test

eccabay · 2024-01-30T17:00:36Z

evalml/pipelines/components/transformers/preprocessing/time_series_featurizer.py

+            self.statistically_significant_lags = {}
+            for column in y.columns:
+                self.statistically_significant_lags[
+                    column
+                ] = self._find_significant_lags(
+                    y[column],
+                    conf_level=self.conf_level,
+                    start_delay=self.start_delay,
+                    max_delay=self.max_delay,
+                )


We can make this section more concise/easier to maintain by folding the single series case into the multiseries case by converting the single series to a dataframe and keeping this code for both cases - following the pattern in other files that already support multiseries (stl decomposer might be a good example?)

~~Good point, I just consolidated it.~~

Actually I just remembered that it's structured this was so that we're still able to run self._find_significant_lags even when y is None. Is there a way you had in mind to structure it so that y can still be None?

Hm, seems like y being None is something we'd want to have explicit behavior for, since right now the behavior is unclear. I think we should just handle it entirely separately

We handle the y being null in self._find_significant_lags since we calculate all lags in that function (and just set significant lags to be all_lags if y is None). Should I put it off into it's own separate branch, even though the code would be identical to the case where y is a series? e.g.

# For the multiseries case, each series ID has individualized lag values if isinstance(y, pd.DataFrame): self.statistically_significant_lags = {} for column in y.columns: self.statistically_significant_lags[ column ] = self._find_significant_lags( y[column], conf_level=self.conf_level, start_delay=self.start_delay, max_delay=self.max_delay, ) elif y is None: self.statistically_significant_lags = self._find_significant_lags( y, conf_level=self.conf_level, start_delay=self.start_delay, max_delay=self.max_delay, ) else: self.statistically_significant_lags = self._find_significant_lags( y, conf_level=self.conf_level, start_delay=self.start_delay, max_delay=self.max_delay, )

Ok, sorry for drilling into this so much, but I think I understand now. My new potentially hot take proposal is something like:

if y is None: self.statistically_significant_lags = np.arange(self.start_delay, self.start_delay +self. max_delay + 1) else: if isinstance(y, pd.Series): y = y.to_frame() for column in y.columns: self.statistically_significant_lags = ...

And then we can remove the handling of y being None from the static function. My argument for doing this is that calling all lags the statistically significant lags is a misnomer, since we didn't actually check statistical significance. This is me getting very into the weeds though, so I very much understand if you would rather keep things closer to the way they are 😅

Regardless, even with your new proposal, we'd still be able to combine the two non y=None cases by casting the series to a dataframe

Your example makes sense to me, I don't see our behavior for y is None changing anytime soon so I'm comfortable with pulling that out and changing the function. Will update!

evalml/pipelines/utils.py

evalml/preprocessing/utils.py

eccabay · 2024-01-30T18:09:06Z

evalml/tests/automl_tests/test_iterative_algorithm.py

    assert algo.default_max_batches == 1
    estimators = get_estimators(problem_type)
-    decomposer = [STLDecomposer] if is_regression(problem_type) else []
+    decomposer = [True, False] if is_regression(problem_type) else [True]


I think it would be better to just parametrize the include_decomposer argument here - this and the below section are confusing to read out of context

eccabay · 2024-01-31T14:58:17Z

evalml/pipelines/components/transformers/preprocessing/time_series_featurizer.py

+            self.statistically_significant_lags = {}
+            for column in y.columns:
+                self.statistically_significant_lags[
+                    column
+                ] = self._find_significant_lags(
+                    y[column],
+                    conf_level=self.conf_level,
+                    start_delay=self.start_delay,
+                    max_delay=self.max_delay,
+                )


Hm, seems like y being None is something we'd want to have explicit behavior for, since right now the behavior is unclear. I think we should just handle it entirely separately

evalml/pipelines/utils.py

evalml/pipelines/components/transformers/preprocessing/drop_nan_rows_transformer.py

evalml/pipelines/multiseries_regression_pipeline.py

eccabay · 2024-01-31T15:11:28Z

evalml/pipelines/multiseries_regression_pipeline.py

+        unstacked_predictions.index = X_unstacked[self.time_index]
        stacked_predictions = stack_data(
            unstacked_predictions,
-            include_series_id=include_series_id,
+            include_series_id=True,
            series_id_name=self.series_id,
        )
-
+        stacked_predictions = stacked_predictions.reset_index()


What's the reasoning behind setting the index and then immediately resetting the index? The value of the index shouldn't impact the order of stacking, right?

Either way, we can explicitly control the index in stack_data with the starting_index argument

The goal of this snippet is to set the index as the time index column, stack the data (and thus using the dates in the time index column to generate new stacked dates) and then resetting the index so that the resulting time index column can be used when we pd.merge later on in line 193.

While it's possible to just copy over the time_index column from X after stacking, I think it's safer to just generate it from the X_unstacked index like this as we know for sure that the X_unstacked time_index aligns with the unstacked_predictions whereas it's technically possible to have an X time_index that's out of order (and thus would be incorrect if we simply copied over this column). I'm open to suggestions for a cleaner implementation!

Ok, I think I understand now! I think a comment would be great. I also wonder if it would be useful to explicitly say reset_index(drop=False), so that even if pandas changes their defaults we don't get screwed.

My motivation here is that this is something that might be confusing to someone looking back at it in the future, since the goal isn't clear from the code itself. I hope that makes sense!

Good call on the reset index parameter, I'll add that in. I'll add a clarifying comment or two so that it's clear what's going on here.

Your motivation makes sense! I feel like I've been so lost in the weeds of this implementation for a while now so it's good to have multiple pairs of eyes on this to highlight what's intuitive and what isn't 😅

evalml/tests/automl_tests/test_iterative_algorithm.py

eccabay

Looks solid! I think with a final few code comments this is all set 😎

eccabay · 2024-01-31T19:13:46Z

evalml/pipelines/components/transformers/preprocessing/time_series_featurizer.py

+            self.statistically_significant_lags = {}
+            for column in y.columns:
+                self.statistically_significant_lags[
+                    column
+                ] = self._find_significant_lags(
+                    y[column],
+                    conf_level=self.conf_level,
+                    start_delay=self.start_delay,
+                    max_delay=self.max_delay,
+                )


Ok, sorry for drilling into this so much, but I think I understand now. My new potentially hot take proposal is something like:

if y is None: self.statistically_significant_lags = np.arange(self.start_delay, self.start_delay +self. max_delay + 1) else: if isinstance(y, pd.Series): y = y.to_frame() for column in y.columns: self.statistically_significant_lags = ...

And then we can remove the handling of y being None from the static function. My argument for doing this is that calling all lags the statistically significant lags is a misnomer, since we didn't actually check statistical significance. This is me getting very into the weeds though, so I very much understand if you would rather keep things closer to the way they are 😅

Regardless, even with your new proposal, we'd still be able to combine the two non y=None cases by casting the series to a dataframe

evalml/pipelines/multiseries_regression_pipeline.py

eccabay · 2024-01-31T19:21:14Z

evalml/pipelines/multiseries_regression_pipeline.py

+        unstacked_predictions.index = X_unstacked[self.time_index]
        stacked_predictions = stack_data(
            unstacked_predictions,
-            include_series_id=include_series_id,
+            include_series_id=True,
            series_id_name=self.series_id,
        )
-
+        stacked_predictions = stacked_predictions.reset_index()


Ok, I think I understand now! I think a comment would be great. I also wonder if it would be useful to explicitly say reset_index(drop=False), so that even if pandas changes their defaults we don't get screwed.

My motivation here is that this is something that might be confusing to someone looking back at it in the future, since the goal isn't clear from the code itself. I hope that makes sense!

evalml/tests/automl_tests/test_iterative_algorithm.py

christopherbunn force-pushed the add_datetime_featurizer branch from dcee5cf to 60a7fad Compare January 26, 2024 21:14

christopherbunn changed the title ~~Added support for mutlseries datasets for featurizers~~ Added support for multiseries datasets for featurizers Jan 26, 2024

christopherbunn changed the title ~~Added support for multiseries datasets for featurizers~~ Added support for multiseries datasets to time series featurizers Jan 26, 2024

christopherbunn changed the title ~~Added support for multiseries datasets to time series featurizers~~ Added support for additional estimators for multiseries datasets Jan 26, 2024

christopherbunn force-pushed the add_datetime_featurizer branch 2 times, most recently from 6f4b97e to e42a615 Compare January 29, 2024 14:56

christopherbunn added 5 commits January 29, 2024 10:03

Initial commit

8ecc59b

Updated tests

3f0fb8a

Added addtional drop nan test case

ed7e0c9

Updated release notes

382cec2

Reverted series ID name

443f7a5

christopherbunn force-pushed the add_datetime_featurizer branch from e42a615 to 443f7a5 Compare January 29, 2024 15:03

christopherbunn marked this pull request as ready for review January 29, 2024 17:36

auto-assign bot assigned christopherbunn Jan 29, 2024

christopherbunn requested review from MichaelFu512, chukarsten, eccabay and jeremyliweishih January 29, 2024 17:37

jeremyliweishih approved these changes Jan 29, 2024

View reviewed changes

Moved infer feature types

fa3dec8

MichaelFu512 approved these changes Jan 30, 2024

View reviewed changes

eccabay reviewed Jan 30, 2024

View reviewed changes

Added clarifying comments and updated test

b7b6bf9

christopherbunn force-pushed the add_datetime_featurizer branch from d99084e to b7b6bf9 Compare January 30, 2024 20:59

eccabay reviewed Jan 31, 2024

View reviewed changes

evalml/tests/automl_tests/test_iterative_algorithm.py Show resolved Hide resolved

Consolidated code and added additional clarifying comments

6f765bb

christopherbunn requested a review from eccabay January 31, 2024 19:03

eccabay approved these changes Jan 31, 2024

View reviewed changes

christopherbunn added 2 commits January 31, 2024 15:01

Code cleanup

e49dfa5

Added support for ndarrays for featurizer

701eef6

christopherbunn merged commit ba6617a into main Jan 31, 2024

christopherbunn deleted the add_datetime_featurizer branch January 31, 2024 21:32

MichaelFu512 mentioned this pull request Feb 1, 2024

release_v0.83.0 #4388

Merged

Conversation

christopherbunn commented Jan 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Jan 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

jeremyliweishih left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

christopherbunn Jan 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

christopherbunn Jan 31, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

eccabay left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

christopherbunn commented Jan 26, 2024 •

edited

Loading

codecov bot commented Jan 26, 2024 •

edited

Loading

christopherbunn Jan 30, 2024 •

edited

Loading

christopherbunn Jan 31, 2024 •

edited

Loading