Skip to content

Add baseline multiseries regressor#4246

Merged
eccabay merged 17 commits intomainfrom
4241_baseline_multiseries
Aug 1, 2023
Merged

Add baseline multiseries regressor#4246
eccabay merged 17 commits intomainfrom
4241_baseline_multiseries

Conversation

@eccabay
Copy link
Copy Markdown
Contributor

@eccabay eccabay commented Jul 20, 2023

Closes #4241

Implementation assumes one column per target series, which is not the final state of the input. Changes will probably have to be made once stacking/unstacking functions are implemented and this is integrated into search.

Implementation also assumes integer indices.

@codecov
Copy link
Copy Markdown

codecov bot commented Jul 24, 2023

Codecov Report

Merging #4246 (38eb434) into main (5b80a8e) will decrease coverage by 0.0%.
The diff coverage is 100.0%.

@@           Coverage Diff           @@
##            main   #4246     +/-   ##
=======================================
- Coverage   99.7%   99.7%   -0.0%     
=======================================
  Files        349     351      +2     
  Lines      38413   38497     +84     
=======================================
+ Hits       38293   38376     +83     
- Misses       120     121      +1     
Files Changed Coverage Δ
evalml/pipelines/components/__init__.py 100.0% <ø> (ø)
evalml/pipelines/components/estimators/__init__.py 100.0% <ø> (ø)
evalml/tests/component_tests/test_utils.py 99.2% <ø> (ø)
evalml/tests/pipeline_tests/test_pipeline_utils.py 99.6% <ø> (-<0.1%) ⬇️
evalml/utils/gen_utils.py 99.3% <ø> (ø)
evalml/pipelines/components/component_base.py 100.0% <100.0%> (ø)
...lines/components/estimators/regressors/__init__.py 100.0% <100.0%> (ø)
...sors/multiseries_time_series_baseline_regressor.py 100.0% <100.0%> (ø)
...ansformers/preprocessing/time_series_featurizer.py 100.0% <100.0%> (ø)
evalml/pipelines/utils.py 99.4% <100.0%> (-0.2%) ⬇️
... and 4 more

@eccabay eccabay marked this pull request as ready for review July 24, 2023 20:56
@eccabay eccabay marked this pull request as draft July 25, 2023 19:31
@eccabay eccabay marked this pull request as ready for review July 26, 2023 19:40
Copy link
Copy Markdown
Collaborator

@jeremyliweishih jeremyliweishih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good work 😸

for t in self.statistically_significant_lags:
lagged_features[self.target_colname_prefix.format(t)] = y.shift(t)
if isinstance(y, pd.DataFrame):
lagged_features.update(self._delay_df(y, y.columns))
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thought: should we just run self._encode_y_while_preserving_index(y) even though we won't expect categorical columns just yet?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 interesting point, will we ever expect categorical columns? We're only supporting regression problems for multiseries

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

potentially sometime in the distant future! Just thought it would make it one step easier for whoever implements that 😄 it'll be a no-op anyways right now

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea in theory, but I'm worried about increasing runtime with checking if y contains any categorical columns. We'd have to do so in all cases, which feels wasteful when we know we won't be dealing with it.

Copy link
Copy Markdown
Contributor

@christopherbunn christopherbunn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few questions for my clarification + a suggestion but once answered LGTM



@pytest.fixture
def X_y_multiseries_regression():
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm going to utilize this for my VARMAX testing. In that case, does it make sense to have it align more with the inputs we expect for a mutliseries regressor (e.g. a series_id column and the column names having series_id value suffixes)

Copy link
Copy Markdown
Contributor

@christopherbunn christopherbunn Jul 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also since this is a time series pipeline should we be extending ts_data() instead? If we do decide to keep this we should also make it a time series dataset by adding a datetime column + changing the name to identify it as a time series dataset.

My bad I forgot we have the multiseries_ts_data_unstacked and multiseries_ts_data_unstacked functions. In that case, is there a reason why the test_test_multiseries_baseline_regressor.py test cases use those mocks?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good callout that this should more closely match the actual expected input - I wrote these before I wrote the stacking and unstacking functions and never updated it 😅 - same with why these tests don't use multiseries_ts_data_unstacked. I'll refactor the test fixtures a bit to be consolidated and up to date.

Comment on lines +127 to +128
if isinstance(y, pd.DataFrame):
self.statistically_significant_lags = [self.start_delay]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So for the multiseries case, do we not try to find the significant lags? Does this just use all or none of the lags?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to find the significant lags because we're not actually doing feature engineering here, just getting the properly lagged column that our baseline regressor relies on. By setting the lags that we calculate to be just self.start_delay, we only compute the one we know we need.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might want to add an explicit comment here for the case we're splitting on...e.g. if y is a dataframe, we expect it to be multiseries.

Comment on lines +82 to +86
Args:
X (pd.DataFrame): Data of shape [n_samples, n_features].

Returns:
pd.Series: Predicted values.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure this is correct, but just to double check we're returning the predictions with the predicted values stacked right (i.e. in series form and not as a dataframe)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lol, this is a copypasta fail. We're returning the unstacked dataframe here

@eccabay eccabay requested a review from christopherbunn July 31, 2023 15:42
Copy link
Copy Markdown
Contributor

@chukarsten chukarsten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few nits, great work

Comment on lines +102 to +112
@property
def feature_importance(self):
"""Returns importance associated with each feature.

Since baseline estimators do not use input features to calculate predictions, returns an array of zeroes.

Returns:
np.ndarray (float): An array of zeroes.
"""
importance = np.array([0] * self._num_features)
return importance
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably another nit, but if you're calling out all Baseline Estimators...is it worth putting together a story to add a BaselineEstimator class to the inheritance chain and have them all inherit this prop def?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed #4255

Comment on lines +127 to +128
if isinstance(y, pd.DataFrame):
self.statistically_significant_lags = [self.start_delay]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might want to add an explicit comment here for the case we're splitting on...e.g. if y is a dataframe, we expect it to be multiseries.

if categorical_columns and col_name in categorical_columns:
col = X_categorical[col_name]
for t in self.statistically_significant_lags:
lagged_features[f"{col_name}_delay_{t}"] = col.shift(t)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're not going to be doing any external matching on this name format, right? If so, I think we might want to establish a pattern of making this string format like a module level thing or accessible via the class

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call - adjusting!

@eccabay eccabay enabled auto-merge (squash) July 31, 2023 20:53
@eccabay eccabay merged commit 7468580 into main Aug 1, 2023
@eccabay eccabay deleted the 4241_baseline_multiseries branch August 1, 2023 15:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Baseline Multiseries Time Series Regressor

4 participants