Conversation
Codecov Report
@@ Coverage Diff @@
## main #4246 +/- ##
=======================================
- Coverage 99.7% 99.7% -0.0%
=======================================
Files 349 351 +2
Lines 38413 38497 +84
=======================================
+ Hits 38293 38376 +83
- Misses 120 121 +1
|
| for t in self.statistically_significant_lags: | ||
| lagged_features[self.target_colname_prefix.format(t)] = y.shift(t) | ||
| if isinstance(y, pd.DataFrame): | ||
| lagged_features.update(self._delay_df(y, y.columns)) |
There was a problem hiding this comment.
thought: should we just run self._encode_y_while_preserving_index(y) even though we won't expect categorical columns just yet?
There was a problem hiding this comment.
🤔 interesting point, will we ever expect categorical columns? We're only supporting regression problems for multiseries
There was a problem hiding this comment.
potentially sometime in the distant future! Just thought it would make it one step easier for whoever implements that 😄 it'll be a no-op anyways right now
There was a problem hiding this comment.
I like the idea in theory, but I'm worried about increasing runtime with checking if y contains any categorical columns. We'd have to do so in all cases, which feels wasteful when we know we won't be dealing with it.
christopherbunn
left a comment
There was a problem hiding this comment.
A few questions for my clarification + a suggestion but once answered LGTM
evalml/tests/conftest.py
Outdated
|
|
||
|
|
||
| @pytest.fixture | ||
| def X_y_multiseries_regression(): |
There was a problem hiding this comment.
I think I'm going to utilize this for my VARMAX testing. In that case, does it make sense to have it align more with the inputs we expect for a mutliseries regressor (e.g. a series_id column and the column names having series_id value suffixes)
There was a problem hiding this comment.
Also since this is a time series pipeline should we be extending ts_data() instead? If we do decide to keep this we should also make it a time series dataset by adding a datetime column + changing the name to identify it as a time series dataset.
My bad I forgot we have the multiseries_ts_data_unstacked and multiseries_ts_data_unstacked functions. In that case, is there a reason why the test_test_multiseries_baseline_regressor.py test cases use those mocks?
There was a problem hiding this comment.
Good callout that this should more closely match the actual expected input - I wrote these before I wrote the stacking and unstacking functions and never updated it 😅 - same with why these tests don't use multiseries_ts_data_unstacked. I'll refactor the test fixtures a bit to be consolidated and up to date.
| if isinstance(y, pd.DataFrame): | ||
| self.statistically_significant_lags = [self.start_delay] |
There was a problem hiding this comment.
So for the multiseries case, do we not try to find the significant lags? Does this just use all or none of the lags?
There was a problem hiding this comment.
We don't need to find the significant lags because we're not actually doing feature engineering here, just getting the properly lagged column that our baseline regressor relies on. By setting the lags that we calculate to be just self.start_delay, we only compute the one we know we need.
There was a problem hiding this comment.
Might want to add an explicit comment here for the case we're splitting on...e.g. if y is a dataframe, we expect it to be multiseries.
| Args: | ||
| X (pd.DataFrame): Data of shape [n_samples, n_features]. | ||
|
|
||
| Returns: | ||
| pd.Series: Predicted values. |
There was a problem hiding this comment.
I'm pretty sure this is correct, but just to double check we're returning the predictions with the predicted values stacked right (i.e. in series form and not as a dataframe)
There was a problem hiding this comment.
Lol, this is a copypasta fail. We're returning the unstacked dataframe here
evalml/pipelines/components/estimators/regressors/multiseries_time_series_baseline_regressor.py
Outdated
Show resolved
Hide resolved
chukarsten
left a comment
There was a problem hiding this comment.
Just a few nits, great work
evalml/pipelines/components/estimators/regressors/multiseries_time_series_baseline_regressor.py
Show resolved
Hide resolved
| @property | ||
| def feature_importance(self): | ||
| """Returns importance associated with each feature. | ||
|
|
||
| Since baseline estimators do not use input features to calculate predictions, returns an array of zeroes. | ||
|
|
||
| Returns: | ||
| np.ndarray (float): An array of zeroes. | ||
| """ | ||
| importance = np.array([0] * self._num_features) | ||
| return importance |
There was a problem hiding this comment.
Probably another nit, but if you're calling out all Baseline Estimators...is it worth putting together a story to add a BaselineEstimator class to the inheritance chain and have them all inherit this prop def?
| if isinstance(y, pd.DataFrame): | ||
| self.statistically_significant_lags = [self.start_delay] |
There was a problem hiding this comment.
Might want to add an explicit comment here for the case we're splitting on...e.g. if y is a dataframe, we expect it to be multiseries.
| if categorical_columns and col_name in categorical_columns: | ||
| col = X_categorical[col_name] | ||
| for t in self.statistically_significant_lags: | ||
| lagged_features[f"{col_name}_delay_{t}"] = col.shift(t) |
There was a problem hiding this comment.
We're not going to be doing any external matching on this name format, right? If so, I think we might want to establish a pattern of making this string format like a module level thing or accessible via the class
There was a problem hiding this comment.
Good call - adjusting!
Closes #4241
Implementation assumes one column per target series, which is not the final state of the input. Changes will probably have to be made once stacking/unstacking functions are implemented and this is integrated into search.
Implementation also assumes integer indices.