Extend DateTimeFormatCheck to work for multiseries#4300
Conversation
| ... } | ||
| ... ] | ||
|
|
||
| The column "Weeks" has a date that does not follow the weekly pattern, which is considered misaligned. |
There was a problem hiding this comment.
Removed because it looks like an exact copy of line 156-200
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## main #4300 +/- ##
=======================================
+ Coverage 99.7% 99.7% +0.1%
=======================================
Files 357 357
Lines 39587 39694 +107
=======================================
+ Hits 39467 39574 +107
Misses 120 120
☔ View full report in Codecov by Sentry. |
| if not pd.DatetimeIndex(no_nan_datetime_values).is_monotonic_increasing: | ||
| messages.append( | ||
| DataCheckError( | ||
| message="Datetime values must be sorted in ascending order.", |
There was a problem hiding this comment.
Unsure if I need to change the message here to say something like "Datetime values {at series x} must be sorted in ascending order" or if it's alright to just keep it general (note: this line is not currently in the for loop that loops through all the different series)
There was a problem hiding this comment.
I think it's fine to keep it general here!
eccabay
left a comment
There was a problem hiding this comment.
Just a few nitpicks and some testing consolidation, but otherwise looks great! Thanks for doing this
| if not pd.DatetimeIndex(no_nan_datetime_values).is_monotonic_increasing: | ||
| messages.append( | ||
| DataCheckError( | ||
| message="Datetime values must be sorted in ascending order.", |
There was a problem hiding this comment.
I think it's fine to keep it general here!
| dates = ( | ||
| pd.date_range("2021-01-01", periods=15, freq="2D") | ||
| .drop("2021-01-13") | ||
| .append(pd.date_range("2021-01-30", periods=1)) | ||
| .append(pd.date_range("2021-01-31", periods=86, freq="2D")) | ||
| ) | ||
| X = pd.DataFrame({"dates": dates}, dtype="datetime64[ns]") | ||
|
|
||
| ww_payload = infer_frequency( | ||
| X["dates"], | ||
| debug=True, | ||
| window_length=WINDOW_LENGTH, | ||
| threshold=THRESHOLD, | ||
| ) | ||
|
|
||
| assert datetime_format_check.validate(X, y) == [ | ||
| DataCheckError( | ||
| message="Column 'dates' has datetime values missing between start and end date.", | ||
| data_check_name=datetime_format_check_name, | ||
| message_code=DataCheckMessageCode.DATETIME_IS_MISSING_VALUES, | ||
| ).to_dict(), | ||
| DataCheckError( | ||
| message="Column 'dates' has datetime values that do not align with the inferred frequency.", | ||
| data_check_name=datetime_format_check_name, | ||
| message_code=DataCheckMessageCode.DATETIME_HAS_MISALIGNED_VALUES, | ||
| ).to_dict(), | ||
| get_uneven_error("dates", ww_payload), | ||
| ] | ||
|
|
There was a problem hiding this comment.
It looks like an exact copy of line 323 - 349
evalml/tests/data_checks_tests/test_datetime_format_data_check.py
Outdated
Show resolved
Hide resolved
evalml/tests/data_checks_tests/test_datetime_format_data_check.py
Outdated
Show resolved
Hide resolved
jeremyliweishih
left a comment
There was a problem hiding this comment.
LGTM - just some punctuation requested.
Co-authored-by: Jeremy Shih <jeremyliweishih@gmail.com>
Co-authored-by: Jeremy Shih <jeremyliweishih@gmail.com>
Pull Request Description
Extend DateTimeFormatCheck to work for multiseries, works on stacked datasets.
Closes #4301
Small example of what this does
Say we had a multiseries problem with a stacked dataset, with two series (0, 1) and goes from 2021-01-01 to 2021-01-15.

For Series 1, the date "2021-01-07" is missing (line 14) while Series 0 has an extra date "2021-01-29" (line 31).
When we use the datacheck and run validate:

It gives the following result:
