Skip to content

MismatchedSeriesLength Datacheck#4296

Merged
MichaelFu512 merged 20 commits intomainfrom
TML-8376
Sep 5, 2023
Merged

MismatchedSeriesLength Datacheck#4296
MichaelFu512 merged 20 commits intomainfrom
TML-8376

Conversation

@MichaelFu512
Copy link
Copy Markdown
Contributor

@MichaelFu512 MichaelFu512 commented Aug 31, 2023

Pull Request Description

Closes #4297
Adds mismatched series length datacheck

Visual of what this PR does

Say there was a multiseries dataset with 5 unique series_id that each show up 20 times (so there are 100 entries in the dataset).
image

Let's say the first row was dropped, meaning that series_id "0" only has 19 entries while the rest of the series_id have 20 entries.
image

Using the mismatched series length datacheck, the validate function will return:
image

After creating the pull request: in order to pass the release_notes_updated check you will need to update the "Future Release" section of docs/source/release_notes.rst to include this pull request by adding :pr:123.

@codecov
Copy link
Copy Markdown

codecov bot commented Aug 31, 2023

Codecov Report

Patch coverage: 100.0% and project coverage change: +0.1% 🎉

Comparison is base (93e7b97) 99.7% compared to head (4422f0e) 99.7%.

Additional details and impacted files
@@           Coverage Diff           @@
##            main   #4296     +/-   ##
=======================================
+ Coverage   99.7%   99.7%   +0.1%     
=======================================
  Files        355     357      +2     
  Lines      39511   39577     +66     
=======================================
+ Hits       39391   39457     +66     
  Misses       120     120             
Files Changed Coverage Δ
evalml/data_checks/__init__.py 100.0% <100.0%> (ø)
evalml/data_checks/data_check_message_code.py 100.0% <100.0%> (ø)
...data_checks/mismatched_series_length_data_check.py 100.0% <100.0%> (ø)
..._tests/test_mismatched_series_length_data_check.py 100.0% <100.0%> (ø)

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@MichaelFu512 MichaelFu512 marked this pull request as ready for review August 31, 2023 16:02
Copy link
Copy Markdown
Contributor

@eccabay eccabay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few nitpicks, but nothing blocking!

Copy link
Copy Markdown
Collaborator

@jeremyliweishih jeremyliweishih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - just some nit suggestions. I would also recommend adding adding a test case where there are multiple majority lengths i.e two modes for the majority length) and note the behavior considering multiple modes in the docstring!

for id in X[self.series_id].unique():
series_id_len[id] = len(X[X[self.series_id] == id])

# get the majority length
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# get the majority length
majority_len = statistics.mode(series_id_len.values())

Copy link
Copy Markdown
Contributor Author

@MichaelFu512 MichaelFu512 Sep 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this faster than what I currently have or is it more concise (or both)?

Copy link
Copy Markdown
Collaborator

@jeremyliweishih jeremyliweishih Sep 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'll stick to what I have currently since what I have lets me easily exit out early if all the series have the same length, which is something that Becca asked to do in one of her comments.

@MichaelFu512 MichaelFu512 merged commit 1329988 into main Sep 5, 2023
@MichaelFu512 MichaelFu512 deleted the TML-8376 branch September 5, 2023 19:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add datacheck for mismatch series length

3 participants