fingerprinting base case handles empty __dict__#1243
Merged
Conversation
Contributor
There was a problem hiding this comment.
👍 Looks good to me! Reviewed everything up to c701383 in 5 seconds
More details
- Looked at
50lines of code in2files - Skipped
0files when reviewing. - Skipped posting
2drafted comments based on config settings.
1. tests/caching/test_fingerprinting.py:34
- Draft comment:
The testtest_empty_dict_attr_is_unhashablecorrectly verifies that objects with an empty__dict__are considered unhashable, aligning with the PR's intent to handle such cases. This ensures that the cache key collision issue is resolved. - Reason this comment was not posted:
Confidence changes required:0%
The PR introduces a change in thehash_valuefunction to handle objects with an empty__dict__. This change is reflected in the testtest_empty_dict_attr_is_unhashable. The test ensures that objects with an empty__dict__are considered unhashable, which aligns with the PR's intent to fix the cache key collision issue.
2. hamilton/caching/fingerprinting.py:78
- Draft comment:
Consider adding documentation indocs/to explain the behavior ofhash_valuewhen encountering objects with an empty__dict__. This will help users understand why certain objects might be skipped during hashing. - Reason this comment was not posted:
Confidence changes required:50%
The functionhash_valueis responsible for hashing objects, and it uses the__dict__attribute to do so. However, the function does not handle cases where the__dict__is empty, which can lead to issues with certain types likepd.Timestamp. The PR addresses this by checking if__dict__is empty and skipping hashing in such cases. This change should be documented in the sphinx documentation underdocs/to inform users about this behavior.
Workflow ID: wflow_mJ8rUtPcjyQZ8hGb
You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.
elijahbenizzy
approved these changes
Nov 26, 2024
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #1242; copying reply from the issues thread
Problem
Using the cache with these two functions would create a collision and both would return
pd.Timestamp("2021-01-01")on the 2nd execution (i.e., when retrieving values from cache)Solution
The source of the bug is an oddity of
pandas. Some objects have an empty__dict__attached. A one line condition check now properly displays the intended warning messages.When encountering this case, Hamilton gives a random UUID instead of hashing the value. Caching can still work because the cache key is
NODE_NAME-CODE_VERSION-DATA_VERSIONwhere data version is now a random UUIDInvestigation
dr.cache.code_versionsshow thatfirst()andsecond()are produce differentcode_versionhashdr.cache.data_versionsproduce the samedata_versionhash (source of the bug)hamilton.caching.fingerprintinghash_pandas_obj()expectspd.Seriesorpd.DataFrameand doesn't handlepd.Timestamp(it would handle a series ofpd.Timestampvalues without collisions though)data_versionsmatch the result of the default implementationhash_value()and falls underhash_value(pd.Timestamp(...).__dict__)pd.Timestamp(...).__dict__ == {}, which is an odd behavior by pandas (it shouldn't be defined instead of empty)hash_value()handles objects without a__dict__(i.e., uses slots), but doesn't account for empty.__dict__attributeside notes