Conversation
Co-authored by: Noah Holm <32292420+noppaz@users.noreply.github.com> Co-authored by: Jeremy Cohen <jeremy@dbtlabs.com>
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #11177 +/- ##
==========================================
- Coverage 91.35% 87.93% -3.43%
==========================================
Files 203 203
Lines 25683 25717 +34
==========================================
- Hits 23462 22613 -849
- Misses 2221 3104 +883
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
jtcohen6
left a comment
There was a problem hiding this comment.
🎩 Given a my_small_seed.csv (<1 MB) and my_large_seed.csv (~3 MB)
Using dbt-core@main:
% dbt parse && mv target/manifest.json state && dbt ls -s state:modified --state state
16:29:42 Running with dbt=1.10.0-a1
16:29:42 Registered adapter: duckdb=1.9.1
16:29:42 Performance info: /Users/jerco/dev/scratch/testy/target/perf_info.json
16:29:43 Running with dbt=1.10.0-a1
16:29:43 Registered adapter: duckdb=1.9.1
16:29:43 Found 1 operation, 1 model, 2 seeds, 424 macros
16:29:43 Found a seed (testy.my_large_seed) >1MB in size at the same path, dbt cannot tell if it has changed: assuming they are the same
16:29:43 The selection criterion 'state:modified' does not match any enabled nodes
16:29:43 No nodes selected!
Switching to dbt-core@jerco/redo-pr-7125 (without recreating state/manifest.json):
% dbt ls -s state:modified --state state
16:30:03 Running with dbt=1.10.0-a1
16:30:03 Registered adapter: duckdb=1.9.1
16:30:03 Found 1 operation, 1 model, 2 seeds, 424 macros
16:30:03 Found a seed (testy.my_large_seed) >1MiB in size at the same path, dbt cannot tell if it has changed: assuming they are the same
16:30:03 The selection criterion 'state:modified' does not match any enabled nodes
16:30:03 No nodes selected!
This proves that the checksum of my_small_seed has not changed.
Now, let's set the config and redo:
% export DBT_MAXIMUM_SEED_SIZE_MIB=10
% dbt parse && mv target/manifest.json state && dbt ls -s state:modified --state state
16:30:45 Running with dbt=1.10.0-a1
16:30:45 Registered adapter: duckdb=1.9.1
16:30:45 Performance info: /Users/jerco/dev/scratch/testy/target/perf_info.json
16:30:46 Running with dbt=1.10.0-a1
16:30:46 Registered adapter: duckdb=1.9.1
16:30:46 Found 1 operation, 1 model, 2 seeds, 424 macros
16:30:46 The selection criterion 'state:modified' does not match any enabled nodes
16:30:46 No nodes selected!
Manually edit one row in the large seed:
% dbt ls -s state:modified --state state
16:36:58 Running with dbt=1.10.0-a1
16:36:58 Registered adapter: duckdb=1.9.1
16:36:58 Found 1 operation, 1 model, 2 seeds, 424 macros
testy.my_large_seed
| else: | ||
| file_contents = load_file_contents(match.absolute_path, strip=True) | ||
| checksum = FileHash.from_contents(file_contents) | ||
| checksum = FileHash.from_path(match.absolute_path) |
There was a problem hiding this comment.
I have confirmed that this is not a "breaking" change, insofar as the same seed produces the same checksum before and after this change.
There was a problem hiding this comment.
Update: The failing test on Windows seems to indicate that the seed checksum does indeed change, as a result of this PR, but only on Windows. The contributor mentioned this code comment as indication that we expect actually different checksums on Windows versus MacOS / Linux.
In that test, the checksum for 'seed.test.seed':
- on MacOS + Linux (before/after):
'381eb2f...' - on Windows (before):
'54a28a3...' - on Windows (after):
'381eb2f...'
I think our options are:
- Restore the behavior to return actually different checksums on different systems. That might be easiest, but I agree it's more desirable to return the same checksum regardless of OS.
- Place this behind a behavior-change flag, which we very quickly flip to "True" by default (in the next minor version of dbt-core), with targeted advisory for users on Windows: "When this change rolls out, all your seeds will be detected as modified, if comparing against a manifest produced before this code change."
- Roll this out in the next minor version of dbt-core, without a behavior-change flag. It won't affect the vast majority of users (who are not running on Windows). We mention it in the upgrade guide.
I lean toward option (2) for thoroughness.
There was a problem hiding this comment.
I wonder weather we want to just do a logic of:
- if we find the hash is different on Windows(maybe also if the manifest was computed with a previous version of dbt), compute with the other method to see if it changed, if not, we return not changed but persist the new hash in artifact.
This should provide a seamless upgrade experience and we don't need to have a flag at all.
There was a problem hiding this comment.
As long as option 2 is a minor addition of work in the PR I think that's great for completeness. But as you say with option 3 the impact should be minimal.
ChenyuLInx
left a comment
There was a problem hiding this comment.
LGTM with one thought around migration.
| else: | ||
| file_contents = load_file_contents(match.absolute_path, strip=True) | ||
| checksum = FileHash.from_contents(file_contents) | ||
| checksum = FileHash.from_path(match.absolute_path) |
There was a problem hiding this comment.
I wonder weather we want to just do a logic of:
- if we find the hash is different on Windows(maybe also if the manifest was computed with a previous version of dbt), compute with the other method to see if it changed, if not, we return not changed but persist the new hash in artifact.
This should provide a seamless upgrade experience and we don't need to have a flag at all.
|
Hey @jtcohen6, happy to see some attention here again. While I'm a bit late to the party know that I'd be available to assist further in getting this on the road. |
Additional Artifact Review RequiredChanges to artifact directory files requires at least 2 approvals from core team members. |
| def seed_too_large(self) -> bool: | ||
| """Return whether the file this represents is over the seed size limit""" | ||
| return os.stat(self.full_path).st_size > MAXIMUM_SEED_SIZE | ||
| def file_size(self) -> int: |
There was a problem hiding this comment.
this could be a @Property to conform nicely with the rest of the FilePath implementation
| if ( | ||
| not result | ||
| and self.checksum.name == other.checksum.name | ||
| and self.checksum.name not in ("path", "none") | ||
| and self.root_path | ||
| ): |
There was a problem hiding this comment.
Am I understanding correctly that we're always using the legacy path implementation anytime there is a checksum difference on a seed with any contents (or one that is a large seed)?
There was a problem hiding this comment.
Yes. Now that I think about it more, I should probably add os.name == "nt" check to only re-compute on windows :D
|
Hey dbt team, I'm very happy to see some progress here. This is still a pain point for me and my team so I'd be super happy to get this out in 1.12. Let me know if I can help. |
resolves #7117
resolves #7124
Reapply changes from #7125. (This proved easier than rebasing the commits directly.)
Problem
We apply an arbitrary limit of 1 MiB to seeds (CSVs), for the specific purpose of hashing contents and comparing those hashed contents during
state:modifiedcomparison. (That's a mebibyte, not a megabyte, for those who care to distinguish between the two).That is, dbt doesn't raise an error if it detects a seed larger than
MAXIMUM_SEED_SIZE_MIB, but certain features become unavailable and dbt does not make any guarantees about acceptable performance.We could adjust this for inflation (1 MiB in 2020 is worth ~1.2 MiB today), but this PR takes the more forward-looking approach of making the value configurable by end users. Users can instruct dbt to compare the contents of larger seeds, so long as they're willing to "pay the price" of hashing the contents of large seeds.
We will need to update docs: https://docs.getdbt.com/reference/node-selection/state-comparison-caveats
Solution
From #7125:
Checklist