Make MAXIMUM_SEED_SIZE_MIB configurable by jtcohen6 · Pull Request #11177 · dbt-labs/dbt-core

jtcohen6 · 2024-12-24T16:26:52Z

resolves #7117
resolves #7124

Reapply changes from #7125. (This proved easier than rebasing the commits directly.)

Problem

We apply an arbitrary limit of 1 MiB to seeds (CSVs), for the specific purpose of hashing contents and comparing those hashed contents during state:modified comparison. (That's a mebibyte, not a megabyte, for those who care to distinguish between the two).

That is, dbt doesn't raise an error if it detects a seed larger than MAXIMUM_SEED_SIZE_MIB, but certain features become unavailable and dbt does not make any guarantees about acceptable performance.

We could adjust this for inflation (1 MiB in 2020 is worth ~1.2 MiB today), but this PR takes the more forward-looking approach of making the value configurable by end users. Users can instruct dbt to compare the contents of larger seeds, so long as they're willing to "pay the price" of hashing the contents of large seeds.

We will need to update docs: https://docs.getdbt.com/reference/node-selection/state-comparison-caveats

Solution

From #7125:

Increasing the maximum size of seed files where the content is hashed for state comparison will enable a greater use of deferred runs with updated seed file contents. By making this a configuration with an environment variable users are able to override the default 1 MiB limit.

Furthermore, to support reading larger files in memory constrained environments, a new method is added to read and compute file contents incrementally. As this is performed on bytes for better performance we continue using the previous UTF-8 content method for small files to not mess up current states where seed files are not stored with UTF-8.

Checklist

I have read the contributing guide and understand what's expected of me.
I have run this code in development, and it appears to resolve the stated issue.
This PR includes tests, or tests are not required or relevant for this PR.
This PR has no interface changes (e.g., macros, CLI, logs, JSON artifacts, config files, adapter interface, etc.) or this PR has already received feedback and approval from Product or DX.
This PR includes type annotations for new and modified functions.

Co-authored by: Noah Holm <32292420+noppaz@users.noreply.github.com> Co-authored by: Jeremy Cohen <jeremy@dbtlabs.com>

codecov · 2024-12-24T16:29:12Z

Codecov Report

❌ Patch coverage is 81.39535% with 8 lines in your changes missing coverage. Please review.
✅ Project coverage is 87.93%. Comparing base (befdd4e) to head (a4277b8).

Additional details and impacted files

@@            Coverage Diff             @@
##             main   #11177      +/-   ##
==========================================
- Coverage   91.35%   87.93%   -3.43%     
==========================================
  Files         203      203              
  Lines       25683    25717      +34     
==========================================
- Hits        23462    22613     -849     
- Misses       2221     3104     +883

Flag	Coverage Δ
integration	`87.93% <81.39%> (-0.20%)`	⬇️
unit	`?`

Flags with carried forward coverage won't be shown. Click here to find out more.

Components	Coverage Δ
Unit Tests	`87.93% <81.39%> (-3.43%)`	⬇️
Integration Tests	`87.93% <81.39%> (-0.20%)`	⬇️

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

jtcohen6

🎩 Given a my_small_seed.csv (<1 MB) and my_large_seed.csv (~3 MB)

Using dbt-core@main:

% dbt parse && mv target/manifest.json state && dbt ls -s state:modified --state state    
16:29:42  Running with dbt=1.10.0-a1
16:29:42  Registered adapter: duckdb=1.9.1
16:29:42  Performance info: /Users/jerco/dev/scratch/testy/target/perf_info.json
16:29:43  Running with dbt=1.10.0-a1
16:29:43  Registered adapter: duckdb=1.9.1
16:29:43  Found 1 operation, 1 model, 2 seeds, 424 macros
16:29:43  Found a seed (testy.my_large_seed) >1MB in size at the same path, dbt cannot tell if it has changed: assuming they are the same
16:29:43  The selection criterion 'state:modified' does not match any enabled nodes
16:29:43  No nodes selected!

Switching to dbt-core@jerco/redo-pr-7125 (without recreating state/manifest.json):

% dbt ls -s state:modified --state state    
16:30:03  Running with dbt=1.10.0-a1
16:30:03  Registered adapter: duckdb=1.9.1
16:30:03  Found 1 operation, 1 model, 2 seeds, 424 macros
16:30:03  Found a seed (testy.my_large_seed) >1MiB in size at the same path, dbt cannot tell if it has changed: assuming they are the same
16:30:03  The selection criterion 'state:modified' does not match any enabled nodes
16:30:03  No nodes selected!

This proves that the checksum of my_small_seed has not changed.

Now, let's set the config and redo:

% export DBT_MAXIMUM_SEED_SIZE_MIB=10
% dbt parse && mv target/manifest.json state && dbt ls -s state:modified --state state
16:30:45  Running with dbt=1.10.0-a1
16:30:45  Registered adapter: duckdb=1.9.1
16:30:45  Performance info: /Users/jerco/dev/scratch/testy/target/perf_info.json
16:30:46  Running with dbt=1.10.0-a1
16:30:46  Registered adapter: duckdb=1.9.1
16:30:46  Found 1 operation, 1 model, 2 seeds, 424 macros
16:30:46  The selection criterion 'state:modified' does not match any enabled nodes
16:30:46  No nodes selected!

Manually edit one row in the large seed:

% dbt ls -s state:modified --state state
16:36:58  Running with dbt=1.10.0-a1
16:36:58  Registered adapter: duckdb=1.9.1
16:36:58  Found 1 operation, 1 model, 2 seeds, 424 macros
testy.my_large_seed

jtcohen6 · 2024-12-24T16:33:36Z

core/dbt/parser/read_files.py

    else:
-        file_contents = load_file_contents(match.absolute_path, strip=True)
-        checksum = FileHash.from_contents(file_contents)
+        checksum = FileHash.from_path(match.absolute_path)


I have confirmed that this is not a "breaking" change, insofar as the same seed produces the same checksum before and after this change.

Update: The failing test on Windows seems to indicate that the seed checksum does indeed change, as a result of this PR, but only on Windows. The contributor mentioned this code comment as indication that we expect actually different checksums on Windows versus MacOS / Linux.

In that test, the checksum for 'seed.test.seed':

on MacOS + Linux (before/after): '381eb2f...'

on Windows (before): '54a28a3...'

on Windows (after): '381eb2f...'

I think our options are:

Restore the behavior to return actually different checksums on different systems. That might be easiest, but I agree it's more desirable to return the same checksum regardless of OS.

Place this behind a behavior-change flag, which we very quickly flip to "True" by default (in the next minor version of dbt-core), with targeted advisory for users on Windows: "When this change rolls out, all your seeds will be detected as modified, if comparing against a manifest produced before this code change."

Roll this out in the next minor version of dbt-core, without a behavior-change flag. It won't affect the vast majority of users (who are not running on Windows). We mention it in the upgrade guide.

I lean toward option (2) for thoroughness.

I wonder weather we want to just do a logic of:

if we find the hash is different on Windows(maybe also if the manifest was computed with a previous version of dbt), compute with the other method to see if it changed, if not, we return not changed but persist the new hash in artifact.

This should provide a seamless upgrade experience and we don't need to have a flag at all.

As long as option 2 is a minor addition of work in the PR I think that's great for completeness. But as you say with option 3 the impact should be minimal.

core/dbt/artifacts/resources/base.py

ChenyuLInx

LGTM with one thought around migration.

ChenyuLInx · 2025-01-07T19:45:21Z

core/dbt/parser/read_files.py

    else:
-        file_contents = load_file_contents(match.absolute_path, strip=True)
-        checksum = FileHash.from_contents(file_contents)
+        checksum = FileHash.from_path(match.absolute_path)


I wonder weather we want to just do a logic of:

if we find the hash is different on Windows(maybe also if the manifest was computed with a previous version of dbt), compute with the other method to see if it changed, if not, we return not changed but persist the new hash in artifact.

This should provide a seamless upgrade experience and we don't need to have a flag at all.

noppaz · 2025-02-24T13:43:46Z

Hey @jtcohen6, happy to see some attention here again. While I'm a bit late to the party know that I'd be available to assist further in getting this on the road.

github-actions · 2026-03-15T14:20:36Z

Additional Artifact Review Required

Changes to artifact directory files requires at least 2 approvals from core team members.

MichelleArk · 2026-03-16T17:49:58Z

core/dbt/contracts/files.py

-    def seed_too_large(self) -> bool:
-        """Return whether the file this represents is over the seed size limit"""
-        return os.stat(self.full_path).st_size > MAXIMUM_SEED_SIZE
+    def file_size(self) -> int:


this could be a @Property to conform nicely with the rest of the FilePath implementation

core/hatch.toml

core/dbt/cli/params.py

MichelleArk · 2026-03-16T18:22:19Z

core/dbt/contracts/graph/nodes.py

+        if (
+            not result
+            and self.checksum.name == other.checksum.name
+            and self.checksum.name not in ("path", "none")
+            and self.root_path
+        ):


Am I understanding correctly that we're always using the legacy path implementation anytime there is a checksum difference on a seed with any contents (or one that is a large seed)?

Yes. Now that I think about it more, I should probably add os.name == "nt" check to only re-compute on windows :D

noppaz · 2026-03-17T19:17:20Z

Hey dbt team, I'm very happy to see some progress here. This is still a pain point for me and my team so I'd be super happy to get this out in 1.12. Let me know if I can help.

Reapply changes from #7125

704120c

Co-authored by: Noah Holm <32292420+noppaz@users.noreply.github.com> Co-authored by: Jeremy Cohen <jeremy@dbtlabs.com>

jtcohen6 requested a review from a team as a code owner December 24, 2024 16:26

cla-bot bot added the cla:yes label Dec 24, 2024

jtcohen6 mentioned this pull request Dec 24, 2024

Make MAXIMUM_SEED_SIZE_MIB configurable #7125

Open

6 tasks

jtcohen6 commented Dec 24, 2024

View reviewed changes

Update MB -> MiB in functional test

20be925

jtcohen6 added the artifact_minor_upgrade To bypass the CI check by confirming that the change is not breaking label Dec 24, 2024

ChenyuLInx previously approved these changes Jan 7, 2025

View reviewed changes

mirnawong1 mentioned this pull request Dec 11, 2025

Configurable MAXIMUM_SEED_SIZE for state hash comparison dbt-labs/docs.getdbt.com#2958

Closed

1 task

Merge branch 'main' into jerco/redo-pr-7125

5771956

sriramr98 dismissed ChenyuLInx’s stale review via 5771956 March 15, 2026 14:20

sriramr98 added 2 commits March 16, 2026 12:30

added smooth backward compatibility for seeds hash compute in windows

ac846fa

fixed buggy windows tests

37fcb33

graciegoheen added this to the v1.12 milestone Mar 16, 2026

graciegoheen added the needs_fusion_implementation label Mar 16, 2026

MichelleArk reviewed Mar 16, 2026

View reviewed changes

core/hatch.toml Outdated Show resolved Hide resolved

MichelleArk reviewed Mar 16, 2026

View reviewed changes

core/dbt/cli/params.py Outdated Show resolved Hide resolved

MichelleArk reviewed Mar 16, 2026

View reviewed changes

core/dbt/cli/params.py Outdated Show resolved Hide resolved

MichelleArk reviewed Mar 16, 2026

View reviewed changes

sriramr98 added 2 commits March 17, 2026 11:27

Merge branch 'main' into jerco/redo-pr-7125

41213f9

small improvements

a4277b8

Conversation

jtcohen6 commented Dec 24, 2024

Problem

Solution

Checklist

Uh oh!

codecov bot commented Dec 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

jtcohen6 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ChenyuLInx left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

noppaz commented Feb 24, 2025

Uh oh!

github-actions bot commented Mar 15, 2026

Additional Artifact Review Required

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

noppaz commented Mar 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

codecov bot commented Dec 24, 2024 •

edited

Loading