Skip to content

fix(cli.update): deduplicate on file_id instead of sha#138

Merged
svenrdz merged 2 commits intomainfrom
fix-update-deduplicate-fileid
Feb 5, 2026
Merged

fix(cli.update): deduplicate on file_id instead of sha#138
svenrdz merged 2 commits intomainfrom
fix-update-deduplicate-fileid

Conversation

@svenrdz
Copy link
Collaborator

@svenrdz svenrdz commented Feb 4, 2026

Follow-up to #132 that handles the missing case:

  • a File in database shares file_id with a new one from an index_node
  • the 2 File have different checksum

When re-adding a query after removal, files may have different checksums
(SHA) but the same file_id. The previous code checked by SHA, causing it
to try adding existing files as new ones, resulting in DB constraint
violations on the unique file_id constraint.

Now we check by file_id instead:
- Fetch existing file_ids from DB before processing
- For each fetched file, check if its file_id exists in DB  
- If yes, fetch the existing file object from DB
- If no, it's a new file that needs to be added

Changes:
- esgpull/cli/update.py: Check by file_id instead of SHA
- esgpull/models/sql.py: Add all_file_ids() method
- tests/conftest.py: Add db fixture
- tests/test_context.py: Add test for file_id checking

This keeps database logic in update.py (not Context) and properly handles
the remove-and-re-add query scenario.
@svenrdz svenrdz mentioned this pull request Feb 5, 2026
@svenrdz svenrdz merged commit 76731e5 into main Feb 5, 2026
3 checks passed
@svenrdz svenrdz deleted the fix-update-deduplicate-fileid branch February 13, 2026 10:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants