Skip to content

Add materialize read reliability#1094

Merged
dhruvkaliraman7 merged 17 commits into
mainfrom
Add-Materialize-Read-Reliability
Feb 6, 2025
Merged

Add materialize read reliability#1094
dhruvkaliraman7 merged 17 commits into
mainfrom
Add-Materialize-Read-Reliability

Conversation

@dhruvkaliraman7
Copy link
Copy Markdown
Contributor

1. MaterializeReadReliability: A new class that enables reliable batch processing of materialized files by:

  • Tracking already processed files

  • Limiting batch sizes

  • Supporting incremental processing through batch resets

  • Maintaining state between batch executions

2. Added utility functions:

  • name_from_docid: Custom naming function using path-based SHA256 hashes

  • docid_from_path: Generates document IDs(SHA 256) from path

  • doc_only_to_binary: Serialization helper

Comment thread lib/sycamore/sycamore/docset.py Outdated
Comment thread lib/sycamore/sycamore/docset.py Outdated
Comment thread lib/sycamore/sycamore/plan_nodes.py Outdated
Comment thread lib/sycamore/sycamore/docset.py Outdated
Comment thread lib/sycamore/sycamore/materialize.py
Comment thread lib/sycamore/sycamore/materialize.py Outdated
Comment thread lib/sycamore/sycamore/materialize.py Outdated
Comment thread lib/sycamore/sycamore/tests/unit/test_materialize.py Outdated
Comment thread lib/sycamore/sycamore/tests/unit/test_materialize.py Outdated
Comment thread lib/sycamore/sycamore/tests/unit/test_materialize.py Outdated
Comment thread lib/sycamore/sycamore/docset.py Outdated
Comment thread lib/sycamore/sycamore/data/docid.py Outdated
Comment thread lib/sycamore/sycamore/materialize.py
Comment thread lib/sycamore/sycamore/docset.py Outdated
Comment thread lib/sycamore/sycamore/docset.py Outdated
Comment thread lib/sycamore/sycamore/docset.py Outdated
* Initial dev

* Remove debugging code

* Remove old code which passed reliability object to context

* Add exception handling after all files processed on ray, lint fix

* Add unit tests

* Switch to using Path Partition Filter

* Add logging, fix assertions in tests

* lint

* refactor tests

* mypy fix, make tests efficient, uniform naming convention

* Remove print

* Change func call from merge

* Better docs and logging

* Yet better docs

* nits

* lint smh

* Address comments
@dhruvkaliraman7 dhruvkaliraman7 merged commit ccd78b7 into main Feb 6, 2025
@dhruvkaliraman7 dhruvkaliraman7 deleted the Add-Materialize-Read-Reliability branch February 6, 2025 01:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants