Skip to content

Standardize materialize filenames a little#1283

Merged
HenryL27 merged 11 commits into
mainfrom
hml-materialize-docids
May 5, 2025
Merged

Standardize materialize filenames a little#1283
HenryL27 merged 11 commits into
mainfrom
hml-materialize-docids

Conversation

@HenryL27
Copy link
Copy Markdown
Collaborator

@HenryL27 HenryL27 commented May 2, 2025

Intent: add ability to materialize from a list of doc_ids

doc_ids = ["doc1", "doc2", ...]
ctx.read.materialize(path={"root": "materializedir", "filter": DocIdFilter(doc_ids)})

Resulting changes:

  1. Group functions to translate between document, doc_id, materialize filename to what's compatible with what.
  2. Plumb it through materialize and mrr
  3. Make mrr work with a pre-existing materialize filter
  4. Make mrr keep metadata documents (unrelated to this particular thing but I think the correct behavior)
  5. A couple more hacks

Signed-off-by: Henry Lindeman <hmlindeman@yahoo.com>
Signed-off-by: Henry Lindeman <hmlindeman@yahoo.com>
Signed-off-by: Henry Lindeman <hmlindeman@yahoo.com>
Signed-off-by: Henry Lindeman <hmlindeman@yahoo.com>
Signed-off-by: Henry Lindeman <hmlindeman@yahoo.com>
Signed-off-by: Henry Lindeman <hmlindeman@yahoo.com>
Signed-off-by: Henry Lindeman <hmlindeman@yahoo.com>
@HenryL27 HenryL27 requested a review from dhruvkaliraman7 May 2, 2025 20:58
HenryL27 added 2 commits May 2, 2025 14:17
Signed-off-by: Henry Lindeman <hmlindeman@yahoo.com>
Signed-off-by: Henry Lindeman <hmlindeman@yahoo.com>
Copy link
Copy Markdown
Contributor

@dhruvkaliraman7 dhruvkaliraman7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few nits, nice work!

assert len(self.metadata["lineage_links"]["from_ids"]) > 0

self.data["doc_id"] = mkdocid()
if "doc_id" not in self.data:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

o/w metadata documents get a new docid every time they're materialized (as materialize call the constructor) which I'm pretty sure is the incorrect behavior


# Create a function that fails for specific documents
def failing_map(doc):
# logger.info(doc)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove

@@ -674,9 +688,9 @@ def test_materialize_read_reliability_retries_successful(self):
docs = make_docs(10)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the end of this test can we assert that we have 11 docs? The make_docs should give you 10 Documents and 1 MetadataDocument

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

depending on how docs get batched through noop I can't guarantee that there will be 11 but I can assert that there are at least one

return d


logger = logging.getLogger(__name__)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove

Comment thread lib/sycamore/sycamore/materialize.py Outdated
ret = []
count = 0
for fi in self._fshelper.list_files(self._root):
# logger.info(fi)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove

Signed-off-by: Henry Lindeman <hmlindeman@yahoo.com>
@HenryL27 HenryL27 merged commit b6ba67d into main May 5, 2025
11 of 15 checks passed
@HenryL27 HenryL27 deleted the hml-materialize-docids branch August 30, 2025 00:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants