add `TaskEncodingFormatter` by ArneBinder · Pull Request #200 · ArneBinder/pie-datasets

ArneBinder · 2025-08-05T00:40:06Z

In addition, this:

registers TaskEncodingFormatter as "task_encoding"
moves DocumentFormatter to formatter.py
adds the parameter format (Optional[str], optional) to (Iterable)Dataset.map_to_hf(): The format to set for the dataset. Defaults to None.

This functionality is useful for converting a PIE-dataset to a non-PIE-dataset. For instance, to use map() to encode a dataset for training as can be obtained with the following version of a PieDataModule:

from pie_core.utils.dictionary import list_of_dicts2dict_of_lists
from pytorch_ie import PieDataModule

class MyPieDataModule(PieDataModule):

    def _encode_document_batch(self, documents: Sequence[DocumentType]) -> Dict[str, List[Any]]:
        task_encodings, documents_in_order = self.taskmodule.batch_encode(
            documents=documents,
            encode_target=True,
        )
        task_encodings_list = [
            {"inputs": task_encoding.inputs, "targets": task_encoding.targets}
            for task_encoding in task_encodings
        ]
        return list_of_dicts2dict_of_lists(task_encodings_list)
    
    def encode_documents(self, documents: Iterable[DocumentType]) -> Union[TaskEncodingDataset, IterableTaskEncodingDataset]:
        # use dataset.map when input is a dataset ...
        if isinstance(documents, (Dataset, IterableDataset)):
            encoded_documents = documents.map_to_hf(
                function=self._encode_document_batch,
                batched=True,
                batch_size=self.taskmodule.encode_document_batch_size,
                format="task_encoding",
            )
        # ... otherwise fall back to usual encode
        else:
            encoded_documents = self.taskmodule.encode(
                documents=documents,
                encode_target=True,
                show_progress=self.show_progress_for_encode,
            )
        if isinstance(encoded_documents, Sequence):
            task_encoding_dataset = TaskEncodingDataset(encodings=encoded_documents)
        else:
            task_encoding_dataset = IterableTaskEncodingDataset(encodings=encoded_documents)
        return task_encoding_dataset

This supersedes #173.

TODO: add tests for

parameter format
TaskEncodingFormatter

codecov · 2025-08-05T00:41:52Z

Codecov Report

❌ Patch coverage is 74.35897% with 10 lines in your changes missing coverage. Please review.
✅ Project coverage is 92.72%. Comparing base (f3c904d) to head (4466de1).
⚠️ Report is 15 commits behind head on main.

Files with missing lines	Patch %	Lines
src/pie_datasets/core/formatter.py	72.41%	8 Missing ⚠️
src/pie_datasets/core/dataset.py	80.00%	2 Missing ⚠️

❌ Your patch check has failed because the patch coverage (74.35%) is below the target coverage (90.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #200      +/-   ##
==========================================
- Coverage   93.52%   92.72%   -0.81%     
==========================================
  Files          10       10              
  Lines         942      962      +20     
==========================================
+ Hits          881      892      +11     
- Misses         61       70       +9

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

This PR allows it to easily overwrite that functionality in downstream projects. some context/motivation: ArneBinder/pie-datasets#200.

move DocumentFormatter to formatter.py; implement TaskEncodingFormatter

67c68ac

ArneBinder added the enhancement New feature or request label Aug 5, 2025

add parameter format to (Iterable)Dataset.map_to_hf()

4466de1

ArneBinder marked this pull request as draft August 5, 2025 00:53

This was referenced Aug 5, 2025

add result_format to Dataset.map() #173

Closed

outsource encode_documents() from PieDataModule.setup() ArneBinder/pytorch-ie#490

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add `TaskEncodingFormatter`#200

add `TaskEncodingFormatter`#200
ArneBinder wants to merge 2 commits intomainfrom
dataset/task_encoding_formatter

ArneBinder commented Aug 5, 2025 •

edited

Loading

Uh oh!

codecov Bot commented Aug 5, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ArneBinder commented Aug 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov Bot commented Aug 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ArneBinder commented Aug 5, 2025 •

edited

Loading

codecov Bot commented Aug 5, 2025 •

edited

Loading