Skip to content

add TaskEncodingFormatter#200

Draft
ArneBinder wants to merge 2 commits intomainfrom
dataset/task_encoding_formatter
Draft

add TaskEncodingFormatter#200
ArneBinder wants to merge 2 commits intomainfrom
dataset/task_encoding_formatter

Conversation

@ArneBinder
Copy link
Copy Markdown
Owner

@ArneBinder ArneBinder commented Aug 5, 2025

In addition, this:

  • registers TaskEncodingFormatter as "task_encoding"
  • moves DocumentFormatter to formatter.py
  • adds the parameter format (Optional[str], optional) to (Iterable)Dataset.map_to_hf(): The format to set for the dataset. Defaults to None.

This functionality is useful for converting a PIE-dataset to a non-PIE-dataset. For instance, to use map() to encode a dataset for training as can be obtained with the following version of a PieDataModule:

from pie_core.utils.dictionary import list_of_dicts2dict_of_lists
from pytorch_ie import PieDataModule

class MyPieDataModule(PieDataModule):

    def _encode_document_batch(self, documents: Sequence[DocumentType]) -> Dict[str, List[Any]]:
        task_encodings, documents_in_order = self.taskmodule.batch_encode(
            documents=documents,
            encode_target=True,
        )
        task_encodings_list = [
            {"inputs": task_encoding.inputs, "targets": task_encoding.targets}
            for task_encoding in task_encodings
        ]
        return list_of_dicts2dict_of_lists(task_encodings_list)
    
    def encode_documents(self, documents: Iterable[DocumentType]) -> Union[TaskEncodingDataset, IterableTaskEncodingDataset]:
        # use dataset.map when input is a dataset ...
        if isinstance(documents, (Dataset, IterableDataset)):
            encoded_documents = documents.map_to_hf(
                function=self._encode_document_batch,
                batched=True,
                batch_size=self.taskmodule.encode_document_batch_size,
                format="task_encoding",
            )
        # ... otherwise fall back to usual encode
        else:
            encoded_documents = self.taskmodule.encode(
                documents=documents,
                encode_target=True,
                show_progress=self.show_progress_for_encode,
            )
        if isinstance(encoded_documents, Sequence):
            task_encoding_dataset = TaskEncodingDataset(encodings=encoded_documents)
        else:
            task_encoding_dataset = IterableTaskEncodingDataset(encodings=encoded_documents)
        return task_encoding_dataset

This supersedes #173.

TODO: add tests for

  • parameter format
  • TaskEncodingFormatter

@ArneBinder ArneBinder added the enhancement New feature or request label Aug 5, 2025
@codecov
Copy link
Copy Markdown

codecov Bot commented Aug 5, 2025

Codecov Report

❌ Patch coverage is 74.35897% with 10 lines in your changes missing coverage. Please review.
✅ Project coverage is 92.72%. Comparing base (f3c904d) to head (4466de1).
⚠️ Report is 15 commits behind head on main.

Files with missing lines Patch % Lines
src/pie_datasets/core/formatter.py 72.41% 8 Missing ⚠️
src/pie_datasets/core/dataset.py 80.00% 2 Missing ⚠️

❌ Your patch check has failed because the patch coverage (74.35%) is below the target coverage (90.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #200      +/-   ##
==========================================
- Coverage   93.52%   92.72%   -0.81%     
==========================================
  Files          10       10              
  Lines         942      962      +20     
==========================================
+ Hits          881      892      +11     
- Misses         61       70       +9     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@ArneBinder ArneBinder marked this pull request as draft August 5, 2025 00:53
ArneBinder added a commit to ArneBinder/pytorch-ie that referenced this pull request Aug 5, 2025
This PR allows it to easily overwrite that functionality in downstream
projects.

some context/motivation:
ArneBinder/pie-datasets#200.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant