Skip to content

[ENH] Add optimized D1 layer categorical encoder for v2#2211

Open
Siddhazntx wants to merge 3 commits intosktime:mainfrom
Siddhazntx:feature/v2-label-encoder
Open

[ENH] Add optimized D1 layer categorical encoder for v2#2211
Siddhazntx wants to merge 3 commits intosktime:mainfrom
Siddhazntx:feature/v2-label-encoder

Conversation

@Siddhazntx
Copy link
Contributor

Reference Issues/PRs

Addresses the label_encoders task mentioned in the v2 roadmap tracking issue : #1974

What does this implement/fix? Explain your changes.

This PR introduces an optimized D1CategoricalEncoder to the v2 data pipeline to handle categorical and text variables, preventing PyTorch tensor conversion crashes.

Key Changes:

  • New Encoder Class: Created _encoders_v2.py featuring a D1CategoricalEncoder that strictly follows the scikit-learn API (fit, transform, inverse_transform).
  • C-Level Optimization: Utilized pd.factorize() instead of native Python dictionaries to ensure the encoding process is efficient and scalable for large datasets.
  • Robust Edge-Case Handling: Safely manages original NaN values without silently dropping them.
    • Handles unseen variables during the transform phase (defaulting to 0).
    • Implemented a _warned_cols set to ensure warnings for unseen variables only trigger once per column, preventing terminal flooding during dataloader loops.
  • D1 Layer Integration: Integrated the encoder into the __init__ of TimeSeries inside _timeseries_v2.py. Columns specified in the cat argument are now automatically encoded.

What should a reviewer concentrate their feedback on?

  • Integration Point: Please review the placement of the encoding logic within _timeseries_v2.py's __init__ method to ensure it aligns with the intended v2 data ingestion flow.
  • Unseen Variable Strategy: I defaulted to handle_unknown="assign_new" (mapping to 0). Let me know if the core team prefers a different default behavior for the v2 release!

Did you add any tests for the change?

Yes. I added a comprehensive pytest suite in a new test_encoders_v2.py file. Tests include:

  • test_encoder_fit_transform: Validates integer conversion and preservation of numeric columns using pd.api.types.is_integer_dtype.
  • test_encoder_inverse_transform: Ensures perfect reverse translation, including restoring true NaN values.
  • test_unseen_variables_warning: Confirms correct fallback assignment and verifies the custom warning triggers.
  • test_only_categorical_columns_selected: Ensures the auto-detect feature properly ignores numeric columns when columns=None.

Any other comments?

PR checklist

  • The PR title starts with either [ENH], [MNT], [DOC], or [BUG]. [BUG] - bugfix, [MNT] - CI, test framework, [ENH] - adding or improving code, [DOC] - writing or improving documentation or docstrings.
  • Added/modified tests
  • Used pre-commit hooks when committing to ensure that code is compliant with hooks. Install hooks with pre-commit install.
    To run hooks independent of commit, execute pre-commit run --all-files

@codecov
Copy link

codecov bot commented Mar 18, 2026

Codecov Report

❌ Patch coverage is 94.91525% with 3 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (main@edbdeb4). Learn more about missing BASE report.

Files with missing lines Patch % Lines
pytorch_forecasting/data/_encoders_v2.py 96.07% 2 Missing ⚠️
...orch_forecasting/data/timeseries/_timeseries_v2.py 87.50% 1 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main    #2211   +/-   ##
=======================================
  Coverage        ?   86.67%           
=======================================
  Files           ?      166           
  Lines           ?     9795           
  Branches        ?        0           
=======================================
  Hits            ?     8490           
  Misses          ?     1305           
  Partials        ?        0           
Flag Coverage Δ
cpu 86.67% <94.91%> (?)
pytest 86.67% <94.91%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant