[WIP] feat(narwhals): implement the new unifying backend#2223
[WIP] feat(narwhals): implement the new unifying backend#2223deepyaman wants to merge 8 commits intounionai-oss:dev/narwhalsfrom
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## dev/narwhals #2223 +/- ##
================================================
- Coverage 83.76% 78.00% -5.77%
================================================
Files 137 147 +10
Lines 10764 11962 +1198
================================================
+ Hits 9017 9331 +314
- Misses 1747 2631 +884 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
hey @deepyaman let me know if you need any help here |
Hey @cosmicBboy, updated with a rough description of where things are at/what's necessary before moving forward with this. |
There was a problem hiding this comment.
hey @deepyaman how are you thinking about the .planning directory? should those be checked in for posterity (and future agent reference)?
I also made a dev/narwhals branch that we can use to merge this PR... can you do an interactive rebase to squash some of the commits into smaller chunks (so as to not lose granularity of commits)?
I don't intend to check them in, but I thought it could be useful to share for now. From my list of pre-merge TODOs above:
I don't know what you think, but I don't think it makes sense to leave project-level artifacts just for agents, even if want to support them? TBH I'm not super familiar on this.
Sure, will look into doing this. |
deepyaman
left a comment
There was a problem hiding this comment.
There seem to be a number of potential issues, most of which fall into two categories: executing too eagerly and backend-specific logic.
pandera/api/base/error_handler.py
Outdated
| # Import is guarded so ibis remains an optional dependency. | ||
| try: | ||
| import ibis as _ibis | ||
| if isinstance(failure_cases, _ibis.Table): |
There was a problem hiding this comment.
| if isinstance(failure_cases, _ibis.Table): | |
| import ibis | |
| if isinstance(failure_cases, ibis.Table): |
Why alias, why not just import ibis?
pandera/api/base/error_handler.py
Outdated
| if isinstance(failure_cases, str): # Avoid returning str length | ||
| return 1 | ||
|
|
||
| # ibis.Table raises ExpressionError for len(); use .count().execute() instead. |
There was a problem hiding this comment.
It seems wrong to add an Ibis-specific branch to the base ErrorHandler; isinstance(failure_cases, ibis.table) made a lot more sense in the Ibis ErrorHandler, and this doesn't seem like the right way to handle it for Narwhals. Maybe the Narwhals backend needs it's own ErrorHandler, and that can dispatch based on type—or, much better, just use the Narwahls way of counting, not sure why this wouldn't work...
| Auto-detects narwhals: if narwhals is installed, registers narwhals backends | ||
| (NarwhalsCheckBackend, narwhals ColumnBackend, narwhals DataFrameSchemaBackend) | ||
| and emits a UserWarning. If narwhals is not installed, registers the native |
There was a problem hiding this comment.
Nit: Don't know why we're not capitalizing Narwhals.
| Auto-detects narwhals: if narwhals is installed, registers narwhals backends | |
| (NarwhalsCheckBackend, narwhals ColumnBackend, narwhals DataFrameSchemaBackend) | |
| and emits a UserWarning. If narwhals is not installed, registers the native | |
| Auto-detects Narwhals: if Narwhals is installed, registers Narwhals backends | |
| (NarwhalsCheckBackend, Narwhals ColumnBackend, Narwhals DataFrameSchemaBackend) | |
| and emits a UserWarning. If Narwhals is not installed, registers the native |
pandera/backends/narwhals/base.py
Outdated
| import narwhals.stable.v1 as nw | ||
| import polars as pl | ||
|
|
||
| from pandera.api.base.error_handler import ErrorHandler |
There was a problem hiding this comment.
Again, seems like we should be importing the Narwhals ErrorHandler here, rather than modifying the base one.
pandera/backends/narwhals/checks.py
Outdated
| ) | ||
|
|
||
| @staticmethod | ||
| def _materialize(frame) -> nw.DataFrame: |
There was a problem hiding this comment.
This is eagerly executing? I think executing early for the check output is not ideal, but still reasonable. However, this is also getting called elsewhere above. Furthermore, the conditional logic seems overly complicated—not sure why this is needed.
| if issubclass(return_type, pl.DataFrame): | ||
| return native.collect() | ||
| return native |
There was a problem hiding this comment.
What is this for? Type-specific .collect() call seems like a red flag. Once again, Ibis and Polars are following different paths, and that can't be right.
| components = self.collect_schema_components( | ||
| check_lf, schema, column_info | ||
| ) | ||
| check_obj_parsed = _to_frame_kind_nw(check_lf, return_type) |
There was a problem hiding this comment.
Why is the object potentially getting collected here? It seems like, if the user passes a pl.DataFrame, we .collect()—for what reason?
|
|
||
| check_results = [] | ||
| check_passed = [] | ||
| # Convert to native pl.LazyFrame for column component dispatch. |
There was a problem hiding this comment.
Not necessarily pl.LazyFrame, right? What if it's an Ibis table?
| ): | ||
| """Collects all schema components to use for validation.""" | ||
|
|
||
| from pandera.api.polars.components import Column |
There was a problem hiding this comment.
Why is something from the Polars backend being used here? This makes no sense.
| column_info: Any, | ||
| ) -> list[CoreCheckResult]: | ||
| """Check that all columns in the schema are present in the dataframe.""" | ||
| from pandera.api.narwhals.utils import _to_native |
There was a problem hiding this comment.
Why is to_native necessary here? Why isn't this being handled in a backend-agnostic way?
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds NarwhalsSchemaBackend, ColumnBackend, and DataFrameSchemaBackend with lazy-first materialization and drop_invalid_rows support via nw.Expr accumulation. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2a02d7e to
44967b2
Compare
Tip
I've used Claude to significantly squash the commits and remove GSD artifacts; however; I'm still working off of the branch that has all of those, now pushed to https://github.com/deepyaman/pandera/tree/feat/narwhals/create-backend-all-artifacts
Prologue
Most of the below description, as well as almost all of the PR, is AI-generated. However, I have been very closely guiding the process and reviewing each step of the way. I have still not deeply reviewed the code in it's entirety, which I plan to do.
I've also verified the functionality at a high level manually. With this change, you can run checks using the existing Polars and Ibis APIs, and they leverage the newly-added Narwhals backend-pretty good for a first pass!
I've currently started exploring adding support for the PySpark backend, but that could be a separate PR. What I think are necessary steps in this PR, before merging:
.planning/(I left them there for now, both as a backup and in case wanted to share the context)Turning it over to my agentic intern...
Warning
The below is outdated and needs updating for milestone v1.1.
Narwhals backend for Polars and Ibis (v1.0)
This PR introduces a unified Narwhals-backed validation engine that replaces library-specific backends for Polars and Ibis with a single shared implementation. Users continue to pass native frames — pandera routes them through Narwhals internally.
Scope: 14 new files, ~2,950 lines of new production code and tests across 5 implementation phases.
What was built
New packages and modules
pandera/api/narwhals/types.pyNarwhalsDataNamedTuple (frame,key) — the data container passed to builtin checkspandera/api/narwhals/utils.py_to_native()helper to unwrap narwhals wrapperspandera/engines/narwhals_engine.pypandera/backends/narwhals/checks.pyNarwhalsCheckBackend— routes builtins throughNarwhalsData, user-defined checks through native frame unwrapping or ibis delegationpandera/backends/narwhals/builtin_checks.pynw.Expr:equal_to,not_equal_to,greater_than,greater_than_or_equal_to,less_than,less_than_or_equal_to,in_range,isin,notin,str_matches,str_contains,str_startswith,str_endswith,str_lengthpandera/backends/narwhals/components.pyColumnBackend— per-column validation (dtype check, nullable, unique, run_checks)pandera/backends/narwhals/container.pyDataFrameSchemaBackend— full container validation (parsers, checks, strict/ordered, lazy mode, drop_invalid_rows)pandera/backends/narwhals/base.pyNarwhalsSchemaBackend— shared helpers:run_check,subsample,failure_cases_metadata,drop_invalid_rows,is_float_dtypeModified files
pandera/backends/polars/register.pyUserWarning.pandera/backends/ibis/register.py@lru_cache(was missing). RegistersNarwhalsCheckBackendforibis.Table/ibis.Column/nw.LazyFrame.Test suite
tests/backends/narwhals/— backend-agnostic, parameterized against both Polars and Ibis:conftest.py—make_narwhals_framefixture producingnw.LazyFramefrom eitherpl.LazyFrameoribis.memtable;autousefixture that calls both register functionstest_checks.py— 14 builtin checks × 2 backends (valid + invalid data paths)test_components.py— column-level dtype, nullable, unique, and check validationtest_container.py— full container validation: strict, ordered, lazy mode, failure cases, drop_invalid_rowstest_parity.py— behavioral parity between Polars and Ibis paths: valid, invalid, lazy, strict, filter, decorator, DataFrameModeltest_narwhals_dtypes.py— dtype engine registration and coerce/try_coerceArchitecture decisions
Narwhals is internal plumbing, not a user-facing API. Users pass
pl.DataFrame,pl.LazyFrame, oribis.Table— no changes to call sites.Auto-detection over configuration.
register_polars_backends()andregister_ibis_backends()check for narwhals viatry/importand swap backends transparently. No config flag needed.SQL-lazy
element_wiseraisesNotImplementedError. Row-level Python functions cannot be applied to lazy query plans (Ibis, DuckDB, PySpark). The error is surfaced as aSchemaErrorwithCHECK_ERRORreason code. NOTE(@deepyaman): See narwhals-dev/narwhals#3512Ibis
drop_invalid_rowsdelegates toIbisSchemaBackend. Narwhals has no positional-join /row_numberabstraction for SQL-lazy backends. NOTE(@deepyaman): I'll verify this later.How to verify
Install with narwhals:
Polars — object API
Polars — class-based model
Polars — lazy mode (collect all errors before raising)
Ibis — same schema, different backend
Ibis — user-defined check (delegates to IbisCheckBackend)
Known gaps and next steps
coercefor Ibisxfail(strict=True)— intentionally deferred; will break CI when implemented so the mark gets cleaned upadd_missing_columnsparserset_defaultfor Column fieldsgroup_by().agg()pattern designed; not implementednarwhals stable.v2migrationsample=subsamplingNotImplementedError; onlyhead=/tail=are supported