feat: add pyspark 4 support with 50+ new functions by eakmanrq · Pull Request #616 · eakmanrq/sqlframe

eakmanrq · 2026-03-15T21:36:39Z

Summary

Bumps pyspark dependency to >=4,<4.2 (closes fix(deps): update dependency pyspark to v4 #407)
Removes F.JVMView assertion from tests (removed in PySpark 4)
Adds 50+ new PySpark 4 SQL functions to SQLFrame

New Functions Added

Date/Time: current_time, dayname, monthname, make_time, to_time, time_diff, time_trunc, timestamp_diff, try_to_date, try_to_time

Null Handling: nullifzero, zeroifnull

String/UTF-8: collate, collation, is_valid_utf8, validate_utf8, make_valid_utf8, try_validate_utf8, quote, randstr

Aggregate: listagg, listagg_distinct, string_agg, string_agg_distinct, bitmap_and_agg

Random: uniform, uuid, random (alias for rand)

JSON/Variant: parse_json, try_parse_json, schema_of_variant, schema_of_variant_agg, is_variant_null, variant_get, try_variant_get, to_variant_object

XML: from_xml, to_xml, schema_of_xml

Try functions: try_mod, try_parse_url, try_reflect, try_url_decode, try_make_interval, try_make_timestamp, try_make_timestamp_ltz, try_make_timestamp_ntz

Session: session_user, input_file_block_length, input_file_block_start

Aliases: column=col, chr=char, current_schema=current_database

Engine Support Notes

Functions without wide dialect support are marked unsupported_engines="*" (Spark/Databricks only)
dayname/monthname: DuckDB + Spark supported (not BigQuery/Postgres)
current_time: DuckDB + Postgres + Spark (not BigQuery; Postgres uses LOCALTIME)
timestamp_diff: DuckDB + Spark (sqlglot handles DuckDB translation via DATE_DIFF)
collate: DuckDB + Spark (not Postgres/BigQuery; different collation systems)
uuid: All engines (Postgres uses gen_random_uuid()::text)
uniform: Spark/Databricks/Snowflake only (DuckDB doesn't have UNIFORM)

Test Plan

All 1820 unit tests pass
DuckDB integration tests pass for all new functions
Postgres integration tests pass for all new functions (unsupported ones skip gracefully)
Type checking passes (ty check)
CI will verify BigQuery, Snowflake, Databricks, Spark engines

🤖 Generated with Claude Code

Copilot

Pull request overview

This PR updates SQLFrame’s Spark/PySpark support to PySpark 4.x, expands the available SQL function surface area (50+ new functions), and adjusts unit/integration tests and documentation to reflect the new compatibility baseline.

Changes:

Bump PySpark dependency to >=4,<4.2 and adjust tests for PySpark 4 behavioral changes.
Add new Spark SQL function wrappers (and aliases) to sqlframe/base/functions.py.
Add unit + integration tests and update engine docs to include the new functions / Spark 4.0 support statement.

Reviewed changes

Copilot reviewed 15 out of 16 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
`sqlframe/base/functions.py`	Adds new PySpark 4 function wrappers (date/time, utf8, xml/json/variant, aggregates, random) and new aliases.
`pyproject.toml`	Updates `pyspark` dependency constraints for dev and spark extras.
`tests/unit/conftest.py`	Removes assertion relying on `F.JVMView` (removed in PySpark 4).
`tests/unit/standalone/test_functions.py`	Adds SQL string-generation unit tests for new functions and adjusts anonymous invocation ignore list.
`tests/integration/fixtures.py`	Updates schema comparison to account for PySpark 4 `StructField.__eq__` / metadata behavior.
`tests/integration/engines/test_int_functions.py`	Updates existing tests for PySpark 4 behavior + adds integration tests for new functions across engines.
`docs/standalone.md`	Updates Spark function support statement to “through 4.0”.
`docs/spark.md`	Updates Spark function support statement to “through 4.0”.
`docs/snowflake.md`	Adds new functions to the Snowflake “supported functions” list.
`docs/postgres.md`	Adds new functions to the Postgres “supported functions” list and fixes formatting.
`docs/duckdb.md`	Adds new functions to the DuckDB “supported functions” list.
`docs/bigquery.md`	Adds new functions to the BigQuery “supported functions” list.
`docs/docs/postgres.md`	Mirrors Postgres docs updates in the generated docs tree.
`docs/docs/duckdb.md`	Mirrors DuckDB docs updates in the generated docs tree.
`docs/docs/bigquery.md`	Mirrors BigQuery docs updates in the generated docs tree.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

sqlframe/base/functions.py

+def try_to_time(str: ColumnOrName, format: t.Optional[ColumnOrName] = None) -> Column:
+    if format is not None:
+        return Column.invoke_anonymous_function(str, "try_to_time", format)


sqlframe/base/functions.py

+        return Column(expression.Var(this="LOCALTIME"))
+    return Column(expression.CurrentTime())


sqlframe/base/functions.py

+@meta(unsupported_engines="*")
+def from_xml(
+    col: ColumnOrName,
+    schema: t.Union["StructType", Column, str],
+    options: t.Optional[t.Mapping[str, str]] = None,
+) -> Column:
+    if isinstance(schema, Column):
+        schema_col = schema
+    elif isinstance(schema, str):
+        schema_col = lit(schema)
+    else:
+        schema_col = lit(schema.simpleString())
+    return Column.invoke_anonymous_function(col, "from_xml", schema_col)


sqlframe/base/functions.py

+@meta(unsupported_engines="*")
+def schema_of_xml(
+    xml: t.Union[Column, str], options: t.Optional[t.Mapping[str, str]] = None
+) -> Column:


sqlframe/base/functions.py

+def to_time(str: ColumnOrName, format: t.Optional[ColumnOrName] = None) -> Column:
+    if format is not None:
+        return Column.invoke_anonymous_function(str, "to_time", format)


- Bump pyspark dependency to >=4,<4.2 in dev and spark extras - Add 50+ new PySpark 4 functions: current_time, dayname, monthname, collate, collation, timestamp_diff, time_diff, time_trunc, make_time, to_time, try_to_date, try_to_time, nullifzero, zeroifnull, session_user, uniform, randstr, uuid, listagg, listagg_distinct, string_agg, string_agg_distinct, parse_json, try_parse_json, schema_of_variant, schema_of_variant_agg, is_variant_null, variant_get, try_variant_get, to_variant_object, from_xml, to_xml, schema_of_xml, bitmap_and_agg, is_valid_utf8, validate_utf8, make_valid_utf8, try_validate_utf8, try_mod, try_parse_url, try_reflect, try_url_decode, try_make_interval, try_make_timestamp, try_make_timestamp_ltz, try_make_timestamp_ntz, input_file_block_length, input_file_block_start, quote, column alias - Add aliases: column=col, random=rand, chr=char, current_schema=current_database, string_agg=listagg, string_agg_distinct=listagg_distinct - Remove F.JVMView assertion (removed in PySpark 4) - Add unit and integration tests for all new functions - Locally verified DuckDB and Postgres engines work correctly Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…yname assertion - Use getattr with default False for _is_postgres in current_time() to avoid AttributeError when called from PySpark's native SparkSession (not SQLFrame) - Fix test_dayname to accept both 'Wednesday' and 'Wed' since Spark returns abbreviated day names while DuckDB returns full names Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- collate(): use Var instead of Literal for collation name — Spark 4 requires unquoted COLLATE syntax (e.g. COLLATE UTF8_LCASE not COLLATE 'UTF8_LCASE') - test_timestamp_add: use getattr() for _is_postgres check on SparkSession - test_make_ym_interval: skip for spark/pyspark — PySpark 4 does not implement YearMonthIntervalType.fromInternal - test_to_unix_timestamp: skip null-return assertion for spark/pyspark — PySpark 4 raises DateTimeException instead of returning NULL for unparseable values - test_monthname: accept both 'April' and 'Apr' (Spark returns abbreviated form) - compare_schemas: clear struct_field.metadata instead of func_metadata — PySpark 4 now includes metadata in StructField equality checks Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- test_timestamp_add: use isinstance(session, PostgresSession) consistently with the rest of the file instead of fragile getattr() attribute check - uuid(): use expression.Uuid() for the no-seed case and direct expression.Anonymous for seeded case — removes invoke_anonymous_function - Remove collate from ignore_funcs (uses expression.Collate directly, not anonymous) - Re-document remaining ignore_funcs entries with concrete reasons: - listagg/string_agg: GroupConcat always emits a separator; anonymous preserves optional no-separator LISTAGG(col) behavior - parse_json: ParseJSON in Spark dialect is a no-op; anonymous emits PARSE_JSON(col) - time_diff: exp.TimeDiff generates TIMEDIFF, not Spark's TIME_DIFF(unit, start, end) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

All added functions are unsupported_engines="*" (Spark-only) so these tests run against pyspark and spark engines in CI. Functions covered: - bitmap_and_agg, collation, from_xml - input_file_block_length, input_file_block_start - is_valid_utf8, is_variant_null - listagg, listagg_distinct, string_agg_distinct - make_time, make_valid_utf8 - parse_json, quote, randstr - schema_of_variant, schema_of_variant_agg, schema_of_xml - time_diff, time_trunc, to_time - to_variant_object, to_xml - try_make_interval, try_make_timestamp, try_make_timestamp_ltz, try_make_timestamp_ntz - try_mod, try_parse_json, try_parse_url, try_reflect - try_to_date, try_to_time, try_url_decode - try_validate_utf8, try_variant_get - validate_utf8, variant_get Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

PySpark 4 changed StructField.__eq__ to use __dict__ comparison and started populating metadata with internal keys (e.g. __autoGeneratedAlias) on aggregation columns. The previous func_metadata={} was a no-op that happened to work in PySpark 3 (which compared fields individually). Clearing the real metadata attribute fixes the equality check. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Replace weak `is not None` / `isinstance(result, str)` checks with exact expected values from PySpark documentation. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Skip TIME-type tests (make_time, time_diff, time_trunc, to_time, try_to_time) for Spark/PySpark: PySpark 4 does not support TIME type - Skip try_make_interval for Spark/PySpark: CalendarIntervalType.fromInternal not implemented in PySpark 4 - Skip listagg_distinct/string_agg_distinct for SQLFrame SparkSession: Spark SQL has no LISTAGG_DISTINCT function - Fix VARIANT tests to use parse_json() function instead of session.sql() to preserve VARIANT type through SQLFrame's SQL translation layer - Fix parse_json/try_parse_json tests: pass plain strings to variant_get (not lit() Columns) to match native PySpark variant_get signature - Fix try_reflect assertion: reflect always returns a string, not int - Fix to_xml() implementation: pass options as MAP literal in SQL so rowTag and other options are applied in the Spark SQL execution Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- BigQuery: add nullifzero, session_user, uuid, zeroifnull - DuckDB: add collate, current_time, dayname, monthname, nullifzero, session_user, timestamp_diff, uuid, zeroifnull - Postgres: add current_time, nullifzero, session_user, uuid, zeroifnull - Snowflake: add collate, current_time, dayname, monthname, nullifzero, session_user, timestamp_diff, uniform, uuid, zeroifnull - Spark/Standalone: bump supported functions version from 3.5 to 4.0 Functions marked unsupported_engines="*" (Spark/PySpark-only) are not listed in engine-specific docs since they require native Spark execution. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- current_time: implement precision parameter — emit CURRENT_TIME(n) or LOCALTIME(n) instead of silently ignoring precision argument - from_xml: pass options as MAP literal to SQL, matching from_json behavior - schema_of_xml: pass options as MAP literal to SQL - to_time/try_to_time: fix format parameter handling — accept str and wrap with lit() so plain strings become SQL literals, not column refs; also rename parameter from 'str' to 'col' to avoid shadowing built-in Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

eakmanrq requested a review from Copilot March 16, 2026 01:50

Copilot started reviewing on behalf of eakmanrq March 16, 2026 01:53 View session

Copilot AI reviewed Mar 16, 2026

View reviewed changes

eakmanrq and others added 11 commits March 15, 2026 19:51

style: apply ruff-format to integration test file

08258c4

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

test: strengthen assertions in PySpark 4 integration tests

ecce2b1

Replace weak `is not None` / `isinstance(result, str)` checks with exact expected values from PySpark documentation. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

eakmanrq force-pushed the eakmanrq/pyspark4-functions-support branch from d4beb4a to 8f3856c Compare March 16, 2026 02:51

eakmanrq merged commit 28c5ba0 into main Mar 17, 2026
4 checks passed

themattmorris mentioned this pull request Mar 17, 2026

Enable VariantType #355

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add pyspark 4 support with 50+ new functions#616

feat: add pyspark 4 support with 50+ new functions#616
eakmanrq merged 11 commits intomainfrom
eakmanrq/pyspark4-functions-support

eakmanrq commented Mar 15, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		return Column(expression.Var(this="LOCALTIME"))
		return Column(expression.CurrentTime())

Conversation

eakmanrq commented Mar 15, 2026

Summary

New Functions Added

Engine Support Notes

Test Plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants