Skip to content

feat: add pyspark 4 support with 50+ new functions#616

Merged
eakmanrq merged 11 commits intomainfrom
eakmanrq/pyspark4-functions-support
Mar 17, 2026
Merged

feat: add pyspark 4 support with 50+ new functions#616
eakmanrq merged 11 commits intomainfrom
eakmanrq/pyspark4-functions-support

Conversation

@eakmanrq
Copy link
Copy Markdown
Owner

Summary

New Functions Added

Date/Time: current_time, dayname, monthname, make_time, to_time, time_diff, time_trunc, timestamp_diff, try_to_date, try_to_time

Null Handling: nullifzero, zeroifnull

String/UTF-8: collate, collation, is_valid_utf8, validate_utf8, make_valid_utf8, try_validate_utf8, quote, randstr

Aggregate: listagg, listagg_distinct, string_agg, string_agg_distinct, bitmap_and_agg

Random: uniform, uuid, random (alias for rand)

JSON/Variant: parse_json, try_parse_json, schema_of_variant, schema_of_variant_agg, is_variant_null, variant_get, try_variant_get, to_variant_object

XML: from_xml, to_xml, schema_of_xml

Try functions: try_mod, try_parse_url, try_reflect, try_url_decode, try_make_interval, try_make_timestamp, try_make_timestamp_ltz, try_make_timestamp_ntz

Session: session_user, input_file_block_length, input_file_block_start

Aliases: column=col, chr=char, current_schema=current_database

Engine Support Notes

  • Functions without wide dialect support are marked unsupported_engines="*" (Spark/Databricks only)
  • dayname/monthname: DuckDB + Spark supported (not BigQuery/Postgres)
  • current_time: DuckDB + Postgres + Spark (not BigQuery; Postgres uses LOCALTIME)
  • timestamp_diff: DuckDB + Spark (sqlglot handles DuckDB translation via DATE_DIFF)
  • collate: DuckDB + Spark (not Postgres/BigQuery; different collation systems)
  • uuid: All engines (Postgres uses gen_random_uuid()::text)
  • uniform: Spark/Databricks/Snowflake only (DuckDB doesn't have UNIFORM)

Test Plan

  • All 1820 unit tests pass
  • DuckDB integration tests pass for all new functions
  • Postgres integration tests pass for all new functions (unsupported ones skip gracefully)
  • Type checking passes (ty check)
  • CI will verify BigQuery, Snowflake, Databricks, Spark engines

🤖 Generated with Claude Code

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates SQLFrame’s Spark/PySpark support to PySpark 4.x, expands the available SQL function surface area (50+ new functions), and adjusts unit/integration tests and documentation to reflect the new compatibility baseline.

Changes:

  • Bump PySpark dependency to >=4,<4.2 and adjust tests for PySpark 4 behavioral changes.
  • Add new Spark SQL function wrappers (and aliases) to sqlframe/base/functions.py.
  • Add unit + integration tests and update engine docs to include the new functions / Spark 4.0 support statement.

Reviewed changes

Copilot reviewed 15 out of 16 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
sqlframe/base/functions.py Adds new PySpark 4 function wrappers (date/time, utf8, xml/json/variant, aggregates, random) and new aliases.
pyproject.toml Updates pyspark dependency constraints for dev and spark extras.
tests/unit/conftest.py Removes assertion relying on F.JVMView (removed in PySpark 4).
tests/unit/standalone/test_functions.py Adds SQL string-generation unit tests for new functions and adjusts anonymous invocation ignore list.
tests/integration/fixtures.py Updates schema comparison to account for PySpark 4 StructField.__eq__ / metadata behavior.
tests/integration/engines/test_int_functions.py Updates existing tests for PySpark 4 behavior + adds integration tests for new functions across engines.
docs/standalone.md Updates Spark function support statement to “through 4.0”.
docs/spark.md Updates Spark function support statement to “through 4.0”.
docs/snowflake.md Adds new functions to the Snowflake “supported functions” list.
docs/postgres.md Adds new functions to the Postgres “supported functions” list and fixes formatting.
docs/duckdb.md Adds new functions to the DuckDB “supported functions” list.
docs/bigquery.md Adds new functions to the BigQuery “supported functions” list.
docs/docs/postgres.md Mirrors Postgres docs updates in the generated docs tree.
docs/docs/duckdb.md Mirrors DuckDB docs updates in the generated docs tree.
docs/docs/bigquery.md Mirrors BigQuery docs updates in the generated docs tree.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +7537 to +7539
def try_to_time(str: ColumnOrName, format: t.Optional[ColumnOrName] = None) -> Column:
if format is not None:
return Column.invoke_anonymous_function(str, "try_to_time", format)
Comment on lines +7225 to +7226
return Column(expression.Var(this="LOCALTIME"))
return Column(expression.CurrentTime())
Comment on lines +7234 to +7246
@meta(unsupported_engines="*")
def from_xml(
col: ColumnOrName,
schema: t.Union["StructType", Column, str],
options: t.Optional[t.Mapping[str, str]] = None,
) -> Column:
if isinstance(schema, Column):
schema_col = schema
elif isinstance(schema, str):
schema_col = lit(schema)
else:
schema_col = lit(schema.simpleString())
return Column.invoke_anonymous_function(col, "from_xml", schema_col)
@meta(unsupported_engines="*")
def schema_of_xml(
xml: t.Union[Column, str], options: t.Optional[t.Mapping[str, str]] = None
) -> Column:
Comment on lines +7404 to +7406
def to_time(str: ColumnOrName, format: t.Optional[ColumnOrName] = None) -> Column:
if format is not None:
return Column.invoke_anonymous_function(str, "to_time", format)
eakmanrq and others added 11 commits March 15, 2026 19:51
- Bump pyspark dependency to >=4,<4.2 in dev and spark extras
- Add 50+ new PySpark 4 functions: current_time, dayname, monthname,
  collate, collation, timestamp_diff, time_diff, time_trunc, make_time,
  to_time, try_to_date, try_to_time, nullifzero, zeroifnull, session_user,
  uniform, randstr, uuid, listagg, listagg_distinct, string_agg,
  string_agg_distinct, parse_json, try_parse_json, schema_of_variant,
  schema_of_variant_agg, is_variant_null, variant_get, try_variant_get,
  to_variant_object, from_xml, to_xml, schema_of_xml, bitmap_and_agg,
  is_valid_utf8, validate_utf8, make_valid_utf8, try_validate_utf8,
  try_mod, try_parse_url, try_reflect, try_url_decode, try_make_interval,
  try_make_timestamp, try_make_timestamp_ltz, try_make_timestamp_ntz,
  input_file_block_length, input_file_block_start, quote, column alias
- Add aliases: column=col, random=rand, chr=char, current_schema=current_database,
  string_agg=listagg, string_agg_distinct=listagg_distinct
- Remove F.JVMView assertion (removed in PySpark 4)
- Add unit and integration tests for all new functions
- Locally verified DuckDB and Postgres engines work correctly

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…yname assertion

- Use getattr with default False for _is_postgres in current_time() to avoid
  AttributeError when called from PySpark's native SparkSession (not SQLFrame)
- Fix test_dayname to accept both 'Wednesday' and 'Wed' since Spark returns
  abbreviated day names while DuckDB returns full names

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- collate(): use Var instead of Literal for collation name — Spark 4 requires
  unquoted COLLATE syntax (e.g. COLLATE UTF8_LCASE not COLLATE 'UTF8_LCASE')
- test_timestamp_add: use getattr() for _is_postgres check on SparkSession
- test_make_ym_interval: skip for spark/pyspark — PySpark 4 does not implement
  YearMonthIntervalType.fromInternal
- test_to_unix_timestamp: skip null-return assertion for spark/pyspark — PySpark 4
  raises DateTimeException instead of returning NULL for unparseable values
- test_monthname: accept both 'April' and 'Apr' (Spark returns abbreviated form)
- compare_schemas: clear struct_field.metadata instead of func_metadata — PySpark 4
  now includes metadata in StructField equality checks

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- test_timestamp_add: use isinstance(session, PostgresSession) consistently
  with the rest of the file instead of fragile getattr() attribute check
- uuid(): use expression.Uuid() for the no-seed case and direct
  expression.Anonymous for seeded case — removes invoke_anonymous_function
- Remove collate from ignore_funcs (uses expression.Collate directly, not anonymous)
- Re-document remaining ignore_funcs entries with concrete reasons:
  - listagg/string_agg: GroupConcat always emits a separator; anonymous preserves
    optional no-separator LISTAGG(col) behavior
  - parse_json: ParseJSON in Spark dialect is a no-op; anonymous emits PARSE_JSON(col)
  - time_diff: exp.TimeDiff generates TIMEDIFF, not Spark's TIME_DIFF(unit, start, end)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
All added functions are unsupported_engines="*" (Spark-only) so these tests
run against pyspark and spark engines in CI.

Functions covered:
- bitmap_and_agg, collation, from_xml
- input_file_block_length, input_file_block_start
- is_valid_utf8, is_variant_null
- listagg, listagg_distinct, string_agg_distinct
- make_time, make_valid_utf8
- parse_json, quote, randstr
- schema_of_variant, schema_of_variant_agg, schema_of_xml
- time_diff, time_trunc, to_time
- to_variant_object, to_xml
- try_make_interval, try_make_timestamp, try_make_timestamp_ltz, try_make_timestamp_ntz
- try_mod, try_parse_json, try_parse_url, try_reflect
- try_to_date, try_to_time, try_url_decode
- try_validate_utf8, try_variant_get
- validate_utf8, variant_get

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
PySpark 4 changed StructField.__eq__ to use __dict__ comparison and
started populating metadata with internal keys (e.g. __autoGeneratedAlias)
on aggregation columns. The previous func_metadata={} was a no-op that
happened to work in PySpark 3 (which compared fields individually).
Clearing the real metadata attribute fixes the equality check.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace weak `is not None` / `isinstance(result, str)` checks with
exact expected values from PySpark documentation.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Skip TIME-type tests (make_time, time_diff, time_trunc, to_time,
  try_to_time) for Spark/PySpark: PySpark 4 does not support TIME type
- Skip try_make_interval for Spark/PySpark: CalendarIntervalType.fromInternal
  not implemented in PySpark 4
- Skip listagg_distinct/string_agg_distinct for SQLFrame SparkSession:
  Spark SQL has no LISTAGG_DISTINCT function
- Fix VARIANT tests to use parse_json() function instead of session.sql()
  to preserve VARIANT type through SQLFrame's SQL translation layer
- Fix parse_json/try_parse_json tests: pass plain strings to variant_get
  (not lit() Columns) to match native PySpark variant_get signature
- Fix try_reflect assertion: reflect always returns a string, not int
- Fix to_xml() implementation: pass options as MAP literal in SQL so
  rowTag and other options are applied in the Spark SQL execution

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- BigQuery: add nullifzero, session_user, uuid, zeroifnull
- DuckDB: add collate, current_time, dayname, monthname, nullifzero,
  session_user, timestamp_diff, uuid, zeroifnull
- Postgres: add current_time, nullifzero, session_user, uuid, zeroifnull
- Snowflake: add collate, current_time, dayname, monthname, nullifzero,
  session_user, timestamp_diff, uniform, uuid, zeroifnull
- Spark/Standalone: bump supported functions version from 3.5 to 4.0

Functions marked unsupported_engines="*" (Spark/PySpark-only) are not
listed in engine-specific docs since they require native Spark execution.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- current_time: implement precision parameter — emit CURRENT_TIME(n)
  or LOCALTIME(n) instead of silently ignoring precision argument
- from_xml: pass options as MAP literal to SQL, matching from_json behavior
- schema_of_xml: pass options as MAP literal to SQL
- to_time/try_to_time: fix format parameter handling — accept str and
  wrap with lit() so plain strings become SQL literals, not column refs;
  also rename parameter from 'str' to 'col' to avoid shadowing built-in

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@eakmanrq eakmanrq force-pushed the eakmanrq/pyspark4-functions-support branch from d4beb4a to 8f3856c Compare March 16, 2026 02:51
@eakmanrq eakmanrq merged commit 28c5ba0 into main Mar 17, 2026
4 checks passed
@themattmorris themattmorris mentioned this pull request Mar 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants