feat: add pyspark 4 support with 50+ new functions#616
Merged
Conversation
There was a problem hiding this comment.
Pull request overview
This PR updates SQLFrame’s Spark/PySpark support to PySpark 4.x, expands the available SQL function surface area (50+ new functions), and adjusts unit/integration tests and documentation to reflect the new compatibility baseline.
Changes:
- Bump PySpark dependency to
>=4,<4.2and adjust tests for PySpark 4 behavioral changes. - Add new Spark SQL function wrappers (and aliases) to
sqlframe/base/functions.py. - Add unit + integration tests and update engine docs to include the new functions / Spark 4.0 support statement.
Reviewed changes
Copilot reviewed 15 out of 16 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
sqlframe/base/functions.py |
Adds new PySpark 4 function wrappers (date/time, utf8, xml/json/variant, aggregates, random) and new aliases. |
pyproject.toml |
Updates pyspark dependency constraints for dev and spark extras. |
tests/unit/conftest.py |
Removes assertion relying on F.JVMView (removed in PySpark 4). |
tests/unit/standalone/test_functions.py |
Adds SQL string-generation unit tests for new functions and adjusts anonymous invocation ignore list. |
tests/integration/fixtures.py |
Updates schema comparison to account for PySpark 4 StructField.__eq__ / metadata behavior. |
tests/integration/engines/test_int_functions.py |
Updates existing tests for PySpark 4 behavior + adds integration tests for new functions across engines. |
docs/standalone.md |
Updates Spark function support statement to “through 4.0”. |
docs/spark.md |
Updates Spark function support statement to “through 4.0”. |
docs/snowflake.md |
Adds new functions to the Snowflake “supported functions” list. |
docs/postgres.md |
Adds new functions to the Postgres “supported functions” list and fixes formatting. |
docs/duckdb.md |
Adds new functions to the DuckDB “supported functions” list. |
docs/bigquery.md |
Adds new functions to the BigQuery “supported functions” list. |
docs/docs/postgres.md |
Mirrors Postgres docs updates in the generated docs tree. |
docs/docs/duckdb.md |
Mirrors DuckDB docs updates in the generated docs tree. |
docs/docs/bigquery.md |
Mirrors BigQuery docs updates in the generated docs tree. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
sqlframe/base/functions.py
Outdated
Comment on lines
+7537
to
+7539
| def try_to_time(str: ColumnOrName, format: t.Optional[ColumnOrName] = None) -> Column: | ||
| if format is not None: | ||
| return Column.invoke_anonymous_function(str, "try_to_time", format) |
Comment on lines
+7225
to
+7226
| return Column(expression.Var(this="LOCALTIME")) | ||
| return Column(expression.CurrentTime()) |
Comment on lines
+7234
to
+7246
| @meta(unsupported_engines="*") | ||
| def from_xml( | ||
| col: ColumnOrName, | ||
| schema: t.Union["StructType", Column, str], | ||
| options: t.Optional[t.Mapping[str, str]] = None, | ||
| ) -> Column: | ||
| if isinstance(schema, Column): | ||
| schema_col = schema | ||
| elif isinstance(schema, str): | ||
| schema_col = lit(schema) | ||
| else: | ||
| schema_col = lit(schema.simpleString()) | ||
| return Column.invoke_anonymous_function(col, "from_xml", schema_col) |
| @meta(unsupported_engines="*") | ||
| def schema_of_xml( | ||
| xml: t.Union[Column, str], options: t.Optional[t.Mapping[str, str]] = None | ||
| ) -> Column: |
sqlframe/base/functions.py
Outdated
Comment on lines
+7404
to
+7406
| def to_time(str: ColumnOrName, format: t.Optional[ColumnOrName] = None) -> Column: | ||
| if format is not None: | ||
| return Column.invoke_anonymous_function(str, "to_time", format) |
- Bump pyspark dependency to >=4,<4.2 in dev and spark extras - Add 50+ new PySpark 4 functions: current_time, dayname, monthname, collate, collation, timestamp_diff, time_diff, time_trunc, make_time, to_time, try_to_date, try_to_time, nullifzero, zeroifnull, session_user, uniform, randstr, uuid, listagg, listagg_distinct, string_agg, string_agg_distinct, parse_json, try_parse_json, schema_of_variant, schema_of_variant_agg, is_variant_null, variant_get, try_variant_get, to_variant_object, from_xml, to_xml, schema_of_xml, bitmap_and_agg, is_valid_utf8, validate_utf8, make_valid_utf8, try_validate_utf8, try_mod, try_parse_url, try_reflect, try_url_decode, try_make_interval, try_make_timestamp, try_make_timestamp_ltz, try_make_timestamp_ntz, input_file_block_length, input_file_block_start, quote, column alias - Add aliases: column=col, random=rand, chr=char, current_schema=current_database, string_agg=listagg, string_agg_distinct=listagg_distinct - Remove F.JVMView assertion (removed in PySpark 4) - Add unit and integration tests for all new functions - Locally verified DuckDB and Postgres engines work correctly Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…yname assertion - Use getattr with default False for _is_postgres in current_time() to avoid AttributeError when called from PySpark's native SparkSession (not SQLFrame) - Fix test_dayname to accept both 'Wednesday' and 'Wed' since Spark returns abbreviated day names while DuckDB returns full names Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- collate(): use Var instead of Literal for collation name — Spark 4 requires unquoted COLLATE syntax (e.g. COLLATE UTF8_LCASE not COLLATE 'UTF8_LCASE') - test_timestamp_add: use getattr() for _is_postgres check on SparkSession - test_make_ym_interval: skip for spark/pyspark — PySpark 4 does not implement YearMonthIntervalType.fromInternal - test_to_unix_timestamp: skip null-return assertion for spark/pyspark — PySpark 4 raises DateTimeException instead of returning NULL for unparseable values - test_monthname: accept both 'April' and 'Apr' (Spark returns abbreviated form) - compare_schemas: clear struct_field.metadata instead of func_metadata — PySpark 4 now includes metadata in StructField equality checks Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- test_timestamp_add: use isinstance(session, PostgresSession) consistently
with the rest of the file instead of fragile getattr() attribute check
- uuid(): use expression.Uuid() for the no-seed case and direct
expression.Anonymous for seeded case — removes invoke_anonymous_function
- Remove collate from ignore_funcs (uses expression.Collate directly, not anonymous)
- Re-document remaining ignore_funcs entries with concrete reasons:
- listagg/string_agg: GroupConcat always emits a separator; anonymous preserves
optional no-separator LISTAGG(col) behavior
- parse_json: ParseJSON in Spark dialect is a no-op; anonymous emits PARSE_JSON(col)
- time_diff: exp.TimeDiff generates TIMEDIFF, not Spark's TIME_DIFF(unit, start, end)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
All added functions are unsupported_engines="*" (Spark-only) so these tests run against pyspark and spark engines in CI. Functions covered: - bitmap_and_agg, collation, from_xml - input_file_block_length, input_file_block_start - is_valid_utf8, is_variant_null - listagg, listagg_distinct, string_agg_distinct - make_time, make_valid_utf8 - parse_json, quote, randstr - schema_of_variant, schema_of_variant_agg, schema_of_xml - time_diff, time_trunc, to_time - to_variant_object, to_xml - try_make_interval, try_make_timestamp, try_make_timestamp_ltz, try_make_timestamp_ntz - try_mod, try_parse_json, try_parse_url, try_reflect - try_to_date, try_to_time, try_url_decode - try_validate_utf8, try_variant_get - validate_utf8, variant_get Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
PySpark 4 changed StructField.__eq__ to use __dict__ comparison and
started populating metadata with internal keys (e.g. __autoGeneratedAlias)
on aggregation columns. The previous func_metadata={} was a no-op that
happened to work in PySpark 3 (which compared fields individually).
Clearing the real metadata attribute fixes the equality check.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace weak `is not None` / `isinstance(result, str)` checks with exact expected values from PySpark documentation. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Skip TIME-type tests (make_time, time_diff, time_trunc, to_time, try_to_time) for Spark/PySpark: PySpark 4 does not support TIME type - Skip try_make_interval for Spark/PySpark: CalendarIntervalType.fromInternal not implemented in PySpark 4 - Skip listagg_distinct/string_agg_distinct for SQLFrame SparkSession: Spark SQL has no LISTAGG_DISTINCT function - Fix VARIANT tests to use parse_json() function instead of session.sql() to preserve VARIANT type through SQLFrame's SQL translation layer - Fix parse_json/try_parse_json tests: pass plain strings to variant_get (not lit() Columns) to match native PySpark variant_get signature - Fix try_reflect assertion: reflect always returns a string, not int - Fix to_xml() implementation: pass options as MAP literal in SQL so rowTag and other options are applied in the Spark SQL execution Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- BigQuery: add nullifzero, session_user, uuid, zeroifnull - DuckDB: add collate, current_time, dayname, monthname, nullifzero, session_user, timestamp_diff, uuid, zeroifnull - Postgres: add current_time, nullifzero, session_user, uuid, zeroifnull - Snowflake: add collate, current_time, dayname, monthname, nullifzero, session_user, timestamp_diff, uniform, uuid, zeroifnull - Spark/Standalone: bump supported functions version from 3.5 to 4.0 Functions marked unsupported_engines="*" (Spark/PySpark-only) are not listed in engine-specific docs since they require native Spark execution. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- current_time: implement precision parameter — emit CURRENT_TIME(n) or LOCALTIME(n) instead of silently ignoring precision argument - from_xml: pass options as MAP literal to SQL, matching from_json behavior - schema_of_xml: pass options as MAP literal to SQL - to_time/try_to_time: fix format parameter handling — accept str and wrap with lit() so plain strings become SQL literals, not column refs; also rename parameter from 'str' to 'col' to avoid shadowing built-in Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
d4beb4a to
8f3856c
Compare
Closed
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
>=4,<4.2(closes fix(deps): update dependency pyspark to v4 #407)F.JVMViewassertion from tests (removed in PySpark 4)New Functions Added
Date/Time:
current_time,dayname,monthname,make_time,to_time,time_diff,time_trunc,timestamp_diff,try_to_date,try_to_timeNull Handling:
nullifzero,zeroifnullString/UTF-8:
collate,collation,is_valid_utf8,validate_utf8,make_valid_utf8,try_validate_utf8,quote,randstrAggregate:
listagg,listagg_distinct,string_agg,string_agg_distinct,bitmap_and_aggRandom:
uniform,uuid,random(alias for rand)JSON/Variant:
parse_json,try_parse_json,schema_of_variant,schema_of_variant_agg,is_variant_null,variant_get,try_variant_get,to_variant_objectXML:
from_xml,to_xml,schema_of_xmlTry functions:
try_mod,try_parse_url,try_reflect,try_url_decode,try_make_interval,try_make_timestamp,try_make_timestamp_ltz,try_make_timestamp_ntzSession:
session_user,input_file_block_length,input_file_block_startAliases:
column=col,chr=char,current_schema=current_databaseEngine Support Notes
unsupported_engines="*"(Spark/Databricks only)dayname/monthname: DuckDB + Spark supported (not BigQuery/Postgres)current_time: DuckDB + Postgres + Spark (not BigQuery; Postgres uses LOCALTIME)timestamp_diff: DuckDB + Spark (sqlglot handles DuckDB translation via DATE_DIFF)collate: DuckDB + Spark (not Postgres/BigQuery; different collation systems)uuid: All engines (Postgres usesgen_random_uuid()::text)uniform: Spark/Databricks/Snowflake only (DuckDB doesn't have UNIFORM)Test Plan
ty check)🤖 Generated with Claude Code