Skip to content

chore(typing): Use SQLFrame instead of PySpark for _spark_like internally#2190

Merged
dangotbanned merged 19 commits intonarwhals-dev:mainfrom
MarcoGorelli:replace-pyspark-typing
Mar 13, 2025
Merged

chore(typing): Use SQLFrame instead of PySpark for _spark_like internally#2190
dangotbanned merged 19 commits intonarwhals-dev:mainfrom
MarcoGorelli:replace-pyspark-typing

Conversation

@MarcoGorelli
Copy link
Copy Markdown
Member

@MarcoGorelli MarcoGorelli commented Mar 11, 2025

closes #2185

What type of PR is this? (check all applicable)

  • 💾 Refactor
  • ✨ Feature
  • 🐛 Bug Fix
  • 🔧 Optimization
  • 📝 Documentation
  • ✅ Test
  • 🐳 Other

Related issues

  • Related issue #<issue number>
  • Closes #<issue number>

Checklist

  • Code follows style guide (ruff)
  • Tests added
  • Documented the changes

If you have comments or can explain your changes, please do so below

@MarcoGorelli MarcoGorelli marked this pull request as ready for review March 11, 2025 19:20
SQLFrameDataFrame: TypeAlias = _SQLFrameDataFrame[Any, Any, Any, Any, Any]
_NativeDataFrame: TypeAlias = "DataFrame | SQLFrameDataFrame"
NativeFrame = BaseDataFrame[Any, Any, Any, Any, Any]
NativeSession = _BaseSession[Any, Any, Any, Any, Any, Any, Any]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't think I'll ever get used to seeing 7 TypeVars 😂

@dangotbanned
Copy link
Copy Markdown
Member

Thanks for the updates @MarcoGorelli

Do you mind holding off on merging?
I'd like to have a look in an IDE today - just to be confident we're not regressing anywhere 🙂

@dangotbanned dangotbanned self-assigned this Mar 12, 2025
Comment on lines +732 to +734
native_object, # NOTE: this is leaking the internal lie
# error: Argument of type "DataFrame" cannot be assigned to parameter "native_dataframe" of type "SQLFrameDataFrame" in function "__init__"
# "DataFrame" is not assignable to "BaseDataFrame[Any, Any, Any, Any, Any]" (reportArgumentType)
Copy link
Copy Markdown
Member

@dangotbanned dangotbanned Mar 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MarcoGorelli this seems like it'll impact LazyFrame[T] - since we need to say that'll only be SQLFrameDataFrame for SparkLikeLazyFrame

Copy link
Copy Markdown
Member

@dangotbanned dangotbanned Mar 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#2185 (comment)

Turned out that while Series isn't an issue - Frame may well be

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how so?

I tried

# ruff: noqa
from __future__ import annotations

import narwhals as nw
import pandas as pd
import polars as pl

from sqlframe.duckdb import DuckDBSession
from sqlframe.duckdb.dataframe import DuckDBDataFrame
import sqlframe.duckdb.functions as F
from pyspark.sql.dataframe import DataFrame as SparkDataFrame

def func(a: SparkDataFrame) -> None:
    reveal_type(nw.from_native(a).to_native())
def func2(a: DuckDBDataFrame) -> None:
    reveal_type(nw.from_native(a).to_native())

and get

  /home/marcogorelli/polars-api-compat-dev/t.py:14:17 - information: Type of "nw.from_native(a).to_native()" is "DataFrame"
  /home/marcogorelli/polars-api-compat-dev/t.py:16:17 - information: Type of "nw.from_native(a).to_native()" is "DuckDBDataFrame"

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try changing that to this and you'll see it matches DataFrame and not LazyFrame

def func(a: SparkDataFrame) -> None:
    b = nw.from_native(a)
    reveal_type(b)
    c = b.to_native()
    reveal_type(c)

Type of "b" is "narwhals.dataframe.DataFrame[pyspark.sql.dataframe.DataFrame]"

Bit confused on this one though 🤔

Type of "c" is "DataFrame"

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MarcoGorelli just in case you missed this while writing #2190 (comment)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah thanks

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't this an issue on main as well:

(.310venv) marcogorelli@DESKTOP-U8OKFP3:~/polars-api-compat-dev$ cat t.py 
# ruff: noqa
from __future__ import annotations

import narwhals as nw
import pandas as pd
import polars as pl

from sqlframe.duckdb import DuckDBSession
from sqlframe.duckdb.dataframe import DuckDBDataFrame
import sqlframe.duckdb.functions as F
from pyspark.sql.dataframe import DataFrame as SparkDataFrame

def func(a: SparkDataFrame) -> None:
    reveal_type(nw.from_native(a).to_native())
    reveal_type(nw.from_native(a))
def func2(a: DuckDBDataFrame) -> None:
    reveal_type(nw.from_native(a).to_native())
    reveal_type(nw.from_native(a))
(.310venv) marcogorelli@DESKTOP-U8OKFP3:~/polars-api-compat-dev$ pyright t.py 
/home/marcogorelli/polars-api-compat-dev/t.py
  /home/marcogorelli/polars-api-compat-dev/t.py:14:17 - information: Type of "nw.from_native(a).to_native()" is "DataFrame"
  /home/marcogorelli/polars-api-compat-dev/t.py:15:17 - information: Type of "nw.from_native(a)" is "narwhals.dataframe.DataFrame[pyspark.sql.dataframe.DataFrame]"
  /home/marcogorelli/polars-api-compat-dev/t.py:17:17 - information: Type of "nw.from_native(a).to_native()" is "DuckDBDataFrame"
  /home/marcogorelli/polars-api-compat-dev/t.py:18:17 - information: Type of "nw.from_native(a)" is "DataFrame[DuckDBDataFrame]"
0 errors, 0 warnings, 4 informations 

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't this an issue on main as well

Yeah I also noticed it seems to happen for any LazyFrame when passed without any keywords

More motivation towards #2116 I suppose

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tracking in (#2194)

@dangotbanned
Copy link
Copy Markdown
Member

dangotbanned commented Mar 12, 2025

@MarcoGorelli more of a general note

A lot of the stuff I've touched is because I still have pyspark installed locally - which I think you may not anymore?

We still need to have that factored into the CI - even if you want to speed things up locally.
Otherwise it'll make it harder for contributors to work with pyspark - without conflicting with CI on where to ignore.
If you remember 😉 #1963 (comment)

@MarcoGorelli
Copy link
Copy Markdown
Member Author

We still need to have that factored into the CI - even if you want to speed things up locally.
Otherwise it'll make it harder for contributors to work with pyspark - without conflicting with CI on where to ignore.
If you remember 😉 #1963 (comment)

could you elaborate please? can't we run pyspark tests in CI, but just use sqlframe for internal type checking?

@dangotbanned
Copy link
Copy Markdown
Member

We still need to have that factored into the CI - even if you want to speed things up locally.
Otherwise it'll make it harder for contributors to work with pyspark - without conflicting with CI on where to ignore.
If you remember 😉 #1963 (comment)

could you elaborate please?
@MarcoGorelli

Sure thing
With pyspark installed - I'm getting this warning in translate.py

native_object,

Screenshot 1

image

Which is a result of this change

_NativeDataFrame: TypeAlias = "DataFrame | SQLFrameDataFrame"

native_dataframe: _NativeDataFrame,

Screenshot 2

image

This is what I meant in (#2185 (comment)) when I said

The problem is they're all connected.

Screenshot 3

image


can't we run pyspark tests in CI
@MarcoGorelli

This is what I want - but the fact that pyright is passing in CI is what I want to fix.
It should be exiting on that error:

>>> pyright
c:\Users\danie\Documents\GitHub\narwhals\narwhals\translate.py
  c:\Users\danie\Documents\GitHub\narwhals\narwhals\translate.py:732:17 - error: Argument of type "DataFrame" cannot be assigned to parameter "native_dataframe" of type "SQLFrameDataFrame" in function "__init__"
    "DataFrame" is not assignable to "BaseDataFrame[Any, Any, Any, Any, Any]" (reportArgumentType)
1 error, 0 warnings, 0 informations 

I'm thinking the problem is in these changes:

https://github.com/narwhals-dev/narwhals/pull/2190/files#diff-8da2d7afa6668fbe866ba5f8a8f24dd448aa57df874700c659e8ad93abf29478
https://github.com/narwhals-dev/narwhals/pull/2190/files#diff-50c86b7ed8ac2cf95bd48334961bf0530cdc77b5a56f852c5c61b89d735fd711

@MarcoGorelli
Copy link
Copy Markdown
Member Author

is it ok if we just type ignore / pyright ignore that line in translate? I think that's what I'd have expected if as you rightly note we're lying about the native objects always being sqlframe ones. But I think that's OK, if passing type-checking is a requirement for any pull request, and the APIs are similar enough, then it seems OK to ignore an internal line in translate.py

From having tried #2190 (comment), it looks like it all works fine with pyspark installed from the user's perspective

@dangotbanned
Copy link
Copy Markdown
Member

is it ok if we just type ignore / pyright ignore that line in translate?

Yeah that's fine if it isn't causing an issue externally.

But we do need to have pyspark installed in CI for typing.
The workflow isn't slow with it and it would've caught that warning
https://github.com/narwhals-dev/narwhals/actions/workflows/typing.yml

@dangotbanned dangotbanned removed their assignment Mar 12, 2025
validate_column_names: bool = False,
) -> None:
self._native_frame = native_dataframe
self._native_frame: SQLFrameDataFrame = native_dataframe
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added an annotation to try and trigger an [arg-type] from mypy.

Still not sure why pyright is the only one to detect (d5e899a)

Copy link
Copy Markdown
Member

@dangotbanned dangotbanned left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy with where everything is at now @MarcoGorelli

Thanks for proposing and working on this 🎉

@dangotbanned dangotbanned merged commit 7611bd4 into narwhals-dev:main Mar 13, 2025
29 checks passed
@dangotbanned dangotbanned changed the title type: Use SQLFrame instead of PySpark to type _spark_like internally chore(typing): Use SQLFrame instead of PySpark for _spark_like internally Mar 13, 2025
@MarcoGorelli
Copy link
Copy Markdown
Member Author

Awesome, thanks for your review and help here Dan, much appreciated! 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

typing: use SQLFrame instead of PySpark for typing

2 participants