-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-28837][SQL] CTAS/RTAS should use nullable schema #25536
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #109492 has finished for PR 25536 at commit
|
|
+1 |
| case (catalog, identifier) => | ||
| spark.sql(s"CREATE TABLE $identifier USING foo AS SELECT 1 i") | ||
|
|
||
| val table = catalog.loadTable(Identifier.of(Array(), "table_name")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor and non-blocking... "table_name" repeated many times and is it better to make it a test class variable and each test case referencing it?
sorry maybe I'm being too nitpick lol
PR looks good to me :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's better to me if you don't have to jump around too much when reading the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in your PR https://github.com/apache/spark/pull/25507/files DataSourceV2DataFrameSessionCatalogSuite.scala, there is a class variable v2Format being referenced by all test cases
protected val v2Format: String = classOf[InMemoryTableProvider].getName
so I thought it's a style convention...
|
I'm also a +1 on this (and PR LGTM). However a follow up on this. Can we ensure that the written data really has the schema as nullable as well? I've seen many parquet files corrupted with nulls written to non-nullable fields according to Parquet's schema |
I've not hit this problem. What version of Parquet are you using? Usually, if the Spark plan is wrong about nullability, we get NPEs thrown in codegen sections because of assignment to primitive types. Though that doesn't affect strings and decimals. |
| } | ||
|
|
||
| Utils.tryWithSafeFinallyAndFailureCallbacks({ | ||
| val schema = query.schema.asNullable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need to update the schema nullability of the corresponding logical plans, CreateTableAsSelect and ReplaceTable, in the analyzer phase? Any reason to directly update the nullability of physical plans?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's too much work if we need to transform the logical plan and add an extra Project to change the nullability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be incorrect to change the logical plan. The behavior of CTAS should be that tables are created with nullable types. The query used by CTAS should not be changed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see. Thanks. Looks ok to me.
|
+1; the change looks reasonable to me. |
|
thanks for your reviews! merging to master |
|
Can CTAS support create table with schema? but don't support this: Can spark support it? |
I don't think it's reasonable, table should be same with schema of dataframe, and if the shema is different from what we want we should explicitly modify it and then create it instead of creating a table that is not the same as the schema , which can be confusing @maropu |
|
@sketchmind Note that, we don't insert to a table only once. It doesn't make sense to define the table schema with its first insertion (CTAS). I think this is also why Postgres uses the same design. |
This is not only a problem with the first insert, a complete dataframe using the datasource createOrReplace API is also affected by this logic,which makes it impossible to create a NOT NULL table |
|
But people can always create an empty table first and then insert? If we really want to support this case, we should allow people to define table schema in CTAS (or Scala createOrReplace) so that it can overwrite dataframe schema. |
CreateOrReplace(create an empty table first and then insert) is a very frequently used function,it's just that most tables are not sensitive to whether a column is NULL or not, especially Hive, so this issue is not widely mentioned. |
|
@cloud-fan how about adding a configuration to allow users to control the behavior? e.g. |
|
Can you explain the use cases? It seems people agreed with the SQL behavior (consistent with postgres), but had concerns about Scala API. Note that, Spark also provides Scala APIs to create a table, so you can keep the nullability of query schema |
|
I am stuck in the SQL case. The background is, ClickHouse is extremely fast on single-wide table OLAP queries, and some data engineers want to use Spark to do heavy data preparation and save the result as a temp table into ClickHouse through pure SQL, usually, the result set is quite large, and won't do append/update/delete during the table's lifecycle. Since NULL is not good for performance, ClickHouse has quite strict restrictions on table schema, e.g. the sorting keys are not allowed to be nullable
https://clickhouse.com/docs/en/sql-reference/data-types/nullable In the above use case, things become simple if the connector knows and respects the nullable of DataFrame's schema on CTAS. |
|
Can you split your SQL CTAS to CREATE TABLE and INSERT? Or ALTER the table later to change column nullability? It seems like a special case for clickhouse benchmark, and I'm a bit reluctant to change Spark's behavior to fit it. |
Yes, it's a workaround, but it's not convenient since users need to explicitly list all column's definition.
I'm afraid not, the CTAS failed because the generated schema violates the restriction. Back to the case mentioned in this PR description, I think table schema evolution is an essential feature of the modern data lake table format, can BTW, I tested the provided CTAS case on MySQL 8, the result is different. |
|
This is a good point. If Spark can do schema evolution during table insertion, then having non-nullable table schema in CTAS is not a problem anymore. We need a bit design about how data sources can claim its support of schema evolution, and what if they do want to enforce the non-nullable property. |
### What changes were proposed in this pull request? Add a new method `useNullableQuerySchema` in `Table`, to allow the DataSource implementation to declare whether they need to reserve schema nullability on CTAS/RTAS. ### Why are the changes needed? SPARK-28837 forcibly uses the nullable schema on CTAS/RTAS, which seems too aggressive: 1. The existing matured RDBMSs have different behaviors for reserving schema nullability on CTAS/RTAS, as mentioned in #25536, PostgreSQL forcibly uses the nullable schema, but MySQL respects the query's output schema nullability. 2. Some OLAP systems(e.g. ClickHouse) are perf-sensitive for nullable, and have strict restrictions on table schema, e.g. the primary keys are not allowed to be nullable. ### Does this PR introduce _any_ user-facing change? Yes, this PR adds a new DSv2 API, but the default implementation reserves backward compatibility. ### How was this patch tested? UTs are updated. Closes #41070 from pan3793/SPARK-43390. Authored-by: Cheng Pan <chengpan@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request? Add a new method `useNullableQuerySchema` in `Table`, to allow the DataSource implementation to declare whether they need to reserve schema nullability on CTAS/RTAS. ### Why are the changes needed? SPARK-28837 forcibly uses the nullable schema on CTAS/RTAS, which seems too aggressive: 1. The existing matured RDBMSs have different behaviors for reserving schema nullability on CTAS/RTAS, as mentioned in apache#25536, PostgreSQL forcibly uses the nullable schema, but MySQL respects the query's output schema nullability. 2. Some OLAP systems(e.g. ClickHouse) are perf-sensitive for nullable, and have strict restrictions on table schema, e.g. the primary keys are not allowed to be nullable. ### Does this PR introduce _any_ user-facing change? Yes, this PR adds a new DSv2 API, but the default implementation reserves backward compatibility. ### How was this patch tested? UTs are updated. Closes apache#41070 from pan3793/SPARK-43390. Authored-by: Cheng Pan <chengpan@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
What changes were proposed in this pull request?
When running CTAS/RTAS, use the nullable schema of the input query to create the table.
Why are the changes needed?
It's very likely to run CTAS/RTAS with non-nullable input query, e.g.
CREATE TABLE t AS SELECT 1. However, it's surprising to users if they can't write null to this table later. Non-nullable is kind of a constraint of the column and should be specified by users explicitly.For reference, Postgres also use nullable schema for CTAS:
File source V1 has the same behavior.
Does this PR introduce any user-facing change?
Yes, after this PR CTAS/RTAS creates tables with nullable schema, then users can insert null values later.
How was this patch tested?
new test