Skip to content

Conversation

@pan3793
Copy link
Member

@pan3793 pan3793 commented May 5, 2023

What changes were proposed in this pull request?

Add a new method useNullableQuerySchema in TableCatalog, to allow the DataSource implementation to declare whether they need to reserve schema nullability on CTAS/RTAS.

Why are the changes needed?

SPARK-28837 forcibly uses the nullable schema on CTAS/RTAS, which seems too aggressive:

  1. The existing matured RDBMSs have different behaviors for reserving schema nullability on CTAS/RTAS, as mentioned in [SPARK-28837][SQL] CTAS/RTAS should use nullable schema #25536, PostgreSQL forcibly uses the nullable schema, but MySQL respects the query's output schema nullability.
  2. Some OLAP systems(e.g. ClickHouse) are perf-sensitive for nullable, and have strict restrictions on table schema, e.g. the primary keys are not allowed to be nullable.

Does this PR introduce any user-facing change?

Yes, this PR adds a new DSv2 API, but the default implementation reserves backward compatibility.

How was this patch tested?

UTs are updated.

@pan3793
Copy link
Member Author

pan3793 commented May 5, 2023

cc @cloud-fan @sunchao @beliefer

* Return whether to reserve schema nullability of query output or forcibly use nullable schema
* on creating table implicitly, e.g. CTAS/RTAS.
*/
default boolean createTableReserveSchemaNullability() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about

/**
 * If true, mark all the fields of the query schema as nullable when executing
 * CREATE/REPLACE TABLE ... AS SELECT ... and creating the table.
 */
default boolean useNullableQuerySchema

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion, changed

}
}

class TestCreateTableReserveSchemaNullabilityCatalog extends InMemoryCatalog {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about ForbidSchemaNullabilityCatalog ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"forbid" seems not accurate here, would prefer to use "reserve" to represent respecting the query schema nullability

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I means the name seems too long. Could you rename it with simple one ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe "ReserveSchemaNullabilityCatalog"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed to "ReserveSchemaNullabilityCatalog"

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in ee9e36d May 9, 2023
LuciferYang pushed a commit to LuciferYang/spark that referenced this pull request May 10, 2023
### What changes were proposed in this pull request?

Add a new method `useNullableQuerySchema` in `Table`, to allow the DataSource implementation to declare whether they need to reserve schema nullability on CTAS/RTAS.

### Why are the changes needed?

SPARK-28837 forcibly uses the nullable schema on CTAS/RTAS, which seems too aggressive:

1. The existing matured RDBMSs have different behaviors for reserving schema nullability on CTAS/RTAS, as mentioned in apache#25536, PostgreSQL forcibly uses the nullable schema, but MySQL respects the query's output schema nullability.
2. Some OLAP systems(e.g. ClickHouse) are perf-sensitive for nullable, and have strict restrictions on table schema, e.g. the primary keys are not allowed to be nullable.

### Does this PR introduce _any_ user-facing change?

Yes, this PR adds a new DSv2 API, but the default implementation reserves backward compatibility.

### How was this patch tested?

UTs are updated.

Closes apache#41070 from pan3793/SPARK-43390.

Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants