Skip to content

Support for Spark Catalogs and ErrorIfExists save mode.#19

Merged
MaxKsyunz merged 29 commits intointeg/save_mode_simplefrom
dev/save_mode_simple_graph_separate
Mar 2, 2026
Merged

Support for Spark Catalogs and ErrorIfExists save mode.#19
MaxKsyunz merged 29 commits intointeg/save_mode_simplefrom
dev/save_mode_simple_graph_separate

Conversation

@MaxKsyunz
Copy link
Copy Markdown

Changes:

  • Connector implements Catalog API.
  • ErrorIfExists and Ignore save modes are supported (via Catalog API).
  • Separate format -- cloud-spanner-graph and table provider for Spanner Graphs.

Introducing a separate format for graphs was necessary because current implementation tightly couples Spark table definition and graph query. To support Catalog API for Spanner graphs with one Spark table provider requires significantly refactoring graph support.

Specifically, I ran into a case where loading a graph table and then querying resulted in type conversion errors because query code treats Id columns as String even though they are defined as INT64.

Signed-off-by: Max Ksyunz <max.ksyunz@improving.com>
- Add a script to set up a database for integration tests.
- Add SPANNER_USE_EXISTING_DATABASE env variable to tell test framework to not create a new database.
- Update CONTRIBUTING.md with details on how to run tests.

Signed-off-by: Max Ksyunz <max.ksyunz@improving.com>
Signed-off-by: Max Ksyunz <max.ksyunz@improving.com>
Signed-off-by: Max Ksyunz <max.ksyunz@improving.com>
Signed-off-by: Max Ksyunz <max.ksyunz@improving.com>
Signed-off-by: Max Ksyunz <max.ksyunz@improving.com>
Signed-off-by: Max Ksyunz <max.ksyunz@improving.com>
 This was a case sensitivity issue (option was converted to lower case, then back into case-sensitive map) so replaced Map<String, String> with CaseInsensitveStringMap

Signed-off-by: Max Ksyunz <max.ksyunz@improving.com>
Signed-off-by: Max Ksyunz <max.ksyunz@improving.com>
Simplifies option management.

Signed-off-by: Max Ksyunz <max.ksyunz@improving.com>
Signed-off-by: Max Ksyunz <max.ksyunz@improving.com>
…ces right now.

Signed-off-by: Max Ksyunz <max.ksyunz@improving.com>
Signed-off-by: Max Ksyunz <max.ksyunz@improving.com>
Note that SpannerCatalog.createTable is looking for a column with "spanner.primaryKey" metadata set to true and uses this column (or columns) to define primary key on the table it creates.

Signed-off-by: Max Ksyunz <max.ksyunz@improving.com>
- check if database is Postgres and lower case the table name since that's what Postgres server does.
- Include primary key information in ScannerTable.schema return so that it is identical to the schema struct passed on creation.

Signed-off-by: Max Ksyunz <max.ksyunz@improving.com>
Signed-off-by: Max Ksyunz <max.ksyunz@improving.com>
Signed-off-by: Max Ksyunz <max.ksyunz@improving.com>
… PostgreSql databases.

Signed-off-by: Max Ksyunz <max.ksyunz@improving.com>
Signed-off-by: Max Ksyunz <max.ksyunz@improving.com>
Signed-off-by: Max Ksyunz <max.ksyunz@improving.com>
Signed-off-by: Max Ksyunz <max.ksyunz@improving.com>
Signed-off-by: Max Ksyunz <max.ksyunz@improving.com>
@MaxKsyunz
Copy link
Copy Markdown
Author

/gemini review

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces significant enhancements by adding support for Spark's Catalog API, enabling DDL operations and new save modes like ErrorIfExists and Ignore. The refactoring to separate graph support into its own data source format (cloud-spanner-graph) is a solid architectural improvement. The code changes, including the new SpannerCatalog and updates to related classes, are well-structured and accompanied by a comprehensive set of new tests. I have a few suggestions to improve the robustness of the new test setup script and a minor style improvement in the catalog implementation.

MaxKsyunz and others added 3 commits February 27, 2026 18:59
…nner/SpannerCatalog.java

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
…nner/integration/SparkSpannerTableProviderIntegrationTestBase.java

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
@MaxKsyunz MaxKsyunz self-assigned this Feb 28, 2026
Signed-off-by: Max Ksyunz <max.ksyunz@improving.com>
Signed-off-by: Max Ksyunz <max.ksyunz@improving.com>
Signed-off-by: Max Ksyunz <max.ksyunz@improving.com>
@MaxKsyunz MaxKsyunz merged commit 034e703 into integ/save_mode_simple Mar 2, 2026
1 check passed
@MaxKsyunz MaxKsyunz deleted the dev/save_mode_simple_graph_separate branch March 2, 2026 21:24
MaxKsyunz added a commit that referenced this pull request Mar 2, 2026
Main Changes:
- Connector implements Catalog API.
- ErrorIfExists and Ignore save modes are supported (via Catalog API).
- Separate format (`cloud-spanner-graph`) and table provider for
Spanner Graphs.

Introducing a separate format for graphs was necessary because current
implementation tightly couples Spark table definition and graph query.
To support Catalog API for Spanner graphs and tables with one Spark table provider
requires significantly refactoring graph support.

Specifically, I ran into a case where loading a graph table and then
querying resulted in type conversion errors because query code casts Id
columns to `String` even though they are defined as `INT64`.

---------

Signed-off-by: Max Ksyunz <max.ksyunz@improving.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
MaxKsyunz added a commit that referenced this pull request Mar 9, 2026
…oogleCloudDataproc#170)

* Support for Spark Catalogs and ErrorIfExists save mode. (#19)

Main Changes:
- Connector implements Catalog API.
- ErrorIfExists and Ignore save modes are supported (via Catalog API).
- Separate format (`cloud-spanner-graph`) and table provider for
Spanner Graphs.

Introducing a separate format for graphs was necessary because current
implementation tightly couples Spark table definition and graph query.
To support Catalog API for Spanner graphs and tables with one Spark table provider
requires significantly refactoring graph support.

Specifically, I ran into a case where loading a graph table and then
querying resulted in type conversion errors because query code casts Id
columns to `String` even though they are defined as `INT64`.

---------

Signed-off-by: Max Ksyunz <max.ksyunz@improving.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

* Update scripts/setup-test-db.sh

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

* Spotless cleanup

Signed-off-by: Max Ksyunz <max.ksyunz@improving.com>

* Addressing Gemini feedback

Signed-off-by: Max Ksyunz <max.ksyunz@improving.com>

* Refactor SpannerCatalog to support graph tables in `cloud-spanner` format(#22)

- `SpannerCatalogTableProviderBase.extractIdentifier` encodes dataframe options  supported by graphs into returned identifier.
- These are: `graph`, `type`, `configs`, `graphQuery`, `timestamp`, `viewsEnabled.
- `SpannerCatalog.loadTable` extracts these options to correctly instantiate `SpannerGraph` class.
- Removed classes related to `cloud-spanner-graph`.
Signed-off-by: Max Ksyunz <max.ksyunz@improving.com>

* Support arrays in SpannerCatalog.createTable. (#23)

Signed-off-by: Max Ksyunz <max.ksyunz@improving.com>

* get emulatorHost from current properties.

Signed-off-by: Max Ksyunz <max.ksyunz@improving.com>

* quote identifiers used to construct create and drop DDL statements.

Signed-off-by: Max Ksyunz <max.ksyunz@improving.com>

* Quote table name in construct SQL statements

Other refactorings and clean-up

Signed-off-by: Max Ksyunz <max.ksyunz@improving.com>

* Resolve merge conflicts with main.

Signed-off-by: Max Ksyunz <max.ksyunz@improving.com>

* Update examples

Signed-off-by: Max Ksyunz <max.ksyunz@improving.com>

* Validate projectId, databaseId, and instanceId as passed by the user

Signed-off-by: Max Ksyunz <max.ksyunz@improving.com>

* fix copyright year.

Signed-off-by: Max Ksyunz <max.ksyunz@improving.com>

* Separating dataframe-based and catalog-based integration tests

- Make sure most integration tests continue to test with dataframe API without catalog
- Add integration tests with Catalog-based write operations specifically
- Store dataframe options for writer as case-insensitive map. Spark options are case-insensitive and sometimes they arrive normalized to all lower case.

Signed-off-by: Max Ksyunz <max.ksyunz@improving.com>

* Small improvements to ErrorIfExists and Ignore integration tests

Signed-off-by: Max Ksyunz <max.ksyunz@improving.com>

* Examples of writing using catalog.

Signed-off-by: Max Ksyunz <max.ksyunz@improving.com>

* Re-write testErrorIfExistsSaveMode and testIgnoreSaveMode test for clarity

They both perform the same write operation twice with different results.

Signed-off-by: Max Ksyunz <max.ksyunz@improving.com>

---------

Signed-off-by: Max Ksyunz <max.ksyunz@improving.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant