Support for Spark Catalogs and ErrorIfExists save mode.#19
Merged
MaxKsyunz merged 29 commits intointeg/save_mode_simplefrom Mar 2, 2026
Merged
Support for Spark Catalogs and ErrorIfExists save mode.#19MaxKsyunz merged 29 commits intointeg/save_mode_simplefrom
MaxKsyunz merged 29 commits intointeg/save_mode_simplefrom
Conversation
Signed-off-by: Max Ksyunz <max.ksyunz@improving.com>
- Add a script to set up a database for integration tests. - Add SPANNER_USE_EXISTING_DATABASE env variable to tell test framework to not create a new database. - Update CONTRIBUTING.md with details on how to run tests. Signed-off-by: Max Ksyunz <max.ksyunz@improving.com>
Signed-off-by: Max Ksyunz <max.ksyunz@improving.com>
Signed-off-by: Max Ksyunz <max.ksyunz@improving.com>
Signed-off-by: Max Ksyunz <max.ksyunz@improving.com>
Signed-off-by: Max Ksyunz <max.ksyunz@improving.com>
Signed-off-by: Max Ksyunz <max.ksyunz@improving.com>
This was a case sensitivity issue (option was converted to lower case, then back into case-sensitive map) so replaced Map<String, String> with CaseInsensitveStringMap Signed-off-by: Max Ksyunz <max.ksyunz@improving.com>
Signed-off-by: Max Ksyunz <max.ksyunz@improving.com>
Simplifies option management. Signed-off-by: Max Ksyunz <max.ksyunz@improving.com>
Signed-off-by: Max Ksyunz <max.ksyunz@improving.com>
…ces right now. Signed-off-by: Max Ksyunz <max.ksyunz@improving.com>
Signed-off-by: Max Ksyunz <max.ksyunz@improving.com>
Note that SpannerCatalog.createTable is looking for a column with "spanner.primaryKey" metadata set to true and uses this column (or columns) to define primary key on the table it creates. Signed-off-by: Max Ksyunz <max.ksyunz@improving.com>
- check if database is Postgres and lower case the table name since that's what Postgres server does. - Include primary key information in ScannerTable.schema return so that it is identical to the schema struct passed on creation. Signed-off-by: Max Ksyunz <max.ksyunz@improving.com>
Signed-off-by: Max Ksyunz <max.ksyunz@improving.com>
Signed-off-by: Max Ksyunz <max.ksyunz@improving.com>
… PostgreSql databases. Signed-off-by: Max Ksyunz <max.ksyunz@improving.com>
Signed-off-by: Max Ksyunz <max.ksyunz@improving.com>
Signed-off-by: Max Ksyunz <max.ksyunz@improving.com>
Signed-off-by: Max Ksyunz <max.ksyunz@improving.com>
Signed-off-by: Max Ksyunz <max.ksyunz@improving.com>
Author
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces significant enhancements by adding support for Spark's Catalog API, enabling DDL operations and new save modes like ErrorIfExists and Ignore. The refactoring to separate graph support into its own data source format (cloud-spanner-graph) is a solid architectural improvement. The code changes, including the new SpannerCatalog and updates to related classes, are well-structured and accompanied by a comprehensive set of new tests. I have a few suggestions to improve the robustness of the new test setup script and a minor style improvement in the catalog implementation.
spark-3.1-spanner-lib/src/main/java/com/google/cloud/spark/spanner/SpannerCatalog.java
Outdated
Show resolved
Hide resolved
...com/google/cloud/spark/spanner/integration/SparkSpannerTableProviderIntegrationTestBase.java
Outdated
Show resolved
Hide resolved
…nner/SpannerCatalog.java Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
…nner/integration/SparkSpannerTableProviderIntegrationTestBase.java Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Max Ksyunz <max.ksyunz@improving.com>
Signed-off-by: Max Ksyunz <max.ksyunz@improving.com>
Signed-off-by: Max Ksyunz <max.ksyunz@improving.com>
MaxKsyunz
added a commit
that referenced
this pull request
Mar 2, 2026
Main Changes: - Connector implements Catalog API. - ErrorIfExists and Ignore save modes are supported (via Catalog API). - Separate format (`cloud-spanner-graph`) and table provider for Spanner Graphs. Introducing a separate format for graphs was necessary because current implementation tightly couples Spark table definition and graph query. To support Catalog API for Spanner graphs and tables with one Spark table provider requires significantly refactoring graph support. Specifically, I ran into a case where loading a graph table and then querying resulted in type conversion errors because query code casts Id columns to `String` even though they are defined as `INT64`. --------- Signed-off-by: Max Ksyunz <max.ksyunz@improving.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
MaxKsyunz
added a commit
that referenced
this pull request
Mar 9, 2026
…oogleCloudDataproc#170) * Support for Spark Catalogs and ErrorIfExists save mode. (#19) Main Changes: - Connector implements Catalog API. - ErrorIfExists and Ignore save modes are supported (via Catalog API). - Separate format (`cloud-spanner-graph`) and table provider for Spanner Graphs. Introducing a separate format for graphs was necessary because current implementation tightly couples Spark table definition and graph query. To support Catalog API for Spanner graphs and tables with one Spark table provider requires significantly refactoring graph support. Specifically, I ran into a case where loading a graph table and then querying resulted in type conversion errors because query code casts Id columns to `String` even though they are defined as `INT64`. --------- Signed-off-by: Max Ksyunz <max.ksyunz@improving.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * Update scripts/setup-test-db.sh Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * Spotless cleanup Signed-off-by: Max Ksyunz <max.ksyunz@improving.com> * Addressing Gemini feedback Signed-off-by: Max Ksyunz <max.ksyunz@improving.com> * Refactor SpannerCatalog to support graph tables in `cloud-spanner` format(#22) - `SpannerCatalogTableProviderBase.extractIdentifier` encodes dataframe options supported by graphs into returned identifier. - These are: `graph`, `type`, `configs`, `graphQuery`, `timestamp`, `viewsEnabled. - `SpannerCatalog.loadTable` extracts these options to correctly instantiate `SpannerGraph` class. - Removed classes related to `cloud-spanner-graph`. Signed-off-by: Max Ksyunz <max.ksyunz@improving.com> * Support arrays in SpannerCatalog.createTable. (#23) Signed-off-by: Max Ksyunz <max.ksyunz@improving.com> * get emulatorHost from current properties. Signed-off-by: Max Ksyunz <max.ksyunz@improving.com> * quote identifiers used to construct create and drop DDL statements. Signed-off-by: Max Ksyunz <max.ksyunz@improving.com> * Quote table name in construct SQL statements Other refactorings and clean-up Signed-off-by: Max Ksyunz <max.ksyunz@improving.com> * Resolve merge conflicts with main. Signed-off-by: Max Ksyunz <max.ksyunz@improving.com> * Update examples Signed-off-by: Max Ksyunz <max.ksyunz@improving.com> * Validate projectId, databaseId, and instanceId as passed by the user Signed-off-by: Max Ksyunz <max.ksyunz@improving.com> * fix copyright year. Signed-off-by: Max Ksyunz <max.ksyunz@improving.com> * Separating dataframe-based and catalog-based integration tests - Make sure most integration tests continue to test with dataframe API without catalog - Add integration tests with Catalog-based write operations specifically - Store dataframe options for writer as case-insensitive map. Spark options are case-insensitive and sometimes they arrive normalized to all lower case. Signed-off-by: Max Ksyunz <max.ksyunz@improving.com> * Small improvements to ErrorIfExists and Ignore integration tests Signed-off-by: Max Ksyunz <max.ksyunz@improving.com> * Examples of writing using catalog. Signed-off-by: Max Ksyunz <max.ksyunz@improving.com> * Re-write testErrorIfExistsSaveMode and testIgnoreSaveMode test for clarity They both perform the same write operation twice with different results. Signed-off-by: Max Ksyunz <max.ksyunz@improving.com> --------- Signed-off-by: Max Ksyunz <max.ksyunz@improving.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Changes:
cloud-spanner-graphand table provider for Spanner Graphs.Introducing a separate format for graphs was necessary because current implementation tightly couples Spark table definition and graph query. To support Catalog API for Spanner graphs with one Spark table provider requires significantly refactoring graph support.
Specifically, I ran into a case where loading a graph table and then querying resulted in type conversion errors because query code treats Id columns as
Stringeven though they are defined asINT64.