Refactor SpannerCatalog to support graph tables by MaxKsyunz · Pull Request #22 · Bit-Quill/spark-spanner-connector

MaxKsyunz · 2026-03-04T23:04:37Z

No description provided.

…ctly generate table schema. Also remove cloud-spanner-graph implementation as it is no longer necessary. Signed-off-by: Max Ksyunz <max.ksyunz@improving.com>

Signed-off-by: Max Ksyunz <max.ksyunz@improving.com>

MaxKsyunz · 2026-03-04T23:04:48Z

/gemini review

gemini-code-assist

Code Review

This pull request refactors the Spanner catalog to support graph tables by unifying the table and graph providers under a single cloud-spanner data source, using a special table identifier for graph properties. A security audit has identified critical vulnerabilities: a configuration overwrite vulnerability that could allow unauthorized access to arbitrary Spanner databases, and a potential NullPointerException leading to a denial of service. Both issues arise from insecure handling of user-controlled data within Spark identifiers. Additionally, general improvements can be made to error handling, immutability, and resource usage.

spark-3.1-spanner-lib/src/main/java/com/google/cloud/spark/spanner/SpannerCatalog.java

...-spanner-lib/src/main/java/com/google/cloud/spark/spanner/SparkSpannerTableProviderBase.java

Signed-off-by: Max Ksyunz <max.ksyunz@improving.com>

MaxKsyunz · 2026-03-05T05:24:41Z

/gemini review

gemini-code-assist

Code Review

This pull request refactors the Spanner Spark connector by consolidating graph table handling into the main SpannerCatalog and SparkSpannerTableProviderBase, removing previously dedicated SparkSpannerGraphTableProvider classes. The changes introduce a special identifier prefix (__spanner_graph__) to embed JSON-encoded graph properties within table identifiers, which are then parsed and used to build SpannerGraph objects. New unit and integration tests were added to cover this updated graph identifier handling. However, review comments highlight a critical SQL injection vulnerability: the graphQuery option, when passed through user-controlled identifier names or options, allows arbitrary SQL execution due to its direct use in SpannerGraphBuilder.build. The reviewer suggests remediating this by avoiding raw SQL, validating graphQuery against an allow-list, restricting its use, or preventing sensitive options from being overridden via identifiers. A minor code improvement was also suggested for simplifying property copying in factorySpannerGraph.

spark-3.1-spanner-lib/src/main/java/com/google/cloud/spark/spanner/SpannerCatalog.java

...-spanner-lib/src/main/java/com/google/cloud/spark/spanner/SparkSpannerTableProviderBase.java

…ion. Remove incorrect tests. Signed-off-by: Max Ksyunz <max.ksyunz@improving.com>

Signed-off-by: Max Ksyunz <max.ksyunz@improving.com>

…t resources. Signed-off-by: Max Ksyunz <max.ksyunz@improving.com>

MaxKsyunz · 2026-03-05T09:54:49Z

/gemini review

gemini-code-assist

Code Review

This pull request refactors the Spanner data source to unify support for both regular tables and graph tables under a single cloud-spanner provider, deprecating the separate cloud-spanner-graph provider. This is achieved by encoding graph-specific options into a JSON payload within the table identifier. The changes are well-structured, include necessary updates to tests and service provider configurations across multiple Spark versions, and improve resource management by making SpannerCatalog AutoCloseable. The overall change improves the usability and maintainability of the connector. I have one suggestion to simplify the code.

spark-3.1-spanner-lib/src/main/java/com/google/cloud/spark/spanner/SpannerCatalog.java

…oogleCloudDataproc#170) * Support for Spark Catalogs and ErrorIfExists save mode. (#19) Main Changes: - Connector implements Catalog API. - ErrorIfExists and Ignore save modes are supported (via Catalog API). - Separate format (`cloud-spanner-graph`) and table provider for Spanner Graphs. Introducing a separate format for graphs was necessary because current implementation tightly couples Spark table definition and graph query. To support Catalog API for Spanner graphs and tables with one Spark table provider requires significantly refactoring graph support. Specifically, I ran into a case where loading a graph table and then querying resulted in type conversion errors because query code casts Id columns to `String` even though they are defined as `INT64`. --------- Signed-off-by: Max Ksyunz <max.ksyunz@improving.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * Update scripts/setup-test-db.sh Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * Spotless cleanup Signed-off-by: Max Ksyunz <max.ksyunz@improving.com> * Addressing Gemini feedback Signed-off-by: Max Ksyunz <max.ksyunz@improving.com> * Refactor SpannerCatalog to support graph tables in `cloud-spanner` format(#22) - `SpannerCatalogTableProviderBase.extractIdentifier` encodes dataframe options supported by graphs into returned identifier. - These are: `graph`, `type`, `configs`, `graphQuery`, `timestamp`, `viewsEnabled. - `SpannerCatalog.loadTable` extracts these options to correctly instantiate `SpannerGraph` class. - Removed classes related to `cloud-spanner-graph`. Signed-off-by: Max Ksyunz <max.ksyunz@improving.com> * Support arrays in SpannerCatalog.createTable. (#23) Signed-off-by: Max Ksyunz <max.ksyunz@improving.com> * get emulatorHost from current properties. Signed-off-by: Max Ksyunz <max.ksyunz@improving.com> * quote identifiers used to construct create and drop DDL statements. Signed-off-by: Max Ksyunz <max.ksyunz@improving.com> * Quote table name in construct SQL statements Other refactorings and clean-up Signed-off-by: Max Ksyunz <max.ksyunz@improving.com> * Resolve merge conflicts with main. Signed-off-by: Max Ksyunz <max.ksyunz@improving.com> * Update examples Signed-off-by: Max Ksyunz <max.ksyunz@improving.com> * Validate projectId, databaseId, and instanceId as passed by the user Signed-off-by: Max Ksyunz <max.ksyunz@improving.com> * fix copyright year. Signed-off-by: Max Ksyunz <max.ksyunz@improving.com> * Separating dataframe-based and catalog-based integration tests - Make sure most integration tests continue to test with dataframe API without catalog - Add integration tests with Catalog-based write operations specifically - Store dataframe options for writer as case-insensitive map. Spark options are case-insensitive and sometimes they arrive normalized to all lower case. Signed-off-by: Max Ksyunz <max.ksyunz@improving.com> * Small improvements to ErrorIfExists and Ignore integration tests Signed-off-by: Max Ksyunz <max.ksyunz@improving.com> * Examples of writing using catalog. Signed-off-by: Max Ksyunz <max.ksyunz@improving.com> * Re-write testErrorIfExistsSaveMode and testIgnoreSaveMode test for clarity They both perform the same write operation twice with different results. Signed-off-by: Max Ksyunz <max.ksyunz@improving.com> --------- Signed-off-by: Max Ksyunz <max.ksyunz@improving.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

MaxKsyunz added 3 commits March 4, 2026 01:43

Encode graph options into identifier so that SpannerCatalog can corre…

d935f88

…ctly generate table schema. Also remove cloud-spanner-graph implementation as it is no longer necessary. Signed-off-by: Max Ksyunz <max.ksyunz@improving.com>

Expand unit test coverage

dc5e5a1

Signed-off-by: Max Ksyunz <max.ksyunz@improving.com>

Expand integration test coverage

c5dd3e1

Signed-off-by: Max Ksyunz <max.ksyunz@improving.com>

gemini-code-assist bot reviewed Mar 4, 2026

View reviewed changes

MaxKsyunz added 3 commits March 4, 2026 15:18

Using static Gson object where appropriate

e339da3

Signed-off-by: Max Ksyunz <max.ksyunz@improving.com>

Validate graph identifier in SpannerCatalog.

5c8e165

Signed-off-by: Max Ksyunz <max.ksyunz@improving.com>

Only look for graph-related options when constructing a SpannerGraph.

471ee40

Signed-off-by: Max Ksyunz <max.ksyunz@improving.com>

gemini-code-assist bot reviewed Mar 5, 2026

View reviewed changes

MaxKsyunz added 3 commits March 5, 2026 00:24

Revert SparkSpannerTableProviderBase.getTable to original implementat…

0016d98

…ion. Remove incorrect tests. Signed-off-by: Max Ksyunz <max.ksyunz@improving.com>

Use table with data for a read test.

d523528

Signed-off-by: Max Ksyunz <max.ksyunz@improving.com>

Implement Autoclose in SpannerCatalog class to clean up spanner clien…

b5fc319

…t resources. Signed-off-by: Max Ksyunz <max.ksyunz@improving.com>

gemini-code-assist bot reviewed Mar 5, 2026

View reviewed changes

spark-3.1-spanner-lib/src/main/java/com/google/cloud/spark/spanner/SpannerCatalog.java Show resolved Hide resolved

MaxKsyunz merged commit a3d6c52 into integ/save_mode_simple Mar 5, 2026
1 check passed

MaxKsyunz deleted the dev/save_mode_simple_encode_graph branch March 5, 2026 19:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor SpannerCatalog to support graph tables#22

Refactor SpannerCatalog to support graph tables#22
MaxKsyunz merged 9 commits intointeg/save_mode_simplefrom
dev/save_mode_simple_encode_graph

MaxKsyunz commented Mar 4, 2026

Uh oh!

MaxKsyunz commented Mar 4, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

MaxKsyunz commented Mar 5, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

MaxKsyunz commented Mar 5, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

MaxKsyunz commented Mar 4, 2026

Uh oh!

MaxKsyunz commented Mar 4, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

MaxKsyunz commented Mar 5, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

MaxKsyunz commented Mar 5, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant