Adding PCNT notation support#1470
Adding PCNT notation support#1470davidrabinowitz wants to merge 3 commits intoGoogleCloudDataproc:masterfrom
Conversation
Summary of ChangesHello @davidrabinowitz, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces support for a more comprehensive table ID notation, allowing for Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
|
/gcbrun |
There was a problem hiding this comment.
Code Review
This pull request adds support for PCNT (Project.Catalog.Namespace.Table) notation for identifying BigQuery tables. The changes include updating the dataset pattern regex and adding logic in BigQueryUtil.parseTableId to correctly parse these complex identifiers. Additionally, the code is refactored to use TableId.getIAMResourceName() for generating table paths, which is a good improvement. The changes are logical, but I have a couple of suggestions: one to fix a minor typo in a comment, and another to enhance test coverage for the new parsing logic.
...-connector-common/src/main/java/com/google/cloud/bigquery/connector/common/BigQueryUtil.java
Outdated
Show resolved
Hide resolved
| public void testProjectCatalogNamespaceTableNotation_independent() { | ||
| TableId tableId = | ||
| BigQueryUtil.parseTableId( | ||
| "table", Optional.of("catalog.namespace"), Optional.of("project")); | ||
| assertThat(tableId).isEqualTo(TableId.of("project", "catalog.namespace", "table")); | ||
| } |
There was a problem hiding this comment.
The added test is good for verifying catalog.namespace as a dataset when components are passed separately. However, it doesn't cover the new parsing logic in parseTableId for a single fully qualified table string. To ensure the new logic is robust and prevent regressions, it would be beneficial to add test cases that specifically exercise parsing PCNT notation from a single string.
For example:
@Test
public void testProjectCatalogNamespaceTableNotation_singleString() {
TableId tableId = BigQueryUtil.parseTableId("project.catalog.namespace.table");
assertThat(tableId).isEqualTo(TableId.of("project", "catalog.namespace", "table"));
}
@Test
public void testProjectCatalogNamespaceTableNotation_singleStringMultipleLevels() {
TableId tableId = BigQueryUtil.parseTableId("project.catalog1.catalog2.namespace.table");
assertThat(tableId).isEqualTo(TableId.of("project", "catalog1.catalog2.namespace", "table"));
}|
/gemini review |
There was a problem hiding this comment.
Code Review
The pull request introduces support for the PCNT (Project, Catalog, Namespace, Table) notation, which is a valuable addition for complex BigQuery environments. The implementation correctly handles the parsing of multi-part table identifiers and cleans up the resource name generation in ReadSessionCreator. I have identified a minor typo and a potential issue with domain-scoped projects that should be addressed to ensure full compatibility with all project ID formats.
|
|
||
| private static final String PROJECT_PATTERN = "\\S+"; | ||
| private static final String DATASET_PATTERN = "\\w+"; | ||
| // The TableId dataset may be `catalog.namesapce` |
| if (projectStr.contains(".") && !projectStr.contains(":")) { | ||
| int dotIndex = projectStr.indexOf("."); | ||
| parsedProject = Optional.of(projectStr.substring(0, dotIndex)); | ||
| parsedDataset = Optional.of(projectStr.substring(dotIndex + 1) + "." + parsedDataset.get()); | ||
| } |
There was a problem hiding this comment.
The current logic for splitting the project string into project and catalog does not support domain-scoped projects (e.g., example.com:project.catalog). In such cases, the project string contains both a colon and a dot, and the first dot is part of the domain name rather than a separator for the catalog. To support this, you should look for the first dot that appears after the colon (if a colon exists).
| if (projectStr.contains(".") && !projectStr.contains(":")) { | |
| int dotIndex = projectStr.indexOf("."); | |
| parsedProject = Optional.of(projectStr.substring(0, dotIndex)); | |
| parsedDataset = Optional.of(projectStr.substring(dotIndex + 1) + "." + parsedDataset.get()); | |
| } | |
| int dotIndex = projectStr.indexOf(".", projectStr.lastIndexOf(":") + 1); | |
| if (dotIndex != -1) { | |
| parsedProject = Optional.of(projectStr.substring(0, dotIndex)); | |
| parsedDataset = Optional.of(projectStr.substring(dotIndex + 1) + "." + parsedDataset.get()); | |
| } |
| public void testProjectCatalogNamespaceTableNotation_independent() { | ||
| TableId tableId = | ||
| BigQueryUtil.parseTableId( | ||
| "table", Optional.of("catalog.namespace"), Optional.of("project")); | ||
| assertThat(tableId).isEqualTo(TableId.of("project", "catalog.namespace", "table")); | ||
| } |
There was a problem hiding this comment.
The added test case testProjectCatalogNamespaceTableNotation_independent uses a simple table name and relies on the Optional parameters. It would be beneficial to add a test case that specifically exercises the new parsing logic for a 4-part qualified table string (e.g., project.catalog.namespace.table) to ensure the regex and splitting logic correctly handle the combined project/catalog string.
There was a problem hiding this comment.
@gemini-code-assist check testProjectCatalogNamespaceTableNotation_four_part()
There was a problem hiding this comment.
You are absolutely right, testProjectCatalogNamespaceTableNotation_four_part() perfectly covers the scenario I suggested. Thank you for adding it!
|
/gcbrun |
|
/gemini review |
|
There was a problem hiding this comment.
Code Review
The pull request successfully adds support for the PCNT (Project.Catalog.Namespace.Table) notation. This is achieved by updating the DATASET_PATTERN regex to allow for catalog.namespace in the dataset part and introducing a new parsing logic in BigQueryUtil.parseTableId to correctly re-interpret table IDs where the project component might implicitly contain a catalog part (e.g., project.catalog.dataset.table). The changes also include a refactoring in ReadSessionCreator.java to use the standard TableId.getIAMResourceName() method, which simplifies the code. Comprehensive unit tests have been added to validate the new parsing logic, including edge cases like illegal nested catalogs. Overall, the changes are well-implemented and improve the flexibility of table ID parsing.
| if (parsedProject.isPresent() && parsedDataset.isPresent()) { | ||
| String projectStr = parsedProject.get(); | ||
| if (projectStr.contains(".") && !projectStr.contains(":")) { | ||
| int dotIndex = projectStr.indexOf("."); | ||
| parsedProject = Optional.of(projectStr.substring(0, dotIndex)); | ||
| String newDataset = projectStr.substring(dotIndex + 1) + "." + parsedDataset.get(); | ||
| // The dataset part should not have more than one dot (i.e. catalog.namespace) | ||
| if (newDataset.indexOf(".") != newDataset.lastIndexOf(".")) { | ||
| throw new IllegalArgumentException( | ||
| format("Invalid Table ID '%s'. Must match '%s'", rawTable, QUALIFIED_TABLE_REGEX)); | ||
| } | ||
| parsedDataset = Optional.of(newDataset); | ||
| } | ||
| } |
There was a problem hiding this comment.
This block of code performs a crucial re-interpretation of the parsed project and dataset, especially when the initial regex might have captured a project.catalog segment as the project. While the logic appears correct, adding a Javadoc comment to this if block would greatly enhance readability and maintainability. It should explain the specific scenario this re-interpretation handles (e.g., project.catalog.dataset.table where project.catalog is initially grouped as the project) and why it's necessary to adjust parsedProject and parsedDataset.
|
/gcbrun |
|
|
||
| if (parsedProject.isPresent() && parsedDataset.isPresent()) { | ||
| String projectStr = parsedProject.get(); | ||
| if (projectStr.contains(".") && !projectStr.contains(":")) { |
There was a problem hiding this comment.
Should we create constant for this given it used multiple times here?



No description provided.