Fix schema name conflicts #1336

rdblue · 2020-08-13T16:49:53Z

Iceberg indexes schemas by name to look up fields, but uses short names by omitting "element" and "value" names when a list or map has a struct element or value. For example, a list, locations, of structs struct<lat: double, long: double> will use names locations.lat instead of locations.element.lat.

This works most of the time, but can lead to conflicts when a map value contains a field name key. In that case, the schema is rejected with a failure message like this: ValidationException: Invalid schema: multiple fields for name some_map.key: 146 and 144

In addition, Spark passes names through alterTable that include the value and element names that are omitted.

This PR fixes the problem by keeping track of two names, the full name with element and value, and secondary "short" names that omit them. Indexing now uses all of the full names and adds any short names that are not ambiguous.

This change also requires indexing IDs separately rather than using a BiMap because IDs can have multiple names, which is not valid when inverting the BiMap.

RussellSpitzer · 2020-08-14T04:23:39Z

I'm new to this but why are we shortening names at all? Were we running into some really long strings? Seems like we could just fix this by using the full names?

RussellSpitzer · 2020-08-14T04:25:14Z

api/src/main/java/org/apache/iceberg/types/IndexByName.java

+    ValidationException.check(existingFieldId == null,
+        "Invalid schema: multiple fields for name %s: %s and %s", fullName, existingFieldId, fieldId);
+
+    // if the short name is not


unfinished comment?

If the short name is not empty and we haven't already added an identical short name?

If this is a nested field. For top-level fields, the short name is the same as the canonical name.

RussellSpitzer · 2020-08-14T04:34:21Z

api/src/main/java/org/apache/iceberg/types/IndexByName.java

  private final Map<String, Integer> nameToId = Maps.newHashMap();
+  private final Map<String, Integer> shortNameToId = Maps.newHashMap();
+
+  public Map<String, Integer> byName() {


I think it would be helpful to have a little comment here explaining the mapping of both Long and Short Names in the returned map but mostly because I don't understand the short names yet. They may be more obvious to a more experienced reader of this code.

I think if the doc commented read something like
"Short names for maps.... For example.... will produce short name 'l.x' in addition to the canonical long name 'l.element.x'"

That small addition in the doc comment would clarify things pretty heavily for me, but I might be biased as I've read the explanation of short vs long names now. Overall I am in favor of the short names though as they feel much more natural over l.element.x.

I added canonical here.

RussellSpitzer · 2020-08-14T04:34:53Z

api/src/main/java/org/apache/iceberg/types/IndexByName.java

+    return builder.build();
+  }
+
+  public Map<Integer, String> byId() {


I think here it may also be good to note this isn't a bijection anymore and short names will not be returned by this function, only long names.

It does somewhat seem implied that this isn't a bijection (and that long names are returned) via Canonical names, not short names are returned, for example 'list.element.field' instead of 'list.field'.

Perhaps it might be specifically useful to call this out in the @return via something like @return A map from field ID to the long / canonical field name.?

rdblue · 2020-08-14T16:45:24Z

Thanks for reviewing, @RussellSpitzer! I'll fix those issues.

The short names are more obvious because they don't contain a hard-coded string that users need to know about. Consider the schema from the description, locations list<struct<lat: double, long: double>>. How does someone know to add element to reference locations.element.lat? I think it's more natural to allow referencing that field as locations.lat, as long as it doesn't conflict with other names.

We've also already supported these names, so I think we should continue to support them. But adding the full names removes ambiguity introduced by the short names.

RussellSpitzer · 2020-08-14T16:47:44Z

Ah that makes sense to me, I thought these identifiers were internal only, but it makes much more sense to me now!

…

On Fri, Aug 14, 2020 at 11:45 AM Ryan Blue ***@***.***> wrote: Thanks for reviewing, @RussellSpitzer <https://github.com/RussellSpitzer>! I'll fix those issues. The short names are more obvious because they don't contain a hard-coded string that users need to know about. Consider the schema from the description, locations list<struct<lat: double, long: double>>. How does someone know to add element to reference locations.element.lat? I think it's more natural to allow referencing that field as locations.lat, as long as it doesn't conflict with other names. We've also already supported these names, so I think we should continue to support them. But adding the full names removes ambiguity introduced by the short names. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1336 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AADE2YO6LJS2MP3DFDJC6LTSAVS3DANCNFSM4P6UNE3A> .

fbocse · 2020-09-02T18:41:36Z

api/src/main/java/org/apache/iceberg/types/IndexByName.java

+   * Returns a mapping from full field name to ID.
+   * <p>
+   * Short names for maps and lists are included for any name that does not conflict with a canonical name. For example,
+   * a list, 'l', with of structs with field 'x' will produce short name 'l.x' in addition to 'l.element.x'.


a list, 'l', with of structs... or just a list, 'l', of structs ...?

* Core: Fix ParallelIterable memory leak where queue continues to be populated even after iterator close (apache#9402) (cherry picked from commit d3cb1b6) * Core: Limit ParallelIterable memory consumption by yielding in tasks (apache#10691) ParallelIterable schedules 2 * WORKER_THREAD_POOL_SIZE tasks for processing input iterables. This defaults to 2 * # CPU cores. When one or some of the input iterables are considerable in size and the ParallelIterable consumer is not quick enough, this could result in unbounded allocation inside `ParallelIterator.queue`. This commit bounds the queue. When queue is full, the tasks yield and get removed from the executor. They are resumed when consumer catches up. (cherry picked from commit 7831a8d) * Drop ParallelIterable's queue low water mark (apache#10978) As part of the change in commit 7831a8d, queue low water mark was introduced. However, it resulted in increased number of manifests being read when planning LIMIT queries in Trino Iceberg connector. To avoid increased I/O, back out the change for now. (cherry picked from commit bcb3281) --------- Co-authored-by: Helt <heltman@qq.com> Co-authored-by: Piotr Findeisen <piotr.findeisen@gmail.com>

RussellSpitzer reviewed Aug 14, 2020

View reviewed changes

rdblue added 2 commits August 21, 2020 13:11

Fix schema name conflicts.

bec6afe

Add more docs and comments for review.

feaad20

rdblue force-pushed the fix-schema-name-conflicts branch from fb01c55 to feaad20 Compare August 21, 2020 20:26

probot-autolabeler bot added the API label Aug 21, 2020

fbocse reviewed Sep 2, 2020

View reviewed changes

Fixup Javadoc.

304358a

aokolnychyi approved these changes Sep 2, 2020

View reviewed changes

rdblue merged commit d98224a into apache:master Sep 2, 2020

rdblue added this to the Java 0.10.0 Release milestone Nov 16, 2020

Fix schema name conflicts #1336

Fix schema name conflicts #1336

Uh oh!

Conversation

rdblue commented Aug 13, 2020

Uh oh!

RussellSpitzer commented Aug 14, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kbendick Aug 30, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rdblue commented Aug 14, 2020

Uh oh!

RussellSpitzer commented Aug 14, 2020 via email

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

kbendick Aug 30, 2020 •

edited

Loading