Skip to content

Conversation

@rdblue
Copy link
Contributor

@rdblue rdblue commented Aug 13, 2020

Iceberg indexes schemas by name to look up fields, but uses short names by omitting "element" and "value" names when a list or map has a struct element or value. For example, a list, locations, of structs struct<lat: double, long: double> will use names locations.lat instead of locations.element.lat.

This works most of the time, but can lead to conflicts when a map value contains a field name key. In that case, the schema is rejected with a failure message like this: ValidationException: Invalid schema: multiple fields for name some_map.key: 146 and 144

In addition, Spark passes names through alterTable that include the value and element names that are omitted.

This PR fixes the problem by keeping track of two names, the full name with element and value, and secondary "short" names that omit them. Indexing now uses all of the full names and adds any short names that are not ambiguous.

This change also requires indexing IDs separately rather than using a BiMap because IDs can have multiple names, which is not valid when inverting the BiMap.

@RussellSpitzer
Copy link
Member

I'm new to this but why are we shortening names at all? Were we running into some really long strings? Seems like we could just fix this by using the full names?

ValidationException.check(existingFieldId == null,
"Invalid schema: multiple fields for name %s: %s and %s", fullName, existingFieldId, fieldId);

// if the short name is not
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unfinished comment?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the short name is not empty and we haven't already added an identical short name?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is a nested field. For top-level fields, the short name is the same as the canonical name.

private final Map<String, Integer> nameToId = Maps.newHashMap();
private final Map<String, Integer> shortNameToId = Maps.newHashMap();

public Map<String, Integer> byName() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be helpful to have a little comment here explaining the mapping of both Long and Short Names in the returned map but mostly because I don't understand the short names yet. They may be more obvious to a more experienced reader of this code.

Copy link
Contributor

@kbendick kbendick Aug 30, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if the doc commented read something like
"Short names for maps.... For example.... will produce short name 'l.x' in addition to the canonical long name 'l.element.x'"

That small addition in the doc comment would clarify things pretty heavily for me, but I might be biased as I've read the explanation of short vs long names now. Overall I am in favor of the short names though as they feel much more natural over l.element.x.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added canonical here.

return builder.build();
}

public Map<Integer, String> byId() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think here it may also be good to note this isn't a bijection anymore and short names will not be returned by this function, only long names.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does somewhat seem implied that this isn't a bijection (and that long names are returned) via Canonical names, not short names are returned, for example 'list.element.field' instead of 'list.field'.

Perhaps it might be specifically useful to call this out in the @return via something like @return A map from field ID to the long / canonical field name.?

@rdblue
Copy link
Contributor Author

rdblue commented Aug 14, 2020

Thanks for reviewing, @RussellSpitzer! I'll fix those issues.

The short names are more obvious because they don't contain a hard-coded string that users need to know about. Consider the schema from the description, locations list<struct<lat: double, long: double>>. How does someone know to add element to reference locations.element.lat? I think it's more natural to allow referencing that field as locations.lat, as long as it doesn't conflict with other names.

We've also already supported these names, so I think we should continue to support them. But adding the full names removes ambiguity introduced by the short names.

@RussellSpitzer
Copy link
Member

RussellSpitzer commented Aug 14, 2020 via email

@rdblue rdblue force-pushed the fix-schema-name-conflicts branch from fb01c55 to feaad20 Compare August 21, 2020 20:26
* Returns a mapping from full field name to ID.
* <p>
* Short names for maps and lists are included for any name that does not conflict with a canonical name. For example,
* a list, 'l', with of structs with field 'x' will produce short name 'l.x' in addition to 'l.element.x'.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a list, 'l', with of structs... or just a list, 'l', of structs ...?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@rdblue rdblue merged commit d98224a into apache:master Sep 2, 2020
@rdblue rdblue added this to the Java 0.10.0 Release milestone Nov 16, 2020
szehon-ho pushed a commit to szehon-ho/iceberg that referenced this pull request Sep 16, 2024
* Core: Fix ParallelIterable memory leak where queue continues to be populated even after iterator close (apache#9402)

(cherry picked from commit d3cb1b6)

* Core: Limit ParallelIterable memory consumption by yielding in tasks (apache#10691)

ParallelIterable schedules 2 * WORKER_THREAD_POOL_SIZE tasks for
processing input iterables. This defaults to 2 * # CPU cores.  When one
or some of the input iterables are considerable in size and the
ParallelIterable consumer is not quick enough, this could result in
unbounded allocation inside `ParallelIterator.queue`. This commit bounds
the queue. When queue is full, the tasks yield and get removed from the
executor. They are resumed when consumer catches up.

(cherry picked from commit 7831a8d)

* Drop ParallelIterable's queue low water mark (apache#10978)

As part of the change in commit
7831a8d, queue low water mark was
introduced. However, it resulted in increased number of manifests being
read when planning LIMIT queries in Trino Iceberg connector. To avoid
increased I/O, back out the change for now.

(cherry picked from commit bcb3281)

---------

Co-authored-by: Helt <heltman@qq.com>
Co-authored-by: Piotr Findeisen <piotr.findeisen@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants