-
Notifications
You must be signed in to change notification settings - Fork 3k
Fix schema name conflicts #1336
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
I'm new to this but why are we shortening names at all? Were we running into some really long strings? Seems like we could just fix this by using the full names? |
| ValidationException.check(existingFieldId == null, | ||
| "Invalid schema: multiple fields for name %s: %s and %s", fullName, existingFieldId, fieldId); | ||
|
|
||
| // if the short name is not |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unfinished comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the short name is not empty and we haven't already added an identical short name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is a nested field. For top-level fields, the short name is the same as the canonical name.
| private final Map<String, Integer> nameToId = Maps.newHashMap(); | ||
| private final Map<String, Integer> shortNameToId = Maps.newHashMap(); | ||
|
|
||
| public Map<String, Integer> byName() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be helpful to have a little comment here explaining the mapping of both Long and Short Names in the returned map but mostly because I don't understand the short names yet. They may be more obvious to a more experienced reader of this code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if the doc commented read something like
"Short names for maps.... For example.... will produce short name 'l.x' in addition to the canonical long name 'l.element.x'"
That small addition in the doc comment would clarify things pretty heavily for me, but I might be biased as I've read the explanation of short vs long names now. Overall I am in favor of the short names though as they feel much more natural over l.element.x.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added canonical here.
| return builder.build(); | ||
| } | ||
|
|
||
| public Map<Integer, String> byId() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think here it may also be good to note this isn't a bijection anymore and short names will not be returned by this function, only long names.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does somewhat seem implied that this isn't a bijection (and that long names are returned) via Canonical names, not short names are returned, for example 'list.element.field' instead of 'list.field'.
Perhaps it might be specifically useful to call this out in the @return via something like @return A map from field ID to the long / canonical field name.?
|
Thanks for reviewing, @RussellSpitzer! I'll fix those issues. The short names are more obvious because they don't contain a hard-coded string that users need to know about. Consider the schema from the description, We've also already supported these names, so I think we should continue to support them. But adding the full names removes ambiguity introduced by the short names. |
|
Ah that makes sense to me, I thought these identifiers were internal only,
but it makes much more sense to me now!
…On Fri, Aug 14, 2020 at 11:45 AM Ryan Blue ***@***.***> wrote:
Thanks for reviewing, @RussellSpitzer <https://github.com/RussellSpitzer>!
I'll fix those issues.
The short names are more obvious because they don't contain a hard-coded
string that users need to know about. Consider the schema from the
description, locations list<struct<lat: double, long: double>>. How does
someone know to add element to reference locations.element.lat? I think
it's more natural to allow referencing that field as locations.lat, as
long as it doesn't conflict with other names.
We've also already supported these names, so I think we should continue to
support them. But adding the full names removes ambiguity introduced by the
short names.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1336 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AADE2YO6LJS2MP3DFDJC6LTSAVS3DANCNFSM4P6UNE3A>
.
|
fb01c55 to
feaad20
Compare
| * Returns a mapping from full field name to ID. | ||
| * <p> | ||
| * Short names for maps and lists are included for any name that does not conflict with a canonical name. For example, | ||
| * a list, 'l', with of structs with field 'x' will produce short name 'l.x' in addition to 'l.element.x'. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a list, 'l', with of structs... or just a list, 'l', of structs ...?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
* Core: Fix ParallelIterable memory leak where queue continues to be populated even after iterator close (apache#9402) (cherry picked from commit d3cb1b6) * Core: Limit ParallelIterable memory consumption by yielding in tasks (apache#10691) ParallelIterable schedules 2 * WORKER_THREAD_POOL_SIZE tasks for processing input iterables. This defaults to 2 * # CPU cores. When one or some of the input iterables are considerable in size and the ParallelIterable consumer is not quick enough, this could result in unbounded allocation inside `ParallelIterator.queue`. This commit bounds the queue. When queue is full, the tasks yield and get removed from the executor. They are resumed when consumer catches up. (cherry picked from commit 7831a8d) * Drop ParallelIterable's queue low water mark (apache#10978) As part of the change in commit 7831a8d, queue low water mark was introduced. However, it resulted in increased number of manifests being read when planning LIMIT queries in Trino Iceberg connector. To avoid increased I/O, back out the change for now. (cherry picked from commit bcb3281) --------- Co-authored-by: Helt <heltman@qq.com> Co-authored-by: Piotr Findeisen <piotr.findeisen@gmail.com>
Iceberg indexes schemas by name to look up fields, but uses short names by omitting "element" and "value" names when a list or map has a struct element or value. For example, a list,
locations, of structsstruct<lat: double, long: double>will use nameslocations.latinstead oflocations.element.lat.This works most of the time, but can lead to conflicts when a map value contains a field name
key. In that case, the schema is rejected with a failure message like this:ValidationException: Invalid schema: multiple fields for name some_map.key: 146 and 144In addition, Spark passes names through
alterTablethat include thevalueandelementnames that are omitted.This PR fixes the problem by keeping track of two names, the full name with
elementandvalue, and secondary "short" names that omit them. Indexing now uses all of the full names and adds any short names that are not ambiguous.This change also requires indexing IDs separately rather than using a
BiMapbecause IDs can have multiple names, which is not valid when inverting theBiMap.