Conversation
|
Pinging @elastic/es-analytics-geo (Team:Analytics) |
This fixes dimensions with unicode characters in their name. It also make the sort order of these dimensions something we can explain: convert all dimensions to utf-8 and sort by that.
| * to build the _tsid field for the document. | ||
| */ | ||
| private SortedMap<String, BytesReference> dimensionBytes; | ||
| private SortedMap<BytesRef, BytesReference> dimensionBytes; |
There was a problem hiding this comment.
I switched to bytes ref here to make it really obvious these are sorted by utf-8.
There was a problem hiding this comment.
Ulterior motive: I'll want them in utf-8 in more than one place.
| "Dimension name must be less than [%s] bytes but [%s] was [%s].", | ||
| DIMENSION_NAME_LIMIT, | ||
| fieldName.utf8ToString(), | ||
| fieldName.length |
There was a problem hiding this comment.
The formatter made a mess of this when I made the parameters a little longer so I switched it to String.format so human can read the message.
|
|
||
| for (int i = 0; i < size; i++) { | ||
| String name = in.readString(); | ||
| String name = in.readBytesRef().utf8ToString(); |
There was a problem hiding this comment.
Here's the bug that made this not work - sometimes BytesRef and String don't write the same.
There was a problem hiding this comment.
I wonder if there is a value in hiding the fact that it's BytesRef all the way down and can this hurt us beyond serialization thing.
There was a problem hiding this comment.
Are you thinking this should be it's own class? Like one that just wraps BytesRef or something?
There was a problem hiding this comment.
Well, if I am completely honest, I was thinking why do we have 2 different maps with String in one case and BytesRef in another case that essentially represent the same data. But I think I like your interpretation much better. It has quite a bit of complexity that could probably benefit from some encapsulation.
There was a problem hiding this comment.
Just to be clear, maybe we need a TimeSeriesId class to represent this sort of data.
There was a problem hiding this comment.
Would you like me to do it in this or a followup?
There was a problem hiding this comment.
Yeah, let's not mix refactoring with the actual change.
server/src/main/java/org/elasticsearch/index/mapper/TimeSeriesIdFieldMapper.java
Outdated
Show resolved
Hide resolved
|
|
||
| public void testUnicodeKeys() throws IOException { | ||
| String fire = new String(new int[] { 0x1F525 }, 0, 1); | ||
| String coffee = "\u2615"; |
|
hi, @nik9000 , as the type of lucene field name is |
|
|
And to force them to be used in a safe way. |
hi, @nik9000 can you explain it? I would like to know the reason why |
It does. But it won't necessarily sort them in the same way as encoding the dimensions to utf-8 and then sorting them. I mean, there was a bug, but we don't need to invoke BytesRef in the Map to solve the bug. But I was mostly concerned about the sorting here. Mostly out of paranoia too. I think it'd be fairly hard to come up with field names that don't sort the same way. But we encode the data using utf-8 anyway so sorting by utf-8 was easy. |
This fixes dimensions with unicode characters in their name. It also
make the sort order of these dimensions something we can explain:
convert all dimensions to utf-8 and sort by that.