Use index-prefix fields for terms of length min_chars - 1#36703
Use index-prefix fields for terms of length min_chars - 1#36703romseygeek merged 8 commits intoelastic:masterfrom
Conversation
|
Pinging @elastic/es-search |
|
@elasticmachine retest this please |
jpountz
left a comment
There was a problem hiding this comment.
This is a great idea! I left some suggestions.
| if (strValue.length() >= minChars) { | ||
| return super.termQuery(value, context); | ||
| } | ||
| WildcardQuery query = new WildcardQuery(new Term(name(), value + "?")); |
There was a problem hiding this comment.
I think it'd be safer to create an automaton manually and then instantiate an AutomatonQuery, otherwise there could be surprises if value contains ? or *.
|
|
||
| boolean accept(int length) { | ||
| return length >= minChars && length <= maxChars; | ||
| return length >= minChars - 1 && length <= maxChars; |
There was a problem hiding this comment.
Let's go even further, change this to just length <= maxChars, and then append minChars - prefixTerm.length wildcards to the automaton that is used for querying?
|
Now that I'm thinking about it again... I'm afraid that this refactoring will fail to match terms whose length is exactly |
|
It probably also means that we should not try to support prefixes whose length is less than minChars - 1 like I suggested above. |
What about text fields that don't have |
jtibshirani
left a comment
There was a problem hiding this comment.
I had a couple questions for my knowledge, to help understand the trade-offs we're making: in what circumstances would users adjust the min_chars setting, and why does it default to 2 as opposed to 1?
| builder.endObject(); | ||
| } | ||
|
|
||
| public Query termQuery(Object value, MultiTermQuery.RewriteMethod method, QueryShardContext context) { |
There was a problem hiding this comment.
Maybe we should rename this method, now that it is not a pure term query (for example prefixQuery could make more sense)?
There was a problem hiding this comment.
I made this just override prefixQuery, it makes much more sense!
| q = fieldType.prefixQuery("internationalisatio", CONSTANT_SCORE_REWRITE, queryShardContext); | ||
| assertEquals(new PrefixQuery(new Term("field", "internationalisatio")), q); | ||
|
|
||
| q = fieldType.prefixQuery("g", CONSTANT_SCORE_REWRITE, queryShardContext); |
There was a problem hiding this comment.
It seems like these detailed query construction tests would fit better in a unit test like TextFieldTypeTests.
|
Thanks @jtibshirani, have pushed some changes to address your comments.
The reasoning is that each extra ngram length adds to index size; so min_chars of 1 will end up with a very large index indeed, and 2 seems to be a reasonable default. But if you know that you will only ever do prefix searches of length 4 or more, for example, then you can up the min_chars setting to save on disk space. |
|
@romseygeek I'm wondering if you addressed @jpountz's comment above?
|
I had missed that, thank you! Will open a separate PR to deal with it. |
I can't find the Github issue now, but it has been occasionally asked that we add a flag that allows to disable slow queries entirely, such as multi-term queries that match lots of terms. We could have a switch for all queries rather than only prefix queries, eg. by enforcing a rewrite method that fails if more than X terms match? And |
The default
index_prefixsettings will index prefixes of between 2 and 5 characters in length. Currently, if a prefix search falls outside of this range at either end we fall back to a standard prefix expansion, which is still very expensive for single character prefixes. However, we have an option here to use a wildcard expansion rather than a prefix expansion, so that a query ofa*gets remapped toa?against the_index_prefixfield - likely to be a very small set of terms, and certain to be much smaller thana*against the whole index.This pull request adds this extra level of mapping for any prefix term whose length is one less than the
min_charsparameter of theindex_prefixesfield.A possible follow-up could be to disallow single-character wildcards against a field unless
index_prefixis enabled with amin_charsettings of 2.