-
Notifications
You must be signed in to change notification settings - Fork 2.5k
[BUG] MinHash token filter parameters not working #15183
Description
Describe the bug
See elastic/elasticsearch#84578
The MinHash token filter has four configurable parameters: bucket_count, hash_count, hash_set_size, and with_rotation. Currently both bucket_count and hash_set_size have no effect, because there appears to be a bug in the code that interfaces with the underlying Lucene class.
Snippet from MinHashTokenFilterFactory:
Map<String, String> settingMap = new HashMap<>();
if (settings.hasValue("hash_count")) {
settingMap.put("hashCount", settings.get("hash_count"));
}
if (settings.hasValue("bucketCount")) {
settingMap.put("bucketCount", settings.get("bucket_count"));
}
if (settings.hasValue("hashSetSize")) {
settingMap.put("hashSetSize", settings.get("hash_set_size"));
}
if (settings.hasValue("with_rotation")) {
settingMap.put("withRotation", settings.get("with_rotation"));
}
Notice the camel-case bucketCount and hashSetSize in the hasValue lines. These should be bucket_count and hash_set_size, since those are the parameter names that would appear (and be fetched on the following lines) from the settings object.
Related component
Indexing
To Reproduce
Here are two examples that demonstrate the issue with the bucket_count parameter
Example 1
Define a custom analyzer that includes a MinHash token filter configured with a single hash function and a single bucket:
PUT my-index-1
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": ["my_minhash_filter"]
}
},
"filter": {
"my_minhash_filter": {
"type": "min_hash",
"hash_count": 1,
"bucket_count": 1,
"hash_set_size": 1,
"with_rotation": true
}
}
}
}
}
POST my-index-1/_analyze
{
"analyzer": "my_analyzer",
"text": "test"
}
Expected output: { "tokens": [<1 token>] }
Observed output: { "tokens": [<512 identical tokens>] }
The reason for the 512 tokens is that the single input token is being hashed and placed into one of the hash function's 512 buckets (the default number), and because rotation is enabled that hashed token is being used as output from the empty buckets as well. If the bucket_count: 1 configuration had worked, we would only have seen a single output token.
Example 2
With k hash functions each paired with a single bucket (and hash set size 1), we should get k MinHash tokens regardless of the number of input tokens:
...
"filter": {
"my_minhash_filter": {
"type": "min_hash",
"hash_count": 1,
"bucket_count": 1,
"hash_set_size": 1,
"with_rotation": false
}
}
...
POST my-index-1/_analyze
{
"analyzer": "my_analyzer",
"text": "another, longer test"
}
Expected output: { "tokens": [<1 token>] }
Observed output: { "tokens": [<3 tokens>] }
Expected behavior
Filter configuration parameters should be applied to the filter.
Additional Details
Plugins
Please list all plugins currently enabled.
Screenshots
If applicable, add screenshots to help explain your problem.
Host/Environment (please complete the following information):
- OS: [e.g. iOS]
- Version [e.g. 22]
Additional context
You can work around the issue by including BOTH camel and snake case config parameters:
"minhash_filter": {
"type": "min_hash",
"hash_count": 1,
"bucketCount": 3,
"bucket_count": 30,
"hashSetSize": 10,
"hash_set_size": 10,
"with_rotation": true
}