Skip to content

[BUG] MinHash token filter parameters not working #15183

@tgreasley-depop

Description

@tgreasley-depop

Describe the bug

See elastic/elasticsearch#84578

The MinHash token filter has four configurable parameters: bucket_count, hash_count, hash_set_size, and with_rotation. Currently both bucket_count and hash_set_size have no effect, because there appears to be a bug in the code that interfaces with the underlying Lucene class.

Snippet from MinHashTokenFilterFactory:

Map<String, String> settingMap = new HashMap<>();
if (settings.hasValue("hash_count")) {
    settingMap.put("hashCount", settings.get("hash_count"));
}
if (settings.hasValue("bucketCount")) {
    settingMap.put("bucketCount", settings.get("bucket_count"));
}
if (settings.hasValue("hashSetSize")) {
    settingMap.put("hashSetSize", settings.get("hash_set_size"));
}
if (settings.hasValue("with_rotation")) {
    settingMap.put("withRotation", settings.get("with_rotation"));
}

Notice the camel-case bucketCount and hashSetSize in the hasValue lines. These should be bucket_count and hash_set_size, since those are the parameter names that would appear (and be fetched on the following lines) from the settings object.

Related component

Indexing

To Reproduce

Here are two examples that demonstrate the issue with the bucket_count parameter

Example 1
Define a custom analyzer that includes a MinHash token filter configured with a single hash function and a single bucket:

PUT my-index-1
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": { 
          "type": "custom",
          "tokenizer": "standard",
          "filter": ["my_minhash_filter"]
        }
      },
      "filter": {
        "my_minhash_filter": {
          "type": "min_hash",
          "hash_count": 1,
          "bucket_count": 1,
          "hash_set_size": 1,
          "with_rotation": true
        }
      }
    }
  }
}

POST my-index-1/_analyze
{
  "analyzer": "my_analyzer",
  "text": "test"
}

Expected output: { "tokens": [<1 token>] }
Observed output: { "tokens": [<512 identical tokens>] }

The reason for the 512 tokens is that the single input token is being hashed and placed into one of the hash function's 512 buckets (the default number), and because rotation is enabled that hashed token is being used as output from the empty buckets as well. If the bucket_count: 1 configuration had worked, we would only have seen a single output token.

Example 2
With k hash functions each paired with a single bucket (and hash set size 1), we should get k MinHash tokens regardless of the number of input tokens:

...
      "filter": {
        "my_minhash_filter": {
          "type": "min_hash",
          "hash_count": 1,
          "bucket_count": 1,
          "hash_set_size": 1,
          "with_rotation": false
        }
      }
...

POST my-index-1/_analyze
{
  "analyzer": "my_analyzer",
  "text": "another, longer test"
}

Expected output: { "tokens": [<1 token>] }
Observed output: { "tokens": [<3 tokens>] }

Expected behavior

Filter configuration parameters should be applied to the filter.

Additional Details

Plugins
Please list all plugins currently enabled.

Screenshots
If applicable, add screenshots to help explain your problem.

Host/Environment (please complete the following information):

  • OS: [e.g. iOS]
  • Version [e.g. 22]

Additional context
You can work around the issue by including BOTH camel and snake case config parameters:

"minhash_filter": {
  "type": "min_hash",
  "hash_count": 1,
  "bucketCount": 3,
  "bucket_count": 30,
  "hashSetSize": 10,
  "hash_set_size": 10,
  "with_rotation": true
}

Metadata

Metadata

Assignees

No one assigned

    Labels

    IndexingIndexing, Bulk Indexing and anything related to indexingbugSomething isn't workinguntriaged

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions