Skip to content

[Proposal] Add fingerprint filter for event deduplication#1872

Closed
andrewkroh wants to merge 1 commit intoelastic:masterfrom
andrewkroh:feature/fingerprint-filter
Closed

[Proposal] Add fingerprint filter for event deduplication#1872
andrewkroh wants to merge 1 commit intoelastic:masterfrom
andrewkroh:feature/fingerprint-filter

Conversation

@andrewkroh
Copy link
Copy Markdown
Member

@andrewkroh andrewkroh commented Jun 15, 2016

The fingerprint filter uses a cryptographic hash function to calculate a hash value from specified fields in the event. The resulting hex encoded hash value is stored in the id field. The id field is used as the _id field when the event is sent through the elasticsearch output. Since there cannot be duplicate _id values stored in an elasticsearch index, this prevents an event from being duplicated.

The fingerprint filter is not appropriate for all use cases. It should only be applied when the combination input fields are sufficiently unique.

Some config examples:

# Filebeat
filters:
  - fingerprint:
      fields: [beat.hostname, source, message]

# Winlogbeat
filters:
  - fingerprint:
      fields: [computer_name, log_name, record_number, message]

Benchmarks:

BenchmarkFingerprintFilterSHA1-4     1000000          2218 ns/op         664 B/op         12 allocs/op
BenchmarkFingerprintFilterSHA256-4    500000          3203 ns/op         696 B/op         12 allocs/op
BenchmarkFingerprintFilterSHA512-4    500000          3332 ns/op        1016 B/op         13 allocs/op
BenchmarkFingerprintFilterMD5-4      1000000          1880 ns/op         552 B/op         11 allocs/op

@andrewkroh andrewkroh added enhancement discuss Issue needs further discussion. libbeat :Processors labels Jun 15, 2016
The fingerprint filter uses a cryptographic hash function to calculate a hash value from specified fields in the event. The resulting hex encoded hash value is stored in the id field. The id field is used as the _id field when the event is sent through the elasticsearch output. Since there cannot be duplicate _id values stored in an elasticsearch index, this prevents an event from being duplicated.

The fingerprint filter is not appropriate for all use cases. It should only be applied when the combination input fields are sufficiently unique.
@andrewkroh andrewkroh force-pushed the feature/fingerprint-filter branch from 5b9748a to 539c7f0 Compare June 15, 2016 21:27
@ruflin
Copy link
Copy Markdown
Contributor

ruflin commented Jun 16, 2016

I really like the idea as it makes the id generation also configurable. Some thoughts / questions:

  • I was first confused by the naming fingerprint. I would probably call more something like "id generation" or "id hashing".
  • I don't think this belongs into filters but into processors (or similar) as it modifies / enhances the data. But this is a more general discussion about the naming.

Definitively a +1 on this feature.

@tsg
Copy link
Copy Markdown
Contributor

tsg commented Jun 25, 2016

Great that you started with this, @andrewkroh!

It's nice that the hashing function can be selected, but something to worry about with all these algorithms is that the IDs they generate are so random that Lucene won't have a chance in compressing them.

Perhaps an idea would be to have an algorithm that concatanates hashes of the fields. So for filebeat we could use, for example, hash(host information):hash(file dev+inode):line-number.

I think @bleskes might have more thoughts in this area. Also pinging @kimchy and @djschny since they were interested in this feature before.

@djschny
Copy link
Copy Markdown

djschny commented Jun 25, 2016

Why even bother with complexity of calculating hash of the fields. Instead just generate random ID that is time/sequence based?

@kimchy
Copy link
Copy Markdown
Member

kimchy commented Jun 26, 2016

I wonder if this is something Beats needs to do, and maybe it should be under ingest pipeline? I am concerned around the fact that it will be used as the ID, I think we can solve potential duplicate data differently (internal spool queue with the generate id for example). I do think fingerprint in general is useful, as a different field, but then it can easily be done in ingest node. /cc @polyfractal who has been playing with it

@tsg
Copy link
Copy Markdown
Contributor

tsg commented Jun 26, 2016

@djschny the issue with time based IDs is that if Filebeat restarts and has to re-read the log lines they would get different IDs, so it doesn't remove all duplicates. An internal spooling queue might mitigate this, indeed.

@polyfractal
Copy link
Copy Markdown

For reference, here's the ingest issue that I opened a while ago: elastic/elasticsearch#16938

It mainly focused on "fuzzy fingerprinting" using things like minhash, simhash, etc. where you want to group similar documents under a single "fingerprint". But it could easily be extended to include exact hash functions that act as a de-duplication field.

And somewhat related, we recently merged the Fingerprint Analyzer, which can be used for fingerprinting text fields. Although this is definitely more for grouping/clustering/ML on text than de-duping.

Just a thought: if this is added to Beats but the user doesn't specify all the fields, it'd be easy for accidental "collisions" to clobber existing data. E.g.

  1. an event arrives with the tuple "192.168.0.160-hostFoo-abcxyz" which hashes to _id: 12345. The doc also has an additional, non-hashed field: "bar": "baz".
  2. a second event arrives with the same tuple "192.168.0.160-hostFoo-abcxyz" but now "bar": "bizzbuzz". It'll clobber the existing "bar" field

Obviously it's up to the user to configure the set of hashed fields to prevent this from happening, but it seems like it could be very trappy and hard to detect problems?

@djschny
Copy link
Copy Markdown

djschny commented Jun 27, 2016

the issue with time based IDs is that if Filebeat restarts and has to re-read the log lines they would get different IDs, so it doesn't remove all duplicates.

Unless I'm missing something, filebeat keeps track of where it has read into the file and only events that successfully shipped move. Handling the scenario of re-running the same file multiple times and it not creating dupes I don't believe is a concern here. It's not like that functionality exists today, so it would still be a step forward for the vast majority of users.

@kimchy
Copy link
Copy Markdown
Member

kimchy commented Jun 27, 2016

the way forward here from architecture across all various beats is relying on spooling, and generating the id once. we can't rely on one specific beat, like filebeat, even though it still doesn't apply (but irrelevant).

I am not sure what additional benefit fingerprint will bring in beats if it can be done in ingest.

@andrewkroh
Copy link
Copy Markdown
Member Author

I agree that if we can do the fingerprinting in an ingest pipeline that it's not necessary to have the feature in Beats. I wasn't aware of elastic/elasticsearch#16938 when I opened this PR.

@andrewkroh andrewkroh closed this Jun 30, 2016
@andrewkroh andrewkroh deleted the feature/fingerprint-filter branch April 19, 2018 23:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants