[Proposal] Add fingerprint filter for event deduplication by andrewkroh · Pull Request #1872 · elastic/beats

andrewkroh · 2016-06-15T21:01:39Z

The fingerprint filter uses a cryptographic hash function to calculate a hash value from specified fields in the event. The resulting hex encoded hash value is stored in the id field. The id field is used as the _id field when the event is sent through the elasticsearch output. Since there cannot be duplicate _id values stored in an elasticsearch index, this prevents an event from being duplicated.

The fingerprint filter is not appropriate for all use cases. It should only be applied when the combination input fields are sufficiently unique.

Some config examples:

# Filebeat
filters:
  - fingerprint:
      fields: [beat.hostname, source, message]

# Winlogbeat
filters:
  - fingerprint:
      fields: [computer_name, log_name, record_number, message]

Benchmarks:

BenchmarkFingerprintFilterSHA1-4     1000000          2218 ns/op         664 B/op         12 allocs/op
BenchmarkFingerprintFilterSHA256-4    500000          3203 ns/op         696 B/op         12 allocs/op
BenchmarkFingerprintFilterSHA512-4    500000          3332 ns/op        1016 B/op         13 allocs/op
BenchmarkFingerprintFilterMD5-4      1000000          1880 ns/op         552 B/op         11 allocs/op

The fingerprint filter uses a cryptographic hash function to calculate a hash value from specified fields in the event. The resulting hex encoded hash value is stored in the id field. The id field is used as the _id field when the event is sent through the elasticsearch output. Since there cannot be duplicate _id values stored in an elasticsearch index, this prevents an event from being duplicated. The fingerprint filter is not appropriate for all use cases. It should only be applied when the combination input fields are sufficiently unique.

ruflin · 2016-06-16T10:41:32Z

I really like the idea as it makes the id generation also configurable. Some thoughts / questions:

I was first confused by the naming fingerprint. I would probably call more something like "id generation" or "id hashing".
I don't think this belongs into filters but into processors (or similar) as it modifies / enhances the data. But this is a more general discussion about the naming.

Definitively a +1 on this feature.

tsg · 2016-06-25T12:12:14Z

Great that you started with this, @andrewkroh!

It's nice that the hashing function can be selected, but something to worry about with all these algorithms is that the IDs they generate are so random that Lucene won't have a chance in compressing them.

Perhaps an idea would be to have an algorithm that concatanates hashes of the fields. So for filebeat we could use, for example, hash(host information):hash(file dev+inode):line-number.

I think @bleskes might have more thoughts in this area. Also pinging @kimchy and @djschny since they were interested in this feature before.

djschny · 2016-06-25T23:56:59Z

Why even bother with complexity of calculating hash of the fields. Instead just generate random ID that is time/sequence based?

kimchy · 2016-06-26T15:13:05Z

I wonder if this is something Beats needs to do, and maybe it should be under ingest pipeline? I am concerned around the fact that it will be used as the ID, I think we can solve potential duplicate data differently (internal spool queue with the generate id for example). I do think fingerprint in general is useful, as a different field, but then it can easily be done in ingest node. /cc @polyfractal who has been playing with it

tsg · 2016-06-26T20:41:13Z

@djschny the issue with time based IDs is that if Filebeat restarts and has to re-read the log lines they would get different IDs, so it doesn't remove all duplicates. An internal spooling queue might mitigate this, indeed.

polyfractal · 2016-06-27T14:40:51Z

For reference, here's the ingest issue that I opened a while ago: elastic/elasticsearch#16938

It mainly focused on "fuzzy fingerprinting" using things like minhash, simhash, etc. where you want to group similar documents under a single "fingerprint". But it could easily be extended to include exact hash functions that act as a de-duplication field.

And somewhat related, we recently merged the Fingerprint Analyzer, which can be used for fingerprinting text fields. Although this is definitely more for grouping/clustering/ML on text than de-duping.

Just a thought: if this is added to Beats but the user doesn't specify all the fields, it'd be easy for accidental "collisions" to clobber existing data. E.g.

an event arrives with the tuple "192.168.0.160-hostFoo-abcxyz" which hashes to _id: 12345. The doc also has an additional, non-hashed field: "bar": "baz".
a second event arrives with the same tuple "192.168.0.160-hostFoo-abcxyz" but now "bar": "bizzbuzz". It'll clobber the existing "bar" field

Obviously it's up to the user to configure the set of hashed fields to prevent this from happening, but it seems like it could be very trappy and hard to detect problems?

djschny · 2016-06-27T14:47:10Z

the issue with time based IDs is that if Filebeat restarts and has to re-read the log lines they would get different IDs, so it doesn't remove all duplicates.

Unless I'm missing something, filebeat keeps track of where it has read into the file and only events that successfully shipped move. Handling the scenario of re-running the same file multiple times and it not creating dupes I don't believe is a concern here. It's not like that functionality exists today, so it would still be a step forward for the vast majority of users.

kimchy · 2016-06-27T16:35:44Z

the way forward here from architecture across all various beats is relying on spooling, and generating the id once. we can't rely on one specific beat, like filebeat, even though it still doesn't apply (but irrelevant).

I am not sure what additional benefit fingerprint will bring in beats if it can be done in ingest.

andrewkroh · 2016-06-30T05:28:06Z

I agree that if we can do the fingerprinting in an ingest pipeline that it's not necessary to have the feature in Beats. I wasn't aware of elastic/elasticsearch#16938 when I opened this PR.

andrewkroh added enhancement discuss Issue needs further discussion. libbeat :Processors labels Jun 15, 2016

andrewkroh force-pushed the feature/fingerprint-filter branch from 5b9748a to 539c7f0 Compare June 15, 2016 21:27

andrewkroh closed this Jun 30, 2016

andrewkroh deleted the feature/fingerprint-filter branch April 19, 2018 23:30

andrewkroh mentioned this pull request Oct 23, 2019

Fingerprint processor #14205

Merged

ycombinator mentioned this pull request Oct 23, 2019

[WIP] Re-introduce hash processor elastic/elasticsearch#47047

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Proposal] Add fingerprint filter for event deduplication#1872

[Proposal] Add fingerprint filter for event deduplication#1872
andrewkroh wants to merge 1 commit intoelastic:masterfrom
andrewkroh:feature/fingerprint-filter

andrewkroh commented Jun 15, 2016 •

edited

Loading

Uh oh!

ruflin commented Jun 16, 2016

Uh oh!

tsg commented Jun 25, 2016

Uh oh!

djschny commented Jun 25, 2016

Uh oh!

kimchy commented Jun 26, 2016

Uh oh!

tsg commented Jun 26, 2016

Uh oh!

polyfractal commented Jun 27, 2016

Uh oh!

djschny commented Jun 27, 2016

Uh oh!

kimchy commented Jun 27, 2016

Uh oh!

andrewkroh commented Jun 30, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

andrewkroh commented Jun 15, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ruflin commented Jun 16, 2016

Uh oh!

tsg commented Jun 25, 2016

Uh oh!

djschny commented Jun 25, 2016

Uh oh!

kimchy commented Jun 26, 2016

Uh oh!

tsg commented Jun 26, 2016

Uh oh!

polyfractal commented Jun 27, 2016

Uh oh!

djschny commented Jun 27, 2016

Uh oh!

kimchy commented Jun 27, 2016

Uh oh!

andrewkroh commented Jun 30, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

andrewkroh commented Jun 15, 2016 •

edited

Loading