[Proposal] Add fingerprint filter for event deduplication#1872
[Proposal] Add fingerprint filter for event deduplication#1872andrewkroh wants to merge 1 commit intoelastic:masterfrom
Conversation
The fingerprint filter uses a cryptographic hash function to calculate a hash value from specified fields in the event. The resulting hex encoded hash value is stored in the id field. The id field is used as the _id field when the event is sent through the elasticsearch output. Since there cannot be duplicate _id values stored in an elasticsearch index, this prevents an event from being duplicated. The fingerprint filter is not appropriate for all use cases. It should only be applied when the combination input fields are sufficiently unique.
5b9748a to
539c7f0
Compare
|
I really like the idea as it makes the id generation also configurable. Some thoughts / questions:
Definitively a +1 on this feature. |
|
Great that you started with this, @andrewkroh! It's nice that the hashing function can be selected, but something to worry about with all these algorithms is that the IDs they generate are so random that Lucene won't have a chance in compressing them. Perhaps an idea would be to have an algorithm that concatanates hashes of the fields. So for filebeat we could use, for example, I think @bleskes might have more thoughts in this area. Also pinging @kimchy and @djschny since they were interested in this feature before. |
|
Why even bother with complexity of calculating hash of the fields. Instead just generate random ID that is time/sequence based? |
|
I wonder if this is something Beats needs to do, and maybe it should be under ingest pipeline? I am concerned around the fact that it will be used as the ID, I think we can solve potential duplicate data differently (internal spool queue with the generate id for example). I do think fingerprint in general is useful, as a different field, but then it can easily be done in ingest node. /cc @polyfractal who has been playing with it |
|
@djschny the issue with time based IDs is that if Filebeat restarts and has to re-read the log lines they would get different IDs, so it doesn't remove all duplicates. An internal spooling queue might mitigate this, indeed. |
|
For reference, here's the ingest issue that I opened a while ago: elastic/elasticsearch#16938 It mainly focused on "fuzzy fingerprinting" using things like minhash, simhash, etc. where you want to group similar documents under a single "fingerprint". But it could easily be extended to include exact hash functions that act as a de-duplication field. And somewhat related, we recently merged the Fingerprint Analyzer, which can be used for fingerprinting text fields. Although this is definitely more for grouping/clustering/ML on text than de-duping. Just a thought: if this is added to Beats but the user doesn't specify all the fields, it'd be easy for accidental "collisions" to clobber existing data. E.g.
Obviously it's up to the user to configure the set of hashed fields to prevent this from happening, but it seems like it could be very trappy and hard to detect problems? |
Unless I'm missing something, filebeat keeps track of where it has read into the file and only events that successfully shipped move. Handling the scenario of re-running the same file multiple times and it not creating dupes I don't believe is a concern here. It's not like that functionality exists today, so it would still be a step forward for the vast majority of users. |
|
the way forward here from architecture across all various beats is relying on spooling, and generating the id once. we can't rely on one specific beat, like filebeat, even though it still doesn't apply (but irrelevant). I am not sure what additional benefit fingerprint will bring in beats if it can be done in ingest. |
|
I agree that if we can do the fingerprinting in an ingest pipeline that it's not necessary to have the feature in Beats. I wasn't aware of elastic/elasticsearch#16938 when I opened this PR. |
The fingerprint filter uses a cryptographic hash function to calculate a hash value from specified fields in the event. The resulting hex encoded hash value is stored in the id field. The id field is used as the _id field when the event is sent through the elasticsearch output. Since there cannot be duplicate _id values stored in an elasticsearch index, this prevents an event from being duplicated.
The fingerprint filter is not appropriate for all use cases. It should only be applied when the combination input fields are sufficiently unique.
Some config examples:
Benchmarks: