Skip to content

Fingerprint ingest processor#68415

Merged
danhermann merged 6 commits intoelastic:masterfrom
danhermann:53578_fingerprint_processor
Feb 9, 2021
Merged

Fingerprint ingest processor#68415
danhermann merged 6 commits intoelastic:masterfrom
danhermann:53578_fingerprint_processor

Conversation

@danhermann
Copy link
Copy Markdown
Contributor

Adds a fingerprint processor that computes hashes of document content for content fingerprinting use cases.

E.g.:

POST _ingest/pipeline/_simulate
{
  "pipeline": {
    "processors": [
      {
        "fingerprint": {
          "fields": ["user"]
        }
      }
    ]
  },
  "docs": [
    {
      "_source": {
        "user": {
          "last_name": "Smith",
          "first_name": "John",
          "date_of_birth": "1980-01-15",
          "is_active": true
        }
      }
    }
  ]
}

Which produces:

"_source" : {
  "fingerprint" : "WbSUPW4zY1PBPehh2AA/sSxiRjw=",
  "user" : {
    "last_name" : "Smith",
    "first_name" : "John",
    "date_of_birth" : "1980-01-15",
    "is_active" : true
  }
}

Supports any number of document fields, nested document content, any hash from [MD5, SHA-1, SHA-256, SHA-512], and a per-processor salt.

Closes #53578 though it addresses only content fingerprinting and not anonymization use cases.

@danhermann danhermann added >feature :Distributed/Ingest Node Execution or management of Ingest Pipelines v8.0.0 v7.12.0 labels Feb 2, 2021
@elasticmachine elasticmachine added the Team:Data Management (obsolete) DO NOT USE. This team no longer exists. label Feb 2, 2021
@elasticmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-core-features (Team:Core/Features)

@danhermann
Copy link
Copy Markdown
Contributor Author

@graphaelli, could you or a member of your team look this over to verify that it provides the functionality that the APM use case requires?

@danhermann
Copy link
Copy Markdown
Contributor Author

@elasticmachine run elasticsearch-ci/bwc

@graphaelli
Copy link
Copy Markdown
Member

Thanks @danhermann! @elastic/apm-server - can you provide feedback here?

@mark-vieira
Copy link
Copy Markdown
Contributor

@elasticmachine update branch

@axw
Copy link
Copy Markdown
Member

axw commented Feb 3, 2021

@danhermann thank you, this will do nicely for APM.

We'll need MD5 (which I see is available) for now, but later I would like to migrate to something faster, such as Murmurhash3 (non-cryptographic would be fine). Any reason why we couldn't add that later? Doesn't look like it to me, just wanted to double check.

@danhermann
Copy link
Copy Markdown
Contributor Author

@danhermann thank you, this will do nicely for APM.

We'll need MD5 (which I see is available) for now, but later I would like to migrate to something faster, such as Murmurhash3 (non-cryptographic would be fine). Any reason why we couldn't add that later? Doesn't look like it to me, just wanted to double check.

Yes, adding support for the murmur3 hash should be doable.

@dakrone dakrone self-requested a review February 4, 2021 16:44
Copy link
Copy Markdown
Member

@dakrone dakrone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great @danhermann, I left a few really minor comments, but I think we need to add regular docs also right?

Comment on lines +98 to +101
var setList = set.stream().sorted().collect(Collectors.toList());
for (int k = setList.size() - 1; k >= 0; k--) {
values.push(setList.get(k));
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Streams are super great, but they aren't quite as fast compared to foreach clauses yet, and since this is performance sensitive code, what do you think about doing:

Suggested change
var setList = set.stream().sorted().collect(Collectors.toList());
for (int k = setList.size() - 1; k >= 0; k--) {
values.push(setList.get(k));
}
var setList = new ArrayList<?>(set);
setList.sort(Comparator.naturalOrder());
for (Object thing : setList) {
values.push(thing);
}

?

(I might be missing something about why you're using a for loop indexed rather than foreach, as I think the foreach compiles to the same bytecode, but I could totally be wrong)

Copy link
Copy Markdown
Contributor Author

@danhermann danhermann Feb 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The for loop iterates backward over the sorted list to push the values onto the stack. That means that the values are processed in sorted order after being popped off the stack which is not strictly necessary since only a stable hash is required, but having the values in sorted order is certainly nicer when debugging.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did replace the usage of streams both here and below.

@danhermann
Copy link
Copy Markdown
Contributor Author

Thanks for the review, @dakrone. I'll open a separate PR with the docs for this one shortly.

@danhermann
Copy link
Copy Markdown
Contributor Author

@dakrone, I've updated this with all your suggestions except the indexed for loop. Let me know if you're ok with the reverse iteration so elements from the stack are processed in sorted order.

@danhermann
Copy link
Copy Markdown
Contributor Author

@elasticmachine update branch

@danhermann
Copy link
Copy Markdown
Contributor Author

@elasticmachine run elasticsearch-ci/2

Copy link
Copy Markdown
Member

@dakrone dakrone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I'm totally cool with the reverse ordering, thanks for iterating!

@danhermann
Copy link
Copy Markdown
Contributor Author

cc: @elastic/es-ui in case Kibana auto-complete needs to be updated with this new processor.

@danhermann danhermann merged commit 0083e9c into elastic:master Feb 9, 2021
@danhermann danhermann deleted the 53578_fingerprint_processor branch February 9, 2021 11:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Distributed/Ingest Node Execution or management of Ingest Pipelines >feature Team:Data Management (obsolete) DO NOT USE. This team no longer exists. v7.12.0 v8.0.0-alpha1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add "fingerprint" ingest processor

7 participants