Add automatic X-Opaque-Id headers to datastore msearch requests#1135
Add automatic X-Opaque-Id headers to datastore msearch requests#1135
Conversation
myronmarston
left a comment
There was a problem hiding this comment.
Left some mild feedback. I'm also curious if you can describe how you plan to use these ids (as I've not seen them before)? During an incident will you list the tasks on nodes having problems (which would include the opaque ids, I guess?) and then use those opaque ids to cancel requests? Are the properties that matter uniqueness and ability to correlate it to a client of your application so that you can understand the consequences of cancelling a task?
2fcc7a8 to
ac08f3c
Compare
4124689 to
f6eaa34
Compare
Elasticsearch and OpenSearch propagate the X-Opaque-Id HTTP header into the tasks API (GET _tasks?detailed=true), search slow logs, and deprecation logs. For any non-trivial deployment, being able to attribute stuck/slow/expensive datastore work back to the originating application is essential operational visibility. Operators can locate a specific task by opaque id in the tasks API output and cancel it individually via POST _tasks/<task_id>/_cancel. Motivated by a production incident on an OpenSearch 2.19 cluster where a single data node ended up holding 13 stuck search tasks for hours; without X-Opaque-Id we could not identify the originating application. Changes: - Add ElasticGraph::Support::OpaqueID.build_header(parts): joins string parts with ";", drops empty values, and replaces ";" / "\r" / "\n" inside parts with "," so each part stays a single segment. - Add OPAQUE_ID_HEADER = "X-Opaque-Id" constant in elastic_graph/constants. - DatastoreSearchRouter#msearch accepts an opaque_id_parts: kwarg (default ["elasticgraph-graphql"]) and sets X-Opaque-Id when the computed value is non-empty. - QueryExecutor sets :elastic_graph_client in the resolver context so QuerySource.execute_many can derive opaque_id_parts of ["elasticgraph-graphql", "client=<name>", *client.extra_opaque_id_parts, "query=<fingerprint>"] for datastore msearch calls. - Client (Data.define) gains extra_opaque_id_parts (default []) so an HTTPRequest resolver can attach per-client tags (e.g. tenant=acme). - HealthChecker tags its msearch with ["elasticgraph-health_check"]. - DatastoreIndexingRouter#source_event_versions_in_index tags its msearch with ["elasticgraph-indexer", "purpose=source_event_versions", "cluster=<name>"]. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
f6eaa34 to
6b3c83f
Compare
There was a problem hiding this comment.
Would it be useful to set OPAQUE_ID from the bulk calls as well?
| OPAQUE_ID_HEADER => Support::OpaqueID.build_header([ | ||
| "elasticgraph-indexer", | ||
| "purpose=source_event_versions", | ||
| "cluster=#{client_name}" |
There was a problem hiding this comment.
It's not clear to me why cluster is useful here. At the point you're seeing these opaque ids (e.g. in an OpenSearch task list) you'll know what cluster you're hitting, right? Or are you expecting these opaque ids to wind up in a centralized logging system that contains opaque ids data from multiple clusters?
In terms of other potential useful bits of data to include...maybe include the count of operations per type? e.g. widget_ops=7;team_ops=12
That would indicate at a glance how large of request it was and which types the request is for.
|
|
||
| def datastore_msearch(queries) | ||
| @datastore_search_router.msearch(queries) | ||
| @datastore_search_router.msearch(queries, opaque_id_parts: ["elasticgraph-health_check"]) |
There was a problem hiding this comment.
Is it worth including of types being queried here? You could change the caller from datastore_msearch(recency_queries_by_type_name.values) to datastore_msearch(recency_queries_by_type_name) and then you'd have the queried types as keys and the queries themselves as values. The types could nicely go in the opaque id.
| ) | ||
|
|
||
| @datastore_router.msearch([query]).fetch(query).documents.first | ||
| @datastore_router.msearch([query], opaque_id_parts: ["elasticgraph-apollo", "resolver=product"]).fetch(query).documents.first |
There was a problem hiding this comment.
This doesn't ship as part of ElasticGraph--it's just part of the implementation we use to run the apollo federation subgraph test suite against elasticgraph-apollo. As such, it only runs locally and on CI and I think we can simplify and revert this.
| def msearch: ( | ||
| ::Array[DatastoreQuery], | ||
| ?query_tracker: QueryDetailsTracker, | ||
| ?opaque_id_parts: ::Array[::String] |
There was a problem hiding this comment.
| ?opaque_id_parts: ::Array[::String] | |
| ?opaque_id_parts: ::Array[::String?] |
...since Support::OpaqueID.build_header accepts an array of ::String?.
| # timeout in a specific header that the `SupportTimeouts` Faraday middleware will use. | ||
| headers = {TIMEOUT_MS_HEADER => msearch_request_timeout_from(queries)&.to_s}.compact | ||
| opaque_id = Support::OpaqueID.build_header(opaque_id_parts) | ||
| headers[OPAQUE_ID_HEADER] = opaque_id unless opaque_id.empty? |
There was a problem hiding this comment.
It feels odd for build_header to return an empty string. Should we have it return nil if it can't build anything meaningful? Then this could be:
if (opaque_id = Support::OpaqueID.build_header(opaque_id_parts))
headers[OPAQUE_ID_HEADER] = opaque_id
end|
|
||
| # `:elastic_graph_client` is set by `QueryExecutor` but not by all entry points | ||
| # (e.g. Apollo's `_entities` resolver dispatches through `execute_many` without | ||
| # going through `QueryExecutor` first). When it's absent we fall back to a |
There was a problem hiding this comment.
Apollo's
_entitiesresolver dispatches throughexecute_manywithout
going throughQueryExecutorfirst
I don't think this is accurate. QueryExecutor is the entry point before any of the resolvers get involved. It delegates to Query#result which invokes all the configured resolvers. When the resolvers execute, they get context which has al the context entries set by QueryExecutor. I'd expect for_context[:elastic_graph_client] to be populated when we get here via the Apollo _entities resolver.
What am I missing? (Or did an LLM hallucinate this?)
| parts = QuerySource.send(:datastore_opaque_id_parts_for, for_context) | ||
|
|
||
| expect(parts).to eq(["elasticgraph-graphql", "client=client-name", "query=(no query string)"]) | ||
| end |
There was a problem hiding this comment.
If we're going to test query_string being nil, it would be good to also include a test for when query_string is non-nil.
That said, the original reason we did query.query_string ? query.fingerprint : "(no query string)" for the fingerprint no longer applies. It was a work around from back when I reported this bug:
That was fixed in 2024 (and we are more than up-to-date with the GraphQL gem) so I think we can remove that bit of conditional logic entirely, from both query_executor.rb and from query_source.rb. We can just use query.fingerprint with no need to check query.string. Then this test can probably go away.
| # `datastore_opaque_id_parts_for` is driven off `for_context[:elastic_graph_client]`, | ||
| # which `QueryExecutor` sets for every query it runs. Some entry points (notably | ||
| # Apollo's `_entities` resolver test harness) reach `QuerySource.execute_many` | ||
| # without going through `QueryExecutor`, so the fallback needs to keep working. |
There was a problem hiding this comment.
Some entry points (notably Apollo's
_entitiesresolver test harness) reachQuerySource.execute_manywithout going throughQueryExecutor
This may need to be updated if (as per my comment above) it's inaccurate.
| it "includes `(no query string)` in place of the fingerprint when the query has no source string" do | ||
| client = Client.new(source_description: "client-description", name: "client-name") | ||
| graphql_query = instance_double(::GraphQL::Query, query_string: nil, fingerprint: "unused") | ||
| for_context = instance_double(::GraphQL::Query::Context, :[] => client, :query => graphql_query) |
There was a problem hiding this comment.
It's a little weird to stub :[] to always return client--if the implementation changes to access different entries this spec is going to act weird.
It's not too hard to construct a Context--we do it in some other specs (if you search for Context.new you'll see how). I'd prefer using the real thing here.
Elasticsearch and OpenSearch propagate the
X-Opaque-IdHTTP header into the tasks API, slow logs, and deprecation logs. This PR now makes ElasticGraph set useful opaque ids itself for datastore_msearchtraffic instead of offering an opt-in Faraday middleware that applications wire manually.What changed
ElasticGraph::Support::FaradayMiddleware::OpaqueIdmiddleware and its opt-indatastore_client_customization_blockdesign.ElasticGraph::GraphQL::Client#extra_opaque_id_partsas the GraphQL customization hook.elasticgraph-graphqlnow automatically includesX-Opaque-Idon datastore_msearchrequests using stable EG-owned parts:elasticgraph-graphqlclient=<client.name>client.extra_opaque_id_partsquery=<fingerprint>QueryExecutor->QuerySource->DatastoreSearchRouterrather than relying on Faraday env customization._msearchcall sites:elasticgraph-health_check->elasticgraph-health_checkelasticgraph-apollo;resolver=productelasticgraph-indexer#source_event_versions_in_index->elasticgraph-indexer;purpose=source_event_versions;cluster=<cluster>Why this shape
_bulkfor now. Indexer write traffic still needs a better request-identity model before we choose default opaque-id semantics there.Notes
ElasticGraph::Support::OpaqueIDformatter remains for assembling and sanitizing header values.X-Opaque-Idheader name now lives inelastic_graph/constants.Verification
GraphQL::Clientnormalization, health-check_msearch, and indexer source-event-version lookups.ruby -cpassed on the touched Ruby files.git diff --checkis clean.aws_lambda_ric,faker,httpx, andnokogiri.