Skip to content

Add automatic X-Opaque-Id headers to datastore msearch requests#1135

Open
jwils wants to merge 2 commits intomainfrom
joshuaw/add-opaque-id-middleware
Open

Add automatic X-Opaque-Id headers to datastore msearch requests#1135
jwils wants to merge 2 commits intomainfrom
joshuaw/add-opaque-id-middleware

Conversation

@jwils
Copy link
Copy Markdown
Collaborator

@jwils jwils commented Apr 17, 2026

Elasticsearch and OpenSearch propagate the X-Opaque-Id HTTP header into the tasks API, slow logs, and deprecation logs. This PR now makes ElasticGraph set useful opaque ids itself for datastore _msearch traffic instead of offering an opt-in Faraday middleware that applications wire manually.

What changed

  • Removes the standalone ElasticGraph::Support::FaradayMiddleware::OpaqueId middleware and its opt-in datastore_client_customization_block design.
  • Adds ElasticGraph::GraphQL::Client#extra_opaque_id_parts as the GraphQL customization hook.
  • elasticgraph-graphql now automatically includes X-Opaque-Id on datastore _msearch requests using stable EG-owned parts:
    • elasticgraph-graphql
    • client=<client.name>
    • any client.extra_opaque_id_parts
    • query=<fingerprint>
  • Threads those parts through QueryExecutor -> QuerySource -> DatastoreSearchRouter rather than relying on Faraday env customization.
  • Adds coarse opaque ids for other EG-owned _msearch call sites:
    • elasticgraph-health_check -> elasticgraph-health_check
    • Apollo test resolver -> elasticgraph-apollo;resolver=product
    • elasticgraph-indexer#source_event_versions_in_index -> elasticgraph-indexer;purpose=source_event_versions;cluster=<cluster>
  • Uses stable/readable values only; no per-request random suffix.

Why this shape

  • Addresses the review feedback that the original PR was useful but did not really belong in ElasticGraph as generic opt-in support code.
  • Keeps the feature in EG request-layer abstractions where EG already knows meaningful request metadata.
  • Avoids _bulk for now. Indexer write traffic still needs a better request-identity model before we choose default opaque-id semantics there.

Notes

  • A small internal ElasticGraph::Support::OpaqueID formatter remains for assembling and sanitizing header values.
  • The shared X-Opaque-Id header name now lives in elastic_graph/constants.

Verification

  • Updated unit/spec coverage for GraphQL routing, GraphQL::Client normalization, health-check _msearch, and indexer source-event-version lookups.
  • ruby -c passed on the touched Ruby files.
  • git diff --check is clean.
  • I could not run the full Ruby test suite in this workspace because the local bundle is incomplete and is missing gems such as aws_lambda_ric, faker, httpx, and nokogiri.

Copy link
Copy Markdown
Collaborator

@myronmarston myronmarston left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some mild feedback. I'm also curious if you can describe how you plan to use these ids (as I've not seen them before)? During an incident will you list the tasks on nodes having problems (which would include the opaque ids, I guess?) and then use those opaque ids to cancel requests? Are the properties that matter uniqueness and ability to correlate it to a client of your application so that you can understand the consequences of cancelling a task?

Comment thread elasticgraph-support/lib/elastic_graph/support/faraday_middleware/opaque_id.rb Outdated
Comment thread elasticgraph-support/lib/elastic_graph/support/faraday_middleware/opaque_id.rb Outdated
Comment thread elasticgraph-support/lib/elastic_graph/support/faraday_middleware/opaque_id.rb Outdated
Comment thread elasticgraph-support/lib/elastic_graph/support/faraday_middleware/opaque_id.rb Outdated
Comment thread elasticgraph-support/lib/elastic_graph/support/faraday_middleware/opaque_id.rb Outdated
Comment thread elasticgraph-support/lib/elastic_graph/support/faraday_middleware/opaque_id.rb Outdated
@jwils jwils force-pushed the joshuaw/add-opaque-id-middleware branch 2 times, most recently from 2fcc7a8 to ac08f3c Compare April 21, 2026 20:52
@jwils jwils changed the title Add Faraday middleware for tagging requests with X-Opaque-Id Add automatic X-Opaque-Id headers to datastore msearch requests Apr 22, 2026
Comment thread elasticgraph-graphql/lib/elastic_graph/graphql/resolvers/query_source.rb Outdated
Comment thread elasticgraph-graphql/lib/elastic_graph/graphql/client.rb Outdated
Comment thread elasticgraph-graphql/lib/elastic_graph/graphql/client.rb Outdated
Comment thread elasticgraph-graphql/lib/elastic_graph/graphql/client.rb Outdated
Comment thread elasticgraph-graphql/lib/elastic_graph/graphql/client.rb Outdated
Comment thread elasticgraph-graphql/sig/elastic_graph/graphql/client.rbs
Comment thread elasticgraph-graphql/sig/elastic_graph/graphql/datastore_search_router.rbs Outdated
Comment thread elasticgraph-graphql/spec/unit/elastic_graph/graphql/client_spec.rb Outdated
Comment thread elasticgraph-graphql/spec/unit/elastic_graph/graphql/client_spec.rb Outdated
Comment thread elasticgraph-graphql/spec/unit/elastic_graph/graphql/client_spec.rb Outdated
@jwils jwils force-pushed the joshuaw/add-opaque-id-middleware branch 5 times, most recently from 4124689 to f6eaa34 Compare April 23, 2026 16:45
Elasticsearch and OpenSearch propagate the X-Opaque-Id HTTP header into
the tasks API (GET _tasks?detailed=true), search slow logs, and
deprecation logs. For any non-trivial deployment, being able to attribute
stuck/slow/expensive datastore work back to the originating application
is essential operational visibility. Operators can locate a specific task
by opaque id in the tasks API output and cancel it individually via
POST _tasks/<task_id>/_cancel.

Motivated by a production incident on an OpenSearch 2.19 cluster where a
single data node ended up holding 13 stuck search tasks for hours;
without X-Opaque-Id we could not identify the originating application.

Changes:
- Add ElasticGraph::Support::OpaqueID.build_header(parts): joins string
  parts with ";", drops empty values, and replaces ";" / "\r" / "\n"
  inside parts with "," so each part stays a single segment.
- Add OPAQUE_ID_HEADER = "X-Opaque-Id" constant in elastic_graph/constants.
- DatastoreSearchRouter#msearch accepts an opaque_id_parts: kwarg
  (default ["elasticgraph-graphql"]) and sets X-Opaque-Id when the
  computed value is non-empty.
- QueryExecutor sets :elastic_graph_client in the resolver context so
  QuerySource.execute_many can derive opaque_id_parts of
  ["elasticgraph-graphql", "client=<name>", *client.extra_opaque_id_parts,
   "query=<fingerprint>"] for datastore msearch calls.
- Client (Data.define) gains extra_opaque_id_parts (default []) so an
  HTTPRequest resolver can attach per-client tags (e.g. tenant=acme).
- HealthChecker tags its msearch with ["elasticgraph-health_check"].
- DatastoreIndexingRouter#source_event_versions_in_index tags its msearch
  with ["elasticgraph-indexer", "purpose=source_event_versions",
  "cluster=<name>"].

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jwils jwils force-pushed the joshuaw/add-opaque-id-middleware branch from f6eaa34 to 6b3c83f Compare April 28, 2026 14:26
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be useful to set OPAQUE_ID from the bulk calls as well?

OPAQUE_ID_HEADER => Support::OpaqueID.build_header([
"elasticgraph-indexer",
"purpose=source_event_versions",
"cluster=#{client_name}"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear to me why cluster is useful here. At the point you're seeing these opaque ids (e.g. in an OpenSearch task list) you'll know what cluster you're hitting, right? Or are you expecting these opaque ids to wind up in a centralized logging system that contains opaque ids data from multiple clusters?

In terms of other potential useful bits of data to include...maybe include the count of operations per type? e.g. widget_ops=7;team_ops=12

That would indicate at a glance how large of request it was and which types the request is for.


def datastore_msearch(queries)
@datastore_search_router.msearch(queries)
@datastore_search_router.msearch(queries, opaque_id_parts: ["elasticgraph-health_check"])
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth including of types being queried here? You could change the caller from datastore_msearch(recency_queries_by_type_name.values) to datastore_msearch(recency_queries_by_type_name) and then you'd have the queried types as keys and the queries themselves as values. The types could nicely go in the opaque id.

)

@datastore_router.msearch([query]).fetch(query).documents.first
@datastore_router.msearch([query], opaque_id_parts: ["elasticgraph-apollo", "resolver=product"]).fetch(query).documents.first
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't ship as part of ElasticGraph--it's just part of the implementation we use to run the apollo federation subgraph test suite against elasticgraph-apollo. As such, it only runs locally and on CI and I think we can simplify and revert this.

def msearch: (
::Array[DatastoreQuery],
?query_tracker: QueryDetailsTracker,
?opaque_id_parts: ::Array[::String]
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
?opaque_id_parts: ::Array[::String]
?opaque_id_parts: ::Array[::String?]

...since Support::OpaqueID.build_header accepts an array of ::String?.

# timeout in a specific header that the `SupportTimeouts` Faraday middleware will use.
headers = {TIMEOUT_MS_HEADER => msearch_request_timeout_from(queries)&.to_s}.compact
opaque_id = Support::OpaqueID.build_header(opaque_id_parts)
headers[OPAQUE_ID_HEADER] = opaque_id unless opaque_id.empty?
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels odd for build_header to return an empty string. Should we have it return nil if it can't build anything meaningful? Then this could be:

if (opaque_id = Support::OpaqueID.build_header(opaque_id_parts))
  headers[OPAQUE_ID_HEADER] = opaque_id
end


# `:elastic_graph_client` is set by `QueryExecutor` but not by all entry points
# (e.g. Apollo's `_entities` resolver dispatches through `execute_many` without
# going through `QueryExecutor` first). When it's absent we fall back to a
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apollo's _entities resolver dispatches through execute_many without
going through QueryExecutor first

I don't think this is accurate. QueryExecutor is the entry point before any of the resolvers get involved. It delegates to Query#result which invokes all the configured resolvers. When the resolvers execute, they get context which has al the context entries set by QueryExecutor. I'd expect for_context[:elastic_graph_client] to be populated when we get here via the Apollo _entities resolver.

What am I missing? (Or did an LLM hallucinate this?)

parts = QuerySource.send(:datastore_opaque_id_parts_for, for_context)

expect(parts).to eq(["elasticgraph-graphql", "client=client-name", "query=(no query string)"])
end
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're going to test query_string being nil, it would be good to also include a test for when query_string is non-nil.

That said, the original reason we did query.query_string ? query.fingerprint : "(no query string)" for the fingerprint no longer applies. It was a work around from back when I reported this bug:

rmosolgo/graphql-ruby#4942

That was fixed in 2024 (and we are more than up-to-date with the GraphQL gem) so I think we can remove that bit of conditional logic entirely, from both query_executor.rb and from query_source.rb. We can just use query.fingerprint with no need to check query.string. Then this test can probably go away.

# `datastore_opaque_id_parts_for` is driven off `for_context[:elastic_graph_client]`,
# which `QueryExecutor` sets for every query it runs. Some entry points (notably
# Apollo's `_entities` resolver test harness) reach `QuerySource.execute_many`
# without going through `QueryExecutor`, so the fallback needs to keep working.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some entry points (notably Apollo's _entities resolver test harness) reach QuerySource.execute_many without going through QueryExecutor

This may need to be updated if (as per my comment above) it's inaccurate.

it "includes `(no query string)` in place of the fingerprint when the query has no source string" do
client = Client.new(source_description: "client-description", name: "client-name")
graphql_query = instance_double(::GraphQL::Query, query_string: nil, fingerprint: "unused")
for_context = instance_double(::GraphQL::Query::Context, :[] => client, :query => graphql_query)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a little weird to stub :[] to always return client--if the implementation changes to access different entries this spec is going to act weird.

It's not too hard to construct a Context--we do it in some other specs (if you search for Context.new you'll see how). I'd prefer using the real thing here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants