[Rollup] Add more diagnostic stats to job#35471
Conversation
To help debug future performance issues, this adds the min/max/avg/count/total latencies (in milliseconds) for search and bulk phase. This latency is the total service time including transfer between nodes, not just the `took` time. It also adds the count of search/bulk failures encountered during runtime. This information is also in the log, but a runtime counter will help expose problems faster
|
Pinging @elastic/es-search-aggs |
| static final ParseField BULK_LATENCY = new ParseField("bulk_latency_in_ms"); | ||
| static final ParseField SEARCH_LATENCY = new ParseField("search_latency_in_ms"); | ||
| static final ParseField SEARCH_FAILURES = new ParseField("search_failures"); | ||
| static final ParseField BULK_FAILURES = new ParseField("bulk_failures"); |
There was a problem hiding this comment.
Nit: I think it would be nicer to call it INDEX_FAILURES, BULK is an implementation detail about how indexing is internally implemented.
|
nice addition! |
jimczi
left a comment
There was a problem hiding this comment.
The change looks good @polyfractal . I left some comments
| static final ParseField MAX = new ParseField("max"); | ||
| static final ParseField AVG = new ParseField("avg"); | ||
| static final ParseField COUNT = new ParseField("count"); | ||
| static final ParseField TOTAL = new ParseField("total"); |
There was a problem hiding this comment.
To be consistent with the _stats API can we call these bulk_time_in_millis and query_time_in_millis ? I am also not sure if we need the min, the max and the avg. It should be enough to have the total time spent in these operations and the number of calls per action ?
There was a problem hiding this comment.
@polyfractal Are MIN, ..., ..., TOTAL leftovers from previous iterations? They look unused to me.
| "trigger_count" : 0 | ||
| "trigger_count" : 0, | ||
| "bulk_failures": 0, | ||
| "bulk_latency_in_ms": { |
There was a problem hiding this comment.
Can we simplify this to:
"bulk_time_in_ms": 0,
"bulk_total": 0,
"search_time_in_ms": 0,
"search_total": 0
?
I don't think we need more than the total time and the number of invocations.
There was a problem hiding this comment.
Sure, I can simplify these. :)
| try { | ||
| stats.markStartSearch(); | ||
| doNextSearch(buildSearchRequest(), ActionListener.wrap(this::onSearchResponse, exc -> finishWithFailure(exc))); | ||
| } catch (Exception e) { |
There was a problem hiding this comment.
missing stats.incrementSearchFailures() ?
|
Thanks @hendrikmuhs @jimczi! |
This adds some new statistics to the job to help with debugging performance issues: - Total search and index time (in milliseconds) encounteed by the indexer during runtime. This time is the total service time including transfer between nodes, not just the `took` time. - Total count of search and index requests. Together with the total times, this can be used to determine average request time. - Count of search/bulk failures encountered during runtime. This information is also in the log, but a runtime counter will help expose problems faster
* master: DOCS Audit event attributes in new format (elastic#35510) Scripting: Actually add joda time back to whitelist (elastic#35965) [DOCS] fix HLRC ILM doc misreferenced tag Add realm information for Authenticate API (elastic#35648) [ILM] add HLRC docs to remove-policy-from-index (elastic#35759) [Rollup] Update serialization version after backport [Rollup] Add more diagnostic stats to job (elastic#35471) Build: Fix gradle build for Mac OS (elastic#35968) Adds deprecation logging to ScriptDocValues#getValues. (elastic#34279) [Monitoring] Make Exporters Async (elastic#35765) [ILM] reduce time restriction on IndexLifecycleExplainResponse (elastic#35954) Remove use of AbstractComponent in xpack (elastic#35394) Deprecate types in search and multi search templates. (elastic#35669) Remove fromXContent from IndexUpgradeInfoResponse (elastic#35934)
To help debug future performance issues, this adds the min/max/avg/count/total latencies (in milliseconds) for search and bulk phase. This latency is the total service time including transfer between nodes, not just the
tooktime.It also adds the count of search/bulk failures encountered during runtime. This information is also in the log, but a runtime counter will help expose problems faster.
Also updates the HLRC with the new response elements.
/cc @hendrikmuhs This adds the stats to the
IndexerJobStatssuperclass, although all the xcontent stuff is done in the Rollup implementation