[Rollup] Add more diagnostic stats to job by polyfractal · Pull Request #35471 · elastic/elasticsearch

polyfractal · 2018-11-12T19:58:19Z

To help debug future performance issues, this adds the min/max/avg/count/total latencies (in milliseconds) for search and bulk phase. This latency is the total service time including transfer between nodes, not just the took time.

It also adds the count of search/bulk failures encountered during runtime. This information is also in the log, but a runtime counter will help expose problems faster.

Also updates the HLRC with the new response elements.

/cc @hendrikmuhs This adds the stats to the IndexerJobStats superclass, although all the xcontent stuff is done in the Rollup implementation

To help debug future performance issues, this adds the min/max/avg/count/total latencies (in milliseconds) for search and bulk phase. This latency is the total service time including transfer between nodes, not just the `took` time. It also adds the count of search/bulk failures encountered during runtime. This information is also in the log, but a runtime counter will help expose problems faster

elasticmachine · 2018-11-12T19:58:21Z

Pinging @elastic/es-search-aggs

hendrikmuhs · 2018-11-13T13:10:57Z

client/rest-high-level/src/main/java/org/elasticsearch/client/rollup/GetRollupJobResponse.java

+    static final ParseField BULK_LATENCY = new ParseField("bulk_latency_in_ms");
+    static final ParseField SEARCH_LATENCY = new ParseField("search_latency_in_ms");
+    static final ParseField SEARCH_FAILURES = new ParseField("search_failures");
+    static final ParseField BULK_FAILURES = new ParseField("bulk_failures");


Nit: I think it would be nicer to call it INDEX_FAILURES, BULK is an implementation detail about how indexing is internally implemented.

hendrikmuhs · 2018-11-13T13:18:47Z

nice addition!

jimczi

The change looks good @polyfractal . I left some comments

jimczi · 2018-11-15T00:14:40Z

client/rest-high-level/src/main/java/org/elasticsearch/client/rollup/GetRollupJobResponse.java

+    static final ParseField MAX = new ParseField("max");
+    static final ParseField AVG = new ParseField("avg");
+    static final ParseField COUNT = new ParseField("count");
+    static final ParseField TOTAL = new ParseField("total");


To be consistent with the _stats API can we call these bulk_time_in_millis and query_time_in_millis ? I am also not sure if we need the min, the max and the avg. It should be enough to have the total time spent in these operations and the number of calls per action ?

@polyfractal Are MIN, ..., ..., TOTAL leftovers from previous iterations? They look unused to me.

Right you are!

jimczi · 2018-11-15T00:18:32Z

docs/reference/rollup/apis/get-job.asciidoc

-            "trigger_count" : 0
+            "trigger_count" : 0,
+            "bulk_failures": 0,
+            "bulk_latency_in_ms": {


Can we simplify this to:

"bulk_time_in_ms": 0, "bulk_total": 0, "search_time_in_ms": 0, "search_total": 0

?
I don't think we need more than the total time and the number of invocations.

Sure, I can simplify these. :)

jimczi · 2018-11-15T00:20:17Z

...ck/plugin/core/src/main/java/org/elasticsearch/xpack/core/indexing/AsyncTwoPhaseIndexer.java

                    try {
+                        stats.markStartSearch();
                        doNextSearch(buildSearchRequest(), ActionListener.wrap(this::onSearchResponse, exc -> finishWithFailure(exc)));
                    } catch (Exception e) {


missing stats.incrementSearchFailures() ?

++ good catch

hendrikmuhs

LGTM

polyfractal · 2018-11-27T20:47:46Z

Thanks @hendrikmuhs @jimczi!

This adds some new statistics to the job to help with debugging performance issues: - Total search and index time (in milliseconds) encounteed by the indexer during runtime. This time is the total service time including transfer between nodes, not just the `took` time. - Total count of search and index requests. Together with the total times, this can be used to determine average request time. - Count of search/bulk failures encountered during runtime. This information is also in the log, but a runtime counter will help expose problems faster

* master: DOCS Audit event attributes in new format (elastic#35510) Scripting: Actually add joda time back to whitelist (elastic#35965) [DOCS] fix HLRC ILM doc misreferenced tag Add realm information for Authenticate API (elastic#35648) [ILM] add HLRC docs to remove-policy-from-index (elastic#35759) [Rollup] Update serialization version after backport [Rollup] Add more diagnostic stats to job (elastic#35471) Build: Fix gradle build for Mac OS (elastic#35968) Adds deprecation logging to ScriptDocValues#getValues. (elastic#34279) [Monitoring] Make Exporters Async (elastic#35765) [ILM] reduce time restriction on IndexLifecycleExplainResponse (elastic#35954) Remove use of AbstractComponent in xpack (elastic#35394) Deprecate types in search and multi search templates. (elastic#35669) Remove fromXContent from IndexUpgradeInfoResponse (elastic#35934)

$@polyfractal$ polyfractal added >enhancement v7.0.0 :StorageEngine/Rollup Turn fine-grained time-based data into coarser-grained data v6.6.0 labels Nov 12, 2018

hendrikmuhs reviewed Nov 13, 2018

View reviewed changes

$@polyfractal$ polyfractal requested review from hendrikmuhs and jimczi November 14, 2018 15:03

jimczi reviewed Nov 15, 2018

View reviewed changes

Zachary Tong added 4 commits November 21, 2018 15:08

review cleanup

6505b73

Merge remote-tracking branch 'origin/master' into rollup_more_stats

6b25535

Remove dead ParseFields

1856fbe

Merge remote-tracking branch 'origin/master' into rollup_more_stats

98e90ed

hendrikmuhs approved these changes Nov 26, 2018

View reviewed changes

$@polyfractal$ polyfractal merged commit 48fa251 into elastic:master Nov 27, 2018

hendrikmuhs mentioned this pull request Nov 28, 2018

[ML-DataFrame] add a stats endpoint #35911

Merged

colings86 added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019

codebrain mentioned this pull request May 21, 2019

Additional Rollup Stats elastic/elasticsearch-net#3759

Merged

Conversation

polyfractal commented Nov 12, 2018

Uh oh!

elasticmachine commented Nov 12, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hendrikmuhs commented Nov 13, 2018

Uh oh!

jimczi left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hendrikmuhs Nov 22, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hendrikmuhs left a comment

Choose a reason for hiding this comment

Uh oh!

polyfractal commented Nov 27, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

$@polyfractal$ polyfractal commented Nov 12, 2018

hendrikmuhs Nov 22, 2018 •

edited

Loading