More debugging info for significant_text#72727
Conversation
|
Pinging @elastic/es-analytics-geo (Team:Analytics) |
Adds some extra debugging information to make it clear that you are running `significant_text`. Also adds some using timing information around the `_source` fetch and the `terms` accumulation. This lets you calculate a third useful timing number: the analysis time. It is `collect_ns - fetch_ns - accumulation_ns`. This also adds a half dozen extra REST tests to get a *fairly* comprehensive set of the operations this supports. It doesn't cover all of the significance heuristic parsing, but its certainly much better than what we had.
|
|
||
| ===== Google normalized distance | ||
| Google normalized distance as described in "The Google Similarity Distance", Cilibrasi and Vitanyi, 2007 (https://arxiv.org/pdf/cs/0412098v3.pdf) can be used as significance score by adding the parameter | ||
| Google normalized distance as described in https://arxiv.org/pdf/cs/0412098v3.pdf["The Google Similarity Distance", Cilibrasi and Vitanyi, 2007] can be used as significance score by adding the parameter |
There was a problem hiding this comment.
This was invalid syntax I bumped into reading the docs so I could write the tests. I can move this into its own change if you'd like, but its pretty small and isolated as is.
| @@ -1,151 +0,0 @@ | |||
| --- | |||
There was a problem hiding this comment.
I renamed this file, replace index with _bulk, and added a ton more tests. It doesn't seem to detect the rename because its barely the same file any more....
| (int) termsAggResult.getDebugInfo().get(SEGMENTS_WITH_SINGLE) + (int) termsAggResult.getDebugInfo().get(SEGMENTS_WITH_MULTI), | ||
| greaterThan(0) | ||
| ); | ||
| assertThat(termsAggResult.getDebugInfo().toString(), (int) termsAggResult.getDebugInfo().get(SEGMENTS_WITH_SINGLE), greaterThan(0)); |
There was a problem hiding this comment.
I bumped into this while double checking that I hadn't broken this test. This was fixed in #71241.
| super.collectDebugInfo(add); | ||
| add.accept("total_buckets", bucketOrds.size()); | ||
| add.accept("collection_strategy", collectorSource.describe()); | ||
| collectorSource.collectDebugInfo(add); |
There was a problem hiding this comment.
Without describing the collection strategy it isn't obvious from reading the profile that you are looking at significant_text rather than significant_terms. The extra debugging information should be useful in figuring out what makes a particular significant_text execution slow - maybe its fetching from source. Maybe its analysis. Maybe its just generating a zillion terms. That's all there in the debug output.
| CollectConsumer consumer | ||
| ) throws IOException { | ||
| valuesFetched++; | ||
| charsFetched += text.length(); |
There was a problem hiding this comment.
With these two counters and the Timer around extract we can tell if the we are extracting many fields or a few fields. We can see if we're extracting many fields and filtering them away. And we can see how long the fields are in utf-16 chars. Which is better than nothing. Not as good as bytes, but such is life.
Runtime fields won't have them and that claim isn't central to the test we're running.
|
I've used elastic/rally-tracks#176 to seed a bunch of data into a local cluster and used this to learn some things. Running Those are all in nanos from the profile API. That means that we spent about 5 seconds building the results and picking the best buckets. Yikes! Worse, we spent six seconds re-analyzing the text. And five seconds reading the On the other extreme, running the The first thing to know is that this whole thing takes about 43ms. Much faster! Good? Maybe. The performance is dominated by extracting the data from |
imotov
left a comment
There was a problem hiding this comment.
LGTM, left a couple of suggestions.
| */ | ||
| public static class ValuesSourceCollectorSource implements CollectorSource { | ||
| private final ValuesSource valuesSource; | ||
| private final ValuesSourceConfig valuesSource; |
There was a problem hiding this comment.
Why not rename the variable as well?
There was a problem hiding this comment.
Because I wasn't being careful. I'll do it.
| throw new IllegalArgumentException("No analyzer configured for field " + f); | ||
| }); | ||
| if (context.profiling()) { | ||
| return new SignificantTextCollectorSource( |
There was a problem hiding this comment.
There are so many overloaded methods here that I feel like extracting it into an old fashioned static class can greatly increase readability.
Adds some extra debugging information to make it clear that you are running `significant_text`. Also adds some using timing information around the `_source` fetch and the `terms` accumulation. This lets you calculate a third useful timing number: the analysis time. It is `collect_ns - fetch_ns - accumulation_ns`. This also adds a half dozen extra REST tests to get a *fairly* comprehensive set of the operations this supports. It doesn't cover all of the significance heuristic parsing, but its certainly much better than what we had.
Adds some extra debugging information to make it clear that you are running `significant_text`. Also adds some using timing information around the `_source` fetch and the `terms` accumulation. This lets you calculate a third useful timing number: the analysis time. It is `collect_ns - fetch_ns - accumulation_ns`. This also adds a half dozen extra REST tests to get a *fairly* comprehensive set of the operations this supports. It doesn't cover all of the significance heuristic parsing, but its certainly much better than what we had.
Now that elastic#72727 has landed in 7.x we can run the bwc tests against its changes.
Now that elastic#72727 has landed in 7.x we can run the bwc tests against its changes.
Now that #72727 has landed in 7.x we can run the bwc tests against its changes.
Now that #72727 has landed in 7.x we can run the bwc tests against its changes.
Adds some extra debugging information to make it clear that you are
running
significant_text. Also adds some using timing informationaround the
_sourcefetch and thetermsaccumulation. This lets youcalculate a third useful timing number: the analysis time. It is
collect_ns - fetch_ns - accumulation_ns.This also adds a half dozen extra REST tests to get a fairly
comprehensive set of the operations this supports. It doesn't cover all
of the significance heuristic parsing, but its certainly much better
than what we had.