ESQL: Pushdown count(field) to Lucene by costin · Pull Request #100122 · elastic/elasticsearch

costin · 2023-10-01T20:09:03Z

Use the LuceneCountOperator also for ungrouped count(field) queries

Fix #99840

elasticsearchmachine · 2023-10-01T20:09:27Z

Pinging @elastic/es-ql (Team:QL)

elasticsearchmachine · 2023-10-01T20:09:27Z

Pinging @elastic/elasticsearch-esql (:Query Languages/ES|QL)

...ql/src/test/java/org/elasticsearch/xpack/esql/optimizer/LocalPhysicalPlanOptimizerTests.java

costin · 2023-10-01T20:11:58Z

x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/plan/physical/FragmentExec.java

int is a better default that prevents subtle conversion errors when trying to convert the null Integer to int.

can you please explain a little more about the situation causing errors? is there any doc regarding the same?

dnhatn · 2023-10-01T21:22:12Z

I think this optimization doesn't account for cases where a field has multiple values in a document.

Use the LuceneCountOperator also for ungrouped count(field) queries Fix elastic#99840

Prevent optimization across multiple fields

astefan

LGTM

Unrelated to this specific PR.... I want to point out a behavior that for me seems slightly confusing. If I use from employees | stats c = count() by gender I get

       c       |    gender     
---------------+---------------
10             |null           
33             |F              
57             |M

but if I use from employees | stats c = count(gender) by gender I get

       c       |    gender     
---------------+---------------
0              |null           
57             |M              
33             |F

I get it that count(Field) ignores null values, but it is confusing to even show the null group for count(Field). If count(Field) ignores nulls then let's ignore them all the way because the result above tells me (as an user that doesn't extrapolate "ignoring nulls" aspect) that there are 0 null values for gender. But I have other queries to reject this statement. Imo, showing the null group is wrong.

astefan · 2023-10-02T13:05:08Z

I think this optimization doesn't account for cases where a field has multiple values in a document.

You are correct @dnhatn.

from employees | where emp_no == 10010 | stats c = count(job_positions) by job_positions

       c       |  job_positions  
---------------+-----------------
4              |Architect        
4              |Purchase Manager 
4              |Reporting Analyst
4              |Tech Lead

bpintea · 2023-10-02T10:45:48Z

...in/esql/src/main/java/org/elasticsearch/xpack/esql/optimizer/LocalPhysicalPlanOptimizer.java


+                // for the moment support pushing count just for one field
+                List<Stat> stats = tuple.v2();
+                if (stats.size() > 1) {


is this check necessary (given the following one)? Or is it meant as an optimisation, to avoid reducing the stats if not necessary?

Yes since the rule might pick up two different fields however at runtime we don't know how to combine the operators.
For example count(salary), count(emp_no) by gender - this results into one EsStatsQuery with two Stat however because these are two different queries, two different sources are required and we currently allow only one.

bpintea · 2023-10-02T12:22:07Z

...ql/src/test/java/org/elasticsearch/xpack/esql/optimizer/LocalPhysicalPlanOptimizerTests.java

+              from test | eval s = salary | rename s as sr | eval hidden_s = sr | rename emp_no as e | where e < 10050
+            | stats c = count(hidden_s)


bpintea · 2023-10-02T12:23:05Z

...ql/src/test/java/org/elasticsearch/xpack/esql/optimizer/LocalPhysicalPlanOptimizerTests.java

+    }
+
+    private PhysicalPlan optimizedPlan(PhysicalPlan plan, SearchStats searchStats) {
+        // System.out.println("* Physical Before\n" + plan);


leftover; here and below.

bpintea · 2023-10-02T13:43:27Z

Unrelated to this specific PR.... I want to point out a behavior that for me seems slightly confusing.

Related and potentially unexpected:

from hosts | keep ip0 | stats c=count(ip0) by ip0 | sort c | limit 1 returns:

       c       |      ip0      
---------------+---------------
0              |null

, but dropping argument from count(): from hosts | keep ip0 | stats c=count() by ip0 | sort c | limit 1 returns:

       c       |      ip0      
---------------+---------------
1              |null

which makes sense, the latter counting the groups and former the values within the group (and null being no value), but we could document this -- currently the count docs don't mention the latter functionality.

astefan · 2023-10-02T13:58:04Z

but we could document this -- currently the count docs don't mention the latter functionality.

#99954

alex-spies

Generally LGTM.

The only thing worrying me is that we never really test the pushed down filter against Lucene; there may be some unforeseen weirdnesses that lead to unexpected exceptions.

alex-spies · 2023-10-02T13:13:29Z

...in/esql/src/main/java/org/elasticsearch/xpack/esql/optimizer/LocalPhysicalPlanOptimizer.java

@@ -356,9 +366,21 @@ private Tuple<List<Attribute>, List<Stat>> pushableStats(AggregateExec aggregate
                            Expression child = as.child();


nit: the lambda passed to computeIfAbsent is beginning to become a bit hard to read (I have trouble figuring out what exactly the lambda is trying to achieve). Consider factoring this into a well-named helper method or adding a comment.

...in/esql/src/main/java/org/elasticsearch/xpack/esql/optimizer/LocalPhysicalPlanOptimizer.java

...ql/src/test/java/org/elasticsearch/xpack/esql/optimizer/LocalPhysicalPlanOptimizerTests.java

x-pack/plugin/esql/qa/testFixtures/src/main/resources/stats.csv-spec

alex-spies · 2023-10-02T13:45:56Z

...ck/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/planner/LocalExecutionPlanner.java

            throw new EsqlIllegalArgumentException("EsStatsQuery should only occur against a Lucene backend");
        }
-        EsPhysicalOperationProviders esProvider = (EsPhysicalOperationProviders) physicalOperationProviders;
+        if (statsQuery.stats().size() > 1) {


Below we depend on having exactly one stat, so we should check this to avoid harder to find out of bounds exceptions.

Suggested change

if (statsQuery.stats().size() > 1) {

if ((statsQuery.stats().size() == 1) == false) {

...ql/src/test/java/org/elasticsearch/xpack/esql/optimizer/LocalPhysicalPlanOptimizerTests.java

alex-spies · 2023-10-02T14:00:27Z

...sql/src/test/java/org/elasticsearch/xpack/esql/optimizer/TestLocalPhysicalPlanOptimizer.java


 public class TestLocalPhysicalPlanOptimizer extends LocalPhysicalPlanOptimizer {

+    private final boolean esRules;


nit: the variable name used in the rules method is clearer (needed to look this up to understand what's going on), consider renaming:

Suggested change

private final boolean esRules;

private final boolean optimizeForEsSource;

...in/esql/src/test/java/org/elasticsearch/xpack/esql/optimizer/PhysicalPlanOptimizerTests.java

alex-spies · 2023-10-02T16:11:29Z

The only thing worrying me is that we never really test the pushed down filter against Lucene; there may be some unforeseen weirdnesses that lead to unexpected exceptions.

Scratch that, I forgot that our csv tests are also being run against full server instances. So the tests are all good, of course :)

costin · 2023-10-03T03:02:01Z

I think this optimization doesn't account for cases where a field has multiple values in a document.

Thanks Nhat, I've extended SearchStats to include a isSingleValue method which uses a similar approach to #80730

costin · 2023-10-03T03:05:26Z

I think this optimization doesn't account for cases where a field has multiple values in a document.

You are correct @dnhatn.

from employees | where emp_no == 10010 | stats c = count(job_positions) by job_positions
       c       |  job_positions  
---------------+-----------------
4              |Architect        
4              |Purchase Manager 
4              |Reporting Analyst
4              |Tech Lead        

Great comment as always @astefan - I've incorporated it into the PR.

costin added the :Analytics/ES|QL AKA ESQL label Oct 1, 2023

costin requested review from alex-spies, astefan, bpintea and luigidellaquila October 1, 2023 20:09

costin self-assigned this Oct 1, 2023

elasticsearchmachine added v8.11.0 Team:QL (Deprecated) Meta label for query languages team labels Oct 1, 2023

costin commented Oct 1, 2023

View reviewed changes

...ql/src/test/java/org/elasticsearch/xpack/esql/optimizer/LocalPhysicalPlanOptimizerTests.java Outdated Show resolved Hide resolved

costin commented Oct 1, 2023

View reviewed changes

...ql/src/test/java/org/elasticsearch/xpack/esql/optimizer/LocalPhysicalPlanOptimizerTests.java Outdated Show resolved Hide resolved

costin commented Oct 1, 2023

View reviewed changes

costin force-pushed the fix/99840 branch from 3a4b2ca to c8475d6 Compare October 1, 2023 20:13

costin force-pushed the fix/99840 branch from c8475d6 to c43698b Compare October 1, 2023 22:12

ESQL: Pushdown count(field) to Lucene

f29cf83

Use the LuceneCountOperator also for ungrouped count(field) queries Fix elastic#99840

costin force-pushed the fix/99840 branch from c43698b to f29cf83 Compare October 1, 2023 22:16

costin added 2 commits October 1, 2023 16:02

Checkstyle for javadocs...

935e608

Update tests

671f7ce

Prevent optimization across multiple fields

costin force-pushed the fix/99840 branch from df27192 to 671f7ce Compare October 2, 2023 01:12

costin added the >non-issue label Oct 2, 2023

astefan approved these changes Oct 2, 2023

View reviewed changes

bpintea reviewed Oct 2, 2023

View reviewed changes

alex-spies approved these changes Oct 2, 2023

View reviewed changes

wip

05d2296

costin added 5 commits October 2, 2023 11:31

wip

0cdd533

Merge remote-tracking branch 'remotes/upstream/main' into fix/99840

deef678

Add infrastructure for checking whether a field is a SV or MV

c9646e1

Moved local physical tests to the right test

a4cffc3

Add mv test

939fa1d

costin merged commit 2e86d25 into elastic:main Oct 3, 2023

costin deleted the fix/99840 branch October 3, 2023 03:37

		from test \| eval s = salary \| rename s as sr \| eval hidden_s = sr \| rename emp_no as e \| where e < 10050
		\| stats c = count(hidden_s)

		@@ -356,9 +366,21 @@ private Tuple<List<Attribute>, List<Stat>> pushableStats(AggregateExec aggregate
		Expression child = as.child();

	if (statsQuery.stats().size() > 1) {
	if ((statsQuery.stats().size() == 1) == false) {


		public class TestLocalPhysicalPlanOptimizer extends LocalPhysicalPlanOptimizer {

		private final boolean esRules;

	private final boolean esRules;
	private final boolean optimizeForEsSource;

Conversation

costin commented Oct 1, 2023

Uh oh!

elasticsearchmachine commented Oct 1, 2023

Uh oh!

elasticsearchmachine commented Oct 1, 2023

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dnhatn commented Oct 1, 2023

Uh oh!

astefan left a comment

Choose a reason for hiding this comment

Uh oh!

astefan commented Oct 2, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bpintea Oct 2, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bpintea commented Oct 2, 2023

Uh oh!

astefan commented Oct 2, 2023

Uh oh!

alex-spies left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

alex-spies commented Oct 2, 2023

Uh oh!

costin commented Oct 3, 2023

Uh oh!

costin commented Oct 3, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

bpintea Oct 2, 2023 •

edited

Loading