Skip to content

ESQL: Pushdown count(field) to Lucene#100122

Merged
costin merged 9 commits intoelastic:mainfrom
costin:fix/99840
Oct 3, 2023
Merged

ESQL: Pushdown count(field) to Lucene#100122
costin merged 9 commits intoelastic:mainfrom
costin:fix/99840

Conversation

@costin
Copy link
Copy Markdown
Member

@costin costin commented Oct 1, 2023

Use the LuceneCountOperator also for ungrouped count(field) queries

Fix #99840

@costin costin added the :Analytics/ES|QL AKA ESQL label Oct 1, 2023
@costin costin self-assigned this Oct 1, 2023
@elasticsearchmachine elasticsearchmachine added v8.11.0 Team:QL (Deprecated) Meta label for query languages team labels Oct 1, 2023
@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-ql (Team:QL)

@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/elasticsearch-esql (:Query Languages/ES|QL)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

int is a better default that prevents subtle conversion errors when trying to convert the null Integer to int.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you please explain a little more about the situation causing errors? is there any doc regarding the same?

@dnhatn
Copy link
Copy Markdown
Member

dnhatn commented Oct 1, 2023

I think this optimization doesn't account for cases where a field has multiple values in a document.

Use the LuceneCountOperator also for ungrouped count(field) queries

Fix elastic#99840
costin added 2 commits October 1, 2023 16:02
Prevent optimization across multiple fields
Copy link
Copy Markdown
Contributor

@astefan astefan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Unrelated to this specific PR.... I want to point out a behavior that for me seems slightly confusing. If I use from employees | stats c = count() by gender I get

       c       |    gender     
---------------+---------------
10             |null           
33             |F              
57             |M              

but if I use from employees | stats c = count(gender) by gender I get

       c       |    gender     
---------------+---------------
0              |null           
57             |M              
33             |F              

I get it that count(Field) ignores null values, but it is confusing to even show the null group for count(Field). If count(Field) ignores nulls then let's ignore them all the way because the result above tells me (as an user that doesn't extrapolate "ignoring nulls" aspect) that there are 0 null values for gender. But I have other queries to reject this statement. Imo, showing the null group is wrong.

@astefan
Copy link
Copy Markdown
Contributor

astefan commented Oct 2, 2023

I think this optimization doesn't account for cases where a field has multiple values in a document.

You are correct @dnhatn.

from employees | where emp_no == 10010 | stats c = count(job_positions) by job_positions

       c       |  job_positions  
---------------+-----------------
4              |Architect        
4              |Purchase Manager 
4              |Reporting Analyst
4              |Tech Lead        


// for the moment support pushing count just for one field
List<Stat> stats = tuple.v2();
if (stats.size() > 1) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this check necessary (given the following one)? Or is it meant as an optimisation, to avoid reducing the stats if not necessary?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes since the rule might pick up two different fields however at runtime we don't know how to combine the operators.
For example count(salary), count(emp_no) by gender - this results into one EsStatsQuery with two Stat however because these are two different queries, two different sources are required and we currently allow only one.

Comment on lines +211 to +212
from test | eval s = salary | rename s as sr | eval hidden_s = sr | rename emp_no as e | where e < 10050
| stats c = count(hidden_s)
Copy link
Copy Markdown
Contributor

@bpintea bpintea Oct 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit:

Suggested change
from test | eval s = salary | rename s as sr | eval hidden_s = sr | rename emp_no as e | where e < 10050
| stats c = count(hidden_s)
from test
| eval s = salary | rename s as sr
| eval hidden_s = sr
| rename emp_no as e
| where e < 10050
| stats c = count(hidden_s)

or some other line breaking approach.

}

private PhysicalPlan optimizedPlan(PhysicalPlan plan, SearchStats searchStats) {
// System.out.println("* Physical Before\n" + plan);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

leftover; here and below.

@bpintea
Copy link
Copy Markdown
Contributor

bpintea commented Oct 2, 2023

Unrelated to this specific PR.... I want to point out a behavior that for me seems slightly confusing.

Related and potentially unexpected:

from hosts | keep ip0 | stats c=count(ip0) by ip0 | sort c | limit 1 returns:

       c       |      ip0      
---------------+---------------
0              |null           

, but dropping argument from count(): from hosts | keep ip0 | stats c=count() by ip0 | sort c | limit 1 returns:

       c       |      ip0      
---------------+---------------
1              |null           

which makes sense, the latter counting the groups and former the values within the group (and null being no value), but we could document this -- currently the count docs don't mention the latter functionality.

@astefan
Copy link
Copy Markdown
Contributor

astefan commented Oct 2, 2023

but we could document this -- currently the count docs don't mention the latter functionality.

#99954

Copy link
Copy Markdown
Contributor

@alex-spies alex-spies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally LGTM.

The only thing worrying me is that we never really test the pushed down filter against Lucene; there may be some unforeseen weirdnesses that lead to unexpected exceptions.

@@ -356,9 +366,21 @@ private Tuple<List<Attribute>, List<Stat>> pushableStats(AggregateExec aggregate
Expression child = as.child();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: the lambda passed to computeIfAbsent is beginning to become a bit hard to read (I have trouble figuring out what exactly the lambda is trying to achieve). Consider factoring this into a well-named helper method or adding a comment.

throw new EsqlIllegalArgumentException("EsStatsQuery should only occur against a Lucene backend");
}
EsPhysicalOperationProviders esProvider = (EsPhysicalOperationProviders) physicalOperationProviders;
if (statsQuery.stats().size() > 1) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Below we depend on having exactly one stat, so we should check this to avoid harder to find out of bounds exceptions.

Suggested change
if (statsQuery.stats().size() > 1) {
if ((statsQuery.stats().size() == 1) == false) {


public class TestLocalPhysicalPlanOptimizer extends LocalPhysicalPlanOptimizer {

private final boolean esRules;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: the variable name used in the rules method is clearer (needed to look this up to understand what's going on), consider renaming:

Suggested change
private final boolean esRules;
private final boolean optimizeForEsSource;

@alex-spies
Copy link
Copy Markdown
Contributor

The only thing worrying me is that we never really test the pushed down filter against Lucene; there may be some unforeseen weirdnesses that lead to unexpected exceptions.

Scratch that, I forgot that our csv tests are also being run against full server instances. So the tests are all good, of course :)

@costin
Copy link
Copy Markdown
Member Author

costin commented Oct 3, 2023

I think this optimization doesn't account for cases where a field has multiple values in a document.

Thanks Nhat, I've extended SearchStats to include a isSingleValue method which uses a similar approach to #80730

@costin
Copy link
Copy Markdown
Member Author

costin commented Oct 3, 2023

I think this optimization doesn't account for cases where a field has multiple values in a document.

You are correct @dnhatn.

from employees | where emp_no == 10010 | stats c = count(job_positions) by job_positions

       c       |  job_positions  
---------------+-----------------
4              |Architect        
4              |Purchase Manager 
4              |Reporting Analyst
4              |Tech Lead        

Great comment as always @astefan - I've incorporated it into the PR.

@costin costin merged commit 2e86d25 into elastic:main Oct 3, 2023
@costin costin deleted the fix/99840 branch October 3, 2023 03:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Analytics/ES|QL AKA ESQL >non-issue Team:QL (Deprecated) Meta label for query languages team v8.11.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ESQL: search-level count for a given field

7 participants