The aggregations framework defines several similar but subtly different classes & enums to represent what type of data an aggregation operates on. Their correct usage and interaction is not obvious, especially to new developers. This refactoring aims to make input type specification for aggregations easier to understand.
Plan
To replace the assorted hard coded ValuesSource references, we want to build a dynamic registry which will map Field types to Values Sources and specific aggregator implementations.
TODO - before Fast follow after merging to master
TODO - backport
TODO - Post merge
Out of Scope:
After considerable discussion, we've decided to deal with the multivalued aggregations at a later point. The current abstractions there aren't working as well as we'd like, so investing more in them right now doesn't seem like a good use of time.
Background
Involved Classes
ValuesSource & its subclasses
These are thin wrappers which provide a unified interface to different ways of getting input data (e.g. DocValues from Lucene, or script output). A class hierarchy defines the type of values returned by the source.
ValueSourceType
A small enum who's values roughly correspond to the first level subclasses of ValuesSource. This gets passed as a constructor argument to ValuesSourceConifg as a hint for what type of values source it should later construct.
ValueType
A more robust enum specifying a wider range of types than ValuesSourceType. The instances of this enum provide a mapping back to a ValuesSourceType. This is used when users supply type hints, e.g. as part of a missing value specification.
ValuesSourceConfig
This class resolves the actual ValuesSource to use on a given shard for the aggregation. It considers script values, missing values, and mapped fields.
ValuesSourceAggregationBuilder
This is the base class for all aggregations using the value source model, and it ties together usage of all of the above classes. It is generic over a ValuesSource class, and accepts a ValuesSourceType and two ValueType parameters. How those different value classes interact is not at all obvious; a good first step at this refactoring would be just documenting those relationships.
Next Steps
I'm open to discussion as to what the best way to clean this up should be. In the interest of seeding that discussion, the following are my first-pass suggestions.
Clarify ValuesSourceAggregationBuilder
This would consist of descriptive field names & comments around the intended use of the various type parameters ValuesSourceAggregationBuilder accepts. Some javadoc on the intended roles of ValueSourceType and ValueType would help too.
Formalize relationship between ValuesSourceType and ValuesSource
Currently, these are related via some cascading if statements in ValuesSourceConfig, but this misses out on many of the benefits of having an enum in the first place. Notably, nothing enforces that all enum values are accounted for, and developers wishing to understand the relationship between two classes must look to a third class for that information.
ValueSourceType.ANY
ValuesSourceType.ANY creates a few edge cases. It's the only ValuesSourceType that doesn't map cleanly back to a ValuesSource subclass. ValuesSourceConfig interprets it as a bytes source, except in the case of a script, which allows BYTES but not ANY. Aggregations that can operate on multiple input types (e.g. TermsAggregation, which can operate on strings or numbers) use ValuesSourceType.ANY to indicate this, which is deceptive since later in the process they are restricted to more specific source types.
ValuesSourceType Ordinal Serialization Issue
In addition to any other refactoring work being done here, ValuesSourceType is an enum being serialized by ordinal value, which is prone to error. If we keep this enum, we should adopt a pattern of serializing based on an ID we control, to allow for deletes & reorderings later. See ValueType, which does this correctly.
ValueType.NUMBER and ValueType.NUMERIC
ValueType specifies both NUMBER and NUMERIC, which are apparently identical as far as the enum is concerned. There's a five year old TODO asking about the difference between these two. We should either merge them, or clarify why we need both.
Rename ValuesSource
ValuesSource is very close to ValueSource, which is an interface in org.elasticsearch.ingest. It's easy to pull up the wrong class in Intellij, especially if you've only heard the name not seen it.
The aggregations framework defines several similar but subtly different classes & enums to represent what type of data an aggregation operates on. Their correct usage and interaction is not obvious, especially to new developers. This refactoring aims to make input type specification for aggregations easier to understand.
Plan
To replace the assorted hard coded
ValuesSourcereferences, we want to build a dynamic registry which will map Field types to Values Sources and specific aggregator implementations.TODO -
beforeFast follow after merging to masterValuesSourcesout ofValuesSourceConfigtoValuesSource(Extract Empty/Script/Missing ValuesSource behavior to an interface #48320)resolve- This will get done with the registry prototypeValueSourceType(working in prototype for simple case)ValuesSourceTypedynamically[ ] Significant Text (@polyfractal WIP)not VSABValueType. This will involve changing how we do type checking of userValueTypeHints in the parser. Vs refactor parser cleanup #53198ValuesSourcebased aggsValueTypeValueType, but refactoring those is out of scope for the initial merge.ValueTypefor parsing user value hints. There's a post-merge PR for removing that.ValuesSourceTypeinfo fromValuesSourceAggregationBuilder(Remove ValuesSourceType argument to ValuesSourceAggregationBuilder #48638)ValuesSourceConfig(and probably other places)ValuesSourceType.ANY& and remove ANY Remove ValuesSourceType.ANY #51539CoreValuesSourceTypeCoreValuesSourceType no longer implements Writable #51276ValuesSourcereferencesValuesSourceConfigimmutable Soft immutability for VSConfig #52729o.e.s.aggregations.support(Tozzi) Vs refactor javadoc and comment cleanup #53427TODO - backport
List.of()usageTODO - Post merge
serializeTargetValueType(doesn't strictly need to be part of this refactor). The following classes override this method:CardinalityAggregationBuilderMissingAggregationBuilderValueCountAggregationBuilderRareTermsAggregationBuilderSignificantTermsAggregationBuilderTermsAggregationBuilderAggregatorSupplier) between Min&Max and other metrics.MultiValuesSourceAggregationBuildersubclassesArrayValuesSourceAggregationBuildersubclassesValueTypeOut of Scope:
After considerable discussion, we've decided to deal with the multivalued aggregations at a later point. The current abstractions there aren't working as well as we'd like, so investing more in them right now doesn't seem like a good use of time.
Background
Involved Classes
ValuesSource & its subclasses
These are thin wrappers which provide a unified interface to different ways of getting input data (e.g. DocValues from Lucene, or script output). A class hierarchy defines the type of values returned by the source.
ValueSourceType
A small enum who's values roughly correspond to the first level subclasses of
ValuesSource. This gets passed as a constructor argument toValuesSourceConifgas a hint for what type of values source it should later construct.ValueType
A more robust enum specifying a wider range of types than
ValuesSourceType. The instances of this enum provide a mapping back to aValuesSourceType. This is used when users supply type hints, e.g. as part of a missing value specification.ValuesSourceConfig
This class resolves the actual
ValuesSourceto use on a given shard for the aggregation. It considers script values, missing values, and mapped fields.ValuesSourceAggregationBuilder
This is the base class for all aggregations using the value source model, and it ties together usage of all of the above classes. It is generic over a
ValuesSourceclass, and accepts aValuesSourceTypeand twoValueTypeparameters. How those different value classes interact is not at all obvious; a good first step at this refactoring would be just documenting those relationships.Next Steps
I'm open to discussion as to what the best way to clean this up should be. In the interest of seeding that discussion, the following are my first-pass suggestions.
Clarify ValuesSourceAggregationBuilder
This would consist of descriptive field names & comments around the intended use of the various type parameters
ValuesSourceAggregationBuilderaccepts. Some javadoc on the intended roles ofValueSourceTypeandValueTypewould help too.Formalize relationship between
ValuesSourceTypeandValuesSourceCurrently, these are related via some cascading
ifstatements inValuesSourceConfig, but this misses out on many of the benefits of having an enum in the first place. Notably, nothing enforces that all enum values are accounted for, and developers wishing to understand the relationship between two classes must look to a third class for that information.ValueSourceType.ANY
ValuesSourceType.ANYcreates a few edge cases. It's the onlyValuesSourceTypethat doesn't map cleanly back to aValuesSourcesubclass.ValuesSourceConfiginterprets it as a bytes source, except in the case of a script, which allowsBYTESbut notANY. Aggregations that can operate on multiple input types (e.g. TermsAggregation, which can operate on strings or numbers) useValuesSourceType.ANYto indicate this, which is deceptive since later in the process they are restricted to more specific source types.ValuesSourceType Ordinal Serialization Issue
In addition to any other refactoring work being done here,
ValuesSourceTypeis an enum being serialized by ordinal value, which is prone to error. If we keep this enum, we should adopt a pattern of serializing based on an ID we control, to allow for deletes & reorderings later. SeeValueType, which does this correctly.ValueType.NUMBERandValueType.NUMERICValueTypespecifies bothNUMBERandNUMERIC, which are apparently identical as far as the enum is concerned. There's a five year old TODO asking about the difference between these two. We should either merge them, or clarify why we need both.Rename
ValuesSourceValuesSourceis very close toValueSource, which is an interface inorg.elasticsearch.ingest. It's easy to pull up the wrong class in Intellij, especially if you've only heard the name not seen it.