-
Notifications
You must be signed in to change notification settings - Fork 2.5k
Description
Is your feature request related to a problem? Please describe.
OpenSearch currently lacks a native field type for storing and aggregating pre-computed HyperLogLog++ (HLL++) sketches. The existing cardinality aggregation is excellent for calculating unique counts on raw data, but it cannot be used on data that has already been aggregated, as the final numeric count is not re-aggregatable.
This presents two major problems:
- Inefficient Data Ingestion: Users with data pipelines (Spark, Flink, etc.) that can pre-compute HLL++ sketches to reduce data volume have no efficient way to use them in OpenSearch. The only option is to use a
binaryfield, which prevents any server-side aggregation and forces all merge logic to the client side. - Blocker for Multi-Tier Rollups: This is the most critical issue. The inability to re-aggregate unique counts is the primary reason that multi-tier rollups are not safely supported. Users cannot create a rollup with a 1-minute unique user count and then re-aggregate that into a correct 1-hour unique user count.
Describe the solution you'd like
I propose the creation of a new field type, tentatively named hll_sketch. This feature would consist of two main components:
- A
HLLSketchFieldMapper: A new field mapper that accepts an HLL++ sketch (e.g., as a Base64 encoded string), stores it efficiently as a binary doc value, and makes it available for aggregation. - A
merge_hll_sketchesAggregation: A new bucket aggregation that can operate onhll_sketchfields. This aggregation would collect the sketches from the relevant documents, merge them into a single sketch, and return the final cardinality.
How This Unlocks Multi-Tier Rollups
This new field type is the foundational building block for enabling safe, accurate, multi-tier rollups for cardinality metrics.
The workflow would be as follows:
- Initial Rollup (Tier 1): An initial rollup job would run on the raw data. Instead of calculating the final
cardinality, it would generate and store the raw HLL++ sketch in a field mapped ashll_sketch. - Subsequent Rollup (Tier 2): A second rollup job could then safely target the Tier 1 rollup index. It would use the new
merge_hll_sketchesaggregation on thehll_sketchfield to accurately combine the sketches from the first tier into a new, higher-level sketch or final count.
This solves the limitation of re-aggregating final counts by instead merging the underlying data structures, making tiered data retention strategies a native capability.
Example Workflow
1. Define the Mapping
PUT my-analytics-index
{
"mappings": {
"properties": {
"timestamp": { "type": "date" },
"user_id_sketch": {
"type": "hll_sketch"
}
}
}
}2. Ingest a Pre-Computed Sketch
POST my-analytics-index/_doc
{
"timestamp": "2025-09-30T14:30:00Z",
"user_id_sketch": "AAEGEAgaA...base64-encoded-hll-sketch...AgA="
}Related component
Search:Aggregations
Describe alternatives you've considered
No response
Additional context
opensearch-project/index-management#1493
opensearch-project/index-management#1490
Metadata
Metadata
Assignees
Labels
Type
Projects
Status
Status