Skip to content

[Feature Request] Introduce a New Field Mapper for HyperLogLog++ Sketches #19487

@sandeshkr419

Description

@sandeshkr419

Is your feature request related to a problem? Please describe.

OpenSearch currently lacks a native field type for storing and aggregating pre-computed HyperLogLog++ (HLL++) sketches. The existing cardinality aggregation is excellent for calculating unique counts on raw data, but it cannot be used on data that has already been aggregated, as the final numeric count is not re-aggregatable.

This presents two major problems:

  1. Inefficient Data Ingestion: Users with data pipelines (Spark, Flink, etc.) that can pre-compute HLL++ sketches to reduce data volume have no efficient way to use them in OpenSearch. The only option is to use a binary field, which prevents any server-side aggregation and forces all merge logic to the client side.
  2. Blocker for Multi-Tier Rollups: This is the most critical issue. The inability to re-aggregate unique counts is the primary reason that multi-tier rollups are not safely supported. Users cannot create a rollup with a 1-minute unique user count and then re-aggregate that into a correct 1-hour unique user count.

Describe the solution you'd like

I propose the creation of a new field type, tentatively named hll_sketch. This feature would consist of two main components:

  1. A HLLSketchFieldMapper: A new field mapper that accepts an HLL++ sketch (e.g., as a Base64 encoded string), stores it efficiently as a binary doc value, and makes it available for aggregation.
  2. A merge_hll_sketches Aggregation: A new bucket aggregation that can operate on hll_sketch fields. This aggregation would collect the sketches from the relevant documents, merge them into a single sketch, and return the final cardinality.

How This Unlocks Multi-Tier Rollups

This new field type is the foundational building block for enabling safe, accurate, multi-tier rollups for cardinality metrics.

The workflow would be as follows:

  1. Initial Rollup (Tier 1): An initial rollup job would run on the raw data. Instead of calculating the final cardinality, it would generate and store the raw HLL++ sketch in a field mapped as hll_sketch.
  2. Subsequent Rollup (Tier 2): A second rollup job could then safely target the Tier 1 rollup index. It would use the new merge_hll_sketches aggregation on the hll_sketch field to accurately combine the sketches from the first tier into a new, higher-level sketch or final count.

This solves the limitation of re-aggregating final counts by instead merging the underlying data structures, making tiered data retention strategies a native capability.

Example Workflow

1. Define the Mapping

PUT my-analytics-index
{
  "mappings": {
    "properties": {
      "timestamp": { "type": "date" },
      "user_id_sketch": {
        "type": "hll_sketch" 
      }
    }
  }
}

2. Ingest a Pre-Computed Sketch

POST my-analytics-index/_doc
{
  "timestamp": "2025-09-30T14:30:00Z",
  "user_id_sketch": "AAEGEAgaA...base64-encoded-hll-sketch...AgA="
}

Related component

Search:Aggregations

Describe alternatives you've considered

No response

Additional context

opensearch-project/index-management#1493
opensearch-project/index-management#1490

Metadata

Metadata

Assignees

Type

No type

Projects

Status

✅ Done

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions