Skip to content

HLL field Mapper for Cardinality Rollups#20129

Merged
sandeshkr419 merged 6 commits intoopensearch-project:mainfrom
sandeshkr419:hll_gem
Jan 22, 2026
Merged

HLL field Mapper for Cardinality Rollups#20129
sandeshkr419 merged 6 commits intoopensearch-project:mainfrom
sandeshkr419:hll_gem

Conversation

@sandeshkr419
Copy link
Copy Markdown
Member

@sandeshkr419 sandeshkr419 commented Nov 30, 2025

Description

The hll field type stores pre-aggregated cardinality data using HyperLogLog++ (HLL++) sketch data structures. This field type is designed for internal use by OpenSearch and its plugins (such as the ISM plugin for multi-tier rollup operations) to enable efficient cardinality aggregation on rolled-up data.

Key Features

  • Efficient Storage: Stores approximate cardinality estimates in compact binary format sketches
  • Merge Support: Supports merging multiple HLL++ sketches during aggregation
  • Configurable Precision: Allows tuning the accuracy-memory tradeoff via precision parameter
  • Rollup Integration: Designed for use with ISM plugin rollup operations

Parameters

  • precision (optional): Integer between 4 and 18 (consistent with OpenSearch) that controls the accuracy-memory tradeoff
    • Default: 12 (keeping consistent with OpenSearch Cardinality Aggregation)
    • Higher precision = better accuracy but more memory

Related Issues

Resolves #[Issue number to be closed when this PR is merged]

Check List

  • Functionality includes testing.
  • API changes companion pull request created, if applicable.
  • Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Summary by CodeRabbit

  • New Features

    • Added HyperLogLog field type support for efficient cardinality aggregations on pre-aggregated sketches.
    • Cardinality aggregations now support merging pre-aggregated HyperLogLog data from HLL fields.
  • Bug Fixes

    • Enhanced error messaging for sketch precision mismatch scenarios during merge operations.
  • Tests

    • Added comprehensive test coverage for HyperLogLog field mapping and cardinality aggregation behavior.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Nov 30, 2025

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

📝 Walkthrough

Walkthrough

This pull request introduces HyperLogLog++ (HLL) sketch field support to OpenSearch, including a new field mapper, field data implementation, specialized cardinality aggregator for pre-aggregated sketch merging, and comprehensive test coverage to enable efficient cardinality computation across rollup and aggregated data scenarios.

Changes

Cohort / File(s) Summary
Field Mapping & Field Data Implementation
server/src/main/java/org/opensearch/index/mapper/HllFieldMapper.java, server/src/main/java/org/opensearch/index/fielddata/plain/HllFieldData.java
Introduces HllFieldMapper and HllFieldData to handle HLL sketches as a new field type, storing pre-aggregated sketch data as binary doc values with precision validation and lazy deserialization via LeafFieldData
Cardinality Aggregation
server/src/main/java/org/opensearch/search/aggregations/metrics/HllCardinalityAggregator.java, server/src/main/java/org/opensearch/search/aggregations/metrics/CardinalityAggregatorFactory.java
Adds HllCardinalityAggregator to merge pre-aggregated HLL sketches per segment and integrates it into CardinalityAggregatorFactory for automatic routing to HLL fields
Framework Registration
server/src/main/java/org/opensearch/indices/IndicesModule.java
Registers HllFieldMapper as a new field type in the mapper registry
Error Handling Enhancement
server/src/main/java/org/opensearch/search/aggregations/metrics/HyperLogLogPlusPlus.java
Improves merge precision mismatch error message to include both precision values
Integration Tests
server/src/test/java/org/opensearch/index/mapper/HllFieldMapperIntegrationTests.java
Comprehensive integration tests validating field creation, sketch ingestion, multi-document rollups, sketch merging via cardinality aggregation, and bucketed aggregation behavior
Unit Tests
server/src/test/java/org/opensearch/index/mapper/HllFieldMapperTests.java, server/src/test/java/org/opensearch/search/aggregations/metrics/HllCardinalityAggregatorTests.java
Unit tests for HLL field mapper validation, sketch serialization/deserialization, precision boundaries, query behavior, and aggregator test setup

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant Indexer as Index Writer
    participant FieldMapper as HllFieldMapper
    participant FieldData as HllFieldData
    participant LeafData as HllLeafFieldData
    participant Aggregator as HllCardinalityAggregator
    participant Sketch as HyperLogLogPlusPlus

    Client->>Indexer: Index doc with HLL sketch bytes
    Indexer->>FieldMapper: Parse field
    FieldMapper->>FieldMapper: Validate sketch precision
    FieldMapper->>FieldData: Store as binary doc value
    
    Client->>Aggregator: Execute cardinality aggregation
    Aggregator->>FieldData: Load field data per segment
    FieldData->>LeafData: Create leaf field data
    
    Aggregator->>LeafData: For each document, getSketch(docId)
    LeafData->>LeafData: Read binary doc value
    LeafData->>Sketch: deserialize via readFrom()
    Sketch-->>LeafData: HyperLogLogPlusPlus instance
    LeafData-->>Aggregator: Sketch
    
    Aggregator->>Aggregator: Initialize/resize accumulator on first sketch
    Aggregator->>Sketch: Merge document sketch into accumulator
    Aggregator->>Aggregator: Track per-bucket cardinality
    
    Aggregator-->>Client: Return InternalCardinality with merged result
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

  • HllCardinalityAggregator: Dynamic accumulator initialization and resizing logic; careful resource management and bucket tracking warrant close inspection
  • Integration tests: Comprehensive end-to-end scenarios with multiple sketch merging paths; verify test assertions match expected cardinality behavior
  • Precision validation: Ensure mismatch handling is consistent across field mapper, field data, and aggregator
  • LeafFieldData.getSketch(): Verify stream/BytesRef lifecycle and null handling for missing values

Suggested labels

Search:Aggregations, enhancement, Indexing

Suggested reviewers

  • sachinpkale
  • shwetathareja
  • andrross
  • dbwiddis
  • msfroh
  • kotwanikunal
  • jed326
  • CEHENKLE

Poem

🐰 Sketches dance in binary form,
HyperLogLog++ keeps cardinality warm,
Merging sketches, bucket by bucket,
Pre-aggregated data—let's crunch it!
wiggles nose 🌟

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 2.86% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main change: introducing an HLL field mapper for cardinality rollups, which aligns with the primary objective of the changeset.
Description check ✅ Passed The description provides comprehensive information about the HLL field type, key features, parameters, and use cases. However, the Related Issues section contains only a placeholder without an actual issue number, and the checklist items are not checked off.

Comment @coderabbitai help to get the list of available commands and usage tips.

@sandeshkr419 sandeshkr419 changed the title hll changes for cardinality rollup HLL field Mapper for Cardinality Rollups Nov 30, 2025
@github-actions
Copy link
Copy Markdown
Contributor

❌ Gradle check result for 97b7392: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@asimmahmood1
Copy link
Copy Markdown
Contributor

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Dec 17, 2025

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (5)
server/src/test/java/org/opensearch/index/mapper/HllFieldMapperTests.java (1)

139-141: Minor inconsistency: consider using BitMixer.mix64 for consistency.

Here raw integer values are collected directly, while testMergeProducesUnionCardinality (line 317) uses BitMixer.mix64(i). While not a correctness issue (HLL handles any input), using BitMixer consistently would better simulate real hash values.

                     // Add values to reach the target cardinality
                     for (int i = 0; i < cardinality; i++) {
-                        originalSketch.collect(0, i);
+                        originalSketch.collect(0, BitMixer.mix64(i));
                     }
server/src/main/java/org/opensearch/search/aggregations/metrics/HllCardinalityAggregator.java (2)

71-85: Consider exponential growth to avoid O(n²) resize overhead.

The current resize strategy creates a new HyperLogLogPlusPlus with capacity bucket + 1 and copies all existing data. If buckets arrive in increasing order, this results in O(n²) total work. Consider using exponential growth (e.g., Math.max(bucket + 1, counts.maxOrd() * 2)) to amortize resize costs.

                         // Grow if needed to accommodate this bucket
                         if (bucket >= counts.maxOrd()) {
-                            HyperLogLogPlusPlus newCounts = new HyperLogLogPlusPlus(precision, context.bigArrays(), bucket + 1);
+                            long newCapacity = Math.max(bucket + 1, counts.maxOrd() * 2);
+                            HyperLogLogPlusPlus newCounts = new HyperLogLogPlusPlus(precision, context.bigArrays(), newCapacity);
                             for (long i = 0; i < counts.maxOrd(); i++) {
                                 if (counts.cardinality(i) > 0) {
                                     newCounts.merge(i, counts, i);

108-111: Potential out-of-bounds access in metric() method.

Unlike buildAggregation() which checks owningBucketOrd >= counts.maxOrd(), this method passes owningBucketOrd directly to counts.cardinality(). While the underlying ByteArray may handle this internally, adding an explicit bounds check would be more defensive and consistent with buildAggregation().

     @Override
     public double metric(long owningBucketOrd) {
-        return counts == null ? 0 : counts.cardinality(owningBucketOrd);
+        return counts == null || owningBucketOrd >= counts.maxOrd() ? 0 : counts.cardinality(owningBucketOrd);
     }
server/src/test/java/org/opensearch/index/mapper/HllFieldMapperIntegrationTests.java (1)

201-246: Hardcoded expected cardinality may be fragile.

The test hardcodes expectedCardinality = 500 and asserts an exact match against the aggregated result. HLL++ is a probabilistic data structure, and while it may return exact values for small cardinalities, this assertion is fragile if internals change.

Consider removing the hardcoded assertion and relying only on the comparison with expectedMerged.cardinality(0):

-        long expectedCardinality = 500;
         // ...
-            // The aggregated cardinality should match the expected cardinality
-            assertEquals("Aggregation should merge sketches correctly", expectedCardinality, aggregatedCardinality);
-
             // The aggregated cardinality should match the cardinality from expected expectedMerged
             assertEquals("Aggregation should merge sketches correctly", expectedMerged.cardinality(0), aggregatedCardinality);
server/src/main/java/org/opensearch/index/fielddata/plain/HllFieldData.java (1)

59-62: Unused parameters in build method.

The cache and breakerService parameters are ignored. This may be intentional since HLL sketches are loaded on-demand, but consider documenting this design decision.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1ee30dc and 97b7392.

📒 Files selected for processing (9)
  • server/src/main/java/org/opensearch/index/fielddata/plain/HllFieldData.java (1 hunks)
  • server/src/main/java/org/opensearch/index/mapper/HllFieldMapper.java (1 hunks)
  • server/src/main/java/org/opensearch/indices/IndicesModule.java (2 hunks)
  • server/src/main/java/org/opensearch/search/aggregations/metrics/CardinalityAggregatorFactory.java (2 hunks)
  • server/src/main/java/org/opensearch/search/aggregations/metrics/HllCardinalityAggregator.java (1 hunks)
  • server/src/main/java/org/opensearch/search/aggregations/metrics/HyperLogLogPlusPlus.java (1 hunks)
  • server/src/test/java/org/opensearch/index/mapper/HllFieldMapperIntegrationTests.java (1 hunks)
  • server/src/test/java/org/opensearch/index/mapper/HllFieldMapperTests.java (1 hunks)
  • server/src/test/java/org/opensearch/search/aggregations/metrics/HllCardinalityAggregatorTests.java (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (3)
server/src/main/java/org/opensearch/search/aggregations/metrics/HllCardinalityAggregator.java (2)
server/src/main/java/org/opensearch/index/fielddata/plain/HllFieldData.java (1)
  • HllFieldData (43-176)
server/src/main/java/org/opensearch/search/aggregations/LeafBucketCollector.java (1)
  • LeafBucketCollector (51-179)
server/src/test/java/org/opensearch/index/mapper/HllFieldMapperIntegrationTests.java (3)
server/src/main/java/org/opensearch/search/aggregations/metrics/HyperLogLogPlusPlus.java (1)
  • HyperLogLogPlusPlus (66-465)
test/framework/src/main/java/org/opensearch/test/OpenSearchSingleNodeTestCase.java (1)
  • OpenSearchSingleNodeTestCase (103-492)
server/src/main/java/org/opensearch/search/aggregations/bucket/terms/InternalTerms.java (1)
  • Bucket (96-221)
server/src/main/java/org/opensearch/indices/IndicesModule.java (1)
server/src/main/java/org/opensearch/index/mapper/HllFieldMapper.java (1)
  • HllFieldMapper (40-234)
🪛 GitHub Actions: Gradle Precommit
server/src/test/java/org/opensearch/search/aggregations/metrics/HllCardinalityAggregatorTests.java

[error] 1-1: License header problems were found. Unapproved licenses detected for this test file. See server/build/reports/licenseHeaders/rat.log for details. Execution failed: ':server:licenseHeaders'.

🔇 Additional comments (14)
server/src/main/java/org/opensearch/search/aggregations/metrics/HyperLogLogPlusPlus.java (1)

194-196: LGTM! Improved error messaging.

The updated error message now includes both precision values, which significantly aids debugging when sketches with mismatched precisions are incorrectly merged.

server/src/main/java/org/opensearch/search/aggregations/metrics/CardinalityAggregatorFactory.java (1)

126-135: LGTM! Clean integration for HLL field type.

The logic correctly detects HLL fields and delegates to HllCardinalityAggregator. Using hllFieldType.precision() (from the field mapping) rather than the aggregation's precisionThreshold is the correct approach since HLL fields have pre-aggregated sketches with a fixed precision.

server/src/main/java/org/opensearch/indices/IndicesModule.java (1)

59-59: LGTM! Standard mapper registration.

The HllFieldMapper is correctly registered following the established pattern for built-in mappers.

Also applies to: 164-164

server/src/test/java/org/opensearch/index/mapper/HllFieldMapperTests.java (1)

27-57: LGTM! Comprehensive test coverage for HLL field mapper.

The test class covers essential scenarios: mapping defaults, precision validation, query behavior, serialization round-trips, merge semantics (commutativity, associativity, union cardinality), and error handling for invalid/mismatched data.

server/src/main/java/org/opensearch/search/aggregations/metrics/HllCardinalityAggregator.java (1)

31-56: LGTM! Clean aggregator implementation.

The lazy initialization pattern for counts, proper resource management via Releasables.close(), and delegation to HllFieldData for per-document sketch retrieval are well-designed.

server/src/test/java/org/opensearch/index/mapper/HllFieldMapperIntegrationTests.java (3)

29-33: LGTM!

Clear class-level documentation and appropriate base class for single-node integration testing.


252-327: LGTM!

Good test coverage for bucketed cardinality aggregations with appropriate tolerance for HLL++ approximation.


329-395: LGTM!

Well-structured simplified test case for validating two-sketch merge behavior with proper resource cleanup in the finally block.

server/src/main/java/org/opensearch/index/fielddata/plain/HllFieldData.java (2)

137-145: LGTM!

The ramBytesUsed() returning 0 and close() being a no-op are appropriate since HllLeafFieldData doesn't own the LeafReader.


98-120: LGTM!

Correctly preventing sorting operations on HLL fields with descriptive error messages.

server/src/main/java/org/opensearch/index/mapper/HllFieldMapper.java (4)

53-96: LGTM!

Well-structured builder with proper precision validation using constants from AbstractHyperLogLog.


164-186: LGTM!

Correct handling of both programmatic external values and parser-based binary input, with proper null handling.


230-233: LGTM!

Standard merge builder pattern for ParametrizedFieldMapper.


126-138: The valueFetcher implementation is correct as written.

When _source data containing binary sketches is parsed, BytesRef objects are properly reconstructed from the base64-encoded JSON representation. The implementation correctly encodes these to base64 strings for readability in API responses, and gracefully handles any other input types by returning them unchanged.

Copy link
Copy Markdown
Contributor

@asimmahmood1 asimmahmood1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall approach looks good, still looking at test cases.

@github-actions
Copy link
Copy Markdown
Contributor

❌ Gradle check result for a36290b: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@asimmahmood1
Copy link
Copy Markdown
Contributor

We trying to restart gradle check but I don't have permission to reopen. @sandeshkr419 @prudhvigodithi

@sandeshkr419 sandeshkr419 reopened this Jan 21, 2026
@github-actions
Copy link
Copy Markdown
Contributor

❌ Gradle check result for a36290b: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@navneet1v
Copy link
Copy Markdown
Contributor

Approved the code. Please fix the CIs

Signed-off-by: Sandesh Kumar <sandeshkr419@gmail.com>
Signed-off-by: Sandesh Kumar <sandeshkr419@gmail.com>
Signed-off-by: Sandesh Kumar <sandeshkr419@gmail.com>
Signed-off-by: Sandesh Kumar <sandeshkr419@gmail.com>
Signed-off-by: Sandesh Kumar <sandeshkr419@gmail.com>
Copy link
Copy Markdown
Contributor

@bharath-techie bharath-techie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Signed-off-by: Sandesh Kumar <sandeshkr419@gmail.com>
@github-actions
Copy link
Copy Markdown
Contributor

✅ Gradle check result for d0c34a0: SUCCESS

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement Enhancement or improvement to existing feature or request Indexing & Search Search:Aggregations

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature Request] Introduce a New Field Mapper for HyperLogLog++ Sketches

4 participants