HLL field Mapper for Cardinality Rollups#20129
HLL field Mapper for Cardinality Rollups#20129sandeshkr419 merged 6 commits intoopensearch-project:mainfrom
Conversation
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the 📝 WalkthroughWalkthroughThis pull request introduces HyperLogLog++ (HLL) sketch field support to OpenSearch, including a new field mapper, field data implementation, specialized cardinality aggregator for pre-aggregated sketch merging, and comprehensive test coverage to enable efficient cardinality computation across rollup and aggregated data scenarios. Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant Indexer as Index Writer
participant FieldMapper as HllFieldMapper
participant FieldData as HllFieldData
participant LeafData as HllLeafFieldData
participant Aggregator as HllCardinalityAggregator
participant Sketch as HyperLogLogPlusPlus
Client->>Indexer: Index doc with HLL sketch bytes
Indexer->>FieldMapper: Parse field
FieldMapper->>FieldMapper: Validate sketch precision
FieldMapper->>FieldData: Store as binary doc value
Client->>Aggregator: Execute cardinality aggregation
Aggregator->>FieldData: Load field data per segment
FieldData->>LeafData: Create leaf field data
Aggregator->>LeafData: For each document, getSketch(docId)
LeafData->>LeafData: Read binary doc value
LeafData->>Sketch: deserialize via readFrom()
Sketch-->>LeafData: HyperLogLogPlusPlus instance
LeafData-->>Aggregator: Sketch
Aggregator->>Aggregator: Initialize/resize accumulator on first sketch
Aggregator->>Sketch: Merge document sketch into accumulator
Aggregator->>Aggregator: Track per-bucket cardinality
Aggregator-->>Client: Return InternalCardinality with merged result
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes
Suggested labels
Suggested reviewers
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
Comment |
|
❌ Gradle check result for 97b7392: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (5)
server/src/test/java/org/opensearch/index/mapper/HllFieldMapperTests.java (1)
139-141: Minor inconsistency: consider usingBitMixer.mix64for consistency.Here raw integer values are collected directly, while
testMergeProducesUnionCardinality(line 317) usesBitMixer.mix64(i). While not a correctness issue (HLL handles any input), usingBitMixerconsistently would better simulate real hash values.// Add values to reach the target cardinality for (int i = 0; i < cardinality; i++) { - originalSketch.collect(0, i); + originalSketch.collect(0, BitMixer.mix64(i)); }server/src/main/java/org/opensearch/search/aggregations/metrics/HllCardinalityAggregator.java (2)
71-85: Consider exponential growth to avoid O(n²) resize overhead.The current resize strategy creates a new
HyperLogLogPlusPluswith capacitybucket + 1and copies all existing data. If buckets arrive in increasing order, this results in O(n²) total work. Consider using exponential growth (e.g.,Math.max(bucket + 1, counts.maxOrd() * 2)) to amortize resize costs.// Grow if needed to accommodate this bucket if (bucket >= counts.maxOrd()) { - HyperLogLogPlusPlus newCounts = new HyperLogLogPlusPlus(precision, context.bigArrays(), bucket + 1); + long newCapacity = Math.max(bucket + 1, counts.maxOrd() * 2); + HyperLogLogPlusPlus newCounts = new HyperLogLogPlusPlus(precision, context.bigArrays(), newCapacity); for (long i = 0; i < counts.maxOrd(); i++) { if (counts.cardinality(i) > 0) { newCounts.merge(i, counts, i);
108-111: Potential out-of-bounds access inmetric()method.Unlike
buildAggregation()which checksowningBucketOrd >= counts.maxOrd(), this method passesowningBucketOrddirectly tocounts.cardinality(). While the underlyingByteArraymay handle this internally, adding an explicit bounds check would be more defensive and consistent withbuildAggregation().@Override public double metric(long owningBucketOrd) { - return counts == null ? 0 : counts.cardinality(owningBucketOrd); + return counts == null || owningBucketOrd >= counts.maxOrd() ? 0 : counts.cardinality(owningBucketOrd); }server/src/test/java/org/opensearch/index/mapper/HllFieldMapperIntegrationTests.java (1)
201-246: Hardcoded expected cardinality may be fragile.The test hardcodes
expectedCardinality = 500and asserts an exact match against the aggregated result. HLL++ is a probabilistic data structure, and while it may return exact values for small cardinalities, this assertion is fragile if internals change.Consider removing the hardcoded assertion and relying only on the comparison with
expectedMerged.cardinality(0):- long expectedCardinality = 500; // ... - // The aggregated cardinality should match the expected cardinality - assertEquals("Aggregation should merge sketches correctly", expectedCardinality, aggregatedCardinality); - // The aggregated cardinality should match the cardinality from expected expectedMerged assertEquals("Aggregation should merge sketches correctly", expectedMerged.cardinality(0), aggregatedCardinality);server/src/main/java/org/opensearch/index/fielddata/plain/HllFieldData.java (1)
59-62: Unused parameters in build method.The
cacheandbreakerServiceparameters are ignored. This may be intentional since HLL sketches are loaded on-demand, but consider documenting this design decision.
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (9)
server/src/main/java/org/opensearch/index/fielddata/plain/HllFieldData.java(1 hunks)server/src/main/java/org/opensearch/index/mapper/HllFieldMapper.java(1 hunks)server/src/main/java/org/opensearch/indices/IndicesModule.java(2 hunks)server/src/main/java/org/opensearch/search/aggregations/metrics/CardinalityAggregatorFactory.java(2 hunks)server/src/main/java/org/opensearch/search/aggregations/metrics/HllCardinalityAggregator.java(1 hunks)server/src/main/java/org/opensearch/search/aggregations/metrics/HyperLogLogPlusPlus.java(1 hunks)server/src/test/java/org/opensearch/index/mapper/HllFieldMapperIntegrationTests.java(1 hunks)server/src/test/java/org/opensearch/index/mapper/HllFieldMapperTests.java(1 hunks)server/src/test/java/org/opensearch/search/aggregations/metrics/HllCardinalityAggregatorTests.java(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (3)
server/src/main/java/org/opensearch/search/aggregations/metrics/HllCardinalityAggregator.java (2)
server/src/main/java/org/opensearch/index/fielddata/plain/HllFieldData.java (1)
HllFieldData(43-176)server/src/main/java/org/opensearch/search/aggregations/LeafBucketCollector.java (1)
LeafBucketCollector(51-179)
server/src/test/java/org/opensearch/index/mapper/HllFieldMapperIntegrationTests.java (3)
server/src/main/java/org/opensearch/search/aggregations/metrics/HyperLogLogPlusPlus.java (1)
HyperLogLogPlusPlus(66-465)test/framework/src/main/java/org/opensearch/test/OpenSearchSingleNodeTestCase.java (1)
OpenSearchSingleNodeTestCase(103-492)server/src/main/java/org/opensearch/search/aggregations/bucket/terms/InternalTerms.java (1)
Bucket(96-221)
server/src/main/java/org/opensearch/indices/IndicesModule.java (1)
server/src/main/java/org/opensearch/index/mapper/HllFieldMapper.java (1)
HllFieldMapper(40-234)
🪛 GitHub Actions: Gradle Precommit
server/src/test/java/org/opensearch/search/aggregations/metrics/HllCardinalityAggregatorTests.java
[error] 1-1: License header problems were found. Unapproved licenses detected for this test file. See server/build/reports/licenseHeaders/rat.log for details. Execution failed: ':server:licenseHeaders'.
🔇 Additional comments (14)
server/src/main/java/org/opensearch/search/aggregations/metrics/HyperLogLogPlusPlus.java (1)
194-196: LGTM! Improved error messaging.The updated error message now includes both precision values, which significantly aids debugging when sketches with mismatched precisions are incorrectly merged.
server/src/main/java/org/opensearch/search/aggregations/metrics/CardinalityAggregatorFactory.java (1)
126-135: LGTM! Clean integration for HLL field type.The logic correctly detects HLL fields and delegates to
HllCardinalityAggregator. UsinghllFieldType.precision()(from the field mapping) rather than the aggregation'sprecisionThresholdis the correct approach since HLL fields have pre-aggregated sketches with a fixed precision.server/src/main/java/org/opensearch/indices/IndicesModule.java (1)
59-59: LGTM! Standard mapper registration.The
HllFieldMapperis correctly registered following the established pattern for built-in mappers.Also applies to: 164-164
server/src/test/java/org/opensearch/index/mapper/HllFieldMapperTests.java (1)
27-57: LGTM! Comprehensive test coverage for HLL field mapper.The test class covers essential scenarios: mapping defaults, precision validation, query behavior, serialization round-trips, merge semantics (commutativity, associativity, union cardinality), and error handling for invalid/mismatched data.
server/src/main/java/org/opensearch/search/aggregations/metrics/HllCardinalityAggregator.java (1)
31-56: LGTM! Clean aggregator implementation.The lazy initialization pattern for
counts, proper resource management viaReleasables.close(), and delegation toHllFieldDatafor per-document sketch retrieval are well-designed.server/src/test/java/org/opensearch/index/mapper/HllFieldMapperIntegrationTests.java (3)
29-33: LGTM!Clear class-level documentation and appropriate base class for single-node integration testing.
252-327: LGTM!Good test coverage for bucketed cardinality aggregations with appropriate tolerance for HLL++ approximation.
329-395: LGTM!Well-structured simplified test case for validating two-sketch merge behavior with proper resource cleanup in the finally block.
server/src/main/java/org/opensearch/index/fielddata/plain/HllFieldData.java (2)
137-145: LGTM!The
ramBytesUsed()returning 0 andclose()being a no-op are appropriate sinceHllLeafFieldDatadoesn't own theLeafReader.
98-120: LGTM!Correctly preventing sorting operations on HLL fields with descriptive error messages.
server/src/main/java/org/opensearch/index/mapper/HllFieldMapper.java (4)
53-96: LGTM!Well-structured builder with proper precision validation using constants from
AbstractHyperLogLog.
164-186: LGTM!Correct handling of both programmatic external values and parser-based binary input, with proper null handling.
230-233: LGTM!Standard merge builder pattern for
ParametrizedFieldMapper.
126-138: The valueFetcher implementation is correct as written.When _source data containing binary sketches is parsed, BytesRef objects are properly reconstructed from the base64-encoded JSON representation. The implementation correctly encodes these to base64 strings for readability in API responses, and gracefully handles any other input types by returning them unchanged.
server/src/main/java/org/opensearch/index/fielddata/plain/HllFieldData.java
Show resolved
Hide resolved
server/src/test/java/org/opensearch/index/mapper/HllFieldMapperIntegrationTests.java
Outdated
Show resolved
Hide resolved
.../src/test/java/org/opensearch/search/aggregations/metrics/HllCardinalityAggregatorTests.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/opensearch/index/mapper/HllFieldMapper.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/opensearch/index/mapper/HllFieldMapper.java
Outdated
Show resolved
Hide resolved
asimmahmood1
left a comment
There was a problem hiding this comment.
Overall approach looks good, still looking at test cases.
97b7392 to
32434fd
Compare
|
❌ Gradle check result for a36290b: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
|
We trying to restart gradle check but I don't have permission to reopen. @sandeshkr419 @prudhvigodithi |
|
❌ Gradle check result for a36290b: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
|
Approved the code. Please fix the CIs |
Signed-off-by: Sandesh Kumar <sandeshkr419@gmail.com>
Signed-off-by: Sandesh Kumar <sandeshkr419@gmail.com>
Signed-off-by: Sandesh Kumar <sandeshkr419@gmail.com>
Signed-off-by: Sandesh Kumar <sandeshkr419@gmail.com>
Signed-off-by: Sandesh Kumar <sandeshkr419@gmail.com>
Description
The
hllfield type stores pre-aggregated cardinality data using HyperLogLog++ (HLL++) sketch data structures. This field type is designed for internal use by OpenSearch and its plugins (such as the ISM plugin for multi-tier rollup operations) to enable efficient cardinality aggregation on rolled-up data.Key Features
Parameters
precision(optional): Integer between 4 and 18 (consistent with OpenSearch) that controls the accuracy-memory tradeoffRelated Issues
Resolves #[Issue number to be closed when this PR is merged]
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.
Summary by CodeRabbit
New Features
Bug Fixes
Tests
✏️ Tip: You can customize this high-level summary in your review settings.