Populate HeadStats in _tsdb/stats endpoint by wentingwang · Pull Request #70 · opensearch-project/time-series-db

wentingwang · 2026-02-19T18:36:18Z

Summary

Collect HeadStats from LiveSeriesIndexLeafReader in TSDBStatsAggregator (numSeries, numChunks, minTime, maxTime)
Add numChunksForDoc in LiveSeriesIndexLeafReader to read numChunks in Head
Merge HeadStats across shards in InternalTSDBStats.reduce() by summing numSeries/chunkCount and taking min/max of time bounds
Updated integration tests to include HeadStats in expected responses
Change integration to insert data based on timestamp to avoid OOO cutoff

Test plan

Unit tests for HeadStats merge logic (shard-level, coordinator-level, partial null, all null)
Unit test for HeadStats collection from LiveSeriesIndexLeafReader
Unit test verifying no HeadStats for ClosedChunkIndex readers
Integration tests updated and passing
./gradlew check passes
./gradlew javaRestTest passes
Patch coverage >= 85%

Depends on #57 — this PR is stacked on implement-tsdb-stats-aggregator. Please merge #57 first.

🤖 Generated with Claude Code

codecov · 2026-02-19T18:52:13Z

Codecov Report

❌ Patch coverage is 92.40506% with 6 lines in your changes missing coverage. Please review.
✅ Project coverage is 89.11%. Comparing base (9d95e46) to head (43e06d4).

Files with missing lines	Patch %	Lines
...sdb/core/index/live/LiveSeriesIndexLeafReader.java	83.33%	1 Missing and 1 partial ⚠️
...rch/tsdb/query/aggregator/TSDBStatsAggregator.java	90.47%	0 Missing and 2 partials ⚠️
.../query/aggregator/TSDBStatsAggregationBuilder.java	92.85%	0 Missing and 1 partial ⚠️
...pensearch/tsdb/query/rest/RestTSDBStatsAction.java	66.66%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##               main      #70      +/-   ##
============================================
+ Coverage     89.09%   89.11%   +0.01%     
- Complexity     5009     5026      +17     
============================================
  Files           321      321              
  Lines         15421    15486      +65     
  Branches       2328     2345      +17     
============================================
+ Hits          13740    13800      +60     
- Misses         1014     1015       +1     
- Partials        667      671       +4

Flag	Coverage Δ
unittests	`89.11% <92.40%> (+0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
...earch/tsdb/query/aggregator/InternalTSDBStats.java	`92.42% <100.00%> (+0.55%)`	⬆️
...b/query/aggregator/TSDBStatsAggregatorFactory.java	`100.00% <100.00%> (ø)`
...rch/tsdb/query/rest/TSDBStatsResponseListener.java	`94.87% <100.00%> (ø)`
.../query/aggregator/TSDBStatsAggregationBuilder.java	`92.22% <92.85%> (+1.19%)`	⬆️
...pensearch/tsdb/query/rest/RestTSDBStatsAction.java	`87.69% <66.66%> (-0.59%)`	⬇️
...sdb/core/index/live/LiveSeriesIndexLeafReader.java	`90.41% <83.33%> (-1.40%)`	⬇️
...rch/tsdb/query/aggregator/TSDBStatsAggregator.java	`88.37% <90.47%> (+0.49%)`	⬆️

... and 1 file with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

philiplhchan · 2026-02-25T22:45:34Z

src/main/java/org/opensearch/tsdb/query/aggregator/TSDBStatsAggregator.java

+    @Override
+    public InternalAggregation buildEmptyAggregation() {
+        // Return empty coordinator-level stats
+        InternalTSDBStats.CoordinatorLevelStats emptyStats = new InternalTSDBStats.CoordinatorLevelStats(null, Map.of());
+        return InternalTSDBStats.forCoordinatorLevel(name, null, emptyStats, metadata());
+    }


shouldn't this return shardLevel stats?
what happens when there's no data on a shard? or is that no possible?

buildEmptyAggregation() returns coordinator-level stats, but shard reduce rejects them

TSDBStatsAggregator.java:236-240 — buildEmptyAggregation() returns CoordinatorLevelStats: public InternalAggregation buildEmptyAggregation() { InternalTSDBStats.CoordinatorLevelStats emptyStats = ...; return InternalTSDBStats.forCoordinatorLevel(name, null, emptyStats, metadata()); }

But InternalTSDBStats.java:413-414 — the shard-level reduce throws when it encounters coordinator-level stats:

if (stats.shardStats == null) { throw new IllegalStateException("Expected shard-level stats but got coordinator-level stats in shard reduce"); }

OpenSearch calls buildEmptyAggregation() for shards with zero matching documents. In a multi-shard deployment where some shards have data and others don't, the shard-level reduce will mix buildAggregation() results (shard-level) with buildEmptyAggregation() results (coordinator-level) and crash with
IllegalStateException.
if TSDBStatsAggregator is used as a sub-aggregation under a bucket aggregator (e.g., a filters or terms agg), then any bucket with zero docs will call
buildEmptyAggregation(), producing coordinator-level stats. Those get mixed with shard-level stats from non-empty buckets during reduceShardLevel(), triggering the crash.

Is this possible in TSDB? or will this never happen?

what happens when there's no data on a shard? or is that no possible?

It still go through buildAggregation which will build an empty ShardLevelStats. TSDBStatsAggregator is a top-level aggregation for now, so in practice we never call buildEmptyAggregation(), but we can changed to shard level stats.

Also fixed the reduce(), so it wont relay on isfinalReduce for the partialReduce case.

all the changes are fixed in #57. so it will be easier to review this PR later

philiplhchan · 2026-02-25T22:57:22Z

src/main/java/org/opensearch/tsdb/query/aggregator/InternalTSDBStats.java

+            if (seriesFingerprintSet != null) {
                out.writeBoolean(true);
-                out.writeVLong(numSeries);
+                out.writeCollection(seriesFingerprintSet, StreamOutput::writeVLong);


writeVLong used for fingerprint serialization — crashes on negative values

InternalTSDBStats.java:128,146 — ShardLevelStats serializes fingerprint sets using writeVLong/readVLong:

out.writeCollection(seriesFingerprintSet, StreamOutput::writeVLong); // line 128
out.writeCollection(fingerprintSet, StreamOutput::writeVLong); // line 146

writeVLong throws IllegalStateException for negative values. The fingerprints come from TSDBStatsAggregator.java:148:

long seriesId = seriesIdDocValues.longValue();

This reads from LABELS_HASH (a hash field) whose values can absolutely be negative. Any negative hash will crash serialization during shard-to-coordinator transport.

Fix: Use writeLong/readLong (or writeZLong/readZLong for variable-length encoding that supports negatives) instead of writeVLong/readVLong for fingerprint sets.

please double check if this is a real issue, i do remember stable hash will produce negative values

good catch, changed it to ZLong. but interestingly when I tested on ts-poc even for 1.6M query, it didn't have exceptions.

philiplhchan · 2026-02-25T23:02:31Z

src/main/java/org/opensearch/tsdb/query/aggregator/InternalTSDBStats.java

+    public record ShardLevelStats(Set<Long> seriesFingerprintSet, Map<String, Map<String, Set<Long>>> labelStats,
+        boolean includeValueStats) {


can you document what each of the fields are with an example

philiplhchan · 2026-02-25T23:05:27Z

src/main/java/org/opensearch/tsdb/query/aggregator/InternalTSDBStats.java

+                            (valueOutput, fingerprintSet) -> {  // Value writer: Set<Long> or null
+                                if (fingerprintSet != null) {
+                                    valueOutput.writeBoolean(true);
+                                    valueOutput.writeCollection(fingerprintSet, StreamOutput::writeVLong);


what's the difference between this fingerprintSet and seriesFingerprintSet above?

seriesFingerprintSet is all the fingerprints we've seen so far (give the total num series for "fetch" statement), fingerprintSet is the fingerprintSet per label value (give the numSeries per label value)

philiplhchan · 2026-02-25T23:06:42Z

src/main/java/org/opensearch/tsdb/query/aggregator/InternalTSDBStats.java

+        // Sanity check: exactly one of shardStats or coordinatorStats must be non-null
+        if ((shardStats == null) == (coordinatorStats == null)) {
+            throw new IllegalArgumentException("Exactly one of shardStats or coordinatorStats must be non-null");
+        }


this feels more like something for assert than illegal argument?
this is not something controlled by the caller right? if this happens this is a bug in our code?

yes, if it happens, it is a bug. fixed

philiplhchan · 2026-02-25T23:13:36Z

src/main/java/org/opensearch/tsdb/query/aggregator/TSDBStatsAggregator.java

+            long seriesId = seriesIdDocValues.longValue();
+
+            // Already processed this series - skip entire document
+            if (!seenSeriesIdentifiers.add(seriesId)) {


iirc LSI.reference and labels_hash are not the same value, so this check would not be able to detect that the same series was already accounted for.
Does this affect the correctness/count values (e.g. result in double counting)? or it would just be less optimal?

i think this can potentially affect correctness. e.g. if labels hash is between 0-1,000,000 (which will be in the series Ref range which is just the count of number of series in LSI), this will incorrectly skip processing. vice versa.

no, they are the same. https://github.com/opensearch-project/time-series-db/blob/main/src/main/java/org/opensearch/index/engine/TSDBEngine.java#L432. It is just later we save it in translog and called seriesReference. We should use save name to store in LSI and CCI.

ok i see, can you add a comment to the constructor where the this.seriesIdDocValues are initialized, and mention we assume LSI.reference == CCI.labels_hash and it's required for this to function properly.

And lets add a test case to HeadTests similar to HeadTests#testHeadLifecycle, but instead assert that if you scan through all the docs across LSI and CCI to get the seriesId, they are all equal (since that test only has 1 series). You can mention in the test the stats aggregator requires this behavior in the javadoc/comments.

offline synced with @itschrispeck, we'd still need decode labels from cci and lsi to always fully dedup across both. changing it back to decoding labels instead of fetch reference/labels_hash

philiplhchan · 2026-02-25T23:18:33Z

src/main/java/org/opensearch/tsdb/query/aggregator/TSDBStatsAggregator.java

+            try {
+                labelValuePairOrdinalMap.close();
+            } catch (Exception e) {
+                // Log and continue - BytesRefHash may fail on multiple close calls


says log, but doesn't log

already fixed in the upstream, will rebase.

philiplhchan · 2026-02-25T23:19:35Z

src/main/java/org/opensearch/tsdb/query/rest/RestTSDBStatsAction.java

+                            new CachedWildcardQueryBuilder(org.opensearch.tsdb.core.mapping.Constants.IndexSchema.LABELS, labelFilter)
+                        );
+                    } else {
+                        // Exact term query on labels field
+                        labelQuery.should(
+                            QueryBuilders.termQuery(org.opensearch.tsdb.core.mapping.Constants.IndexSchema.LABELS, labelFilter)


import these

philiplhchan · 2026-02-25T23:22:11Z

src/test/java/org/opensearch/tsdb/query/rest/TSDBStatsResponseListenerTests.java

    }

-    // ========== Grouped Format Tests ==========
+    // ========== Grouped Format Tests (Combined: 3 → 2) ==========


what does (Combined: 3 → 2) mean?

at one phase, claude generated much more unit test and I asked it to consolidate them, probably from that.

philiplhchan · 2026-02-25T23:24:41Z

src/javaRestTest/resources/test_cases/tsdb_stats_rest_it.yaml

+
+  index_configs:
+    - name: "tsdb_stats_test"
+      shards: 1


tests should have multiple shards to verify coordinator reduce

philiplhchan · 2026-02-25T23:26:07Z

src/javaRestTest/resources/test_cases/tsdb_stats_rest_it.yaml

+      time_config:
+        min_timestamp: "2025-01-01T00:00:00Z"
+        max_timestamp: "2025-01-01T01:00:00Z"
+        step: "5m"


this will only test LSI/Head block without CCI right?
we should use sparser time stamps to ensure that behavior is also tested, eg > 6 hour range to test >2+ blocks

fixed in the upstream diff

philiplhchan · 2026-02-25T23:50:07Z

other's i'll leave here, you can take a look and see if these are actual issues, if not, you can ignore them

---
doXContentBody leaks sentinel 0-values for uncounted stats

  InternalTSDBStats.java doXContentBody() emits valuesStats directly without filtering 0-count sentinels. The REST response path (TSDBStatsResponseListener) correctly handles this by checking includeValueStats, but any other consumer of the XContent (e.g., cross-cluster search, _search API with aggregations, explain
  mode) will see misleading 0 counts for values that were never actually counted.

  ---
labelCardinality is always null when includeValueStats=false

  InternalTSDBStats.java:458,483-488: When includeValueStats=false, the else branch never computes labelCardinality, so per-label numSeries is always null. The shard-level data genuinely doesn't collect per-value fingerprints in this mode, so per-label cardinality can't be derived. This is a functional gap — users
  requesting include=labelStats get labels with values listed but no per-label series counts.

wentingwang · 2026-02-26T00:42:35Z

other's i'll leave here, you can take a look and see if these are actual issues, if not, you can ignore them

---
doXContentBody leaks sentinel 0-values for uncounted stats

  InternalTSDBStats.java doXContentBody() emits valuesStats directly without filtering 0-count sentinels. The REST response path (TSDBStatsResponseListener) correctly handles this by checking includeValueStats, but any other consumer of the XContent (e.g., cross-cluster search, _search API with aggregations, explain
  mode) will see misleading 0 counts for values that were never actually counted.

What it said is true. This works for TSDBStatsResponseListener but not for other caller. will fixed later.

labelCardinality is always null when includeValueStats=false

InternalTSDBStats.java:458,483-488: When includeValueStats=false, the else branch never computes labelCardinality, so per-label numSeries is always null. The shard-level data genuinely doesn't collect per-value fingerprints in this mode, so per-label cardinality can't be derived. This is a functional gap — users
requesting include=labelStats get labels with values listed but no per-label series counts.

This is intensional. But maybe because include=labelStats the naming is confusing. Original I wanted a mode that we only returns label values per key to cover /search m3 endpoint use case. renamed the params to be include=labelValues.

venketep-prasad · 2026-03-10T04:21:43Z

src/main/java/org/opensearch/tsdb/query/aggregator/TSDBStatsAggregator.java

+            // Accumulate per-doc HeadStats for live series segments
+            if (includeHeadStats && tsdbLeafReader instanceof LiveSeriesIndexLeafReader) {
+                headNumSeries++;
+                headNumChunks += ((LiveSeriesIndexLeafReader) tsdbLeafReader).numChunksForDoc(doc, tsdbDocValues);


HeadStats accumulation sits after the seenSeriesIds dedup check. If a series exists in both a ClosedChunkIndex segment and a LiveSeriesIndex segment (same stableHash), and the closed chunk doc is processed first, the live doc gets deduped, and HeadStats skips it — undercounting headNumSeries/headNumChunks.

We can move this block before the dedup check, since HeadStats should count all series in the head independently of label-stats dedup.

make sense, fixed

venketep-prasad · 2026-03-10T04:21:43Z

src/main/java/org/opensearch/tsdb/query/rest/RestTSDBStatsAction.java

 *   "headStats": {
 *     "numSeries": 508,
- *     "chunkCount": 937,
 *     "minTime": 1591516800000,


chunkCount was removed from this javadoc example, but numChunks wasn't added to replace it. Same with the flat-format example below.

Collect HeadStats from LiveSeriesIndexLeafReader in TSDBStatsAggregator, merge across shards by summing numSeries/chunkCount and taking min/max of time bounds, and include in the endpoint response. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Wenting Wang <wenting.wang@uber.com> rebase for reviewing stacked diff, unit tests failed, will fix in the next commit Signed-off-by: Wenting Wang <wenting.wang@uber.com> fix unit test due to rebase Signed-off-by: Wenting Wang <wenting.wang@uber.com> add includeHeadStats in TSDBStatsAggregator Signed-off-by: Wenting Wang <wenting.wang@uber.com>

Signed-off-by: Wenting Wang <wenting.wang@uber.com>

wentingwang requested review from itschrispeck, philiplhchan and yupeng9 as code owners February 19, 2026 18:36

philiplhchan reviewed Feb 25, 2026

View reviewed changes

wentingwang force-pushed the implement-head-stats branch 5 times, most recently from a12c508 to b6b8342 Compare March 4, 2026 18:06

venketep-prasad reviewed Mar 11, 2026

View reviewed changes

wentingwang added 6 commits March 11, 2026 09:52

fix test coverage

8331508

Signed-off-by: Wenting Wang <wenting.wang@uber.com>

add numChunks and fix integrationTest

df14bdb

Signed-off-by: Wenting Wang <wenting.wang@uber.com>

triggers test again

8170d0e

Signed-off-by: Wenting Wang <wenting.wang@uber.com>

refactor and add unit test

5bea82a

Signed-off-by: Wenting Wang <wenting.wang@uber.com>

address comments

43e06d4

Signed-off-by: Wenting Wang <wenting.wang@uber.com>

wentingwang force-pushed the implement-head-stats branch from 43aa46f to 43e06d4 Compare March 11, 2026 16:52

		public record ShardLevelStats(Set<Long> seriesFingerprintSet, Map<String, Map<String, Set<Long>>> labelStats,
		boolean includeValueStats) {

Conversation

wentingwang commented Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

codecov bot commented Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

philiplhchan Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wentingwang Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

philiplhchan Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

philiplhchan Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

philiplhchan commented Feb 25, 2026

Uh oh!

wentingwang commented Feb 26, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

wentingwang commented Feb 19, 2026 •

edited

Loading

codecov bot commented Feb 19, 2026 •

edited

Loading

philiplhchan Feb 25, 2026 •

edited

Loading

wentingwang Feb 26, 2026 •

edited

Loading

philiplhchan Feb 25, 2026 •

edited

Loading

philiplhchan Feb 26, 2026 •

edited

Loading