Skip to content

feat: add Random Sample reducer to FT.Aggregate groupby stage#949

Open
rileydes-improving wants to merge 27 commits intovalkey-io:mainfrom
Bit-Quill:feature/random-sample-implementation
Open

feat: add Random Sample reducer to FT.Aggregate groupby stage#949
rileydes-improving wants to merge 27 commits intovalkey-io:mainfrom
Bit-Quill:feature/random-sample-implementation

Conversation

@rileydes-improving
Copy link
Copy Markdown

Summary

Adds a RANDOM_SAMPLE reducer to FT.AGGREGATE GROUPBY stage, enabling reservoir-sampled field values to be returned as RESP arrays from aggregate queries in RediSearch compatible behavior (Not testable as sampling is random)

Changes

  • RandomSample class in ft_aggregate_exec.cc implements Algorithm R reservoir sampling with nil-skipping, a 1000-element cap, and a thread_local std::mt19937; registered in GroupBy::reducerTable as RANDOM_SAMPLE with 2 arguments (@ field, sample_size).
  • ft.aggregate.json and ft.aggregate.md updated to include RANDOM_SAMPLE as a valid reducer.
  • Uses vector value type from feat: Add vector value type support to expression system #893

Testing

  • Unit: ft_aggregate_exec_test.cc covers empty group, single element, sample < / == / > group size, nil exclusion, string and mixed-type inputs, multiple independent reducers, and per-group isolation
  • Integration: test_non_vector.py covers basic sampling, size < / == / > group, sample size 0, multiple reducers, per-category GROUPBY, and numeric field values.
  • Compatibility tests not added

Example

FT.AGGREGATE products "@price:[1 1000]"
  LOAD 1 price  APPLY 1 AS all
  GROUPBY 1 @all
  REDUCE RANDOM_SAMPLE 2 @price 5 AS sample

Returns one group with sample as a RESP array of up to 5 randomly selected price values.

This commit squashes 59 commits containing:
- Refactored FT.AGGREGATE reducers to process records in batch
- Vector reducer function optimizations
- Text search performance improvements
- Compatibility test enhancements
- Multi-DB support for FT.SEARCH
- Various bug fixes and test coverage improvements
- Locking optimizations for text indexes
- Memory allocation improvements

Key changes:
- Changed ProcessRecords interface to accept all records at once for batch processing
- Count reducer now O(1) instead of O(n) function calls
- Improved text index locking and performance
- Added comprehensive timeout coverage for search execution
- Enhanced compatibility testing framework
- Fixed lazy expiration crash in search commands
- Optimized stemming, tokenization, and rax mutations
- Added ingestion performance scenarios
- Better separation between data collection and processing phases

Signed-off-by: Alexandru Filip <alexandru.filip@improving.com>
Signed-off-by: Alexandru Filip <alexandru.filip@improving.com>
Signed-off-by: Alexandru Filip <alexandru.filip@improving.com>
Signed-off-by: Alexandru Filip <alexandru.filip@improving.com>
@allenss-amazon
Copy link
Copy Markdown
Member

I didn't review this because it's largely a duplicate of 946. It's too hard to see the differences. Let's fixup 946 and then rebase this one on top of it.

@rileydes-improving rileydes-improving force-pushed the feature/random-sample-implementation branch from f3c20cd to 5aca69c Compare April 13, 2026 17:57
Signed-off-by: Alexandru Filip <alexandru.filip@improving.com>
@rileydes-improving rileydes-improving force-pushed the feature/random-sample-implementation branch from 5aca69c to 221ced0 Compare April 15, 2026 15:54
AlexFilipImproving and others added 16 commits April 17, 2026 17:25
Refactor GroupBy reducers from batch ProcessRecords to incremental
ProcessRecord, allowing each reducer to process records as they arrive
instead of collecting all values first. Reducers now own their parsing
logic via BasicReducerParser, decoupling reducer construction from the
aggregate parser. ReducerInfo becomes a factory function pointer and
Reducer is now a virtual base class with MakeInstance.

Signed-off-by: Alexandru Filip <alexandru.filip@improving.com>
Signed-off-by: Alexandru Filip <alexandru.filip@improving.com>
This commit adds comprehensive vector value type support to the expression
evaluation system, enabling vectors to be used in FT.AGGREGATE operations.

Key changes:
- Implement AsVector() accessor method for Value class
- Add Vector variant to expr::Value with std::vector<double> storage
- Implement vector-specific functions (VECLEN, VECGET, VECSET, etc.)
- Add vector support to mathematical and string functions
- Implement vector serialization to RESP format
- Add comprehensive error handling for vector operations
- Add vector comparison operators and nested vector support
- Refactor Value::Nil to use std::string instead of const char*

Testing:
- Add comprehensive unit tests for vector operations
- Add nested vector tests
- Add vector comparison tests

This also includes upstream changes from main branch merged during
development, including text search optimizations, compatibility tests,
and various bug fixes.

Signed-off-by: Alexandru Filip <alexandru.filip@improving.com>
Signed-off-by: Alexandru Filip <alexandru.filip@improving.com>
Signed-off-by: Alexandru Filip <alexandru.filip@improving.com>
Signed-off-by: Alexandru Filip <alexandru.filip@improving.com>
Signed-off-by: Alexandru Filip <alexandru.filip@improving.com>
note: ProcessRecords cannot propogate errors which is why errors return nill

- Add RANDOM_SAMPLE reducer to ft.aggregate command
- Update command documentation with RANDOM_SAMPLE syntax and description
- Implement RANDOM_SAMPLE parsing in ft_aggregate_parser.h
- Add RANDOM_SAMPLE execution logic in ft_aggregate_exec.cc
- Update ft.aggregate.json command schema with RANDOM_SAMPLE definition
- Add comprehensive unit and integration tests for RANDOM_SAMPLE

Signed-off-by: Riley Des <riley.desserre@improving.com>
Signed-off-by: Riley Des <riley.desserre@improving.com>
Signed-off-by: Riley Des <riley.desserre@improving.com>
…ic once

Signed-off-by: Riley Des <riley.desserre@improving.com>
Signed-off-by: Riley Des <riley.desserre@improving.com>
…nges

- Move sample_size_ initialization from ProcessRecords to constructor
- Change ProcessRecords to ProcessRecord for single-record processing
- Extract RandomSampleReducer struct with custom parser for parse-time validation
- Validate sample size at parse time instead of runtime for better error handling
- Remove ReducerInstanceVector struct (no longer needed)
- Register RandomSampleReducerParser in reducerTable for RANDOM_SAMPLE command
- Simplifies runtime processing by pre-validating parameters during aggregation setup (matching Redis compatibility)

Signed-off-by: Riley Des <riley.desserre@improving.com>
…ity tests

- Add comprehensive error case tests for RANDOM_SAMPLE reducer in compatibility test suite
- Include validation for wrong argument counts, non-numeric/non-integer sample sizes, negative values, and values exceeding the 1000 cap
- Remove duplicate error handling tests from test_non_vector.py that are now covered by compatibility tests
- Add TryParseStages helper method to ft_aggregate_exec_test.cc for testing parse-time error paths
- Add RandomSampleParseErrorsTest to verify negative sample sizes and other invalid inputs are rejected at parse time
- Consolidate error validation into compatibility layer to avoid redundant test coverage

Signed-off-by: Riley Des <riley.desserre@improving.com>
Signed-off-by: Riley Des <riley.desserre@improving.com>
@rileydes-improving rileydes-improving force-pushed the feature/random-sample-implementation branch from 221ced0 to bae15fe Compare April 23, 2026 17:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants