feat: add TOLIST reducer for FT.AGGREGATE GROUPBY stage#932
feat: add TOLIST reducer for FT.AGGREGATE GROUPBY stage#932AlexFilipImproving wants to merge 20 commits intovalkey-io:mainfrom
Conversation
This commit squashes 59 commits containing: - Refactored FT.AGGREGATE reducers to process records in batch - Vector reducer function optimizations - Text search performance improvements - Compatibility test enhancements - Multi-DB support for FT.SEARCH - Various bug fixes and test coverage improvements - Locking optimizations for text indexes - Memory allocation improvements Key changes: - Changed ProcessRecords interface to accept all records at once for batch processing - Count reducer now O(1) instead of O(n) function calls - Improved text index locking and performance - Added comprehensive timeout coverage for search execution - Enhanced compatibility testing framework - Fixed lazy expiration crash in search commands - Optimized stemming, tokenization, and rax mutations - Added ingestion performance scenarios - Better separation between data collection and processing phases Signed-off-by: Alexandru Filip <alexandru.filip@improving.com>
Signed-off-by: Alexandru Filip <alexandru.filip@improving.com>
Signed-off-by: Alexandru Filip <alexandru.filip@improving.com>
This commit adds comprehensive vector value type support to the expression evaluation system, enabling vectors to be used in FT.AGGREGATE operations. Key changes: - Implement AsVector() accessor method for Value class - Add Vector variant to expr::Value with std::vector<double> storage - Implement vector-specific functions (VECLEN, VECGET, VECSET, etc.) - Add vector support to mathematical and string functions - Implement vector serialization to RESP format - Add comprehensive error handling for vector operations - Add vector comparison operators and nested vector support - Refactor Value::Nil to use std::string instead of const char* Testing: - Add comprehensive unit tests for vector operations - Add nested vector tests - Add vector comparison tests This also includes upstream changes from main branch merged during development, including text search optimizations, compatibility tests, and various bug fixes. Signed-off-by: Alexandru Filip <alexandru.filip@improving.com>
Signed-off-by: Alexandru Filip <alexandru.filip@improving.com>
Signed-off-by: Alexandru Filip <alexandru.filip@improving.com>
Signed-off-by: Alexandru Filip <alexandru.filip@improving.com>
Add ToList ReducerInstance class that collects all distinct values of a field into a vector using absl::flat_hash_set<expr::Value>. Handles nil value skipping and one-level vector input flattening. Register TOLIST in the GroupBy::reducerTable with min_nargs=1, max_nargs=1. Signed-off-by: Alexandru Filip <alexandru.filip@improving.com>
Add ToListReducerTest covering basic distinct value collection, deduplication behavior, and TOLIST combined with COUNT in the same GROUPBY stage. Signed-off-by: Alexandru Filip <alexandru.filip@improving.com>
Add four TOLIST entries to the TestStages vector: - Valid case: basic TOLIST with one argument - Valid case: TOLIST with AS alias - Invalid case: TOLIST with 0 arguments - Invalid case: TOLIST with 2 arguments Signed-off-by: Alexandru Filip <alexandru.filip@improving.com>
Signed-off-by: Alexandru Filip <alexandru.filip@improving.com>
|
|
||
| return data_set | ||
|
|
||
| def _check_and_regenerate_answers(answer_file, generator_file): |
There was a problem hiding this comment.
I think this code belongs in the build scripts, not here.
| def _check_and_regenerate_answers(answer_file, generator_file): | ||
| root_dir = os.getenv("ROOT_DIR") | ||
| answer_path = f"integration/compatibility/{answer_file}" | ||
| generator_path = f"integration/compatibility/{generator_file}" | ||
|
|
||
| answer_time = subprocess.run(["git", "log", "-1", "--format=%ct", answer_path], | ||
| capture_output=True, text=True, cwd=root_dir).stdout.strip().split('\n')[-1] | ||
| generator_time = subprocess.run(["git", "log", "-1", "--format=%ct", generator_path], | ||
| capture_output=True, text=True, cwd=root_dir).stdout.strip().split('\n')[-1] | ||
|
|
||
| if not answer_time or (generator_time and int(generator_time) > int(answer_time)): | ||
| print(f"Regenerating {answer_file}...") | ||
| os.system(f"cd {root_dir}/integration/compatibility && pytest {generator_file}") | ||
|
|
There was a problem hiding this comment.
I love the idea of having the answer files being re-generated. But I think the cmake/build scripts are the right place for that code.
| if (values[0].IsVector()) { | ||
| auto vec = values[0].GetVector(); | ||
| for (const auto& elem : *vec) { | ||
| if (!elem.IsNil()) { | ||
| values_.insert(elem); | ||
| } | ||
| } |
| }; | ||
|
|
||
| class ToList : public GroupBy::ReducerInstance { | ||
| absl::flat_hash_set<expr::Value> values_; |
There was a problem hiding this comment.
Is a hash_set correct or should it be a vector? Does the compatibility testing cover the cases:
- Duplicate values.
- Ordering of values.
| explicit Value(std::string&& s) : value_(std::move(s)) {} | ||
|
|
||
| // Vector constructors | ||
| explicit Value(Vector vec) : value_(vec) {} |
| const char* reason_; | ||
| std::string reason_; | ||
| }; | ||
| using Vector = std::shared_ptr<std::vector<Value>>; |
There was a problem hiding this comment.
I'd like to see us differentiate between a vector, in the vector-search sense and a collection of Values -- which I think could be called an Array.
A vector is guaranteed to be a collection of numbers (no nils, or strings, etc.). This will allow future functions to efficiently compute values from vectors (like vector distance, etc.)
An Array is a heterogeneous collection of values. It's less efficient because you have to test each element as you go -- and like end up with Nils in some cases.
We can also think about a Tensor type, where a "Vector" is a rank 1 Tensor.
Summary
Adds the
TOLISTreduction function forFT.AGGREGATE'sGROUPBYstage. TOLIST collects all distinct values of a specified field within each group into a vector (RESP array), providing RediSearch-compatible behavior.Changes
ToListreducer class inft_aggregate_exec.ccusingabsl::flat_hash_set<expr::Value>for deduplicationTOLISTinGroupBy::reducerTablewith min/max nargs of 1GetResult()returns anexpr::Value::Vector(serialized via existingSerializeVectorToResp)Testing
ft_aggregate_exec_test.cccovering basic collection, deduplication, vector type, and sizeft_aggregate_parser_test.ccfor valid syntax, AS alias, and invalid argument countsExample
Returns a nested RESP array of distinct
@namevalues per@categorygroup.