Skip to content

feat: add TOLIST reducer for FT.AGGREGATE GROUPBY stage#932

Open
AlexFilipImproving wants to merge 20 commits intovalkey-io:mainfrom
Bit-Quill:feature/tolist
Open

feat: add TOLIST reducer for FT.AGGREGATE GROUPBY stage#932
AlexFilipImproving wants to merge 20 commits intovalkey-io:mainfrom
Bit-Quill:feature/tolist

Conversation

@AlexFilipImproving
Copy link
Copy Markdown
Collaborator

Summary

Adds the TOLIST reduction function for FT.AGGREGATE's GROUPBY stage. TOLIST collects all distinct values of a specified field within each group into a vector (RESP array), providing RediSearch-compatible behavior.

Changes

  • Added ToList reducer class in ft_aggregate_exec.cc using absl::flat_hash_set<expr::Value> for deduplication
  • Registered TOLIST in GroupBy::reducerTable with min/max nargs of 1
  • Nil values are skipped; vector inputs are flattened one level
  • GetResult() returns an expr::Value::Vector (serialized via existing SerializeVectorToResp)

Testing

  • Unit tests in ft_aggregate_exec_test.cc covering basic collection, deduplication, vector type, and size
  • Parser tests in ft_aggregate_parser_test.cc for valid syntax, AS alias, and invalid argument counts
  • Integration test covering end-to-end RESP output, deduplication, multi-reducer GROUPBY, and AS alias

Example

FT.AGGREGATE idx "*" GROUPBY 1 @category REDUCE TOLIST 1 @name AS names

Returns a nested RESP array of distinct @name values per @category group.

This commit squashes 59 commits containing:
- Refactored FT.AGGREGATE reducers to process records in batch
- Vector reducer function optimizations
- Text search performance improvements
- Compatibility test enhancements
- Multi-DB support for FT.SEARCH
- Various bug fixes and test coverage improvements
- Locking optimizations for text indexes
- Memory allocation improvements

Key changes:
- Changed ProcessRecords interface to accept all records at once for batch processing
- Count reducer now O(1) instead of O(n) function calls
- Improved text index locking and performance
- Added comprehensive timeout coverage for search execution
- Enhanced compatibility testing framework
- Fixed lazy expiration crash in search commands
- Optimized stemming, tokenization, and rax mutations
- Added ingestion performance scenarios
- Better separation between data collection and processing phases

Signed-off-by: Alexandru Filip <alexandru.filip@improving.com>
Signed-off-by: Alexandru Filip <alexandru.filip@improving.com>
Signed-off-by: Alexandru Filip <alexandru.filip@improving.com>
This commit adds comprehensive vector value type support to the expression
evaluation system, enabling vectors to be used in FT.AGGREGATE operations.

Key changes:
- Implement AsVector() accessor method for Value class
- Add Vector variant to expr::Value with std::vector<double> storage
- Implement vector-specific functions (VECLEN, VECGET, VECSET, etc.)
- Add vector support to mathematical and string functions
- Implement vector serialization to RESP format
- Add comprehensive error handling for vector operations
- Add vector comparison operators and nested vector support
- Refactor Value::Nil to use std::string instead of const char*

Testing:
- Add comprehensive unit tests for vector operations
- Add nested vector tests
- Add vector comparison tests

This also includes upstream changes from main branch merged during
development, including text search optimizations, compatibility tests,
and various bug fixes.

Signed-off-by: Alexandru Filip <alexandru.filip@improving.com>
Signed-off-by: Alexandru Filip <alexandru.filip@improving.com>
Signed-off-by: Alexandru Filip <alexandru.filip@improving.com>
Signed-off-by: Alexandru Filip <alexandru.filip@improving.com>
Add ToList ReducerInstance class that collects all distinct values of a
field into a vector using absl::flat_hash_set<expr::Value>. Handles nil
value skipping and one-level vector input flattening. Register TOLIST in
the GroupBy::reducerTable with min_nargs=1, max_nargs=1.

Signed-off-by: Alexandru Filip <alexandru.filip@improving.com>
Add ToListReducerTest covering basic distinct value collection,
deduplication behavior, and TOLIST combined with COUNT in the
same GROUPBY stage.

Signed-off-by: Alexandru Filip <alexandru.filip@improving.com>
Add four TOLIST entries to the TestStages vector:
- Valid case: basic TOLIST with one argument
- Valid case: TOLIST with AS alias
- Invalid case: TOLIST with 0 arguments
- Invalid case: TOLIST with 2 arguments

Signed-off-by: Alexandru Filip <alexandru.filip@improving.com>
Signed-off-by: Alexandru Filip <alexandru.filip@improving.com>

return data_set

def _check_and_regenerate_answers(answer_file, generator_file):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this code belongs in the build scripts, not here.

Comment on lines +493 to +506
def _check_and_regenerate_answers(answer_file, generator_file):
root_dir = os.getenv("ROOT_DIR")
answer_path = f"integration/compatibility/{answer_file}"
generator_path = f"integration/compatibility/{generator_file}"

answer_time = subprocess.run(["git", "log", "-1", "--format=%ct", answer_path],
capture_output=True, text=True, cwd=root_dir).stdout.strip().split('\n')[-1]
generator_time = subprocess.run(["git", "log", "-1", "--format=%ct", generator_path],
capture_output=True, text=True, cwd=root_dir).stdout.strip().split('\n')[-1]

if not answer_time or (generator_time and int(generator_time) > int(answer_time)):
print(f"Regenerating {answer_file}...")
os.system(f"cd {root_dir}/integration/compatibility && pytest {generator_file}")

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love the idea of having the answer files being re-generated. But I think the cmake/build scripts are the right place for that code.

Comment on lines +370 to +376
if (values[0].IsVector()) {
auto vec = values[0].GetVector();
for (const auto& elem : *vec) {
if (!elem.IsNil()) {
values_.insert(elem);
}
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this happen?

};

class ToList : public GroupBy::ReducerInstance {
absl::flat_hash_set<expr::Value> values_;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is a hash_set correct or should it be a vector? Does the compatibility testing cover the cases:

  1. Duplicate values.
  2. Ordering of values.

explicit Value(std::string&& s) : value_(std::move(s)) {}

// Vector constructors
explicit Value(Vector vec) : value_(vec) {}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const & ?

const char* reason_;
std::string reason_;
};
using Vector = std::shared_ptr<std::vector<Value>>;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to see us differentiate between a vector, in the vector-search sense and a collection of Values -- which I think could be called an Array.

A vector is guaranteed to be a collection of numbers (no nils, or strings, etc.). This will allow future functions to efficiently compute values from vectors (like vector distance, etc.)

An Array is a heterogeneous collection of values. It's less efficient because you have to test each element as you go -- and like end up with Nils in some cases.

We can also think about a Tensor type, where a "Vector" is a rank 1 Tensor.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants