[FEA] Improve join_match_context API design and usability

**Is your feature request related to a problem? Please describe.**

The `join_match_context` and `join_partition_context` APIs have design decisions that were intentionally made but have proven suboptimal:

1. **Non-standard naming**: Public members use leading underscores (`_left_table`, `_match_counts`) which is conventionally reserved for private members
2. **Redundant unique_ptr wrapper**: `std::unique_ptr<rmm::device_uvector<size_type>>` adds dereferencing overhead but provides no benefit since `device_uvector` is already non-copyable and move-only
3. **Inconsistent terminology**: `hash_join` uses `probe`/`build` (implementation-specific) while storing as `_left_table`. Should use `left_table`/`right_table` (deterministic, matches join result semantics)
4. **No encapsulation**: Direct member access prevents future changes without breaking users
5. **Code duplication**: `left_join_match_context` and `full_join_match_context` have identical implementations

**Describe the solution you'd like**

1. **Remove leading underscores** from public members
2. **Store container directly**:
   ```cpp
   struct join_match_context {
     table_view left_table;
     rmm::device_uvector<size_type> match_counts;  // No unique_ptr wrapper
   };
   ```
3. **Add accessor methods** (if converting to class):
   ```cpp
   device_span<size_type const> match_counts_span() const;
   table_view left_table() const;
   ```
4. **Standardize on `left_table`/`right_table`** terminology across all join types

**Describe alternatives you've considered**

- **Minimal approach**: Add accessor methods while keeping struct members public (backward compatible)
- **Hybrid approach**: Convert `join_match_context` to class but keep `join_partition_context` as struct
- **Keep current design**: Maintain backward compatibility at the cost of usability improvements

**Additional context**

Current usage requires awkward dereferencing:
```cpp
auto match_context = hash_join.inner_join_match_context(probe, stream);
expect_match_counts_equal(*match_context._match_counts, {1, 0, 2, 1, 2}, stream);  // Must dereference

// Proposed: cleaner direct access with same ownership semantics
expect_match_counts_equal(match_context.match_counts, {1, 0, 2, 1, 2}, stream);
```

The partition API works well for chunking large datasets (documented in `sort_merge_join.hpp`). A potential enhancement would be adding a helper for equal-sized chunks:
```cpp
// Proposed helper for common chunking pattern
static std::vector<join_partition_context> equal_chunks(
  join_match_context const& ctx, size_type chunk_size);

// Would simplify current manual loop:
for (size_type start = 0; start < num_rows; start += chunk_size) {
  size_type end = std::min(start + chunk_size, num_rows);
  auto part_ctx = join_partition_context{context, start, end};
  // ...
}
```

Related APIs: `sort_merge_join::inner_join_match_context`, `hash_join::{inner,left,full}_join_match_context`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Improve join_match_context API design and usability #20958

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[FEA] Improve join_match_context API design and usability #20958

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions