Improve parquet dictionary encoding#10635
Improve parquet dictionary encoding#10635rapids-bot[bot] merged 14 commits intorapidsai:branch-22.06from
Conversation
Codecov Report
@@ Coverage Diff @@
## branch-22.06 #10635 +/- ##
================================================
+ Coverage 86.15% 86.36% +0.20%
================================================
Files 141 140 -1
Lines 22510 22304 -206
================================================
- Hits 19394 19263 -131
+ Misses 3116 3041 -75
Continue to review full report at Codecov.
|
|
rerun tests |
| uniq_elem_size = [&]() -> size_type { | ||
| if (not is_unique) { return 0; } | ||
| switch (col->physical_type) { | ||
| case Type::INT32: return 4; | ||
| case Type::INT64: return 8; | ||
| case Type::INT96: return 12; | ||
| case Type::FLOAT: return 4; | ||
| case Type::DOUBLE: return 8; | ||
| case Type::BYTE_ARRAY: | ||
| if (data_col.type().id() == type_id::STRING) { | ||
| // Strings are stored as 4 byte length + string bytes | ||
| return 4 + data_col.element<string_view>(val_idx).size_bytes(); | ||
| } | ||
| case Type::FIXED_LEN_BYTE_ARRAY: | ||
| if (data_col.type().id() == type_id::DECIMAL128) { return sizeof(__int128_t); } | ||
| default: CUDF_UNREACHABLE("Unsupported type for dictionary encoding"); | ||
| } |
There was a problem hiding this comment.
This switch seems redundant with the type_dispatcher. Couldn't the map_insert_fn be made to return the same information and avoid the extra switch?
There was a problem hiding this comment.
Specifically, it seems like this could be simplified to:
auto const [is_unique, element_size] = is_valid ? type_dispatcher(...) : {0, 0};
There was a problem hiding this comment.
These are parquet types, not cudf types.
There was a problem hiding this comment.
I assume the parquet type can be derived the cudf type?
There was a problem hiding this comment.
Even better would be to push the is_valid check inside map_insert_fn. Then I'd write this as:
while (val_idx - block_size < end_value_idx) {
thrust::optional<size_type> unique_element_size = type_dispatcher(...);
...
auto const num_unique = block_reduce(reduce_storage).Sum( unique_element_size.has_value() );
__syncthreads();
auto const unique_data_size = block_reduce(reduce_storage).Sum(unique_element_size.value_or(0));
...
}
There was a problem hiding this comment.
I assume the parquet type can be derived the cudf type?
Not trivially. There can be multiple parquet types associated with a cuDF type and the decision to use which one is determined by a user passed metadata which gets to here.
If the user wants time stamp to be encoded with int96 instead of int64 then that might affect the decision to use dictionary.
|
Don't benchmark using file output. Use void output. The IO time dominates the kernel running time and hides any actual improvements |
hyperbolic2346
left a comment
There was a problem hiding this comment.
Looks good, thanks!
|
@gpucibot merge |
This PR includes several changes to improve parquet dictionary encoding:
__launch_bounds__whileOther ideas tested but not eventually included in this PR due to zero or negative performance impact:
cg::shflinstead of shared memory + syncinsert/findnum_dict_entriesanduniq_data_sizecg::reduceinstead ofcub::BlockReduceBefore:
Now: