GH-36388: [C++][Array] Add Overflow check for string Repeat#38504
GH-36388: [C++][Array] Add Overflow check for string Repeat#38504llama90 wants to merge 7 commits intoapache:mainfrom
Conversation
|
|
|
@AlenkaF Hello. Is it appropriate to fix the part you mentioned as an issue like this? If you have time, I would appreciate it if you could leave a review. |
js8544
left a comment
There was a problem hiding this comment.
Thanks for your contribution! The implementation looks good in general. I left a few comments.
| auto scalar = std::make_shared<StringScalar>("aa"); | ||
|
|
||
| // Use a length that will cause an overflow when multiplied by the size of the string | ||
| int64_t length = static_cast<int64_t>(std::numeric_limits<int64_t>::max()) / 2 + 1; |
There was a problem hiding this comment.
We should probably test StringArray with int32_t::max and LargeStringArray with int64_t::max
|
@js8544 Thank you for your review!! I'll take your advice and try to improve it. |
|
@js8544 Hello. I've done as you mentioned. I've now utilized Furthermore, I've put in place checks for when the length exceeds the Thank you for your insightful review. It has certainly helped in refining my approach and making sure all bases are covered. Your guidance is greatly appreciated! |
cpp/src/arrow/array/util.cc
Outdated
| return Status::Invalid( | ||
| "length exceeds the maximum value of offset_type: " + std::to_string(length_) + | ||
| " is greater than " + | ||
| std::to_string(std::numeric_limits<typename T::offset_type>::max())); |
There was a problem hiding this comment.
I have no problem having this check here.
BTW You can just use Status::Invalid("some message ", length_, " some message ", number) without using to_string and +
There was a problem hiding this comment.
On second thought, what if the input scalar an empty string ""? Can the length be larger than int32::max?
There was a problem hiding this comment.
It could be larger than in32::max if the string were empty. We should only apply this check to the product of (length of repeated string) * (number of repeats).
| auto array_result = MakeArrayFromScalar(*scalar, length); | ||
|
|
||
| std::string err_msg = "offset overflow in repeated array construction"; | ||
| EXPECT_RAISES_WITH_MESSAGE_THAT(Invalid, ::testing::HasSubstr(err_msg), array_result); |
There was a problem hiding this comment.
For each case, can you also add a case where it "just" works to avoid false positives. E.g. "aa" and int32::max/2
|
I've done a rollback to the previous code. |
7cd6b03 to
2cc9882
Compare
cpp/src/arrow/array/util.cc
Outdated
| auto size = static_cast<typename T::offset_type>(value->size()); | ||
|
|
||
| typename T::offset_type length_casted; | ||
| if (length_ > std::numeric_limits<typename T::offset_type>::max()) { |
There was a problem hiding this comment.
This check is not correct; please remove it. We only need to check whether (length of repeated string) * (number of repeats) can be represented as an offset, and I think it'd be better for that check to reside in CreateOffsetsBuffer (since then it will also be applied to list types)
cpp/src/arrow/array/util.cc
Outdated
|
|
||
| typename T::offset_type total_size; | ||
| if (MultiplyWithOverflow(size, length_casted, &total_size)) { | ||
| if (MultiplyWithOverflow(size, static_cast<typename T::offset_type>(length_), |
There was a problem hiding this comment.
I think in order to make this correct, this multiplication must be moved into CreateOffsetsBuffer.
| OffsetType total_size; | ||
| if (MultiplyWithOverflow(value_length, static_cast<OffsetType>(length_), | ||
| &total_size)) { | ||
| return Status::Invalid("offset overflow in repeated array construction"); | ||
| } | ||
|
|
||
| if (length_ > std::numeric_limits<OffsetType>::max()) { | ||
| return Status::Invalid( | ||
| "length exceeds the maximum value of offset_type: ", std::to_string(length_), | ||
| " is greater than ", std::to_string(std::numeric_limits<OffsetType>::max())); |
There was a problem hiding this comment.
| OffsetType total_size; | |
| if (MultiplyWithOverflow(value_length, static_cast<OffsetType>(length_), | |
| &total_size)) { | |
| return Status::Invalid("offset overflow in repeated array construction"); | |
| } | |
| if (length_ > std::numeric_limits<OffsetType>::max()) { | |
| return Status::Invalid( | |
| "length exceeds the maximum value of offset_type: ", std::to_string(length_), | |
| " is greater than ", std::to_string(std::numeric_limits<OffsetType>::max())); | |
| int64_t total_size; | |
| if (MultiplyWithOverflow(static_cast<int64_t>(value_length), length_, &total_size) || | |
| total_size > std::numeric_limits<OffsetType>::max()) { | |
| return Status::Invalid( | |
| "length exceeds the maximum value of offset_type: ", std::to_string(total_size), | |
| " is greater than ", std::to_string(std::numeric_limits<OffsetType>::max())); |
|
Thank you for your contribution. Unfortunately, this |
|
Hello, I don't have enough time to work on this issue/PR at the moment. |
Rationale for this change
As mentioned in the issue, the
Repeatfunction does not currently check for overflow when multiplying the repeat count by the string size.What changes are included in this PR?
Are these changes tested?
Yes.
Are there any user-facing changes?
Yes.