Skip to content

GH-36388: [C++][Array] Add Overflow check for string Repeat#38504

Closed
llama90 wants to merge 7 commits intoapache:mainfrom
llama90:bugfix/check-repeat-length
Closed

GH-36388: [C++][Array] Add Overflow check for string Repeat#38504
llama90 wants to merge 7 commits intoapache:mainfrom
llama90:bugfix/check-repeat-length

Conversation

@llama90
Copy link
Copy Markdown
Contributor

@llama90 llama90 commented Oct 29, 2023

Rationale for this change

As mentioned in the issue, the Repeat function does not currently check for overflow when multiplying the repeat count by the string size.

What changes are included in this PR?

  • Added check logic for overflow.
  • Added unit test for overflow scenarios.

Are these changes tested?

Yes.

Are there any user-facing changes?

Yes.

@github-actions
Copy link
Copy Markdown

⚠️ GitHub issue #36388 has been automatically assigned in GitHub to PR creator.

@llama90
Copy link
Copy Markdown
Contributor Author

llama90 commented Oct 31, 2023

@AlenkaF Hello. Is it appropriate to fix the part you mentioned as an issue like this?

If you have time, I would appreciate it if you could leave a review.

Copy link
Copy Markdown
Contributor

@js8544 js8544 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution! The implementation looks good in general. I left a few comments.

auto scalar = std::make_shared<StringScalar>("aa");

// Use a length that will cause an overflow when multiplied by the size of the string
int64_t length = static_cast<int64_t>(std::numeric_limits<int64_t>::max()) / 2 + 1;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably test StringArray with int32_t::max and LargeStringArray with int64_t::max

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Nov 8, 2023
@llama90
Copy link
Copy Markdown
Contributor Author

llama90 commented Nov 8, 2023

@js8544 Thank you for your review!! I'll take your advice and try to improve it.

@llama90
Copy link
Copy Markdown
Contributor Author

llama90 commented Nov 9, 2023

@js8544 Hello.

I've done as you mentioned. I've now utilized typename T::offset_type to accommodate type-specific maximum lengths. Additionally, I've included tests for String and LargeString to ensure verification.

Furthermore, I've put in place checks for when the length exceeds the maximum allowed size or is negative. Do you think these checks could be excessive?

Thank you for your insightful review. It has certainly helped in refining my approach and making sure all bases are covered. Your guidance is greatly appreciated!

@llama90 llama90 requested a review from js8544 November 10, 2023 00:08
Copy link
Copy Markdown
Contributor

@js8544 js8544 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! LGTM now with a couple of nits.
cc @pitrou @bkietz for comments

return Status::Invalid(
"length exceeds the maximum value of offset_type: " + std::to_string(length_) +
" is greater than " +
std::to_string(std::numeric_limits<typename T::offset_type>::max()));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no problem having this check here.
BTW You can just use Status::Invalid("some message ", length_, " some message ", number) without using to_string and +

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second thought, what if the input scalar an empty string ""? Can the length be larger than int32::max?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be larger than in32::max if the string were empty. We should only apply this check to the product of (length of repeated string) * (number of repeats).

auto array_result = MakeArrayFromScalar(*scalar, length);

std::string err_msg = "offset overflow in repeated array construction";
EXPECT_RAISES_WITH_MESSAGE_THAT(Invalid, ::testing::HasSubstr(err_msg), array_result);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For each case, can you also add a case where it "just" works to avoid false positives. E.g. "aa" and int32::max/2

@llama90 llama90 requested a review from js8544 November 11, 2023 09:45
@github-actions github-actions bot added awaiting changes Awaiting changes awaiting change review Awaiting change review and removed awaiting committer review Awaiting committer review awaiting changes Awaiting changes labels Nov 13, 2023
@llama90 llama90 requested a review from bkietz November 13, 2023 23:44
@llama90
Copy link
Copy Markdown
Contributor Author

llama90 commented Nov 14, 2023

I have found that preventing empty strings affects the other tests. I will make the necessary corrections ASAP.

I've done a rollback to the previous code.

@llama90 llama90 force-pushed the bugfix/check-repeat-length branch from 7cd6b03 to 2cc9882 Compare November 14, 2023 04:28
auto size = static_cast<typename T::offset_type>(value->size());

typename T::offset_type length_casted;
if (length_ > std::numeric_limits<typename T::offset_type>::max()) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check is not correct; please remove it. We only need to check whether (length of repeated string) * (number of repeats) can be represented as an offset, and I think it'd be better for that check to reside in CreateOffsetsBuffer (since then it will also be applied to list types)

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Nov 14, 2023
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Nov 15, 2023
@llama90 llama90 requested a review from bkietz November 15, 2023 00:20

typename T::offset_type total_size;
if (MultiplyWithOverflow(size, length_casted, &total_size)) {
if (MultiplyWithOverflow(size, static_cast<typename T::offset_type>(length_),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in order to make this correct, this multiplication must be moved into CreateOffsetsBuffer.

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Nov 15, 2023
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Nov 15, 2023
@llama90 llama90 requested a review from bkietz November 15, 2023 15:44
Comment on lines +851 to +860
OffsetType total_size;
if (MultiplyWithOverflow(value_length, static_cast<OffsetType>(length_),
&total_size)) {
return Status::Invalid("offset overflow in repeated array construction");
}

if (length_ > std::numeric_limits<OffsetType>::max()) {
return Status::Invalid(
"length exceeds the maximum value of offset_type: ", std::to_string(length_),
" is greater than ", std::to_string(std::numeric_limits<OffsetType>::max()));
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
OffsetType total_size;
if (MultiplyWithOverflow(value_length, static_cast<OffsetType>(length_),
&total_size)) {
return Status::Invalid("offset overflow in repeated array construction");
}
if (length_ > std::numeric_limits<OffsetType>::max()) {
return Status::Invalid(
"length exceeds the maximum value of offset_type: ", std::to_string(length_),
" is greater than ", std::to_string(std::numeric_limits<OffsetType>::max()));
int64_t total_size;
if (MultiplyWithOverflow(static_cast<int64_t>(value_length), length_, &total_size) ||
total_size > std::numeric_limits<OffsetType>::max()) {
return Status::Invalid(
"length exceeds the maximum value of offset_type: ", std::to_string(total_size),
" is greater than ", std::to_string(std::numeric_limits<OffsetType>::max()));

@github-actions github-actions bot added awaiting changes Awaiting changes awaiting review Awaiting review and removed awaiting change review Awaiting change review labels Nov 15, 2023
@llama90 llama90 requested a review from bkietz November 15, 2023 16:36
@github-actions github-actions bot added the Status: stale-warning Issues and PRs flagged as stale which are due to be closed if no indication otherwise label Nov 18, 2025
@thisisnic
Copy link
Copy Markdown
Member

Thank you for your contribution. Unfortunately, this
pull request has been marked as stale because it has had no activity in the past 365 days. Please remove the stale label
or comment below, or this PR will be closed in 14 days. Feel free to re-open this if it has been closed in error. If you
do not have repository permissions to reopen the PR, please tag a maintainer.

@llama90
Copy link
Copy Markdown
Contributor Author

llama90 commented Nov 30, 2025

Hello, I don't have enough time to work on this issue/PR at the moment.
So I will close this PR. If anyone is interested, please feel free to continue the work.
Thank you.

@llama90 llama90 closed this Nov 30, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

awaiting changes Awaiting changes awaiting review Awaiting review Component: C++ Status: stale-warning Issues and PRs flagged as stale which are due to be closed if no indication otherwise

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Python][C++] pyarrow.repeat returns an invalid array when a chunked array is required.

4 participants