Skip to content

GH-49753: [C++][Gandiva] Fix overflow in string functions.#49813

Open
abtom87 wants to merge 2 commits intoapache:mainfrom
abtom87:fix-overflow-in-string-func
Open

GH-49753: [C++][Gandiva] Fix overflow in string functions.#49813
abtom87 wants to merge 2 commits intoapache:mainfrom
abtom87:fix-overflow-in-string-func

Conversation

@abtom87
Copy link
Copy Markdown

@abtom87 abtom87 commented Apr 20, 2026

Rationale for this change

Fix the overflow in functions where strings are used

What changes are included in this PR?

Fixes overflow and handles negative lengths

Are these changes tested?

Yes, modified relevant google tests and ran them locally.

Are there any user-facing changes?

No

This PR contains a "Critical Fix". (If the changes fix either (a) a security vulnerability, (b) a bug that caused incorrect or invalid data to be produced, or (c) a bug that causes a crash (even when the API contract is upheld), please provide explanation. If not, you can remove this.)

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes potential integer-overflow/invalid-length issues in Gandiva string functions by adding overflow-checked allocation sizing and expanding unit tests to cover extreme and negative lengths.

Changes:

  • Add overflow-checked allocation calculations for quote_utf8 and to_hex_binary.
  • Introduce a shared allocation-length helper for gdv_fn_lower_utf8, gdv_fn_upper_utf8, and gdv_fn_initcap_utf8, including negative-length rejection.
  • Add/extend GoogleTest coverage for overflow and negative-length scenarios.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.

File Description
cpp/src/gandiva/precompiled/string_ops.cc Adds overflow-checked allocation sizing for quote_utf8 and to_hex_binary.
cpp/src/gandiva/precompiled/string_ops_test.cc Adds overflow-focused tests for quote_utf8 and to_hex_binary.
cpp/src/gandiva/gdv_string_function_stubs.cc Adds shared allocation-length helper and overflow checks; adjusts some integer types in translate.
cpp/src/gandiva/gdv_function_stubs_test.cc Adds tests for max-length overflow and negative-length handling in upper/lower/initcap and translate.
Comments suppressed due to low confidence (1)

cpp/src/gandiva/gdv_string_function_stubs.cc:792

  • In the multibyte path, the from_for <= from_len loop reads from[from_for] (to compute len_char_from) before checking from_for == from_len. When from_for reaches from_len (the sentinel iteration used to handle the "not found" case), this does a 1-past-the-end read. Reorder the loop so the from_for == from_len case is handled before indexing into from (or change the loop bounds to avoid the sentinel iteration and handle the fallback after the loop).
        for (int32_t from_for = 0; from_for <= from_len; from_for += len_char_from) {
          // Updating len to char in this position
          len_char_from = gdv_fn_utf8_char_length(from[from_for]);
          // Making copy to std::string with length for this char position
          std::string copy_from_compare(from + from_for, len_char_from);
          if (from_for == from_len) {
            // If it's not in the FROM list, just add it to the map and the result.
            std::string insert_copy_value(in + in_for, len_char_in);
            // Insert in map to next loops

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread cpp/src/gandiva/precompiled/string_ops_test.cc Outdated
int32_t bad_in_len = std::numeric_limits<int32_t>::max() / 2 + 20;
out_str = to_hex_binary(ctx_ptr, binary_string, bad_in_len, &out_len);
EXPECT_EQ(out_len, 0);
EXPECT_EQ(out_str, "");
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These assertions compare const char* pointers (EXPECT_EQ(out_str, "")) rather than string contents, which can be non-deterministic. Use EXPECT_STREQ(out_str, "") or compare std::string(out_str, out_len) instead, and consider asserting/resetting the context error since the overflow path sets an error message.

Suggested change
EXPECT_EQ(out_str, "");
EXPECT_STREQ(out_str, "");
EXPECT_TRUE(ctx.has_error());
ctx.Reset();

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

}
} else { // If there are no multibytes in the input, work with std::strings
} else {
// If there are no multibytes in the input, work with std::strings
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment in the has_multi_byte branch is inverted: this else executes when multibyte UTF-8 is present, but the comment says "If there are no multibytes". Please correct the comment to match the control flow to avoid confusion during future maintenance.

Suggested change
// If there are no multibytes in the input, work with std::strings
// If there are multibytes in the input, work with std::strings

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Comment on lines +1928 to 1940
int32_t alloc_length = 0;
// Check overflow: 2 * in_len
if (ARROW_PREDICT_FALSE(
arrow::internal::MultiplyWithOverflow(2, in_len, &alloc_length))) {
gdv_fn_context_set_error_msg(context, "Would overflow maximum output size");
*out_len = 0;
return "";
}

// try to allocate double size output string (worst case)
auto out =
reinterpret_cast<char*>(gdv_fn_context_arena_malloc(context, (in_len * 2) + 2));
reinterpret_cast<char*>(gdv_fn_context_arena_malloc(context, alloc_length + 2));
if (out == nullptr) {
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

quote_utf8 checks overflow for 2 * in_len, but the allocation size is alloc_length + 2. For boundary values (e.g., in_len == INT32_MAX/2), 2 * in_len fits in int32_t while + 2 overflows, which can lead to a negative/incorrect allocation size being passed to gdv_fn_context_arena_malloc. Please include the + 2 in the overflow-safe size computation (or separately check that alloc_length <= INT32_MAX - 2).

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed where/how?

Copy link
Copy Markdown
Author

@abtom87 abtom87 Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These lines check for the suggested overflow. using the AddwithOverflow method.

Comment on lines 2837 to 2853
auto ret =
reinterpret_cast<char*>(gdv_fn_context_arena_malloc(context, text_len * 2 + 1));
reinterpret_cast<char*>(gdv_fn_context_arena_malloc(context, alloc_length + 1));

Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to_hex_binary only special-cases text_len == 0. If text_len is negative, the overflow check may not trigger and alloc_length can become negative, which then gets passed to gdv_fn_context_arena_malloc (takes an int32_t size) and can result in invalid/huge allocations or crashes. Treat negative lengths as invalid (or at least as empty) by changing the guard to text_len <= 0 (and ideally setting an error for negative input, consistent with other functions).

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@abtom87 abtom87 force-pushed the fix-overflow-in-string-func branch 3 times, most recently from d0280e4 to 8bb980f Compare April 23, 2026 03:45
out_str = to_hex_binary(ctx_ptr, binary_string, neg_in_len, &out_len);
EXPECT_EQ(out_len, 0);
EXPECT_STREQ(out_str, "");
ctx.Reset();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add a couple more boundary tests here just to make the hardening feel more complete? I’m thinking of cases like quote_utf8 right around the INT32_MAX / 2 boundary, substring_index with negative lengths, and maybe one concat_ws_* case where the combined output size would overflow.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I will add more tests to cover these, later today.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd love to see the commit history in pull request - easier to review diffs that way.
You found few issues in the implementation via these new tests, correct?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dmitry-chirkov-dremio Done! Could you please check? I think the concat_ws* function needs to be reimplemented using variadic template or so, a lot of redundant code can be avoided.

@abtom87 abtom87 force-pushed the fix-overflow-in-string-func branch from 8bb980f to 0ccb5ee Compare April 23, 2026 17:30
return "";
}

int32_t temp = 0;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are also 3, 4 and 5 argument versions of concat_ws below

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, they have been handled, as well. Perhaps a better way might be to approach them using variadic template or so.

@abtom87 abtom87 force-pushed the fix-overflow-in-string-func branch from 0ccb5ee to f02d531 Compare April 23, 2026 17:54
EXPECT_EQ(std::string(out_str, out_len), "'\\'\\'\\'\\'\\'\\'\\'\\'\\''");
EXPECT_FALSE(ctx.has_error());

int32_t bad_in_len = std::numeric_limits<int32_t>::max() / 2 + 20;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you check int32_t bad_in_len = std::numeric_limits<int32_t>::max() / 2 + 1? Might need to have UndefinedBehaviorSanitizer to actually catch the issue 🤔

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that worked, I can try adding that too. What did not work though is (max/2) -1. The logic should allocate that memory, but i got a segfault, probably because the size was too huge.

Edit: added.

@abtom87 abtom87 force-pushed the fix-overflow-in-string-func branch from f02d531 to e7ff043 Compare April 23, 2026 17:58
…ction_stubs

Fixes potential integer-overflow/invalid-length issues in
gdv_string_function_stubs.cc. Expanded unit-tests.

Incorporated review comments.
@abtom87 abtom87 force-pushed the fix-overflow-in-string-func branch from e7ff043 to be80053 Compare April 23, 2026 18:48
Fixes potential integer-overflow/invalid-length issues in
string_ops.cc. Expanded unit-tests for concat_ws*.

Incorporated review comments.
@abtom87 abtom87 force-pushed the fix-overflow-in-string-func branch from be80053 to 5713446 Compare April 23, 2026 18:56
Copy link
Copy Markdown
Author

@abtom87 abtom87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added suggested changes

EXPECT_EQ(std::string(out_str, out_len), "'\\'\\'\\'\\'\\'\\'\\'\\'\\''");
EXPECT_FALSE(ctx.has_error());

int32_t bad_in_len = std::numeric_limits<int32_t>::max() / 2 + 20;
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that worked, I can try adding that too. What did not work though is (max/2) -1. The logic should allocate that memory, but i got a segfault, probably because the size was too huge.

Edit: added.

return "";
}

int32_t temp = 0;
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, they have been handled, as well. Perhaps a better way might be to approach them using variadic template or so.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants