-
Notifications
You must be signed in to change notification settings - Fork 4k
GH-40053: [Python] Preserve dict key order when inferring struct type #48813
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…struct type When converting Python dictionaries to PyArrow arrays, struct fields were previously sorted alphabetically due to the use of std::map. This change preserves the original dictionary key insertion order, which is the expected behavior since Python 3.7+ guarantees dict ordering. The fix replaces std::map with a vector + unordered_map combination, following the same pattern used in Arrow's JSON parser. This maintains O(1) lookup performance while preserving insertion order.
|
|
|
|
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR @gabrielsimoes ! The solution looks good in general, here are some relatively minor comments.
|
Thanks for the speedy review @pitrou; I have addressed them and the linter/doctest CI failures. |
|
|
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks ok to me, thank you @gabrielsimoes . I'll wait for CI to pass and then I think we can merge.
|
After merging your PR, Conbench analyzed the 3 benchmarking runs that have been run so far on merge-commit cff2c52. There weren't enough matching historic benchmark results to make a call on whether there were regressions. The full Conbench report has more details. |
Rationale for this change
Fixes #40053
When converting Python dictionaries to PyArrow arrays, struct fields are sorted alphabetically instead of preserving the original dictionary key insertion order. Since Python 3.7+, dictionaries maintain insertion order, and users expect this order to be preserved.
Expected:
struct<b: int64, a: int64>What changes are included in this PR?
Replace
std::map<std::string, TypeInferrer>withstd::vector<std::pair<std::string, TypeInferrer>>+std::unordered_map<std::string, size_t>in the type inference code. This follows the same pattern used in the JSON parser (cpp/src/arrow/json/parser.cc) for the same problem.Are these changes tested?
Updated existing tests to verify field ordering.
Are there any user-facing changes?
Struct field order now matches dictionary key insertion order instead of being sorted alphabetically. This is a behavioral change but aligns with user expectations and Python semantics.
pa.array(<pd.Series of structs>)changes field order to be sorted #40053