Skip to content

Conversation

@gabrielsimoes
Copy link
Contributor

@gabrielsimoes gabrielsimoes commented Jan 10, 2026

Rationale for this change

Fixes #40053

When converting Python dictionaries to PyArrow arrays, struct fields are sorted alphabetically instead of preserving the original dictionary key insertion order. Since Python 3.7+, dictionaries maintain insertion order, and users expect this order to be preserved.

>>> import pyarrow as pa
>>> pa.array([{"b": 2, "a": 1}]).type
struct<a: int64, b: int64>

Expected: struct<b: int64, a: int64>

What changes are included in this PR?

Replace std::map<std::string, TypeInferrer> with std::vector<std::pair<std::string, TypeInferrer>> + std::unordered_map<std::string, size_t> in the type inference code. This follows the same pattern used in the JSON parser (cpp/src/arrow/json/parser.cc) for the same problem.

Are these changes tested?

Updated existing tests to verify field ordering.

Are there any user-facing changes?

Struct field order now matches dictionary key insertion order instead of being sorted alphabetically. This is a behavioral change but aligns with user expectations and Python semantics.

…struct type

When converting Python dictionaries to PyArrow arrays, struct fields
were previously sorted alphabetically due to the use of std::map.
This change preserves the original dictionary key insertion order,
which is the expected behavior since Python 3.7+ guarantees dict
ordering.

The fix replaces std::map with a vector + unordered_map combination,
following the same pattern used in Arrow's JSON parser. This maintains
O(1) lookup performance while preserving insertion order.
@github-actions
Copy link

⚠️ GitHub issue #40053 has been automatically assigned in GitHub to PR creator.

@gabrielsimoes gabrielsimoes changed the title GH-40053: [C++][Python] Preserve dict key order when inferring struct type GH-40053: [Python] Preserve dict key order when inferring struct type Jan 10, 2026
@github-actions
Copy link

⚠️ GitHub issue #40053 has been automatically assigned in GitHub to PR creator.

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @gabrielsimoes ! The solution looks good in general, here are some relatively minor comments.

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Jan 12, 2026
@gabrielsimoes
Copy link
Contributor Author

Thanks for the speedy review @pitrou; I have addressed them and the linter/doctest CI failures.

@github-actions
Copy link

⚠️ GitHub issue #40053 has been automatically assigned in GitHub to PR creator.

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks ok to me, thank you @gabrielsimoes . I'll wait for CI to pass and then I think we can merge.

@pitrou pitrou added the Breaking Change Includes a breaking change to the API label Jan 13, 2026
@pitrou pitrou merged commit cff2c52 into apache:main Jan 13, 2026
22 checks passed
@pitrou pitrou removed the awaiting committer review Awaiting committer review label Jan 13, 2026
@conbench-apache-arrow
Copy link

After merging your PR, Conbench analyzed the 3 benchmarking runs that have been run so far on merge-commit cff2c52.

There weren't enough matching historic benchmark results to make a call on whether there were regressions.

The full Conbench report has more details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Breaking Change Includes a breaking change to the API Component: Python

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Python] pa.array(<pd.Series of structs>) changes field order to be sorted

2 participants