Motivation / use-case
Many users need to coerce all columns of a CSV to the same Arrow type—most commonly string() to keep raw text—when the schema is unknown or very wide.
Today the API only permits either:
- passing an explicit
column_types={"colA": pa.string(), …} map, or
- letting the reader infer per-column types.
That forces callers to a) know every header in advance and b) enumerate them, which is painful for dynamic files.
The limitation was raised in ARROW-5811.
Current docs confirm no built-in way exists beyond the explicit map.
Proposed change
Option A – sentinel entry in column_types
Honor a magic key (e.g. "*", "__default__", or a constant kWildcardColumn) inside ConvertOptions.column_types.
Lookup order in MakeConversionSchema() becomes:
- exact match in
column_types
- sentinel key
- current fallback (type inference)
Option B – new field default_column_type
Add std::shared_ptr<DataType> default_column_type = nullptr to ConvertOptions.
If non-null, columns not listed in column_types are converted with that type.
Both approaches are backwards-compatible; Option B is explicit and avoids magic strings, while Option A is a one-line API addition.
Python examples
import pyarrow as pa, pyarrow.csv as pcsv
# Option A (sentinel)
opts = pcsv.ConvertOptions(column_types={"*": pa.string(), "id": pa.int64()})
tbl = pcsv.read_csv("data.csv", convert_options=opts)
# Option B (explicit field)
opts = pcsv.ConvertOptions(
default_column_type=pa.string(), # NEW
column_types={"id": pa.int64()} # explicit override
)
tbl = pcsv.read_csv("data.csv", convert_options=opts)
Affected code (C++ path overview)
| Layer |
File(s) |
Change summary |
Notes |
| Public API |
cpp/src/arrow/csv/options.h |
• Add std::shared_ptr<DataType> default_column_type; to struct ConvertOptions (Option B) or define static const std::string kWildcardColumn = "__default__"; (Option A). • Document the new knob in the Doxygen comment. |
Keeps the setting user-visible. |
|
cpp/src/arrow/csv/options.cc |
• In ConvertOptions::Defaults(), initialise opts.default_column_type = nullptr;. • Extend ConvertOptions::Validate() to raise Status::Invalid for an illegal dtype or duplicate sentinel. |
Ensures default behaviour remains unchanged. |
| Core logic |
cpp/src/arrow/csv/reader.cc — inside MakeConversionSchema() |
Replace the existing two-branch decision with a three-branch cascade: 1. explicit mapping → 2. default_column_type / sentinel → 3. infer type (legacy path). |
~10 LOC patch; confined to one lambda. |
| Unit tests (C++) |
cpp/src/arrow/csv/options_test.cc (new) |
Add three cases: • default only – every column gets that type. • default + explicit overrides – explicit wins. • default == nullptr – legacy inference. |
Guards against regressions. |
| Python binding |
python/pyarrow/_csv.cpp (Cython) |
• Expose default_column_type keyword (accept None or DataType). • Map to/from the underlying C++ field. |
Maintains PyArrow feature parity. |
|
python/pyarrow/tests/test_csv.py |
Mirror the three C++ test scenarios. |
Confirms binding wiring. |
| Documentation |
docs/source/cpp/csv.rst, docs/source/python/csv.rst |
Add one bullet and a quick example for the new option. |
Makes the feature discoverable. |
| Other bindings (optional) |
R, GLib, Rust wrappers |
Add the field/property if those wrappers already expose ConvertOptions. |
Can be staged separately. |
Build system: No CMake or Meson tweaks are required—the dataset/file-CSV paths automatically inherit the updated ConvertOptions.
Cross-language bindings checklist
| Language |
File / area |
Binding note |
| Python (pyarrow) |
_csv.cpp |
add default_column_type kwarg with None ⇒ nullptr |
R (arrow::r::csv) |
r/src/ |
mirror the field in convert_options() constructor |
| GLib |
glib/arrow-gio/csv-options.cpp |
expose property default-column-type |
| Rust |
arrow-csv crate |
add default_column_type: Option<DataType> |
| Java / JNI |
none (CSV reader lives in C++ backend) |
no change |
These additions are mechanical once the C++ core is in place.
Component(s)
C++
Motivation / use-case
Many users need to coerce all columns of a CSV to the same Arrow type—most commonly
string()to keep raw text—when the schema is unknown or very wide.Today the API only permits either:
column_types={"colA": pa.string(), …}map, orThat forces callers to a) know every header in advance and b) enumerate them, which is painful for dynamic files.
The limitation was raised in ARROW-5811.
Current docs confirm no built-in way exists beyond the explicit map.
Proposed change
Option A – sentinel entry in
column_typesHonor a magic key (e.g.
"*","__default__", or a constantkWildcardColumn) insideConvertOptions.column_types.Lookup order in
MakeConversionSchema()becomes:column_typesOption B – new field
default_column_typeAdd
std::shared_ptr<DataType> default_column_type = nullptrtoConvertOptions.If non-null, columns not listed in
column_typesare converted with that type.Both approaches are backwards-compatible; Option B is explicit and avoids magic strings, while Option A is a one-line API addition.
Python examples
Affected code (C++ path overview)
cpp/src/arrow/csv/options.hstd::shared_ptr<DataType> default_column_type;tostruct ConvertOptions(Option B) or definestatic const std::string kWildcardColumn = "__default__";(Option A).• Document the new knob in the Doxygen comment.
cpp/src/arrow/csv/options.ccConvertOptions::Defaults(), initialiseopts.default_column_type = nullptr;.• Extend
ConvertOptions::Validate()to raiseStatus::Invalidfor an illegal dtype or duplicate sentinel.cpp/src/arrow/csv/reader.cc— insideMakeConversionSchema()1. explicit mapping →
2. default_column_type / sentinel →
3. infer type (legacy path).
cpp/src/arrow/csv/options_test.cc(new)• default only – every column gets that type.
• default + explicit overrides – explicit wins.
• default == nullptr – legacy inference.
python/pyarrow/_csv.cpp(Cython)default_column_typekeyword (acceptNoneorDataType).• Map to/from the underlying C++ field.
python/pyarrow/tests/test_csv.pydocs/source/cpp/csv.rst,docs/source/python/csv.rstConvertOptions.Cross-language bindings checklist
_csv.cppdefault_column_typekwarg withNone⇒nullptrarrow::r::csv)r/src/convert_options()constructorglib/arrow-gio/csv-options.cppdefault-column-typearrow-csvcratedefault_column_type: Option<DataType>These additions are mechanical once the C++ core is in place.
Component(s)
C++