[REVIEW] List parquet writer support#6075
Conversation
Replace hardcoded page.num_vals in encode page headers with temp fix
Fix for ColumnMetadata's path_in_schema
|
Please update the changelog in order to start CI tests. View the gpuCI docs here. |
|
@devavret please retarget to |
vuule
left a comment
There was a problem hiding this comment.
Overall structure looks good and there's a lot of readable new code 👍
Left a lot of more detail-oriented feedback below.
| EXPECT_EQ(expected_metadata.column_names, result.metadata.column_names); | ||
| } | ||
|
|
||
| TEST_F(ParquetWriterTest, ListColumn) |
There was a problem hiding this comment.
do we need tests for other types?
There was a problem hiding this comment.
There's a string type in there as well. I can add a double type too.
There was a problem hiding this comment.
I left it as a question since I'm not sure it would actually add code coverage. Feel free to ignore if it does not.
There was a problem hiding this comment.
Strongly recommend a suite of tests that:
- Nest at least 3 levels deep
- Contain nulls at all 3 levels, especially in combination
- Empty lists all over the place
On the reader end, this is where most of the truly fun bugs were.
Edit: I see a you've got a bunch of this below.
There was a problem hiding this comment.
Additionally, on the writer, there are additional corner-cases for batch_list.size() != 1 that only show up if the dataset is greater than 1GB (writer splits the writes in max_bytes_in_batch=1GB chunks to minimize peak memory utilization).
Might need to make this a parameter in the cpp interface or actually generate datasets > 1GB for coverage.
There was a problem hiding this comment.
@OlivierNV, the batch processing functionality seems to overlap chunked writing. Should batched processing be taken out? Chunked write gives the user more control over the partitions of the table they want to write at a time. @nvdbaranec is that a reasonable assumption?
There was a problem hiding this comment.
I think we need both. Chunked writing really isn't an official parquet feature. We just support it as a handy tool.
There was a problem hiding this comment.
Another idea then is to internally use chunked writing for large tables. We'd at least reduce logic duplication.
There was a problem hiding this comment.
@OlivierNV The batch list logic seems to partition row-groups into batches, right? Even with the list writing changes, the list data for a row doesn't span multiple row groups. In fact, even though it is allowed by parquet spec, row data does not span across page boundaries. (This is for convenience and compatibility with v2). Are there any other factors affecting this?
There was a problem hiding this comment.
The high-level batch list logic groups multiple rowgroups together, mainly there to limit the peak memory utilization (assumption is that gpu is fully utilized at >= 1GB, so the splitting avoids having post-rle and post-compression buffers that are as large as the dataframe itself).
It might be possible to implement it by just going through the chunked writing logic (I don't see a downside OTOH, but it's been a while)
[Edit] Oh, one downside would be that in the chunked_write path, you'd have to preemptively know the rough rowgroup boundaries prior to having access to early dictionary stats (maybe not big deal if it's only for huge tables with many rowgroups per batch).
nvdbaranec
left a comment
There was a problem hiding this comment.
First pass. Still have more to review.
cpp/src/io/parquet/writer_impl.cu
Outdated
| rmm::device_vector<size_type const *> _offsets_array; | ||
| rmm::device_vector<size_type> _dremel_offsets; | ||
| rmm::device_vector<uint32_t> _rep_level; | ||
| rmm::device_vector<uint32_t> _def_level; |
There was a problem hiding this comment.
| rmm::device_vector<uint32_t> _def_level; | |
| // O(num leaf values) | |
| rmm::device_vector<uint32_t> _def_level; |
Same comment as above. Might make sense to make this a vector of uint8_t.
There was a problem hiding this comment.
Sure, but the comment is not entirely true. It's not entirely dependent on leaf values. It also has to encode empty lists. Leaf values may be empty but this will be non-zero.
There was a problem hiding this comment.
Definition levels would only be used for encoding null lists I think, but your point is correct. Maybe something like
O(num leaf values + num null lists)
As for rep_level, maybe
O(num lists)
| EXPECT_EQ(expected_metadata.column_names, result.metadata.column_names); | ||
| } | ||
|
|
||
| TEST_F(ParquetWriterTest, ListColumn) |
There was a problem hiding this comment.
Strongly recommend a suite of tests that:
- Nest at least 3 levels deep
- Contain nulls at all 3 levels, especially in combination
- Empty lists all over the place
On the reader end, this is where most of the truly fun bugs were.
Edit: I see a you've got a bunch of this below.
|
rerun tests |
vuule
left a comment
There was a problem hiding this comment.
Minor suggestions, did not dig into the new algorithm.
devavret
left a comment
There was a problem hiding this comment.
This still needs some python changes to fix the case where writing by cudf and reading by pandas doesn't work. Details below
| gdf.to_parquet(fname) | ||
| assert os.path.exists(fname) | ||
|
|
||
| # TODO: This is currently unreadable by pandas. fix it |
There was a problem hiding this comment.
Pandas shows an error when reading this:
pytest error
dtype = 'list'
def pandas_dtype(dtype) -> DtypeObj:
"""
Convert input into a pandas only dtype object or a numpy dtype object.
Parameters
----------
dtype : object to be converted
Returns
-------
np.dtype or a pandas dtype
Raises
------
TypeError if not a dtype
"""
# short-circuit
if isinstance(dtype, np.ndarray):
return dtype.dtype
elif isinstance(dtype, (np.dtype, ExtensionDtype)):
return dtype
# registered extension types
result = registry.find(dtype)
if result is not None:
return result
# try a numpy dtype
# raise a consistent TypeError if failed
try:
> npdtype = np.dtype(dtype)
E TypeError: data type 'list' not understoodThere was a problem hiding this comment.
Triaged this down to use eventually calling pyarrow.pandas_compat.construct_metadata via duck typing, which eventually calls into str(dtype) here: https://github.com/apache/arrow/blob/50d6252ea6b6692ad090e148d096a67605f13811/python/pyarrow/pandas_compat.py#L131
This won't work for types like list, struct, decimal, etc. where there isn't a pandas / numpy native type for us to use.
There was a problem hiding this comment.
And also,
In [61]: gdf = cudf.DataFrame({'a': [[[1, 2], [3, 4]], None, [[5, 6], None]]})
In [62]: gdf.a.dtype
Out[62]: ListDtype(ListDtype(int64))
In [63]: gdf.a.dtype.to_pandas()
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-63-c6a0a93e83a4> in <module>
----> 1 gdf.a.dtype.to_pandas()
~/miniconda3/envs/cudf_dev/lib/python3.7/site-packages/cudf-0.16.0a0+1920.gc3c0779108.dirty-py3.7-linux-x86_64.egg/cudf/core/dtypes.py in to_pandas(self)
153
154 def to_pandas(self):
--> 155 super().to_pandas(integer_object_nulls=True)
156
157 def __eq__(self, other):
AttributeError: 'super' object has no attribute 'to_pandas'| + "supported by the gpu accelerated parquet writer" | ||
| ) | ||
| elif is_list_dtype(col): | ||
| types.append(col.dtype.to_arrow()) |
There was a problem hiding this comment.
This list of types is given to pyarrow to construct pandas metadata using pa.pandas_compat.construct_metadata. That creates the string:
{"index_columns": [], "column_indexes": [{"name": null, "field_name": null, "pandas_type": "unicode", "numpy_type": "object", "metadata": {"encoding": "UTF-8"}}], "columns": [{"name": "a", "field_name": "a", "pandas_type": "list[list[int64]]", "numpy_type": "list", "metadata": null}], "creator": {"library": "pyarrow", "version": "1.0.1"}, "pandas_version": "1.1.2"}Notice the "numpy_type": "list" in columns metadata. This is invalid and it should be "numpy_type": "object".
python/cudf/cudf/core/dtypes.py
Outdated
| def __str__(self): | ||
| return "object" |
There was a problem hiding this comment.
I'm not sure we want to do this for List dtypes. cc @shwina @brandon-b-miller
There was a problem hiding this comment.
For now, it seems like the only approach that works around what PyArrow is doing here: https://github.com/apache/arrow/blob/50d6252ea6b6692ad090e148d096a67605f13811/python/pyarrow/pandas_compat.py#L131 and the expectations of #6075 (comment)
There was a problem hiding this comment.
I'd be okay if this was a short term fix in nightlies, but given this is right before 0.16 release this is concerning to me.
Should we just stop using the arrow function directly and reimplement it as needed while additionally adding cudf specific metadata? The json constructed doesn't seem overly complex.
There was a problem hiding this comment.
I encountered this building cuDF dtypes. The only workaround I could see was to construct a the metadata from real_data.head(0).to_pandas().
I don't know much about the parquet writer overall but is this change just to pacify the pandas parquet reader, and thus the tests? If pandas doesn't support lists but these parquet files can be read by other frameworks (spark, etc) I wonder how far out of our way we should go to make it work with pandas.
There was a problem hiding this comment.
I don't know much about the parquet writer overall but is this change just to pacify the pandas parquet reader, and thus the tests? If pandas doesn't support lists but these parquet files can be read by other frameworks (spark, etc) I wonder how far out of our way we should go to make it work with pandas.
We try to capture Pandas metadata in the file so Pandas can read the data correctly as expected as well as round trip things like RangeIndex objects without issue. We're currently putting list as a numpy_dtype which is invalid.
There was a problem hiding this comment.
If we have to make it work with pandas, short of rewriting that pyarrow function I suppose we could monkey patch this change into those dtype instances at runtime and then undo it. But I suppose what I'm asking is, does roundtripping through a library that doesn't support list make sense as a test for this, as opposed to some alternative (maybe pyarrow, or just cuDF itself)? I think if spark wrote a parquet file that had a list column, and then pandas was used to try and read that file, it would error right?
There was a problem hiding this comment.
But I suppose what I'm asking is, does roundtripping through a library that doesn't support
listmake sense as a test for this, as opposed to some alternative (maybe pyarrow, or just cuDF itself)?
It is supported as an object column of lists for example. Pandas parquet reading / writing goes through PyArrow and in the case of List / Struct / Map columns PyArrow handles the translation to/from Pandas object columns as needed.
There was a problem hiding this comment.
I think if spark wrote a parquet file that had a list column, and then pandas was used to try and read that file, it would error right?
I just checked. It works. I suppose the metadata is for additional pandas specific information like which column is the index etc. But I was able to read a parquet file that had no metadata.
Add a missing typo
Adds List writing ability to parquet writer.
How I achieved this:
parquet_columnclass to point to leaf data in case of a list column. Added members to define list structure.get_levelsfunction in page_enc.cu that takes a cudf column_view of list type and generates the dremel repetition and definition level values for it. Along with this, it generates row-wise offsets into these rep/def level value array. This is because these arrays are not the same size as the leaf data array.DataPageHeaderV2of parquet spec. Therefore, there is no danger of a row of a list column being split across pages.