[REVIEW] List parquet writer support by devavret · Pull Request #6075 · rapidsai/cudf

devavret · 2020-08-24T16:54:23Z

Adds List writing ability to parquet writer.
How I achieved this:

Change parquet_column class to point to leaf data in case of a list column. Added members to define list structure.
Add standalone get_levels function in page_enc.cu that takes a cudf column_view of list type and generates the dremel repetition and definition level values for it. Along with this, it generates row-wise offsets into these rep/def level value array. This is because these arrays are not the same size as the leaf data array.
In initFragments, instead of parsing fixed number of values to find out their size, it now parses fixed number of rows.
Created a distinction in page.num_rows and page.num_values as num_values can be more in case of list. Row data cannot span multiple pages as compliant with DataPageHeaderV2 of parquet spec. Therefore, there is no danger of a row of a list column being split across pages.
Changed pageEncode to RLE encode pre-processed rep/def levels as opposed to calculating def levels on the fly. This change is made only for list columns and flat columns continue to generate def levels on the fly.

Replace hardcoded page.num_vals in encode page headers with temp fix

Fix for ColumnMetadata's path_in_schema

…commit

GPUtester · 2020-08-24T16:54:51Z

Please update the changelog in order to start CI tests.

View the gpuCI docs here.

kkraus14 · 2020-08-24T16:56:50Z

@devavret please retarget to branch-0.16.

…alculation

vuule

Overall structure looks good and there's a lot of readable new code 👍
Left a lot of more detail-oriented feedback below.

vuule · 2020-08-25T23:21:58Z

cpp/tests/io/parquet_test.cpp

  EXPECT_EQ(expected_metadata.column_names, result.metadata.column_names);
 }

+TEST_F(ParquetWriterTest, ListColumn)


do we need tests for other types?

There's a string type in there as well. I can add a double type too.

I left it as a question since I'm not sure it would actually add code coverage. Feel free to ignore if it does not.

Strongly recommend a suite of tests that:

Nest at least 3 levels deep

Contain nulls at all 3 levels, especially in combination

Empty lists all over the place

On the reader end, this is where most of the truly fun bugs were.
Edit: I see a you've got a bunch of this below.

Additionally, on the writer, there are additional corner-cases for batch_list.size() != 1 that only show up if the dataset is greater than 1GB (writer splits the writes in max_bytes_in_batch=1GB chunks to minimize peak memory utilization).
Might need to make this a parameter in the cpp interface or actually generate datasets > 1GB for coverage.

@OlivierNV, the batch processing functionality seems to overlap chunked writing. Should batched processing be taken out? Chunked write gives the user more control over the partitions of the table they want to write at a time. @nvdbaranec is that a reasonable assumption?

I think we need both. Chunked writing really isn't an official parquet feature. We just support it as a handy tool.

Another idea then is to internally use chunked writing for large tables. We'd at least reduce logic duplication.

@OlivierNV The batch list logic seems to partition row-groups into batches, right? Even with the list writing changes, the list data for a row doesn't span multiple row groups. In fact, even though it is allowed by parquet spec, row data does not span across page boundaries. (This is for convenience and compatibility with v2). Are there any other factors affecting this?

The high-level batch list logic groups multiple rowgroups together, mainly there to limit the peak memory utilization (assumption is that gpu is fully utilized at >= 1GB, so the splitting avoids having post-rle and post-compression buffers that are as large as the dataframe itself).
It might be possible to implement it by just going through the chunked writing logic (I don't see a downside OTOH, but it's been a while)
[Edit] Oh, one downside would be that in the chunked_write path, you'd have to preemptively know the rough rowgroup boundaries prior to having access to early dictionary stats (maybe not big deal if it's only for huge tables with many rowgroups per batch).

cpp/tests/io/parquet_test.cpp

cpp/src/io/parquet/writer_impl.cu

cpp/src/io/parquet/page_enc.cu

nvdbaranec

First pass. Still have more to review.

cpp/src/io/parquet/page_enc.cu

cpp/src/io/parquet/writer_impl.cu

nvdbaranec · 2020-08-31T20:34:40Z

cpp/src/io/parquet/writer_impl.cu

+  rmm::device_vector<size_type const *> _offsets_array;
+  rmm::device_vector<size_type> _dremel_offsets;
+  rmm::device_vector<uint32_t> _rep_level;
+  rmm::device_vector<uint32_t> _def_level;


Suggested change

rmm::device_vector<uint32_t> _def_level;

// O(num leaf values)

rmm::device_vector<uint32_t> _def_level;

Same comment as above. Might make sense to make this a vector of uint8_t.

Sure, but the comment is not entirely true. It's not entirely dependent on leaf values. It also has to encode empty lists. Leaf values may be empty but this will be non-zero.

Definition levels would only be used for encoding null lists I think, but your point is correct. Maybe something like
O(num leaf values + num null lists)

As for rep_level, maybe
O(num lists)

cpp/src/io/parquet/writer_impl.cu

cpp/src/io/parquet/page_enc.cu

cpp/src/io/parquet/parquet_gpu.h

cpp/src/io/parquet/writer_impl.cu

nvdbaranec · 2020-08-31T21:41:02Z

cpp/tests/io/parquet_test.cpp

  EXPECT_EQ(expected_metadata.column_names, result.metadata.column_names);
 }

+TEST_F(ParquetWriterTest, ListColumn)


Strongly recommend a suite of tests that:

Nest at least 3 levels deep

Contain nulls at all 3 levels, especially in combination

Empty lists all over the place

On the reader end, this is where most of the truly fun bugs were.
Edit: I see a you've got a bunch of this below.

cpp/tests/io/parquet_test.cpp

cpp/src/io/parquet/page_enc.cu

devavret · 2020-09-30T22:29:15Z

rerun tests

vuule

Minor suggestions, did not dig into the new algorithm.

cpp/src/io/parquet/page_enc.cu

cpp/src/io/parquet/parquet_gpu.hpp

cpp/src/io/parquet/page_enc.cu

cpp/src/io/parquet/parquet_gpu.hpp

devavret

This still needs some python changes to fix the case where writing by cudf and reading by pandas doesn't work. Details below

devavret · 2020-10-01T23:58:43Z

python/cudf/cudf/tests/test_parquet.py

+    gdf.to_parquet(fname)
+    assert os.path.exists(fname)
+
+    # TODO: This is currently unreadable by pandas. fix it


Pandas shows an error when reading this:

pytest error

dtype = 'list' def pandas_dtype(dtype) -> DtypeObj: """ Convert input into a pandas only dtype object or a numpy dtype object. Parameters ---------- dtype : object to be converted Returns ------- np.dtype or a pandas dtype Raises ------ TypeError if not a dtype """ # short-circuit if isinstance(dtype, np.ndarray): return dtype.dtype elif isinstance(dtype, (np.dtype, ExtensionDtype)): return dtype # registered extension types result = registry.find(dtype) if result is not None: return result # try a numpy dtype # raise a consistent TypeError if failed try: > npdtype = np.dtype(dtype) E TypeError: data type 'list' not understood

Triaged this down to use eventually calling pyarrow.pandas_compat.construct_metadata via duck typing, which eventually calls into str(dtype) here: https://github.com/apache/arrow/blob/50d6252ea6b6692ad090e148d096a67605f13811/python/pyarrow/pandas_compat.py#L131

This won't work for types like list, struct, decimal, etc. where there isn't a pandas / numpy native type for us to use.

And also,

In [61]: gdf = cudf.DataFrame({'a': [[[1, 2], [3, 4]], None, [[5, 6], None]]}) In [62]: gdf.a.dtype Out[62]: ListDtype(ListDtype(int64)) In [63]: gdf.a.dtype.to_pandas() --------------------------------------------------------------------------- AttributeError Traceback (most recent call last) <ipython-input-63-c6a0a93e83a4> in <module> ----> 1 gdf.a.dtype.to_pandas() ~/miniconda3/envs/cudf_dev/lib/python3.7/site-packages/cudf-0.16.0a0+1920.gc3c0779108.dirty-py3.7-linux-x86_64.egg/cudf/core/dtypes.py in to_pandas(self) 153 154 def to_pandas(self): --> 155 super().to_pandas(integer_object_nulls=True) 156 157 def __eq__(self, other): AttributeError: 'super' object has no attribute 'to_pandas'

devavret · 2020-10-02T00:03:58Z

python/cudf/cudf/_lib/parquet.pyx

                + "supported by the gpu accelerated parquet writer"
            )
+        elif is_list_dtype(col):
+            types.append(col.dtype.to_arrow())


This list of types is given to pyarrow to construct pandas metadata using pa.pandas_compat.construct_metadata. That creates the string:

{"index_columns": [], "column_indexes": [{"name": null, "field_name": null, "pandas_type": "unicode", "numpy_type": "object", "metadata": {"encoding": "UTF-8"}}], "columns": [{"name": "a", "field_name": "a", "pandas_type": "list[list[int64]]", "numpy_type": "list", "metadata": null}], "creator": {"library": "pyarrow", "version": "1.0.1"}, "pandas_version": "1.1.2"}

Notice the "numpy_type": "list" in columns metadata. This is invalid and it should be "numpy_type": "object".

kkraus14 · 2020-10-03T21:12:52Z

python/cudf/cudf/core/dtypes.py

+    def __str__(self):
+        return "object"


I'm not sure we want to do this for List dtypes. cc @shwina @brandon-b-miller

For now, it seems like the only approach that works around what PyArrow is doing here: https://github.com/apache/arrow/blob/50d6252ea6b6692ad090e148d096a67605f13811/python/pyarrow/pandas_compat.py#L131 and the expectations of #6075 (comment)

I'd be okay if this was a short term fix in nightlies, but given this is right before 0.16 release this is concerning to me.

Should we just stop using the arrow function directly and reimplement it as needed while additionally adding cudf specific metadata? The json constructed doesn't seem overly complex.

I encountered this building cuDF dtypes. The only workaround I could see was to construct a the metadata from real_data.head(0).to_pandas().

I don't know much about the parquet writer overall but is this change just to pacify the pandas parquet reader, and thus the tests? If pandas doesn't support lists but these parquet files can be read by other frameworks (spark, etc) I wonder how far out of our way we should go to make it work with pandas.

I don't know much about the parquet writer overall but is this change just to pacify the pandas parquet reader, and thus the tests? If pandas doesn't support lists but these parquet files can be read by other frameworks (spark, etc) I wonder how far out of our way we should go to make it work with pandas.

We try to capture Pandas metadata in the file so Pandas can read the data correctly as expected as well as round trip things like RangeIndex objects without issue. We're currently putting list as a numpy_dtype which is invalid.

If we have to make it work with pandas, short of rewriting that pyarrow function I suppose we could monkey patch this change into those dtype instances at runtime and then undo it. But I suppose what I'm asking is, does roundtripping through a library that doesn't support list make sense as a test for this, as opposed to some alternative (maybe pyarrow, or just cuDF itself)? I think if spark wrote a parquet file that had a list column, and then pandas was used to try and read that file, it would error right?

But I suppose what I'm asking is, does roundtripping through a library that doesn't support list make sense as a test for this, as opposed to some alternative (maybe pyarrow, or just cuDF itself)?

It is supported as an object column of lists for example. Pandas parquet reading / writing goes through PyArrow and in the case of List / Struct / Map columns PyArrow handles the translation to/from Pandas object columns as needed.

I think if spark wrote a parquet file that had a list column, and then pandas was used to try and read that file, it would error right?

I just checked. It works. I suppose the metadata is for additional pandas specific information like which column is the index etc. But I was able to read a parquet file that had no metadata.

…ist-parqet-write

python/cudf/cudf/_lib/parquet.pyx

Add a missing typo

devavret added 12 commits August 12, 2020 04:47

First draft of a working list support in parquet writer.

eecd664

Replace hardcoded list schema elements with actual code

73ce7f3

Replace hardcoded page.num_vals in encode page headers with temp fix

Enhance PackLiterals to handle all bit widths

b742b9b

Fix introduced bug in adding non-list schema element to metadata

07005f4

Put page.num_values in page header init kernel

4447f66

Clean up EncColumnDesc

8ce17b8

Add offset handling in list column

2e1f815

Update the page max size calculation

48f0bc5

Merge branch 'branch-0.15' into list-parqet-write

68f4ec8

First proper test for list parquet writer

e7bcbee

Fix for ColumnMetadata's path_in_schema

Add list column to sliced test and fix schema bug introduced in last …

1a119b4

…commit

Chunked write for list cannot be enabled without significant refactor

6484f62

devavret requested review from OlivierNV, nvdbaranec and vuule August 24, 2020 16:54

devavret requested a review from a team as a code owner August 24, 2020 16:54

devavret added 2 commits August 24, 2020 22:40

Cleanup

09d1cbf

changelog

af5663f

devavret changed the base branch from branch-0.15 to branch-0.16 August 24, 2020 17:13

Fix issue with values flowing to another page

12c97ed

devavret mentioned this pull request Aug 25, 2020

[FEA] cudf/cuIO support for read/write parquet files with nested list columns #5665

Closed

devavret added 2 commits August 27, 2020 02:03

Fix the bug I introduced for non lists while trying to improve page c…

688b6ad

…alculation

Fixing style issues

17f0e92

vuule requested changes Aug 28, 2020

View reviewed changes

nvdbaranec suggested changes Aug 31, 2020

View reviewed changes

OlivierNV reviewed Sep 3, 2020

View reviewed changes

cpp/src/io/parquet/page_enc.cu Outdated Show resolved Hide resolved

cpp/src/io/parquet/page_enc.cu Show resolved Hide resolved

cpp/src/io/parquet/page_enc.cu Show resolved Hide resolved

OlivierNV reviewed Sep 3, 2020

View reviewed changes

cpp/src/io/parquet/page_enc.cu Outdated Show resolved Hide resolved

kkraus14 changed the title ~~[REVIEW] List parqet writer support~~ [REVIEW] List parquet writer support Sep 28, 2020

devavret added 3 commits September 30, 2020 03:05

Dremel calculation documentation and cleanup

cf54b8f

Review changes

a56d83a

Merge branch 'branch-0.16' into list-parqet-write

3bec8a0

devavret requested review from cwharris and vuule September 29, 2020 22:55

vuule requested changes Sep 30, 2020

View reviewed changes

Review changes requested by vuule

bfd83f3

cwharris approved these changes Oct 1, 2020

View reviewed changes

cpp/src/io/parquet/page_enc.cu Show resolved Hide resolved

cpp/src/io/parquet/parquet_gpu.hpp Show resolved Hide resolved

cpp/src/io/parquet/parquet_gpu.hpp Show resolved Hide resolved

Remove unused declaration.

32a3cab

vuule approved these changes Oct 1, 2020

View reviewed changes

devavret commented Oct 2, 2020

View reviewed changes

devavret added 2 commits October 2, 2020 21:10

Merge branch 'branch-0.16' into list-parqet-write

3708732

Workaround for pandas reading of parquet with list

05296b4

kkraus14 requested changes Oct 3, 2020

View reviewed changes

shwina added 2 commits October 6, 2020 17:50

Merge branch 'branch-0.16' of https://github.com/rapidsai/cudf into l…

18877bd

…ist-parqet-write

Correct pandas metadata returned by pyarrow for list/struct dtypes

5a17311

kkraus14 reviewed Oct 6, 2020

View reviewed changes

python/cudf/cudf/_lib/parquet.pyx Show resolved Hide resolved

Try to import ujson first

5672141

kkraus14 approved these changes Oct 6, 2020

View reviewed changes

kkraus14 added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team labels Oct 6, 2020

Merge branch 'branch-0.16' into list-parqet-write

b0a2b55

galipremsagar reviewed Oct 7, 2020

View reviewed changes

python/cudf/cudf/_lib/parquet.pyx Outdated Show resolved Hide resolved

Update python/cudf/cudf/_lib/parquet.pyx

3733776

Add a missing typo

kkraus14 merged commit 8950c1a into rapidsai:branch-0.16 Oct 7, 2020

devavret mentioned this pull request Jan 13, 2021

[BUG] from_arrow fails when there is a list column in pyarrow table #7137

Closed

devavret mentioned this pull request Apr 6, 2021

[FEA] Dictionary encoding in Parquet writer for nested types #7873

Closed

	rmm::device_vector<uint32_t> _def_level;
	// O(num leaf values)
	rmm::device_vector<uint32_t> _def_level;

Conversation

devavret commented Aug 24, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

GPUtester commented Aug 24, 2020

Uh oh!

kkraus14 commented Aug 24, 2020

Uh oh!

vuule left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

OlivierNV Sep 10, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nvdbaranec left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

devavret commented Sep 30, 2020

Uh oh!

vuule left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

devavret commented Aug 24, 2020 •

edited

Loading

OlivierNV Sep 10, 2020 •

edited

Loading

shwina Oct 3, 2020 •

edited

Loading

brandon-b-miller Oct 5, 2020 •

edited

Loading

devavret Oct 5, 2020 •

edited

Loading