Skip to content

[REVIEW] List parquet writer support#6075

Merged
kkraus14 merged 55 commits intorapidsai:branch-0.16from
devavret:list-parqet-write
Oct 7, 2020
Merged

[REVIEW] List parquet writer support#6075
kkraus14 merged 55 commits intorapidsai:branch-0.16from
devavret:list-parqet-write

Conversation

@devavret
Copy link
Copy Markdown
Contributor

@devavret devavret commented Aug 24, 2020

Adds List writing ability to parquet writer.
How I achieved this:

  1. Change parquet_column class to point to leaf data in case of a list column. Added members to define list structure.
  2. Add standalone get_levels function in page_enc.cu that takes a cudf column_view of list type and generates the dremel repetition and definition level values for it. Along with this, it generates row-wise offsets into these rep/def level value array. This is because these arrays are not the same size as the leaf data array.
  3. In initFragments, instead of parsing fixed number of values to find out their size, it now parses fixed number of rows.
  4. Created a distinction in page.num_rows and page.num_values as num_values can be more in case of list. Row data cannot span multiple pages as compliant with DataPageHeaderV2 of parquet spec. Therefore, there is no danger of a row of a list column being split across pages.
  5. Changed pageEncode to RLE encode pre-processed rep/def levels as opposed to calculating def levels on the fly. This change is made only for list columns and flat columns continue to generate def levels on the fly.

@devavret devavret requested a review from a team as a code owner August 24, 2020 16:54
@GPUtester
Copy link
Copy Markdown
Collaborator

Please update the changelog in order to start CI tests.

View the gpuCI docs here.

@kkraus14
Copy link
Copy Markdown
Collaborator

@devavret please retarget to branch-0.16.

@devavret devavret changed the base branch from branch-0.15 to branch-0.16 August 24, 2020 17:13
Copy link
Copy Markdown
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall structure looks good and there's a lot of readable new code 👍
Left a lot of more detail-oriented feedback below.

EXPECT_EQ(expected_metadata.column_names, result.metadata.column_names);
}

TEST_F(ParquetWriterTest, ListColumn)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need tests for other types?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a string type in there as well. I can add a double type too.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left it as a question since I'm not sure it would actually add code coverage. Feel free to ignore if it does not.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strongly recommend a suite of tests that:

  • Nest at least 3 levels deep
  • Contain nulls at all 3 levels, especially in combination
  • Empty lists all over the place

On the reader end, this is where most of the truly fun bugs were.
Edit: I see a you've got a bunch of this below.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additionally, on the writer, there are additional corner-cases for batch_list.size() != 1 that only show up if the dataset is greater than 1GB (writer splits the writes in max_bytes_in_batch=1GB chunks to minimize peak memory utilization).
Might need to make this a parameter in the cpp interface or actually generate datasets > 1GB for coverage.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@OlivierNV, the batch processing functionality seems to overlap chunked writing. Should batched processing be taken out? Chunked write gives the user more control over the partitions of the table they want to write at a time. @nvdbaranec is that a reasonable assumption?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need both. Chunked writing really isn't an official parquet feature. We just support it as a handy tool.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another idea then is to internally use chunked writing for large tables. We'd at least reduce logic duplication.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@OlivierNV The batch list logic seems to partition row-groups into batches, right? Even with the list writing changes, the list data for a row doesn't span multiple row groups. In fact, even though it is allowed by parquet spec, row data does not span across page boundaries. (This is for convenience and compatibility with v2). Are there any other factors affecting this?

Copy link
Copy Markdown
Contributor

@OlivierNV OlivierNV Sep 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The high-level batch list logic groups multiple rowgroups together, mainly there to limit the peak memory utilization (assumption is that gpu is fully utilized at >= 1GB, so the splitting avoids having post-rle and post-compression buffers that are as large as the dataframe itself).
It might be possible to implement it by just going through the chunked writing logic (I don't see a downside OTOH, but it's been a while)
[Edit] Oh, one downside would be that in the chunked_write path, you'd have to preemptively know the rough rowgroup boundaries prior to having access to early dictionary stats (maybe not big deal if it's only for huge tables with many rowgroups per batch).

Copy link
Copy Markdown
Contributor

@nvdbaranec nvdbaranec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First pass. Still have more to review.

rmm::device_vector<size_type const *> _offsets_array;
rmm::device_vector<size_type> _dremel_offsets;
rmm::device_vector<uint32_t> _rep_level;
rmm::device_vector<uint32_t> _def_level;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
rmm::device_vector<uint32_t> _def_level;
// O(num leaf values)
rmm::device_vector<uint32_t> _def_level;

Same comment as above. Might make sense to make this a vector of uint8_t.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, but the comment is not entirely true. It's not entirely dependent on leaf values. It also has to encode empty lists. Leaf values may be empty but this will be non-zero.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definition levels would only be used for encoding null lists I think, but your point is correct. Maybe something like
O(num leaf values + num null lists)

As for rep_level, maybe
O(num lists)

EXPECT_EQ(expected_metadata.column_names, result.metadata.column_names);
}

TEST_F(ParquetWriterTest, ListColumn)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strongly recommend a suite of tests that:

  • Nest at least 3 levels deep
  • Contain nulls at all 3 levels, especially in combination
  • Empty lists all over the place

On the reader end, this is where most of the truly fun bugs were.
Edit: I see a you've got a bunch of this below.

@kkraus14 kkraus14 changed the title [REVIEW] List parqet writer support [REVIEW] List parquet writer support Sep 28, 2020
@devavret devavret requested review from cwharris and vuule September 29, 2020 22:55
@devavret
Copy link
Copy Markdown
Contributor Author

rerun tests

Copy link
Copy Markdown
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor suggestions, did not dig into the new algorithm.

Copy link
Copy Markdown
Contributor Author

@devavret devavret left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still needs some python changes to fix the case where writing by cudf and reading by pandas doesn't work. Details below

gdf.to_parquet(fname)
assert os.path.exists(fname)

# TODO: This is currently unreadable by pandas. fix it
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pandas shows an error when reading this:

pytest error

dtype = 'list'
    def pandas_dtype(dtype) -> DtypeObj:
        """
        Convert input into a pandas only dtype object or a numpy dtype object.
        Parameters
        ----------
        dtype : object to be converted
        Returns
        -------
        np.dtype or a pandas dtype
        Raises
        ------
        TypeError if not a dtype
        """
        # short-circuit
        if isinstance(dtype, np.ndarray):
            return dtype.dtype
        elif isinstance(dtype, (np.dtype, ExtensionDtype)):
            return dtype
        # registered extension types
        result = registry.find(dtype)
        if result is not None:
            return result
        # try a numpy dtype
        # raise a consistent TypeError if failed
        try:
>           npdtype = np.dtype(dtype)
E           TypeError: data type 'list' not understood

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Triaged this down to use eventually calling pyarrow.pandas_compat.construct_metadata via duck typing, which eventually calls into str(dtype) here: https://github.com/apache/arrow/blob/50d6252ea6b6692ad090e148d096a67605f13811/python/pyarrow/pandas_compat.py#L131

This won't work for types like list, struct, decimal, etc. where there isn't a pandas / numpy native type for us to use.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And also,

In [61]: gdf = cudf.DataFrame({'a': [[[1, 2], [3, 4]], None, [[5, 6], None]]})

In [62]: gdf.a.dtype
Out[62]: ListDtype(ListDtype(int64))

In [63]: gdf.a.dtype.to_pandas()
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-63-c6a0a93e83a4> in <module>
----> 1 gdf.a.dtype.to_pandas()

~/miniconda3/envs/cudf_dev/lib/python3.7/site-packages/cudf-0.16.0a0+1920.gc3c0779108.dirty-py3.7-linux-x86_64.egg/cudf/core/dtypes.py in to_pandas(self)
    153 
    154     def to_pandas(self):
--> 155         super().to_pandas(integer_object_nulls=True)
    156 
    157     def __eq__(self, other):

AttributeError: 'super' object has no attribute 'to_pandas'

+ "supported by the gpu accelerated parquet writer"
)
elif is_list_dtype(col):
types.append(col.dtype.to_arrow())
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This list of types is given to pyarrow to construct pandas metadata using pa.pandas_compat.construct_metadata. That creates the string:

{"index_columns": [], "column_indexes": [{"name": null, "field_name": null, "pandas_type": "unicode", "numpy_type": "object", "metadata": {"encoding": "UTF-8"}}], "columns": [{"name": "a", "field_name": "a", "pandas_type": "list[list[int64]]", "numpy_type": "list", "metadata": null}], "creator": {"library": "pyarrow", "version": "1.0.1"}, "pandas_version": "1.1.2"}

Notice the "numpy_type": "list" in columns metadata. This is invalid and it should be "numpy_type": "object".

Comment on lines +170 to +171
def __str__(self):
return "object"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we want to do this for List dtypes. cc @shwina @brandon-b-miller

Copy link
Copy Markdown
Contributor

@shwina shwina Oct 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, it seems like the only approach that works around what PyArrow is doing here: https://github.com/apache/arrow/blob/50d6252ea6b6692ad090e148d096a67605f13811/python/pyarrow/pandas_compat.py#L131 and the expectations of #6075 (comment)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be okay if this was a short term fix in nightlies, but given this is right before 0.16 release this is concerning to me.

Should we just stop using the arrow function directly and reimplement it as needed while additionally adding cudf specific metadata? The json constructed doesn't seem overly complex.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I encountered this building cuDF dtypes. The only workaround I could see was to construct a the metadata from real_data.head(0).to_pandas().

I don't know much about the parquet writer overall but is this change just to pacify the pandas parquet reader, and thus the tests? If pandas doesn't support lists but these parquet files can be read by other frameworks (spark, etc) I wonder how far out of our way we should go to make it work with pandas.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know much about the parquet writer overall but is this change just to pacify the pandas parquet reader, and thus the tests? If pandas doesn't support lists but these parquet files can be read by other frameworks (spark, etc) I wonder how far out of our way we should go to make it work with pandas.

We try to capture Pandas metadata in the file so Pandas can read the data correctly as expected as well as round trip things like RangeIndex objects without issue. We're currently putting list as a numpy_dtype which is invalid.

Copy link
Copy Markdown
Contributor

@brandon-b-miller brandon-b-miller Oct 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we have to make it work with pandas, short of rewriting that pyarrow function I suppose we could monkey patch this change into those dtype instances at runtime and then undo it. But I suppose what I'm asking is, does roundtripping through a library that doesn't support list make sense as a test for this, as opposed to some alternative (maybe pyarrow, or just cuDF itself)? I think if spark wrote a parquet file that had a list column, and then pandas was used to try and read that file, it would error right?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I suppose what I'm asking is, does roundtripping through a library that doesn't support list make sense as a test for this, as opposed to some alternative (maybe pyarrow, or just cuDF itself)?

It is supported as an object column of lists for example. Pandas parquet reading / writing goes through PyArrow and in the case of List / Struct / Map columns PyArrow handles the translation to/from Pandas object columns as needed.

Copy link
Copy Markdown
Contributor Author

@devavret devavret Oct 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if spark wrote a parquet file that had a list column, and then pandas was used to try and read that file, it would error right?

I just checked. It works. I suppose the metadata is for additional pandas specific information like which column is the index etc. But I was able to read a parquet file that had no metadata.

@kkraus14 kkraus14 added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team labels Oct 6, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

5 - Ready to Merge Testing and reviews complete, ready to merge cuIO cuIO issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.