Skip to content

feat(transform): Add FormatFromParquet Transform#302

Merged
andrew-kline merged 1 commit intobrexhq:mainfrom
jshlbrd:jshlbrd/feat/fmt-from-parquet
Jan 3, 2026
Merged

feat(transform): Add FormatFromParquet Transform#302
andrew-kline merged 1 commit intobrexhq:mainfrom
jshlbrd:jshlbrd/feat/fmt-from-parquet

Conversation

@jshlbrd
Copy link
Contributor

@jshlbrd jshlbrd commented Jun 17, 2025

Description

  • Adds a transform for reading Parquet files into messages

Motivation and Context

This is similar to the formatFromZip transform in that a Parquet file is made up of rows which are converted to Substation's internal message format. This works out of the box with existing applications since Parquet is not a text-based file format.

How Has This Been Tested?

Added unit tests and an example config.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • My code follows the code style of this project.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.

@jshlbrd jshlbrd marked this pull request as ready for review June 23, 2025 15:44
@jshlbrd jshlbrd requested a review from a team as a code owner June 23, 2025 15:44
rows := make([]any, reader.NumRows())
for {
if n, err := reader.Read(rows); err != nil {
if err.Error() == "EOF" || n == 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If err is not nil and not EOF, should it break or return an error?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently this returns the error (line 66) if the error is not EOF or skips the file if it's empty (mimics the behavior here).

reader := parquet.NewGenericReader[any](fi)
defer reader.Close()

rows := make([]any, reader.NumRows())
Copy link
Contributor

@Mallika05 Mallika05 Jun 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to add a batchSize here to be careful with larger files and memory issues (or with scalability in mind)?
The Read() function seems like it can accept a chunk size rather than processing all rows together 🤔
https://deepwiki.com/parquet-go/parquet-go/2-reading-parquet-files?utm_source=chatgpt.com#:~:text=Person%2C%20100)-,//%20Read%20in%20batches,-for%20%7B%0A%20%20%20%20n

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the same here as other file readers -- there's no protection from reading large files. I haven't observed this as an issue in production deployments but that issue could exist in multiple places throughout the source code, so it may be worth a refactor later if there's interest in it.

Copy link
Contributor

@Mallika05 Mallika05 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a couple questions to clarify!

@andrew-kline andrew-kline merged commit 64928e1 into brexhq:main Jan 3, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants