Skip to content

Adding alternatives to the exchange of record-oriented json messages #29

@Marnixvdb

Description

@Marnixvdb

As great and universal as json is for exchanging messages, the fact that Singer tap -> target communication requires a record-oriented json format is a big drawback (at least for me), as the unnecessary serialisation/deserialisation overhead becomes a real pain when processing (analytical) bulk data.

I was wondering how much room/importance the community sees in extending the spec in this area.

My first thought would be to add the option to use Apache Arrow Inter Process Communication (IPC). For those unfamiliar with Arrow: Arrow is a standardised columnar memory specification, and IPC is a way of transferring arrow record batches without the need for serialisation/deserialisation. As many data storage systems are adopting Arrow Flight, the will be a lot of value in data pipelines that use Arrow as the shared data layout in every step from extraction to loading.

Let me know if this is of interest, or if more information is needed, and I will add more detail.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions