As great and universal as json is for exchanging messages, the fact that Singer tap -> target communication requires a record-oriented json format is a big drawback (at least for me), as the unnecessary serialisation/deserialisation overhead becomes a real pain when processing (analytical) bulk data.
I was wondering how much room/importance the community sees in extending the spec in this area.
My first thought would be to add the option to use Apache Arrow Inter Process Communication (IPC). For those unfamiliar with Arrow: Arrow is a standardised columnar memory specification, and IPC is a way of transferring arrow record batches without the need for serialisation/deserialisation. As many data storage systems are adopting Arrow Flight, the will be a lot of value in data pipelines that use Arrow as the shared data layout in every step from extraction to loading.
Let me know if this is of interest, or if more information is needed, and I will add more detail.
As great and universal as json is for exchanging messages, the fact that Singer tap -> target communication requires a record-oriented json format is a big drawback (at least for me), as the unnecessary serialisation/deserialisation overhead becomes a real pain when processing (analytical) bulk data.
I was wondering how much room/importance the community sees in extending the spec in this area.
My first thought would be to add the option to use Apache Arrow Inter Process Communication (IPC). For those unfamiliar with Arrow: Arrow is a standardised columnar memory specification, and IPC is a way of transferring arrow record batches without the need for serialisation/deserialisation. As many data storage systems are adopting Arrow Flight, the will be a lot of value in data pipelines that use Arrow as the shared data layout in every step from extraction to loading.
Let me know if this is of interest, or if more information is needed, and I will add more detail.