-
Notifications
You must be signed in to change notification settings - Fork 47
Description
Edit: I noticed this is somewhat related to #365 so please let me know if I should move my discussion there. Thanks!
Hi maintainers! First of all, I would love to thank you for your work bringing apache avro into the Rust ecosystem. I am truly grateful. I have been trying to adopt avro more at work, but in its current state, it is quite brittle, especially with respect to using serde with Unions that contain a null variant. I plan on contributing to fix these issues, but before then I would like some input from the maintainers with respect to what we would like to support wrt this crate.
TLDR
- There are arguably 3 ways to represent avro nullable unions in Rust, some work and some don't, but I'd like input on which ones we'd like to support or if we should take a stance on a single correct way.
- For nullable unions with >3 variants, the rusty way requires
#[serde(untagged)]to work but it fails on multiple edge cases. Would like input on which plan to address is better direction for this crate
The Rust Representation Conundrum
Given an schema of
{
"name": "MyUnion",
"type": [
"null",
"int"
]
}
There are currently 3 arguable ways to represent this with Rust types
The current avro-rs way
enum MyUnionNullable {
Null,
Int(i32),
}
Advantages
This aligns closest to how avro binary encoding works and is simplest to serialize. Use the variant index from serde to properly encode the branch index then encode the contained value
Disadvantages
This is not "rusty", as null or missing values are generally represented as Option<T>, and by doing this you lose the nice semantics of Option.
This also isn't the proper avro json encoding if you choose to serialize it with serde_json.
According to the avro specification,
The value of a union is encoded in JSON as follows:
- if its type is null, then it is encoded as a JSON null;
- otherwise it is encoded as a JSON object with one name/value pair whose
name is the type’s name and whose value is the recursively encoded value.
For Avro’s named types (record, fixed or enum) the user-specified name is
used, for other types the type name is used.
Which means that the null branch should be encoded as null and the int branch should be encoded as { "int": 1 }, but in this representation the null branch is encoded as "Null".
The "Rusty" way
Option<i32>
Advantages
This is the most intuitive for Rust users
Disadvantages
This also isn't the proper avro json encoding if you choose to serialize it with serde_json because the int branch gets encoded as 1 without the discriminator
The Anchor On Avro Json Encoding And Serde Json Way
enum MyUnionAvroJsonEncoding {
#[serde(rename = "int")]
Int(i32),
}
Option<MyUnionAvroJsonEncoding>
Advantages
This is technically the most "correct" representation because it serializes exactly according to the avro json encoding specification
Disadvantages
This is even worse looking than the avro-rs way
Implications
This begs the question of
What variations of these Rust representations should we support in our serializer and deserializer, and is it even possible to support said representation?
As of today, 1 and 2 are fully supported. EXCEPT, for nullable enums, thanks to me causing a regression with #337 (Which is unreleased so I guess not so bad). Not going to get into the details but basically Enums and Unions are the same type in Rust, and with serde you can't tell if your unit variant is serializing an Enum like type or a Union like type.
IMHO the current plan of only supporting 1 and 2 for 2 variant nullable unions is probably sufficient, and avro json encoding should not be done by serde json but instead should be exposed by apache-avro with its own serialization of converting an apache-avro value to a serde_json value.
But what about Nullable Unions with >3 variants?
{
"name": "MyUnion",
"type": [
"null",
"int",
{
"type": "enum",
"name": "MyEnum",
"symbols": ["A", "B"]
},
{
"type": "record",
"name": "MyRecord",
"fields": [
{"name": "a", "type": "int"}
]
}
]
}
There are in my opinion, two ways to represent this in rust
The current avro-rs way
enum MyEnum {
A,
B,
}
struct MyRecord {
a: i32,
}
enum MyUnionNullable {
Null,
Int(i32),
MyEnum(MyEnum),
MyRecord(MyRecord),
}
The "Rusty" way
enum MyEnum {
A,
B,
}
struct MyRecord {
a: i32,
}
enum MyUnion {
Int(i32),
MyEnum(MyEnum),
MyRecord(MyRecord),
}
Option<MyUnion>
Implications
My #337 PR was actually aimed to solve this problem. Before said PR, the "rusty" way of representing combination unions actually didn't work at all since it would immediately pick the first variant of of the schema and try to select it.
HOWEVER, my PR was an mediocre fix at best because in order for it to be supported, #[serde(untagged)] was required to be decorated on the type, and it would break in 2 edge cases.
- Enums didn't work. This is because of the same problem that I mentioned regarding the regression that I created
- Because of untagged, deserialization would deserialize to the incorrect variant if their fields were identical, or if an earlier variant could "serde" deserialize a later variant with its Option semantics, example shown below
const NULLABLE_RECORD_SCHEMA: &str = r#"
{
"name": "MyUnion",
"type": [
"null",
{
"type": "record",
"name": "MyRecordA",
"fields": [
{"name": "a", "type": "int"}
]
},
{
"type": "record",
"name": "MyRecordB",
"fields": [
{"name": "a", "type": "int"}
]
}
]
}
"#;
#[derive(Debug, Serialize, Deserialize, PartialEq)]
struct MyRecordA {
a: i32,
}
#[derive(Debug, Serialize, Deserialize, PartialEq)]
struct MyRecordB {
a: i32,
}
#[derive(Debug, Serialize, Deserialize, PartialEq)]
#[serde(untagged)]
enum MyUnionUntagged {
MyRecordA(MyRecordA),
MyRecordB(MyRecordB),
}
In the above example, an avro of variant MyRecordB will always deserialize to MyRecordA because untagged means there's no discriminant and the best effort deserialization chooses MyRecordA
Ways to address
The goals of my 2 proosals here are to get rid of the reliance on #[serde(untagged)] for serialization. I haven't dove into the deserialization yet, but rest assured, these plans will not become a PR unless a roundtrip can be successful.
-
Lookup Union variants by name rather than index. serde serialize_x_variant provides the name of the variant branch you are in. Instead of lookup by variant_index, lookup by variant_name to match the schema. Unnamed types like primitves in avro will just follow the name as provided by json encoding. This change would allow enums to be in whatever order they want, but add a restriction that names must strictly follow their primitive name i.e. Int String Long. This would be a breaking change since previous users would now need to rename their variants to align with the naming convention. Also there may be a potential performance hit because we need to scan our union schemas to match, although I'm not sure how relevant that is. We could also technically only do Lookup by Name if we use the Option representation, since we would be notified through
serialize_some -
Still use index based lookup, BUT make the assumption that if you use Option, The null variant must be the 0th branch, and properly bump index by 1 when you arrive at "serialize_x_variant." However, this would mean that if you use Option, you are restricted to having a certain schema shape, but in the vast majority of cases, null is always the first variant (I recall its mentioned in the spec that this is standard practice)
Final thoughts
serde and avro don't play the nicest together, but its what we have. My goal is for serialization and deserialization to work properly for everyone trying to use the crate while trying to follow idiomatic rust as possible. Sorry for the long post, but thanks for your attention.
Appendix
I attached a comprehensive minimal test code that documents the different successes and failures of representations of unions with null variant in rust. (GH doesn't let me upload an rs file so its in txt)