Skip to content

Value to JSON serialisation is wrong for Unions that are not Null #449

@PookieBuns

Description

@PookieBuns

Edit: I noticed this is somewhat related to #365 so please let me know if I should move my discussion there. Thanks!

Hi maintainers! First of all, I would love to thank you for your work bringing apache avro into the Rust ecosystem. I am truly grateful. I have been trying to adopt avro more at work, but in its current state, it is quite brittle, especially with respect to using serde with Unions that contain a null variant. I plan on contributing to fix these issues, but before then I would like some input from the maintainers with respect to what we would like to support wrt this crate.

TLDR

  • There are arguably 3 ways to represent avro nullable unions in Rust, some work and some don't, but I'd like input on which ones we'd like to support or if we should take a stance on a single correct way.
  • For nullable unions with >3 variants, the rusty way requires #[serde(untagged)] to work but it fails on multiple edge cases. Would like input on which plan to address is better direction for this crate

The Rust Representation Conundrum

Given an schema of

    {
        "name": "MyUnion",
        "type": [
            "null",
            "int"
        ]
    }

There are currently 3 arguable ways to represent this with Rust types

The current avro-rs way

enum MyUnionNullable {
    Null,
    Int(i32),
}

Advantages

This aligns closest to how avro binary encoding works and is simplest to serialize. Use the variant index from serde to properly encode the branch index then encode the contained value

Disadvantages

This is not "rusty", as null or missing values are generally represented as Option<T>, and by doing this you lose the nice semantics of Option.

This also isn't the proper avro json encoding if you choose to serialize it with serde_json.

According to the avro specification,

The value of a union is encoded in JSON as follows:

  • if its type is null, then it is encoded as a JSON null;
  • otherwise it is encoded as a JSON object with one name/value pair whose
    name is the type’s name and whose value is the recursively encoded value.
    For Avro’s named types (record, fixed or enum) the user-specified name is
    used, for other types the type name is used.

Which means that the null branch should be encoded as null and the int branch should be encoded as { "int": 1 }, but in this representation the null branch is encoded as "Null".

The "Rusty" way

Option<i32>

Advantages

This is the most intuitive for Rust users

Disadvantages

This also isn't the proper avro json encoding if you choose to serialize it with serde_json because the int branch gets encoded as 1 without the discriminator

The Anchor On Avro Json Encoding And Serde Json Way

enum MyUnionAvroJsonEncoding {
    #[serde(rename = "int")]
    Int(i32),
}

Option<MyUnionAvroJsonEncoding>

Advantages

This is technically the most "correct" representation because it serializes exactly according to the avro json encoding specification

Disadvantages

This is even worse looking than the avro-rs way

Implications

This begs the question of

What variations of these Rust representations should we support in our serializer and deserializer, and is it even possible to support said representation?

As of today, 1 and 2 are fully supported. EXCEPT, for nullable enums, thanks to me causing a regression with #337 (Which is unreleased so I guess not so bad). Not going to get into the details but basically Enums and Unions are the same type in Rust, and with serde you can't tell if your unit variant is serializing an Enum like type or a Union like type.

IMHO the current plan of only supporting 1 and 2 for 2 variant nullable unions is probably sufficient, and avro json encoding should not be done by serde json but instead should be exposed by apache-avro with its own serialization of converting an apache-avro value to a serde_json value.

But what about Nullable Unions with >3 variants?

    {
        "name": "MyUnion",
        "type": [
            "null",
            "int",
            {
                "type": "enum",
                "name": "MyEnum",
                "symbols": ["A", "B"]
            },
            {
                "type": "record",
                "name": "MyRecord",
                "fields": [
                    {"name": "a", "type": "int"}
                ]
            }
        ]
    }

There are in my opinion, two ways to represent this in rust

The current avro-rs way

    enum MyEnum {
        A,
        B,
    }

    struct MyRecord {
        a: i32,
    }

    enum MyUnionNullable {
        Null,
        Int(i32),
        MyEnum(MyEnum),
        MyRecord(MyRecord),
    }

The "Rusty" way

    enum MyEnum {
        A,
        B,
    }

    struct MyRecord {
        a: i32,
    }

    enum MyUnion {
        Int(i32),
        MyEnum(MyEnum),
        MyRecord(MyRecord),
    }

    Option<MyUnion>

Implications

My #337 PR was actually aimed to solve this problem. Before said PR, the "rusty" way of representing combination unions actually didn't work at all since it would immediately pick the first variant of of the schema and try to select it.

HOWEVER, my PR was an mediocre fix at best because in order for it to be supported, #[serde(untagged)] was required to be decorated on the type, and it would break in 2 edge cases.

  1. Enums didn't work. This is because of the same problem that I mentioned regarding the regression that I created
  2. Because of untagged, deserialization would deserialize to the incorrect variant if their fields were identical, or if an earlier variant could "serde" deserialize a later variant with its Option semantics, example shown below
    const NULLABLE_RECORD_SCHEMA: &str = r#"
    {
        "name": "MyUnion",
        "type": [
            "null",
            {
                "type": "record",
                "name": "MyRecordA",
                "fields": [
                    {"name": "a", "type": "int"}
                ]
            },
            {
                "type": "record",
                "name": "MyRecordB",
                "fields": [
                    {"name": "a", "type": "int"}
                ]
            }
        ]
    }
    "#;

    #[derive(Debug, Serialize, Deserialize, PartialEq)]
    struct MyRecordA {
        a: i32,
    }

    #[derive(Debug, Serialize, Deserialize, PartialEq)]
    struct MyRecordB {
        a: i32,
    }

    #[derive(Debug, Serialize, Deserialize, PartialEq)]
    #[serde(untagged)]
    enum MyUnionUntagged {
        MyRecordA(MyRecordA),
        MyRecordB(MyRecordB),
    }

In the above example, an avro of variant MyRecordB will always deserialize to MyRecordA because untagged means there's no discriminant and the best effort deserialization chooses MyRecordA

Ways to address

The goals of my 2 proosals here are to get rid of the reliance on #[serde(untagged)] for serialization. I haven't dove into the deserialization yet, but rest assured, these plans will not become a PR unless a roundtrip can be successful.

  1. Lookup Union variants by name rather than index. serde serialize_x_variant provides the name of the variant branch you are in. Instead of lookup by variant_index, lookup by variant_name to match the schema. Unnamed types like primitves in avro will just follow the name as provided by json encoding. This change would allow enums to be in whatever order they want, but add a restriction that names must strictly follow their primitive name i.e. Int String Long. This would be a breaking change since previous users would now need to rename their variants to align with the naming convention. Also there may be a potential performance hit because we need to scan our union schemas to match, although I'm not sure how relevant that is. We could also technically only do Lookup by Name if we use the Option representation, since we would be notified through serialize_some

  2. Still use index based lookup, BUT make the assumption that if you use Option, The null variant must be the 0th branch, and properly bump index by 1 when you arrive at "serialize_x_variant." However, this would mean that if you use Option, you are restricted to having a certain schema shape, but in the vast majority of cases, null is always the first variant (I recall its mentioned in the spec that this is standard practice)

Final thoughts

serde and avro don't play the nicest together, but its what we have. My goal is for serialization and deserialization to work properly for everyone trying to use the crate while trying to follow idiomatic rust as possible. Sorry for the long post, but thanks for your attention.

Appendix

I attached a comprehensive minimal test code that documents the different successes and failures of representations of unions with null variant in rust. (GH doesn't let me upload an rs file so its in txt)

test_nullable_union.txt

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions