Value to JSON serialisation is wrong for Unions that are not Null

**Edit: I noticed this is somewhat related to #365 so please let me know if I should move my discussion there. Thanks!**

Hi maintainers! First of all, I would love to thank you for your work bringing apache avro into the Rust ecosystem. I am truly grateful. I have been trying to adopt avro more at work, but in its current state, it is quite brittle, especially with respect to using serde with Unions that contain a null variant. I plan on contributing to fix these issues, but before then I would like some input from the maintainers with respect to what we would like to support wrt this crate.

# TLDR
- There are arguably 3 ways to represent avro nullable unions in Rust, some work and some don't, but I'd like input on which ones we'd like to support or if we should take a stance on a single correct way.
- For nullable unions with >3 variants, the rusty way requires `#[serde(untagged)]` to work but it fails on multiple edge cases. Would like input on which plan to address is better direction for this crate

# The Rust Representation Conundrum

Given an schema of 

```
    {
        "name": "MyUnion",
        "type": [
            "null",
            "int"
        ]
    }
```

There are currently 3 arguable ways to represent this with Rust types

## The current `avro-rs` way

```
enum MyUnionNullable {
    Null,
    Int(i32),
}
```

### Advantages

This aligns closest to how avro binary encoding works and is simplest to serialize. Use the variant index from serde to properly encode the branch index then encode the contained value

### Disadvantages

This is not "rusty", as null or missing values are generally represented as `Option<T>`, and by doing this you lose the nice semantics of `Option`.

This also isn't the proper avro json encoding if you choose to serialize it with `serde_json`. 

According to the avro specification,

> The value of a union is encoded in JSON as follows:
>
> - if its type is null, then it is encoded as a JSON null;
> - otherwise it is encoded as a JSON object with one name/value pair whose
  name is the type’s name and whose value is the recursively encoded value.
  For Avro’s named types (record, fixed or enum) the user-specified name is
  used, for other types the type name is used.

Which means that the null branch should be encoded as `null` and the int branch should be encoded as `{ "int": 1 }`, but in this representation the null branch is encoded as `"Null"`.

## The "Rusty" way

```
Option<i32>
```

### Advantages

This is the most intuitive for Rust users

### Disadvantages

This also isn't the proper avro json encoding if you choose to serialize it with `serde_json` because the int branch gets encoded as `1` without the discriminator

## The Anchor On Avro Json Encoding And Serde Json Way

```
enum MyUnionAvroJsonEncoding {
    #[serde(rename = "int")]
    Int(i32),
}

Option<MyUnionAvroJsonEncoding>
```

### Advantages

This is technically the most "correct" representation because it serializes exactly according to the avro json encoding specification

### Disadvantages

This is even worse looking than the `avro-rs` way

# Implications

This begs the question of

**What variations of these Rust representations should we support in our serializer and deserializer, and is it even possible to support said representation?**

As of today, 1 and 2 are fully supported. EXCEPT, for nullable enums, thanks to me causing a regression with #337 (Which is unreleased so I guess not so bad). Not going to get into the details but basically Enums and Unions are the same type in Rust, and with serde you can't tell if your unit variant is serializing an Enum like type or a Union like type. 

IMHO the current plan of only supporting 1 and 2 for 2 variant nullable unions is probably sufficient, and avro json encoding should not be done by serde json but instead should be exposed by `apache-avro` with its own serialization of converting an `apache-avro` value to a `serde_json` value.

# But what about Nullable Unions with >3 variants?

```
    {
        "name": "MyUnion",
        "type": [
            "null",
            "int",
            {
                "type": "enum",
                "name": "MyEnum",
                "symbols": ["A", "B"]
            },
            {
                "type": "record",
                "name": "MyRecord",
                "fields": [
                    {"name": "a", "type": "int"}
                ]
            }
        ]
    }
```
There are in my opinion, two ways to represent this in rust

## The current `avro-rs` way

```
    enum MyEnum {
        A,
        B,
    }

    struct MyRecord {
        a: i32,
    }

    enum MyUnionNullable {
        Null,
        Int(i32),
        MyEnum(MyEnum),
        MyRecord(MyRecord),
    }
```

## The "Rusty" way
```
    enum MyEnum {
        A,
        B,
    }

    struct MyRecord {
        a: i32,
    }

    enum MyUnion {
        Int(i32),
        MyEnum(MyEnum),
        MyRecord(MyRecord),
    }

    Option<MyUnion>
```
# Implications

My #337 PR was actually aimed to solve this problem. Before said PR, the "rusty" way of representing combination unions actually didn't work at all since it would immediately pick the first variant of of the schema and try to select it. 

HOWEVER, my PR was an mediocre fix at best because in order for it to be supported, `#[serde(untagged)]` was required to be decorated on the type, and it would break in 2 edge cases.

1. Enums didn't work. This is because of the same problem that I mentioned regarding the regression that I created
2. Because of untagged, deserialization would deserialize to the incorrect variant if their fields were identical, or if an earlier variant could "serde" deserialize a later variant with its Option semantics, example shown below

```
    const NULLABLE_RECORD_SCHEMA: &str = r#"
    {
        "name": "MyUnion",
        "type": [
            "null",
            {
                "type": "record",
                "name": "MyRecordA",
                "fields": [
                    {"name": "a", "type": "int"}
                ]
            },
            {
                "type": "record",
                "name": "MyRecordB",
                "fields": [
                    {"name": "a", "type": "int"}
                ]
            }
        ]
    }
    "#;

    #[derive(Debug, Serialize, Deserialize, PartialEq)]
    struct MyRecordA {
        a: i32,
    }

    #[derive(Debug, Serialize, Deserialize, PartialEq)]
    struct MyRecordB {
        a: i32,
    }

    #[derive(Debug, Serialize, Deserialize, PartialEq)]
    #[serde(untagged)]
    enum MyUnionUntagged {
        MyRecordA(MyRecordA),
        MyRecordB(MyRecordB),
    }
```

In the above example, an avro of variant `MyRecordB` will always deserialize to `MyRecordA` because untagged means there's no discriminant and the best effort deserialization chooses MyRecordA

## Ways to address

The goals of my 2 proosals here are to get rid of the reliance on #[serde(untagged)] for serialization. I haven't dove into the deserialization yet, but rest assured, these plans will not become a PR unless a roundtrip can be successful.

1. Lookup Union variants by name rather than index. serde serialize_x_variant provides the name of the variant branch you are in. Instead of lookup by variant_index, lookup by variant_name to match the schema. Unnamed types like primitves in avro will just follow the name as provided by json encoding. This change would allow enums to be in whatever order they want, but add a restriction that names must strictly follow their primitive name i.e. Int String Long. This would be a breaking change since previous users would now need to rename their variants to align with the naming convention. Also there may be a potential performance hit because we need to scan our union schemas to match, although I'm not sure how relevant that is. We could also technically only do Lookup by Name if we use the Option<Enum> representation, since we would be notified through `serialize_some`

2. Still use index based lookup, BUT make the assumption that if you use Option<Enum>, The null variant must be the 0th branch, and properly bump index by 1 when you arrive at "serialize_x_variant." However, this would mean that if you use Option<Enum>, you are restricted to having a certain schema shape, but in the vast majority of cases, null is always the first variant (I recall its mentioned in the spec that this is standard practice)

# Final thoughts

`serde` and `avro` don't play the nicest together, but its what we have. My goal is for serialization and deserialization to work properly for everyone trying to use the crate while trying to follow idiomatic rust as possible. Sorry for the long post, but thanks for your attention.

# Appendix

I attached a comprehensive minimal test code that documents the different successes and failures of representations of unions with null variant in rust. (GH doesn't let me upload an rs file so its in txt)

[test_nullable_union.txt](https://github.com/user-attachments/files/24978848/test_nullable_union.txt)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Value to JSON serialisation is wrong for Unions that are not Null #449

TLDR

The Rust Representation Conundrum

The current `avro-rs` way

Advantages

Disadvantages

The "Rusty" way

Advantages

Disadvantages

The Anchor On Avro Json Encoding And Serde Json Way

Advantages

Disadvantages

Implications

But what about Nullable Unions with >3 variants?

The current `avro-rs` way

The "Rusty" way

Implications

Ways to address

Final thoughts

Appendix

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Value to JSON serialisation is wrong for Unions that are not Null #449

Description

TLDR

The Rust Representation Conundrum

The current avro-rs way

Advantages

Disadvantages

The "Rusty" way

Advantages

Disadvantages

The Anchor On Avro Json Encoding And Serde Json Way

Advantages

Disadvantages

Implications

But what about Nullable Unions with >3 variants?

The current avro-rs way

The "Rusty" way

Implications

Ways to address

Final thoughts

Appendix

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

The current `avro-rs` way

The current `avro-rs` way