Simple encoding of unions in C#

# Simple encoding of unions in C#

Unions are a feature where the runtime representation matters a lot to the trade-off between expressiveness and performance. Here's a simple way to encode *type unions* that are very expressive, and *tagged unions* that *are* type unions instead of an alternative to them. This approach trades off performance compared to others, notably around allocation, in exchange for simplicity and expressiveness. 

It might just be the right approach for us! But regardless, it may serve as a useful endpoint of the spectrum, and we can use it to measure any comparative loss of expressiveness, elegance and unification incurred by lower-allocation alternatives.

I won't dwell on syntax. I'll adopt some suggestive syntax, but not discuss it. The focus is on the underlying structures. I lean heavily on the great work of the [Discriminated Unions Working Group](https://github.com/dotnet/csharplang/tree/main/meetings/working-groups/discriminated-unions).

# Type unions

A "type union" corresponds to a union of a set of types, as in `union (string | int | bool)`. A value belongs to the union if and only if it belongs to (at least) one of the constituent types - in this case `string`, `int` or `bool`. 
Type unions can take the form of *anonymous unions* in e.g. method signatures, but can also grow up to become *declared unions* when circumstances make it preferable, e.g. because of verbosity, repeated use or the need to add additional members - similar to how tuples can grow up to be declared as struct records.

A declared union type could look something like:

``` c#
public union Result : (string | int)
{
    public bool IsEmpty => this is null or "" or 0;
}
```

Declared unions can't have additional state, but function members can operate on the value of `this` itself.

# Runtime representation

A union type lowers to a struct type which wraps the single `object` field `__value` containing the value of the union. The struct declaration will contain some mix of members and metadata to tell the compiler what the constituent types are, and enable it to implement union-specific semantics. In the case of declared unions, any declared function members are lowered to make use of the underlying `__value` instead of `this`. E.g.:

``` c#
[Union(typeof(string), typeof(int))]
public struct Result 
{
    // Generated members
    public object __value;

    // Declared members
    public bool IsEmpty => __value is null or "" or 0;
}
```

The exact format of the metadata is TBD - the `[Union]` attribute is just for illustrative purposes.

Anonymous unions lower to a declaration with a compiler-generated name (say `__Union1`), and the compiler is free to reuse declarations for equivalent union types across the same assembly.

If any of the constituent types are value types, those values will be boxed into the `__value` field. There is no need to separately record the type of the contents of `__value`.

If all the constituent types of a union have a shared base class, we might be able to use that instead of `object` as the type of `__value`.

This simple layout means that all unions have the same representation in memory, so that they are easily interchangeable, and so that generated types for anonymous unions are optimally shareable.

Also by being a single reference field they are small enough to behave atomically and not be subject to "tearing"; i.e., inconsistent state caused by multiple threads mutating simultaneously.

Finally, by using a reference field for the value, declared union types can be recursive, containing other instances of the same type.

# Type testing

The usual mode of consumption for a type union is through a type test, e.g. a type pattern, to see what type its value actually has. 

``` c#
if (result is string s) { ... }
```

The generated code for this is trivial; just redirect the type test to the underlying `__value`:

``` c#
if (result.__value is string s) { ... }
```

The runtime already knows how to test the type of the underlying member. No need to invent a new encoding.

# Implicit conversions

This runtime representation allows us to offer extensive structural conversions between unions, whether declared or anonymous. 

Where two union types `U1` and `U2` are deemed equivalent by the compiler, there is an implicit conversion between them in both directions. (This isn't quite an identity conversion, because unfortunately it won't apply to arrays or generic types over union types). 

Where `U1` is deemed a subtype of `U2` by the compiler, an implicit conversion exists from `U1` to `U2`.

In addition, an implicit conversion exists from each constituent type to the union type, and from the union type to any shared base type of its constituent types.

``` c#
class B { }
class C : B { }
class D : B { }
union (C | D) u1 = new C();     // C --> union (C | D)
union (D | C) u2 = u1;          // union (C | D) <--> union (D | C)
union (C | D | string) u3 = u1; // union (C | D) --> union (C | D | string)
B b = u1;                       // union (C | D) --> B
```

The runtime implementation of these implicit conversions is trivial, since it can just wrap the `__value` of the source in the target type:

``` c#
__Union1 u1 = new () { __value = new C() };
__Union2 u2 = new () { __value = u1.__value };
__Union3 u3 = new () { __value = u1.__value };
B b = (B)u1.__value;
```

The declaration of `u2` might in fact lower like this instead:

``` c#
__Union1 u2 = u1;
```

Since the compiler would have realized that `union (C | D)` and `union (D | C)` are equivalent anonymous unions and might have reused the same lowered declaration for both.

The declaration of `b` might look like this instead:

``` c#
B b = u1.__value;
```

Since the compiler would have realized that `B` is a shared base class of `C` and `D`, and might therefore have declared the `__value` field of `__Union1` to be of type `B` instead of `object`, thus obviating the explicit cast.

# Tagged unions

The general idea of tagged unions (or "discriminated unions") is that each "variant" or "case" of a union has a tag, as well as some contents (zero, one or more values of given types). A given tag can then be matched with a pattern that is syntactically identical to a type pattern:

``` c#
return result switch
{
    Success(var outcome) => ...,
    Failure(var exception) => ...
}
```

What if this isn't *like* a type pattern, but *is* a type pattern? Consider the following declarations:

``` c#
public struct record Success(Outcome outcome);
public struct record Failure(Exception e);
public union Result : (Success | Failure);
```

Assuming `result` is of type `Result`, this makes the above switch expression work exactly as expected!

The proposal then is that we unify tagged unions into type unions by saying that a tag is simply a named wrapper type for zero, one or more values. You *can* deconstruct them immediately upon matching, but you don't have to. As types they are already an existing concept and first class citizens in the language. No new machinery or semantics needed.

Of course, a union type would be free to mix tagged and untagged constituent types, and it might well make sense to do so.

This is not to say that there shouldn't be convenient syntax directly supporting tagged unions, though perhaps it, too, should be more integrated into the syntax of type unions. The main point is that, whatever the syntax, tags, semantically speaking, are just types.

Tag type declarations could certainly be nested within type union declarations, so that they're proprietary to the given union type and don't "pollute" the top-level type space. If there's a dedicated syntax for tagged unions, maybe it generates the tag types nested inside of the union type by default. This might be paired with a feature like https://github.com/dotnet/csharplang/issues/2926 to keep the pattern matching against nested types elegant.

# Performance

The main performance downside of this approach to unions is the frequent need for allocation. Value types need to be boxed, and "tag types" would also need to be allocated to go into the reference field of a union type. This seems particularly glaring when talking about a small union of a few small value types.

There are probably some mitigations possible without deviating from the format. For instance, common boxed values (`true`, `false`, `-1`, `0`, `1`, `2`,...) could be cached and reused. But fundamentally, most value types would be boxed most of the time.

## Non-allocating alternatives

On the other hand, [non-allocating alternatives have performance downsides too](https://github.com/dotnet/csharplang/issues/7016). Trying to represent a union value as a struct that has a field for each possible type gets very space inefficient. While some packing is probably possible, it is complicated, as it must respect the runtime's need to keep references and non-reference data separate. So unions of many and/or large value types could become quite wasteful (lots of zeros in memory!), expensive to copy around and complicated to pack and unpack to underlying values.

On top of that, every union type would have a unique layout. Even if two of them are structurally related, any interop between them, such as implicit conversion or equality comparison, would be quite complicated and expensive, or maybe outright prohibited.

## Recursive types

We should note that [F# represents unions as classes](https://learn.microsoft.com/en-us/dotnet/fsharp/language-reference/discriminated-unions) by default. While different in detail from what's proposed here, it leads to a similar amount of allocation. In F#, discriminated unions are used - indeed usually preferred - as an alternative to class hierarchies to describe your data model. While you can explicitly ask for struct unions in F#, those are quite limited in comparison, partly because they cannot be recursive. 

If you're doing serious data modeling, chances are you're going to need recursive types - where a value can directly or indirectly contain another value of its same type. In order to represent that, you need *something* to be a reference type: Value types cannot contain themselves! If union types don't support that directly, you're going to have to go out of your way to explicitly introduce reference types elsewhere. Your code gets more complicated, and the end tally of allocations will end up much the same.

## Middle ground?

It's possible that we could pick another struct representation that is still uniform across all union types while avoiding a large portion of allocations. For instance, we could standardize on a scheme similar to https://github.com/JeremyKuhne/ValuePrototype, which in addition to a reference field keeps a certain number of bits on the side for storing small enough values without boxing. Perhaps we could expand the scheme to also efficiently represent "tag types" without extra allocation in many cases.

Compared to the single reference approach, such a scheme would make all union values bigger (perhaps twice as big), taking up more inline space and costing more to copy, but at least not in a way that grows proportionally to the number of constituent types involved. In addition, they would no longer be safe against tearing. But the approach would maintain the all-important property that the same value is represented by the same bits, regardless of which union type it is stored in. As such, it could still straightforwardly support many of the semantic benefits.

# Conclusion

I think the described approach compares favorably to others in terms of expressiveness and inherent simplicity. As such it will be useful to measure against as we explore other approaches with different performance characteristics.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simple encoding of unions in C# #7544

Simple encoding of unions in C#

Type unions

Runtime representation

Type testing

Implicit conversions

Tagged unions

Performance

Non-allocating alternatives

Recursive types

Middle ground?

Conclusion

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Simple encoding of unions in C# #7544

Description

Simple encoding of unions in C#

Type unions

Runtime representation

Type testing

Implicit conversions

Tagged unions

Performance

Non-allocating alternatives

Recursive types

Middle ground?

Conclusion

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions