-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Simple encoding of unions in C# #7544
Description
Simple encoding of unions in C#
Unions are a feature where the runtime representation matters a lot to the trade-off between expressiveness and performance. Here's a simple way to encode type unions that are very expressive, and tagged unions that are type unions instead of an alternative to them. This approach trades off performance compared to others, notably around allocation, in exchange for simplicity and expressiveness.
It might just be the right approach for us! But regardless, it may serve as a useful endpoint of the spectrum, and we can use it to measure any comparative loss of expressiveness, elegance and unification incurred by lower-allocation alternatives.
I won't dwell on syntax. I'll adopt some suggestive syntax, but not discuss it. The focus is on the underlying structures. I lean heavily on the great work of the Discriminated Unions Working Group.
Type unions
A "type union" corresponds to a union of a set of types, as in union (string | int | bool). A value belongs to the union if and only if it belongs to (at least) one of the constituent types - in this case string, int or bool.
Type unions can take the form of anonymous unions in e.g. method signatures, but can also grow up to become declared unions when circumstances make it preferable, e.g. because of verbosity, repeated use or the need to add additional members - similar to how tuples can grow up to be declared as struct records.
A declared union type could look something like:
public union Result : (string | int)
{
public bool IsEmpty => this is null or "" or 0;
}Declared unions can't have additional state, but function members can operate on the value of this itself.
Runtime representation
A union type lowers to a struct type which wraps the single object field __value containing the value of the union. The struct declaration will contain some mix of members and metadata to tell the compiler what the constituent types are, and enable it to implement union-specific semantics. In the case of declared unions, any declared function members are lowered to make use of the underlying __value instead of this. E.g.:
[Union(typeof(string), typeof(int))]
public struct Result
{
// Generated members
public object __value;
// Declared members
public bool IsEmpty => __value is null or "" or 0;
}The exact format of the metadata is TBD - the [Union] attribute is just for illustrative purposes.
Anonymous unions lower to a declaration with a compiler-generated name (say __Union1), and the compiler is free to reuse declarations for equivalent union types across the same assembly.
If any of the constituent types are value types, those values will be boxed into the __value field. There is no need to separately record the type of the contents of __value.
If all the constituent types of a union have a shared base class, we might be able to use that instead of object as the type of __value.
This simple layout means that all unions have the same representation in memory, so that they are easily interchangeable, and so that generated types for anonymous unions are optimally shareable.
Also by being a single reference field they are small enough to behave atomically and not be subject to "tearing"; i.e., inconsistent state caused by multiple threads mutating simultaneously.
Finally, by using a reference field for the value, declared union types can be recursive, containing other instances of the same type.
Type testing
The usual mode of consumption for a type union is through a type test, e.g. a type pattern, to see what type its value actually has.
if (result is string s) { ... }The generated code for this is trivial; just redirect the type test to the underlying __value:
if (result.__value is string s) { ... }The runtime already knows how to test the type of the underlying member. No need to invent a new encoding.
Implicit conversions
This runtime representation allows us to offer extensive structural conversions between unions, whether declared or anonymous.
Where two union types U1 and U2 are deemed equivalent by the compiler, there is an implicit conversion between them in both directions. (This isn't quite an identity conversion, because unfortunately it won't apply to arrays or generic types over union types).
Where U1 is deemed a subtype of U2 by the compiler, an implicit conversion exists from U1 to U2.
In addition, an implicit conversion exists from each constituent type to the union type, and from the union type to any shared base type of its constituent types.
class B { }
class C : B { }
class D : B { }
union (C | D) u1 = new C(); // C --> union (C | D)
union (D | C) u2 = u1; // union (C | D) <--> union (D | C)
union (C | D | string) u3 = u1; // union (C | D) --> union (C | D | string)
B b = u1; // union (C | D) --> BThe runtime implementation of these implicit conversions is trivial, since it can just wrap the __value of the source in the target type:
__Union1 u1 = new () { __value = new C() };
__Union2 u2 = new () { __value = u1.__value };
__Union3 u3 = new () { __value = u1.__value };
B b = (B)u1.__value;The declaration of u2 might in fact lower like this instead:
__Union1 u2 = u1;Since the compiler would have realized that union (C | D) and union (D | C) are equivalent anonymous unions and might have reused the same lowered declaration for both.
The declaration of b might look like this instead:
B b = u1.__value;Since the compiler would have realized that B is a shared base class of C and D, and might therefore have declared the __value field of __Union1 to be of type B instead of object, thus obviating the explicit cast.
Tagged unions
The general idea of tagged unions (or "discriminated unions") is that each "variant" or "case" of a union has a tag, as well as some contents (zero, one or more values of given types). A given tag can then be matched with a pattern that is syntactically identical to a type pattern:
return result switch
{
Success(var outcome) => ...,
Failure(var exception) => ...
}What if this isn't like a type pattern, but is a type pattern? Consider the following declarations:
public struct record Success(Outcome outcome);
public struct record Failure(Exception e);
public union Result : (Success | Failure);Assuming result is of type Result, this makes the above switch expression work exactly as expected!
The proposal then is that we unify tagged unions into type unions by saying that a tag is simply a named wrapper type for zero, one or more values. You can deconstruct them immediately upon matching, but you don't have to. As types they are already an existing concept and first class citizens in the language. No new machinery or semantics needed.
Of course, a union type would be free to mix tagged and untagged constituent types, and it might well make sense to do so.
This is not to say that there shouldn't be convenient syntax directly supporting tagged unions, though perhaps it, too, should be more integrated into the syntax of type unions. The main point is that, whatever the syntax, tags, semantically speaking, are just types.
Tag type declarations could certainly be nested within type union declarations, so that they're proprietary to the given union type and don't "pollute" the top-level type space. If there's a dedicated syntax for tagged unions, maybe it generates the tag types nested inside of the union type by default. This might be paired with a feature like #2926 to keep the pattern matching against nested types elegant.
Performance
The main performance downside of this approach to unions is the frequent need for allocation. Value types need to be boxed, and "tag types" would also need to be allocated to go into the reference field of a union type. This seems particularly glaring when talking about a small union of a few small value types.
There are probably some mitigations possible without deviating from the format. For instance, common boxed values (true, false, -1, 0, 1, 2,...) could be cached and reused. But fundamentally, most value types would be boxed most of the time.
Non-allocating alternatives
On the other hand, non-allocating alternatives have performance downsides too. Trying to represent a union value as a struct that has a field for each possible type gets very space inefficient. While some packing is probably possible, it is complicated, as it must respect the runtime's need to keep references and non-reference data separate. So unions of many and/or large value types could become quite wasteful (lots of zeros in memory!), expensive to copy around and complicated to pack and unpack to underlying values.
On top of that, every union type would have a unique layout. Even if two of them are structurally related, any interop between them, such as implicit conversion or equality comparison, would be quite complicated and expensive, or maybe outright prohibited.
Recursive types
We should note that F# represents unions as classes by default. While different in detail from what's proposed here, it leads to a similar amount of allocation. In F#, discriminated unions are used - indeed usually preferred - as an alternative to class hierarchies to describe your data model. While you can explicitly ask for struct unions in F#, those are quite limited in comparison, partly because they cannot be recursive.
If you're doing serious data modeling, chances are you're going to need recursive types - where a value can directly or indirectly contain another value of its same type. In order to represent that, you need something to be a reference type: Value types cannot contain themselves! If union types don't support that directly, you're going to have to go out of your way to explicitly introduce reference types elsewhere. Your code gets more complicated, and the end tally of allocations will end up much the same.
Middle ground?
It's possible that we could pick another struct representation that is still uniform across all union types while avoiding a large portion of allocations. For instance, we could standardize on a scheme similar to https://github.com/JeremyKuhne/ValuePrototype, which in addition to a reference field keeps a certain number of bits on the side for storing small enough values without boxing. Perhaps we could expand the scheme to also efficiently represent "tag types" without extra allocation in many cases.
Compared to the single reference approach, such a scheme would make all union values bigger (perhaps twice as big), taking up more inline space and costing more to copy, but at least not in a way that grows proportionally to the number of constituent types involved. In addition, they would no longer be safe against tearing. But the approach would maintain the all-important property that the same value is represented by the same bits, regardless of which union type it is stored in. As such, it could still straightforwardly support many of the semantic benefits.
Conclusion
I think the described approach compares favorably to others in terms of expressiveness and inherent simplicity. As such it will be useful to measure against as we explore other approaches with different performance characteristics.