Message ur-formats

ray · January 30, 2025, 10:18am

Thesis: message formats are not inherently complex. Complex types exist in the system, but not at the level of message processing; handling them is a delegated behavior. However, we have a few ways of representing them internally; they have different purposes (e.g. Protobuf for “your language already has a compiler for protobufs” interoperability, Erlang terms for use inside the node and between nodes on the same VM host, Juvix and noun for execution) but they express the same things.

Therefore, I present an ur-format with very few moving parts for defining messages, with a sort of pseudocode-ish representation, along with ways to represent it in the various languages present in the system. It’s meant as a common denominator, not necessarily a lowest one but a low one.

Opaque

We start by introducing an “opaque”. This is just something which is irreducible as far as the message definition is concerned; it could be anything, but it’s not the message format’s responsibility to look inside it. (It might compare it for equality, sometimes, but I see that as pretty extensional; in cases where it might be problematic, like a program, this is already “a Nockma program” specifically and noun equality suffices.)

Some examples of opaque things could be:

an integer
a JPEG of a cat
a program
a proof
a message that we’re not processing here and now

In our pseudocode, this is just represented descriptively.

integral-message : integer
We handle an integral-message by making a ‘ping’ sound if it is nonnegative, and a ‘pong’ sound if it is negative.

cat-message : jpeg
We handle a cat-message by decoding the JPEG and displaying it on the screen.

We could just say “binary” here and not be far off, but saying that would imply some sort of requirement to marshal them to bits, always, which is often redundant. Say it’s a function; we would really prefer to just pass it around, until we reach a boundary where that’s not possible, because we probably plan to evaluate it soon.

However, we do want our opaque things to be able to be marshalled to bits; we might send them over the network. If you can make it a noun, you get jam for free; this works for e.g. Nockma programs. If it’s already a binary, you’re also set; this works for e.g. Cairo programs.

Record

The above doesn’t suffice for most messages, which have multiple fields. We therefore introduce records with named fields. This is reminiscent of a product type, but the fields do have names; this is type information about them, so you could just say it is a product type and have a strong case.

Order matters here sometimes, but it might not matter in all particular language representations. If there is an order, the one given is canonical, else it might just be a map from field name to field value.

In our pseudocode, we’ll use square brackets:

cat-message : [
 name: string
 id: natural
 image: jpeg
 destination: node-id
]

In Elixir, this would look like (using the typedstruct macro to shorten things):

typedstruct do
  field(:name, String.t())
  field(:id, non_neg_integer())
  field(:image, JPEG.t())
  field(:destination, Anoma.Node.node_id())
end

In Juvix:

type CatMessage :=
  catMessage@{
    name : String;
    id : Nat;
    image : JPEG;
    destination : NodeID;
  };

As a protobuf, our opaque things are nearly always going to be bytes:

message CatMessage {
  string name = 1;
  bytes id = 2; # if it's capped, could be uint64 or similar
  bytes image = 3;
  NodeID destination = 4;
}

As Nock the field names are type information, so it would just be a cell (following the brackets-associate-right rule to have tuples larger than 2). Giving some pseudo-Hoon since our Elixir compiler syntax isn’t decided yet:

+$  cat-message
  $:
    name=@t
    id=@u
    image=@
    destination=node-id
  ==

Whatever the compiler, it’s capable of knowing that name is at 2, id is at 6, image is at 14, and destination is at 30; this is most of its job.

Tagged union

Tagged unions are not sum types, but we want something sum-type-like and only Elixir and Hoon support them. This isn’t a huge deal since most practical sums include a tag anyway, but it will matter for some representations later.

We use the vertical bar in our pseudocode, since this is the most common syntax, and angle brackets to set them off since these are the least commonly used brackets.

cat-or-vampire-message : [
 name: string
 id: natural
 image: <cat: jpeg> | <vampire: nil>
 destination: node-id
]

(Vampires, of course, can’t be photographed.)

Elixir has sum types, so it can be simplified somewhat to:

typedstruct do
  field(:name, String.t())
  field(:id, non_neg_integer())
  field(:image, JPEG.t() | nil)
  field(:destination, Anoma.Node.node_id())
end

However, most cases will actually need the tag; in practice it’ll look more like {:cat, JPEG.t()} | :vampire ({:vampire, nil} would be very non-idiomatic redundancy).

I am not actually certain how to do this in Juvix without pushing the tag outward to the whole message; I just don’t know enough Juvix. I think you’d define something like

type CatOrVampireImage :=
  | Cat JPEG
  | Vampire ();

and use it as the field, possibly.

Protobuf would use optional because this is an option:

message CatOrVampireMessage {
  string name = 1;
  bytes id = 2;
  optional bytes cat_image = 3;
  NodeID destination = 4;
}

However, for something more complicated than “optional” it would use oneof instead. This comes with the tag for free (you can check which field of a oneof is set); it also comes with some concerns about ensuring the field number is unique, but all protobuf has that.

For nouns, we’re going to need the tag because atoms are our only ur-element so we can’t just sum them with nil or similar; we use 0 for “nil”. Pseudo-hoon:

+$  cat-or-vampire-message
  $:
    name=@t
    id=@u
    $=  image
      $%
        [%cat jpeg=@]
        [%vampire ~]
      ==
    destination=node-id
  ==

cwgoes · January 31, 2025, 3:01am

Thanks for this write-up. In general, the concept of a “low common denominator” format for defining messages with these basic elements (opaque, record/product, tagged union/sum) makes sense to me. I think it would be helpful to brainstorm (or to share, if you’ve already thought about this) a bit more of the surrounding context on how we expect this to be used, for example:

Would this ur-format come with some kind of code generation (ur-format pseudocode → Elixir/Juvix/Protobuf/Nock)? Alternatively (or additionally), we could write macros to process native types written in these languages into the ur-format.
Would this ur-format come with a defined, canonical “serialize” and “deserialize” pair (or “to_bytes” and “from_bytes”), implemented in each supported language, that guarantees that an instance of a particular format serialized in one language could be deserialized in another? This will require us to either make some more decisions about how that serialization / deserialization actually works, or choose one language which already has a binary format (e.g. protobuf) and make that language’s binary format canonical.
Do I infer correctly that the tags are purely “semantic” information and would not need to be represented in the serialized (binary) format?
Can we also support JSON? (I expect this should be straightforward)

degregat · January 31, 2025, 8:00am

The idea for a minimal common representation arose from the need to make behavioral commitments over messages uniquely definable, without assuming any specific language and implicitly importing implementation details from it into the specification.

It seems most straightforward to me to define a mapping to a representation that most/all implementations will need to deal with, i.e. a canonical wireformat. Ideally that should be one that already supports many languages, e.g. protobuf.

If any language does not have a mapping to and from the wireformat, a mapping to another language that does can be used if available, but then correctness of parser composition to translate through multiple formats becomes relevant.

In theory, we could also make the definitions in protobuf canonical directly, or specify a subset of protobuf as the ur-format, but I’m not sure if we might be importing anything we don’t want.

ray · February 5, 2025, 2:04pm

The reason we use protobufs for things at all is the highly useful “your preferred language probably already has a protobuf compiler” property, mainly. How do you make a grpc call? Compile our protobufs to your language and make one of those.

ur-message comes with conversions out. It’s not even necessarily a concrete thing; it’s satisfied by anything with conversions out into [Erlang, Protobuf, Noun, JSON, Juvix, &c]. The issue with inventing serialization formats for such is needing to write a decoder everywhere in every language; we use protobuf to sidestep this sometimes, because your language probably has a compiler.

Tagged unions do physically contain the tags, they have to. They’re not sum types, which only Erlang/Elixir and (limitedly) Hoon even support of all relevant languages (you can’t have A + B, only (Tag-A x A) + (Tag-B x B), in nearly all cases, even if it doesn’t mathematically need a tag. And in practice both will use tags anyway (sometimes pushed in, like the typespec %StructA{} | %StructB{} doesn’t include a tag, it’s directly A + B, but the struct name inside the struct is being used as one)

cwgoes · February 6, 2025, 4:55am

I’m not quite following here. Maybe an alternative way to ask this question is simply: how do we convert between different ur-format representations? Surely we need to do that at some points, e.g. when accepting messages over the network we’ll need to convert a protobuf representation to an Elixir one. In order for the system to behave as we expect, this conversion must be bijective (or perhaps a slightly weaker set of properties, but something close), right?

degregat · February 6, 2025, 11:17am

Every language would have a mapping from native data structures to protobuf and back. Since there is a correspondence from ur-messages to protobufs, that should be sufficient to convert between all languages (source ↔ protobuf ↔ target).

We assume that all these protobuf mappings are correct, i.e. corresponding native data types exist in the respective languages and transformations are not lossy.

The mappings should be bijective (up to isomorphism) from source to target representations, but the serialized representations might not be unique, as serialization is not canonical.

Topic		Replies	Views
Simple Merkleization System Protocol Design	12	278	May 11, 2024
Data & function encoding Protocol Design	37	392	December 18, 2024
Anoma Transport Architecture Protocol Design	0	22	February 4, 2025
Use of multiformats in the Anoma protocol Protocol Design	6	143	July 3, 2024
Taiga's data format Protocol Design data-format	2	217	September 25, 2023

Message ur-formats

Opaque

Record

Tagged union

Related topics