Message ur-formats

Thesis: message formats are not inherently complex. Complex types exist in the system, but not at the level of message processing; handling them is a delegated behavior. However, we have a few ways of representing them internally; they have different purposes (e.g. Protobuf for “your language already has a compiler for protobufs” interoperability, Erlang terms for use inside the node and between nodes on the same VM host, Juvix and noun for execution) but they express the same things.

Therefore, I present an ur-format with very few moving parts for defining messages, with a sort of pseudocode-ish representation, along with ways to represent it in the various languages present in the system. It’s meant as a common denominator, not necessarily a lowest one but a low one.

Opaque

We start by introducing an “opaque”. This is just something which is irreducible as far as the message definition is concerned; it could be anything, but it’s not the message format’s responsibility to look inside it. (It might compare it for equality, sometimes, but I see that as pretty extensional; in cases where it might be problematic, like a program, this is already “a Nockma program” specifically and noun equality suffices.)

Some examples of opaque things could be:

  • an integer
  • a JPEG of a cat
  • a program
  • a proof
  • a message that we’re not processing here and now

In our pseudocode, this is just represented descriptively.

integral-message : integer

We handle an integral-message by making a ‘ping’ sound if it is nonnegative, and a ‘pong’ sound if it is negative.

cat-message : jpeg

We handle a cat-message by decoding the JPEG and displaying it on the screen.

We could just say “binary” here and not be far off, but saying that would imply some sort of requirement to marshal them to bits, always, which is often redundant. Say it’s a function; we would really prefer to just pass it around, until we reach a boundary where that’s not possible, because we probably plan to evaluate it soon.

However, we do want our opaque things to be able to be marshalled to bits; we might send them over the network. If you can make it a noun, you get jam for free; this works for e.g. Nockma programs. If it’s already a binary, you’re also set; this works for e.g. Cairo programs.

Record

The above doesn’t suffice for most messages, which have multiple fields. We therefore introduce records with named fields. This is reminiscent of a product type, but the fields do have names; this is type information about them, so you could just say it is a product type and have a strong case.

Order matters here sometimes, but it might not matter in all particular language representations. If there is an order, the one given is canonical, else it might just be a map from field name to field value.

In our pseudocode, we’ll use square brackets:

cat-message : [
 name: string
 id: natural
 image: jpeg
 destination: node-id
]

In Elixir, this would look like (using the typedstruct macro to shorten things):

typedstruct do
  field(:name, String.t())
  field(:id, non_neg_integer())
  field(:image, JPEG.t())
  field(:destination, Anoma.Node.node_id())
end

In Juvix:

type CatMessage :=
  catMessage@{
    name : String;
    id : Nat;
    image : JPEG;
    destination : NodeID;
  };

As a protobuf, our opaque things are nearly always going to be bytes:

message CatMessage {
  string name = 1;
  bytes id = 2; # if it's capped, could be uint64 or similar
  bytes image = 3;
  NodeID destination = 4;
}

As Nock the field names are type information, so it would just be a cell (following the brackets-associate-right rule to have tuples larger than 2). Giving some pseudo-Hoon since our Elixir compiler syntax isn’t decided yet:

+$  cat-message
  $:
    name=@t
    id=@u
    image=@
    destination=node-id
  ==

Whatever the compiler, it’s capable of knowing that name is at 2, id is at 6, image is at 14, and destination is at 30; this is most of its job.

Tagged union

Tagged unions are not sum types, but we want something sum-type-like and only Elixir and Hoon support them. This isn’t a huge deal since most practical sums include a tag anyway, but it will matter for some representations later.

We use the vertical bar in our pseudocode, since this is the most common syntax, and angle brackets to set them off since these are the least commonly used brackets.

cat-or-vampire-message : [
 name: string
 id: natural
 image: <cat: jpeg> | <vampire: nil>
 destination: node-id
]

(Vampires, of course, can’t be photographed.)

Elixir has sum types, so it can be simplified somewhat to:

typedstruct do
  field(:name, String.t())
  field(:id, non_neg_integer())
  field(:image, JPEG.t() | nil)
  field(:destination, Anoma.Node.node_id())
end

However, most cases will actually need the tag; in practice it’ll look more like {:cat, JPEG.t()} | :vampire ({:vampire, nil} would be very non-idiomatic redundancy).

I am not actually certain how to do this in Juvix without pushing the tag outward to the whole message; I just don’t know enough Juvix. I think you’d define something like

type CatOrVampireImage :=
  | Cat JPEG
  | Vampire ();

and use it as the field, possibly.

Protobuf would use optional because this is an option:

message CatOrVampireMessage {
  string name = 1;
  bytes id = 2;
  optional bytes cat_image = 3;
  NodeID destination = 4;
}

However, for something more complicated than “optional” it would use oneof instead. This comes with the tag for free (you can check which field of a oneof is set); it also comes with some concerns about ensuring the field number is unique, but all protobuf has that.

For nouns, we’re going to need the tag because atoms are our only ur-element so we can’t just sum them with nil or similar; we use 0 for “nil”. Pseudo-hoon:

+$  cat-or-vampire-message
  $:
    name=@t
    id=@u
    $=  image
      $%
        [%cat jpeg=@]
        [%vampire ~]
      ==
    destination=node-id
  ==
4 Likes

Thanks for this write-up. In general, the concept of a “low common denominator” format for defining messages with these basic elements (opaque, record/product, tagged union/sum) makes sense to me. I think it would be helpful to brainstorm (or to share, if you’ve already thought about this) a bit more of the surrounding context on how we expect this to be used, for example:

  1. Would this ur-format come with some kind of code generation (ur-format pseudocode → Elixir/Juvix/Protobuf/Nock)? Alternatively (or additionally), we could write macros to process native types written in these languages into the ur-format.
  2. Would this ur-format come with a defined, canonical “serialize” and “deserialize” pair (or “to_bytes” and “from_bytes”), implemented in each supported language, that guarantees that an instance of a particular format serialized in one language could be deserialized in another? This will require us to either make some more decisions about how that serialization / deserialization actually works, or choose one language which already has a binary format (e.g. protobuf) and make that language’s binary format canonical.
  3. Do I infer correctly that the tags are purely “semantic” information and would not need to be represented in the serialized (binary) format?
  4. Can we also support JSON? (I expect this should be straightforward)

The idea for a minimal common representation arose from the need to make behavioral commitments over messages uniquely definable, without assuming any specific language and implicitly importing implementation details from it into the specification.

It seems most straightforward to me to define a mapping to a representation that most/all implementations will need to deal with, i.e. a canonical wireformat. Ideally that should be one that already supports many languages, e.g. protobuf.

If any language does not have a mapping to and from the wireformat, a mapping to another language that does can be used if available, but then correctness of parser composition to translate through multiple formats becomes relevant.

In theory, we could also make the definitions in protobuf canonical directly, or specify a subset of protobuf as the ur-format, but I’m not sure if we might be importing anything we don’t want.

1 Like