Use of multiformats in the Anoma protocol

Recently, while thinking about how to represent different virtual machines, I again came across the excellent work from IPFS/Filecoin on multiformats, which I think might help us with disambiguation and forwards-compatibility. The basic idea of multiformats - as I understand it - is to make values such as hashes and encodings self-describing by adding a simple numerical prefix which maps in some canonical table to a particular hash algorithm, encoding format, etc. I think we could consider using multiformats for hashes and encoding algorithms, as they already do, and I think we could also consider using it for canonical representations of functions, which would look roughly like (for example):

Code Virtual machine
0x01 Nockma
0x02 Cairo
0x03 RISC5

Function representations would then be encoded as the concatenation of the virtual machine code, and the actual term (encoded in a way specific to each virtual machine). In order to run a function, then, a node would need to look up the virtual machine in question, and use it to parse and evaluate the term. A node which does not know the virtual machine associated with a particular code can simply fail (and at least the node has meta-knowledge that it doesn’t know this particular format). In the future we can add new virtual machines simply with new codes and nodes can gradually upgrade - although consensus-critical code updates will need to wait until all relevant nodes have attested that they can support the new format. Using multiple formats does mean that we lose a clear equality judgement, but we don’t really have that anyways… We still have the guarantee that syntactically equal representations are semantically equal, and we can pass around at runtime other (verifiable) judgements of semantic equality.

An interesting alternative to simply using a table would be to write a canonical interpreter for each VM in one particular representation such as Nockma (or even perhaps Geb), and use the hash of this canonical interpreter as the code. This has the nice benefit of being content-addressed, so we could automatically verify equivalences in certain cases. It would be a lot of initial work though - and I think we can do this in the future anyways, simply by adding a code for a further-wrapped content-addressed interpreter mode.

The really neat thing about using multiformats for representation of virtual machines is that we should be able to bootstrap everything else off one of one single table of VMs in a content-addressed way - because we can define all other functions (encoding, hashes, etc.) in terms of one of these VMs.

cc @degregat @mariari @Moonchild @tg-x @vveiln @terence for further input here

3 Likes

At first glance that’s personally interesting in that one of my long-term intents for Geb has been to use the syntax-independence to allow translation across multiple syntaxes without each pair of syntaxes having to be aware of each other by using a hub-and-spoke through a syntax with an IPFS multicodec.

Indeed it’s plausible to view Geb itself as a multicodec for programming languages.

So if you have a multicodec for its syntax, then you have a multicodec for concrete representations of programming languages (and programs).

For a hub syntax, the points are efficiency, ease of parsing, and such, not human readability. So I expect it would be some kind of bytecode (probably, as opposed to the textual syntax of most programming languages) representation of universal morphisms (ultimately the universal objects/morphisms of FinSet, though interpreted differently in different contexts by Geb)

And functions of FinSet are ultimately polynomials, so ultimately, it would be a bytecode representation of polynomials. Specifically which one(s) would be most efficient for what purposes, however, is a number-theory/cryptography question which I am hopelessly unqualified to answer :slight_smile:

1 Like

For long-term application data this seems useful,
also similar for any sort of serialized data.

For network protocols, however, using protocol-native enums or similar is the better way that supports protocol evolution, as the list of e.g. allowed algorithms can change over time, new ones added, and old ones removed (e.g. crypto algs)

Do you see these choices in opposition to each other? Why? It seems to me like we could use multiformats e.g. for hashes and encryption algorithms for the network layer as well - but nodes can change which subset they support over time. Is there a disadvantage to that?

I wonder how large the hash space should be, s.t. message size does not get blown up too much (what would that mean?) while still not producing collisions. It sounds to me like this should also support versioning of the canonical interpreters?

I think we had a similar idea on engineering, as we need to have deal with resources of different kinds being ran through things @ray @xuyang and myself have discussed this before, and would be better than what we are doing now

1 Like