Simple Merkleization System

cwgoes · March 29, 2024, 11:42am

Recently, I’ve been thinking about encoding standards, particularly those which we will want to standardize soon for application-facing components of the protocol such as the resource machine - relevant data structures here include resources, transactions, and data processed therein (such as authorizations which are signed).

We don’t want to reinvent-the-encoding-wheel, so one option is to use an existing standard such as protobuf, RLP, or SSZ. It’s not clear that these formats provide exactly what we need, however. I think our needs are roughly:

typed serialization (unlike RLP or JSON) - more efficient and we need to typecheck anyways
supports basic types + recursive product/coproduct types
efficient to deserialize / check deserialization in a circuit
defines canonical Merkle root of every object
as simple as possible given these constraints

One interesting option which I’ve been considering is not to define an encoding scheme at all, but instead only to define a Merkleization one. In particular, consider the following scheme:

Define an ObjectType as either:

a basic type (such as a bytestring or unsigned integer)
a product type (of object types)
a coproduct type (of object types)

i.e. (in Haskell, assuming a basic type BasicType)

data ObjectType
  = BasicType BasicType
  | ProductType [ObjectType]
  | CoproductType [ObjectType]

and, similarly, ObjectValue

data ObjectValue T
  = BasicValue T
  | ProductValue [ObjectValue T]
  | CoproductValue Nat (ObjectValue T)

where the natural number index indicates which coproduct “slot” is occupied.

We can then define the following recursive Merkleization function, assuming hash:

root :: ObjectValue T -> ObjectType -> ByteString
root (BasicValue b) (BasicType _) = hash b
root (ProductValue xs) (ProductType _) = 
  hash $ concat (hash 0 <> map hash xs)
root (CoproductValue index x) (CoproductType _) =
  hash $ hash (x + 1) <> hash x

Represented visually:

IF we simply:

Send around Merkle roots of each object, and
Send any part of the data for that object (encoded in any fashion)

THEN - I think - we do not need to standardize any further encoding at all! All we need to do is send around Merkle roots for all objects and check - when deserializing or processing objects - that the data provided matches the Merkle root. Naively implemented, this would incur a lot of overhead (calling hash many times), but this overhead is probably not meaningful compared to the data storage and bandwidth costs, we can cache to avoid recomputation, etc.

This Merkleization scheme also has several nice features:

it’s possible to verify constant-size parts of a large object with only constant-size data
product type Merkleization can be parallelized
it’s very simple

We will still need to standardize encoding for basic types with proper domain separation (distinct values for different types should not have the same hash), but this can probably be borrowed from existing work.

cc @mariari @Moonchild @vveiln

Moonchild · March 30, 2024, 4:39am

Interesting. This is basically defining a somewhat-more-abstract encoding format, but it is still a concrete encoding format, which will necessarily have redundancies (for instance, x AND (y OR z) is the same identity as (x AND y) OR (x AND z), but those will likely be two different encodings with different hashes), which means we still have to remember incidental structure. We do get to throw away somewhat more of it than we otherwise might without sacrificing hash-roundtrips. But maybe that’s not particularly valuable insofar as we still might want to throw away more structure than that → enforce a strong separation between places that maintain all incidental structure vs. places that get to throw away as much of it as they want → doesn’t matter very much.

Also, this doesn’t avoid the need to specify concrete encoding formats. It doesn’t matter if two parties can agree on the slightly-more-abstract state of a value when given a hash, if they can’t negotiate a way to actually communicate the contents over the wire. Aside that the negotiation itself requires some form of serialisation, it would be Bad if two parties supported disjoint sets of encoding formats and so couldn’t communicate.

This does open the door to having, say, a required serialisation format which is a bit simple and dumb, and allowing parties to transparently negotiate formats that are faster to decode without sacrificing some forms of interoperability. But I expect the performance benefits to be vastly outweighed by the need to compute hashes in this way.

product type Merkleization can be parallelized

Hashing in general can be parallelised.

it’s possible to verify constant-size parts of a large object with only constant-size data

I question the value of this, but if we really want it, we can get it anyway by merkleising w.r.t. some constant chunk size of the encoded data, and then verifying only the chunks that correspond to the parts of the value that we’re interested in. And then we can pick the chunk size so the constant factors on hashing are right (just having the chunk size be a constant helps the constant factors significantly, even if it’s very small!). (Ed: also worth noting that ‘part of a value’ is also defined w.r.t. the concrete encoding: I may want a constant number of bits of information, but I need the entirety of the concrete encoded value to compute those bits; then we can’t save anything.)

Don’t mean to say this is necessarily a bad idea; just some thoughts on the limitations it has.

cwgoes · April 1, 2024, 9:55am

Fantastic feedback, thanks! Responses/questions:

(for instance, x AND (y OR z) is the same identity as (x AND y) OR (x AND z), but those will likely be two different encodings with different hashes)

Yes - by itself, this still doesn’t provide extensional function equality for encodings (encode a == encode b if a = b where a and b are functions). Whether or not that property would hold will depend on how we encode functions themselves into product/coproduct/basic types, and there’s an efficiency tradeoff - e.g. if we encode functions as lookup tables, we can get this property, but that’s horrendously inefficient - and more efficient encodings are unlikely to have a single canonical form for each equivalence class of extensionally equal functions.

which means we still have to remember incidental structure. We do get to throw away somewhat more of it than we otherwise might without sacrificing hash-roundtrips. But maybe that’s not particularly valuable insofar as we still might want to throw away more structure than that → enforce a strong separation between places that maintain all incidental structure vs. places that get to throw away as much of it as they want → doesn’t matter very much.

I’m not sure that I follow this part. In general, our current plan with “extensional equalities” / equivalence relations of interest, such as those for external identities, is to simply track and process them separately, e.g. in the signsFor and readsFor relations processed by the identity machine. Can you give me a few concrete examples of what you’re thinking of here?

Also, this doesn’t avoid the need to specify concrete encoding formats. It doesn’t matter if two parties can agree on the slightly-more-abstract state of a value when given a hash, if they can’t negotiate a way to actually communicate the contents over the wire. Aside that the negotiation itself requires some form of serialisation, it would be Bad if two parties supported disjoint sets of encoding formats and so couldn’t communicate.

That’s right - each two parties must still agree on a serialization format for each interaction - but we no longer need to globally agree on a serialization format. We would still want to recommend one as a default, but I think (ceteris paribus) it’s an advantage not to rely on this default in the protocol - different parties may have different serialization preferences, and the less we need to specify in the protocol, the better.

This does open the door to having, say, a required serialisation format which is a bit simple and dumb, and allowing parties to transparently negotiate formats that are faster to decode without sacrificing some forms of interoperability. But I expect the performance benefits to be vastly outweighed by the need to compute hashes in this way.

I agree that hashing like this adds lots of overhead compared to a scheme which doesn’t require it, but are we really sure that this matters? For replicated verification, we will typically need to Merkleize or at least hash data anyways. For simpler P2P messages, we wouldn’t, but if we wanted to start implementing bandwidth optimizations (e.g. where A sends an object reference to B, B requests parts that B doesn’t know by hash, repeat…) we would eventually, at least in some form.

I don’t know where to find the best benchmarks, but one answer here lists a hashing bandwidth of multiple GB/s - on a single core - which seems like it would be more than sufficient to make network bandwidth - not hashing bandwidth - the bottleneck, even if we basically hash everything that comes over the network.

Hashing in general can be parallelised.

Yes - more specifically I mean the Merkle tree construction, which can be parallelized only per the tree structure since you need the child hashes to compute the parent ones for each level.

I question the value of this, but if we really want it, we can get it anyway by merkleising w.r.t. some constant chunk size of the encoded data, and then verifying only the chunks that correspond to the parts of the value that we’re interested in. And then we can pick the chunk size so the constant factors on hashing are right (just having the chunk size be a constant helps the constant factors significantly, even if it’s very small!).

Yes - I think this is also an option worth considering - but the primary difference to me seems to be whether we Merkleize in a way which preserves some kind of tree structure (as in my proposal above), or in a way which flattens the data structure and uses a constant chunk size (as you propose). There’s a performance tradeoff here between:

the cost of computing the object hash (potentially lower if flattened)
the cost of verifying sub-components (potentially higher if flattened)

Another option worth noting - once we have Merkleized an object one way, we can also Merkleize it another way, make a ZKP that this second Merkleization is correct, and send around the two hashes, proof of equivalence, and Merkle proof for the second way to anyone if it’s a cheaper way of verifying whatever they want to verify.

(Ed: also worth noting that ‘part of a value’ is also defined w.r.t. the concrete encoding: I may want a constant number of bits of information, but I need the entirety of the concrete encoded value to compute those bits; then we can’t save anything.)

For this, you need a succinct proof - then you can get a constant number of bits for bits + C, where C is a constant size (with recent schemes, under 1 KB).

mariari · April 2, 2024, 7:53am

Overall my thoughts are rather positive, however I do have some comments/questions.

Elijah’s Post

I believe, what @Moonchild is getting at, if one happens to know some merkelization of data, then the data itself doesn’t need to be sent, it can be lookedup from some local store/cache. However since this encoding does not define out such an equivalence, the data would need to be sent anyways.

My comments

My Assumptions

To send around a hash, It seems all parties have to agree upon the following given this merkalization schema for any hash h:

Endiness
Encoding of the Basic types for the hash function
The hash function used

Further, along with sending around the result of the root function, we must also send around the result of root-type : ObjectType -> ByteString on the type.

We must send it around, else we reach a collision:

(def h hash)

(def RightArr (Coproduct (Basic Arr) (Basic Arr)))
(def Basic    (Basic Arr))

(root (right [1,2,3,4,5]) ≈≈≻ (h (h 1) (h 1) (h 2) (h 3) (h 4) (h 5))
(root [1, 1, 2, 3, 4, 5]) ≈≈≻ (h (h 1) (h 1) (h 2) (h 3) (h 4) (h 5))

Further, we can’t send around RightArr or Basic raw either, as then we’ll need to agree upon a lot more facts.

Disagreements with reality

My assumptions differ from reality in one important place:

which means we aren’t sending the root-type of the given ObjectType, but rather the ObjectType itself. I believe this would lead us to basically agree upon a encoding scheme, which contradicts the express purpose of the OP:

Therefore I would argue that we should send around the root of the ObjectType rather than the ObjectType itself.

Further, to me I’m not sure the collision case I mentioned earlier was properly thought of in the flow of computation as:

Assumes that we are only sending around the merkle roots themselves, which runs into the collision I posted previously.

Issues in the given specification

This type is quite naive and improperly specified for what is wanted.

Namely in the generic T, this implies all the BasicValue’s must be of the same type. When in reality your example disproves this.

(product 2 "xyz" 3).

What was probably meant was

data Primitives = Integer Integer | ...

data ObjectValue
  = BasicValue Basic
  | ProductValue [ObjectValue T]
  | CoproductValue Nat (ObjectValue T)

Another specification issue is that root is poorly defined.

cwgoes:

We can then define the following recursive Merkleization function, assuming hash:

root :: ObjectValue T -> ObjectType -> ByteString
root (BasicValue b) (BasicType _) = hash b
root (ProductValue xs) (ProductType _) = 
  hash $ concat (hash 0 <> map hash xs)
root (CoproductValue index x) (CoproductType _) =
  hash $ hash (x + 1) <> hash x

The root function is not recursive, as we simply run hash on the product and coproduct, rather than run the root algorithm recursively.

Other notes

how would we encode objects, I assume if we have object style things along with ADT style things, then we’d encode it something like

(root (product "ObjectName" object-details slot1 slot2 ... slotn))

where we essentially replace the Coproduct tag, with an actual name reference of the particular Object.

If we were to consider the standard Either type as an object, then we would likely have this encoding for it right?

(root (product "left" object-details slot1))
(root (product "right" object-details slot1))

Where object-details may be omitted.

Basically since the [] type of product is of an arbitrary size, we can likely cut the coproduct entirely and encode those details as natural numbers in the product.

I’m not sure how much this is thought of and the tradeoffs here.

cwgoes · April 2, 2024, 9:29am

Merkleization of this form would allow for cache lookup, yes, and even distributed cache lookup over the network - but it doesn’t mean that “equivalent” values could be looked up from the cache. I think this question is pretty much orthogonal, though, as I mentioned - we would need a canonical function encoding, which has tradeoffs unrelated to the choice of Merkelization or serialization scheme.

Endiness

Encoding of the Basic types for the hash function

The hash function used

Do you mean endianness? I agree, although I would consider endianness part of (2).

Further, along with sending around the result of the root function, we must also send around the
result of root-type : ObjectType -> ByteString on the type.

If we do not do this, with the current scheme, we would have a collision where the same bytestring could be interpreted differently as two objects with different types, as you mention. My original goal was simply for bijectivity given a type - but we have a few options here:

assume that the recipient always knows the type of the message they expect to receive
send around the root type, as you mention
change the scheme slightly for collision resistance (special sentinel values for product/coproduct)

which means we aren’t sending the root-type of the given ObjectType , but rather the ObjectType itself. I believe this would lead us to basically agree upon a encoding scheme, which contradicts the express purpose of the OP:

I’m not sure that I follow here - we need to standardize encoding for e.g. integers (big-endian versus little-endian), yes. That’s a far simpler problem than a whole encoding scheme for structured data (a la protobuf or borsh) though.

This type is quite naive and improperly specified for what is wanted.

T is a type parameter, it could be instantiated e.g. with Primitives as you mention.

The root function is not recursive, as we simply run hash on the product and coproduct, rather than run the root algorithm recursively.

Ah yes, this is wrong - thanks.

how would we encode objects, I assume if we have object style things along with ADT style things, then we’d encode it something like

Seems possible. I’m not sure it’s always necessary to send around the actual strings; if those are shared between many objects it should be possible to put them in the type, or associate them with it.

tg-x · April 2, 2024, 3:06pm

for reference/inspiration, some encoding formats that consider this:

typedefs (github)
wtf (syntax)
- simplified typedefs
tasl (github)
- inspired by RDF, IPLD, typedefs

cwgoes · April 2, 2024, 4:12pm

These are good inspirations, but as far as I can tell none of them provide for Merkleization - do you know of anything which does, or which tackles this area of the problem space?

tg-x · April 4, 2024, 4:06pm

here’s one I did, which uses convergent encryption, content hash block IDs, and each block contains children refs:

LoFiRe: Local-First Repositories for Asynchronous Collaboration over Community Overlay Networks
see type BlockV0 at the end for the type definition using BARE

cwgoes · April 24, 2024, 11:47am

In LoFiRe, is there a specification as to how data structures (e.g. product/coproduct types as described in this thread) should be represented as objects, or is that left up to the choice of the user (whoever is using the block storage)?

cwgoes · May 3, 2024, 4:01pm

After reading through:

I’m leaning towards:

using SSZ
writing a mapping from abstract/algebraic data types to SSZ
- need to see exactly what this looks like
- if this doesn’t work well, we can reconsider

tg-x · May 3, 2024, 4:29pm

LoFiRe does not define such type system, merely the serialization format with convergent encryption of data blocks, where large binary data is split up into a Merkle tree of smaller blocks.
Such a type system could be built around this primitive of single objects, though.

Wondertan · May 7, 2024, 4:55pm

I recommend looking into https://capnproto.org/

Particularly in its performance characteristics. It provides arena allocators and in-memory
field access from allocated buffers, so there is effectively no serialization and deserialization.

Cap’n Proto has a well-defined and efficient canonical form that is necessary for the cryptographic nature of Anoma.

Cap’n Proto learns from multi-year accumulated protobuf experience and is authored by the same expert.

It reached a stable v1 release after incrementally releasing minor releases from v0.1. This indicates maturity and willingness for long-term support. Besides, Cloudflare uses it extensively, which is another significant signal.

Cap’n Proto provides an RPC framework and a versatile IDL to describe those. It’s not necessary to use it initially, but it might be handy at later stages for interoperability across programming languages.

Moonchild · May 11, 2024, 6:57am

Sorry it took me so long to send this!

which means we still have to remember incidental structure. We do get to throw away somewhat more of it than we otherwise might without sacrificing hash-roundtrips. But maybe that’s not particularly valuable insofar as we still might want to throw away more structure than that → enforce a strong separation between places that maintain all incidental structure vs. places that get to throw away as much of it as they want → doesn’t matter very much.

I’m not sure that I follow this part. In general, our current plan with “extensional equalities” / equivalence relations of interest, such as those for external identities, is to simply track and process them separately, e.g. in the signsFor and readsFor relations processed by the identity machine. Can you give me a few concrete examples of what you’re thinking of here?

Bah, I’m not used to these questions having decidable answers :)—though I’m not sure they are in this case? The exact mechanism of signs-for/reads-for is still not entirely clear to me. If an id can only be key/and/or/top/bot, then it is trivial, given two ids, to canonicalise both and check if the canonical forms are the same, but I don’t think this is how it works given the specs’ mention of ‘reads-for/signs-for evidence’. Either way, we also have turing-complete transactions, for which equality is certainly undecidable in general and hence attestation of equality needs explicit proof.

I guess the general case where this would come up is the following: B wants to prove to C that A said x. Obviously, A has to have signed off on x, and B has to keep around the signature. But now suppose: A signed x, but B actually wants to prove that A said x’, where x’ is abstractly equivalent to (or implied by) x, but concretely different. If B is not acting maliciously, then presumably it has good reasons for doing this, but it has to remember its reasoning, and produce and communicate to C a proof that x’ is equivalent to or implied by x. Whereas, since compute resources are finite, we cannot (for example) keep around all pieces of reads-for/signs-for evidence we encounter forever. (Otherwise, it would be a trivial way to DOS any node.) This isn’t super concrete still, but hopefully slightly more so?

each two parties must still agree on a serialization format for each interaction - but we no longer need to globally agree on a serialization format. We would still want to recommend one as a default, but I think (ceteris paribus) it’s an advantage not to rely on this default in the protocol - different parties may have different serialization preferences, and the less we need to specify in the protocol, the better.

I don’t understand this at all. What is the point of specifying the merkleisation scheme? If it is to define a standard of interoperability, then it doesn’t suffice to interoperate without an encoding scheme. If it is not to define a standard of interoperability, then why is it being defined at all?

I don’t know where to find the best benchmarks, but one answer here lists a hashing bandwidth of multiple GB/s - on a single core - which seems like it would be more than sufficient to make network bandwidth - not hashing bandwidth - the bottleneck, even if we basically hash everything that comes over the network.

I didn’t realise they were that fast (thought it was closer to 2gb/s/core)–neat–but that wasn’t going to be the bottleneck anyway. Performance on von neumann machines is mostly about scheduling movement of data between queues, and you’ve added a bunch of dispatch and heterogeneous data structures–it will be difficult to keep the queues full. (And this is hashing small blocks, which, even if you do manage to amortise the dispatch, there’s plain work involved with hash finalisation that you don’t get to amortise.) And, so I am given to understand, I/O is no longer the bottleneck for anything interesting.

Merkle tree construction, which can be parallelized only per the tree structure since you need the child hashes to compute the parent ones for each level.

Yes—span is tree height. All the more reason to use larger blocks and a wider splay!

performance tradeoff here between:

the cost of computing the object hash (potentially lower if flattened)

the cost of verifying sub-components (potentially higher if flattened)

Perhaps, but you’re playing a constant factors game now, and I’m not at all convinced the finer-grained scheme actually wins; see above re queues.

…but now I’m going to completely contradict everything I said performance-wise and suggest something else: maybe it’s just a bad idea to have large objects, period, for operational reasons, and we should rather have many small objects that refer to each other. I was thinking about this recently in context of content-addressing for code: what should be the unit of content-addressability? If it’s, say, the function, then we lose the ability to share subtrees that multiple functions have in common; but if it’s, say, the AST node, then we pay more overhead for that. So there’s some tradeoffs here–your proposal is like making the base unit the AST node.

I am intrigued by the possibility of proving equivalence of merkleisation schemes. I assume that verification will be reasonably fast, but proving may not be? Is it possible to amortise proof cost across multiple queries?

you need a succinct proof - then you can get a constant number of bits for bits + C, where C is a constant size (with recent schemes, under 1 KB).

Fascinating. This works for arbitrary functions, with verification time independent of the running time of the original function?

Topic		Replies	Views
Data & function encoding Protocol Design	37	392	December 18, 2024
First draft of application state encoding format Specs v2	0	52	May 6, 2024
Stored data format: resource machine <> storage Resource Machine Stuff	21	474	April 3, 2024
Towards a general proof type Protocol Design	22	139	August 30, 2024
Clarifying proof structures Resource Machine Stuff	8	59	December 16, 2024