Data & function encoding

Note for readers: this is a follow-on thread to discussions here, although it should be readable standalone.

Introduction

In the Anoma protocol, we do not want to pick a single canonical encoding scheme (serialization format) or virtual machine, both because:

  • fashions in serialization schemes and virtual machines change over time, and what scheme / VM makes most sense or is most suitable depends on specific hardware and specific applications
  • we do not need to, and so avoiding doing so makes the protocol simpler and easier to understand, since we can abstract the implementation details of serialization schemes and virtual machines, and one can understand how the protocol works without understanding those details (or being misled into thinking that they’re specially relevant for how the protocol works, which they aren’t)

In order to facilitate interoperability between encoding schemes and virtual machines, I think we do need to standardize:

  • a type of data types, which allows for data to be deserialized based on a known type and for inputs and outputs to functions represented in different virtual machines to be typechecked
  • a multiformat for encoding schemes
  • a multiformat for virtual machines (more context here)
  • the functions each encoding scheme must provide, and the expected properties of those functions
  • the functions each virtual machine must provide, and the expected properties of those functions

The following is written as a potential component of the specification, in order to define these concepts clearly and lay out how interfaces and interactions would work. I’m putting this down in writing in order to solicit review - particularly from @AHart @isheff @tg-x @mariari @degregat @vveiln @jonathan @graphomath - review from others welcome as well.

Data values and data types

The protocol standardises basic types and general algebraic data types. All messages sent and received within and between nodes and all parts of state must have an associated data type. Data types used by the protocol itself are fixed a priori (by this specification). Implementation languages will typically represent these types as native types in their typesystem, such that the implementation language can provide typechecking. Data types used by applications are chosen by application authors, serialised, and passed around at runtime in order to handle data appropriately.

Data types

Basic types

A basic type is defined as either:

  • a boolean (bit) type
  • a ring type Z_n of natural numbers \mathrm{mod}~n
  • a finite field type \mathbb{F}_n of order n
  • a binary (binary data of unbounded length) type
type BasicType :=
  | BooleanT
  | RingT Nat
  | FiniteFieldT Nat
  | BinaryT
Data types

A data type is defined as either:

  • a basic type
  • a product of other algebraic data types
  • a coproduct of other algebraic data types
  • a function type from one data type to another data type
type DataType :=
  | BasicT BasicType
  | ProductT [DataType]
  | CoproductT [DataType]
  | FunctionT DataType DataType
Basic values

A basic value is defined as either:

  • a boolean value
  • a ring value n (between 0 and n-1)
  • a finite field value \mathbb{F}_n (a natural number n represents the $n$th element of the finite field)
  • a binary (binary data of unbounded length) value
  • a function value (represented with a particular virtual machine, identified by a natural number)
type BasicValue :=
  | BooleanV Boolean
  | RingV Nat
  | FiniteFieldV Nat
  | BinaryV Bytestring
  | FunctionV Nat Bytestring
Data value

A data value is defined as either:

  • a basic value
  • a tuple of other data values (inhabitant of a product type)
  • a option with an index and a data value (inhabitant of a coproduct type)
type DataValue :=
  | BasicV BasicValue
  | TupleV [DataValue]
  | OptionV Nat DataValue

Typechecking

The typechecking relation

typecheck : DataValue -> DataType -> Boolean

can be implemented in the obvious manner. Typechecking functions would require typechecking support from particular virtual machines, which isn’t covered for now, but could be added in the future in a straightforward way.

Encoding scheme

An encoding scheme is a bijective mapping between structured data and a series of bytes, uniquely defined by the pair of serialisation and deserialisation functions.

Serialisation

The serialise function serialises a data value into a bytestring.

type Serialise = DataValue -> Bytes
Deserialisation

The deserialise function attempts to deserialise a bytestring into a data value of the specified type.

type Deserialise = DataType -> Bytestring -> Maybe DataValue
Properties

These functions must be inverses of each other, in that:

  • deserialising a serialised value will result in Just <that value>
  • serialising a deserialised value will result in the same bytestring

Furthermore, the mapping - given a type - must be bijective: fixing a given type, no two distinct bytestrings can deserialise into the same value, and no two distinct values can serialise into the same bytestring. A bijective mapping without fixing a type can be achieved simply by also serialising the type.

Multiformat

The protocol standardizes a table of encoding schemes, where each encoding scheme is associated with a unique natural number. For example:

Code Encoding scheme
0x01 SSZ
0x02 JSON
0x03 Borsh

Nodes running the protocol then associate each number with a pair of serialialise and deserialise functions. In order to interoperate correctly, nodes must agree on which number is associated with which encoding scheme, so this table is part of the definition of any particular protocol version, and new entries to the table, once added, cannot be changed. In general, adding new entries to the table should not break anything - a node encountering an encoding scheme it does not know simply fails.

Virtual machine

A virtual machine is a way to represent functions, uniquely defined by three functions encode, decode, and evaluate. The protocol standardizes a table of virtual machines, where each encoding scheme is associated with a unique natural number. We will refer to the virtual machine associated with n as VM_n, and associated functions as encode_n, decode_n, and evaluate_n. Each virtual machine, in the implementation language, also comes with an opaque internal representation type VM_n.t.

Decoding

The decoding function decode_n attempts to decode a DataValue into an internal representation VM_n.t. Decoding which encounters a FunctionV associated with a different virtual machine will simply represent that as data (instead of code) in the internal representation type.

type Decode = DataValue -> Maybe VM_n.t

Encoding

The encoding function encode_n encodes an internal representation of a function and/or data into a DataValue. Functions in the internal representation will be serialised in some fashion and paired with the natural number associated with the virtual machine in a FunctionV.

type Encode = VM_n.t -> DataValue
Properties

The encoding and decoding functions must be inverses of each other, in that:

  • decoding an encoded value will result in Just <that value>
  • encoding a decoded value will result in the original internal representation

Evaluation

The evaluation function evaluate_n calls a function (in the internal representation) on the provided list of arguments (in the original representation). Evaluation must be deterministic. Evaluation must also meter gas, a measure of compute and memory resource expenditure. Different virtual machines will have different gas scales. Evaluation takes a gas limit. During VM internal execution, gas must be tracked, and evaluation must terminate if the gas limit is exceeded. Should execution complete successfully within the gas limit, the VM must return the gas actually used.

type Evaluate =
  VM_n.t ->
  [VM_n.t] ->
  Natural ->
  (Maybe VM_n.t, Natural)

Note: In the future, gas will likely change from a scalar to a vector to allow for metering compute and memory resources differently.

Provable virtual machines

A provable virtual machine is a virtual machine with a proof type P_n and two additional functions parameterized over P_n, prove_n and verify_n.

Proving

The proving function prove_n generates a proof for a given program (logical relation), public input, and private input.

type Prove =
  VM_n.t (T0 -> T1 -> boolean) ->
  VM_n.t (T0) ->
  VM_n.t (T1) ->
  P_n

Verification

The verification function verify_n verifies a proof for a given program and public input.

type Verify =
  VM_n.t (T0 -> T1 -> boolean) ->
  VM_n.t (T1) ->
  P_n ->
  boolean
Properties

These functions must be correct and complete, in that:

  • valid proofs can only be created for valid inputs (correctness), and a valid proof can be created for any valid input (completeness)
  • i.e. verify f public proof = true if and only if proof = prove f public' private' where public = public' and evaluate f [public', private'] g = (true, _) for some sufficient gas limit g (we could probably split evaluation into gassy and gassless versions)

Should P_n not reveal any information about VM_n.t (T0), the provable virtual machine can be said to be zero-knowledge.

Multi-functions

Multiencoding

The multiencode function takes a DataValue and a set of preferences, and tries to encode it in a bytestring according to those preferences. Preferences determine which encoding scheme(s) is/are chosen, and whether or not we attempt to convert between virtual machine representations (requiring compilation, which may not be supported in all cases). Multiformat codes will be included to indicate which formats are in use (see here for a longer description of how this works).

multiencode : Preferences -> DataValue -> Bytestring

Multidecoding

The multidecode function takes a value in bytestring representation and tries to decode it into an internal representation, according to the multiformat information and the encoding schemes known by the decoding party. Attempting to decode unknown formats will result in an error.

multidecode : Bytestring -> DataType -> Maybe DataValue

Equality

Note that, in general, equality of multiencoded representations implies equality of data values, but equality of data values does not imply equality of multiencoded representations (as different encoding schemes may be used). Phrased more succinctly:

  1. multiencode a = multiencode b → a = b
  2. multiencode a /= multiencode b does not imply a != b

Usage

In general, canonical commitments to data and code are made over the output of multiencode.
Implication (1) guarantees that (subject to the usual cryptographic assumptions) equality of two succinct commitments (cryptographic hash function applied to the multiencoded value) implies equality of the data values so encoded and committed to.

Multiencoding is also used before storing data or sending it over the network. Any party who knows the encoding scheme table should be able to take any piece of stored or sent data and deserialise it with multidecode.

Multievaluation

The multievaluate function evaluates the application of a function to a value. Whenever multievaluate encounters a FunctionV n f, it looks up the appropriate virtual machine as specified by the natural index n, decodes f using decode_n, then calls evaluate_n, tracking and summing the gas used in subsequent evaluations.

Note: this implies a uniform gas scale, which we elide the details of for now, but would probably require e.g. benchmarking on the hardware in question.

multievaluate :
  DataValue ->
  [DataValue] ->
  Natural ->
  Maybe (DataValue, Natural)

Recursive multievaluation

What if one VM is passed a data value including a FunctionV represented in a different VM? The VM in question can treat this code as data - in the sense that it can examine, introspect, modify it, etc. - but it cannot treat this code as code, since it doesn’t know how to interpret functions represented in another VM (in general). However, we can allow one VM to call another easily enough simply by passing a pointer to multievaluate itself into the evaluation context. Then, when it encounters a function encoded for a different VM which it wishes to evaluate, the VM can simply call multievaluate, tracking gas consumption as appropriate. This technique can also be used to allow for something such as a data query, where data queries are represented as functions understood by a specific, specialized VM.

Note: there’s some ABI/FFI memory layout logic to be figured out here - multievaluate must be called with a function and arguments formatted as DataValues - this should be specified in detail.

Multicompilation

What if we wish to convert functions between different VM representations? We can define a multicompile function which attempts to compile functions represented in a data value based on a set of preferences. For example, preferences could be to convert all functions to a particular VM, or to convert where conversions are known in some order of preference. Compilation will fail if unknown VM conversions are attempted.

Multicompilation depends on a known set of conversions compile_{i,j} which convert between VM_i.t and VM_j.t representations of functions, which must preserve extensional equality under evaluation.

multicompile :
  Preferences ->
  DataValue ->
  Maybe DataValue
Equality proofs

With multicompilation, we can create evidence that a = b, where a and b are arbitrary data values, including functions (where multiencode a /= multiencode b). This evidence would consist simply of a proof that multicompile prefs a = b for some preferences prefs. This proof could be of varying practical forms - the verifier could simply run multicompile themselves, the verifier could trust another’s run of multicompile, or the verifier could check a succinct proof of computational correctness. Many details are elided here for now.

Smart multievaluation

With multicompilation and equality proofs, we can also define a version of multievaluate which uses available evidence + preferences intelligently at evaluation time to use certain known-to-be-equivalent versions of functions (e.g. compiled versions) instead of others. Then known optimized versions of functions can be used, and even something like “JIT” compilation can happen e.g. asychronously once certain statistical thresholds are met. Optionally, this “smart multievaluate” can also use known proofs of the results of certain evaluations (instead of repeating the evaluations), where provable VMs are involved.

Afaik correctness and completeness correspond to the same property in the context of proving systems, and since we are bringing zero-knowledge here, it suggests some context that makes the use of both correctness and completeness confusing. The correctness property here rather corresponds to the soundness property (no false statement should have a convincing proof) in proving systems

1 Like

Are you sure? Finite fields can’t be ordered. You’re ordering representations, which means you’ve smuggled in a representation

1 Like

Good point. Still, we do want to be able to reason about finite field elements in our typesystem, and to pass them between virtual machines. Do you have an idea of what best to do here?

Every real-world finite field is:

  • a prime field GF(p) (like, something you have an elliptic curve over) which is just \mathbb{Z}/p\mathbb{Z}, or
  • it’s a binary field GF(2^n) representable, if not necessarily canonically, as binary data of n bits

finite fields GF(p^n) for p > 2 are likely not going to come up, though i’m not up on all my arcana (if they do, they’re just polynomials over GF(p) with n terms, i.e., an array of n integers mod p)

1 Like

a multiformat

what is “a multiformat”? I know you define a specific “multiformat” later, but what do you mean, in general, by this term?

| ProductT [DataType]
| CoproductT [DataType]

this creates de facto unit ProductT [] and empty CoproductT [] types. This is fine, but is this intended?

It would probably be appropriate to also define a standard set of primitive functions which are expected to be supported over the types. Bit operations for bits, ring and field operations for the fields, I don’t know what for bitstrings (appending 1 and 0 + folding?), projections for tuples, universal properties for both products and coproducts, and application and currying for function types. Something like that. Otherwise, what makes the rings rings and the fields fields?

An encoding scheme is a bijective mapping

this seems too strict. It’s reasonable to expect encoding schemes, even efficient ones, to merely be injective to Bytes. For example, essentially any prefix-free or error-correcting encoding, common in communication, won’t be bijective, for example.

type Serialise = DataValue -> Bytes

what, exactly, is “Bytes”? Is this different from “binary data of unbounded length”? Is the notion of an actual byte (eight bits) relevant here?

Also, this looks like it’s defining the type of the Serialise function. Is there supposed to be a function called something like serialise such that serialise : Serialise?

Also, though this is not significant really but I figured it’s worth pointing out; “serialise” is the less common UK specific spelling of serialize. Even in the UK, the Oxford spelling is “serialize”. I don’t personally care that much, but do with that what you will.

The deserialise function attempts to deserialise a bytestring into a data value of the specified type.

So the encoding isn’t a bijection?

These functions must be inverses of each other, in that:

Deserialise (Serialise x) = Just x

Deserialise x = Just y ⇒ Serialise y = x

must be bijective: fixing a given type, no two distinct bytestrings can deserialise into the same value, and no two distinct values can serialise into the same bytestring.

This doesn’t describe a bijection, this just describes two injections going in opposite directions. For this to be a bijection, Serialise and Deserialize have to be surjections; that is all bytestrings must decode into some unique value and Serialise has to be able to produce any bytestring on some input.

Maybe I don’t understand what you mean by “fixing a type”. Do you mean we define a set Repr T = {x | ∃y. (y : T) & Serialise y = x} ⊆ ran Serialise, and, if we restrict the domain of Deserialise to this set, then we get a bijection? And, similarly, if we restrict the domain of Deserialise to ran Serialise, we also get a bijection? If this is what you mean, it’s not clear.

Virtual machine

Provable virtual machines

Most of this stuff falls in line with what we talked about yesterday. I think this part broadly makes sense.

1 Like

If we want to use Nat as a parameter here, shouldn’t Nat also be a basic type too?

What does the index do here? Do we assume a canonical ordering of internal types and use the index to identify which type the element is an inhabitant of? If so, we need to make this canonical ordering explicit.

This sounds like something we want to be able to restrict on the user side via policies, to limit recursion depth, but maybe limiting gas expenditure is enough.

Thinking more about it, users might want to express preferences for encoding/decoding/evaluation in general, since different encoding schemes and VMs will have different security properties and users might want to set policies to allow/disallow/limit accordingly. This should be done via policies that are always checked, not only as preferences for multiencode and smart multievaluate

While bijectivity is a nice-to-have property, I wonder how feasible it is in practice.
Which formats provide this property?
Is this a hard requirement to ensure some specific property in our system?

More details about nuances of bijectivity are in this discussion from BARE, it turns out to be not that easy with lots of nuances around ints, floats, maps, utf8 strings, etc.
BARE is close but does not pursue this property, although with a custom encoder that ensures this could be possible.

1 Like

You want injectivity of encode (we can encode any actual value to an encoded value distinct from every other encoded value) and surjectivity of decode (an encoded value can represent any possible actual value); injectivity of decode (no two encoded values decode to the same actual value) is difficult but has some desirable features (comparing encodings for equality); and surjectivity of encode (every bit string can decode to something) is theoretically possible but extremely painful to do and most people don’t bother.

The issues in the link aren’t real; the posters there have some really deranged idea that if you have a utf-8 string (which is some binary data) or a float (which is some binary data) you want to go in and inspect the semantics of some completely different system as part of your serialization scheme; this is clearly deranged and no one actually wants their wire format/generic serializer to delve into Unicode arcana to unify “á” made of one codepoint with “á” made of two codepoints, they want their UTF-8 string put in a field. Even less do they want all NaN floats coalesced, since usually when you have floats you multiply them, not compare them. (And of course, NaN != NaN, which is obvious if you understand floats (the naming convention is poor since none of them are numbers))

But even though they aren’t real issues, this sort of encoding generally isn’t bijective. There are binaries which are not the encoding of anything. The alternative is extreme difficulty.

1 Like

For the most part, creating a bijective encoding isn’t actually that hard so long as your underlying structure doesn’t have a difficult notion of isomorphism (for example, unlabeled graphs are hard to do this with, but you can do that inefficiently with a stream/filter/index encoding). See this stuff;

However, it precludes many common features of encodings, such as being prefix free.

1 Like

Addressing most of @AHart’s and @degregat’s comments here, I will address the questions around injectivity/bijectivity separately.

I took the word and concept mostly from Multiformats (originally developed by Protocol Labs). In general, by “multiformat”, I mean an interface (consisting of typed functions) associated with a table of multiple implementations of that interface, each associated with a unique non-negative integer, such that the integer can be included to indicate which implementation should be used to interpret/process some data. Conceptually, I also understand multiformats as basically an implementation of runtime-updateable typeclasses (in the Haskell sense), where new instances can be added associated with new codes.

No, good point, these should be non-empty lists.

Agreed.

No, not really, good point.

:laughing: - revealing my pretensions of British English eh? It shall be changed.

Not sure, the type of natural numbers and the type of binary data are the same, right (modulo interpretation - but they have the same members)? Do we need that distinction encoded at the type level?

Yes. The canonical ordering of members of a coproduct type is explicit in the list CoproductT (list) - are you asking for something more, or is that sufficient?

Yes, good points, agreed.

1 Like

Thanks for the clarifications here. I believe that we want:

  • injectivity of encode
  • surjectivity of decode
  • injectivity of decode, given a particular fixed type (as used with deserialise), which will allow us to compare encodings for equality (as you mention)

What is difficult about injectivity of decode?

Then its a bit confusing to have different BasicValues be parametrized over Boolean, Nat, Bytestring (where boolean could also be a bytestring with 0x00 == false and everything else == true or vice versa).

Maybe we want to have a section of PrimitiveValues or where we make the requirements for Boolean, Nat, ByteString and their interpretations explicit?

Should be sufficient if this is made explicit, e.g. by defining [ ], which seems to be “implicit” but basic so far.

1 Like

Fair, I’ll clarify this, we can make all the types clearly distinct and the domains explicit.

:+1:

1 Like

@cwgoes Do I understand correctly that we would build an option type as follows?

type SomeOptionT := CoproductT [ SomeType, Nothing ]

If we want to have option types, we probably want to include Nothing. For general ones, Just would also be useful.

Or do we have “|” s.t. we can just define further types, in the same sense that the basic definitions are introduced? If so, we should make that explicit.

Yes. The data constructors Just and Nothing would be partially applied OptionV, specifically OptionV 0 and OptionV 1 in this case. Typically the programmer wouldn’t need to write these - further abstraction would be provided by the language (e.g. Juvix) - but I don’t think we need to add anything to the typesystem for what you seek here.

No - this basic typesystem only allows construction of types from the primitives provided, and there’s no way to declare a named type. I imagine that named types would often be useful, but they should be provided at a different layer, I think, both for simplicity here and because we need to manage name mappings.

1 Like

Notes from discussion with @jonathan:

  • Basic types at this level could be simplified to Nat and FinSet
  • Product/CoproductT could take two DataTypes instead of a list
  • FunctionT should move to BasicType
  • Should List be a basic constructor?

With FiniteSet n, typeclass could define operations that characterize rings & finite fields (or other things).

Alternatively put, this proposal would simplify basic types to simply specify the size of a value (finite of a particular size or infinite), and leave any kind of semantic interpretation to a separate layer (which should still be standardized in the specs, but doesn’t really need to be part of the encoding per se). I think this makes sense, but then we also need to figure out what this second layer looks like. @jonathan mentioned that the specs currently have a standard prelude here which is an effort in a similar direction.

If we were to characterize FinSet in terms of finite limits and colimits (and exponential objects) rather than just finite products and coproducts, then we could express both constraints and equivalences in the data types themselves. That in turn would allow us to write things like finite fields with their constraints in the type system itself, rather than postulating them as basic types (and thus having to figure out all the basic types that we need ahead of time). It would suffice for example to recursively include all of:

  • Initial object
  • Binary pushouts
  • Terminal object
  • Binary pullbacks
  • Exponential objects

A system with recursive types would be definable by extending the introduction/elimination rules specified by the universal properties above with those of W-types built solely from the above, where W-types are specified as, for example, the initial algebras of functors defined as in polynomial functor in nLab (example code at geb/geb-idris/src/Library/IdrisCategories.idr at 7253188543c0839a0c428f11fd6978d54d945333 ¡ anoma/geb ¡ GitHub, doubtless among many other places), where the four objects and three morphisms involved all come only from FinSet. M-types (terminal coalgebras, i.e. not-necessarily-well-founded types such as streams) can also be defined in terms of those.

1 Like

Thanks - what you propose sounds very helpful, but I’m afraid I don’t understand all of the concepts involved here myself yet - @jonathan / @AHart / @Jamie does this proposal make more sense to you, perhaps?

A typical really simple and fast encoding will have some bits that are sometimes ignored when they don’t apply, because we care a lot about having fixed widths for things (even if when produced by the encoder they’re always all 0). You can sort of take the easy way out by declaring that if a currently-unused bit is 0 it’s invalid, I guess, but now you’ve sacrificed simple and fast by putting in more branches.

1 Like