Data & function encoding

AHart · August 27, 2024, 11:25pm

By my reckoning, your substantive new suggestions are

pullbacks and pushouts
(limited) W-types
stating things in compositional rather than applicative form

Pullbacks are just subsets of products. Their elements would be pairs w/ an equation. In other words, they are special cases of a subset type; a subset of an existing type with a predicate that each element must satisfy. In other words, it’s what we already have w/ products plus a semantic side-condition that the type checker knows about. Do we want this? It’s not relevant to the serializer, but it could be used by a type checker.

Pushouts are a separate story. Their point is to have syntactically distinct things which are considered equivalent by fiat. Syntactically, it’s elements are the same as that of coproduct, but some inl(x)s are considered the same as some inr(y)s. What is the serializer supposed to do with that? I think the answer is “nothing”, and it’s just the type checker that sees them as the same. This disconnect begs the question of what the exact relation between the serializer and the type checker is supposed to be. That’s a question for @cwgoes , I suppose. Then again, I could be totally wrong. If you have something else in mind for how pushouts are supposed to work, it would help get a better understanding of your suggestion. Same goes with all my interpretations.

I think there’s two discussions going on here. The GOAL goal is to define a canonical set of serializable types supported by VMs and, separately, there’s a goal of defining a type theory for reasoning about interfacing between these VMs that’s some happy medium between minimal and practical. It does not seem like arbitrary nestings of function types should be serializable, but the meta-type checker should be able to reason about them. It’s not clear from your suggestion what, exactly, the serializable data is supposed to be. I think you started with FinSet since everything in it (and emerging from it with the W-type construction) will have serializable elements. At the same time, what is the serializer supposed to do with predicate/quotient information? The type checker can use that, but does it need to be limited to what you described? I don’t think so, but I’m also shaky on what the complete goal is with that.

As for W-types, that’s worth considering. It may cut the knot on the question of if lists etc. should be a fundamental supported type. It’s really not obvious to me that your specific restriction is the correct one. In general, so long as, for any serializable X, F(X) is also serializable, then the initial algebra of F will be serializable since all elements will just be finite many F layers. Equivalently, if F(ℕ) ≃ ℕ then Fix F ≃ ℕ (where the later isomorphism can usually be defined as the ana+cata over the (co)algebras in the F(ℕ) ≃ ℕ isomorphism). Interface wise, I think we can assume F-catamorphisms (folds) AND F-anamorphisms (unfolds) since I don’t believe there’s a termination requirement. (As an aside, if termination isn’t required, should the type theory have a bottom element? Should it have intersection or union types? Do we want our type theory to be more extensional rather than intentional?) I think this would be larger than your restricted W-types. As an example, the initial algebra of List will have serializable elements. Correct me if I’m wrong, but I don’t think that type can be defined under your restriction.

As for things being in compositional rather than applicative form, this would mean all our data are morphisms and, for example, the stuff being serialized will be (some of?) the morphisms out of the terminal object. That’s going to look very alien to most people, and I’d suggest working out an example. For example, what would constructing a finite field look like? What would a user actually type to get that done? And would this be better than the applicative style (e.g. something like Agda records)? To me, that seems more like a UI question. As someone whose not too enamored with point-free programming, I’m not going to advocate it over an applicative style, but it’s something worth considering. But you’ll have to put a clearer image in people’s heads for a proper determination to be made.

Jamie · August 28, 2024, 10:04am

I’m responding to @cwgoes direct question “does this proposal make … sense to you”, where ‘this proposal’ refers to the post by @terence here.

I’m afraid that I don’t have a clear answer to this, but I can make some general comments:

Generally speaking, expressing something simple in terms of primitives that are more complex, is not advised.
There may be good reasons for this – e.g. if the more-complex primitives are inherently canonical (e.g. characterising the natural numbers as \omega); or fixed by some environment that we decide we want to live in (e.g. the impredicative characterisations that come out of System F) – but if so, this needs to be carefully justified. Such a justification is not evident to me from reading the text (though this may reflect a lack of background knowledge on my part, of course).
Can such a justification be provided, preferably in elementary terms?
The assertion “If we were to characterize FinSet in terms of finite limits and colimits (and exponential objects) rather than just finite products and coproducts” invites the question of why we would want to characterise FinSet in such terms in the first place (cf. previous point).
I don’t mean this in a hostile way; it just seems to be assumed in the post that this is desirable. Why?
Is some fixed type regime assumed here (Haskell? Idris? Something else?)? Or are we contemplating building one from scratch?
A standard technique is to write down whatever types you want (just as they are and without trying to shoehorn them into another system) and develop intro- and elim-rules and a term language to populate them in Curry-Howard style. The discussion in this thread seems to invite such an analysis. Has this been considered?

cwgoes · August 28, 2024, 12:34pm

Thanks @terence @AHart and @jamie for your input - very helpful for me. To synthesize at a high level here for now:

My goal in this initial approach is to figure out the minimum necessary to standardize data and function encoding in the operational sense. Typechecking and reasoning are goals we may have later, but they aren’t the immediate goal here - and thus I think we should avoid introducing anything that adds proof obligations into these basic types right now.
This level will definitely not be the level at which programmers actually write in - programmers will usually be writing at higher levels of abstraction, which might include other types, other type-systems, proof obligations, etc. In a sense I think we can understand this level as the “type-erased” level - the minimum information necessary for serialization and execution.
Given all that, I think that we should split out these more complex types and embedded proof obligations into a separate discussion (which we should also have, but which I will not immediately aim to merge into the specs).

Jamie · August 28, 2024, 5:27pm

@cwgoes To me it’s not clear what framework is implicit in the term “in the operational sense”. To me, this could mean two things:

We assume some existing framework X (e.g. X=Idris). In this case, are we embedding our data and function as types of X, or as terms of X?
We do not assume any existing framework. We just want to express (e.g. in mathematical notation, or in pseudocode) what we want.

Re-reading the thread above, I suspect you intend option 2. This is what you are doing in your original post.

@cwgoes I would be grateful if you could please confirm?

cwgoes · August 28, 2024, 5:47pm

Yes - we do not want to assume any existing framework, but rather specify (here a bit imprecisely using pseudocode and mathematical notation, and in the future perhaps more precisely in the specs) what we want.

Jamie · August 28, 2024, 6:29pm

OK, thank you, this is clear.

Here’s a random question then. This pseudocode

multidecode : Bytestring -> DataType -> Maybe DataValue

seems to imply dependent types, since the type DataValue may depend on the contents of the Bytestring.

Let’s try to prune this complexity and see what happens. How would you feel about just removing the types, and working in an untyped setting? To frame this slightly differently: we treat data as an abstract algebraic object (of course in an implementation, types would be provided), and focus on how that data is manipulated without worrying too much about what it represents (numbers, finite field elements, etc).

The first half of your original post would get abstracted/simplified away. E.g. serialisation would be present, but would just take data to data.

What would be left would be the stuff that’s currently in your headlines, such as (casting a quick glance over the text):

Encoding & decoding.
Multi-functions.
Multicompilation.
…

cwgoes · August 29, 2024, 8:13am

That was not what I meant to imply (perhaps this pseudocode style is ambiguous) - DataValue is itself a type, specifically a sum type of the possible inhabitants of a basic value, a tuple value, or an option value as described in the first post. Which inhabitant of that type is returned is dependent on the contents of the bytestring, but the DataValue type itself is fixed - no dependent types here.

Jamie · August 30, 2024, 9:41am

I see. Thanks. So here’s my two cents, for what it’s worth.

“DataValue is … a sum type” means to a first approximation that you’re already working in an untyped setting.

In this case, it seems to me that we just have a bunch of objects:

\mathsf{Data}, \mathsf{Bytestring}, \mathsf{VMCode}, …

together with some functions between them. Some of these are standard. For example, the injective serialise map

\mathsf{Data} \stackrel{serialise}{\hookrightarrow}\mathsf{Bytestring}

has a surjective right inverse deserialise. At a high level, knowing that \mathsf{Data} and \mathsf{Bytestring} are related by these two arrows is all we need to know. So far, so good.

Some of the functions are less evident, like Multievaluation. What jumps out at me is that its verbal description references virtual machines and gas, but the type given for it (reproduced below) does not (I assume that the Natural' parameter below is the natural index n` that you mention, and is not gas):

multievaluate :
  DataValue ->
  [DataValue] ->
  Natural ->
  Maybe (DataValue, Natural)

I wonder whether the description above (whether in ‘type-flavoured’ notation or in ‘maths-flavoured’ notation) could be enriched to clarify this in some clean way.

For me, the interesting question is get as clean as possible a description of the objects at play.

(p.s. I acknowledge that I’m really just saying back to you what you said to me.)

cwgoes · August 30, 2024, 12:24pm

On further reflection, I think this discussion bundles together two questions which probably should be treated as separate (although related):

What sort of type information should we standardize for encoding schemes?
What sort of type information should we standardize for interoperation between virtual machines?

The answer to (1) is pretty clearly “the minimum possible which will allow encoding schemes to operate efficiently” - namely, the size of the encoded values, and any structure we might care to be able to efficiently navigate in an encoded representation (e.g. products/coproducts/maybe maps in the future). Standardizing additional type information doesn’t make a lot of sense here as it makes the encoding scheme requirements more complex, doesn’t add any efficiency, and mixes concerns - if we want to check additional type information, we can always do this separately.

Furthermore, clearly separating these layers brings benefits - e.g. suppose that we had some kind of higher-level type “natural number between 0 and 999, where the number is even”. From a serialization/encoding perspective, this can be encoded as a FiniteSet 500 - just as “natural number between 0 and 9999, where the number mod 10 is 0” - it’s the same size. What distinguishes these two values is some additional information on the type layer which the serialization scheme doesn’t need to care about.

The answer to (2) is less clear. Less precise types simplify the interface and ABI encoding requirements, while more precise types allow us to perform stronger checks that interoperation is correct, since they effectively provide more semantic information about how values are “supposed” to be treated. On the extreme end of the spectrum, if we had full dependent types, VMs wouldn’t really need to “trust each other” in the same way they do now, in the sense that they wouldn’t need to care how precisely functions are represented or computed, as long as certain (typechecked) properties hold.

cwgoes · August 30, 2024, 12:30pm

Ah, the natural number there is indeed intended to be gas - a case of insufficiently clear notation, as you mention!

Jamie · August 30, 2024, 12:55pm

the natural number there is indeed intended to be gas

Perfect, thank you. At first blush, it seems likely that a suitable model for gas would be a monad. You already have the Maybe monad in there, so let’s declare a WithGas monad that subsumes and extends Maybe with a natural number etc.

Then we get something like this:

multievaluate :
  [DataValue] ->               # list of arguments
  WithGas DataValue ->         # input value, with gas 
  WithGas DataValue            # output value, with gas

The above may be technically incorrect (e.g. perhaps the list of arguments comes with gas too, though I doubt it!) but I hope it gives a flavour of how I am thinking.

(I did a quick web search for “gas monad”, hoping to quickly link to a mathematical treatment of gas as a monad structure, off-the-shelf. This may well exist, but unfortunately my hits were swamped by Monad (“Monad is a high-performance Ethereum-compatible L1”). Oh well.)

Jamie · September 4, 2024, 9:50am

@cwgoes As discussed I have made a start at translating your post above into mathematical language. I welcome comments (from anyone) on what is below, to guide me if and when I continue.

I’ll use monads following categorical notation as per this page.

Data serialisation

We assume functions

\begin{array}{r@{\ }r@{\ }l} \mathit{serialise} : & \mathsf{Data} \hookrightarrow \mathsf{ByteString} \\ \mathit{deserialise} : & \mathsf{ByteString} \twoheadrightarrow \mathsf{Data} \end{array}

subject to

\forall d{\in}\mathsf{Data}.\mathit{deserialise}(\mathit{serialise}(d)) = d .

Gas

We define a monad-like structure \mathsf{WithGas} such that:

\hspace{-1em} \begin{array}{r@{\ }l@{\ }l} \mathsf{WithGas}(X) =& \mathbb Z \times X \\ \eta(n,x) = \mathsf{return}(n,x) =& (n,x) & \eta : (\mathbb Z\times X) \to \mathsf{WithGas}(X) \\ \mu((n,x),(n',x')) = & (n+n',(x,x')) & \mu : \mathsf{WithGas}(X)^2 \to \mathsf{WithGas} \\ \mathsf{bind}((n,x), f) =& (n+\pi_1(f(x)),\pi_2(f(x))) & \mathsf{bind} : \begin{array}[t]{l} \mathsf{WithGas}(X) \times \\ \ \ (X\to\mathsf{WithGas}(Y)) \to \\ \ \ \ \ \mathsf{WithGas}(Y) \end{array} \\ \mathsf{extract}(n,x) =& \begin{cases} \mathsf{Just}(x) & n\geq 0 \\ \mathsf{Nothing} & n<0 \end{cases} & \mathsf{extract}:\mathsf{WithGas}(X) \to X_\bot \end{array}

Note:

This isn’t quite a monad, because \eta has type \mathbb Z\times X \to \mathsf{WithGas}(X), instead of X\to\mathsf{WithGas}(X). (I don’t see this as a particular problem, but if it is, it’s straightforward to fiddle with the types to fix it, at some slight cost in complexity.)
X_\bot is just \mathsf{Maybe}(X).
I didn’t set things up here so that execution terminates if gas falls below zero — this is because this is a declarative framework, so there is no execution order and thus if you run out of gas you can always go back in time and add more. However, \mathsf{extract} will return \mathsf{Nothing} if you apply it to a state (n,x) for which n<0 (i.e. with negative gas).

Virtual Machine

We let virtual machines be a monoid (a set with an associative binary composition operation and an identity element),

\cdot : \mathsf{VM} \times \mathsf{VM} \to \mathsf{VM} \qquad \mathit{id} : \mathsf{VM},

and we assume functions

\begin{array}{r@{\ }l@{\ }l} \mathit{encode} :& \mathsf{VM} \hookrightarrow \mathsf{Data} \\ \mathit{decode} :& \mathsf{Data} \twoheadrightarrow \mathsf{VM} \\ \mathit{eval} :& \mathsf{VM} \to \mathsf{WithGas}(\mathsf{VM}) \end{array}

subject to the usual condition \forall v{\in}\mathsf{VM}.\mathit{decode}(\mathit{encode}(v)) = v.

Note:

For now I treat just one virtual machine above, for simplicity.
This looks different from Christopher’s presentation. I’ve experimented: I tried to simplify things by giving VMs a monoid structure, which under the hood just pairs VM data. If a decode fails, it just returns the empty VM, which also acts as an identity element for the monoid action.

Both \mathit{encode} and \mathit{eval} exploit the monoid structure. \mathit{encode} returns an empty VM if the encode fails. \mathit{eval} tries to split its input v into (v',[v_1,\dots,v_n]), and if this succeeds, it carries out the usual evaluate as per Christopher’s note. Gas cost just goes straight into the \mathsf{WithGas} monad.

terence · September 5, 2024, 1:48pm

Syntactically, it’s elements are the same as that of coproduct, but some inl(x) s are considered the same as some inr(y) s. What is the serializer supposed to do with that? I think the answer is “nothing”,

Yes, that’s right. Products/coproducts are structural and computational; equalizers/coequalizers are assertive. (Although it’s not impossible in some cases to make use of proven constraints/equivalences to perform optimizations, of course.)

I think this would be larger than your restricted W-types. As an example, the initial algebra of List will have serializable elements. Correct me if I’m wrong, but I don’t think that type can be defined under your restriction.

There’s a verbose, uncommented, and (with respect to final Geb interfaces) obsolete example of how to define the free monad of List (where the initial algebra of it emerges as the special case of the application of that monad to the initial object, Void) using W-types here: geb/geb-idris/src/LanguageDef/GebTopos.idr at 855f41fbd65d7396e24707daaf474d2cd5fd84c7 · anoma/geb · GitHub

That particular example generates atom-or-list-style S-expressions where an atom is either a given finite type or a list of natural numbers, which is more complicated than just the free monad of List; I’ll pare down what is required for the latter:

The domain and codomain of the functor are FinSet / Fin(3) where the three elements of the type we slice over represent ATOM, EXPR, and LIST_EXPR
The ATOM slice has one position, which has one direction (this is simply the constructor which says "an atom is any term of the type that we pass into the free monad to obtain an expression type)
The EXPR slice has two positions:
- An EXPR_ATOM position, which has one direction, of slice ATOM
- An EXPR_LIST position, which has one direction, of slice LIST(EXPR)
The LIST_EXPR slice has two positions:
- A NIL position, which has no directions
- A CONS position, which has two directions, HEAD of slice EXPR, and TAIL of slice LIST_EXPR

The free monad of this (which any W-type has) – call it FreeL – is an endofunctor on Set / Fin(3). Then we can define an endofunctor on Set as follows:

X : Set -> FreeL(X, Void, Void)

This endofunctor is the free monad of List (so in particular, its application to Void produces the initial algebra of List).

As far as I have seen so far, this seems to be a general technique – when you need to define a recursive data structure which contains one or more instances not only of itself but of some functor applied to itself, you factor out “functor applied to itself” as a type of its own, expand it, and define everything by mutual recursion using one slice per type. Thus you end up with:

Each type (or, when you generalize to dependent types, type family) is a slice
Each constructor is a position
Each field is a direction

The proposition that this can be done for any data structure using only finite types for the types of slices, positions, and directions then amounts simply to the observation that even if your types may be infinite, your source code can not!

terence · September 5, 2024, 1:58pm

Generally speaking, expressing something simple in terms of primitives that are more complex, is not advised.

Agreed – this is my goal, which I think I can probably best explain by answering this question:

why we would want to characterise FinSet in such terms in the first place

The essential reason is that, while for mathematicians, “Natural numbers were created by God”, for programmers, it’s data types that were created by God. And the types I listed are the atomic, universal (in the category-theoretic sense) components of data types:

Products are records
Coproducts are enumerated types/ADTs
Exponential objects (which can be defined in terms of the above, as in the earliest versions of Geb) are function types
Equalizers are constraints
Coequalizers are claims of equivalence (for example to some normal form)
W-types are recursive data structures
M-types are co-recursive data structures

My perspective is that potential types such as “finite field” are more complicated (in particular, in the sense of the universal objects/morphisms, or, in the type-theoretic view, introduction and elimination rules) than those of the above (viewed individually), and that they can be composed from the simpler types listed above.

There is a large pre-existing pool of theory applying to each of the above types, which we can inherit for the purposes of proving things about our derived types (both on paper and in code).

AHart · September 5, 2024, 6:20pm

I would argue that this is the free monad on X ↦ 1 + X × X, that is, the type of full-binary trees (plus some junk). Implicitly, you’re using an isomorphism where you identify binary trees with rose trees by accumulating left branches as cons and right branches as nestings. While it’s an elegant isomorphism, it is still an isomorphism. If you’re going to implicitly identify isomorphic things, why not just assert the existance of a natural numbers object and call it a day since it will be isomorphic to any countably infinite thing?

I suppose the argument would be a compromise between practicality and minimality. We don’t have only an NNO for the same reason we don’t want to represent FinSet skeletally; it’s possible but not convenient in practice. At the same time, you want to place a limit like “we should only allow atomic type constructors with a bounded number of arguments”, and that would eliminate the initial algebra of List (but there would still be things isomorphic to it). Similarly, for any type which is a nesting of (finite many) fixed points, we can, in practice, identify it with an isomorphic collapse into a single fixed point a la

Fix List 
  = Fix (X ↦ Fix (Y ↦ 1 + X × Y))
  ≃ Fix (X ↦ 1 + X × X)

I think that’s reasonable, honestly, and the restriction to constructors with bounded argument numbers seems natural to me.

*** Edit: I think I misunderstood your construction. Rereading it, I think what you’re describing is the transformation of (something like) this mutual inductive type

Mutual Inductive ListMut : Type :=
  | Nil : ListMut
  | Cons : FixMut -> ListMut -> ListMut
Mutual Inductive FixMut : Type :=
  | Fix : ListMut -> FixMut

Into (something like) this dependent type

Inductive FixListDep : Bool -> Type :=
  | Nil : FixListDep True
  | Cons : FixListDep False -> FixListDep True -> FixListDep True
  | Fix : FixListDep True -> FixListDep False

the latter being the initial dialgebra over

F, G : (Bool → Type) → Type × Type × Type
F(X) := (Unit, X False × X True, X True)
G(X) := (X True, X True, X False)

I suppose this is different in that this has exactly and only the same terms as Fix List would have, but the induction principle wouldn’t be exactly the same. That may not matter, though? I think the core of my original comments is still valuable, but I hope this is closer to what you were saying.

*** End edit.

I do think this would break down at some point. For example, consider the functor Cyc(X) whose elements are lists whose order is quotiented by the action of the cyclic group whose order is the length of the list. So, for example

[1,2,3] = [2,3,1] = [3,1,2] : Cyc(ℕ)

and

[0,5] = [5,0] : Cyc(ℕ)

If we consider the initial algebra of Cyc, there would be an isomorphic type of binary trees quotiented similarly (where they are equivalent under certain permutations of right branches), but the binary tree version wouldn’t have the quotient follow the inductive structure. It’s not obvious to me that this could be defined under your restriction, though, if you can demo that then I’ll be a true believer. This is an obscure example, so I don’t mind dismissing it, and it’s not relevant to the serializer discussion anyway, but it came to mind so I thought to mention it.

ray · September 5, 2024, 6:53pm

ADTs are really “coproducts implemented in the bad constructor-soup way ML does it”, I’m on the anti-ADT train for this reason. An a is already an a \lor b, it’s deeply silly that with the usual ML-style ADTs an a isn’t an a \lor b but an AOrBConstructor(a) which is manifestly not an a, it has extra garbage piled on it, is an a \lor b.

Forget Hindley-Milner and just use coproducts and follow the definitionally-exists forgetting arrow you get for free!

More concretely “an integer” like 5 is also “an integer or a pair of integers” like 5. I just followed an arrow on the type level when typechecking f: integer or pair of integers -> _; f(5). In this manner is ADT hell avoided for the working programmer.

terence · September 7, 2024, 12:49pm

I suppose this is different in that this has exactly and only the same terms as Fix List would have, but the induction principle wouldn’t be exactly the same. That may not matter, though? I think the core of my original comments is still valuable, but I hope this is closer to what you were saying.

Yes, that’s right – I thought I was going to have to write that up myself, so thanks for making the edit and saving me the time!

It’s not obvious to me that this could be defined under your restriction, though, if you can demo that then I’ll be a true believer.

Thanks – I’ll work on figuring out whether I can, then! That is precisely the kind of challenging test I would want for my hypothesis that we can express any data structure by the finitary, dependent-mutual-recursion technique of which I’ve provided the one example of the List initial algebra, because it is only a hypothesis, not something I can prove, and indeed I’m not even sure whether there’s a sensible way of precisely formalizing the hypothesis or whether it’s purely a statement about the practice of programming.

ray · December 18, 2024, 4:41pm

nb: it’s more like X \to \mathsf{WithGas}_n(X) f.s. n

Topic		Replies	Views
Use of multiformats in the Anoma protocol Protocol Design	6	137	July 3, 2024
Proper abstraction of instruction sets / virtual machines Protocol Design	19	185	September 5, 2024
Simple Merkleization System Protocol Design	12	273	May 11, 2024
Features for a Future Anoma Protocol Design architecture	0	16	March 26, 2025
Request for comments: ART on service commitments Anoma Research Topics	9	91	October 29, 2024

Data & function encoding

Related topics