Sketch of a possible VampIR VM

So I think the main points of focus would be:

  1. Alternative non-von Neumann architecture/ISA - the Eden VM is highly motivational here
  2. Alternative memory model
  3. Concrete efficiency optimizations
  4. Well defined “runtime”/introspection model

Let me define what I mean by “runtime” - in the “zkVM” model it would be anything that that computes on the actual program itself, which includes instruction fetch/decode, branching, and introspection (e.g. loading new code/programs). I think, generally, if the VM is based on recursion, a lot of this so-called “runtime” might (though, sometimes might not) end up on the other side of the cycle from the “execution”.

My thesis is that it’s extremely important to define the “runtime” of the VM and to minimize its overhead. In particular, to try to remove as much of the runtime overhead from the circuits that do the actual execution.

The memory model is probably the easiest to define - commitments can act as pointers to arbitrary length memory (which acts as an “intermediate” memory store - in between long term storage likely to be Merkle trees, and short term “storage” in cells in a circuit)

Of course this presents a performance challenge (mostly for proving performance), where every memory access requires opening a commitment in a circuit, which can be expensive depending on the number and size of each access. There is also some “runtime” overhead whenever the VM dynamically decodes instructions or recurses, for example. I think this is where we are still seeing 2-3x proving time overhead of VMs over the equivalent statically compiled circuit, when (IMHO) it should be possible to get this below 1.5x and possibly lower.

So we are motivated (at minimum) to minimize “memory” accesses. The ideal case would be to do an optimization common for CPU architectures and avoid consecutive reads and writes of the same data (here, represented as an instruction writing to a “memory commitment”, and the next instruction reading the same memory). One approach is to add “registers” or other constructs to the VM and use common compiler techniques, but I argue this is still too traditionally-focused because Halo2 already pretty much solves this problem, allowing efficient dataflow between adjacent “chips”. “zkVM registers” would always have at least some runtime overhead if implemented in a completely generic way, whereas a hand-coded Halo2 circuit can completely avoid this.

Another way of thinking about it is that registers are essential in CPU architectures because of the inability to connect functional units of the CPU directly together, both in a programmatic sense (the CPU is immutable) and also a physical sense (sequential computation adds latency more than the cpu clock cycle). So CPU circuits have to store intermediate data between instructions in registers, but Halo2 circuits don’t.

Another substantial difference is the cost of instruction decoding (traditionally something that mattered in CPUs too) and whether it motivates simple or complex “instructions”. In theory, I think a zkVM should get “instruction decoding” essentially for close to free (!) if much of it is done outside of the circuit. This is where static analysis and compilation is important: if a sequence of nonbranching/nonintrospecting instructions are compiled into a single circuit, then the VM only needs to “decode” the verifying key of that circuit. Since verifying keys are succinct, the efficiency should be better.

So far this proposal (statically compile sequences of nonbranching instructions) is mostly about concrete efficiency - it is actually agnostic to the specific ISA or memory model, and doesn’t achieve an asymptotic speedup. It’s more about taking an arbitrary ISA or memory model, which are assumed to be statically analyzable to some extent, and simultaneously minimizing memory accesses and maximizing the program/runtime ratio (as a function of circuit size; area-degree product or similar).

In conclusion, this presents two requirements on the IR/ISA:

  1. The IR is statically analyzable with respect to branching/introspection locations
  2. The IR is statically analyzable with respect to dataflow (to be compiled into a equivalent Halo2 circuits)

and a requirements on the VM “runtime”:

  1. A representation of the compiled program as a sequence of “augmented” Halo2 verifying keys (augmented with dataflow, branching, and introspection; rather than a sequence of IR instructions)

There are probably some performance downsides to this:

  1. Since prover time is O(n log n) in the size of the circuit, increasing the size of each execution circuit (while decreasing the total number of execution circuits) does impose slightly extra log cost
  2. (Perhaps) increased communication complexity between the execution circuits and “runtime” circuit

Nevertheless these costs should be insignificant to memory read/write savings and runtime savings. In general, anytime any computation can be moved out of a circuit, the performance will improve since computation in a circuit is always an order of magnitude slower than outside of the circuit.