Sketch of a possible VampIR VM


This is a rough sketch of what a “VampIR VM” might look like. Goals of this VM design:

  • Abstract (not tied to a specific proof system)
  • Minimal (no more complexity than necessary)
  • Clean separation of semantic specification and efficiency / optimisation concerns


The entire VM is parameterized over a totally ordered finite field F, with:

  • Multiplication: *, with type F \to F \to F
  • Addition: +, with type F \to F \to F
  • Additive identity: 0, with type F
  • Multiplicative identity: 1, with type F
  • Additive inverse: -, with type F \to F
  • Comparison: < (less than, exclusive), with type F \to F \to {0 | 1}

Input program representation

Programs are represented as extended multivariate polynomials over a_0 ... a_n in canonical form, with designated variable b (where a_0 .... a_n can be understood as the inputs, and b can be understood as the output result):

a_0^3 + 2a_1^2 - 3a_0 + 2 = b

This is, of course, equivalent to a_0^3 + 2a_1^2 - 3a_0 + 2 - b = 0.

To provide control flow, we introduce two new primitives:

  • Comparison lt(a, b), where:
    • lt(a, b) = 0 if a < b
    • lt(a, b) = 1 otherwise
  • Branching branch(a, b, c), where:
    • branch(0, b, c) = b
    • branch(1, b, c) = c
    • branch(\_, b, c) (if a is neither identity element) is undefined

TODO: Consider whether “undefined” is the right semantics here. “Crash” might also be appropriate.

Note: Assuming that a \in {0, 1}, branch(a, b, c) is semantically equivalent to ((1 - a) * b) + (a * c), and some kind of smart bidirectional transformation between these two should be possible.

Thus, for example,

branch((lt(a_0, a_1), a_2, a_3^2) = b

is a valid program.

Transformation to circuit

In order to transform programs to a circuit, they are first factorized (via some algorithm), resulting in a program of a form

seq_0 * seq_1 * seq_2 ... = b

TODO: Figure out how lt and branch interact with factorization.

The factors are ordered in canonical form.

The program is then transformed to a circuit-style DAG, with a_0 ... a_n as the inputs and b as the output. There are only five gate types:

  • ADD
  • MUL
  • NEG
  • LT

Transformation to instructions

Now, we have an AST of the form (for example):

data Circuit
	= Input Nat
	| Neg Circuit
	| Add Circuit Circuit
	| Mul Circuit Circuit
	| Lt Circuit Circuit
	| Branch Circuit Circuit Circuit

For sequential execution, this AST can be compiled with SSA to a finite set of registers and a sequence of instructions of the same names, where evaluation is eager, except for branch which is compiled to compare-jump (avoiding execution of the branch not taken).

For example, define the sequential VM state as the tuple (pc, ins, rt) where

  • pc is the program counter
  • ins is the instruction sequence
  • rt is the register table

Instructions are then:

data Instruction
  = Neg R R
  | Add R R R
  | Mul R R R
  | Lt R R R
  | Branch R P P

Each step, the VM reads the instruction at the current pc, and executes as follows:

  • NEG, ADD, MUL, LT: Perform the appropriate operation, writing to the last (output) register, and increment pc by 1.
  • BRANCH: Read register R, and set pc to the left branch if R = 0, or to the right branch if R = 1.

At the end, the output value will be stored in the register assigned to b.

For parallel execution, disjoint parts of the circuit may be executed in parallel as usual, and compile-time branch prediction would lead to early parallel evaluation of (parts of) both branches.

Note: the optimisation function here is input-value-dependent, so a well-defined solution would depend on a probability distribution over input values)



Add primitives and associated circuit gates + instructions:

  • reflect(a) - reflects a \in F up to a program PROG
  • repr(PROG) - represents PROG as a field element a \in F

Future tasks

  • Figure out how and when to deal with overflow (seems like this might be hard/annoying)
    • Ideally we just do this at the definition stage by using a large enough field
    • But otherwise we might have to do intricate things
  • Consider different finite fields (of program definitions and execution backends)
  • Consider prefixes / sub-fields
  • Consider what happens if the field is only partially ordered / what implications this structure has
  • Consider a potential non-deterministic VM version where (basically) some of the inputs are undefined

The primary optimization question here (cc @joe) is to transform between finite field arithmetic on different fields (set by the programmer in the input program, and set by the machine for the actual VM).

1 Like

I think BRANCH needs careful thought here.

In circuits you can save some witness generation time (part of proving time), but not constraint generation time or constraint count by avoiding branches not taken. The constraints in the branch not taken must still be a part of the circuit. However the Prover, who knows which branches they will take, can choose to fill the witnesses on untaken branches with zeros, or any other garbage, knowing that their proof will still verify.

I think witness generation is comparable to “execution” of a typical program and common optimizations for program execution should be applicable there.

However, constraint generation is a different animal and requires a different treatment.

Perhaps we need to think of Vamp-IR as having two VMs operating on the same programs. After all, only a Prover needs to do witness generation at all.

Aye - to me, one of the primary arguments for building a VM in the first place is that we wouldn’t need to evaluate all branches - a program could be split into parts, some of which are compiled directly (as VampIR currently does, evaluating all branches, but with no VM overhead), and some of which are compiled to a sequence of VM instructions, which we then prove the interpretation result of using a circuit specifically for the VampIR VM (so we only need to prove correctness of branches taken).

There’s a very good chance I am misunderstanding you, but I don’t see how this could work and remain zero-knowledge. Does this not leak information about the taken branches to anyone receiving the circuit? If it doesn’t leak information, how does the Verifier know how to construct the same circuit the Prover has?

The circuit wouldn’t change - the prover and verifier would have to run the compiler with the same settings, but the circuit is still determined at compile-time - it’s just that parts of the program could be compiled directly, and parts of this program could be evaluated on the VampIR VM interpreter circuit (similar to how other zkVMs work).

How does using a VM interpreter circuit avoid proving correctness of branches not taken?

Is the idea to use Lurk-style recursion?

We would just prove some number of iterations of applying the VM step interpreter function (if the program is too long, we’d need to use continuations or recursion or such). I believe that RISC0 uses continuations - Introducing Continuations (RISC Zero Study Club - Continuations Part 2/5) - YouTube - and Lurk uses recursion, yes.

Ok, I think I understand you now, and why the VM sketch is the way that it is.

It looks like getting performance benefits this way requires recursion, which both RISC0 and Lurk exploit. When not using recursion it should be more constraint-count-efficient to compile everything directly.

We should try somehow to quantify the cost of the VM overhead over direct compilation of the same circuits.

Hmm - I think that we should definitely investigate recursion, but I’m not sure that I understand why it would always be more efficient to compile directly. Suppose that I have a very long program, but only actually evaluate a few branches - couldn’t the VM interpreter circuit be more efficient in that case?

I think recursion is necessary to realize any efficiency benefits of a VM model (though I don’t know to prove it).

In the Risc Zero video you linked to they use recursion to realize continuations.

My intuition is that zkVMs need recursion to do anything “introspective” (like avoiding proving on unused branches) without breaking ZK.

I need to think about this some more and brush up on VMs in general.

After some thought about it, here’s what I think:

  1. I agree with @lopeetall that recursion, or at least something morally close to it, is necessary to get most of the efficiency benefits
  2. The benefit of avoiding unnecessary branching paths depends a lot on the application. Cryptographic applications often don’t need much branching (indeed, it’s undesirable) and so it’s not been necessary for applications up until now. But of course, it could be that useful applications with lots of branching simply have been impractical without something like this.
  3. I don’t think the “interpreter” style of VM is necessary at all here, and it’s possible to achieve the stated goals without it. Indeed the word “zkVM” is now heavily overloaded, making things a bit murky.

If the primary goal is to allow efficient branching, then I think it’s not worth moving many parts of the execution model into the circuit, in particular the (I guess common) approach of having a circuit compute some steps or cycles of the VM by interpreting opcodes. I can understand why this approach is sometimes necessary (indeed, why VMs in general are useful) but I think it’s entirely tangential to the goal here.

Rather, when the zk-program has been compiled to an IR, the IR can be partitioned at every possible branch-point (or possibly more often) and each segment of the partition hardcoded into a circuit. Then a circuit on the other side of the cycle can ensure that each segment is computed with the correct sequence of VM states, and ensure that the correct sequence of segments is accumulated. Since the “instruction trace” of each segment is perfectly known at compile time, the resulting circuits don’t need to compute anything extra.

1 Like

Great point, I didn’t think of this - the verifier would just need to check each “subpath” of program execution which was actually taken, and that the subpaths taken correspond correctly to the branch instructions that specify which subpath should be taken according to the results of computation thus far.

So I think the main points of focus would be:

  1. Alternative non-von Neumann architecture/ISA - the Eden VM is highly motivational here
  2. Alternative memory model
  3. Concrete efficiency optimizations
  4. Well defined “runtime”/introspection model

Let me define what I mean by “runtime” - in the “zkVM” model it would be anything that that computes on the actual program itself, which includes instruction fetch/decode, branching, and introspection (e.g. loading new code/programs). I think, generally, if the VM is based on recursion, a lot of this so-called “runtime” might (though, sometimes might not) end up on the other side of the cycle from the “execution”.

My thesis is that it’s extremely important to define the “runtime” of the VM and to minimize its overhead. In particular, to try to remove as much of the runtime overhead from the circuits that do the actual execution.

The memory model is probably the easiest to define - commitments can act as pointers to arbitrary length memory (which acts as an “intermediate” memory store - in between long term storage likely to be Merkle trees, and short term “storage” in cells in a circuit)

Of course this presents a performance challenge (mostly for proving performance), where every memory access requires opening a commitment in a circuit, which can be expensive depending on the number and size of each access. There is also some “runtime” overhead whenever the VM dynamically decodes instructions or recurses, for example. I think this is where we are still seeing 2-3x proving time overhead of VMs over the equivalent statically compiled circuit, when (IMHO) it should be possible to get this below 1.5x and possibly lower.

So we are motivated (at minimum) to minimize “memory” accesses. The ideal case would be to do an optimization common for CPU architectures and avoid consecutive reads and writes of the same data (here, represented as an instruction writing to a “memory commitment”, and the next instruction reading the same memory). One approach is to add “registers” or other constructs to the VM and use common compiler techniques, but I argue this is still too traditionally-focused because Halo2 already pretty much solves this problem, allowing efficient dataflow between adjacent “chips”. “zkVM registers” would always have at least some runtime overhead if implemented in a completely generic way, whereas a hand-coded Halo2 circuit can completely avoid this.

Another way of thinking about it is that registers are essential in CPU architectures because of the inability to connect functional units of the CPU directly together, both in a programmatic sense (the CPU is immutable) and also a physical sense (sequential computation adds latency more than the cpu clock cycle). So CPU circuits have to store intermediate data between instructions in registers, but Halo2 circuits don’t.

Another substantial difference is the cost of instruction decoding (traditionally something that mattered in CPUs too) and whether it motivates simple or complex “instructions”. In theory, I think a zkVM should get “instruction decoding” essentially for close to free (!) if much of it is done outside of the circuit. This is where static analysis and compilation is important: if a sequence of nonbranching/nonintrospecting instructions are compiled into a single circuit, then the VM only needs to “decode” the verifying key of that circuit. Since verifying keys are succinct, the efficiency should be better.

So far this proposal (statically compile sequences of nonbranching instructions) is mostly about concrete efficiency - it is actually agnostic to the specific ISA or memory model, and doesn’t achieve an asymptotic speedup. It’s more about taking an arbitrary ISA or memory model, which are assumed to be statically analyzable to some extent, and simultaneously minimizing memory accesses and maximizing the program/runtime ratio (as a function of circuit size; area-degree product or similar).

In conclusion, this presents two requirements on the IR/ISA:

  1. The IR is statically analyzable with respect to branching/introspection locations
  2. The IR is statically analyzable with respect to dataflow (to be compiled into a equivalent Halo2 circuits)

and a requirements on the VM “runtime”:

  1. A representation of the compiled program as a sequence of “augmented” Halo2 verifying keys (augmented with dataflow, branching, and introspection; rather than a sequence of IR instructions)

There are probably some performance downsides to this:

  1. Since prover time is O(n log n) in the size of the circuit, increasing the size of each execution circuit (while decreasing the total number of execution circuits) does impose slightly extra log cost
  2. (Perhaps) increased communication complexity between the execution circuits and “runtime” circuit

Nevertheless these costs should be insignificant to memory read/write savings and runtime savings. In general, anytime any computation can be moved out of a circuit, the performance will improve since computation in a circuit is always an order of magnitude slower than outside of the circuit.

Thanks; this is helpful. A few questions:

What is “execution” here - checking that a particular instruction was executed correctly?

This is possible, but would it be necessary? Suppose that we can statically analyze all “memory allocations”, such that - although we don’t want to generate one big circuit which allocates for all branches - we can simply generate the circuit for each segment with the right number of public / private inputs to include memory values transmitted from or to other segments. What do we need these extra commitments for, exactly?

Or perhaps this is what you are proposing below (not to do this)?

By “dataflow”, you mean memory dependencies, right? e.g. the compiler would need to track where variables are written & read and generate a directed graph (possibly DAG) so that we can identify which values to pass (in some way) between segments

What exactly is a Halo 2 verifying key, in terms of abstract mathematical structure? (presumably this technique would not be specific to only Halo 2)

What needs to be communicated between the execution circuits and “runtime” circuit? Mostly, in this model, it is that the correct branches were actually taken, right? (and then data needs to be communicated between different execution circuits, but shouldn’t need to be communicated with the runtime circuit except perhaps for the start and end of execution).

Right, by “execution” I mean the actual computations of the program (field arithmetic, EC arithmetic, hashes, etc) which are done by Halo 2 “chips”

I guess I didn’t really say exactly. I don’t mean that there is a single memory store which the entire program shares, but rather that each commitment can be thought of as a pointer to some specific piece of memory. I thought that this kind of memory model is at least sometimes necessary because the memory state has to be binding between segments (or single instructions).

You have an interesting point, though, that it might not be necessary - it could be, maybe, it’s possible to do “direct” memory sharing; that is, the “runtime” circuit simply provides the “execution” circuits with the relevant memory states as “public” inputs rather than private inputs with corresponding private commitment pointers. The immediate problem is that, as public inputs, they would reveal the contents of memory publicly. However, this might or might not present an actual privacy problem, because perhaps it’s possible to “hide” the relevant public inputs when doing the recursion so they’re not actually public. This seems plausible, but it depends a lot on the details of the proof system, so further investigation required.

We can even think of this as a completely dynamic allocation, where the allocation is size determined at runtime, although of course it’s not possible to compute on a dynamically sized data without branches. But as you say, each segment only needs to read and write the memory as needed transmit between the other segments.

Right, I think this is not really a novel idea, since most compilers have to do this kind of dataflow analysis.

Mathematically it’s a commitment(s) to the fixed columns (hardcoded inputs, basically) and all of the permutations used in the circuit. Also relevant is the “configuration” of the circuit (basically we can think of it as which gates/chips are available to use) although this is not really considered part of the verifying key. This is roughly the verifying key for any plonk-ish proof system. For a different kind of proof system the verifying key mathematically would be different, but the abstraction is the same (given a known verifying key, public inputs, and a proof, the verifier can check that the predicate corresponding to the key is true for the given public inputs and proof)

Unfortunately it’s quite tricky to define a general stable serialized format for Halo 2 verifying keys, which is an open issue on the Halo 2 repository and matters for e.g. zkEVM (and VampIR VM too) but isn’t as much of a problem for Orchard and Taiga.

At minimum branching, but as Jeremy pointed out to me it can include any level of introspection, e.g. loading arbitrary new programs. So this would concretely look like the execute computation “deciding” which “program” / verifying keys to “execute” next and passing this data to the “runtime” side of the cycle. So this might not include just bits of data but perhaps larger amounts (although much more rarely)

Let’s start with a brief overly simplified review of how a Halo2 (or Plonkish) circuit looks, abstractly - we can think of it as a rectangular array of cells arranged in rows and columns, upon which there are various constraints (equality, polynomial, lookup) where there are 3 types of columns: fixed (constants), instance (or public input), and advice (private input). The relevant metric for performance analysis is area-degree product (rows x columns x maximum polynomial degree) which motivates avoiding overly complex gates (polynomial degree too high) and also too many constraints (too many rows). If the set of possible polynomial constraints and lookup tables is fixed (or standardized) then a circuit is mainly defined by its values on its fixed columns and its set of “gates” (the exact constraints used for each row/col).

Now, Halo2 is exciting because it enables incrementally verifiable computation, which is exactly what is needed for a VM. For this purpose, we’ll ignore the actual details of how IVC works and, indeed, describe IVC incorrectly (in an unimportant way).

For IVC, we want to think about it as: there is some prior state, along with a proof of correctness of this prior state, and we want to create a new state, along with a proof of correctness of the new state. In general this would require checking that the state transition is acceptable, as well as verifying the proof of correctness of the prior state.

IVC is directly useful for VMs in an obvious way. The prior state is just the prior state of the VM, and the correct state transition is given by whatever operation/instruction is next in the sequence. Therefore, the state of the VM always has a proof that verifies the entire execution history. Unlike compiling directly into a circuit, we can branch efficiently (along with other operations) in this model by selecting the proper state transition, and the IVC does not have to compute all the untaken branches.

There are several problems with this description. One is that Halo2 IVC doesn’t work in this way - you cannot reasonably verify the prior state proof in the circuit used to prove correctness of the new state. Instead, you must use accumulation and add the proof of correctness to the accumulator. The correctness of the VM state at any point in time can be checked by checking the accumulator (outside of any circuit). However, this distinction is fairly irrelevant to the actual design of the VM as we can just view this proving system as a black box.

A much more relevant problem is that Halo2 IVC also does not work in the way described in the paragraph above, because the accumulator would run on the “opposite” curve of the cycle of curves, from the “state transition” circuit. So we must not only alternate sides of the curve (checking state transition on one side, then accumulate the proof of correctness of the transition on the other side) but then also accumulate on both sides (since the proof of correct accumulation itself needs to be accumulated on the other side). While accumulation is asymptotically efficient, there are still concrete costs that happen for every IVC step.

So what can we do about these costs? The above proposal (implicitly) addresses this by suggesting that there is an optimally sized IVC step, which does a significant amount of computation (and not just one small operation) in each IVC step. This amount should not be “too” large, because occasionally branches might be necessary (the edge cases are: no branching ever, in which case the optimal size is probably to put the entire computation in one circuit, and when every operation branches, in which case an IVC step does only one operation). The exact amount is probably heuristic and greatly depends on application, but since branching is typically somewhat rare (cryptographic code often avoids branching on purpose) it is reasonable that the optimal amount is rather large.

There are also other issues with this approach. Sometimes memory is the constrained resource, in which case the optimal size may be much smaller.

Now we can formalize the VM. For simplicity, let’s assume that every possible state transition is defined by a Halo2 circuit of standardized format (# of columns, polynomial constraints/lookups, etc). Then each IVC step consists of a circuit over curve 1 that looks like:

  1. Input prior curve 1 state, new curve 1 state, prior accumulator over curve 2, new accumulator over curve 2, and the prior curve 2 proof
  2. check a (fixed, branchless) state transition from prior curve 1 state to new curve 1 state
  3. check the accumulation of the prior curve 2 proof into the prior accumulator to get the new accumulator

The exact form of the state can vary, but in general we can think of it as the backing memory store of the VM.

Then the circuit on the other side of the cycle looks like:

  1. Input prior curve 2 state, new curve 2 state, prior accumulator over curve 1, new accumulator over curve 1, and the prior curve 1 proof
  2. Check that the prior curve 2 state and new curve 2 state are consistent with the “program”
  3. Check the accumulation of prior curve 1 proof into the prior accumulator to get the new accumulator

Note that rather than accumulating one IVC step, the VM could accumulate several, although this may defeat the purpose if the IVC steps are maximal size and always branch after every IVC step (since otherwise we may as well make each IVC step larger on the other side of the cycle)

We’ve left some things undefined here: “program” and “curve 2 state”. It’s slightly less clear what these should mean, but we can propose some options. A program is a (possibly Merkelized) sequence of Halo2 circuit verifying keys (which because of standardization, are essentially just commitments to the constraints).

The “curve 2 state” could consist of: 1. a program, 2. a program counter, and 3. commitment to the curve 1 state. Then an acceptable state transition is one that checks that the program counter is updated properly (incremented or branched, etc), that the curve 1 proof that is accumulated is of the circuit at the proper place in the program, and that the proper prior/new curve 1 state are in the public input of the IVC step. This state could also include curve 2 native field arithmetic operations, if the VM implements this field arithmetic by deferring it to the native circuit over curve 2. Dynamically loading new programs is also possible.

There were several things oversimplified or ignored in the above. One is the exact details of the accumulation, and it’s assumed that the circuit “infrastructure” exists to do these things. Another is “communication” between curve 1 and 2. The most notable is how to update the program counter using a result from the IVC step (which is the natural way branching works). It would be rather subtle to take curve 1 data (the new PC value) and convert to curve 2 data (where the PC is stored), although this problem is not directly related to IVC step size (unless e.g. a non-circuit friendly hash function is used for this purpose, in which case the high fixed cost of the hash motivates larger IVC step size)

1 Like

Have we considered using Jolt as our zkVM? In particular, it addresses the problem of branching, i.e. branching does not make the circuit bigger, and it is apparently more performant than Halo2 (I haven’t checked it myself). The paradigm of mostly lookups is also appealing from an engineering and auditability perspective: it is easier and less error-prone to implement. I share below some overview of Jolt and Lasso:

Jolt is a frontend technique that applies to any instruction set architecture (ISA). It consists roughly of producing circuits that only perform lookups. In particular, the circuits output by Jolt perform lookups into a table of size more than 2^{128} that depends only on the ISA. The trick is that this table is never materialised. Most values in these tables will never be accessed, they exist only conceptually.

In Jolt, the complexity of the prover depends on the size of the inputs to each instruction, as long as they are decomposable. Decomposability means “that one can evaluate the instruction on a given pair of inputs (x, y) by breaking x and y up into smaller chunks, evaluating a small number of functions of each chunk, and combining the results.”. Arguably, decomposibility can be satisified for all instructions in an ISA.

Lasso, on the other hand, is a lookup argument that checks the validity of these lookups.

Jolt transforms a witness-checking program into an equivalent circuit-satifiability instance that performs only lookups, Lasso allows the prover to establish that it knows a satisfying assignment to the circuit by performing a lookup argument.

Another way of seeing this is that Jolt applies Lasso to an entire instruction set. The key insight is that most instructions can be computed by composing lookups from a handful of small tables.

For example, to verify an AND opcode on two 64-bit words, one doesn’t need a gigatable. Instead, one can split each word in 8 bytes, and apply 8 bitwise ANDs, on each pair of bytes. This works, because in an AND opcode the bit positions are independent.

We can represent any other opcode in a table: bitwise OR, bitwise XOR, ADD, SUB etc. Each opcode will have a different corresponding table

In a general sense, front-ends produce a circuit that, for each step of the computation:

  • (1) identifies what instruction to execute at that step, and then
  • (2) executes that instruction.

Lasso lets one replace (2) with a single lookup.

Benefits of using Jolt/Lasso

  • Branching doesn’t make the circuit bigger.
  • Performance
    • The basic idea behind Jolt+Lasso is to verify an execution trace by using table lookups. These are also implemented using polynomials, but simpler ones than those used by Plonk or HALO2, leading to a 10x speedup or more for the Prover. Unfortunately this comes at a cost of a 2x delay for the Verifier.
  • Better and more accessible developer experience
    • Prior approaches to SNARK design involve formulating and hand-optimizing CPU instructions as circuits – a low-level and security-critical task requiring niche expertise in a domain-specific language.
  • Easier auditability

@joe Some questions:

I think this is where we are still seeing 2-3x proving time overhead of VMs over the equivalent statically compiled circuit, when (IMHO) it should be possible to get this below 1.5x and possibly lower.

Where do you get these numbers from?

“zkVM registers” would always have at least some runtime overhead if implemented in a completely generic way, whereas a hand-coded Halo2 circuit can completely avoid this.

Can you point to this reference?

zkVM should get instruction decoding almost freely if much of it is done outside the circuit

Why is this?

“So we must not only alternate sides of the curve (checking state transition on one side, then accumulate the proof of correctness of the transition on the other side) but then also accumulate on both sides (since the proof of correct accumulation itself needs to be accumulated on the other side).”

I can’t see this and the rest of the writeup becomes a bit confusing without understanding this