Sketch of a possible VampIR VM

I was basing it on these benchmarks for Miden vs Plonk

I mean, if you have some kind of “instruction” which is dynamically decoded by the circuit, where the inputs and outputs are copied into or out of a “register”, this seems like it would always have overhead over placing two Halo2 chips with relative references. This effect would be magnified a lot if the register has to be copied between circuits, for example if a circuit only executes a single instruction. Maybe this is not really a significant factor, but it would always exist.

If the VM works in this recursive-verifying cycle, where it executes one or more instructions on one side of the cycle and incrementally/recursively verifies/accumulates on the other side of the cycle, then “instruction decoding” (or generally deciding what to execute next) has to happen somewhere - but if the execution circuit is “specialized” - by which I mean that it is not generic, that the circuit is hardcoded to do a particular operation - then when the recursive verifier checks that the execution circuit is the correct one, that means there doesn’t have to be any instruction decoding in the execution circuit. Instead, the VM compiles (out of circuit) one or more instructions into the specialized/hardcoded execution circuit, and provides the correct verifying key for that circuit.

In order to get true IVC/recursive verification, since we have proofs on both sides of the cycle, we have to accumulate both sides of the cycle as well. Otherwise, if we only accumulate on one side, then we still need to verify the entire history of proofs from the other side of the cycle. Imagine that we only accumulate execution circuit proofs with the verifier circuit, then at each stage we only have one execution circuit accumulator, but in order to trust that the execution circuit accumulator was created properly, we need to check the verifier circuit proof since the beginning of execution, which doesn’t bring us the asymptotic efficiency we wanted. So the execution circuit also has to accumulate the verifier proofs so that we only have to check the two accumulators to be assured of the entire history.

1 Like

I am working through a concrete example of a statement we could prove in this VM and comparing the Jolt and Halo2-style approaches.

Jolt uses “offline memory checking” to load programs. This is how it avoids the usual cost of branching in circuit. There is a “program counter” which indexes into the program code to retrieve instructions. The program counter itself can be manipulated by these instructions for the next cycle. The lookup table (via Lasso) is a separate element that only computes basic functions with no branching or introspection. So I think this is essentially the distinction @joe was making with “runtime” vs “execution”.

I want to compare this to Halo2-style recursion. In my approach there are inner and outer proofs. The inner proofs handle “execution” blocks with no branching. The outer proof gets a transcript of branch paths taken between these blocks (which amount to applying conditionals to public outputs of the blocks) and all of the inner proofs. The outer proof accumulates the inner proofs using the verifying keys selected by booleans in the branch path transcript.

My problem is that I am not familiar enough with the Halo2 accumulation process and have made some assumptions I’m not sure are correct. Are verifying keys accumulated homomorphically “in-circuit”? Can I use a witness in a circuit to zero out the VK of the unused branch?

1 Like

I added some notes on Jolt/Lasso taken at the zksummit here

I like some of the ideas that Lasso/Jolt is bringing here. I think the Lasso/Jolt approach is trying to solve many of the same problems, but in a completely different way. It’s a little hard to work through some of the details, but I also don’t think Lasso/Jolt is mutually exclusive from some of the ideas we’ve discussed.

It seems to me like the way Jolt avoids a branching penalty is basically by making the execution operation so generic that regardless of the branch result, the operation is still a basically a lookup. So the size of the resulting circuit is probably close to optimal for the execution trace of the program. This is a pretty clever approach, which only works because Lasso can implicitly encode such a large table.

My thoughts:

  1. Lasso/Jolt use R1CS, but I see no reason why the same decomposable lookup technique cannot also apply to plonkish/halo2 circuits (of course this would probably require re-engineering of the protocol). This maybe could preserve some of the plonkish benefits at the same time as getting the Lasso benefits. That being said, there is probably an objection that if you use Lasso, you don’t need the benefits of plonk as much.
  2. Similarly, I don’t think there is any incompatibility between the recursive approach I described above and Lasso/Jolt; except that in general it’s probably better to put the entire “program” into one Lasso/Jolt circuit and so the benefit of “chunking” is probably negligible. Therefore recursive verification would probably be reserved for much higher level operations like providing function privacy, rollups of many programs, etc.
  3. Fundamentally, I think Jolt is designed around a more traditional instruction set architecture like risc-v or similar. In this way the lookup approach is probably far superior to anything else, definitely better than a VM that doesn’t use this kind of trick. The Lasso benchmarks which show much better proving time on 2^20 lookups than Halo2 are pretty impressive. It’s not as clear to me that it’s absolutely superior - for example, that Taiga or Orchard circuits would be faster, because these are so optimized for the native scalar arithmetic. But if the programs are much more generic, especially something not designed for circuits (e.g. rust code, risc-v, webassembly, etc) then it would be difficult to do better.

So in conclusion, I think it depends on the nature of the programs. If they’re very “arithmetic heavy” then maybe the advantages are not quite as significant. If they’re very “lookup heavy” then the Jolt approach seems great.

I think this is a fair assumption for all reasonable ISA. It’s not usual to have such unstructured instructions.

I think what happens with Jolt is that the “runtime” and “execution” are more decoupled, because the “runtime” operations (like changing the program counter) don’t actually change the execution which is now just a lookup. So the “runtime overhead” which I was concerned about before is really minimal.

I was thinking that the inner proofs are accumulated homomorphically in the outer circuit (which requires all the verifying keys for the inner circuits to perform the recursive accumulation “verifier” step). The outer circuit would handle branches, but there’s no “branching penalty” because each outcome of the branch invokes the same “functional unit” (a recursive verifier) so you only ever need exactly as many verifiers as the maximum number of blocks/chunks of the execution trace. With each branch, we just conditionally select the VK based on the result of the branch.

3 Likes

After reviewing the Vamp-IR report again, I think there might be a few separate goals going on. Assuming that one of them is creating an interface between high level programming languages and proof system backends (hence the “intermediate”), the initial idea was to statically compile to the backend. This presented a few somewhat related problems:

  1. High cost of branching
  2. Difficulty of metaprogramming (e.g. loading new code)

In theory, a “VM” of some kind can solve these problems. In this case, I will define a VM as anything that takes as input (in any form) the description of the computation to be done and guarantees the execution (or verification) of that computation or state transition (this definition might be quite different than some others).

The formal distinction between a VM and (for example) incrementally verifiable computation (IVC) or proof carrying data (PCD) can get blurred here quite a bit, particularly if IVC is used to help create a VM (e.g. by using an accumulation scheme to achieve function privacy, to dynamically take branches, load new circuits, etc). But fundamentally, while a VM might use IVC as a primitive, a VM must additionally perform at least some level of processing of the program/IR (whereas an IVC might solely use a static program).

Therefore, I suppose the direct goal is simply, “can there be a zk VM that takes as input Vamp-IR to execute?” This goal is a little bit vague by itself because we could be referring to the initial IR, the 3AC, or some yet to exist IR. Since the “backend” (the VM) and the IR can be codesigned, it probably makes more sense to optimize the IR simultaneously for the VM.

This goal is slightly incompatible with my “chunking” optimization above, because the “IR” that my design uses is a sequence of halo2 verifying keys (hardly an “intermediate” representation at all; this “IR” is completely at the end of the backend). This could possibly be improved a bit, if the relationship between the IR and the verifying keys is well defined, but in general is not really focused on the above goal of a “VM for Vamp-IR”.

The lasso+jolt method, on the other hand, is probably much more suited towards this goal. Since the VM circuit is actually “decoding” the IR, the IR can actually look a lot more like the Vamp-IR and also be more agnostic of the ultimate backend. The IR would look probably more like a standard instruction set architecture (although might be far more suitable than LLVM or risc v)

So assuming the goal is achieved and there is a VM that runs Vamp-IR, then the question turns to what uses Vamp-IR. Since the idea is that people can write Vamp-IR directly, or write in a higher level language, the question I have is, what does the HLL need from the IR? And what requirements of this propagate to the VM. If in general a HLL needs the full complexity of (say) LLVM or WebAssembly, then it would present huge difficulties to the IR and VM (and also make it very unclear whether this VM has any advantage over RISC0). On the other hand, if the IR is much simpler, does this impose any meaningful impact on the HLLs as well, and if so, is this a problem that people might prefer to use existing (or otherwise less “restricted”) HLLs?

To clarify, really the only requirement we have from the HLL beyond what current VampIR can provide is efficient branching (and efficient function calls, which can probably be provided in a very similar way). Juvix/GEB can already compile to VampIR - it’s just too inefficient to unroll everything and evaluate all branches.

In the interest of making the problem statement as concrete as possible, let’s avoid the use of “VM” entirely (ignoring my above post).

The immediate problem statement seems to be: branching is extremely expensive in circuits, how can we introduce branching into the IR without sacrificing performance (measured by circuit size)? This may not require an extremely sophisticated solution. Indeed, let’s assume that it’s preferable to keep everything statically compiled. Within the same compiler stack, a lot of assumptions can be made (e.g. regarding how halo2 verification keys are handled, etc)

What I imagine is the direct problem is a “for loop” type construct (which is a problem in traditional compilers and CPUs as well) - there is some amount of code that runs some number of times in sequence. Let’s assume the code is nontrivial enough and runs enough iterations that it’s not feasible to unroll the loop entirely. This might be an ideal situation for “chunking” (since each chunk is presumably big enough to put in its own circuit).

However, maybe this can still be handled even more “statically” than having a “runtime” that selects “chunks”. The idea would be that branches can always be thought of as selecting one of multiple circuits to execute “next”, so maybe, the compiler should compile the verifier keys of the two possible “paths” directly into the “circuit program” (note that this doesn’t use double the circuit space, because there would be only one “verifier” and then some constraints to conditionally select the “verifier key” that is used)

Now, there is a problem here because of switching curve cycles (e.g. the “program circuit” cannot recursively verify directly the possible branches). Instead the program circuit must select and commit to the verifying key and pass it to the “true” verifier on the other side of the curve cycle. But unlike a generic runtime, this verifier does not execute a generic IR but merely the specific choice of branch circuits that the program selects. In this way, the “program circuit” behaves as the true “program” and the verifier circuit is merely a primitive used to call some subcircuits.

I sense this is probably much simpler and more efficient than chunking the whole program (rather, it better captures how branching usually works in programs, with smaller segments of the program within if/for/while loops). For example, the “top-level” program doesn’t always end at a branch; the circuit contains all the pre-branch and post-branch logic, it merely defers the code that would otherwise be unrolled to elsewhere.

To summarize, if the ultimate goal is simply to avoid the cost of unrolling loops or other branchings, then the simplest approach is probably not to design a VM at all, but rather to just use IVC (or something else, e.g. lasso, jolt, etc) directly as a primitive to substitute for unrolling each branch.

3 Likes

I agree that IVC (Incrementally Verifiable Computation) is what we want and a ZKVM is a particular instance of IVC


where z_0 is the initial state and z_n is the final state, and the tuple (C,V) is the instruction/opcode executed. In our state machine we want to prove that for each state transition F, z_i = F^{(i)}(z_o).

This particular diagram is from Nova, a recent folding scheme. All other descriptions of IVC (from Valiant08 to Halo2) are very similar conceptually (they embed the verifier circuit V in the circuit C) and progress has been made in the direction of removing the cost of computing each step in the chain via an accumulator or otherwise and generally delaying the work of the verifier to the last step. Folding schemes are a type of recursive schemes that allow us to create non-monolithic SNARKs. They are efficient schemes that realise IVC by avoiding creating a SNARK at each of these intermediary steps.

However, IVC (understood as a single function that gets executed repeatedly) is not enough to realise our goal. In a ZKVM with multiple instructions, this would mean that the size of each instruction is linear to the size of all instructions. In other words, how do we choose which instruction to execute next efficiently? Similarly, how do we make sure that branching doesn’t blow up the size of the circuit? How can we avoid embedding every single branch in the circuit?

Fortunately, SuperNova introduced/popularised the concept of Non-uniform IVC (NIVC). This means that instead of one step function F you have a bunch of step functions \{F_i\} and you choose one. The size of the circuit and cost of computation are linear to only that F_i that gets executed.

\phi is a function that takes past information (\omega_{i-1}, z_{i-1}) to select next index i and thus execute C_i.

This is exactly what we need for our goal and in fact, the branching problem of ZKVMs was the main motivating factor for this feature, as explained in this slide about the “Limitations of State-of-the-Arts” by the Protostar authors (another folding scheme).

Basically, NIVC allows us to select the opcode/branch at runtime and the circuit to be proportional only to the size of the selected opcode or branch.

What work has been done in this regard?

In Nova, you have to embed all of the circuits for every opcode or branch in the recursive circuit. I believe this situation is similar to Halo2, i.e. neither of these can do NIVC.

Both Protostar (and its continuation work Protogalaxy) and SuperNova (and its continuation work HyperNova) realise Non-uniform IVC, and thus our purpose. As mentioned in previous posts, Jolt/Lasso also suits our goal. They all provide an “a la carte” cost profile, i.e. you pay only for what is executed.

Protostar has the advantage to be Plonk friendly and it was conceived with that goal in mind (as seen in the slide above), which may be easily adaptable to Halo2. It also realises Non-uniform PCD, which proves circuits in a DAG, instead of sequentially as IVC does. This may be an avenue to explore.

2 Likes

I agree with the problem statement by @joe and the analysis by @alberto that we likely want to use some kind of NIVC scheme.

One question that comes to my mind as a non-expert in cryptography and compilers:

What are the tradeoffs regarding flexibility vs perfomance/memory usage in regards to circuit size? I assume if we break it down to single operations (e.g. one addition) overhead will dominate.

For example, if we look at the AST of the euclidean algorithm

turning the two compare and the two assign blocks seems like a natural choice into chunks, but would it be the (approximately) “best” one?
Should/can the compiler be aware of the balance between circuit size/number and overhead and choose dynamics groupings accordingly?
Would developers need to provide annotations, or could they specify requirements for the compiler?

1 Like

This problem is discussed in section 2.3.1 of the Cairo paper

In order to design an efficient instruction set, one has to understand what property should be optimized. A reasonable measure is the expectation of the number of trace cells an average program (written optimally in the said instruction set) uses. This is an informal measure because, to be accurate, it would require knowledge of the distribution of the programs, and require that each of those programs is written in the most efficient way in all of the compared instruction sets. Nevertheless, one can still use this definition as a guideline for many decisions throughout the design.

And uses the as_bool instruction as an exercise that may appear in the design of the VM/state machine/IVC instance.

As an example, take two instruction sets, A and B. The instruction set A has an instruction “as_bool” that computes the expression “1 if x \neq 0, otherwise 0”. The instruction set B is an identical instruction set, except that “as_bool” is missing. Say that the cost, in trace cells, of executing a single step in a CPU based on instruction set A is a (for simplicity we assume all instructions cost the same number of trace cells), and that the cost of a single step when using instruction set B is b (where a > b due to the complexity of adding the additional instruction). On the other hand, a certain program may require kA steps if written using instruction set A and kB steps if written using instruction set B (here kB ≥ kA since every time the program needs to use “as_bool” it might require more than a single instruction from instruction set B). If a · kA < b · kB, instruction set A is better for that program, and if a·kA > b·kB, instruction set B is better. When one decides whether to include an instruction or not, they should consider the additional cost per step (a/b) against the additional steps (kB/kA)

This is the general approach in the ZKVM design space and the common belief is that “simpler instruction sets lead to faster ZKVMs”. There are instructions such as comparing elements or range checks that may not be efficient and are provided as builtins.

However, Jolt/Lasso bring a radical approach to this design space and in their paradigm, this question is not that relevant, since you can make the matrix encoding all the operations as big as you want. This is efficient because this matrix is not materialised and can be decomposed into smaller ones.

From their presentation:

1 Like

Yes, if we can make circuits for “atomic” operations composable, or decompose large matrix representations of full circuits to generate efficient representations of programs dynamically, that would be ideal!

This is an important long-term question, but for the short term I think pragmas (e.g. in Juvix, and perhaps in the intermediate representation) to indicate how programs should be split up is totally reasonable.

But would this still apply when using the non-uniform IVC approach? Suppose that each of these instructions are different non-deterministic functions - as you mention, the size of the circuit and cost of computation are linear only in the F_i that gets executed - so more instructions shouldn’t matter, right? At most the final circuit would need a reference table or something, I imagine?

2 Likes

Yes, that’s true. Cairo operates with the old paradigm before NIVC, so this doesn’t apply any more. Thanks for pointing that out

1 Like

From talking with one of the guys implementing Lasso, it makes sense to implement Lasso as a backend to Halo2 (for lookups), not the other way around. This could be a good compromise for us: we keep the existing work on Halo2 while meeting our goal of solving the branching problem through the way Lasso does lookups

I’m not sure if I’m thinking of the same thing, but what I would hope is that - by statically compiling as much as possible - the importance of the specific “instruction set” is diminished because most execution is simply plonk-like circuits with chips/functional units that are directly laid out and connected by the compiler, and there’s no need to “decode” instructions (for the most part) because this is statically done by the compiler. By restricting the discussion to branching only - if the NIVC approach is applied - ideally each NIVC step is only instantiated when branching occurs. Although I’m not sure, in the NIVC approach, whether a large difference in circuit size between non-deterministic steps is a problem.

1 Like