From reading the paper and listening to Justin Thaler explain Lasso in the ZK Study Club talk, Lasso can be viewed as a component of a snark that provides efficient lookups.
A little history and comparison of snark lookups as I understand them (corrections or clarifications gladly accepted):
Plookup
Plookup is a way of adding a custom gate that provides lookup functionality into a Plonkish system. The whole lookup table is embedded into the circuit, making the circuit as large as the lookup table. Because of this, only small lookup tables can be used efficiently. A common choice is to use a table of length 2^16 to represent a binary relation on two 8-bit inputs. To simulate lookups of 256-bit inputs, each input would be unpacked into 32 8-bit limbs, 32 lookups would be performed, and the results would be packed back together using arithmetic gates into a 256-bit output. For an 256-bit XOR table (which has no carry bits to worry about) the packing of the results could cost up to 31 arithmetic gates in width-3 Plonk; fewer in systems with wider gates. Importantly, these gates are (usually?) multilinear.
Caulk, cq, etc
Caulk, cq, etc. separate the lookup table from the circuit itself, adding additional phases to proving and verifying. You could view a snark using one of these systems as two composed snarks: one verifies the arithmetic performed on private witnesses and lookup queries, and the other verifies that the lookup queries themselves were contained in the table. Homomorphic commitments to the table allow for some compression of the table, so that larger tables can be used. Caulk and cq use an additional pairing check for verifying lookup queries.
Lasso
Lasso also separates the lookup table from the circuit itself and adds phases to the proving and verifying algorithms like Caulk and cq, but rather than using an additional pairing check, it uses a sum-check protocol to verify lookup queries. By exploiting certain homomorphic properties of multilinear extensions that play well with the sum-check protocol, and by requiring that the lookup table is structured nicely, the lookup queries as well as the linear arithmetic needed to unpack/repack can be compressed together. This compression allows a single “lookup gate” to be represented by only 3\cdot c field elements. The structured lookup tables can be decomposed into c tables of size N^{1/c}, so the prover commits additionally to c\cdot N^{1/c} field elements, for a total of 3\cdot c \cdot m + c\cdot N^{1/c} for m lookups into a structured table of size N.
In Plonk, I could create a width c+1 circuit with two custom gates: one that packs/unpacks c “limbs” together and one that performs c lookups into c tables of size N^{1/c}. Furthermore, I think I could design the “packing” custom gate to accept selectors which would turn it into an addition gate or multiplication gate mod N = 2^n pretty easily. In this version of Plonk, addition mod N would cost one constraint, and XOR on n bits would cost one constraint, as would multiplication mod N, AND on n-bits, and even bit rotation. This is how I think about Lasso—lookups, packing, and simple limb-wise arithmetic are available in each constraint, so that in 3 rows you can do, say, 256-bit XOR or addition mod 2^256 etc.
In Plonk there would be a big cost to doing this because the tables would need to be represented in circuit and the prover would need to commit to vectors the same length as the table, and there is a lot of additional overhead for the lookup argument on top of the regular arithmetic gates. Lasso’s lookup argument requires much less overhead which makes this approach attractive.