Context
I’ve been running some load tests on the Anoma node, and I have come to some disappointing conclusions. The load test was run on a Macbook Pro M4. A summary of the load test is:
- A single node can ingest at most 1000 transactions per second, before it horribly breaks down. And by break down I mean: it crashes and we cannot throttle anything.
- The event broker system that is relaying all sorts of events within the node can deal with about 15,000 subscriptions per second. Why this is relevant will be clear in a minute.
Current design
The transaction execution is not my main responsibility within the engineering team, so I might be off the mark on some things. But I am good at Elixir and system design, so I will summarize my findings.
When a transaction is submitted into the node, the following things happen.
- The transaction is put into the
Mempool
engine its state (&Mempool.handle_tx/2
) - The transaction is launched (
&Backends.execute/2
)
Launching a transaction means that a new Erlang process P is created that will execute this transaction.
During execution, a transaction can read values from storage. This is the tricky bit. - The transaction eventually completes, and the process P is terminated.
The part of this whole flow I want to discuss is the reading from storage. Reading from storage is a blocking operation. Reading a key at a specific time can cause the transaction to be blocking for an indefinite period of time.
The process P can read from storage using Ordering.read({time, key})
.
To avoid blocking the Ordering
process while it waits for a write, a new process P_{wait} is created that will wait for the key to readable.
If the key is already present in a previous block (or transaction? i got confused), then P_{wait} calls the Storage
engine using Storage.read/2
. Storage
in turn will create a new process P_{wait2}, to avoid blocking the engine.
Waiting for a write is done by means of a subscription to the event broker. P_{wait2} will analyze specific events to determine if the key its waiting for has been written.
If the key is in the past and available, then Storage
immediately returns the value.
When P_{wait2} receives the value, it synchronizes P_{wait}, which in turn synchronizes P.
The mechanism of synchronization here being the return of a call
. P called P_{wait}, which in turn called P_{wait2}.
Note: I might be wrong here, and missing an implied invariant.
This design works, and is elegant in the sense that it uses the computational model of Erlang to model a specific problem: idling processes model the concept of waiting for something, and using call
s to have synchronous code.
I think there are two downsides to this, however.
1. Tasks are not great
The bookkeeping necessary to juggle a massive amount of processes is easy enough with Supervisors, but Tasks are less elegant. A symptom of this is the fact that we can no longer shut down a node cleanly without causing processes to error out in the process.
The reason for this is simple: a process should ideally at all times be linked and/or monitored to another process. Otherwise you can end up with orphan processes that just take up resources.
Compounded to this is the fact that we use call
as a synchronization mechanism. A process that call
s another process is monitoring that called process. This is to make sure that a call to a process that crashes, also crashes. We want this.
However, if you have 3 processes which call on eachother A->B->C->A, you have a cirlce of processes that can never be shut down cleanly.
A call is synchronous, and returning from a call using an asynchronous tasks opens up the possibility of race conditions and deadlocks, and even livelocks.
2. Modelling state as a process is not great
Because we use tasks to model the fact that some process P is waiting for an event E, we have intangible state. By this I mean that that there is no way to measure/inspect/manipulate this state. We might count the number of processes somehow, but that’s about it.
Additionally, the cost of modeling the fact that some process P is waiting for an event E, costs more in terms of resources than other ways of modeling this.
Finally, because we model state as processes, there is no way for us to control the rate at which the state expands. If we were to hope to constrain the system to only wait for X writes at a given point in time, we would have to start counting processes or something along those lines.
My Proposal
Instead of modelling the fact that some process P is waiting for an event E as a model, I propose to model this as a simple data structure that keeps track of which process is waiting for which event. I know this sounds a lot like an event broker, but bear with me.
The reading of a yet-to-be-defined key should work as follows.
Transaction process P wants to read key k, which will be written when pigs can fly (i.e., a very long time).
P registers this intent (see what I did there) with the Ordering
process.
The Ordering
process keeps track of which process wants to be notified of which key in the form of a dictionary: %{process_id => [key]}
. The Ordering
process can manage a large amount of these subscriptions easily.
The Ordering
process subscribes to all events pertaining to writes. Each time a write event occurs, Ordering
figures out if the event contains writes to keys that a process is waiting for. If that is the case, it sends an asynchronous message to that process {:your_key, key, value}
.
I think this solves both problems above.
1. Tasks are not great
We do not create additional tasks for waiting on a particular read. The supervisor can be cleanly shut down because no pending call
invocations exist.
2. Modelling state as a process is not great
At any given point in time it’s easy to obtain the amount of pending reads and writes by inspecting the state of the Ordering
engine.
The cost of a pending read is reduced to the minimum. The only additional cost for a process P to wait for a write on key k is a few bytes extra in memory of the Ordering
engine.
We don’t have pending call
invocations anymore, and asynchronous behavior is no longer forced into the model of synchronous calls.
If we want to constrain the system to only have x pending writes or reads, we can limit the size of the state of the Ordering
engine and deny new reads or writes with a :delayed message, or something along those lines. This message can then be handled by the transaction process.
Amazing you read this far, have gold star .