Node performance: a design suggestion

Context

I’ve been running some load tests on the Anoma node, and I have come to some disappointing conclusions. The load test was run on a Macbook Pro M4. A summary of the load test is:

  1. A single node can ingest at most 1000 transactions per second, before it horribly breaks down. And by break down I mean: it crashes and we cannot throttle anything.
  2. The event broker system that is relaying all sorts of events within the node can deal with about 15,000 subscriptions per second. Why this is relevant will be clear in a minute.

Current design

The transaction execution is not my main responsibility within the engineering team, so I might be off the mark on some things. But I am good at Elixir and system design, so I will summarize my findings.

When a transaction is submitted into the node, the following things happen.

  1. The transaction is put into the Mempool engine its state (&Mempool.handle_tx/2)
  2. The transaction is launched (&Backends.execute/2)
    Launching a transaction means that a new Erlang process P is created that will execute this transaction.
    During execution, a transaction can read values from storage. This is the tricky bit.
  3. The transaction eventually completes, and the process P is terminated.

The part of this whole flow I want to discuss is the reading from storage. Reading from storage is a blocking operation. Reading a key at a specific time can cause the transaction to be blocking for an indefinite period of time.

The process P can read from storage using Ordering.read({time, key}).

To avoid blocking the Ordering process while it waits for a write, a new process P_{wait} is created that will wait for the key to readable.

If the key is already present in a previous block (or transaction? i got confused), then P_{wait} calls the Storage engine using Storage.read/2. Storage in turn will create a new process P_{wait2}, to avoid blocking the engine.
Waiting for a write is done by means of a subscription to the event broker. P_{wait2} will analyze specific events to determine if the key its waiting for has been written.

If the key is in the past and available, then Storage immediately returns the value.
When P_{wait2} receives the value, it synchronizes P_{wait}, which in turn synchronizes P.

The mechanism of synchronization here being the return of a call. P called P_{wait}, which in turn called P_{wait2}.

Note: I might be wrong here, and missing an implied invariant.

This design works, and is elegant in the sense that it uses the computational model of Erlang to model a specific problem: idling processes model the concept of waiting for something, and using calls to have synchronous code.

I think there are two downsides to this, however.

1. Tasks are not great

The bookkeeping necessary to juggle a massive amount of processes is easy enough with Supervisors, but Tasks are less elegant. A symptom of this is the fact that we can no longer shut down a node cleanly without causing processes to error out in the process.

The reason for this is simple: a process should ideally at all times be linked and/or monitored to another process. Otherwise you can end up with orphan processes that just take up resources.

Compounded to this is the fact that we use call as a synchronization mechanism. A process that calls another process is monitoring that called process. This is to make sure that a call to a process that crashes, also crashes. We want this.

However, if you have 3 processes which call on eachother A->B->C->A, you have a cirlce of processes that can never be shut down cleanly.

A call is synchronous, and returning from a call using an asynchronous tasks opens up the possibility of race conditions and deadlocks, and even livelocks.

2. Modelling state as a process is not great

Because we use tasks to model the fact that some process P is waiting for an event E, we have intangible state. By this I mean that that there is no way to measure/inspect/manipulate this state. We might count the number of processes somehow, but that’s about it.

Additionally, the cost of modeling the fact that some process P is waiting for an event E, costs more in terms of resources than other ways of modeling this.

Finally, because we model state as processes, there is no way for us to control the rate at which the state expands. If we were to hope to constrain the system to only wait for X writes at a given point in time, we would have to start counting processes or something along those lines.

My Proposal

Instead of modelling the fact that some process P is waiting for an event E as a model, I propose to model this as a simple data structure that keeps track of which process is waiting for which event. I know this sounds a lot like an event broker, but bear with me.

The reading of a yet-to-be-defined key should work as follows.

Transaction process P wants to read key k, which will be written when pigs can fly (i.e., a very long time).
P registers this intent (see what I did there) with the Ordering process.
The Ordering process keeps track of which process wants to be notified of which key in the form of a dictionary: %{process_id => [key]}. The Ordering process can manage a large amount of these subscriptions easily.

The Ordering process subscribes to all events pertaining to writes. Each time a write event occurs, Ordering figures out if the event contains writes to keys that a process is waiting for. If that is the case, it sends an asynchronous message to that process {:your_key, key, value}.

I think this solves both problems above.

1. Tasks are not great

We do not create additional tasks for waiting on a particular read. The supervisor can be cleanly shut down because no pending call invocations exist.

2. Modelling state as a process is not great

At any given point in time it’s easy to obtain the amount of pending reads and writes by inspecting the state of the Ordering engine.

The cost of a pending read is reduced to the minimum. The only additional cost for a process P to wait for a write on key k is a few bytes extra in memory of the Ordering engine.

We don’t have pending call invocations anymore, and asynchronous behavior is no longer forced into the model of synchronous calls.

If we want to constrain the system to only have x pending writes or reads, we can limit the size of the state of the Ordering engine and deny new reads or writes with a :delayed message, or something along those lines. This message can then be handled by the transaction process.

Amazing you read this far, have gold star :star:.

3 Likes

After our most recent discussion regarding the shards I actually think this is worth revisiting once we shard our storage. I am saying this because it is not evident (at least to me at this point) how exactly and how much of the synchronous behavior will be migrated.

When we do have that implemented, however, and have held appropriate benchmarks, we should revisit this formally during an architecture meeting.

How does that sound?

1 Like

This sounds good. I will, when I have some time to spare, make a minimal implementation of both appproaches to have a reference point.

1 Like