Basic OTA upgrade system

The v2 specs must include a basic OTA upgrade system which allows new code to be shipped over the network that node operators can easily upgrade to (optionally automatically). I think that we can implement this as a RM application with a special resource which is (optionally) read by the node after each block.

Basic requirements:

  • New releases (code directly, or Git commit & remote information) distributed over the P2P network
  • Optional embedded external identity which is allowed to authorize upgrades.
  • Automatic recompilation & restart after an upgrade is activated

Questions for engineering (cc @mariari @Moonchild @ray):

  • Is Elixir sandboxed? Do we need to further sandbox the node in order to enable safe automatic recompilation and restart?
  • What exactly is required to hot reload code? Should we try to do that, or just stop and restart the node?
  • Broadly, what do you recommend in terms of the operational procedure for upgrading after new code is downloaded and compiled?

Discuss! :desktop_computer:

sandboxing is always impossible. nor do i see why we would want that, unless you mean something different than i think?

hot reloading is possible. we should also have a way to specify a transition function (similar to common lisp update-instance-for-redefined-class) that updates the state of an engine to the format specified by the new code

it would be bad if there were a mix of old and new engines in the same system, so we presumably want the upgrade to be atomic. but, what if the transition function needs to talk to other engines or start new engines—while the update is in-progress?

a moderately annoying point is updating native libraries

it might be a good idea to do the update offline, even though online is possible, as that way we have fewer mechanisms to implement and test (since snapshot/restore within a single version is already something we have to support). we still need a transition function, but perhaps there can be just one that operates on the entire snapshot—then, since it has access to all the state, the problem i mentioned goes away

it would be cool if we can have transparent interoperability with arbitrary external data sources, including git. presumably that’s a ways out. in light of which sending the code directly for the time being probably makes more sense

What do you mean by “sandboxing”? I mean limiting the “host OS” scope which the node has access to so e.g. distributed code cannot read /etc/passwd. For example, docker sandboxes - not perfectly, but pretty well, from my understanding.

Yes, agreed.

I’m not sure how to address this entirely yet. One option could be for the upgrade itself to specify how engines need to be restarted. Perhaps some upgrades only upgrade specific engines - and do so in backwards-compatible ways, from the perspective of an observer of the engine - while others may require a full node restart.

Yes, we should also have the ability to do upgrades offline - and if online upgrading is too hard, we can punt this problem until later, it’s not absolutely critical for v2 - just a nice-to-have.

Aye, let’s keep it simple for now.

sandboxing

I don’t think it makes sense to claim that any sandbox can be relied on as a primary security measure (especially if that sandbox’s name isn’t ‘v8’ or ‘jsc’—but even then). (Fun trivia: a security researcher I know has a cluster of cheap arm boards used for browser tabs. Every new browser tab gets its own completely isolated hardware. This seems like a very good idea and the only reason I haven’t done it is it’s a lot of work I haven’t gotten around to yet.) The primary security measure has to be that you trust the signature on the code. Running in a sandbox, then, might not be a bad idea as a defense-in-depth measure, but is not necessarily super interesting. (And that’s ignoring the hypothetical future where all the interesting stuff an attacker might want access to is part of the anoma node anyway.)

One option could be for the upgrade itself to specify how engines need to be restarted. Perhaps some upgrades only upgrade specific engines - and do so in backwards-compatible ways, from the perspective of an observer of the engine - while others may require a full node restart.

Where I suspect this leads: figuring out how to make every upgrade fully on-line; say, if an expensive data format migration is needed, do it incrementally and maintain the data in both the old format and the new until it’s done. Not necessarily a bad idea, but seems like a lot of engineering effort. We should decide if this is something we want to invest in or not, but I’m not sure if half measures make sense—if we’re ok asking people to tolerate downtime sometimes, then it should probably be ok to ask them to tolerate it every time.

1 Like