Merkle Patricia Trees(MPT) and Recovery
This page provides a more in-depth description of Merkle Patricia Trees (MPT) data structure and how it is used within Züs.
When a miner or sharder is missing state, either because it is newly added to the active set or because it was temporarily unavailable, it must sync its state with the rest of the mining network.
To do this, the miner or miner requests the MPT updates from the other miners.
Züs uses a consensus algorithm that is speculative resulting in a slightly lagging finality. This trade-off achieves high throughput and is similar to how hardware optimizes the instruction pipeline with speculative branch prediction to move as fast as possible. Operating at a very high speed along with speculative branching requires a very sophisticated data structure to manage the state.
Züs uses the same Merkle Patricia Trie (MPT) as Ethereum [4]. An MPT contains a single root node and several intermediate nodes and leaf nodes. The intermediate nodes can be either full nodes or extension nodes[4].All these nodes are persisted into RocksDB [10]. Due to speculation, not every computed node is immediately saved to the database. It is necessary to operate both persisted nodes as well as in-memory nodes. Further, among the in-memory nodes, it is also necessary to have several parallel paths and following each path can result in different state values.
To deal with all these different types of nodes and to support persistent and in-memory branches of nodes, an abstraction is created called NodeDB. There are three implementations of this abstraction:
MemoryDB: This is used to track any insert/update/delete changes into the current computation. The current computation can be the entire block or a single transaction. This is important because we are able to use this to support transaction rollbacks in case of any execution failure of a transaction without rolling back the entire block.
LevelDB: This is used to support multiple simultaneous branches. That is, each speculative branch path will have a separate level db for each block within the branch. A LevelDB is like a linked list with a previous and current members each of which is a NodeDB. A current member is either a MemoryDB or a PersistDB (see below for the later). Previous member can either be a LevelDB pointing go the state as of the prior block that is still in-memory or a PersistDB where any unmodified state beyond this LevelDB is already persisted.
PersistDB: This provides an interface to interact with the underlying rocksdb to query and save the MPT nodes.
When a block is created (to propose or validate), its state starts with a LevelDB with the prior member pointing to the LevelDB of the prior block and the current member pointing to a brand new MemoryDB. As transactions are executed, the state changes are captured into the MemoryDB.
When a block is finalized, its state is persisted. At this time, the level DB of the finalized block is updated with both the current and prior members pointing to the PersistDB. This design ensures that there is no need to traverse back any further.
The above two ways of managing the block’s database as a block goes through different stages ensures that a) speculative state is accumulated across blocks in the memory b) finalized state is committed to the persistent storage. Older blocks get eventually deleted freeing up memory for the new blocks ensuring a continuous progress of the blockchain without any memory pressure. Since the number of blocks retained in the memory at any given time has some upper bound, the memory requirement depends mostly on the number of transactions per block.
MPT is a versioned data structure. That is, any change to any of the leaves result in changes that get propagated all the way up to the root. As a result, as the blockchain progresses, nodes that are no longer used quickly get accumulated. As these are persisted, the database grows over time and will have a severe impact on the performance.
To deal with this problem, we have a state pruning mechanism that happens in the background. Periodically the state is pruned getting rid of unused MPT nodes. This is done following an algorithm similar to mark and sweep that is used for garbage collection. Each leaf node has an origin
field which indicates the block Round when that node was introduced. During the “mark” phase, the entire MPT is traversed from the root (as of a given block in the past to have sufficient buffer) and all reachable nodes are updated with an UpdatedVersion
indicating the round of the root. Any node that is still reachable from a given root should be retained. During the “sweep” phase, all nodes within the persist db are traversed and any node whose UpdatedVersion
is less than the root version being pruned is deleted.
It should be noted that when traversing the MPT from the root, it is possible to discover any missing nodes, which indicates that the state is not fully available; these nodes are synced from other miners/sharders.
There are times when the prior state is not available. This can happen due to a) a miner joining as part of a view change b) temporary network outage c) miner going down due to crash, scheduled upgrade and so on. Züs has a robust way to cope with such incomplete state. The Züs blockchain progresses in an optimistic manner without waiting for the entire state db to be synced.Sharders, however, must sync all blocks up to the current block.
When a prior state is not available multiple things are tried. First, any missing previous blocks are fetched and their state is computed. This process is done only up to a certain number of blocks, currently 10 in our configuration. Anything beyond, the logic falls back to requesting delta state change for the block (this delta change can be securely verified). In addition, delta changes are applied optimistically directly on top of what is already present. This allows being able to compute and validate state changes even when the entire state is not available (so long as the changes don’t require retrieving any missing intermediate nodes).
Because of all the above mentioned sophisticated algorithms, it is important to make sure that the state changes are done properly by dealing with issues like concurrent updates.
It should be noted that the current implementation of MPT is not very general but implemented to satisfy requirements of a global state. For example, a general MPT can contain values on the full nodes. But such a scenario never arises for the global state. Similarly, all paths used to access the MPT are of fixed size for the global state, where the path is the client id (a 64-byte hash).Adoption of the Global state MPT for other use cases should keep this in mind or make changes appropriately to support additional use cases.
Last updated