IndexBtree

This class implements a standard (n, m) B+ tree that maps variable length secondary key blobs to primary key hashes for retrieval in RAMCloud index operations. The property (n, m) means the smallest non-root node in the system can only have n entries and the largest node in the system can be up to n/m entries at most. In our case, n = mininnerslot = 4 and m = innerslotmax = 8. This B+ does coalesce and rebalance nodes to maintain this invariant.

As with most B+ trees, there are leaf nodes which contain n keys/data and inner nodes that contain the n keys and n + 1 pointers to other nodes. However, unlike most trees, the key and data are one and the same in this case since it is desired to sort by not only the secondary key but also the pKHash number. Thus, the inner nodes in this B+ also contain data as well the keys. This key+data combination is called a BtreeEntry.

The B+ tree is organized such that the root is always in a fixed location after a recovery. This is achieved by storing the B+ nodes as RAMCloud objects so that the nodes are automatically recovered during a crash and the root node is always stored with the same RAMCloud object key (ROOT_ID) so it is possible to continue using the B+ tree after recovery. One important note is that the nextNodeId variable is recovered elsewhere in the system and it is CRUCIAL to set properly after a recovery. The reason is because the system-generated RAMCloud Object Keys for the Nodes depends on that variable. If set incorrectly, new nodes could overwrite live data.

One more property worth mentioning is that the entries stored in inner nodes slot refer to the rightmost key of the subtree where the subtree root is the child pointer at that position. This allows the B+ tree to store n entries for n + 1 child pointers as the last entry in the B+ tree can be used as a pivot point to choose between the final child or the second last child.

The properties of the B+ tree is summarized below

  • Keys in the system are (Secondary Key Blob) to (Primary Key Hash) mappings called BtreeEntry's
  • Smallest non-Root node contains at least mininnerslots entries (4)
  • Largest node contains at most innerslotmax entries (8)
  • Nodes are allocated as RAMCloud objects 
  • Keys are sorted within a nodes 
  • Inner Node Keys refer to the rightmost leaf key when traversing the subtree NodeId pointer.


Variable-Sized Nodes

IndexBtree::Node implements the base class for the inner and leaf nodes of the B+ tree. The main responsibility of the base class is to abstract the complexity of storing variable-sized blobs in a RAMCloud Buffer and provide a vector-like API with set/insert/get/erase operations. The semantics of the data and how to arrange the items within the logical array are determined by the IndexBtree class.

Nodes use Buffers to keep track of variable-sized secondary keys, which is also convenient since RAMCloud Objects are serialized to Buffers and read back with Buffers. Due to this structure, it's not uncommon for the Node metadata to be stored in the Buffer along with its secondary keys.

Using Buffers also affords the use of virtual copies (externalAppend's), which are used copiously to update the node. For example, when an insert occurs, instead of shifting data, the node performs the following operations (a) virtual copy the first half of the original keys, (b) make a hard copy of the new inserted key, and (c) virtual copy the second half of the original keys.

So a simple node structure after an insert may look like this:

------------------------------------------------------------------------

| Node A (meta) | keya, keyb | virtual keya, hard keyb, virtual keyc |..

------------------------------------------------------------------------

A simple erase (on c) may look something like this

------------------------------------------------------------------------

| Node A (meta) | keya, keyb, keyc | virtual keya, virtual keyb | .....

------------------------------------------------------------------------

The important thing to note is that the Node will ALWAYS append a copy (either virtual if it's part of the old keys or hard if it's passed in during an insert) of the keys to the end of the Buffer. This provides two useful properties. The first is that the contents of any pointer (such as a secondary key pointer) returned by the node during its lifetime will remain valid during its lifetime, even if the entry has been 'erased'. Here lifetime means from the time that the local copy is created/read from RAMCloud to the time that the local copy is garbage collected. The second useful property is this allows multiple nodes to the same buffer, saving on stack space. So it is also not uncommon to see a buffer layout such as

------------------------------------------------------------------------

| Node A(meta)| keysA v1 | Node B(meta)| keysB v1 | keysA v2 | Node C...

------------------------------------------------------------------------

The always-appending-a-copy-of-keys-to-the-end property ensures pointer safety within the B+ tree code and allows multiple nodes to share one keyBuffer without using additional space.