Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 8 Next »

The Proposed Server API page asks "What do we want keys to be?".
This page collects and evaluates the options.

Options

User-allocated variable-length or long fixed-length keys
  • These longer keys must be passed around in requests.
  • The master needs a hash table.
  • There is some feature overlap with indexes. Now you need an extremely fast "primary index" on the master to look up objects.
User-allocated 64-bit keys
  • I think there's a consensus that this is not a large enough space to be used as a reasonable hash value, so the only real use case is ICQ.
System-allocated 64-bit keys with reuse
  • This is by far the easiest solution to implement. Berthas (multi-level indexes analogous to page tables) seem to be the data structure of choice.
  • Dangling pointers in the application are extremely dangerous. If your application gets back the same type of object as was originally identified by that key, it will not crash but rather it is likely to leak information.
System-allocated 64-bit keys without reuse
  • This has "sweet" semantics: assigned keys strictly increase with time.
  • We still don't have to worry about running out of the key space and wrapping around.
  • Berthas would be the data structure of choice for tables that never have deletes. With deletes, other options are not so easy:
    • Berthas with leaves collapsed to hash tables when they become sparse
      • Older objects that have many nearby deletes now incur a performance penalty.
      • Inner nodes of the Bertha still grow over time.
    • Berthas with leaves reused as they become sparse
      • Now a leaf can have multiple pointers coming into it
      • Some fraction of keys will have to be skipped over and not used
      • Splitting a key range to another server now requires unmerging the leaf.
      • Inner nodes of the Bertha still grow over time.
    • Hash tables
System-allocated (64-bit key, 64-bit generation number) tuples
  • We can use Berthas again, storing the generation number in the inode.
  • Generation numbers aren't "sweet".
  • It now takes 128 bits to identify an object.
  • It is possible to have a global generation number, so if my object's generation number is larger than that of yours, mine was created more recently.
  • It is not possible, however, to enumerate objects in chronological order efficiently.
    • Unless if we maintain some sort of linked list...
System-allocated (48-bit key, 16-bit generation number) tuples
  • 48-bit keys may not be large enough
  • 16-bit generation numbers could wrap around fairly quickly. Do we burn the keys after that or do we wrap around and risk following a dangling pointer?
  • Similar trade-offs as system-allocated (64-bit key, 64-bit generation number) tuples, just fewer bits.

Hash Tables

The "sweeter" options demand fast hash tables.

For a fast hash table:

  • Pick a good hash function
  • Decide between chaining and open addressing
  • Decide how to size the hash table and when to resize it

One level up, we need to decide how to use this primitive:

  • On each master, have a single hash table
  • On each master, have one hash table that groups the key ranges in a single RAMCloud table
  • On each master, have one hash table per key range (shard)

This choice will affect how frequently resizing is necessary and at what scale.

We should also consider what happens when you want to evict part of a key range to another machine:

  1. Decide what to split.
    • Maybe you don't really care, you just want to free up some space. If you need more, you'll split something else too.
  2. Transfer the objects from the old master's log to the new master's log.
  3. On the new master, add entries into a hash table for this new key range.
  4. On the old master, you want to clean up the now invalid entries from the hash table.
    • If it's one hash table per key range, this is O(1).
    • If it's one hash table per key ranges in a single RAMCloud table, this is more painful. O(size of those key ranges)
    • If it's one hash table for the entire master, this is much more painful. O(machine size)
  • No labels