Primary Keys

The Proposed Server API page asks "What do we want keys to be?".
This page collects and evaluates the options.

Options

User-allocated variable-length or long fixed-length keys
  • Application chooses values for keys; if keys need to be unique, it's up to the application to figure how how to do this.
  • These longer keys must be passed around in requests.
  • The master needs a hash table.
  • There is some feature overlap with indexes. Now you need an extremely fast "primary index" on the master to look up objects.
  • This is what the application wants (e.g., I'm creating a table that will always be indexed by a URL) then this is more efficient than the system-allocated options where there would need to be a secondary index that would be used in virtually every lookup.
  • BigTable and memcached do it
  • Painful for allocating seq keys
User-allocated 64-bit keys
  • I think there's a consensus that this is not a large enough space to be used as a reasonable hash value, so the only real use case is ICQ.

    what does "ICQ" mean? -JO

    ICQ is an ancient chat program that identified users by a large integer. See http://en.wikipedia.org/wiki/ICQ#UIN . -Diego

  • Painful for allocating seq keys
  • Nice for migrating legacy apps
  • Decent memcache emulation
  • 2 levels of hash collision to build a hash table
User-allocated 64-bit keys with system key generation assistance
  • Winner
  • Nice for legacy migration
  • Fast memcache emulation
  • 2 levels of hash collision to build a hash table
System-allocated 64-bit keys with reuse
  • In this approach the system allocates keys, starting with 0 and incrementing for each successive addition.
  • "With reuse" means that is an object is deleted the system can reuse its key for a new object in the future.
  • This is by far the easiest solution to implement. Berthas (multi-level indexes analogous to page tables) seem to be the data structure of choice.
  • With this scheme it's important to use the key space densely: if deletions result in sparsely used Berthas then this approach will become inefficient. Reuse makes it easier to ensure a dense key space.
  • Dangling pointers in the application (accidentally reusing keys that have been deleted) are extremely dangerous. If your application gets back the same type of object as was originally identified by that key, it will not crash but rather it is likely to leak information.
System-allocated 64-bit keys without reuse
  • This has "sweet" semantics.
  • Key values increase strictly with time.
  • "Append" operations are simple: just create a new object. The order of events can be determined by comparing key values.
  • 64-bit keys mean we don't have to worry about running out of the key space and wrapping around.
  • If keys are never deleted in a table then it's easy to enumerate the table's objects in chronological order. However, if keys are deleted the enumeration is not so easy: the obvious approach of stepping through each of the key values in order may spend a lot of time looking for objects that no longer exist.
  • Berthas would be the data structure of choice for tables that never have deletes. With deletes, other options are not so easy:
    • Berthas with leaves collapsed to hash tables when they become sparse
      • Older objects that have many nearby deletes now incur a performance penalty.
      • Inner nodes of the Bertha still grow over time.
      • This option feels complicated because it combines a couple of different modes of operation.
    • Berthas with leaves reused as they become sparse

      I don't understand how this option works -JO

      This one would be a little tricky to get right, and it's easier to explain with a diagram. Say there's a leaf of 8 entries that's only using entries 1,2, and 6 for keys 1201, 1202, and 1206. I can reuse this leaf starting at 2700 for 2700, 2703, 2704, 2705, and 2707 if I burn the keys 2701, 2702, and 2706. -Diego

      • Now a leaf can have multiple pointers coming into it
      • Some fraction of keys will have to be skipped over and not used
      • Splitting a key range to another server now requires unmerging the leaf.
      • Inner nodes of the Bertha still grow over time.
    • Hash tables
  • Looks like memory
System-allocated (64-bit key, 64-bit generation number) tuples
  • We can use Berthas again, storing the generation number in the inode.
  • Generation numbers aren't "sweet".
  • It now takes 128 bits to identify an object.
  • It is possible to have a global generation number, so if my object's generation number is larger than that of yours, mine was created more recently.
  • It is not possible, however, to enumerate objects in chronological order efficiently.
    • Unless if we maintain some sort of linked list...
System-allocated (48-bit key, 16-bit generation number) tuples
  • 48-bit keys may not be large enough
  • 16-bit generation numbers could wrap around fairly quickly. Do we burn the keys after that or do we wrap around and risk following a dangling pointer?
  • Similar trade-offs as system-allocated (64-bit key, 64-bit generation number) tuples, just fewer bits.

Hash Tables

The "sweeter" options demand fast hash tables.

For a fast hash table:

  • Pick a good hash function
  • Decide between chaining and open addressing
  • Decide how to size the hash table and when to resize it

One level up, we need to decide how to use this primitive:

  • On each master, have a single hash table
    • This option probably has the best performance: a single large bucket array is likely to have fewer conflicts than a whole bunch of smaller bucket arrays.
    • Keys from different tables and shards are mixed up together, which may make some operations slower or more complicated.
  • On each master, have one hash table that groups the key ranges in a single RAMCloud table
  • On each master, have one hash table per key range (shard)

This choice will affect how frequently resizing is necessary and at what scale.

We should also consider what happens when you want to evict part of a key range to another machine:

  1. Decide what to split.
    • Maybe you don't really care, you just want to free up some space. If you need more, you'll split something else too.
  2. Transfer the objects from the old master's log to the new master's log.
  3. On the new master, add entries into a hash table for this new key range.
  4. On the old master, you want to clean up the now invalid entries from the hash table.
    • If it's one hash table per key range, this is O(1).
    • If it's one hash table per key ranges in a single RAMCloud table, this is more painful. O(size of those key ranges)
    • If it's one hash table for the entire master, this is much more painful. O(machine size) (But, we have already scanned the log to copy objects to the new server; all we have to do is scan the same log again to identify the keys that have moved, then hash into the hash table and delete them from the hash table. This at most doubles the cost of eviction.)