Main Log: index in memory, no additional persistence

Strategies

Straw-man proposal

Normal operation:

  1. No op

Recovery:

  1. Select a new index host.
  2. In parallel:
Optimization: array of new index hosts

To avoid a possible in-cast problem on the new index host:

  1. Partition the search key space of the logical index.
  2. Select an array of new index hosts, each to recover one partition.
  3. The participating masters send index entries to the index hosts based on which partition the search key falls into.
  4. Later, if desired, the index partitions can be merged to fewer hosts.

Discussion

Out-cast problem

It's not clear that there is always a solution. We talked about getting data from backups or sharding.

Replication: index in memory, data structure also duplicated on backup hosts

Strategies

Straw-man proposal

Normal operation:

  1. Replicate the entire index data structure to k backups.
  2. If power goes out, the backups should try to flush their replicas to disk.

Recovery:

  1. Select a new index host.
  2. Case:
Optimization: use smaller k

The chances of having to fall back to Main Log recovery are larger, but the overhead during normal operation is smaller.

Optimization: Don't store search keys in the replicas

Normal operation:

  1. Replicate the entire index data structure to k backups, except don't tell them what the search keys are.
  2. If power goes out, the backups should try to flush their shadow indexes to disk.

Recovery:

  1. Select a new index host.
  2. Case:
Optimization: array of new index hosts

Discussion

Best solution to that bug?
Interior nodes

It's not strictly necessary to store interior nodes in a shadow tree the same way as in the real tree. A more compact but slower representation might be better.

Logging: index in memory, index changes logged separately

Strategies

Straw-man proposal

Normal operation:

  1. Log index operations to a new log. These are either add (key, oid) or tombstone: (segment number, offset). The tombstone marks the death of and points to a previous add entry. Handle this log much the same as the main log.
  2. Augment the index entries in the data structure (tree/hash table) to record the segment number and offset of the index operation that added it.
  3. Cleaning:

Recovery:

  1. Select a new index host.
  2. Replay the index operations log.
Optimization: add entries don't store keys in memory

Change only the in-memory add entries to store a pointer to the index entry in the tree/hash table instead of repeating the (key, oid). To be clear, the add entries on disk still store the (key, oid); the add entries in memory only point to their contents.

These are the new cleaning rules:

Optimization: array of new index hosts (butterfly scramble)

Stackable on the in-memory add entries optimization.

On recovery, partition the key space and assign one partition to each host. The backups send only the relevant index operations to each new index host.

Discussion

Can we and should we use the same log as for objects?
Log points to index entries or index entries point to log?

Discussion

RAM usage in replication vs. logging approach

If we suppose each index entry takes 12 bytes in the shadow index (8 for the oid and 4 more to amortize the cost of interior nodes/buckets and memory allocation; I think this is quite pessimistic), a shadow index with 1 billion entries will take 11.2GB of RAM.

Compare this to memory usage of the most optimized log structure: index entries all have a segment number (no offset; 6 bytes), add entries take a pointer (6 bytes), and tombstones have a segment number and offset (8 bytes). Assume that there is no space overhead for cleaning log entries or memory allocation for log entries. If the log is 100% clean, this is 12 bytes per index entry. Under these optimistic assumptions, for an index with 1 billion entries, the in-memory log overhead will be 11.2GB. This alone pays for the memory usage of k=1 replication.

See John's 2010-01-27 email.