Main Log: index in memory, no additional persistence

Strategies

Straw-man proposal

Normal operation:

No op

Recovery:

Select a new index host.
In parallel:
- For each master participating in the index:
  1. For each object participating in the index:
    1. Send the index entry to the new index host.
- On the new index host:
  1. For each index entry received:
    1. Add the index entry to the new index data structure.

Optimization: array of new index hosts

To avoid a possible in-cast problem on the new index host:

Partition the search key space of the logical index.
Select an array of new index hosts, each to recover one partition.
The participating masters send index entries to the index hosts based on which partition the search key falls into.
Later, if desired, the index partitions can be merged to fewer hosts.

Discussion

Out-cast problem

It's not clear that there is always a solution. We talked about getting data from backups or sharding.

Replication: index in memory, data structure also duplicated on backup hosts

Strategies

Straw-man proposal

Normal operation:

Replicate the entire index data structure to k backups.
If power goes out, the backups should try to flush their replicas to disk.

Recovery:

Select a new index host.
Case:
- If there is a replica available, stream the data structure to the new index host.
- If there is a replica flushed to disk available, stream the disk dump to the new index host.
- Otherwise, use Main Log recovery.

Optimization: use smaller k

The chances of having to fall back to Main Log recovery are larger, but the overhead during normal operation is smaller.

Optimization: Don't store search keys in the replicas

Normal operation:

Replicate the entire index data structure to k backups, except don't tell them what the search keys are.
- We'll refer to these indexes that don't have search keys as shadow indexes. Shadow indexes have the right structure and the OIDs in all the right places, but they don't have any search keys.
- This means the old index host will have to describe structural changes to the backups, the format of which will be tailored to the exact data structure used.
If power goes out, the backups should try to flush their shadow indexes to disk.

Recovery:

Select a new index host.
Case:
- If there is a shadow index available in memory or on disk:
  1. Case:
    - If there is a shadow index available in memory, stream the shadow index to the new index host.
    - If there is a shadow index available on disk, stream the disk dump to the new index host.
  2. Begin servicing requests
    - While servicing requests, if the new index host encounters a node in the tree or a hash bucket which is still in shadow form:
      1. Request the search key for each object ID stored in the shadow node/bucket from their masters.
        Bug: Assumes that each object ID will return only one search key. There are space/time trade-offs that solve this. -Diego
      2. Once the search keys have all returned, change the shadow node/bucket into a full node/bucket including the search keys.
    - Fetch nodes/buckets in the background until there are no more shadow nodes/buckets.
- Otherwise, use Main Log Recovery.

Optimization: array of new index hosts

Similar to Main log optimization.
Stackable on top of shadow replicas
- Partition on subtrees or bucket ranges

Discussion

Best solution to that bug?

Interior nodes

It's not strictly necessary to store interior nodes in a shadow tree the same way as in the real tree. A more compact but slower representation might be better.

Logging: index in memory, index changes logged separately

Strategies

Straw-man proposal

Normal operation:

Log index operations to a new log. These are either add (key, oid) or tombstone: (segment number, offset). The tombstone marks the death of and points to a previous add entry. Handle this log much the same as the main log.
Augment the index entries in the data structure (tree/hash table) to record the segment number and offset of the index operation that added it.
Cleaning:
- It is safe to clean an add entry if the (key,oid) is not found in the index or the (key,oid) is found in the index and points to a segment number and offset larger than that of the add entry.
- It is safe to clean a tombstone if the segment number it points to has already been cleaned.

Recovery:

Select a new index host.
Replay the index operations log.

Optimization: add entries don't store keys in memory

Change only the in-memory add entries to store a pointer to the index entry in the tree/hash table instead of repeating the (key, oid). To be clear, the add entries on disk still store the (key, oid); the add entries in memory only point to their contents.

These are the new cleaning rules:

It is safe to clean an add entry if the index entry to which it points in turn points to a segment number and offset larger than that of the add entry.
The rule for tombstones remains the same.

Optimization: array of new index hosts (butterfly scramble)

Stackable on the in-memory add entries optimization.

On recovery, partition the key space and assign one partition to each host. The backups send only the relevant index operations to each new index host.

Discussion

Can we and should we use the same log as for objects?

Log points to index entries or index entries point to log?

Discussion

RAM usage in replication vs. logging approach

If we suppose each index entry takes 12 bytes in the shadow index (8 for the oid and 4 more to amortize the cost of interior nodes/buckets and memory allocation; I think this is quite pessimistic), a shadow index with 1 billion entries will take 11.2GB of RAM.

Compare this to memory usage of the most optimized log structure: index entries all have a segment number (no offset; 6 bytes), add entries take a pointer (6 bytes), and tombstones have a segment number and offset (8 bytes). Assume that there is no space overhead for cleaning log entries or memory allocation for log entries. If the log is 100% clean, this is 12 bytes per index entry. Under these optimistic assumptions, for an index with 1 billion entries, the in-memory log overhead will be 11.2GB. This alone pays for the memory usage of k=1 replication.

Main Log: index in memory, no additional persistence

Strategies

Straw-man proposal

Optimization: array of new index hosts

Discussion

Out-cast problem

Replication: index in memory, data structure also duplicated on backup hosts

Strategies

Straw-man proposal

Optimization: use smaller k

Optimization: Don't store search keys in the replicas

Optimization: array of new index hosts

Discussion

Best solution to that bug?

Interior nodes

Logging: index in memory, index changes logged separately

Strategies

Straw-man proposal

Optimization: add entries don't store keys in memory

Optimization: array of new index hosts (butterfly scramble)

Discussion

Can we and should we use the same log as for objects?

Log points to index entries or index entries point to log?

Discussion

RAM usage in replication vs. logging approach

See John's 2010-01-27 email.