Page Comparison

...

Conjoin with object writes when objects are overwritten. The idea is that with string keys tombstones would presumably contain the entire variable-length key. Overwrites would then be especially expensive, as we'd write the new object with the key as well as the tombstone with its own copy. By specializing the overwrite case some space could be saved. Basically an overwrite consists of a conjoined new object and tombstone for the old one.

There are a few wrinkles involved. For instance, when an object that has a conjoined tombstone is cleaned, the tombstone must be separated if its referent is still alive. Likewise if the tombstone part of a conjoined entry is cleaned, the object must be split off.
~~Shrink tombstones such that they only include the last two fields (Version and SegmentId)~~. This will work if we make versions server-unique. On recovery, recovering masters keep track of all object and tombstone versions seen and ensure that dead objects are not made live again. This probably requires at worst an extra 8 bytes per tombstone during recovery, which worst case would be about an additional 8/28 ~= 29% overhead if the partition contains only tombstones.

Turns out this won't work. The problem is that during recovery the backups must filter out objects and tombstones based on their partition in order to report them to the correct recovery master. If tombstones don't contain any tablet and object information, then they cannot be filtered. For this to work, we'd need at least the table id and probably the hash of the key. So we're back to four 64-bit fields (Version, SegmentId, TableId, ObjectHash).
Nuke tombstones entirely. Well, sort of. Here's the proposal:

Backups maintain an in-memory bit vector for each segment they store on disk. Each bit corresponds to the liveness of an object (e.g. 1 means the object is deleted). Since objects are variable length, we need to associate an object with a specific bit. This can be done by taking the minimum object size for a segment and allocating SEGMENT_SIZE / MIN_OBJECT_SIZE bits in the vector. Then we logically split the segment into MIN_OBJECT_SIZE contiguous regions. Since at most one object can start in each region, we can use the region offset in the segment as the index in the bit vector on the backups to denote that object's liveness.

When an object is deleted, RPCs are sent to the replicas for the segment it resides in, telling them to flip the appropriate bit. Unfortunately, this breaks recovery. When finding a segment during recovery, how do we know that a prospective segment has the latest version of the bit vector? We essentially now have quasi-mutable closed segments.

A solution to this is to associate a version number with each segment. When an object is deleted, the following procedure takes place: First, the segment version number (V) is incremented on the appropriate segment (S). This value is sent to each replica along with an instruction to flip the appropriate liveness bit. Next, a special entry is appended to the head of the log stating that the minimum valid version for S is V. Finally, each Log Digest contains <SegmentId, Version> tuples for each segment in the log (this implies that the special updates need only survive in the head segment - they can be immediately cleaned from any closed one). On recovery, the head is located and the latest digest is computed using the digest in the head and any updates that were appended. Segments are only used in recovery if they have a version at least as high as the highest one determined by the head.

(Actually, the version need not be updated in the log itself. It could just be sent to the replicas for the head segment and they could store it outside the log. This is a little gnarly and would require twice the segment buffer space in the worst case, but avoids ever writing deletes into the log itself. Perhaps even better, but more or less gnarly, the backups could modify the in-log digest, removing the need for extra space. To make this less unpalatable, we could define the log such that master don't write Log Digests, rather backups maintain them in memory and tack them on the ends of segments when they're closed or power fails.)

There's one more wrinkle. In-memory cleaning can re-arrange the contents of a segment. That is, an object may migrate towards the beginning of a segment over time. This means that when an object is deleted, its offset in the in-memory segment may not match the offset in the disk segment, which means that we can't know which bit to flip on the backup. One trivial solution to this is to add an index field to each entry, but this will likely require 2 to 3 bytes per entry. While we'd only need to add those bytes to segments that have been cleaned in memory, we'd hope the common case would be that most segments have been memory cleaned at least once, so this may not win us much.

Some tradeoffs with this approach include:
1. Extra log overhead per object to record which live bit corresponds to it. We probably need 18 to 24 bits, assuming 40 byte min entries.
  1. It turns out we have quite a few bits that can be squeezed from headers right now anyway.
2. Extra RPCs to backups during deletes (must both write to R backups via log and contact R other backups to flip bits).
  1. John has a clever enhancement for this where we write tombstones to special tombstone segments (or a separate tombstone log) and clean that log by batching out the bit flip updates in the background. We still have the same amount of messages in the worst case, but at least they're out of the fast path.
  2. Note that these extra RPCs don't have to occur in the fast path during overwrites. The delete bits need only be flipped when an object is removed from the system (if there's a newer live object, it will supercede during recovery). Therefore on overwrites we can just queue up deletes for the prior objects and handle them asynchronously. We need only be sure that all previous deletes had been handled when deleting the latest version of an object. If we expect deletes to be rarer operations (it seems reasonable to expect reads to dominate writes and overwrites to dominate deletes) this may be a reasonable tradeoff. Especially if we buy our claims about fast RPCs making more synchronous operations tenable.
3. Extra memory overhead on backups to keep free bit vectors in memory for all segments stored. The overhead isn't so bad, but the necessity to flush all of the data to disk on power failure may be more of a problem.
4. During recovery, overwrites of recovered objects by newer versions will require issuing deletes to the overwritten versions. It's not clear that these updates can be as effectively batched as the segments filled during recovery, so it may adversely affect recovery time.
  1. No! Turns out if we do these synchronous deletes, the log being recovered will be completely pure (no old objects). There's no need to worry about getting old object versions during recovery. This may in fact speed up recovery, especially in workloads with lots of deletes.

...

Versions Compared

Old Version 9

New Version 10

Key