Rethinking Tombstones

Intro

RAMCloud currently uses tombstones to mark deleted objects as such. The whole point is to avoid object resurrection when a server fails and its data is restored during crash recovery. When an object is deleted or overwritten a special record is appended to the log that nullifies the particular object instance that is no longer valid. The tombstone must persist until the object it refers to is no longer in any segment that is part of the durable log.

Objects current contain the following information:

ObjectId of object deleted (8 bytes)
TableId of table in which deleted object existed (8 bytes)
Version of object deleted (8 bytes)
SegmentId where deleted object resided (8 bytes) [used to determine when a tombstone may be cleaned from the log]

Unfortunately, tombstones have a number of crappy characteristics:

They're big - about 42 bytes, which is close to the minimum object size. With string keys they threaten to become even bigger.
They cannot be garbage collected without cleaning on disk first. That is, we must completely remove from the persistent log the object a tombstone refers to before the tombstone itself may be discarded. This affects in-memory cleaning in that sometimes it must wait for disk cleaning to occur (even if the disk isn't nearly full) before it can make more progress.
Tracking when a tombstone is eligible for garbage collection is difficult. When an object is cleaned, there's no easy way to update statistics in the segment its tombstone resides in. The cleaner therefore has to either calculate dead tombstone space by scanning segments (expensive), or ignore their existence (inaccurate). An alternative is to maintain pointers between deleted objects and their tombstones (in both directions), so that i) tombstones can be marked as free space when the object's segment is cleaned and ii) pointers to tombstones can be updated when the tombstone's segment is cleaned and it's relocated.

Possible changes to tombstones

Conjoin with object writes when objects are overwritten. The idea is that with string keys tombstones would presumably contain the entire variable-length key. Overwrites would then be especially expensive, as we'd write the new object with the key as well as the tombstone with its own copy. By specializing the overwrite case some space could be saved. Basically an overwrite consists of a conjoined new object and tombstone for the old one.

There are a few wrinkles involved. For instance, when an object that has a conjoined tombstone is cleaned, the tombstone must be separated if its referent is still alive. Likewise if the tombstone part of a conjoined entry is cleaned, the object must be split off.
~~Shrink tombstones such that they only include the last two fields (Version and SegmentId)~~. This will work if we make versions server-unique. On recovery, recovering masters keep track of all object and tombstone versions seen and ensure that dead objects are not made live again. This probably requires at worst an extra 8 bytes per tombstone during recovery, which worst case would be about an additional 8/28 ~= 29% overhead if the partition contains only tombstones.

Turns out this won't work. The problem is that during recovery the backups must filter out objects and tombstones based on their partition in order to report them to the correct recovery master. If tombstones don't contain any tablet and object information, then they cannot be filtered. For this to work, we'd need at least the table id and probably the hash of the key. So we're back to four 64-bit fields (Version, SegmentId, TableId, ObjectHash).
Nuke tombstones entirely. Well, sort of. Here's the proposal:

Backups maintain an in-memory bit vector for each segment they store on disk. Each bit corresponds to the liveness of an object (e.g. 1 means the object is deleted). Since objects are variable length, we need to associate an object with a specific bit. This can be done by taking the minimum object size for a segment and allocating SEGMENT_SIZE / MIN_OBJECT_SIZE bits in the vector. Then we logically split the segment into MIN_OBJECT_SIZE contiguous regions. Since at most one object can start in each region, we can use the region offset in the segment as the index in the bit vector on the backups to denote that object's liveness.

When an object is deleted, RPCs are sent to the replicas for the segment it resides in, telling them to flip the appropriate bit. Unfortunately, this breaks recovery. When finding a segment during recovery, how do we know that a prospective segment has the latest version of the bit vector? We essentially now have quasi-mutable closed segments.

A solution to this is to associate a version number with each segment. When an object is deleted, the following procedure takes place: First, the segment version number (V) is incremented on the appropriate segment (S). This value is sent to each replica along with an instruction to flip the appropriate liveness bit. Next, a special entry is appended to the head of the log stating that the minimum valid version for S is V. Finally, each Log Digest contains <SegmentId, Version> tuples for each segment in the log (this implies that the special updates need only survive in the head segment - they can be immediately cleaned from any closed one). On recovery, the head is located and the latest digest is computed using the digest in the head and any updates that were appended. Segments are only used in recovery if they have a version at least as high as the highest one determined by the head.

There's one more wrinkle. In-memory cleaning can re-arrange the contents of a segment. That is, an object may migrate towards the beginning of a segment over time. This means that when an object is deleted, its offset in the in-memory segment may not match the offset in the disk segment, which means that we can't know which bit to flip on the backup. One trivial solution to this is to add an index field to each entry, but this will likely require 2 to 3 bytes per entry. While we'd only need to add those bytes to segments that have been cleaned in memory, we'd hope the common case would be that most segments have been memory cleaned at least once, so this may not win us much.

Some tradeoffs with this approach include:
1. Extra log overhead per object to record which live bit corresponds to it.
2. Extra RPCs to backups during deletes (must both write to R backups via log and contact R other backups to flip bits). John has a clever enhancement for this where we write tombstones to special tombstone segments (or a separate tombstone log) and clean that log by batching out the bit flip updates in the background. We still have the same amount of messages in the worst case, but at least they're out of the fast path.
3. Extra memory overhead on backups to keep free bit vectors in memory for all segments stored. The overhead isn't so bad, but the necessity to flush all of the data to disk on power failure may be more of a problem.