Rethinking Tombstones

Intro

RAMCloud currently uses tombstones to mark deleted objects as such. The whole point is to avoid object resurrection when a server fails and its data is restored during crash recovery. When an object is deleted or overwritten a special record is appended to the log that nullifies the particular object instance that is no longer valid. The tombstone must persist until the object it refers to is no longer in any segment that is part of the durable log.

Objects current contain the following information:

ObjectId of object deleted (8 bytes)
TableId of table in which deleted object existed (8 bytes)
Version of object deleted (8 bytes)
SegmentId where deleted object resided (8 bytes) [used to determine when a tombstone may be cleaned from the log]

Unfortunately, tombstones have a number of crappy characteristics:

They're big - about 42 bytes, which is close to the minimum object size. With string keys they threaten to become even bigger.
They cannot be garbage collected without cleaning on disk first. That is, we must completely remove from the persistent log the object a tombstone refers to before the tombstone itself may be discarded. This affects in-memory cleaning in that sometimes it must wait for disk cleaning to occur (even if the disk isn't nearly full) before it can make more progress.
Tracking when a tombstone is eligible for garbage collection is difficult. When an object is cleaned, there's no easy way to update statistics in the segment its tombstone resides in. The cleaner therefore has to either calculate dead tombstone space by scanning segments (expensive), or ignore their existence (inaccurate). An alternative is to maintain pointers between deleted objects and their tombstones (in both directions), so that i) tombstones can be marked as free space when the object's segment is cleaned and ii) pointers to tombstones can be updated when the tombstone's segment is cleaned and it's relocated.

Possible changes to tombstones

Conjoin with object writes when objects are overwritten. This piggybacking saves space at some complexity cost. Ask John/Ankita for details.
~~Shrink tombstones such that they only include the last two fields (Version and SegmentId)~~. This will work if we make versions server-unique. On recovery, recovering masters keep track of all object and tombstone versions seen and ensure that dead objects are not made live again. This probably requires at worst an extra 8 bytes per tombstone during recovery, which worst case would be about an additional 8/28 ~= 29% overhead if the partition contains only tombstones.

Turns out this won't work. The problem is that during recovery the backups must filter out objects and tombstones based on their partition in order to report them to the correct recovery master. If tombstones don't contain any tablet and object information, then they cannot be filtered. For this to work, we'd need at least the table id and probably the hash of the key. So we're back to four 64-bit fields (Version, SegmentId, TableId, ObjectHash).
Nuke tombstones entirely. XXX fill me in

Some tradeoffs with this approach include:
1. Extra log overhead per object to record which live bit corresponds to it.
2. Extra RPCs to backups during deletes (must both write to R backups via log and contact R other backups to flip bits). John has a clever enhancement for this where we write tombstones to special tombstone segments (or a separate tombstone log) and clean that log by batching out this bit flip updates in the background. We still have the same amount of messages in the worst case, but at least they're out of the fast path.
3. Extra memory overhead on backups to keep free bit vectors in memory for all segments stored.