Rethinking Tombstones

Intro

RAMCloud currently uses tombstones to mark deleted objects as such. The whole point is to avoid object resurrection when a server fails and its data is restored during crash recovery. When an object is deleted or overwritten a special record is appended to the log that nullifies the particular object instance that is no longer valid. The tombstone must persist until the object it refers to is no longer in any segment that is part of the durable log.

Objects current contain the following information:

  1. ObjectId of object deleted (8 bytes)
  2. TableId of table in which deleted object existed (8 bytes)
  3. Version of object deleted (8 bytes)
  4. SegmentId where deleted object resided (8 bytes)  [used to determine when a tombstone may be cleaned from the log]

Unfortunately, tombstones have a number of crappy characteristics:

  • They're big - about 42 bytes, which is close to the minimum object size. With string keys they threaten to become even bigger.
  • They cannot be garbage collected without cleaning on disk first. That is, we must completely remove from the persistent log the object a tombstone refers to before the tombstone itself may be discarded. This affects in-memory cleaning in that sometimes it must wait for disk cleaning to occur (even if the disk isn't nearly full) before it can make more progress.
  • Tracking when a tombstone is eligible for garbage collection is difficult. When an object is cleaned, there's no easy way to update statistics in the segment its tombstone resides in. The cleaner therefore has to either calculate dead tombstone space by scanning segments (expensive), or ignore their existence (inaccurate). An alternative is to maintain pointers between deleted objects and their tombstones (in both directions), so that i) tombstones can be marked as free space when the object's segment is cleaned and ii) pointers to tombstones can be updated when the tombstone's segment is cleaned and it's relocated.
Possible changes to tombstones
  1. Conjoin with object writes when objects are overwritten. The idea is that with string keys tombstones would presumably contain the entire variable-length key. Overwrites would then be especially expensive, as we'd write the new object with the key as well as the tombstone with its own copy. By specializing the overwrite case some space could be saved. Basically an overwrite consists of a conjoined new object and tombstone for the old one.

    There are a few wrinkles involved. For instance, when an object that has a conjoined tombstone is cleaned, the tombstone must be separated if its referent is still alive. Likewise if the tombstone part of a conjoined entry is cleaned, the object must be split off.
     
  2. Shrink tombstones such that they only include the last two fields (Version and SegmentId). This will work if we make versions server-unique. On recovery, recovering masters keep track of all object and tombstone versions seen and ensure that dead objects are not made live again. This probably requires at worst an extra 8 bytes per tombstone during recovery, which worst case would be about an additional 8/28 ~= 29% overhead if the partition contains only tombstones.

    Turns out this won't work. The problem is that during recovery the backups must filter out objects and tombstones based on their partition in order to report them to the correct recovery master. If tombstones don't contain any tablet and object information, then they cannot be filtered. For this to work, we'd need at least the table id and probably the hash of the key. So we're back to four 64-bit fields (Version, SegmentId, TableId, ObjectHash).
     
  3. Nuke tombstones entirely. Well, sort of. Here's the proposal:

    Backups maintain an in-memory bit vector for each segment they store on disk. Each bit corresponds to the liveness of an object (e.g. 1 means the object is deleted). Since objects are variable length, we need to associate an object with a specific bit. This can be done by taking the minimum object size for a segment and allocating SEGMENT_SIZE / MIN_OBJECT_SIZE bits in the vector. Then we logically split the segment into MIN_OBJECT_SIZE contiguous regions. Since at most one object can start in each region, we can use the region offset in the segment as the index in the bit vector on the backups to denote that object's liveness.

    When an object is deleted, RPCs are sent to the replicas for the segment it resides in, telling them to flip the appropriate bit. Unfortunately, this breaks recovery. When finding a segment during recovery, how do we know that a prospective segment has the latest version of the bit vector? We essentially now have quasi-mutable closed segments.

    A solution to this is to associate a version number with each segment. When an object is deleted, the following procedure takes place: First, the segment version number (V) is incremented on the appropriate segment (S). This value is sent to each replica along with an instruction to flip the appropriate liveness bit. Next, a special entry is appended to the head of the log stating that the minimum valid version for S is V. Finally, each Log Digest contains <SegmentId, Version> tuples for each segment in the log (this implies that the special updates need only survive in the head segment - they can be immediately cleaned from any closed one). On recovery, the head is located and the latest digest is computed using the digest in the head and any updates that were appended. Segments are only used in recovery if they have a version at least as high as the highest one determined by the head.

    (Actually, the version need not be updated in the log itself. It could just be sent to the replicas for the head segment and they could store it outside the log. This is a little gnarly and would require twice the segment buffer space in the worst case, but avoids ever writing deletes into the log itself. Perhaps even better, but more or less gnarly, the backups could modify the in-log digest, removing the need for extra space. To make this less unpalatable, we could define the log such that master don't write Log Digests, rather backups maintain them in memory and tack them on the ends of segments when they're closed or power fails.)

    There's one more wrinkle. In-memory cleaning can re-arrange the contents of a segment. That is, an object may migrate towards the beginning of a segment over time. This means that when an object is deleted, its offset in the in-memory segment may not match the offset in the disk segment, which means that we can't know which bit to flip on the backup. One trivial solution to this is to add an index field to each entry, but this will likely require 2 to 3 bytes per entry. While we'd only need to add those bytes to segments that have been cleaned in memory, we'd hope the common case would be that most segments have been memory cleaned at least once, so this may not win us much.

    Some tradeoffs with this approach include:
    1. Extra log overhead per object to record which live bit corresponds to it. We probably need 18 to 24 bits, assuming 40 byte min entries.
      1. It turns out we have quite a few bits that can be squeezed from headers right now anyway.
    2. Extra RPCs to backups during deletes (must both write to R backups via log and contact R other backups to flip bits).
      1. John has a clever enhancement for this where we write tombstones to special tombstone segments (or a separate tombstone log) and clean that log by batching out the bit flip updates in the background. We still have the same amount of messages in the worst case, but at least they're out of the fast path.
      2. Note that these extra RPCs don't have to occur in the fast path during overwrites. The delete bits need only be flipped when an object is removed from the system (if there's a newer live object, it will supercede during recovery). Therefore on overwrites we can just queue up deletes for the prior objects and handle them asynchronously. We need only be sure that all previous deletes had been handled when deleting the latest version of an object. If we expect deletes to be rarer operations (it seems reasonable to expect reads to dominate writes and overwrites to dominate deletes) this may be a reasonable tradeoff. Especially if we buy our claims about fast RPCs making more synchronous operations tenable.
    3. Extra memory overhead on backups to keep free bit vectors in memory for all segments stored. The overhead isn't so bad, but the necessity to flush all of the data to disk on power failure may be more of a problem.
    4. During recovery, overwrites of recovered objects by newer versions will require issuing deletes to the overwritten versions. It's not clear that these updates can be as effectively batched as the segments filled during recovery, so it may adversely affect recovery time.
      1. No! Turns out if we do these synchronous deletes, the log being recovered will be completely pure (no old objects). There's no need to worry about getting old object versions during recovery. This may in fact speed up recovery, especially in workloads with lots of deletes.

Problem: Interaction with parallel cleaning

How do deletes work when cleaning may occur in parallel? Objects are moved to new segments before they become part of the log. In this bit flipping world, a delete on an object moved before cleaning has completed would result in the bit being flipped for that survivor segment, but it is not yet part of the log. Another copy still exists in some other segment. A delete during cleaning followed by a failure before the old segment is removed from the log could result in zombie objects.

 

Another weird idea

If we wanted to trade off cleaning cost for memory utilization, we could stop keeping tombstones in master memory. They only really serve a purpose on disks since they're just used during recovery. Therefore we could in-memory clean closed segments and remove all of their tombstones. Backup recovery would then need to read a replica off of another disk. Also, disk cleaning on the master would require reading segments back from disk to ensure that tombstones are perpetuated. The former is complicated by the mirror group work, as it'd cause a ton of disk I/O in that group when a mirror fails (all masters reading a segment from it to rereplicate). The latter makes disk cleaning twice as expensive. The win is almost entirely nuking tombstones from server memory.

Back-of-the-envelope musings for the bit vector approach

Space overhead:

Overhead for the bit vector on backups is R/(8*MinObjSize) * MasterLogSize.

For example, if R = 4, MinObjSize = 40, and MasterLogSize = 64GB, we have 0.8GB of bits in memory on each backup. This is 1.25% overhead.

While 1.25% seems a little high, consider that no tombstones will be stored in the log. Tombstones are currently 42 bytes each, so the crossover point for a 64GB master is at 20.5e6 tombstones per log. That sounds like a lot, but its still only 1.25%. I don't know how low we could really expect to get in terms of tombstone overhead.

There would also be 8 bytes per segment for the segment version number, but that would only add 256KB for 8MB segments, 64GB servers, R = 4.

Entry offset overhead:

If each entry must describe which bit on replicas needs to be flipped when it dies, how big must the offset be?

With 8MB segments, 40 byte MinObjSize, we'd need to represent 209,716 values. So 18 bits. Having 24 bits would keep us safe up to 640MB segments.

Interestingly, this seems to argue for smaller segments. With 2.5MB segments we'd need 16 bits. 160KB segments would make 12 bits viable. Alternatively, a larger minimum object size would help, but we'd need to double the size to chop each additional bit.